- Containers, systemd-nspawn and overlayfs
- A quick introduction
- Using your own containers with systemd-nspawn + overlayfs
This is a little blurb about how to use namespaces most efficiently at the time of writting. In particular, systemd-nspawn offers a lot of flexibility, is lightweight, and takes advantage of having the host and guest running systemd. As much as I dislike systemd’s design in some ways, this is very practical in this case.
Namespaces are generally called “containers” outside the kernel world. Really, a container is a userspace process running in a set of namespaces. The most typical example is to run init (the first Linux process when you boot) or an init alternative inside the namespace, and chroot  it.
This makes the container look like a fresh/separate operating system, a little bit like a virtual machine would - except all resources (memory, disk, kernel, etc.) are seen and shared by the host.
|||Change file-system root.|
Linux namespaces are not new. They’re the kernel-side that allow to separate resources at the process, filesystem, network, user, etc level from within the same kernel.
Implementation started several years ago, and has become useable more recently around the Linux kernel 3.6+ series. This has been done by implementing a new creds structure in the kernel, that holds information on the current namespace and permissions of an object. If you’re interested, full details are at https://www.kernel.org/doc/Documentation/security/credentials.txt (this is a good read!).
A container does not necessarily enable all namespaces, however most do to achieve more thorough separation of resources.
|File system||Different view of the filesystem. Generally provided by a chroot of a different mount point. Contains often see/write the host or each other’s filesystem namespaces due to bind mounts (such as /proc, /sys, and /dev)|
|PID||Different view of the process table. Additionally, you can’t send signals from one PID namespace to another. Containers can’t see/write the host or each other’s PID namespaces.|
|IPC||Different view of the shared memory, messages, semaphores. Containers can’t see/write the host or each other’s IPC namespaces.|
|Network||Different view of the network interfaces, netfilter, routes, etc. Containers can’t see/write the host or each other’s firewall, IPs, etc.|
|UTS||Different view of the system hostname. Containers can’t see/write the host or each other’s hostname.|
While there’s plenty of software using namespaces, I’ll just drop some notes about the most notable ones:
Docker is a well-known, dare I say, celebrity in the container world.
- Written in Go - it provides
- an API with a root daemon
- the Docker file format to create images
- a large image repository
- some utilities
- the ability to deploy someone elses setup with one command
- Poor image repository security - you don’t know what you run (being worked on)
- Poor container isolation (being worked on)
- Hard to debug
- Always running, mandatory root daemon (by design)
LXC has been around for a while, and is seen as a qemu/VM  replacement. More recently, it has been rewritten in C.
- Slightly complex to use
- Poor reputation
- Most setups can be escaped from by default by root users in guests
- Cannot easily create or import images
OpenVZ (now “OpenVZ Virtuozzo”) has also been around for a long time and greatly participated to Linux namespaces popularity, including a large part of the kernel namespacing code itself.
- Large documentation
- Supports live migration of images
- Plenty of features
- Commercial support
- Heavy, complex
- Some features are commercial-only (proprietary)
The most recent, nspawn is a part of systemd - the init system. Being part of the init system gives it specific advantages such as being able to coordinate all system logs, automatically masquerade the host network, and so on.
- Fast, simple CLI commands
- No additional daemons
- Compatible with most image formats
- Can pull images directly from Docker’s repository (“one command image setup”)
- Supports cgroups and seccomp via systemd directly (single place to setup)
- No API
- Systemd guest required for using all features
This how-to will use in particular:
- Arch Linux host (lightweight but not as limited as CoreOS)
- overlayfs as root filesystem for guests
OverlayFS allows guests to share their filesystem cache and works as a COW  filesystem from a base image. This ensure higher performance and low memory/disk usage.
- Have ArchLinux installed (https://wiki.archlinux.org/index.php/Installation_guide)
- Understand that all containers will be stored at /var/lib/container/ ;-)
- Have /var/lib/container as an ext4 filesystem (or part of an ext4 filesystem)
- That’s it.
We’re going to create the base image as /var/lib/container/default-ns-1.
While I’m calling this an image, it really just is a filesystem directory - this is so that the filesystem cache can be most efficient. Any image file can be mounted and copied to convert it to the native filesystem.
Ok, your base image is now setup. You can modify things in it at any time - changes will go to all “child” images.
A child image is simply an overlay on top of the base image. It uses no disk space until you start using it and writing files. The child image only store differences with the base image.
You can make as many child images as you want. We’ll make a child image to run apache in this example.
At this point, if you list /var/lib/container/child_apache it should have the same contents as /var/lib/container/default-ns-1.
In order to mount it at every boot you can add it to /etc/fstab:
- overlay /var/lib/container/child_apache overlay noauto,x-systemd.automount,lowerdir=/var/lib/container/default-ns-1,upperdir=/var/lib/container/child_apache_up,workdir=/var/lib/container/child_apache_work 0 0
“noauto,x-systemd.automount” ensure that systemd will not block in case something goes wrong during the mount at boot. Also, it will only mount the child image automatically when the container is started - otherwise, the image will stay unmounted.
You will need to reboot or restart systemd’s mount service for the fstab entries to be taken into account.
You can simply start it via nspawn directly (ensure you have mounted the overlayfs already):
$ sudo systemd-nspawn --boot -j -M child_apache [...] plenty of boot messages [...] Login:
When started via nspawn, exiting the container will also kill it.
Via machinectl (identical to using systemctl):
$ sudo machinectl start child_apache $ sudo machinectl login child_apache Login:
When exiting, the container will still be running. It can be turned off via:
$ sudo machinectl poweroff child_apache
$ sudo systemctl enable systemd-nspawn@child_apache.service
Ensure that you’re using systemd-networkd host-side for networking if you would like to make use all systemd’s easy networking setup guest-side.
The default for systemd-nspawn is to setup an automatic veth network. If you would not like that, and prefer to disable the network namespace, you can edit the service file:
$ cp /usr/lib/systemd/system/systemd-nspawn@.service /etc/systemd/system $ vim /etc/systemd/system/systemd-nspawn@.service
Simple remove the argument “–network-veth” from the systemd-nspawn command to disable automatic networking and network namespacing. Your guest will share the network with the host in this case. You will need to disable and re-enable any container you have already enable with systemctl to take the changes into account.
For system updates, we’ll take advantage of overlayfs’ exposing the filesystem differences between the base image and the child images.
First, just update the image...
$ sudo arch-chroot /var/lib/container/default-ns-1 pacman -Suy
Then, ensure the child does not have any identical file that is more recent, else, delete them (force the child to use the base image’s version).
Child’s version will be lost!
>>> TODO insert find cmd here
Repeat for all childs to update:
$ sudo mount -oremount /var/lib/container/child_apache