Containers, systemd-nspawn and overlayfs¶
Contents
This is a little blurb about how to use namespaces most efficiently at the time of writting. In particular, systemd-nspawn offers a lot of flexibility, is lightweight, and takes advantage of having the host and guest running systemd. As much as I dislike systemd’s design in some ways, this is very practical in this case.
A quick introduction¶
What’s a container?¶
Namespaces are generally called “containers” outside the kernel world. Really, a container is a userspace process running in a set of namespaces. The most typical example is to run init (the first Linux process when you boot) or an init alternative inside the namespace, and chroot 1 it.
This makes the container look like a fresh/separate operating system, a little bit like a virtual machine would - except all resources (memory, disk, kernel, etc.) are seen and shared by the host.
- 1
Change file-system root.
What’s a namespace really?¶
Linux namespaces are not new. They’re the kernel-side that allow to separate resources at the process, filesystem, network, user, etc level from within the same kernel.
Implementation started several years ago, and has become useable more recently around the Linux kernel 3.6+ series. This has been done by implementing a new creds structure in the kernel, that holds information on the current namespace and permissions of an object. If you’re interested, full details are at https://www.kernel.org/doc/Documentation/security/credentials.txt (this is a good read!).
Available namespaces¶
Note
A container does not necessarily enable all namespaces, however most do to achieve more thorough separation of resources.
Namespace |
Description |
---|---|
File system |
Different view of the filesystem. Generally provided by a chroot of a different mount point. Contains often see/write the host or each other’s filesystem namespaces due to bind mounts (such as /proc, /sys, and /dev) |
PID |
Different view of the process table. Additionally, you can’t send signals from one PID namespace to another. Containers can’t see/write the host or each other’s PID namespaces. |
IPC |
Different view of the shared memory, messages, semaphores. Containers can’t see/write the host or each other’s IPC namespaces. |
Network |
Different view of the network interfaces, netfilter, routes, etc. Containers can’t see/write the host or each other’s firewall, IPs, etc. |
UTS |
Different view of the system hostname. Containers can’t see/write the host or each other’s hostname. |
Systemd, docker, LXC, etc.¶
While there’s plenty of software using namespaces, I’ll just drop some notes about the most notable ones:
Docker¶
Docker is a well-known, dare I say, celebrity in the container world.
- Written in Go - it provides
an API with a root daemon
the Docker file format to create images
a large image repository
some utilities
the ability to deploy someone elses setup with one command
- Cons
Poor image repository security - you don’t know what you run (being worked on)
Poor container isolation (being worked on)
Hard to debug
Always running, mandatory root daemon (by design)
LXC¶
LXC has been around for a while, and is seen as a qemu/VM 2 replacement. More recently, it has been rewritten in C.
- Cons
Slightly complex to use
Poor reputation
Most setups can be escaped from by default by root users in guests
Cannot easily create or import images
- 2
Virtual machine
OpenVZ¶
OpenVZ (now “OpenVZ Virtuozzo”) has also been around for a long time and greatly participated to Linux namespaces popularity, including a large part of the kernel namespacing code itself.
- Pros
Large documentation
Supports live migration of images
Plenty of features
Commercial support
- Cons
Heavy, complex
Some features are commercial-only (proprietary)
systemd-nspawn¶
The most recent, nspawn is a part of systemd - the init system. Being part of the init system gives it specific advantages such as being able to coordinate all system logs, automatically masquerade the host network, and so on.
- Pros
Fast, simple CLI commands
No additional daemons
Compatible with most image formats
Can pull images directly from Docker’s repository (“one command image setup”)
Supports cgroups and seccomp via systemd directly (single place to setup)
Flexible
- Cons
No API
Systemd guest required for using all features
Using your own containers with systemd-nspawn + overlayfs¶
This how-to will use in particular:
Arch Linux host (lightweight but not as limited as CoreOS)
systemd-nspawn
overlayfs as root filesystem for guests
OverlayFS allows guests to share their filesystem cache and works as a COW 3 filesystem from a base image. This ensure higher performance and low memory/disk usage.
- 3
Copy-on-write
Prerequisites¶
Have ArchLinux installed (https://wiki.archlinux.org/index.php/Installation_guide)
Understand that all containers will be stored at /var/lib/container/ ;-)
Have /var/lib/container as an ext4 filesystem (or part of an ext4 filesystem)
That’s it.
Setup your own base system “image”¶
Dependencies¶
Ensure you have arch install scripts installed:
Setup¶
We’re going to create the base image as /var/lib/container/default-ns-1.
While I’m calling this an image, it really just is a filesystem directory - this is so that the filesystem cache can be most efficient. Any image file can be mounted and copied to convert it to the native filesystem.
Ok, your base image is now setup. You can modify things in it at any time - changes will go to all “child” images.
Create your first child “image”¶
Creation¶
A child image is simply an overlay on top of the base image. It uses no disk space until you start using it and writing files. The child image only store differences with the base image.
You can make as many child images as you want. We’ll make a child image to run apache in this example.
At this point, if you list /var/lib/container/child_apache it should have the same contents as /var/lib/container/default-ns-1.
Mount at boot¶
In order to mount it at every boot you can add it to /etc/fstab:
- ::
overlay /var/lib/container/child_apache overlay noauto,x-systemd.automount,lowerdir=/var/lib/container/default-ns-1,upperdir=/var/lib/container/child_apache_up,workdir=/var/lib/container/child_apache_work 0 0
“noauto,x-systemd.automount” ensure that systemd will not block in case something goes wrong during the mount at boot. Also, it will only mount the child image automatically when the container is started - otherwise, the image will stay unmounted.
Note
You will need to reboot or restart systemd’s mount service for the fstab entries to be taken into account.
Start the container¶
Manual startup¶
You can simply start it via nspawn directly (ensure you have mounted the overlayfs already):
$ sudo systemd-nspawn --boot -j -M child_apache
[...] plenty of boot messages [...]
Login:
When started via nspawn, exiting the container will also kill it.
Via machinectl (identical to using systemctl):
$ sudo machinectl start child_apache
$ sudo machinectl login child_apache
Login:
When exiting, the container will still be running. It can be turned off via:
$ sudo machinectl poweroff child_apache
Automatic start at boot¶
$ sudo systemctl enable systemd-nspawn@child_apache.service
That’s it!
Networking notes¶
Note
Ensure that you’re using systemd-networkd host-side for networking if you would like to make use all systemd’s easy networking setup guest-side.
The default for systemd-nspawn is to setup an automatic veth network. If you would not like that, and prefer to disable the network namespace, you can edit the service file:
$ cp /usr/lib/systemd/system/systemd-nspawn@.service /etc/systemd/system
$ vim /etc/systemd/system/systemd-nspawn@.service
Simple remove the argument “–network-veth” from the systemd-nspawn command to disable automatic networking and network namespacing. Your guest will share the network with the host in this case. You will need to disable and re-enable any container you have already enable with systemctl to take the changes into account.
System updates and base image changes¶
For system updates, we’ll take advantage of overlayfs’ exposing the filesystem differences between the base image and the child images.
Update base image¶
First, just update the image…
$ sudo arch-chroot /var/lib/container/default-ns-1 pacman -Suy
Then, ensure the child does not have any identical file that is more recent, else, delete them (force the child to use the base image’s version).
Warning
Child’s version will be lost!
>>> TODO insert find cmd here
Remount the childs¶
Repeat for all childs to update:
$ sudo mount -oremount /var/lib/container/child_apache
Restart the childs¶
If you want to ensure that all changes are taken into account, you may need to restart services on the child, or reboot it entierely (which is probably just as fast since it’s containers):
$ sudo machinectl reboot child_apache
That’s it!