Slides
Container building blocks
namespace- Linux control groups (cgroups)
- UnionFS
- veth
What a container should provide
- Filesystem isolation — e.g.
chrootto change the root mount - Network isolation — for distributed apps: own IP, ports, routes; veth pairs so each container has its own netdev, IP, routing table,
/proc/net, ports. Multiple containers on one host can each bind port 80 inside their own network namespace - Hostname — UTS namespace for identity on the network
- IPC — separate System V IPC and POSIX message queues; only processes in the same IPC namespace can talk to each other
- User IDs — in a user namespace, UID/GID can differ from the host; an unprivileged host user can be “root” inside the container
Main pieces
| Piece | Details |
|---|---|
| namespace | UTS, IPC, PID, NETWORK, MOUNT, USER |
| cgroup | CPU, memory, blkio, devices, … |
| UnionFS | aufs (Ubuntu), btrfs (SUSE), vfs, devicemapper (CentOS), overlay2 (CentOS/Ubuntu) |
| veth | Docker network modes: bridge, host, container, none |
How containers are implemented
A container is essentially a special process created with clone(2):
| |
(fork and clone both go through sys_clone on Linux; the note below is simplified.)
- fork — child continues after the call site
- clone — child starts at
fn(args); parent and child have separate memory views but may share pages;child_stackpoints at the child stack - flags — low bits can carry a signal; Docker-related flags map namespaces:
CLONE_NEWUTS,CLONE_NEWIPC,CLONE_NEWPID,CLONE_NEWNET,CLONE_NEWNS,CLONE_NEWUSER
cgroup — resource limits under /sys/fs/cgroup
UnionFS / overlay2 — layer metadata example:
| |
veth — virtual Ethernet pair connecting container to the host bridge