Virtualisation

How containers really work: namespaces, cgroups and images

A container is just an isolated process — here is exactly what Linux does to make that true.

16 min read · updated 20 Jun 2026

This concept is explained in five layers — from a simple analogy up to a deep technical dive. Read top-to-bottom, or jump to your level.

On this page

Explain like I'm 5
Beginner
Intermediate
Advanced
Deep dive
FAQ

Level 1·Containers are like labelled lunchboxes that share the same school kitchen.

Explain like I'm 5

Imagine a big school kitchen. Every child brings their own lunchbox. The kitchen — the cooker, the sink, the electricity — is shared by everyone. But each child can only see and eat what is in their own box. They cannot reach into someone else's box, and they cannot make a mess that spoils anyone else's lunch.

A container is like one of those lunchboxes. The shared kitchen is the Linux kernel running on the server. Each container is a program that thinks it is alone on the whole machine — it has its own view of files, its own network address, and its own list of processes. But underneath, it is just borrowing the school's kitchen, the same as everyone else.

Why this matters

Because everyone shares the same kitchen, containers start in milliseconds and use far less memory than hiring a whole separate chef — which is what a virtual machine does.

A virtual machine, by contrast, is like building an entirely new kitchen inside the first one — walls, cooker, sink, and all. It is completely separate, but it takes up a lot more room and takes longer to set up.

Level 2·The what and why: containers share a kernel, VMs virtualise hardware.

Beginner

A container is a Linux process (or group of processes) that runs with an isolated view of the system — its own filesystem, its own network interfaces, its own process tree — while sharing the host kernel directly. There is no second operating system; there is no hypervisor translating hardware instructions.

A virtual machine does the opposite. A hypervisor (KVM, VMware, Hyper-V) virtualises the hardware itself. Each VM boots its own kernel, its own init system, its own userland. This makes VMs heavier and slower to start, but also more strongly isolated: a bug in one VM's kernel cannot directly affect another.

The single most important fact

Containers share the host kernel. That is simultaneously why they are fast, lightweight, and portable — and why their security boundary is thinner than a VM's.

Linux achieves container isolation with two main kernel features:

Namespaces — give each container its own isolated view of kernel resources (processes, networking, filesystems, hostname, and more).
cgroups (control groups) — limit how much CPU, memory, I/O, and other resources a container may consume.
Union filesystems — let container images be assembled from read-only layers, with a thin writable layer on top, keeping images small and shareable.

When you run docker run nginx, Docker (or Podman) asks the kernel to create a new set of namespaces, applies cgroup limits, mounts the image layers, and then executes the nginx binary inside that environment. Nginx sees itself as PID 1 on a fresh machine. The host kernel sees it as just another process.

Level 3·How it really works: the full stack from image layers to the runtime.

Intermediate

Let us compare the two stacks side by side, then drill into each primitive.

Your App

Container (namespaces + cgroups)

Container Engine (Docker / Podman)

Host Linux Kernel

Hardware

VM stack vs container stack — containers collapse the guest OS layers entirely.

For comparison, a VM inserts a Guest OS Kernel, a Hypervisor, and often a firmware/BIOS layer between your app and the hardware. A container skips all of that.

Namespaces are the core isolation mechanism. The kernel currently ships seven namespaces relevant to containers:

Namespace	Flag	What it isolates
`pid`	`CLONE_NEWPID`	Process IDs — container sees its own tree starting at PID 1
`net`	`CLONE_NEWNET`	Network interfaces, routing tables, iptables rules, port numbers
`mnt`	`CLONE_NEWNS`	Mount points and filesystem view (the original namespace flag)
`uts`	`CLONE_NEWUTS`	Hostname and NIS domain name
`ipc`	`CLONE_NEWIPC`	System V IPC and POSIX message queues
`user`	`CLONE_NEWUSER`	UID/GID mappings — allows UID 0 inside to map to an unprivileged UID outside
`cgroup`	`CLONE_NEWCGROUP`	Gives the container its own view of the cgroup hierarchy

cgroups v2 (the unified hierarchy, default on modern distros since Linux 5.2) organises processes into a tree and enforces resource budgets. A typical container runtime sets limits such as:

# CPU — period 100 ms, quota 50 ms = 50% of one core
cat /sys/fs/cgroup/system.slice/docker-<id>.scope/cpu.max
# 50000 100000

# Memory hard limit
cat /sys/fs/cgroup/system.slice/docker-<id>.scope/memory.max
# 268435456

Equivalent kernel cgroup knobs that `docker run --cpus 0.5 --memory 256m` writes

Images and OverlayFS — a container image is a stack of read-only layers, each one a tar archive of filesystem changes. OverlayFS merges them into a single unified directory tree with one writable upper layer on top (the container layer). Writes trigger copy-on-write: the kernel copies the file from a lower read-only layer into the upper layer before modifying it. Delete and re-create a container and the upper layer vanishes; the shared lower layers remain untouched, saving both disk space and pull time.

Writable container layer (upperdir)

Image layer 3 — app binary (read-only)

Image layer 2 — apt packages (read-only)

Image layer 1 — base OS (read-only)

OverlayFS: multiple read-only image layers merged under a thin writable container layer.

The runtime stack follows the OCI specifications. When you run a container:

Request flow from the CLI to the kernel when launching a container.

Docker Engine / Podman — the user-facing API and daemon.
containerd — an OCI-compliant container runtime daemon; manages image pulls, snapshots, and lifecycle.
runc — the low-level runtime that actually calls clone(), sets up cgroups, mounts OverlayFS, and exec()s the entry-point. It is the reference implementation of the OCI Runtime Specification.
OCI Image Spec — defines how image layers and manifests are stored and distributed.
OCI Runtime Spec — defines the config.json bundle that runc consumes.

Level 4·Networking internals, Kubernetes pods, and real-world resource accounting.

Advanced

Container networking with veth pairs — when containerd creates a network namespace for a container, the networking plugin (e.g. CNI bridge plugin) creates a virtual Ethernet pair: two linked virtual NICs. One end (eth0) lives inside the container's net namespace; the other end (veth0abc) lives in the host namespace and is attached to a Linux bridge (docker0 by default). Traffic between containers flows across that bridge; traffic to the outside world is NAT'd via iptables MASQUERADE rules. ip link and ip netns on the host reveal the full topology.

Inspecting live container namespaces

Find the container's host PID with docker inspect --format '{{.State.Pid}}' <name>, then enter any namespace with nsenter -t <pid> --net --pid -- bash. You will be inside the container's network and PID view without needing a shell in the image.

Kubernetes pods re-use the same primitives. A pod is a group of containers that share a single net, uts, and ipc namespace — but each has its own mnt namespace. This is why containers in the same pod can talk on localhost and see each other's processes via IPC, yet have independent filesystems. A pause container (the 'infra' container) holds the shared namespaces open for the lifetime of the pod; the application containers join them.

cgroups v1 vs v2 — cgroups v1 allowed controllers to be mounted independently, leading to inconsistency (a process could be in different hierarchies for cpu vs memory). cgroups v2 uses a single unified hierarchy at /sys/fs/cgroup. Kubernetes added full cgroups v2 support in 1.25. When specifying --cpus or memory limits, remember these are enforced on the host scheduler, not a virtual CPU count — a container limited to 0.5 CPUs on a 32-core host gets 50 ms of every 100 ms period, shared across all its threads.

Image layer deduplication — because OverlayFS layers are content-addressed and shared on disk, ten containers based on the same ubuntu:24.04 base image share one copy of those layers. This matters at scale: a node running 50 nginx containers may hold only one copy of the nginx image layers, with 50 thin writable upper-dirs. However, large writes inside a container (e.g. database files) should use a volume mounted directly, not the overlay upper layer, which has I/O overhead from copy-on-write.

OOM kills are silent

When a container exceeds its memory limit, the kernel OOM-killer terminates it — often with no log visible inside the container. Check dmesg on the host or kubectl describe pod for OOMKilled in the container state.

Level 5·Security model, rootless containers, capabilities, seccomp, and --privileged.

Deep dive

Linux capabilities — traditionally Unix privileges were binary: root (UID 0) could do everything; everyone else was restricted. Since Linux 2.2, root's power is split into roughly 40 discrete capabilities (CAP_NET_ADMIN, CAP_SYS_PTRACE, CAP_MKNOD, etc.). Container runtimes drop the vast majority by default. Docker's default set retains roughly 14 capabilities — enough to run most web services but not enough to load kernel modules or rebind arbitrary ports below 1024 on the host.

# Inside the container:
grep CapEff /proc/self/status | awk '{print $2}' | xargs -I{} capsh --decode={}

# Or from the host, for any PID:
cat /proc/<pid>/status | grep Cap

Inspect effective capabilities inside a running container

seccomp filters add a second layer: a Berkeley Packet Filter (BPF) programme attached to the process that inspects every syscall number and either allows or kills it. Docker ships a default seccomp profile that blocks around 44 syscalls (including ptrace, keyctl, mount, reboot, and several esoteric ones). The profile is a JSON file you can audit at /etc/docker/seccomp-default.json or the Moby repository.

no_new_privs — runc sets the PR_SET_NO_NEW_PRIVS prctl flag on the container process, preventing it (and any child) from gaining new privileges via setuid binaries or filesystem capabilities. This means even if an attacker escapes to a setuid binary inside the container, it will not elevate them.

Rootless containers and user namespaces — this is where it gets elegant. A user namespace remaps UIDs. Inside the namespace, UID 0 is mapped to, say, host UID 100000. From the kernel's perspective, the process is unprivileged. The container process thinks it is root and can do things that require UID 0 inside its namespaces (mount certain filesystems, bind to ports below 1024 inside the net namespace, etc.) but the host kernel enforces the mapping — it can never affect anything outside the namespace with host-root power.

# On the host, see the subuid range allocated to your user:
cat /etc/subuid
# keithuk:100000:65536

# Inside a rootless container:
id
# uid=0(root) gid=0(root)

# On the host, find the real UID of that process:
ps -o pid,uid,cmd -p $(podman inspect --format '{{.State.Pid}}' mycontainer)
# PID    UID  CMD
# 18342  100000  nginx: master process

Verify UID mapping for a rootless Podman container

The --privileged exception

docker run --privileged disables all of this: it restores the full capability set, disables seccomp, disables AppArmor/SELinux confinement, and mounts the host /dev tree. The container can load kernel modules, mount host filesystems, and modify the host network. Never run untrusted images with --privileged — it is effectively equivalent to giving the process host root.

AppArmor and SELinux provide a further Mandatory Access Control (MAC) layer orthogonal to namespaces and capabilities. Docker generates an AppArmor profile (docker-default) that, among other things, prevents reading /proc/sysrq-trigger and writing to /proc/sys. Kubernetes uses SELinux labels (svirt_lxc_net_t) on Red Hat-family nodes. These are defence-in-depth: they limit damage even if namespace isolation is bypassed.

Image supply-chain security — an image is only as trustworthy as its layers. Each layer is a tar pulled by content-addressed digest (SHA-256), so layer tampering is detectable. However, the tag-to-digest mapping is mutable. Use image digests (image@sha256:...) in production manifests rather than tags. Tools like Cosign (Sigstore) provide cryptographic image signing; Trivy or Grype scan layers for known CVEs before they reach production.

Common escape vectors (for threat-modelling) — the vast majority of historical container escapes have involved: mounted Docker socket (/var/run/docker.sock) giving full daemon control; --privileged combined with mounting /dev or /proc; kernel vulnerabilities in syscall paths reachable despite seccomp (e.g. runc CVE-2019-5736 via /proc/self/exe); and writable host path mounts. Mitigations: drop the socket, do not use --privileged, keep the kernel patched, and run rootless.

The one-paragraph summary

A container is a Linux process given an isolated view of the world via namespaces (pid, net, mnt, uts, ipc, user, cgroup) and throttled by cgroups; its filesystem is a stack of read-only image layers merged by OverlayFS with a thin writable upper layer on top. The runtime chain — Docker/Podman to containerd to runc — translates a declarative image and run command into a clone() syscall that creates those namespaces, then exec()s the entry-point inside them. Security comes from dropped capabilities, a seccomp syscall filter, no_new_privs, and optionally user namespaces that map container-root to an unprivileged host UID — but --privileged tears all of that down and should be treated as equivalent to host root. Kubernetes pods are simply a set of containers sharing one net/uts/ipc namespace group, held open by a pause container.

Frequently asked questions

What is the difference between a container and a virtual machine?

A VM virtualises hardware and runs its own kernel; every VM carries a full OS. A container shares the host kernel directly and isolates using Linux namespaces and cgroups. Containers start in milliseconds, use tens of MB of memory overhead instead of GB, but offer a thinner security boundary because a kernel bug can potentially affect all containers on the host.

Is root inside a container the same as root on the host?

Usually not. By default, Docker maps container UID 0 to the same host UID 0 (which is a risk), but capabilities are heavily dropped and seccomp blocks dangerous syscalls. With rootless containers and user namespaces, container UID 0 maps to an unprivileged host UID, so it genuinely has no host-root power. The dangerous exception is --privileged, which effectively grants host root.

What is OverlayFS and why does it matter?

OverlayFS is a union filesystem built into the Linux kernel. It merges multiple read-only image layers into one view and places a writable layer on top. Writes trigger copy-on-write: the kernel copies a file from a read-only layer into the writable layer before modifying it. This lets many containers share the same base image layers on disk, saving significant space.

What is the difference between Docker and containerd?

Docker is a developer-facing toolchain (CLI, build, push, compose). Underneath it, containerd is the OCI-compliant daemon that actually manages image storage, snapshots, and container lifecycle. runc is the low-level binary that makes the actual kernel calls. Kubernetes dropped the Docker daemon in v1.24 in favour of calling containerd directly via the CRI interface.

What are Linux capabilities and why do containers drop them?

Capabilities are fine-grained subdivisions of root privilege (e.g. CAP_NET_ADMIN, CAP_SYS_PTRACE). Container runtimes drop all but a minimal set by default so that even if a process inside the container is running as UID 0, it cannot perform dangerous operations like loading kernel modules or modifying host network interfaces. You can inspect the effective set with `grep CapEff /proc/self/status`.

How does a Kubernetes pod relate to containers?

A pod is a group of containers that share a single network namespace, UTS namespace, and IPC namespace — so they share an IP address and can communicate on localhost. Each container still has its own mount namespace (its own filesystem). A pause container holds the shared namespaces open for the pod's lifetime, and the application containers attach to them at start-up.

Keep going

Linux file permissions →Linux boot process →Linux learning track →

ShellQuest turns concepts like this into bite-sized lessons, puzzles and labs you actually practise.

Join the waitlist