History

Adrian Reber 60ae7091de checkpoint: support lazy migration With the help of userfaultfd CRIU supports lazy migration. Lazy migration means that memory pages are only transferred from the migration source to the migration destination on page fault. This enables to reduce the downtime during process or container migration to a minimum as the memory does not need to be transferred during migration. Lazy migration currently depends on userfaultfd being available on the current Linux kernel and if the used CRIU version supports lazy migration. Both dependencies can be checked by querying CRIU via RPC if the lazy migration feature is available. Using feature checking instead of version comparison enables runC to use CRIU features from the criu-dev branch. This way the user can decide if lazy migration should be available by choosing the right kernel and CRIU branch. To use lazy migration the CRIU process during dump needs to dump everything besides the memory pages and then it opens a network port waiting for remote page fault requests: # runc checkpoint httpd --lazy-pages --page-server 0.0.0.0:27 \ --status-fd /tmp/postcopy-pipe In this example CRIU will hang/wait once it has opened the network port and wait for network connection. As runC waits for CRIU to finish it will also hang until the lazy migration has finished. To know when the restore on the destination side can start the '--status-fd' parameter is used: #️ runc checkpoint --help \| grep status --status-fd value criu writes \0 to this FD once lazy-pages is ready The parameter '--status-fd' is directly from CRIU and this way the process outside of runC which controls the migration knows exactly when to transfer the checkpoint (without memory pages) to the destination and that the restore can be started. On the destination side it is necessary to start CRIU in 'lazy-pages' mode like this: # criu lazy-pages --page-server --address 192.168.122.3 --port 27 \ -D checkpoint and tell runC to do a lazy restore: # runc restore -d --image-path checkpoint --work-path checkpoint \ --lazy-pages httpd If both processes on the restore side have the same working directory 'criu lazy-pages' creates a unix domain socket where it waits for requests from the actual restore. runC starts CRIU restore in lazy restore mode and talks to 'criu lazy-pages' that it wants to restore memory pages on demand. CRIU continues to restore the process and once the process is running and accesses the first non-existing memory page the 'criu lazy-pages' server will request the page from the source system. Thus all pages from the source system will be transferred to the destination system. Once all pages have been transferred runC on the source system will end and the container will have finished migration. This can also be combined with CRIU's pre-copy support. The combination of pre-copy and post-copy (lazy migration) provides the possibility to migrate containers with minimal downtimes. Some additional background about post-copy migration can be found in these articles: https://lisas.de/~adrian/?p=1253 https://lisas.de/~adrian/?p=1183 Signed-off-by: Adrian Reber <areber@redhat.com>		2017-09-06 12:35:38 +00:00
..
apparmor	Updating error condition in applying apparmor profile	2016-05-04 19:10:55 +05:30
cgroups	Fix systemd cgroup after memory type changed	2017-08-25 01:14:16 -04:00
configs	Merge pull request #1477 from yummypeng/save-own-ns-path	2017-08-02 11:24:30 +01:00
criurpc	criurpc.proto: copy latest criurpc.proto from criu 3.3	2017-08-02 16:07:32 +00:00
devices	Handle non-devices correctly in DeviceFromPath	2017-08-09 08:52:20 -07:00
integration	Merge pull request #1537 from tklauser/staticcheck	2017-08-02 09:52:11 -04:00
keys	Use keyctl wrappers from x/sys/unix	2017-06-09 15:55:18 +02:00
nsenter	Pass back the pid of runc:[1:CHILD] so we can wait on it	2017-08-05 13:44:36 +10:00
seccomp	libcontainer: use Prctl() from x/sys/unix	2017-07-10 10:56:58 +02:00
specconv	fix panic when Linux is nil for rootless case	2017-08-16 09:11:13 +08:00
stacktrace	fix typos	2016-11-30 13:31:36 +08:00
system	libcontainer: use ioctl wrappers from x/sys/unix	2017-07-10 10:56:58 +02:00
user	Revert "Merge pull request #1450 from vrothberg/sgid-non-numeric"	2017-08-04 14:28:21 -07:00
utils	Move libcontainer to x/sys/unix	2017-05-22 17:35:20 -05:00
xattr	Use symlink xattr functions from x/sys/unix	2017-05-31 13:50:34 +02:00
README.md	update READ.me for new struct configs.Config.Capabilities	2017-06-09 18:47:05 +08:00
SPEC.md	Do not create /dev/fuse by default	2016-08-12 13:00:24 +01:00
capabilities_linux.go	Remove ambient build tag	2017-03-15 11:38:43 -07:00
compat_1.5_linux.go	Don't set /proc/<PID>/setgroups to deny in Go1.5	2015-08-03 14:59:15 -04:00
console.go	Remove terminal info	2017-03-16 10:23:59 -07:00
console_freebsd.go	console: don't chown(2) the slave PTY	2016-12-01 15:49:36 +11:00
console_linux.go	libcontainer: use ioctl wrappers from x/sys/unix	2017-07-10 10:56:58 +02:00
console_solaris.go	console: don't chown(2) the slave PTY	2016-12-01 15:49:36 +11:00
console_windows.go	console: don't chown(2) the slave PTY	2016-12-01 15:49:36 +11:00
container.go	libcontainer: Replace GetProcessStartTime with Stat_t.StartTime	2017-06-20 16:26:55 -07:00
container_linux.go	checkpoint: support lazy migration	2017-09-06 12:35:38 +00:00
container_linux_test.go	Update state after update	2017-08-15 14:38:44 +08:00
container_solaris.go	Get runc to build clean on Solaris	2016-04-12 16:13:08 -07:00
container_windows.go	Windows: Refactor Container interface	2015-11-02 15:12:16 -08:00
criu_opts_linux.go	checkpoint: support lazy migration	2017-09-06 12:35:38 +00:00
criu_opts_windows.go	Windows: Factor down criu_opts	2015-10-23 12:58:59 -07:00
error.go	Fix the outdated comment for Error interface	2017-01-03 15:06:47 +08:00
error_test.go	[unittest] add extra ErrorCode in TestErrorCode testcase	2016-09-22 20:15:54 +08:00
factory.go	could load a stopped container.	2017-04-07 07:39:41 -04:00
factory_linux.go	init: switch away from stateDirFd entirely	2017-08-25 13:19:03 +10:00
factory_linux_test.go	Move libcontainer to x/sys/unix	2017-05-22 17:35:20 -05:00
generic_error.go	libcontainer: refactor syncT handling	2016-12-01 15:46:04 +11:00
generic_error_test.go	add testcase in generic_error_test.go	2017-04-18 08:56:02 +08:00
init_linux.go	init: switch away from stateDirFd entirely	2017-08-25 13:19:03 +10:00
message_linux.go	Use NLA_* constants from x/sys/unix instead of syscall	2017-06-02 10:42:11 +02:00
network_linux.go	Revert "fix minor issue"	2017-03-20 12:28:43 +11:00
notify_linux.go	Fix flaky test TestNotifyOnOOM	2017-08-14 15:18:59 +08:00
notify_linux_test.go	Some fixes for testMemoryNotification	2017-08-14 15:28:03 +08:00
process.go	Add separate console socket	2017-03-16 10:23:59 -07:00
process_linux.go	init: switch away from stateDirFd entirely	2017-08-25 13:19:03 +10:00
restored_process.go	libcontainer: Replace GetProcessStartTime with Stat_t.StartTime	2017-06-20 16:26:55 -07:00
rootfs_linux.go	fix --read-only containers under --userns-remap	2017-08-24 16:43:21 -06:00
rootfs_linux_test.go	Remove check for binding to /	2016-09-29 15:26:09 -07:00
setgroups_linux.go	Don't set /proc/<PID>/setgroups to deny in Go1.5	2015-08-03 14:59:15 -04:00
setns_init_linux.go	libcontainer: use PR_SET_NO_NEW_PRIVS from x/sys/unix	2017-07-13 15:31:33 +02:00
standard_init_linux.go	init: switch away from stateDirFd entirely	2017-08-25 13:19:03 +10:00
state_linux.go	Updated logrus to v1	2017-07-19 15:20:56 +00:00
state_linux_test.go	add createdState and runningState status testcase	2017-04-19 16:28:03 +08:00
stats.go	Move libcontainer into subdirectory	2015-06-21 19:29:15 -07:00
stats_freebsd.go	Move libcontainer into subdirectory	2015-06-21 19:29:15 -07:00
stats_linux.go	Update import paths for new repository	2015-06-21 19:29:59 -07:00
stats_solaris.go	Get runc to build clean on Solaris	2016-04-12 16:13:08 -07:00
stats_windows.go	Move libcontainer into subdirectory	2015-06-21 19:29:15 -07:00
sync.go	Add separate console socket	2017-03-16 10:23:59 -07:00

README.md

libcontainer

Libcontainer provides a native Go implementation for creating containers with namespaces, cgroups, capabilities, and filesystem access controls. It allows you to manage the lifecycle of the container performing additional operations after the container is created.

Container

A container is a self contained execution environment that shares the kernel of the host system and which is (optionally) isolated from other containers in the system.

Using libcontainer

Because containers are spawned in a two step process you will need a binary that will be executed as the init process for the container. In libcontainer, we use the current binary (/proc/self/exe) to be executed as the init process, and use arg "init", we call the first step process "bootstrap", so you always need a "init" function as the entry of "bootstrap".

In addition to the go init function the early stage bootstrap is handled by importing nsenter.

import (
	_ "github.com/opencontainers/runc/libcontainer/nsenter"
)

func init() {
	if len(os.Args) > 1 && os.Args[1] == "init" {
		runtime.GOMAXPROCS(1)
		runtime.LockOSThread()
		factory, _ := libcontainer.New("")
		if err := factory.StartInitialization(); err != nil {
			logrus.Fatal(err)
		}
		panic("--this line should have never been executed, congratulations--")
	}
}

Then to create a container you first have to initialize an instance of a factory that will handle the creation and initialization for a container.

factory, err := libcontainer.New("/var/lib/container", libcontainer.Cgroupfs, libcontainer.InitArgs(os.Args[0], "init"))
if err != nil {
	logrus.Fatal(err)
	return
}

Once you have an instance of the factory created we can create a configuration struct describing how the container is to be created. A sample would look similar to this:

defaultMountFlags := unix.MS_NOEXEC | unix.MS_NOSUID | unix.MS_NODEV
config := &configs.Config{
	Rootfs: "/your/path/to/rootfs",
	Capabilities: &configs.Capabilities{
                Bounding: []string{
                        "CAP_CHOWN",
                        "CAP_DAC_OVERRIDE",
                        "CAP_FSETID",
                        "CAP_FOWNER",
                        "CAP_MKNOD",
                        "CAP_NET_RAW",
                        "CAP_SETGID",
                        "CAP_SETUID",
                        "CAP_SETFCAP",
                        "CAP_SETPCAP",
                        "CAP_NET_BIND_SERVICE",
                        "CAP_SYS_CHROOT",
                        "CAP_KILL",
                        "CAP_AUDIT_WRITE",
                },
                Effective: []string{
                        "CAP_CHOWN",
                        "CAP_DAC_OVERRIDE",
                        "CAP_FSETID",
                        "CAP_FOWNER",
                        "CAP_MKNOD",
                        "CAP_NET_RAW",
                        "CAP_SETGID",
                        "CAP_SETUID",
                        "CAP_SETFCAP",
                        "CAP_SETPCAP",
                        "CAP_NET_BIND_SERVICE",
                        "CAP_SYS_CHROOT",
                        "CAP_KILL",
                        "CAP_AUDIT_WRITE",
                },
                Inheritable: []string{
                        "CAP_CHOWN",
                        "CAP_DAC_OVERRIDE",
                        "CAP_FSETID",
                        "CAP_FOWNER",
                        "CAP_MKNOD",
                        "CAP_NET_RAW",
                        "CAP_SETGID",
                        "CAP_SETUID",
                        "CAP_SETFCAP",
                        "CAP_SETPCAP",
                        "CAP_NET_BIND_SERVICE",
                        "CAP_SYS_CHROOT",
                        "CAP_KILL",
                        "CAP_AUDIT_WRITE",
                },
                Permitted: []string{
                        "CAP_CHOWN",
                        "CAP_DAC_OVERRIDE",
                        "CAP_FSETID",
                        "CAP_FOWNER",
                        "CAP_MKNOD",
                        "CAP_NET_RAW",
                        "CAP_SETGID",
                        "CAP_SETUID",
                        "CAP_SETFCAP",
                        "CAP_SETPCAP",
                        "CAP_NET_BIND_SERVICE",
                        "CAP_SYS_CHROOT",
                        "CAP_KILL",
                        "CAP_AUDIT_WRITE",
                },
                Ambient: []string{
                        "CAP_CHOWN",
                        "CAP_DAC_OVERRIDE",
                        "CAP_FSETID",
                        "CAP_FOWNER",
                        "CAP_MKNOD",
                        "CAP_NET_RAW",
                        "CAP_SETGID",
                        "CAP_SETUID",
                        "CAP_SETFCAP",
                        "CAP_SETPCAP",
                        "CAP_NET_BIND_SERVICE",
                        "CAP_SYS_CHROOT",
                        "CAP_KILL",
                        "CAP_AUDIT_WRITE",
                },
        },
	Namespaces: configs.Namespaces([]configs.Namespace{
		{Type: configs.NEWNS},
		{Type: configs.NEWUTS},
		{Type: configs.NEWIPC},
		{Type: configs.NEWPID},
		{Type: configs.NEWUSER},
		{Type: configs.NEWNET},
	}),
	Cgroups: &configs.Cgroup{
		Name:   "test-container",
		Parent: "system",
		Resources: &configs.Resources{
			MemorySwappiness: nil,
			AllowAllDevices:  nil,
			AllowedDevices:   configs.DefaultAllowedDevices,
		},
	},
	MaskPaths: []string{
		"/proc/kcore",
		"/sys/firmware",
	},
	ReadonlyPaths: []string{
		"/proc/sys", "/proc/sysrq-trigger", "/proc/irq", "/proc/bus",
	},
	Devices:  configs.DefaultAutoCreatedDevices,
	Hostname: "testing",
	Mounts: []*configs.Mount{
		{
			Source:      "proc",
			Destination: "/proc",
			Device:      "proc",
			Flags:       defaultMountFlags,
		},
		{
			Source:      "tmpfs",
			Destination: "/dev",
			Device:      "tmpfs",
			Flags:       unix.MS_NOSUID | unix.MS_STRICTATIME,
			Data:        "mode=755",
		},
		{
			Source:      "devpts",
			Destination: "/dev/pts",
			Device:      "devpts",
			Flags:       unix.MS_NOSUID | unix.MS_NOEXEC,
			Data:        "newinstance,ptmxmode=0666,mode=0620,gid=5",
		},
		{
			Device:      "tmpfs",
			Source:      "shm",
			Destination: "/dev/shm",
			Data:        "mode=1777,size=65536k",
			Flags:       defaultMountFlags,
		},
		{
			Source:      "mqueue",
			Destination: "/dev/mqueue",
			Device:      "mqueue",
			Flags:       defaultMountFlags,
		},
		{
			Source:      "sysfs",
			Destination: "/sys",
			Device:      "sysfs",
			Flags:       defaultMountFlags | unix.MS_RDONLY,
		},
	},
	UidMappings: []configs.IDMap{
		{
			ContainerID: 0,
			HostID: 1000,
			Size: 65536,
		},
	},
	GidMappings: []configs.IDMap{
		{
			ContainerID: 0,
			HostID: 1000,
			Size: 65536,
		},
	},
	Networks: []*configs.Network{
		{
			Type:    "loopback",
			Address: "127.0.0.1/0",
			Gateway: "localhost",
		},
	},
	Rlimits: []configs.Rlimit{
		{
			Type: unix.RLIMIT_NOFILE,
			Hard: uint64(1025),
			Soft: uint64(1025),
		},
	},
}

Once you have the configuration populated you can create a container:

container, err := factory.Create("container-id", config)
if err != nil {
	logrus.Fatal(err)
	return
}

To spawn bash as the initial process inside the container and have the processes pid returned in order to wait, signal, or kill the process:

process := &libcontainer.Process{
	Args:   []string{"/bin/bash"},
	Env:    []string{"PATH=/bin"},
	User:   "daemon",
	Stdin:  os.Stdin,
	Stdout: os.Stdout,
	Stderr: os.Stderr,
}

err := container.Run(process)
if err != nil {
	container.Destroy()
	logrus.Fatal(err)
	return
}

// wait for the process to finish.
_, err := process.Wait()
if err != nil {
	logrus.Fatal(err)
}

// destroy the container.
container.Destroy()

Additional ways to interact with a running container are:

// return all the pids for all processes running inside the container.
processes, err := container.Processes()

// get detailed cpu, memory, io, and network statistics for the container and
// it's processes.
stats, err := container.Stats()

// pause all processes inside the container.
container.Pause()

// resume all paused processes.
container.Resume()

// send signal to container's init process.
container.Signal(signal)

// update container resource constraints.
container.Set(config)

// get current status of the container.
status, err := container.Status()

// get current container's state information.
state, err := container.State()

Checkpoint & Restore

libcontainer now integrates CRIU for checkpointing and restoring containers. This let's you save the state of a process running inside a container to disk, and then restore that state into a new process, on the same machine or on another machine.

criu version 1.5.2 or higher is required to use checkpoint and restore. If you don't already have criu installed, you can build it from source, following the online instructions. criu is also installed in the docker image generated when building libcontainer with docker.

README.md

libcontainer

Container

Using libcontainer

Checkpoint & Restore

Copyright and license