History

Xiaochen Shen 27560ace2f libcontainer: intelrdt: add support for Intel RDT/MBA in runc Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature of Intel Resource Director Technology (RDT) which is supported on some Intel Xeon platforms. Intel RDT/MBA provides indirect and approximate throttle over memory bandwidth for the software. A user controls the resource by indicating the percentage of maximum memory bandwidth. Hardware details of Intel RDT/MBA can be found in section 17.18 of Intel Software Developer Manual: https://software.intel.com/en-us/articles/intel-sdm In Linux 4.12 kernel and newer, Intel RDT/MBA is enabled by kernel config CONFIG_INTEL_RDT. If hardware support, CPU flags `rdt_a` and `mba` will be set in /proc/cpuinfo. Intel RDT "resource control" filesystem hierarchy: mount -t resctrl resctrl /sys/fs/resctrl tree /sys/fs/resctrl /sys/fs/resctrl/ \|-- info \| \|-- L3 \| \| \|-- cbm_mask \| \| \|-- min_cbm_bits \| \| \|-- num_closids \| \|-- MB \| \|-- bandwidth_gran \| \|-- delay_linear \| \|-- min_bandwidth \| \|-- num_closids \|-- ... \|-- schemata \|-- tasks \|-- <container_id> \|-- ... \|-- schemata \|-- tasks For MBA support for `runc`, we will reuse the infrastructure and code base of Intel RDT/CAT which implemented in #1279. We could also make use of `tasks` and `schemata` configuration for memory bandwidth resource constraints. The file `tasks` has a list of tasks that belongs to this group (e.g., <container_id>" group). Tasks can be added to a group by writing the task ID to the "tasks" file (which will automatically remove them from the previous group to which they belonged). New tasks created by fork(2) and clone(2) are added to the same group as their parent. The file `schemata` has a list of all the resources available to this group. Each resource (L3 cache, memory bandwidth) has its own line and format. Memory bandwidth schema: It has allocation values for memory bandwidth on each socket, which contains L3 cache id and memory bandwidth percentage. Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..." The minimum bandwidth percentage value for each CPU model is predefined and can be looked up through "info/MB/min_bandwidth". The bandwidth granularity that is allocated is also dependent on the CPU model and can be looked up at "info/MB/bandwidth_gran". The available bandwidth control steps are: min_bw + N * bw_gran. Intermediate values are rounded to the next control step available on the hardware. For more information about Intel RDT kernel interface: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt An example for runc: Consider a two-socket machine with two L3 caches where the minimum memory bandwidth of 10% with a memory bandwidth granularity of 10%. Tasks inside the container may use a maximum memory bandwidth of 20% on socket 0 and 70% on socket 1. "linux": { "intelRdt": { "memBwSchema": "MB:0=20;1=70" } } Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>		2018-10-16 14:29:29 +08:00
..
apparmor	libcontainer: remove dependency on libapparmor	2017-12-15 09:59:58 +01:00
cgroups	Merge pull request #1862 from AkihiroSuda/decompose-rootless-pr	2018-10-15 17:32:15 -07:00
configs	libcontainer: intelrdt: add support for Intel RDT/MBA in runc	2018-10-16 14:29:29 +08:00
criurpc	criurpc.proto: copy latest criurpc.proto from criu 3.3	2017-08-02 16:07:32 +00:00
devices	libcontainer: devices: fix mips builds	2018-06-17 11:22:01 +10:00
integration	Fix race in runc exec	2018-06-01 16:25:58 -07:00
intelrdt	libcontainer: intelrdt: add support for Intel RDT/MBA in runc	2018-10-16 14:29:29 +08:00
keys	keyring: handle ENOSYS with keyctl(KEYCTL_JOIN_SESSION_KEYRING)	2018-09-17 21:38:30 +10:00
mount	remove placeholder for non-linux platforms	2017-11-24 18:14:51 +00:00
nsenter	Merge pull request #1862 from AkihiroSuda/decompose-rootless-pr	2018-10-15 17:32:15 -07:00
seccomp	Fix breaking change in Seccomp profile behavior	2017-10-18 11:53:56 -04:00
specconv	libcontainer: intelrdt: add support for Intel RDT/MBA in runc	2018-10-16 14:29:29 +08:00
stacktrace	doc: fix typo	2018-09-07 11:58:59 +08:00
system	libcontainer: fix compilation on GOARCH=arm GOARM=6 (32 bits)	2018-06-14 18:33:14 +00:00
user	libcontainer: fix compilation on GOARCH=arm GOARM=6 (32 bits)	2018-06-14 18:33:14 +00:00
utils	test: add more test case for CleanPath	2018-09-14 21:37:12 +08:00
README.md	update READ.me for new struct configs.Config.Capabilities	2017-06-09 18:47:05 +08:00
SPEC.md	libcontainer: intelrdt: add Intel RDT/MBA docs in SPEC.md	2018-10-16 14:28:19 +08:00
capabilities_linux.go	libcontainer/capabilities_linux: Drop os.Getpid() call	2018-02-19 15:47:42 -08:00
console_linux.go	tty: move IO of master pty to be done with epoll	2017-07-28 12:35:02 +01:00
container.go	libcontainer: Replace GetProcessStartTime with Stat_t.StartTime	2017-06-20 16:26:55 -07:00
container_linux.go	Disable rootless mode except RootlessCgMgr when executed as the root in userns	2018-09-07 15:05:03 +09:00
container_linux_test.go	doc: fix typo	2018-09-07 11:58:59 +08:00
criu_opts_linux.go	Update criu_opts_linux.go	2017-12-05 15:16:26 +08:00
error.go	Fix the outdated comment for Error interface	2017-01-03 15:06:47 +08:00
error_test.go	[unittest] add extra ErrorCode in TestErrorCode testcase	2016-09-22 20:15:54 +08:00
factory.go	could load a stopped container.	2017-04-07 07:39:41 -04:00
factory_linux.go	libcontainer: intelrdt: add support for Intel RDT/MBA in runc	2018-10-16 14:29:29 +08:00
factory_linux_test.go	Import docker/docker/pkg/mount into runc	2017-11-08 16:25:58 +01:00
generic_error.go	libcontainer: refactor syncT handling	2016-12-01 15:46:04 +11:00
generic_error_test.go	add testcase in generic_error_test.go	2017-04-18 08:56:02 +08:00
init_linux.go	Disable rootless mode except RootlessCgMgr when executed as the root in userns	2018-09-07 15:05:03 +09:00
message_linux.go	Disable rootless mode except RootlessCgMgr when executed as the root in userns	2018-09-07 15:05:03 +09:00
network_linux.go	Remove unused veth setup code	2018-08-24 15:41:52 -07:00
notify_linux.go	Fix flaky test TestNotifyOnOOM	2017-08-14 15:18:59 +08:00
notify_linux_test.go	Some fixes for testMemoryNotification	2017-08-14 15:28:03 +08:00
process.go	Fix race in runc exec	2018-06-01 16:25:58 -07:00
process_linux.go	Merge pull request #1862 from AkihiroSuda/decompose-rootless-pr	2018-10-15 17:32:15 -07:00
restored_process.go	libcontainer: Replace GetProcessStartTime with Stat_t.StartTime	2017-06-20 16:26:55 -07:00
rootfs_linux.go	Merge pull request #1832 from giuseppe/runc-drop-invalid-proc-destination-with-chroot	2018-09-04 09:26:21 -07:00
rootfs_linux_test.go	linux: drop check for /proc as invalid dest	2018-08-30 09:56:18 +02:00
setns_init_linux.go	keyring: handle ENOSYS with keyctl(KEYCTL_JOIN_SESSION_KEYRING)	2018-09-17 21:38:30 +10:00
standard_init_linux.go	keyring: handle ENOSYS with keyctl(KEYCTL_JOIN_SESSION_KEYRING)	2018-09-17 21:38:30 +10:00
state_linux.go	libcontainer: expose annotations in hooks	2018-01-11 16:54:01 +01:00
state_linux_test.go	libcontainer/state_linux_test: Add a testTransitions helper	2018-01-25 11:18:45 -08:00
stats.go	Move libcontainer into subdirectory	2015-06-21 19:29:15 -07:00
stats_linux.go	libcontainer: add support for Intel RDT/CAT in runc	2017-09-01 14:26:33 +08:00
sync.go	Add separate console socket	2017-03-16 10:23:59 -07:00

README.md

libcontainer

Libcontainer provides a native Go implementation for creating containers with namespaces, cgroups, capabilities, and filesystem access controls. It allows you to manage the lifecycle of the container performing additional operations after the container is created.

Container

A container is a self contained execution environment that shares the kernel of the host system and which is (optionally) isolated from other containers in the system.

Using libcontainer

Because containers are spawned in a two step process you will need a binary that will be executed as the init process for the container. In libcontainer, we use the current binary (/proc/self/exe) to be executed as the init process, and use arg "init", we call the first step process "bootstrap", so you always need a "init" function as the entry of "bootstrap".

In addition to the go init function the early stage bootstrap is handled by importing nsenter.

import (
	_ "github.com/opencontainers/runc/libcontainer/nsenter"
)

func init() {
	if len(os.Args) > 1 && os.Args[1] == "init" {
		runtime.GOMAXPROCS(1)
		runtime.LockOSThread()
		factory, _ := libcontainer.New("")
		if err := factory.StartInitialization(); err != nil {
			logrus.Fatal(err)
		}
		panic("--this line should have never been executed, congratulations--")
	}
}

Then to create a container you first have to initialize an instance of a factory that will handle the creation and initialization for a container.

factory, err := libcontainer.New("/var/lib/container", libcontainer.Cgroupfs, libcontainer.InitArgs(os.Args[0], "init"))
if err != nil {
	logrus.Fatal(err)
	return
}

Once you have an instance of the factory created we can create a configuration struct describing how the container is to be created. A sample would look similar to this:

defaultMountFlags := unix.MS_NOEXEC | unix.MS_NOSUID | unix.MS_NODEV
config := &configs.Config{
	Rootfs: "/your/path/to/rootfs",
	Capabilities: &configs.Capabilities{
                Bounding: []string{
                        "CAP_CHOWN",
                        "CAP_DAC_OVERRIDE",
                        "CAP_FSETID",
                        "CAP_FOWNER",
                        "CAP_MKNOD",
                        "CAP_NET_RAW",
                        "CAP_SETGID",
                        "CAP_SETUID",
                        "CAP_SETFCAP",
                        "CAP_SETPCAP",
                        "CAP_NET_BIND_SERVICE",
                        "CAP_SYS_CHROOT",
                        "CAP_KILL",
                        "CAP_AUDIT_WRITE",
                },
                Effective: []string{
                        "CAP_CHOWN",
                        "CAP_DAC_OVERRIDE",
                        "CAP_FSETID",
                        "CAP_FOWNER",
                        "CAP_MKNOD",
                        "CAP_NET_RAW",
                        "CAP_SETGID",
                        "CAP_SETUID",
                        "CAP_SETFCAP",
                        "CAP_SETPCAP",
                        "CAP_NET_BIND_SERVICE",
                        "CAP_SYS_CHROOT",
                        "CAP_KILL",
                        "CAP_AUDIT_WRITE",
                },
                Inheritable: []string{
                        "CAP_CHOWN",
                        "CAP_DAC_OVERRIDE",
                        "CAP_FSETID",
                        "CAP_FOWNER",
                        "CAP_MKNOD",
                        "CAP_NET_RAW",
                        "CAP_SETGID",
                        "CAP_SETUID",
                        "CAP_SETFCAP",
                        "CAP_SETPCAP",
                        "CAP_NET_BIND_SERVICE",
                        "CAP_SYS_CHROOT",
                        "CAP_KILL",
                        "CAP_AUDIT_WRITE",
                },
                Permitted: []string{
                        "CAP_CHOWN",
                        "CAP_DAC_OVERRIDE",
                        "CAP_FSETID",
                        "CAP_FOWNER",
                        "CAP_MKNOD",
                        "CAP_NET_RAW",
                        "CAP_SETGID",
                        "CAP_SETUID",
                        "CAP_SETFCAP",
                        "CAP_SETPCAP",
                        "CAP_NET_BIND_SERVICE",
                        "CAP_SYS_CHROOT",
                        "CAP_KILL",
                        "CAP_AUDIT_WRITE",
                },
                Ambient: []string{
                        "CAP_CHOWN",
                        "CAP_DAC_OVERRIDE",
                        "CAP_FSETID",
                        "CAP_FOWNER",
                        "CAP_MKNOD",
                        "CAP_NET_RAW",
                        "CAP_SETGID",
                        "CAP_SETUID",
                        "CAP_SETFCAP",
                        "CAP_SETPCAP",
                        "CAP_NET_BIND_SERVICE",
                        "CAP_SYS_CHROOT",
                        "CAP_KILL",
                        "CAP_AUDIT_WRITE",
                },
        },
	Namespaces: configs.Namespaces([]configs.Namespace{
		{Type: configs.NEWNS},
		{Type: configs.NEWUTS},
		{Type: configs.NEWIPC},
		{Type: configs.NEWPID},
		{Type: configs.NEWUSER},
		{Type: configs.NEWNET},
	}),
	Cgroups: &configs.Cgroup{
		Name:   "test-container",
		Parent: "system",
		Resources: &configs.Resources{
			MemorySwappiness: nil,
			AllowAllDevices:  nil,
			AllowedDevices:   configs.DefaultAllowedDevices,
		},
	},
	MaskPaths: []string{
		"/proc/kcore",
		"/sys/firmware",
	},
	ReadonlyPaths: []string{
		"/proc/sys", "/proc/sysrq-trigger", "/proc/irq", "/proc/bus",
	},
	Devices:  configs.DefaultAutoCreatedDevices,
	Hostname: "testing",
	Mounts: []*configs.Mount{
		{
			Source:      "proc",
			Destination: "/proc",
			Device:      "proc",
			Flags:       defaultMountFlags,
		},
		{
			Source:      "tmpfs",
			Destination: "/dev",
			Device:      "tmpfs",
			Flags:       unix.MS_NOSUID | unix.MS_STRICTATIME,
			Data:        "mode=755",
		},
		{
			Source:      "devpts",
			Destination: "/dev/pts",
			Device:      "devpts",
			Flags:       unix.MS_NOSUID | unix.MS_NOEXEC,
			Data:        "newinstance,ptmxmode=0666,mode=0620,gid=5",
		},
		{
			Device:      "tmpfs",
			Source:      "shm",
			Destination: "/dev/shm",
			Data:        "mode=1777,size=65536k",
			Flags:       defaultMountFlags,
		},
		{
			Source:      "mqueue",
			Destination: "/dev/mqueue",
			Device:      "mqueue",
			Flags:       defaultMountFlags,
		},
		{
			Source:      "sysfs",
			Destination: "/sys",
			Device:      "sysfs",
			Flags:       defaultMountFlags | unix.MS_RDONLY,
		},
	},
	UidMappings: []configs.IDMap{
		{
			ContainerID: 0,
			HostID: 1000,
			Size: 65536,
		},
	},
	GidMappings: []configs.IDMap{
		{
			ContainerID: 0,
			HostID: 1000,
			Size: 65536,
		},
	},
	Networks: []*configs.Network{
		{
			Type:    "loopback",
			Address: "127.0.0.1/0",
			Gateway: "localhost",
		},
	},
	Rlimits: []configs.Rlimit{
		{
			Type: unix.RLIMIT_NOFILE,
			Hard: uint64(1025),
			Soft: uint64(1025),
		},
	},
}

Once you have the configuration populated you can create a container:

container, err := factory.Create("container-id", config)
if err != nil {
	logrus.Fatal(err)
	return
}

To spawn bash as the initial process inside the container and have the processes pid returned in order to wait, signal, or kill the process:

process := &libcontainer.Process{
	Args:   []string{"/bin/bash"},
	Env:    []string{"PATH=/bin"},
	User:   "daemon",
	Stdin:  os.Stdin,
	Stdout: os.Stdout,
	Stderr: os.Stderr,
}

err := container.Run(process)
if err != nil {
	container.Destroy()
	logrus.Fatal(err)
	return
}

// wait for the process to finish.
_, err := process.Wait()
if err != nil {
	logrus.Fatal(err)
}

// destroy the container.
container.Destroy()

Additional ways to interact with a running container are:

// return all the pids for all processes running inside the container.
processes, err := container.Processes()

// get detailed cpu, memory, io, and network statistics for the container and
// it's processes.
stats, err := container.Stats()

// pause all processes inside the container.
container.Pause()

// resume all paused processes.
container.Resume()

// send signal to container's init process.
container.Signal(signal)

// update container resource constraints.
container.Set(config)

// get current status of the container.
status, err := container.Status()

// get current container's state information.
state, err := container.State()

Checkpoint & Restore

libcontainer now integrates CRIU for checkpointing and restoring containers. This let's you save the state of a process running inside a container to disk, and then restore that state into a new process, on the same machine or on another machine.

criu version 1.5.2 or higher is required to use checkpoint and restore. If you don't already have criu installed, you can build it from source, following the online instructions. criu is also installed in the docker image generated when building libcontainer with docker.

README.md

libcontainer

Container

Using libcontainer

Checkpoint & Restore

Copyright and license