runc/libcontainer
Justin Cormack e18de63108 If possible, apply seccomp rules immediately before exec
See https://github.com/docker/docker/issues/22252

Previously we would apply seccomp rules before applying
capabilities, because it requires CAP_SYS_ADMIN. This
however means that a seccomp profile needs to allow
operations such as setcap() and setuid() which you
might reasonably want to disallow.

If prctl(PR_SET_NO_NEW_PRIVS) has been applied however
setting a seccomp filter is an unprivileged operation.
Therefore if this has been set, apply the seccomp
filter as late as possible, after capabilities have
been dropped and the uid set.

Note a small number of syscalls will take place
after the filter is applied, such as `futex`,
`stat` and `execve`, so these still need to be allowed
in addition to any the program itself needs.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2016-04-27 20:06:14 +01:00
..
apparmor Adding error conditions when apparmor disabled 2015-11-22 13:14:18 +05:30
cgroups Cgroup: reduce redundant parsing of mountinfo 2016-04-22 09:41:28 +09:00
configs Merge pull request #679 from rajasec/selinux-errorcheck 2016-04-24 16:24:26 +00:00
criurpc libcontainer: update criurpc.proto 2016-02-19 02:38:02 +03:00
devices Windows: Tidy libcontainer\devices 2015-10-23 13:50:24 -07:00
integration Fix trivial style errors reported by `go vet` and `golint` 2016-04-12 08:13:16 +00:00
keys Fix trivial style errors reported by `go vet` and `golint` 2016-04-12 08:13:16 +00:00
label Fix trivial style errors reported by `go vet` and `golint` 2016-04-12 08:13:16 +00:00
nsenter nsexec: fix build against musl libc 2016-04-19 10:58:17 +02:00
seccomp Fix trivial style errors reported by `go vet` and `golint` 2016-04-12 08:13:16 +00:00
selinux Selinux: reduce redundant parsing of mountinfo 2016-04-22 09:41:28 +09:00
specconv Merge pull request #777 from cyphar/fix-null-pointer-deref 2016-04-24 19:09:30 -07:00
stacktrace Fix trivial style errors reported by `go vet` and `golint` 2016-04-12 08:13:16 +00:00
system Fix trivial style errors reported by `go vet` and `golint` 2016-04-12 08:13:16 +00:00
user libcontainer: user: general cleanups 2016-03-31 07:44:16 +11:00
utils Add unit tests for 'utils' package 2016-04-12 13:29:37 +01:00
xattr Fixing xattr test step issue 2015-11-29 09:24:42 +05:30
README.md Updating README with container signal interaction 2016-04-05 19:41:27 +05:30
SPEC.md Typo in SPEC.md 2016-04-15 14:57:14 +05:30
capabilities_linux.go Update github.com/syndtr/gocapability/capability to 2c00daeb6c3b45114c80ac44119e7b8801fdd852 2015-09-24 18:44:01 -04:00
compat_1.5_linux.go Don't set /proc/<PID>/setgroups to deny in Go1.5 2015-08-03 14:59:15 -04:00
console.go Move libcontainer into subdirectory 2015-06-21 19:29:15 -07:00
console_freebsd.go Export console New func 2015-12-09 11:59:10 -08:00
console_linux.go Export console New func 2015-12-09 11:59:10 -08:00
console_solaris.go Get runc to build clean on Solaris 2016-04-12 16:13:08 -07:00
console_windows.go Export console New func 2015-12-09 11:59:10 -08:00
container.go Fix trivial style errors reported by `go vet` and `golint` 2016-04-12 08:13:16 +00:00
container_linux.go Merge pull request #758 from rajasec/container-pause-comment 2016-04-19 16:16:41 -07:00
container_linux_test.go Fix trivial style errors reported by `go vet` and `golint` 2016-04-12 08:13:16 +00:00
container_solaris.go Get runc to build clean on Solaris 2016-04-12 16:13:08 -07:00
container_windows.go Windows: Refactor Container interface 2015-11-02 15:12:16 -08:00
criu_opts_unix.go Fix trivial style errors reported by `go vet` and `golint` 2016-04-12 08:13:16 +00:00
criu_opts_windows.go Windows: Factor down criu_opts 2015-10-23 12:58:59 -07:00
error.go Fix trivial style errors reported by `go vet` and `golint` 2016-04-12 08:13:16 +00:00
error_test.go Move libcontainer into subdirectory 2015-06-21 19:29:15 -07:00
factory.go Update import paths for new repository 2015-06-21 19:29:59 -07:00
factory_linux.go Show proper error from init process panic 2016-03-22 15:57:15 -07:00
factory_linux_test.go Serialize CommandHooks to state 2016-03-03 16:57:51 +00:00
generic_error.go Add cause to error messages 2016-04-18 11:37:26 -07:00
generic_error_test.go Move libcontainer into subdirectory 2015-06-21 19:29:15 -07:00
init_linux.go Set rlimits using prlimit in parent 2016-03-25 15:11:44 +00:00
message_linux.go Fix trivial style errors reported by `go vet` and `golint` 2016-04-12 08:13:16 +00:00
network_linux.go libcontainer: network_linux.go: fix go vet 2015-11-30 12:31:18 +01:00
notify_linux.go libcontainer: Add support for memcg pressure notifications 2015-12-28 13:36:55 -05:00
notify_linux_test.go libcontainer: Add support for memcg pressure notifications 2015-12-28 13:36:55 -05:00
process.go Update libcontainer to support rlimit per process 2016-03-10 14:35:16 -08:00
process_linux.go Add cause to error messages 2016-04-18 11:37:26 -07:00
restored_process.go Add signal API to Container interface 2015-08-03 17:07:29 -07:00
rootfs_linux.go Rootfs: reduce redundant parsing of mountinfo 2016-04-22 09:41:28 +09:00
rootfs_linux_test.go Fix setupDev logic in rootfs_linux.go 2016-04-11 10:29:40 -07:00
setgroups_linux.go Don't set /proc/<PID>/setgroups to deny in Go1.5 2015-08-03 14:59:15 -04:00
setns_init_linux.go Set rlimits using prlimit in parent 2016-03-25 15:11:44 +00:00
standard_init_linux.go If possible, apply seccomp rules immediately before exec 2016-04-27 20:06:14 +01:00
state_linux.go HookState adhears to OCI 2016-04-06 16:57:59 +01:00
state_linux_test.go Remove the nullState 2016-01-25 00:26:11 -08:00
stats.go Move libcontainer into subdirectory 2015-06-21 19:29:15 -07:00
stats_freebsd.go Move libcontainer into subdirectory 2015-06-21 19:29:15 -07:00
stats_linux.go Update import paths for new repository 2015-06-21 19:29:59 -07:00
stats_solaris.go Get runc to build clean on Solaris 2016-04-12 16:13:08 -07:00
stats_windows.go Move libcontainer into subdirectory 2015-06-21 19:29:15 -07:00

README.md

Libcontainer provides a native Go implementation for creating containers with namespaces, cgroups, capabilities, and filesystem access controls. It allows you to manage the lifecycle of the container performing additional operations after the container is created.

Container

A container is a self contained execution environment that shares the kernel of the host system and which is (optionally) isolated from other containers in the system.

Using libcontainer

Because containers are spawned in a two step process you will need a binary that will be executed as the init process for the container. In libcontainer, we use the current binary (/proc/self/exe) to be executed as the init process, and use arg "init", we call the first step process "bootstrap", so you always need a "init" function as the entry of "bootstrap".

func init() {
	if len(os.Args) > 1 && os.Args[1] == "init" {
		runtime.GOMAXPROCS(1)
		runtime.LockOSThread()
		factory, _ := libcontainer.New("")
		if err := factory.StartInitialization(); err != nil {
			logrus.Fatal(err)
		}
		panic("--this line should have never been executed, congratulations--")
	}
}

Then to create a container you first have to initialize an instance of a factory that will handle the creation and initialization for a container.

factory, err := libcontainer.New("/var/lib/container", libcontainer.Cgroupfs, libcontainer.InitArgs(os.Args[0], "init"))
if err != nil {
	logrus.Fatal(err)
	return
}

Once you have an instance of the factory created we can create a configuration struct describing how the container is to be created. A sample would look similar to this:

defaultMountFlags := syscall.MS_NOEXEC | syscall.MS_NOSUID | syscall.MS_NODEV
config := &configs.Config{
	Rootfs: "/your/path/to/rootfs",
	Capabilities: []string{
		"CAP_CHOWN",
		"CAP_DAC_OVERRIDE",
		"CAP_FSETID",
		"CAP_FOWNER",
		"CAP_MKNOD",
		"CAP_NET_RAW",
		"CAP_SETGID",
		"CAP_SETUID",
		"CAP_SETFCAP",
		"CAP_SETPCAP",
		"CAP_NET_BIND_SERVICE",
		"CAP_SYS_CHROOT",
		"CAP_KILL",
		"CAP_AUDIT_WRITE",
	},
	Namespaces: configs.Namespaces([]configs.Namespace{
		{Type: configs.NEWNS},
		{Type: configs.NEWUTS},
		{Type: configs.NEWIPC},
		{Type: configs.NEWPID},
		{Type: configs.NEWUSER},
		{Type: configs.NEWNET},
	}),
	Cgroups: &configs.Cgroup{
		Name:   "test-container",
		Parent: "system",
		Resources: &configs.Resources{
			MemorySwappiness: nil,
			AllowAllDevices:  false,
			AllowedDevices:   configs.DefaultAllowedDevices,
		},
	},
	MaskPaths: []string{
		"/proc/kcore",
	},
	ReadonlyPaths: []string{
		"/proc/sys", "/proc/sysrq-trigger", "/proc/irq", "/proc/bus",
	},
	Devices:  configs.DefaultAutoCreatedDevices,
	Hostname: "testing",
	Mounts: []*configs.Mount{
		{
			Source:      "proc",
			Destination: "/proc",
			Device:      "proc",
			Flags:       defaultMountFlags,
		},
		{
			Source:      "tmpfs",
			Destination: "/dev",
			Device:      "tmpfs",
			Flags:       syscall.MS_NOSUID | syscall.MS_STRICTATIME,
			Data:        "mode=755",
		},
		{
			Source:      "devpts",
			Destination: "/dev/pts",
			Device:      "devpts",
			Flags:       syscall.MS_NOSUID | syscall.MS_NOEXEC,
			Data:        "newinstance,ptmxmode=0666,mode=0620,gid=5",
		},
		{
			Device:      "tmpfs",
			Source:      "shm",
			Destination: "/dev/shm",
			Data:        "mode=1777,size=65536k",
			Flags:       defaultMountFlags,
		},
		{
			Source:      "mqueue",
			Destination: "/dev/mqueue",
			Device:      "mqueue",
			Flags:       defaultMountFlags,
		},
		{
			Source:      "sysfs",
			Destination: "/sys",
			Device:      "sysfs",
			Flags:       defaultMountFlags | syscall.MS_RDONLY,
		},
	},
	UidMappings: []configs.IDMap{
		{
			ContainerID: 0,
			HostID: 1000,
			Size: 65536,
		},
	},
	GidMappings: []configs.IDMap{
		{
			ContainerID: 0,
			HostID: 1000,
			Size: 65536,
		},
	},
	Networks: []*configs.Network{
		{
			Type:    "loopback",
			Address: "127.0.0.1/0",
			Gateway: "localhost",
		},
	},
	Rlimits: []configs.Rlimit{
		{
			Type: syscall.RLIMIT_NOFILE,
			Hard: uint64(1025),
			Soft: uint64(1025),
		},
	},
}

Once you have the configuration populated you can create a container:

container, err := factory.Create("container-id", config)
if err != nil {
	logrus.Fatal(err)
	return
}

To spawn bash as the initial process inside the container and have the processes pid returned in order to wait, signal, or kill the process:

process := &libcontainer.Process{
	Args:   []string{"/bin/bash"},
	Env:    []string{"PATH=/bin"},
	User:   "daemon",
	Stdin:  os.Stdin,
	Stdout: os.Stdout,
	Stderr: os.Stderr,
}

err := container.Start(process)
if err != nil {
	logrus.Fatal(err)
	container.Destroy()
	return
}

// wait for the process to finish.
_, err := process.Wait()
if err != nil {
	logrus.Fatal(err)
}

// destroy the container.
container.Destroy()

Additional ways to interact with a running container are:

// return all the pids for all processes running inside the container.
processes, err := container.Processes()

// get detailed cpu, memory, io, and network statistics for the container and
// it's processes.
stats, err := container.Stats()

// pause all processes inside the container.
container.Pause()

// resume all paused processes.
container.Resume()

// send signal to container's init process.
container.Signal(signal)

Checkpoint & Restore

libcontainer now integrates CRIU for checkpointing and restoring containers. This let's you save the state of a process running inside a container to disk, and then restore that state into a new process, on the same machine or on another machine.

criu version 1.5.2 or higher is required to use checkpoint and restore. If you don't already have criu installed, you can build it from source, following the online instructions. criu is also installed in the docker image generated when building libcontainer with docker.

Code and documentation copyright 2014 Docker, inc. Code released under the Apache 2.0 license. Docs released under Creative commons.