runc/nsenter at 16612d74de5f84977e50a9c8ead7f0e9e13b8628 - runc

History

Aleksa Sarai 16612d74de nsenter: cloned_binary: try to ro-bind /proc/self/exe before copying The usage of memfd_create(2) and other copying techniques is quite wasteful, despite attempts to minimise it with _LIBCONTAINER_STATEDIR. memfd_create(2) added ~10M of memory usage to the cgroup associated with the container, which can result in some setups getting OOM'd (or just hogging the hosts' memory when you have lots of created-but-not-started containers sticking around). The easiest way of solving this is by creating a read-only bind-mount of the binary, opening that read-only bindmount, and then umounting it to ensure that the host won't accidentally be re-mounted read-write. This avoids all copying and cleans up naturally like the other techniques used. Unfortunately, like the O_TMPFILE fallback, this requires being able to create a file inside _LIBCONTAINER_STATEDIR (since bind-mounting over the most obvious path -- /proc/self/exe -- is a very bad idea). Unfortunately detecting this isn't fool-proof -- on a system with a read-only root filesystem (that might become read-write during "runc init" execution), we cannot tell whether we have already done an ro remount. As a partial mitigation, we store a _LIBCONTAINER_CLONED_BINARY environment variable which is checked alongside the protection being present. Signed-off-by: Aleksa Sarai <asarai@suse.de>		2019-03-01 23:29:08 +11:00
..
README.md	Update outdated nsenter README content	2018-08-07 17:53:56 +02:00
cloned_binary.c	nsenter: cloned_binary: try to ro-bind /proc/self/exe before copying	2019-03-01 23:29:08 +11:00
namespace.h	nsenter: guarantee correct user namespace ordering	2016-10-04 16:17:55 +11:00
nsenter.go	Move libcontainer into subdirectory	2015-06-21 19:29:15 -07:00
nsenter_gccgo.go	Move libcontainer into subdirectory	2015-06-21 19:29:15 -07:00
nsenter_test.go	Move libcontainer to x/sys/unix	2017-05-22 17:35:20 -05:00
nsenter_unsupported.go	Move libcontainer into subdirectory	2015-06-21 19:29:15 -07:00
nsexec.c	nsenter: clone /proc/self/exe to avoid exposing host binary to container	2019-02-08 18:57:59 +11:00

README.md

nsenter

The nsenter package registers a special init constructor that is called before the Go runtime has a chance to boot. This provides us the ability to setns on existing namespaces and avoid the issues that the Go runtime has with multiple threads. This constructor will be called if this package is registered, imported, in your go application.

The nsenter package will import "C" and it uses cgo package. In cgo, if the import of "C" is immediately preceded by a comment, that comment, called the preamble, is used as a header when compiling the C parts of the package. So every time we import package nsenter, the C code function nsexec() would be called. And package nsenter is only imported in init.go, so every time the runc init command is invoked, that C code is run.

Because nsexec() must be run before the Go runtime in order to use the Linux kernel namespace, you must import this library into a package if you plan to use libcontainer directly. Otherwise Go will not execute the nsexec() constructor, which means that the re-exec will not cause the namespaces to be joined. You can import it like this:

import _ "github.com/opencontainers/runc/libcontainer/nsenter"

nsexec() will first get the file descriptor number for the init pipe from the environment variable _LIBCONTAINER_INITPIPE (which was opened by the parent and kept open across the fork-exec of the nsexec() init process). The init pipe is used to read bootstrap data (namespace paths, clone flags, uid and gid mappings, and the console path) from the parent process. nsexec() will then call setns(2) to join the namespaces provided in the bootstrap data (if available), clone(2) a child process with the provided clone flags, update the user and group ID mappings, do some further miscellaneous setup steps, and then send the PID of the child process to the parent of the nsexec() "caller". Finally, the parent nsexec() will exit and the child nsexec() process will return to allow the Go runtime take over.

NOTE: We do both setns(2) and clone(2) even if we don't have any CLONE_NEW* clone flags because we must fork a new process in order to enter the PID namespace.