jasder/runc - runc - 军科开源项目托管

Commit Graph

Author	SHA1	Message	Date
Christy Perez	3d7cb4293c	Move libcontainer to x/sys/unix Since syscall is outdated and broken for some architectures, use x/sys/unix instead. There are still some dependencies on the syscall package that will remain in syscall for the forseeable future: Errno Signal SysProcAttr Additionally: - os still uses syscall, so it needs to be kept for anything returning *os.ProcessState, such as process.Wait. Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>	2017-05-22 17:35:20 -05:00
Aleksa Sarai	76aeaf8181	libcontainer: init: fix unmapped console fchown If the stdio of the container is owned by a group which is not mapped in the user namespace, attempting to fchown the file descriptor will result in EINVAL. Counteract this by simply not doing an fchown if the group owner of the file descriptor has no host mapping according to the configured GIDMappings. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-03-23 20:46:21 +11:00
Aleksa Sarai	d2f49696b0	runc: add support for rootless containers This enables the support for the rootless container mode. There are many restrictions on what rootless containers can do, so many different runC commands have been disabled: * runc checkpoint * runc events * runc pause * runc ps * runc restore * runc resume * runc update The following commands work: * runc create * runc delete * runc exec * runc kill * runc list * runc run * runc spec * runc state In addition, any specification options that imply joining cgroups have also been disabled. This is due to support for unprivileged subtree management not being available from Linux upstream. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-03-23 20:45:24 +11:00
Aleksa Sarai	6bd4bd9030	*: handle unprivileged operations and !dumpable Effectively, !dumpable makes implementing rootless containers quite hard, due to a bunch of different operations on /proc/self no longer being possible without reordering everything. !dumpable only really makes sense when you are switching between different security contexts, which is only the case when we are joining namespaces. Unfortunately this means that !dumpable will still have issues in this instance, and it should only be necessary to set !dumpable if we are not joining USER namespaces (new kernels have protections that make !dumpable no longer necessary). But that's a topic for another time. This also includes code to unset and then re-set dumpable when doing the USER namespace mappings. This should also be safe because in principle processes in a container can't see us until after we fork into the PID namespace (which happens after the user mapping). In rootless containers, it is not possible to set a non-dumpable process's /proc/self/oom_score_adj (it's owned by root and thus not writeable). Thus, it needs to be set inside nsexec before we set ourselves as non-dumpable. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-03-23 20:45:19 +11:00
Daniel Dao	09c72cea69	fix panic regression when config doesnt have caps When process config doesnt specify capabilities anywhere, we should not panic because setting capabilities are optional. Signed-off-by: Daniel Dao <dqminh89@gmail.com>	2017-03-21 00:45:26 +00:00
Michael Crosby	00a0ecf554	Add separate console socket Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2017-03-16 10:23:59 -07:00
Mrunal Patel	4f9cb13b64	Update runtime spec to 1.0.0.rc5 Signed-off-by: Mrunal Patel <mrunalp@gmail.com>	2017-03-15 11:38:37 -07:00
Qiang Huang	b7932a2e07	Remove unused ExecFifoPath In container process's Init function, we use fd + execFifoFilename to open exec fifo, so this field in init config is never used. Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>	2017-03-09 10:58:16 +08:00
Qiang Huang	7350cd8640	Merge pull request #1285 from stevenh/signal-wait Only wait for processes after delivering SIGKILL in signalAllProcesses	2017-02-06 16:41:24 +08:00
Aleksa Sarai	e034cedce7	libcontainer: init: only pass stateDirFd when creating a container If we pass a file descriptor to the host filesystem while joining a container, there is a race condition where a process inside the container can ptrace(2) the joining process and stop it from closing its file descriptor to the stateDirFd. Then the process can access the host filesystem from that file descriptor. This was fixed in part by `5d93fed3d2` ("Set init processes as non-dumpable"), but that fix is more of a hail-mary than an actual fix for the underlying issue. To fix this, don't open or pass the stateDirFd to the init process unless we're creating a new container. A proper fix for this would be to remove the need for even passing around directory file descriptors (which are quite dangerous in the context of mount namespaces). There is still an issue with containers that have CAP_SYS_PTRACE and are using the setns(2)-style of joining a container namespace. Currently I'm not really sure how to fix it without rampant layer violation. Fixes: CVE-2016-9962 Fixes: `5d93fed3d2` ("Set init processes as non-dumpable") Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-02-02 00:41:11 +11:00
Steven Hartland	82d895fbb9	Conditionally wait for children after delivering signal When signaling children and the signal is SIGKILL wait for children otherwise conditionally wait for children which are ready to report. This reaps all children which exited due to the signal sent without blocking indefinitely. Also: * Ignore ignore ECHILD, which means the child has already gone. Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>	2017-02-01 13:22:37 +00:00
Steven Hartland	27a5447ea4	Only wait for processes after delivering SIGKILL in signalAllProcesses signalAllProcesses was making the assumption that the requested signal was SIGKILL, possibly due to the signal parameter being added at a later date, and hence it was safe to wait for all processes which is not the case. BaseContainer.Signal(s os.Signal, all bool) exposes this functionality to consumers, so an arbitrary signal could be used which is not guaranteed to make the processes exit. Correct the documentation for signalAllProcesses around the signal delivered and update it so that the wait is only performed on SIGKILL hence making it safe to process other signals without risk of blocking forever, while still maintaining compatibility to SIGKILL callers. Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>	2017-01-21 18:20:23 +00:00
Michael Crosby	5d93fed3d2	Set init processes as non-dumpable This sets the init processes that join and setup the container's namespaces as non-dumpable before they setns to the container's pid (or any other ) namespace. This settings is automatically reset to the default after the Exec in the container so that it does not change functionality for the applications that are running inside, just our init processes. This prevents parent processes, the pid 1 of the container, to ptrace the init process before it drops caps and other sets LSMs. This patch also ensures that the stateDirFD being used is still closed prior to exec, even though it is set as O_CLOEXEC, because of the order in the kernel. https://github.com/torvalds/linux/blob/v4.9/fs/exec.c#L1290-L1318 The order during the exec syscall is that the process is set back to dumpable before O_CLOEXEC are processed. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2017-01-11 09:56:56 -08:00
Qiang Huang	20f0ca7306	Fix typos Found by: https://goreportcard.com/report/github.com/opencontainers/runc#misspell Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>	2017-01-06 10:54:33 +08:00
Aleksa Sarai	7df64f8886	runc: implement --console-socket This allows for higher-level orchestrators to be able to have access to the master pty file descriptor without keeping the runC process running. This is key to having (detach && createTTY) with a _real_ pty created inside the container, which is then sent to a higher level orchestrator over an AF_UNIX socket. This patch is part of the console rewrite patchset. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2016-12-01 15:49:36 +11:00
Mrunal Patel	f1324a9fc1	Don't label the console as it already has the right label [@cyphar: removed mountLabel argument from .mount().] Signed-off-by: Mrunal Patel <mrunalp@gmail.com> Signed-off-by: Aleksa Sarai <asarai@suse.de>	2016-12-01 15:49:36 +11:00
Aleksa Sarai	c0c8edb9e8	console: don't chown(2) the slave PTY Since the gid=X and mode=Y flags can be set inside config.json as mount options, don't override them with our own defaults. This avoids /dev/pts/* not being owned by tty in a regular container, as well as all of the issues with us implementing grantpt(3) manually. This is the least opinionated approach to take. This patch is part of the console rewrite patchset. Reported-by: Mrunal Patel <mrunalp@gmail.com> Signed-off-by: Aleksa Sarai <asarai@suse.de>	2016-12-01 15:49:36 +11:00
Aleksa Sarai	244c9fc426	*: console rewrite This implements {createTTY, detach} and all of the combinations and negations of the two that were previously implemented. There are some valid questions about out-of-OCI-scope topics like !createTTY and how things should be handled (why do we dup the current stdio to the process, and how is that not a security issue). However, these will be dealt with in a separate patchset. In order to allow for late console setup, split setupRootfs into the "preparation" section where all of the mounts are created and the "finalize" section where we pivot_root and set things as ro. In between the two we can set up all of the console mountpoints and symlinks we need. We use two-stage synchronisation to ensures that when the syscalls are reordered in a suboptimal way, an out-of-place read() on the parentPipe will not gobble the ancilliary information. This patch is part of the console rewrite patchset. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2016-12-01 15:49:36 +11:00
Aleksa Sarai	4776b4326a	libcontainer: refactor syncT handling To make the code cleaner, and more clear, refactor the syncT handling used when creating the `runc init` process. In addition, document the state changes so that people actually understand what is going on. Rather than only using syncT for the standard initProcess, use it for both initProcess and setnsProcess. This removes some special cases, as well as allowing for the use of syncT with setnsProcess. Also remove a bunch of the boilerplate around syncT handling. This patch is part of the console rewrite patchset. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2016-12-01 15:46:04 +11:00
Michael Crosby	e58671e530	Add --all flag to kill This allows a user to send a signal to all the processes in the container within a single atomic action to avoid new processes being forked off before the signal can be sent. This is basically taking functionality that we already use being `delete` and exposing it ok the `kill` command by adding a flag. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-11-08 09:35:02 -08:00
Haiyan Meng	def07036a0	Fix the err info of chdir(cwd) failure Signed-off-by: Haiyan Meng <haiyanalady@gmail.com>	2016-08-08 12:26:59 -04:00
Petar Petrov	f9b72b1b46	Allow additional groups to be overridden in exec Signed-off-by: Julian Friedman <julz.friedman@uk.ibm.com> Signed-off-by: Petar Petrov <pppepito86@gmail.com> Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>	2016-06-21 10:35:11 +03:00
Michael Crosby	3aacff695d	Use fifo for create/start This removes the use of a signal handler and SIGCONT to signal the init process to exec the users process. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-06-13 11:26:53 -07:00
Michael Crosby	3fe7d7f31e	Add create and start command for container lifecycle Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-05-31 11:06:41 -07:00
Julian Friedman	e91b2b8aca	Set rlimits using prlimit in parent Fixes #680 This changes setupRlimit to use the Prlimit syscall (rather than Setrlimit) and moves the call to the parent process. This is necessary because Setrlimit would affect the libcontainer consumer if called in the parent, and would fail if called from the child if the child process is in a user namespace and the requested rlimit is higher than that in the parent. Signed-off-by: Julian Friedman <julz.friedman@uk.ibm.com>	2016-03-25 15:11:44 +00:00
Michael Crosby	fdb100d247	Destroy container along with processes before stdio We need to make sure the container is destroyed before closing the stdio for the container. This becomes a big issues when running in the host's pid namespace because the other processes could have inherited the stdio of the initial process. The call to close will just block as they still have the io open. Calling destroy before closing io, especially in the host pid namespace will cause all additional processes to be killed in the container's cgroup. This will allow the io to be closed successfuly. This change makes sure the order for destroy and close is correct as well as ensuring that if any errors encoutered during start or exec will be handled by terminating the process and destroying the container. We cannot use defers here because we need to enforce the correct ordering on destroy. This also sets the subreaper setting for runc so that when running in pid host, runc can wait on the addiontal processes launched by the container, useful on destroy, but also good for reaping the additional processes that were launched. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-03-15 13:17:11 -07:00
Michael Crosby	20422c9bd9	Update libcontainer to support rlimit per process This updates runc and libcontainer to handle rlimits per process and set them correctly for the container. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-03-10 14:35:16 -08:00
Phil Estes	178bad5e71	Properly setuid/setgid after entering userns The re-work of namespace entering lost the setuid/setgid that was part of the Go-routine based process exec in the prior code. A side issue was found with setting oom_score_adj before execve() in a userns that is also solved here. Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)	2016-03-04 11:12:26 -05:00
Michael Crosby	3cc90bd2d8	Add support for process overrides of settings This commit adds support to libcontainer to allow caps, no new privs, apparmor, and selinux process label to the process struct so that it can be used together of override the base settings on the container config per individual process. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-03-03 11:41:33 -08:00
Daniel, Dao Quang Minh	42d5d04801	Sets custom namespaces for init processes An init process can join other namespaces (pidns, ipc etc.). This leverages C code defined in nsenter package to spawn a process with correct namespaces and clone if necessary. This moves all setns and cloneflags related code to nsenter layer, which mean that we dont use Go os/exec to create process with cloneflags and set uid/gid_map or setgroups anymore. The necessary data is passed from Go to C using a netlink binary-encoding format. With this change, setns and init processes are almost the same, which brings some opportunity for refactoring. Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com> [mickael.laventure@docker.com: adapted to apply on master @ d97d5e] Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@docker.com>	2016-02-28 12:26:53 -08:00
Mrunal Patel	4951f5821b	Merge pull request #582 from stefanberger/new_session_keyring Create unique session key name for every container	2016-02-25 17:54:14 -08:00
Stefan Berger	5fbf791e31	Create unique session key name for every container Create a unique session key name for every container. Use the pattern _ses.<postfix> with postfix being the container's Id. This patch does not prevent containers from joining each other's session keyring. Signed-off-by: Stefan Berger <stefanb@linux.vnet.ibm.com>	2016-02-24 08:39:52 -05:00
Mrunal Patel	2f27649848	Move pre-start hooks after container mounts Today mounts in pre-start hooks get overriden by the default mounts. Moving the pre-start hooks to after the container mounts and before the pivot/move root gives better flexiblity in the hooks. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>	2016-02-23 02:50:35 -08:00
Michael Crosby	ddcee3cc2a	Do not use stream encoders Marshall the raw objects for the sync pipes so that no new line chars are left behind in the pipe causing errors. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-01-26 11:22:05 -08:00
Michael Crosby	ed7be1d082	Only set cwd when not empty For existing consumers of libconatiner to not require cwd inside the libcontainer code. This can be done at the runc level and is already evaluated there. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-01-21 11:08:32 -08:00
Mrunal Patel	269a717555	Make cwd required Signed-off-by: Mrunal Patel <mrunalp@gmail.com>	2016-01-14 19:06:56 -05:00
Mrunal Patel	4c767d7046	Merge pull request #446 from cyphar/18-add-pids-controller cgroup: add PIDs cgroup controller support	2016-01-11 16:56:00 -08:00
Aleksa Sarai	103853ead7	libcontainer: set cgroup config late Due to the fact that the init is implemented in Go (which seemingly randomly spawns new processes and loves eating memory), most cgroup configurations are required to have an arbitrary minimum dictated by the init. This confuses users and makes configuration more annoying than it should. An example of this is pids.max, where Go spawns multiple processes that then cause init to violate the pids cgroup constraint before the container can even start. Solve this problem by setting the cgroup configurations as late as possible, to avoid hitting as many of the resources hogged by the Go init as possible. This has to be done before seccomp rules are applied, as the parent and child must synchronise in order for the parent to correctly set the configurations (and writes might be blocked by seccomp). Signed-off-by: Aleksa Sarai <asarai@suse.com>	2016-01-12 10:06:35 +11:00
Jimmi Dyson	91c7024e52	Revert to non-recursive GetPids, add recursive GetAllPids Signed-off-by: Jimmi Dyson <jimmidyson@gmail.com>	2016-01-08 19:42:25 +00:00
Mrunal Patel	4124ba9468	Revert "cgroups: add pids controller support" Signed-off-by: Mrunal Patel <mrunalp@gmail.com>	2015-12-19 07:48:48 -08:00
Aleksa Sarai	14ed8696c1	libcontainer: set cgroup config late Due to the fact that the init is implemented in Go (which seemingly randomly spawns new processes and loves eating memory), most cgroup configurations are required to have an arbitrary minimum dictated by the init. This confuses users and makes configuration more annoying than it should. An example of this is pids.max, where Go spawns multiple processes that then cause init to violate the pids cgroup constraint before the container can even start. Solve this problem by setting the cgroup configurations as late as possible, to avoid hitting as many of the resources hogged by the Go init as possible. This has to be done before seccomp rules are applied, as the parent and child must synchronise in order for the parent to correctly set the configurations (and writes might be blocked by seccomp). Signed-off-by: Aleksa Sarai <asarai@suse.com>	2015-12-19 11:30:48 +11:00
Michael Crosby	219b6c99e0	Ignore changing /dev/null permissions if used in STDIO Whenever dev/null is used as one of the main processes STDIO, do not try to change the permissions on it via fchown because we should not do it in the first place and also this will fail if the container is supposed to be readonly. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2015-09-22 15:32:31 -07:00
Michael Crosby	0dad64f7ad	Fix STDIO permissions when container user not root Fix the permissions of the container's main processes STDIO when the process is not run as the root user. This changes the permissions right before switching to the specified user so that it's STDIO matches it's UID and GID. Add a test for checking that the STDIO of the process is owned by the specified user. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2015-09-18 14:11:29 -07:00
Alexander Morozov	916bd6bd68	Use github.com/vishvananda/netlink for networking Signed-off-by: Alexander Morozov <lk4d4@docker.com>	2015-09-09 19:32:46 -07:00
Vishnu Kannan	cc232c4707	Adding oom_score_adj as a container config param. Signed-off-by: Vishnu Kannan <vishnuk@google.com>	2015-08-31 14:02:59 -07:00
Matthew Heon	2ae581ae62	Convert Seccomp support to use Libseccomp This removes the existing, native Go seccomp filter generation and replaces it with Libseccomp. Libseccomp is a C library which provides architecture independent generation of Seccomp filters for the Linux kernel. This adds a dependency on v2.2.1 or above of Libseccomp. Signed-off-by: Matthew Heon <mheon@redhat.com>	2015-08-13 07:56:27 -04:00
Jin-Hwan Jeong	491cfef259	typo: tempory -> temporary Signed-off-by: Jin-Hwan Jeong <jhjeong.kr@gmail.com>	2015-07-24 11:19:25 +09:00
Michael Crosby	080df7ab88	Update import paths for new repository Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2015-06-21 19:29:59 -07:00
Michael Crosby	8f97d39dd2	Move libcontainer into subdirectory Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2015-06-21 19:29:15 -07:00

49 Commits