jasder/runc - runc - 军科开源项目托管

Commit Graph

Author	SHA1	Message	Date
Aleksa Sarai	16612d74de	nsenter: cloned_binary: try to ro-bind /proc/self/exe before copying The usage of memfd_create(2) and other copying techniques is quite wasteful, despite attempts to minimise it with _LIBCONTAINER_STATEDIR. memfd_create(2) added ~10M of memory usage to the cgroup associated with the container, which can result in some setups getting OOM'd (or just hogging the hosts' memory when you have lots of created-but-not-started containers sticking around). The easiest way of solving this is by creating a read-only bind-mount of the binary, opening that read-only bindmount, and then umounting it to ensure that the host won't accidentally be re-mounted read-write. This avoids all copying and cleans up naturally like the other techniques used. Unfortunately, like the O_TMPFILE fallback, this requires being able to create a file inside _LIBCONTAINER_STATEDIR (since bind-mounting over the most obvious path -- /proc/self/exe -- is a very bad idea). Unfortunately detecting this isn't fool-proof -- on a system with a read-only root filesystem (that might become read-write during "runc init" execution), we cannot tell whether we have already done an ro remount. As a partial mitigation, we store a _LIBCONTAINER_CLONED_BINARY environment variable which is checked alongside the protection being present. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-03-01 23:29:08 +11:00
Aleksa Sarai	af9da0a450	nsenter: cloned_binary: use the runc statedir for O_TMPFILE Writing a file to tmpfs actually incurs a memcg penalty, and thus the benefit of being able to disable memfd_create(2) with _LIBCONTAINER_DISABLE_MEMFD_CLONE is fairly minimal -- though it should be noted that quite a few distributions don't use tmpfs for /tmp (and instead have it as a regular directory or subvolume of the host filesystem). Since runc must have write access to the state directory anyway (and the state directory is usually not on a tmpfs) we can use that instead of /tmp -- avoiding potential memcg costs with no real downside. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-03-01 23:28:51 +11:00
Aleksa Sarai	2429d59352	nsenter: cloned_binary: expand and add pre-3.11 fallbacks In order to get around the memfd_create(2) requirement, `0a8e4117e7` ("nsenter: clone /proc/self/exe to avoid exposing host binary to container") added an O_TMPFILE fallback. However, this fallback was flawed in two ways: * It required O_TMPFILE which is relatively new (having been added to Linux 3.11). * The fallback choice was made at compile-time, not runtime. This results in several complications when it comes to running binaries on different machines to the ones they were built on. The easiest way to resolve these things is to have fallbacks work in a more procedural way (though it does make the code unfortunately more complicated) and to add a new fallback that uses mkotemp(3). Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-03-01 23:28:50 +11:00
Aleksa Sarai	5b775bf297	nsenter: cloned_binary: detect and handle short copies For a variety of reasons, sendfile(2) can end up doing a short-copy so we need to just loop until we hit the binary size. Since /proc/self/exe is tautologically our own binary, there's no chance someone is going to modify it underneath us (or changing the size). Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-02-26 19:51:01 +11:00
Mrunal Patel	f79e211b1d	Merge pull request #1995 from giuseppe/exec-preserve-fds exec: expose --preserve-fds	2019-02-25 17:35:28 -08:00
Giuseppe Scrivano	52f4e0facc	exec: expose --preserve-fds The implementation is already there, we only need to add the CLI option and pass it down. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2019-02-25 17:33:04 +01:00
Mrunal Patel	5b5130ad76	Merge pull request #1963 from adrianreber/go-criu Vendor in go-criu and use it for CRIU's RPC definition	2019-02-23 10:44:28 -08:00
Michael Crosby	8084f7611e	Merge pull request #1986 from adrianreber/master switched travis to xenial	2019-02-21 15:36:02 -05:00
Adrian Reber	f1da0d3008	switched travis to xenial The CRIU test for lazy migration was always skipped in Travis because the kernel was too old. This switches Travis testing to dist: xenial which provides a newer kernel which enables CRIU lazy migration testing. Signed-off-by: Adrian Reber <areber@redhat.com>	2019-02-16 19:45:22 +01:00
Aleksa Sarai	751f18de2a	merge branch 'pr-1982' nsexec (CVE-2019-5736): avoid parsing environ LGTMs: @cyphar @crosbymichael Closes #1982	2019-02-15 18:40:33 +11:00
Adrian Reber	9edb5494bb	Use vendored in CRIU Go bindings This makes use of the vendored in Go bindings and removes the copy of the CRIU RPC interface definition. runc now relies on go-criu for RPC definition and hopefully more CRIU functions can be used in the future from the CRIU Go bindings. Signed-off-by: Adrian Reber <areber@redhat.com>	2019-02-14 18:20:02 +01:00
Adrian Reber	bfca1e6262	Vendor in go-criu Now that CRIU has released Go bindings, this commit vendors those in. At first it only replaces the copy of RPC interface but the goal is to use CRIU functions from the Go bindings instead of replicating the functionality in runc. Signed-off-by: Adrian Reber <areber@redhat.com>	2019-02-14 18:20:02 +01:00
Christian Brauner	bb7d8b1f41	nsexec (CVE-2019-5736): avoid parsing environ My first attempt to simplify this and make it less costly focussed on the way constructors are called. I was under the impression that the ELF specification mandated that arg, argv, and actually even envp need to be passed to functions located in the .init_arry section (aka "constructors"). Actually, the specifications is (cf. [2]): SHT_INIT_ARRAY This section contains an array of pointers to initialization functions, as described in ``Initialization and Termination Functions'' in Chapter 5. Each pointer in the array is taken as a parameterless procedure with a void return. which means that this becomes a libc specific decision. Glibc passes down those args, musl doesn't. So this approach can't work. However, we can at least remove the environment parsing part based on POSIX since [1] mandates that there should be an environ variable defined in unistd.h which provides access to the environment. See also the relevant Open Group specification [1]. [1]: http://pubs.opengroup.org/onlinepubs/9699919799/ [2]: http://www.sco.com/developers/gabi/latest/ch4.sheader.html#init_array Fixes: CVE-2019-5736 Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2019-02-14 16:06:21 +01:00
Mrunal Patel	f414f497b5	Merge pull request #1978 from filbranden/systemd5 Remove detection for scope properties, which have always been broken	2019-02-13 11:54:08 -08:00
Daniel, Dao Quang Minh	0a012df867	Merge pull request #1973 from jhowardmsft/jjh/runtimespec Vendor opencontainers/runtime-spec 29686dbc	2019-02-12 17:07:43 +00:00
Filipe Brandenburger	cd41feb46b	Remove detection for scope properties, which have always been broken The detection for scope properties (whether scope units support DefaultDependencies= or Delegate=) has always been broken, since systemd refuses to create scopes unless at least one PID is attached to it (and this has been so since scope units were introduced in systemd v205.) This can be seen in journal logs whenever a container is started with libpod: Feb 11 15:08:07 myhost systemd[1]: libcontainer-12345-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing. Feb 11 15:08:07 myhost systemd[1]: libcontainer-12345-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing. Since this logic never worked, just assume both attributes are supported (which is what the code does when detection fails for this reason, since it's looking for an "unknown attribute" or "read-only attribute" to mark them as false) and skip the detection altogether. Signed-off-by: Filipe Brandenburger <filbranden@google.com>	2019-02-11 16:05:37 -08:00
Aleksa Sarai	6635b4f0c6	merge branch 'cve-2019-5736' nsenter: clone /proc/self/exe to avoid exposing host binary to container Fixes: CVE-2019-5736 LGTMs: @cyphar @crosbymichael	2019-02-08 18:58:10 +11:00
Aleksa Sarai	0a8e4117e7	nsenter: clone /proc/self/exe to avoid exposing host binary to container There are quite a few circumstances where /proc/self/exe pointing to a pretty important container binary is a _bad_ thing, so to avoid this we have to make a copy (preferably doing self-clean-up and not being writeable). We require memfd_create(2) -- though there is an O_TMPFILE fallback -- but we can always extend this to use a scratch MNT_DETACH overlayfs or tmpfs. The main downside to this approach is no page-cache sharing for the runc binary (which overlayfs would give us) but this is far less complicated. This is only done during nsenter so that it happens transparently to the Go code, and any libcontainer users benefit from it. This also makes ExtraFiles and --preserve-fds handling trivial (because we don't need to worry about it). Fixes: CVE-2019-5736 Co-developed-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-02-08 18:57:59 +11:00
Aleksa Sarai	dd023c457d	merge branch 'pr-1972' Update vendored golang.org/x/sys to latest LGTMs: @crosbymichael @cyphar Closes #1972	2019-02-08 18:52:59 +11:00
John Howard	ec069fe332	Vendor opencontainers/runtime-spec 29686dbc Signed-off-by: John Howard <jhoward@microsoft.com>	2019-02-07 14:49:22 -08:00
Filipe Brandenburger	4a600c04ed	Update vendored golang.org/x/sys to latest Signed-off-by: Filipe Brandenburger <filbranden@google.com>	2019-02-06 17:59:21 -08:00
Mrunal Patel	e4fa8a4575	Merge pull request #1955 from xiaochenshen/rdt-fix-destroy-issue libcontainer: intelrdt: fix null intelrdt path issue in Destroy()	2019-02-01 13:18:56 -08:00
Mrunal Patel	4e4c907193	Merge pull request #1950 from cloudfoundry-incubator/enter-pid-race Resilience in adding of exec tasks to cgroups	2019-02-01 13:18:16 -08:00
Mrunal Patel	6994ff2742	Merge pull request #1967 from cyphar/integration-factory-fixup integration: fix mis-use of libcontainer.Factory	2019-02-01 13:16:36 -08:00
Michael Crosby	8011af4a96	Merge pull request #1964 from adrianreber/org.criu Document 'org.criu.config' annotation	2019-01-25 14:28:19 -05:00
Aleksa Sarai	565325fc36	integration: fix mis-use of libcontainer.Factory For some reason, libcontainer/integration has a whole bunch of incorrect usages of libcontainer.Factory -- causing test failures with a set of security patches that will be published soon. Fixing ths is fairly trivial (switch to creating a new libcontainer.Factory once in each process, rather than creating one in TestMain globally). Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-01-24 23:12:48 +13:00
Adrian Reber	dd50c7e332	Add 'org.criu.config' annotation documentation Signed-off-by: Adrian Reber <areber@redhat.com>	2019-01-15 19:54:47 +01:00
Adrian Reber	5f32bb94fd	Update runc-checkpoint man-page This just copies the latest output from 'runc checkpoint --help' to the man page. Signed-off-by: Adrian Reber <areber@redhat.com>	2019-01-15 19:54:47 +01:00
Michael Crosby	c1e454b2a1	Merge pull request #1960 from giuseppe/fix-kmem-systemd systemd: fix setting kernel memory limit	2019-01-15 13:21:01 -05:00
Michael Crosby	4e9d52da54	Merge pull request #1933 from adrianreber/master Add CRIU configuration file support	2019-01-15 11:22:38 -05:00
Aleksa Sarai	12f6a99120	merge branch 'pr-1962' rootfs: umount all procfs and sysfs with --no-pivot LGTMs: @mrunalp @cyphar Closes #1962	2019-01-15 15:15:53 +11:00
Giuseppe Scrivano	28a697cce3	rootfs: umount all procfs and sysfs with --no-pivot When creating a new user namespace, the kernel doesn't allow to mount a new procfs or sysfs file system if there is not already one instance fully visible in the current mount namespace. When using --no-pivot we were effectively inhibiting this protection from the kernel, as /proc and /sys from the host are still present in the container mount namespace. A container without full access to /proc could then create a new user namespace, and from there able to mount a fully visible /proc, bypassing the limitations in the container. A simple reproducer for this issue is: unshare -mrfp sh -c "mount -t proc none /proc && echo c > /proc/sysrq-trigger" Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2019-01-14 09:53:35 +01:00
Giuseppe Scrivano	f01923376d	systemd: fix setting kernel memory limit since commit `df3fa115f9` it is not possible to set a kernel memory limit when using the systemd cgroups backend as we use cgroup.Apply twice. Skip enabling kernel memory if there are already tasks in the cgroup. Without this patch, runc fails with: container_linux.go:344: starting container process caused "process_linux.go:311: applying cgroup configuration for process caused \"failed to set memory.kmem.limit_in_bytes, because either tasks have already joined this cgroup or it has children\"" Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2019-01-10 11:33:50 +01:00
Xiaochen Shen	acb75d0e38	libcontainer: intelrdt: fix null intelrdt path issue in Destroy() This patch fixes a corner case when destroy a container: If we start a container without 'intelRdt' config set, and then we run “runc update --l3-cache-schema/--mem-bw-schema” to add 'intelRdt' config implicitly. Now if we enter "exit" from the container inside, we will pass through linuxContainer.Destroy() -> state.destroy() -> intelRdtManager.Destroy(). But in IntelRdtManager.Destroy(), IntelRdtManager.Path is still null string, it hasn’t been initialized yet. As a result, the created rdt group directory during "runc update" will not be removed as expected. Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2019-01-05 00:34:25 +08:00
Adrian Reber	403986c5dd	Add CRIU patch to fix checkpoint test For the newly integrated feature to use CRIU configuration files the test is broken without an additional CRIU patch. The test changes CRIU's log file. Changing the log file is unfortunately the only thing which is in broken in CRIU 3.11. But it is the easiest option for testing. With CRIU 3.12 this will be fixed. All other CRIU options can be changed with a CRIU configuration file. With this change the CRIU 3.11 feature can be merged into runc with a test and for the user it should just work, if they are not trying to change CRIU's log file. Signed-off-by: Adrian Reber <areber@redhat.com>	2018-12-21 07:42:12 +01:00
Adrian Reber	6f3e13cc48	Added test for container specific CRIU configuration files Signed-off-by: Adrian Reber <areber@redhat.com>	2018-12-21 07:42:12 +01:00
Adrian Reber	e157963054	Enable CRIU configuration files CRIU 3.11 introduces configuration files: https://criu.org/Configuration_files https://lisas.de/~adrian/posts/2018-Nov-08-criu-configuration-files.html This enables the user to influence CRIU's behaviour without code changes if using new CRIU features or if the user wants to enable certain CRIU behaviour without always specifying certain options. With this it is possible to write 'tcp-established' to the configuration file: $ echo tcp-established > /etc/criu/runc.conf and from now on all checkpoints will preserve the state of established TCP connections. This removes the need to always use $ runc checkpoint --tcp-stablished If the goal is to always checkpoint with '--tcp-established' It also adds the possibility for unexpected CRIU behaviour if the user created a configuration file at some point in time and forgets about it. As a result of the discussion in https://github.com/opencontainers/runc/pull/1933 it is now also possible to define a CRIU configuration file for each container with the annotation 'org.criu.config'. If 'org.criu.config' does not exist, runc will tell CRIU to use '/etc/criu/runc.conf' if it exists. If 'org.criu.config' is set to an empty string (''), runc will tell CRIU to not use any runc specific configuration file at all. If 'org.criu.config' is set to a non-empty string, runc will use that value as an additional configuration file for CRIU. With the annotation the user can decide to use the default configuration file ('/etc/criu/runc.conf'), none or a container specific configuration file. Signed-off-by: Adrian Reber <areber@redhat.com>	2018-12-21 07:42:12 +01:00
Adrian Reber	360ba8a27d	Update criurpc definition for latest features Signed-off-by: Adrian Reber <areber@redhat.com>	2018-12-21 07:42:12 +01:00
Michael Crosby	bbb17efcb4	Merge pull request #1952 from JoeWrightss/patch-4 Fix .Fatalf() error message	2018-12-20 09:18:50 -05:00
JoeWrightss	0855bce448	Fix .Fatalf() error message Signed-off-by: JoeWrightss <zhoulin.xie@daocloud.io>	2018-12-19 20:22:48 +08:00
Tom Godkin	bdf3524b34	Retry adding pids to cgroups when EINVAL occurs The kernel will sometimes return EINVAL when writing a pid to a cgroup.procs file. It does so when the task being added still has the state TASK_NEW. See: https://elixir.bootlin.com/linux/v4.8/source/kernel/sched/core.c#L8286 Co-authored-by: Danail Branekov <danailster@gmail.com> Signed-off-by: Tom Godkin <tgodkin@pivotal.io> Signed-off-by: Danail Branekov <danailster@gmail.com>	2018-12-17 15:34:47 +00:00
Aleksa Sarai	f5b99917df	merge branch 'pr-1945' Fix some typos LGTMs: @crosbymichael @cyphar Closes #1945	2018-12-11 03:43:44 +11:00
JoeWrightss	769d6c4a75	Fix some typos Signed-off-by: JoeWrightss <zhoulin.xie@daocloud.io>	2018-12-09 23:52:54 +08:00
Daniel, Dao Quang Minh	859f74576e	Merge pull request #1942 from KentaTada/fix-kernel-config-to-adjust-to-moby Modify check-config.sh in accordance with Moby Project updates	2018-12-08 20:15:53 +00:00
Michael Crosby	25f3f893c8	Merge pull request #1939 from cyphar/nokmem-error cgroups: nokmem: error out on explicitly-set kmemcg limits	2018-12-04 11:14:56 -05:00
Michael Crosby	96ec2177ae	Merge pull request #1943 from giuseppe/allow-to-signal-paused-containers kill: allow to signal paused containers	2018-12-03 16:55:13 -05:00
Michael Crosby	ff38d6e7cc	Merge pull request #1944 from Ace-Tang/criu_notify_pid cr: get pid from criu notify when restore	2018-12-03 10:35:58 -05:00
Ace-Tang	dce70cdff5	cr: get pid from criu notify when restore when restore container from a checkpoint directory, we should get pid from criu notify, since c.initProcess has not been created. Signed-off-by: Ace-Tang <aceapril@126.com>	2018-12-03 13:31:20 +08:00
Aleksa Sarai	8a4629f7b5	cgroups: nokmem: error out on explicitly-set kmemcg limits When built with nokmem we explicitly are disabling support for kmemcg, but it is a strict specification requirement that if we cannot fulfil an aspect of the container configuration that we error out. Completely ignoring explicitly-requested kmemcg limits with nokmem would undoubtably lead to problems. Fixes: `6a2c155968` ("libcontainer: ability to compile without kmem") Signed-off-by: Aleksa Sarai <asarai@suse.de>	2018-12-01 14:31:35 +11:00
Giuseppe Scrivano	07d1ad44c8	kill: allow to signal paused containers regression introduced by `87a188996e` Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2018-11-30 23:35:47 +01:00

1 2 3 4 5 ...

3795 Commits All Branches Search

3795 Commits

All Branches