jasder/runc - runc - 军科开源项目托管

Commit Graph

Author	SHA1	Message	Date
Kenta Tada	65032b55b1	libcontainer: fix TestGetContainerState to check configs.NEWCGROUP This test needs to handle the case of configs.NEWCGROUP as Namespace's type. Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>	2019-05-21 09:10:38 +09:00
Mrunal Patel	2484581dd7	Merge pull request #2035 from cyphar/bindmount-types specconv: always set "type: bind" in case of MS_BIND	2019-05-07 15:47:58 -07:00
Mrunal Patel	a0ecf749ee	Merge pull request #2047 from filbranden/systemd7 Move systemd.Manager initialization into a function in that module	2019-05-07 15:08:41 -07:00
Filipe Brandenburger	46351eb3d1	Move systemd.Manager initialization into a function in that module This will permit us to extend the internals of systemd.Manager to include further information about the system, such as whether cgroupv1, cgroupv2 or both are in effect. Furthermore, it allows a future refactor of moving more of UseSystemd() code into the factory initialization function. Signed-off-by: Filipe Brandenburger <filbranden@gmail.com>	2019-05-01 13:22:19 -07:00
Georgi Sabev	a146081828	Write logs to stderr by default Minor refactoring to use the filePair struct for both init sock and log pipe Co-authored-by: Julia Nedialkova <julianedialkova@hotmail.com> Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>	2019-04-24 15:18:14 +03:00
Georgi Sabev	68b4ff5b37	Simplify bail logic & minor nsexec improvements Co-authored-by: Julia Nedialkova <julianedialkova@hotmail.com> Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>	2019-04-24 15:16:11 +03:00
Xiaochen Shen	17b37ea3fa	libcontainer: intelrdt: add missing destroy handler in defer func In the exception handling of initProcess.start(), we need to add the missing IntelRdtManager.Destroy() handler in defer func. Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2019-04-24 16:41:51 +08:00
Georgi Sabev	475aef10f7	Remove redundant log function Bump logrus so that we can use logrus.StandardLogger().Logf instead Co-authored-by: Julia Nedialkova <julianedialkova@hotmail.com> Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>	2019-04-22 17:54:55 +03:00
Georgi Sabev	ba3cabf932	Improve nsexec logging * Simplify logging function * Logs contain __FUNCTION__:__LINE__ * Bail uses write_log Co-authored-by: Julia Nedialkova <julianedialkova@hotmail.com> Co-authored-by: Danail Branekov <danailster@gmail.com> Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>	2019-04-22 17:53:52 +03:00
Aleksa Sarai	8296826da5	specconv: always set "type: bind" in case of MS_BIND We discovered in umoci that setting a dummy type of "none" would result in file-based bind-mounts no longer working properly, which is caused by a restriction for when specconv will change the device type to "bind" to work around rootfs_linux.go's ... issues. However, bind-mounts don't have a type (and Linux will ignore any type specifier you give it) because the type is copied from the source of the bind-mount. So we should always overwrite it to avoid user confusion. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-04-08 15:08:08 +10:00
Danail Branekov	c486e3c406	Address comments in PR 1861 Refactor configuring logging into a reusable component so that it can be nicely used in both main() and init process init() Co-authored-by: Georgi Sabev <georgethebeatle@gmail.com> Co-authored-by: Giuseppe Capizzi <gcapizzi@pivotal.io> Co-authored-by: Claudia Beresford <cberesford@pivotal.io> Signed-off-by: Danail Branekov <danailster@gmail.com>	2019-04-04 14:57:28 +03:00
Marco Vedovati	feebfac358	Remove pipe close before exec. Pipe close before exec is not necessary as os.Pipe() is calling pipe2 with O_CLOEXEC option. Signed-off-by: Marco Vedovati <mvedovati@suse.com>	2019-04-04 14:53:30 +03:00
Marco Vedovati	9a599f62fb	Support for logging from children processes Add support for children processes logging (including nsexec). A pipe is used to send logs from children to parent in JSON. The JSON format used is the same used by logrus JSON formatted, i.e. children process can use standard logrus APIs. Signed-off-by: Marco Vedovati <mvedovati@suse.com>	2019-04-04 14:53:23 +03:00
Michael Crosby	11fc498ffa	Merge pull request #2023 from LittleLightLittleFire/2022-fix-runc-zombie-process-regression Fixes regression causing zombie runc:[1:CHILD] processes	2019-03-22 14:06:31 -04:00
Mrunal Patel	dd22a84864	Merge pull request #2012 from rhatdan/selinux Need to setup labeling of kernel keyrings.	2019-03-20 21:17:18 -07:00
Alex Fang	eab5330908	Fixes regression causing zombie runc:[1:CHILD] processes Whenever processes are spawned using nsexec, a zombie runc:[1:CHILD] process will always be created and will need to be reaped by the parent Signed-off-by: Alex Fang <littlelightlittlefire@gmail.com>	2019-03-21 13:43:38 +11:00
Aleksa Sarai	f56b4cbead	merge branch 'pr-2015' Use getenv not secure_getenv LGTMs: @crosbymichael @cyphar Closes #2015	2019-03-16 17:30:56 +11:00
Filipe Brandenburger	4b2b978291	Add cgroup name to error message More information should help troubleshoot an issue when this error occurs. Signed-off-by: Filipe Brandenburger <filbranden@google.com>	2019-03-14 10:25:00 -07:00
Justin Cormack	6f714aa928	Use getenv not secure_getenv secure_getenv is a Glibc extension and so this code does not compile on Musl libc any more after this patch. secure_getenv is only intended to be used in setuid binaries, in order that they should not trust their environment. It simply returns NULL if the binary is running setuid. If runc was installed setuid, the user can already do anything as root, so it is game over, so this check is not needed. Signed-off-by: Justin Cormack <justin.cormack@docker.com>	2019-03-14 10:58:10 +00:00
Daniel J Walsh	cd96170c10	Need to setup labeling of kernel keyrings. Work is ongoing in the kernel to support different kernel keyrings per user namespace. We want to allow SELinux to manage kernel keyrings inside of the container. Currently when runc creates the kernel keyring it gets the label which runc is running with ususally `container_runtime_t`, with this change the kernel keyring will be labeled with the container process label container_t:s0:C1,c2. Container running as container_t:s0:c1,c2 can manage keyrings with the same label. This change required a revendoring or the SELinux go bindings. github.com/opencontainers/selinux. Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>	2019-03-13 17:57:30 -04:00
Mrunal Patel	2b18fe1d88	Merge pull request #1984 from cyphar/memfd-cleanups nsenter: cloned_binary: "memfd" cleanups	2019-03-07 10:18:33 -08:00
Michael Crosby	f739110263	Merge pull request #1968 from adrianreber/podman Create bind mount mountpoints during restore	2019-03-04 11:37:07 -06:00
Aleksa Sarai	2d4a37b427	nsenter: cloned_binary: userspace copy fallback if sendfile fails There are some circumstances where sendfile(2) can fail (one example is that AppArmor appears to block writing to deleted files with sendfile(2) under some circumstances) and so we need to have a userspace fallback. It's fairly trivial (and handles short-writes). Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-03-01 23:29:10 +11:00
Aleksa Sarai	16612d74de	nsenter: cloned_binary: try to ro-bind /proc/self/exe before copying The usage of memfd_create(2) and other copying techniques is quite wasteful, despite attempts to minimise it with _LIBCONTAINER_STATEDIR. memfd_create(2) added ~10M of memory usage to the cgroup associated with the container, which can result in some setups getting OOM'd (or just hogging the hosts' memory when you have lots of created-but-not-started containers sticking around). The easiest way of solving this is by creating a read-only bind-mount of the binary, opening that read-only bindmount, and then umounting it to ensure that the host won't accidentally be re-mounted read-write. This avoids all copying and cleans up naturally like the other techniques used. Unfortunately, like the O_TMPFILE fallback, this requires being able to create a file inside _LIBCONTAINER_STATEDIR (since bind-mounting over the most obvious path -- /proc/self/exe -- is a very bad idea). Unfortunately detecting this isn't fool-proof -- on a system with a read-only root filesystem (that might become read-write during "runc init" execution), we cannot tell whether we have already done an ro remount. As a partial mitigation, we store a _LIBCONTAINER_CLONED_BINARY environment variable which is checked alongside the protection being present. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-03-01 23:29:08 +11:00
Aleksa Sarai	af9da0a450	nsenter: cloned_binary: use the runc statedir for O_TMPFILE Writing a file to tmpfs actually incurs a memcg penalty, and thus the benefit of being able to disable memfd_create(2) with _LIBCONTAINER_DISABLE_MEMFD_CLONE is fairly minimal -- though it should be noted that quite a few distributions don't use tmpfs for /tmp (and instead have it as a regular directory or subvolume of the host filesystem). Since runc must have write access to the state directory anyway (and the state directory is usually not on a tmpfs) we can use that instead of /tmp -- avoiding potential memcg costs with no real downside. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-03-01 23:28:51 +11:00
Aleksa Sarai	2429d59352	nsenter: cloned_binary: expand and add pre-3.11 fallbacks In order to get around the memfd_create(2) requirement, `0a8e4117e7` ("nsenter: clone /proc/self/exe to avoid exposing host binary to container") added an O_TMPFILE fallback. However, this fallback was flawed in two ways: * It required O_TMPFILE which is relatively new (having been added to Linux 3.11). * The fallback choice was made at compile-time, not runtime. This results in several complications when it comes to running binaries on different machines to the ones they were built on. The easiest way to resolve these things is to have fallbacks work in a more procedural way (though it does make the code unfortunately more complicated) and to add a new fallback that uses mkotemp(3). Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-03-01 23:28:50 +11:00
Aleksa Sarai	5b775bf297	nsenter: cloned_binary: detect and handle short copies For a variety of reasons, sendfile(2) can end up doing a short-copy so we need to just loop until we hit the binary size. Since /proc/self/exe is tautologically our own binary, there's no chance someone is going to modify it underneath us (or changing the size). Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-02-26 19:51:01 +11:00
Mrunal Patel	5b5130ad76	Merge pull request #1963 from adrianreber/go-criu Vendor in go-criu and use it for CRIU's RPC definition	2019-02-23 10:44:28 -08:00
Adrian Reber	9edb5494bb	Use vendored in CRIU Go bindings This makes use of the vendored in Go bindings and removes the copy of the CRIU RPC interface definition. runc now relies on go-criu for RPC definition and hopefully more CRIU functions can be used in the future from the CRIU Go bindings. Signed-off-by: Adrian Reber <areber@redhat.com>	2019-02-14 18:20:02 +01:00
Christian Brauner	bb7d8b1f41	nsexec (CVE-2019-5736): avoid parsing environ My first attempt to simplify this and make it less costly focussed on the way constructors are called. I was under the impression that the ELF specification mandated that arg, argv, and actually even envp need to be passed to functions located in the .init_arry section (aka "constructors"). Actually, the specifications is (cf. [2]): SHT_INIT_ARRAY This section contains an array of pointers to initialization functions, as described in ``Initialization and Termination Functions'' in Chapter 5. Each pointer in the array is taken as a parameterless procedure with a void return. which means that this becomes a libc specific decision. Glibc passes down those args, musl doesn't. So this approach can't work. However, we can at least remove the environment parsing part based on POSIX since [1] mandates that there should be an environ variable defined in unistd.h which provides access to the environment. See also the relevant Open Group specification [1]. [1]: http://pubs.opengroup.org/onlinepubs/9699919799/ [2]: http://www.sco.com/developers/gabi/latest/ch4.sheader.html#init_array Fixes: CVE-2019-5736 Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2019-02-14 16:06:21 +01:00
Filipe Brandenburger	cd41feb46b	Remove detection for scope properties, which have always been broken The detection for scope properties (whether scope units support DefaultDependencies= or Delegate=) has always been broken, since systemd refuses to create scopes unless at least one PID is attached to it (and this has been so since scope units were introduced in systemd v205.) This can be seen in journal logs whenever a container is started with libpod: Feb 11 15:08:07 myhost systemd[1]: libcontainer-12345-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing. Feb 11 15:08:07 myhost systemd[1]: libcontainer-12345-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing. Since this logic never worked, just assume both attributes are supported (which is what the code does when detection fails for this reason, since it's looking for an "unknown attribute" or "read-only attribute" to mark them as false) and skip the detection altogether. Signed-off-by: Filipe Brandenburger <filbranden@google.com>	2019-02-11 16:05:37 -08:00
Adrian Reber	7354546cc8	Create mountpoints also on restore runc creates all missing mountpoints when it starts a container, this commit also creates those mountpoints during restore. Now it is possible to restore a container using the same, but newly created rootfs just as during container start. Signed-off-by: Adrian Reber <areber@redhat.com>	2019-02-08 15:59:51 +01:00
Adrian Reber	f661e02343	factor out bind mount mountpoint creation During rootfs setup all mountpoints (directory and files) are created before bind mounting the bind mounts. This does not happen during container restore via CRIU. If restoring in an identical but newly created rootfs, the restore fails right now. This just factors out the code to create the bind mount mountpoints so that it also can be used during restore. Signed-off-by: Adrian Reber <areber@redhat.com>	2019-02-08 15:59:51 +01:00
Aleksa Sarai	0a8e4117e7	nsenter: clone /proc/self/exe to avoid exposing host binary to container There are quite a few circumstances where /proc/self/exe pointing to a pretty important container binary is a _bad_ thing, so to avoid this we have to make a copy (preferably doing self-clean-up and not being writeable). We require memfd_create(2) -- though there is an O_TMPFILE fallback -- but we can always extend this to use a scratch MNT_DETACH overlayfs or tmpfs. The main downside to this approach is no page-cache sharing for the runc binary (which overlayfs would give us) but this is far less complicated. This is only done during nsenter so that it happens transparently to the Go code, and any libcontainer users benefit from it. This also makes ExtraFiles and --preserve-fds handling trivial (because we don't need to worry about it). Fixes: CVE-2019-5736 Co-developed-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-02-08 18:57:59 +11:00
Mrunal Patel	e4fa8a4575	Merge pull request #1955 from xiaochenshen/rdt-fix-destroy-issue libcontainer: intelrdt: fix null intelrdt path issue in Destroy()	2019-02-01 13:18:56 -08:00
Mrunal Patel	4e4c907193	Merge pull request #1950 from cloudfoundry-incubator/enter-pid-race Resilience in adding of exec tasks to cgroups	2019-02-01 13:18:16 -08:00
Aleksa Sarai	565325fc36	integration: fix mis-use of libcontainer.Factory For some reason, libcontainer/integration has a whole bunch of incorrect usages of libcontainer.Factory -- causing test failures with a set of security patches that will be published soon. Fixing ths is fairly trivial (switch to creating a new libcontainer.Factory once in each process, rather than creating one in TestMain globally). Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-01-24 23:12:48 +13:00
Michael Crosby	c1e454b2a1	Merge pull request #1960 from giuseppe/fix-kmem-systemd systemd: fix setting kernel memory limit	2019-01-15 13:21:01 -05:00
Michael Crosby	4e9d52da54	Merge pull request #1933 from adrianreber/master Add CRIU configuration file support	2019-01-15 11:22:38 -05:00
Giuseppe Scrivano	28a697cce3	rootfs: umount all procfs and sysfs with --no-pivot When creating a new user namespace, the kernel doesn't allow to mount a new procfs or sysfs file system if there is not already one instance fully visible in the current mount namespace. When using --no-pivot we were effectively inhibiting this protection from the kernel, as /proc and /sys from the host are still present in the container mount namespace. A container without full access to /proc could then create a new user namespace, and from there able to mount a fully visible /proc, bypassing the limitations in the container. A simple reproducer for this issue is: unshare -mrfp sh -c "mount -t proc none /proc && echo c > /proc/sysrq-trigger" Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2019-01-14 09:53:35 +01:00
Giuseppe Scrivano	f01923376d	systemd: fix setting kernel memory limit since commit `df3fa115f9` it is not possible to set a kernel memory limit when using the systemd cgroups backend as we use cgroup.Apply twice. Skip enabling kernel memory if there are already tasks in the cgroup. Without this patch, runc fails with: container_linux.go:344: starting container process caused "process_linux.go:311: applying cgroup configuration for process caused \"failed to set memory.kmem.limit_in_bytes, because either tasks have already joined this cgroup or it has children\"" Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2019-01-10 11:33:50 +01:00
Xiaochen Shen	acb75d0e38	libcontainer: intelrdt: fix null intelrdt path issue in Destroy() This patch fixes a corner case when destroy a container: If we start a container without 'intelRdt' config set, and then we run “runc update --l3-cache-schema/--mem-bw-schema” to add 'intelRdt' config implicitly. Now if we enter "exit" from the container inside, we will pass through linuxContainer.Destroy() -> state.destroy() -> intelRdtManager.Destroy(). But in IntelRdtManager.Destroy(), IntelRdtManager.Path is still null string, it hasn’t been initialized yet. As a result, the created rdt group directory during "runc update" will not be removed as expected. Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2019-01-05 00:34:25 +08:00
Adrian Reber	e157963054	Enable CRIU configuration files CRIU 3.11 introduces configuration files: https://criu.org/Configuration_files https://lisas.de/~adrian/posts/2018-Nov-08-criu-configuration-files.html This enables the user to influence CRIU's behaviour without code changes if using new CRIU features or if the user wants to enable certain CRIU behaviour without always specifying certain options. With this it is possible to write 'tcp-established' to the configuration file: $ echo tcp-established > /etc/criu/runc.conf and from now on all checkpoints will preserve the state of established TCP connections. This removes the need to always use $ runc checkpoint --tcp-stablished If the goal is to always checkpoint with '--tcp-established' It also adds the possibility for unexpected CRIU behaviour if the user created a configuration file at some point in time and forgets about it. As a result of the discussion in https://github.com/opencontainers/runc/pull/1933 it is now also possible to define a CRIU configuration file for each container with the annotation 'org.criu.config'. If 'org.criu.config' does not exist, runc will tell CRIU to use '/etc/criu/runc.conf' if it exists. If 'org.criu.config' is set to an empty string (''), runc will tell CRIU to not use any runc specific configuration file at all. If 'org.criu.config' is set to a non-empty string, runc will use that value as an additional configuration file for CRIU. With the annotation the user can decide to use the default configuration file ('/etc/criu/runc.conf'), none or a container specific configuration file. Signed-off-by: Adrian Reber <areber@redhat.com>	2018-12-21 07:42:12 +01:00
Adrian Reber	360ba8a27d	Update criurpc definition for latest features Signed-off-by: Adrian Reber <areber@redhat.com>	2018-12-21 07:42:12 +01:00
JoeWrightss	0855bce448	Fix .Fatalf() error message Signed-off-by: JoeWrightss <zhoulin.xie@daocloud.io>	2018-12-19 20:22:48 +08:00
Tom Godkin	bdf3524b34	Retry adding pids to cgroups when EINVAL occurs The kernel will sometimes return EINVAL when writing a pid to a cgroup.procs file. It does so when the task being added still has the state TASK_NEW. See: https://elixir.bootlin.com/linux/v4.8/source/kernel/sched/core.c#L8286 Co-authored-by: Danail Branekov <danailster@gmail.com> Signed-off-by: Tom Godkin <tgodkin@pivotal.io> Signed-off-by: Danail Branekov <danailster@gmail.com>	2018-12-17 15:34:47 +00:00
JoeWrightss	769d6c4a75	Fix some typos Signed-off-by: JoeWrightss <zhoulin.xie@daocloud.io>	2018-12-09 23:52:54 +08:00
Michael Crosby	25f3f893c8	Merge pull request #1939 from cyphar/nokmem-error cgroups: nokmem: error out on explicitly-set kmemcg limits	2018-12-04 11:14:56 -05:00
Michael Crosby	96ec2177ae	Merge pull request #1943 from giuseppe/allow-to-signal-paused-containers kill: allow to signal paused containers	2018-12-03 16:55:13 -05:00
Ace-Tang	dce70cdff5	cr: get pid from criu notify when restore when restore container from a checkpoint directory, we should get pid from criu notify, since c.initProcess has not been created. Signed-off-by: Ace-Tang <aceapril@126.com>	2018-12-03 13:31:20 +08:00
Aleksa Sarai	8a4629f7b5	cgroups: nokmem: error out on explicitly-set kmemcg limits When built with nokmem we explicitly are disabling support for kmemcg, but it is a strict specification requirement that if we cannot fulfil an aspect of the container configuration that we error out. Completely ignoring explicitly-requested kmemcg limits with nokmem would undoubtably lead to problems. Fixes: `6a2c155968` ("libcontainer: ability to compile without kmem") Signed-off-by: Aleksa Sarai <asarai@suse.de>	2018-12-01 14:31:35 +11:00
Giuseppe Scrivano	07d1ad44c8	kill: allow to signal paused containers regression introduced by `87a188996e` Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2018-11-30 23:35:47 +01:00
Michael Crosby	4932620b62	Merge pull request #1919 from xiaochenshen/rdt-mba-software-controller libcontainer: intelrdt: add support for Intel RDT/MBA Software Controller in runc	2018-11-26 16:45:42 -05:00
Michael Crosby	50e2634995	Merge pull request #1934 from lifubang/kill fix: may kill other process when container has been stopped	2018-11-21 10:30:25 -05:00
Lifubang	87a188996e	may kill other process when container has been stopped Signed-off-by: Lifubang <lifubang@acmcoder.com>	2018-11-21 17:44:52 +08:00
Aleksa Sarai	ceefc3fe4e	merge branch 'pr-1741' libcontainer: Set 'status' in hook stdin LGTMs: @cyphar @crosbymichael Closes #1741	2018-11-20 06:39:30 +11:00
Michael Crosby	76520a4bf0	Merge pull request #1872 from masters-of-cats/better-find-cgroup-mountpoint Respect container's cgroup path	2018-11-16 14:06:54 -05:00
W. Trevor King	e23868603a	libcontainer: Set 'status' in hook stdin Finish off the work started in `a344b2d6` (sync up `HookState` with OCI spec `State`, 2016-12-19, #1201). And drop HookState, since there's no need for a local alias for specs.State. Also set c.initProcess in newInitProcess to support OCIState calls from within initProcess.start(). I think the cyclic references between linuxContainer and initProcess are unfortunate, but didn't want to address that here. I've also left the timing of the Prestart hooks alone, although the spec calls for them to happen before start (not as part of creation) [1,2]. Once the timing gets fixed we can drop the initProcessStartTime hacks which initProcess.start currently needs. I'm not sure why we trigger the prestart hooks in response to both procReady and procHooks. But we've had two prestart rounds in initProcess.start since `2f276498` (Move pre-start hooks after container mounts, 2016-02-17, #568). I've left that alone too. I really think we should have len() guards to avoid computing the state when .Hooks is non-nil but the particular phase we're looking at is empty. Aleksa, however, is adamantly against them [3] citing a risk of sloppy copy/pastes causing the hook slice being len-guarded to diverge from the hook slice being iterated over within the guard. I think that ort of thing is very lo-risk, because: * We shouldn't be copy/pasting this, right? DRY for the win :). * There's only ever a few lines between the guard and the guarded loop. That makes broken copy/pastes easy to catch in review. * We should have test coverage for these. Guarding with the wrong slice is certainly not the only thing you can break with a sloppy copy/paste. But I'm not a maintainer ;). [1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.0/config.md#prestart [2]: https://github.com/opencontainers/runc/issues/1710 [3]: https://github.com/opencontainers/runc/pull/1741#discussion_r233331570 Signed-off-by: W. Trevor King <wking@tremily.us>	2018-11-14 06:49:49 -08:00
Mrunal Patel	4769cdf607	Merge pull request #1916 from crosbymichael/cgns Add support for cgroup namespace	2018-11-13 12:21:38 -08:00
Mrunal Patel	f000fe11ec	Merge pull request #1917 from slp/master libcontainer: map PidsLimit to systemd's TasksMax property	2018-11-13 12:21:23 -08:00
Michael Crosby	aa7917b751	Merge pull request #1911 from theSuess/linter-fixes Various cleanups to address linter issues	2018-11-13 12:13:34 -05:00
Michael Crosby	bd420b59f1	Merge pull request #1925 from Ace-Tang/fix_dup_ns test: fix TestDupNamespaces fail to test dup-ns error	2018-11-13 12:11:11 -05:00
Xiaochen Shen	95af9eff82	libcontainer: intelrdt: add support for Intel RDT/MBA Software Controller in runc MBA Software Controller feature is introduced in Linux kernel v4.18. It is a software enhancement to mitigate some limitations in MBA which describes in kernel documentation. It also makes the interface more user friendly - we could specify memory bandwidth in "MBps" (Mega Bytes per second) as well as in "percentages". The kernel underneath would use a software feedback mechanism or a "Software Controller" which reads the actual bandwidth using MBM counters and adjust the memory bandwidth percentages to ensure: "actual memory bandwidth < user specified memory bandwidth". We could enable this feature through mount option "-o mba_MBps": mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl In runc, we handle both memory bandwidth schemata in unified format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..." The unit of memory bandwidth is specified in "percentages" by default, and in "MBps" if MBA Software Controller is enabled. For more information about Intel RDT and MBA Software Controller: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2018-11-13 23:27:08 +08:00
Ace-Tang	16d55f17a8	libcontainer: fix potential panic if spec.Process is nil for the code logic, pointer 'spec.Process' should be judge first to avoid panic. Signed-off-by: Ace-Tang <aceapril@126.com>	2018-11-06 11:55:30 +08:00
Ace-Tang	95d1aa1886	test: fix TestDupNamespaces add Root in created spec, or error message is 'Root must be specified' Signed-off-by: Ace-Tang <aceapril@126.com>	2018-11-06 11:36:27 +08:00
Michael Crosby	b1068fb925	Merge pull request #1814 from rhatdan/selinux SELinux labels are tied to the thread	2018-11-05 10:00:11 -05:00
Aleksa Sarai	9f1e94488e	merge branch 'pr-1921' libcontainer: ability to compile without kmem LGTMs: @mrunalp @cyphar Closes #1921	2018-11-02 09:54:16 +11:00
Michael Crosby	9e5aa7494d	Merge pull request #1918 from giuseppe/skip-setgroups rootless: fix running with /proc/self/setgroups set to deny	2018-11-01 13:16:47 -04:00
Kir Kolyshkin	6a2c155968	libcontainer: ability to compile without kmem Commit `fe898e7862` (PR #1350) enables kernel memory accounting for all cgroups created by libcontainer -- even if kmem limit is not configured. Kernel memory accounting is known to be broken in some kernels, specifically the ones from RHEL7 (including RHEL 7.5). Those kernels do not support kernel memory reclaim, and are prone to oopses. Unconditionally enabling kmem acct on such kernels lead to bugs, such as * https://github.com/opencontainers/runc/issues/1725 * https://github.com/kubernetes/kubernetes/issues/61937 * https://github.com/moby/moby/issues/29638 This commit gives a way to compile runc without kernel memory setting support. To do so, use something like make BUILDTAGS="seccomp nokmem" Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2018-10-31 20:35:51 -07:00
Yuanhong Peng	df3fa115f9	Add support for cgroup namespace Cgroup namespace can be configured in `config.json` as other namespaces. Here is an example: ``` "namespaces": [ { "type": "pid" }, { "type": "network" }, { "type": "ipc" }, { "type": "uts" }, { "type": "mount" }, { "type": "cgroup" } ], ``` Note that if you want to run a container which has shared cgroup ns with another container, then it's strongly recommended that you set proper `CgroupsPath` of both containers(the second container's cgroup path must be the subdirectory of the first one). Or there might be some unexpected results. Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com> Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2018-10-31 10:51:43 -04:00
Chris Aniszczyk	f3ce8221ea	Merge pull request #1913 from xiaochenshen/rdt-add-diagnostics libcontainer: intelrdt: add user-friendly diagnostics for Intel RDT operation errors	2018-10-25 14:27:17 -05:00
Giuseppe Scrivano	869add3318	rootless: fix running with /proc/self/setgroups set to deny This is a regression from `06f789cf26` when the user namespace was configured without a privileged helper. To allow a single mapping in an user namespace, it is necessary to set /proc/self/setgroups to "deny". For a simple reproducer, the user namespace can be created with "unshare -r". Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2018-10-25 15:44:15 +02:00
Sergio Lopez	5c6b9c3c1c	libcontainer: map PidsLimit to systemd's TasksMax property Currently runc applies PidsLimit restriction by writing directly to cgroup's pids.max, without notifying systemd. As a consequence, when the later updates the context of the corresponding scope, pids.max is reset to the value of systemd's TasksMax property. This can be easily reproduced this way (I'm using "postfix" here just an example, any unrelated but existing service will do): # CTR=`docker run --pids-limit 111 --detach --rm busybox /bin/sleep 8h` # cat /sys/fs/cgroup/pids/system.slice/docker-${CTR}.scope/pids.max 111 # systemctl disable --now postfix # systemctl enable --now postfix # cat /sys/fs/cgroup/pids/system.slice/docker-${CTR}.scope/pids.max max This patch adds TasksAccounting=true and TasksMax=PidsLimit to the properties sent to systemd. Signed-off-by: Sergio Lopez <slp@redhat.com>	2018-10-24 17:20:27 +02:00
Aleksa Sarai	e93996674f	merge branch 'pr-1903' clarify license information LGTMs: @hqhq @cyphar Closes #1903	2018-10-24 22:03:44 +11:00
Aleksa Sarai	9a3a8a5ebf	libcontainer: implement CLONE_NEWCGROUP This is a very simple implementation because it doesn't require any configuration unlike the other namespaces, and in its current state it only masks paths. This feature is available in Linux 4.6+ and is enabled by default for kernels compiled with CONFIG_CGROUP=y. Signed-off-by: Aleksa Sarai <asarai@suse.de> Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2018-10-23 16:23:00 -04:00
Xiaochen Shen	6c307f8ff2	libcontainer: intelrdt: add user-friendly diagnostics for Intel RDT operation errors Linux kernel v4.15 introduces better diagnostics for Intel RDT operation errors. If any error returns when making new directories or writing to any of the control file in resctrl filesystem, reading file /sys/fs/resctrl/info/last_cmd_status could provide more information that can be conveyed in the error returns from file operations. Some examples: echo "L3:0=f3;1=ff" > /sys/fs/resctrl/container_id/schemata -bash: echo: write error: Invalid argument cat /sys/fs/resctrl/info/last_cmd_status mask f3 has non-consecutive 1-bits echo "MB:0=0;1=110" > /sys/fs/resctrl/container_id/schemata -bash: echo: write error: Invalid argument cat /sys/fs/resctrl/info/last_cmd_status MB value 0 out of range [10,100] cd /sys/fs/resctrl mkdir 1 2 3 4 5 6 7 8 mkdir: cannot create directory '8': No space left on device cat /sys/fs/resctrl/info/last_cmd_status out of CLOSIDs See 'last_cmd_status' for more details in kernel documentation: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt In runc, we could append the diagnostics information to the error message of Intel RDT operation errors to provide more user-friendly information. Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2018-10-19 00:16:08 +08:00
Mrunal Patel	c2ab1e656e	Merge pull request #1910 from adrianreber/tip Fix travis Go: tip	2018-10-17 12:47:08 -07:00
Michael Crosby	58592df567	Merge pull request #1880 from AkihiroSuda/fix-subgid libcontainer: CurrentGroupSubGIDs -> CurrentUserSubGIDs	2018-10-16 15:21:51 -04:00
Xiaochen Shen	d59b17d6d5	libcontainer: intelrdt: Add more check if sub-features are enabled Double check if Intel RDT sub-features are available in "resource control" filesystem. Intel RDT sub-features can be selectively disabled or enabled by kernel command line (e.g., rdt=!l3cat,mba) in 4.14 and newer kernel. Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2018-10-16 14:29:44 +08:00
Xiaochen Shen	f097339289	libcontainer: intelrdt: add test cases for Intel RDT/MBA Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2018-10-16 14:29:39 +08:00
Xiaochen Shen	27560ace2f	libcontainer: intelrdt: add support for Intel RDT/MBA in runc Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature of Intel Resource Director Technology (RDT) which is supported on some Intel Xeon platforms. Intel RDT/MBA provides indirect and approximate throttle over memory bandwidth for the software. A user controls the resource by indicating the percentage of maximum memory bandwidth. Hardware details of Intel RDT/MBA can be found in section 17.18 of Intel Software Developer Manual: https://software.intel.com/en-us/articles/intel-sdm In Linux 4.12 kernel and newer, Intel RDT/MBA is enabled by kernel config CONFIG_INTEL_RDT. If hardware support, CPU flags `rdt_a` and `mba` will be set in /proc/cpuinfo. Intel RDT "resource control" filesystem hierarchy: mount -t resctrl resctrl /sys/fs/resctrl tree /sys/fs/resctrl /sys/fs/resctrl/ \|-- info \| \|-- L3 \| \| \|-- cbm_mask \| \| \|-- min_cbm_bits \| \| \|-- num_closids \| \|-- MB \| \|-- bandwidth_gran \| \|-- delay_linear \| \|-- min_bandwidth \| \|-- num_closids \|-- ... \|-- schemata \|-- tasks \|-- <container_id> \|-- ... \|-- schemata \|-- tasks For MBA support for `runc`, we will reuse the infrastructure and code base of Intel RDT/CAT which implemented in #1279. We could also make use of `tasks` and `schemata` configuration for memory bandwidth resource constraints. The file `tasks` has a list of tasks that belongs to this group (e.g., <container_id>" group). Tasks can be added to a group by writing the task ID to the "tasks" file (which will automatically remove them from the previous group to which they belonged). New tasks created by fork(2) and clone(2) are added to the same group as their parent. The file `schemata` has a list of all the resources available to this group. Each resource (L3 cache, memory bandwidth) has its own line and format. Memory bandwidth schema: It has allocation values for memory bandwidth on each socket, which contains L3 cache id and memory bandwidth percentage. Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..." The minimum bandwidth percentage value for each CPU model is predefined and can be looked up through "info/MB/min_bandwidth". The bandwidth granularity that is allocated is also dependent on the CPU model and can be looked up at "info/MB/bandwidth_gran". The available bandwidth control steps are: min_bw + N * bw_gran. Intermediate values are rounded to the next control step available on the hardware. For more information about Intel RDT kernel interface: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt An example for runc: Consider a two-socket machine with two L3 caches where the minimum memory bandwidth of 10% with a memory bandwidth granularity of 10%. Tasks inside the container may use a maximum memory bandwidth of 20% on socket 0 and 70% on socket 1. "linux": { "intelRdt": { "memBwSchema": "MB:0=20;1=70" } } Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2018-10-16 14:29:29 +08:00
Xiaochen Shen	c1cece7e23	libcontainer: intelrdt: add Intel RDT/MBA docs in SPEC.md Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2018-10-16 14:28:19 +08:00
Mrunal Patel	a00bf01908	Merge pull request #1862 from AkihiroSuda/decompose-rootless-pr Disable rootless mode except RootlessCgMgr when executed as the root in userns (fix Docker-in-LXD regression)	2018-10-15 17:32:15 -07:00
Dominik Süß	0b412e9482	various cleanups to address linter issues Signed-off-by: Dominik Süß <dominik@suess.wtf>	2018-10-13 21:14:03 +02:00
Adrian Reber	0d01164756	Fix travis Go: tip This fixes libcontainer/container_linux.go:1200: Error call has possible formatting directive %s Signed-off-by: Adrian Reber <areber@redhat.com>	2018-10-13 10:44:07 +00:00
Aleksa Sarai	e40d4635c4	merge branch 'pr-1894' Move spec.Linux.IntelRdt check to spec.Linux != nil block LGTMs: @crosbymichael @cyphar Closes #1894	2018-10-09 02:41:13 +11:00
Jonathan Marler	1499c746a1	Move spec.Linux.IntelRdt check to spec.Linux != nil block Signed-off-by: Jonathan Marler <johnnymarler@gmail.com>	2018-10-04 21:30:55 -06:00
Mike Brown	26bdc0dce7	clarify license information Signed-off-by: Mike Brown <brownwm@us.ibm.com>	2018-10-03 10:39:44 -05:00
Mrunal Patel	2abd837c8c	Merge pull request #1893 from cyphar/keyctl-ignore-enosys keyring: handle ENOSYS with keyctl(KEYCTL_JOIN_SESSION_KEYRING)	2018-09-25 13:35:16 -07:00
Danail Branekov	a1d5398afa	Respect container's cgroup path Respect the container's cgroup path when finding the container's cgroup mount point, which is useful in multi-tenant environments, where containers have their own unique cgroup mounts Signed-off-by: Danail Branekov <danailster@gmail.com> Signed-off-by: Oliver Stenbom <ostenbom@pivotal.io> Signed-off-by: Giuseppe Capizzi <gcapizzi@pivotal.io>	2018-09-25 17:43:36 +01:00
Aleksa Sarai	578fe65e4f	merge branch 'pr-1817' Fix duplicate entries and missing entries in getCgroupMountsHelper Add test for testing cgroup mounts on bedrock linux Stop relying on number of subsystems for cgroups LGTMs: @crosbymichael @cyphar Closes #1817	2018-09-19 19:48:17 +10:00
Michael Crosby	cc8146cf93	Merge pull request #1858 from marcov/nsenter-README Update outdated nsenter README content	2018-09-17 10:53:19 -04:00
Michael Crosby	d77251d5fc	Merge pull request #1892 from Ace-Tang/add_clean_test test: add more test case for CleanPath	2018-09-17 10:51:17 -04:00
Aleksa Sarai	40f1468413	keyring: handle ENOSYS with keyctl(KEYCTL_JOIN_SESSION_KEYRING) While all modern kernels (and I do mean _all_ of them -- this syscall was added in 2.6.10 before git had begun development!) have support for this syscall, LXC has a default seccomp profile that returns ENOSYS for this syscall. For most syscalls this would be a deal-breaker, and our use of session keyrings is security-based there are a few mitigating factors that make this change not-completely-insane: * We already have a flag that disables the use of session keyrings (for older kernels that had system-wide keyring limits and so on). So disabling it is not a new idea. * While the primary justification of using session keys is security-based, it's more of a security-by-obscurity protection. The main defense keyrings have is VFS credentials -- which is something that users already have better security tools for (setuid(2) and user namespaces). * Given the security justification you might argue that we shouldn't silently ignore this. However, the only way for the kernel to return -ENOSYS is either being ridiculously old (at which point we wouldn't work anyway) or that there is a seccomp profile in place blocking it. Given that the seccomp profile (if malicious) could very easily just return 0 or a silly return code (or something even more clever with seccomp-bpf) and trick us without this patch, there isn't much of a significant change in how much seccomp can trick us with or without this patch. Given all of that over-analysis, I'm pretty convinced there isn't a security problem in this very specific case and it will help out the ChromeOS folks by allowing Docker to run inside their LXC container setup. I'd be happy to be proven wrong. Ref: https://bugs.chromium.org/p/chromium/issues/detail?id=860565 Signed-off-by: Aleksa Sarai <asarai@suse.de>	2018-09-17 21:38:30 +10:00
Ace-Tang	5963cf2afc	test: add more test case for CleanPath Signed-off-by: Ace-Tang <aceapril@126.com>	2018-09-14 21:37:12 +08:00
Akihiro Suda	06f789cf26	Disable rootless mode except RootlessCgMgr when executed as the root in userns This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and `RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc. `RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in the current user namespace. `RootlessEUID` is almost identical to the former `Rootless` except cgroups stuff. `RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups. `RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace. Otherwise `RootlessCgroups` is set to true. (Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well) When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes), `RootlessEUID` is set to false but `RootlessCgroups` is set to true. So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored. This PR does not have any impact on CLI flags and `state.json`. Note about CLI: * Now `runc --rootless=(auto\|true\|false)` CLI flag is only used for setting `RootlessCgroups`. * Now `runc spec --rootless` is only required when `RootlessEUID` is set to true. For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of UID/GID are mapped. Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`): * `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility. (`/run/runc` is used) * If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`. This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`. Note about `state.json`: * `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`. Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>	2018-09-07 15:05:03 +09:00
Yan Zhu	feb90346e0	doc: fix typo Signed-off-by: Yan Zhu <yanzhu@alauda.io>	2018-09-07 11:58:59 +08:00
Michael Crosby	70ca035aa6	Merge pull request #1883 from lifubang/containeridinpath fix delete other file bug when container id is ..	2018-09-05 13:43:21 -04:00
Mrunal Patel	9cda583235	Merge pull request #1832 from giuseppe/runc-drop-invalid-proc-destination-with-chroot linux: drop check for /proc as invalid dest	2018-09-04 09:26:21 -07:00
Lifubang	4eb30fcdbe	code optimization: use securejoin.SecureJoin and CleanPath Signed-off-by: Lifubang <lifubang@acmcoder.com>	2018-09-04 09:02:18 +08:00
Lifubang	4fae8fcce2	code optimization after review Signed-off-by: Lifubang <lifubang@acmcoder.com>	2018-09-03 23:27:31 +08:00
Lifubang	d2d226e8f9	fix unexpected delete bug when container id is .. Signed-off-by: Lifubang <lifubang@acmcoder.com>	2018-08-31 11:17:42 +08:00
ChangFeng	3ce8fac7c4	libcontainer: add /proc/loadavg to the white list of bind mount Signed-off-by: JunLi <lijun.git@gmail.com>	2018-08-30 21:30:23 +08:00
Giuseppe Scrivano	636b664027	linux: drop check for /proc as invalid dest it is now allowed to bind mount /proc. This is useful for rootless containers when the PID namespace is shared with the host. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2018-08-30 09:56:18 +02:00
Akihiro Suda	b34d6d8a7c	libcontainer: CurrentGroupSubGIDs -> CurrentUserSubGIDs subgid is defined per user, not group (see subgid(5)) This commit also adds support for specifying subuid owner with a numeric UID. Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>	2018-08-29 07:46:03 +09:00
Michael Crosby	1555a78945	Merge pull request #1874 from mrunalp/drop_unused_code Remove unused veth setup code	2018-08-27 11:07:25 -04:00
Qiang Huang	0228707b77	Merge pull request #1873 from rhatdan/ms_move When doing a copyup, /tmp can not be a shared mount point	2018-08-27 10:08:53 +08:00
Mrunal Patel	fe3d5c4c6e	Remove unused veth setup code Networking is setup by plugins for users of runc so it makes sense to get rid of the veth strategy. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>	2018-08-24 15:41:52 -07:00
Adrian Reber	fa43a72aba	criu: restore into existing namespace when specified Using CRIU to checkpoint and restore a container into an existing network namespace is not possible. If the network namespace is defined like { "type": "network", "path": "/run/netns/test" } there is the expectation that the restored container is again running in the network namespace specified with 'path'. This adds the new CRIU 'external namespace' feature to runc, where during checkpointing that specific namespace is referenced and during restore CRIU tries to restore the container in exactly that namespace. This breaks/fixes current runc behavior. If, without this patch, runc restores a container with such a network namespace definition, it is ignored and CRIU recreates a network namespace without a name. With this patch runc uses the network namespace path (if available) to checkpoint and restore the container in just that network namespace. Restore will now fail if a container was checkpointed with a network namespace path set and if that network namespace path does not exist during restore. runc still falls back to the old behavior if CRIU older than 3.11 is installed. Fixes #1786 Related to https://github.com/projectatomic/libpod/pull/469 Thanks to Andrei Vagin for all the help in getting the interface between CRIU and runc right! Signed-off-by: Adrian Reber <areber@redhat.com>	2018-08-22 23:27:20 +02:00
Daniel J Walsh	62a4763a7a	When doing a copyup, /tmp can not be a shared mount point MOVE_MOUNT will fail under certain situations. You are not allowed to MS_MOVE if the parent directory is shared. man mount ... The move operation Move a mounted tree to another place (atomically). The call is: mount --move olddir newdir This will cause the contents which previously appeared under olddir to now be accessible under newdir. The physical location of the files is not changed. Note that olddir has to be a mountpoint. Note also that moving a mount residing under a shared mount is invalid and unsupported. Use findmnt -o TARGET,PROPAGATION to see the current propagation flags. Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>	2018-08-20 17:41:06 -04:00
Aleksa Sarai	20aff4f048	merge branch 'pr-1867' Revert "libcontainer/rootfs_linux: minor cleanup" LGTMs: @hqhq @cyphar Closes #1867	2018-08-15 15:42:56 +10:00
Mrunal Patel	26ec8a9783	Revert "libcontainer/rootfs_linux: minor cleanup" This reverts commit `1b27db67f1`. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>	2018-08-14 15:50:18 -07:00
Marco Vedovati	34ed62697b	Update outdated nsenter README content Signed-off-by: Marco Vedovati <mvedovati@suse.com>	2018-08-07 17:53:56 +02:00
Michael Crosby	4056a41f58	Merge pull request #1830 from crosbymichael/procs Pass GOMAXPROCS to init processes	2018-08-01 10:48:06 -04:00
Jay Kamat	a2faaa1317	Fix duplicate entries and missing entries in getCgroupMountsHelper Signed-off-by: Jay Kamat <jaygkamat@gmail.com>	2018-07-31 20:12:18 -07:00
Alban Crequy	3321aa1af7	Fix regression with mounts with non-absolute source path PR #1753 introduced a test on the mount flags but the binary operator was wrong, see https://github.com/opencontainers/runc/pull/1753#discussion_r203445652 This was noticed when investigating https://github.com/opencontainers/runtime-tools/issues/651 Symptoms: in the container, /proc/self/mountinfo displays some mounts as follow: 296 279 0:67 / /tmp rw,nosuid - tmpfs /home/dpark/go/src/github.com/opencontainers/runc/tmpfs rw,size=65536k,mode=755 Signed-off-by: Alban Crequy <alban@kinvolk.io>	2018-07-18 18:30:49 +02:00
Michael Crosby	53fddb540a	Pass GOMAXPROCS to init processes This will help runc's init to not spawn many threads on large systems when launched with max procs by the caller. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2018-06-26 11:23:37 -04:00
Michael Crosby	2c632d1a2d	Merge pull request #1824 from cyphar/fix-mips-build-devNumber libcontainer: devices: fix mips builds	2018-06-25 13:21:28 -04:00
Jay Kamat	e5a7c61f3c	Add test for testing cgroup mounts on bedrock linux Add a mountinfo from a bedrock linux system with 4 strata, and include it for tests Signed-off-by: Jay Kamat <jaygkamat@gmail.com> Signed-off-by: Daniel Dao <dqminh89@gmail.com>	2018-06-24 00:01:07 +01:00
Daniel Dao	5ee0648bfb	Stop relying on number of subsystems for cgroups When there are complicated mount setups, there can be multiple mount points which have the subsystem we are looking for. Instead of counting the mountpoints, tick off subsystems until we have found them all. Without the 'all' flag, ignore duplicate subsystems after the first. Signed-off-by: Daniel Dao <dqminh89@gmail.com>	2018-06-24 00:00:58 +01:00
Aleksa Sarai	823c06eae9	libcontainer: improve "kernel.{domainname,hostname}" sysctl handling These sysctls are namespaced by CLONE_NEWUTS, and we need to use "kernel.domainname" if we want users to be able to set an NIS domainname on Linux. However we disallow "kernel.hostname" because it would conflict with the "hostname" field and cause confusion (but we include a helpful message to make it clearer to the user). Signed-off-by: Aleksa Sarai <asarai@suse.de>	2018-06-18 21:48:04 +10:00
Aleksa Sarai	a0e99e7a1a	libcontainer: devices: fix mips builds It turns out that MIPS uses uint32 in the device number returned by stat(2), so explicitly wrap everything to make the compiler happy. I really wish that Go had C-like numeric type promotion. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2018-06-17 11:22:01 +10:00
Mrunal Patel	ad0f525506	Merge pull request #1819 from tiborvass/fix-arm32bit libcontainer: fix compilation on GOARCH=arm GOARM=6 (32 bits)	2018-06-15 07:06:50 -07:00
Tibor Vass	c205e9fb64	libcontainer: fix compilation on GOARCH=arm GOARM=6 (32 bits) This fixes the following compilation error on 32bit ARM: ``` $ GOARCH=arm GOARCH=6 go build ./libcontainer/system/ libcontainer/system/linux.go:119:89: constant 4294967295 overflows int ``` Signed-off-by: Tibor Vass <tibor@docker.com>	2018-06-14 18:33:14 +00:00
Giuseppe Scrivano	cbcc85d311	runc: not require uid/gid mappings if euid()==0 When running in a new unserNS as root, don't require a mapping to be present in the configuration file. We are already skipping the test for a new userns to be present. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2018-06-12 12:45:54 +02:00
Daniel J Walsh	aa3fee6c80	SELinux labels are tied to the thread We need to lock the threads for the SetProcessLabel to work, should also call SetProcessLabel("") after the container starts to go back to the default SELinux behaviour. Once you call SetProcessLabel, then any process executed by runc will run with this label, even if the process is for setup rather then the container. It is always safest to call the SELinux calls just before the exec of the container, so that other processes do not get started with the incorrect label. Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>	2018-06-11 08:34:58 -04:00
Aleksa Sarai	dd56ece823	merge branch 'pr-1812' Fix race in runc exec LGTMs: @dqminh @cyphar Closes #1812	2018-06-04 19:02:33 +10:00
Daniel, Dao Quang Minh	2e91544060	Merge pull request #1806 from cyphar/cgroup-ignorable-error-fixup cgroup: clean up isIgnorableError for skippable EROFS	2018-06-02 23:57:02 +01:00
Mrunal Patel	bd3c4f844a	Fix race in runc exec There is a race in runc exec when the init process stops just before the check for the container status. It is then wrongly assumed that we are trying to start an init process instead of an exec process. This commit add an Init field to libcontainer Process to distinguish between init and exec processes to prevent this race. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>	2018-06-01 16:25:58 -07:00
Michael Crosby	0e561642f8	Merge pull request #1688 from AkihiroSuda/unshare-m-r main: support rootless mode in userns	2018-05-29 15:41:17 -04:00
Aleksa Sarai	939d5a3753	cgroup: clean up isIgnorableError for skippable EROFS Include a rootless argument for isIgnorableError to avoid people accidentally using isIgnorableError when they shouldn't (we don't ignore any errors when running as root as that really isn't safe). Signed-off-by: Aleksa Sarai <asarai@suse.de>	2018-05-25 11:31:41 +10:00
Qiang Huang	dd67ab10d7	Merge pull request #1759 from cyphar/rootless-erofs-as-eperm rootless: cgroup: treat EROFS as a skippable error	2018-05-25 09:24:16 +08:00
Daniel, Dao Quang Minh	2e931185f9	Merge pull request #1805 from derekwaynecarr/systemd-cpuquota-fix fix systemd cpu quota for -1	2018-05-24 11:24:27 +01:00
Akihiro Suda	c93815738a	libcontainer: remove extra CAP_SETGID check for SetgroupAttr Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>	2018-05-24 14:59:30 +09:00
Derek Carr	b515963c10	systemd cpu quota ignores -1 Signed-off-by: Derek Carr <decarr@redhat.com>	2018-05-23 14:28:39 -04:00
Michael Crosby	fd0febd3ce	Wrap error messages during init Fixes #1437 Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2018-05-10 10:28:10 -04:00
Akihiro Suda	f103de57ec	main: support rootless mode in userns Running rootless containers in userns is useful for mounting filesystems (e.g. overlay) with mapped euid 0, but without actual root privilege. Usage: (Note that `unshare --mount` requires `--map-root-user`) user$ mkdir lower upper work rootfs user$ curl http://dl-cdn.alpinelinux.org/alpine/v3.7/releases/x86_64/alpine-minirootfs-3.7.0-x86_64.tar.gz \| tar Cxz ./lower \|\| ( true; echo "mknod errors were ignored" ) user$ unshare --mount --map-root-user mappedroot# runc spec --rootless mappedroot# sed -i 's/"readonly": true/"readonly": false/g' config.json mappedroot# mount -t overlay -o lowerdir=./lower,upperdir=./upper,workdir=./work overlayfs ./rootfs mappedroot# runc run foo Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>	2018-05-10 12:16:43 +09:00
Akihiro Suda	9c7d8bc1fd	libcontainer: add parser for /etc/sub{u,g}id and /proc/PID/{u,g}id_map Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>	2018-05-10 12:16:43 +09:00
Mrunal Patel	0cbfd8392f	Merge pull request #1562 from cyphar/carry-975-959-ipc-uid-namespaces nsenter: improve namespace creation and SELinux IPC handling	2018-04-26 14:12:33 -07:00
Mrunal Patel	871ba2e58e	Merge pull request #1781 from filbranden/systemd3 Make channel for StartTransientUnit buffered	2018-04-24 11:56:34 -07:00
Michael Crosby	bdbb9fab07	Merge pull request #1693 from AkihiroSuda/leave-setgroups-allow libcontainer: allow setgroup in rootless mode	2018-04-24 11:24:04 -04:00
Michael Crosby	1f11dc5dba	Merge pull request #1785 from dlorenc/seccomp Make the setupSeccomp function public.	2018-04-19 16:00:54 -04:00
Mrunal Patel	63e6708c74	Merge pull request #1784 from pierrchen/master libcontainer/rootfs_linux: minor cleanup	2018-04-17 17:02:10 -07:00
dlorenc	40680b2d37	Make the setupSeccomp function public. This function is useful for converting from the OCI spec format to the one used by runC/libcontainer. Signed-off-by: dlorenc <lorenc.d@gmail.com>	2018-04-17 10:47:22 -07:00
Michael Crosby	d56f6cc202	Merge pull request #1753 from wking/do-not-require-bind-mount-type libcontainer/specconv/spec_linux: Support empty 'type' for bind mounts	2018-04-16 11:01:53 -04:00
Bin Chen	1b27db67f1	libcontainer/rootfs_linux: minor cleanup move variable close to where is used Signed-off-by: Bin Chen <nk@devicu.com>	2018-04-16 22:25:48 +10:00
Filipe Brandenburger	165ee45334	Make channel for StartTransientUnit buffered So that, if a timeout happens and we decide to stop blocking on the operation, the writer will not block when they try to report the result of the operation. This should address Issue #1780 and it's a follow up for PR #1683, PR #1754 and PR #1772. Signed-off-by: Filipe Brandenburger <filbranden@google.com>	2018-04-14 08:49:50 -07:00
Michael Crosby	f753f300ae	Merge pull request #1779 from runcom/gcc8-fix nsexec.c: fix GCC 8 warning	2018-04-12 12:13:43 -04:00
Michael Crosby	9f0eca2a94	Merge pull request #1777 from nalind/no-config-for-extant-netns Only configure networking when creating a net ns	2018-04-12 10:55:02 -04:00
Antonio Murdaca	1a5064622c	nsexec.c: fix GCC 8 warning Signed-off-by: Antonio Murdaca <runcom@redhat.com>	2018-04-12 12:25:06 +02:00
Nalin Dahyabhai	4521d4b19c	Only configure networking when creating a net ns When joining an existing namespace, don't default to configuring a loopback interface in that namespace. Its creator should have done that, and we don't want to fail to create the container when we don't have sufficient privileges to configure the network namespace. Signed-off-by: Nalin Dahyabhai <nalin@redhat.com>	2018-04-11 13:28:19 -04:00
Filipe Brandenburger	0e16bd9b53	Detect whether Delegate is available on both slices and scopes Starting with systemd 237, in preparation for cgroup v2, delegation is only now available for scopes, not slices. Update libcontainer code to detect whether delegation is available on both and use that information when creating new slices. Signed-off-by: Filipe Brandenburger <filbranden@google.com>	2018-04-10 11:42:55 -07:00
Filipe Brandenburger	8ab251f298	Fix systemd.Apply() to check for DBus error before waiting on a channel. The channel was introduced in #1683 to work around a race condition. However, the check for error in StartTransientUnit ignores the error for an already existing unit, and in that case there will be no notification from DBus (so waiting on the channel will make it hang.) Later PR #1754 added a timeout, which worked around the issue, but we can fix this correctly by only waiting on the channel when there is no error. Fix the code to do so. The timeout handling was kept, since there might be other cases where this situation occurs (https://bugzilla.redhat.com/show_bug.cgi?id=1548358 mentions calling this code from inside a container, it's unclear whether an existing container was in use or not, so not sure whether this would have fixed that bug as well.) Signed-off-by: Filipe Brandenburger <filbranden@google.com>	2018-04-09 11:51:59 -07:00
Sebastien Boeuf	985628dda0	libcontainer: Don't set container state to running when exec'ing There is no reason to set the container state to "running" as a temporary value when exec'ing a process on a container in "created" state. The problem doing this is that consumers of the libcontainer library might use it by keeping pointers in memory. In this case, the container state will indicate that the container is running, which is wrong, and this will end up with a failure on the next action because the check for the container state transition will complain. Fixes #1767 Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>	2018-03-30 09:29:18 -07:00
Akihiro Suda	73f3dc6389	libcontainer: allow setgroup in rootless mode Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>	2018-03-27 17:42:05 +09:00
Akihiro Suda	ed58366cc8	libcontainer: fix Boolmsg alignment Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>	2018-03-26 14:44:03 +09:00
Tamal Saha	58415b4b12	Fix error message Signed-off-by: Tamal Saha <tamal@appscode.com>	2018-03-21 20:52:09 -07:00
Aleksa Sarai	fd3a6e6c83	libcontainer: handle unset oomScoreAdj corectly Previously if oomScoreAdj was not set in config.json we would implicitly set oom_score_adj to 0. This is not allowed according to the spec: > If oomScoreAdj is not set, the runtime MUST NOT change the value of > oom_score_adj. Change this so that we do not modify oom_score_adj if oomScoreAdj is not present in the configuration. While this modifies our internal configuration types, the on-disk format is still compatible. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2018-03-17 13:53:42 +11:00
Aleksa Sarai	03e585985f	rootless: cgroup: treat EROFS as a skippable error In some cases, /sys/fs/cgroups is mounted read-only. In rootless containers we can consider this effectively identical to having cgroups that we don't have write permission to -- because the user isn't responsible for the read-only setup and cannot modify it. The rules are identical to when /sys/fs/cgroups is not writable by the unprivileged user. An example of this is the default configuration of Docker, where cgroups are mounted as read-only as a preventative security measure. Reported-by: Vladimir Rutsky <rutsky@google.com> Signed-off-by: Aleksa Sarai <asarai@suse.de>	2018-03-17 13:53:42 +11:00
Daniel J Walsh	43aea05946	Label the masked tmpfs with the mount label Currently if a confined container process tries to list these directories AVC's are generated because they are labeled with external labels. Adding the mountlabel will remove these AVC's. Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>	2018-03-09 14:29:06 -05:00
Qiang Huang	9facb87f87	Merge pull request #1754 from vikaschoudhary16/add-timeout Add timeout while waiting for StartTransinetUnit completion signal	2018-03-08 09:09:34 +08:00
W. Trevor King	0aa6e4e5d3	libcontainer/specconv/spec_linux: Support empty 'type' for bind mounts From the "Creating a bind mount" section of mount(2) [1]: > If mountflags includes MS_BIND (available since Linux 2.4), then > perform a bind mount... > > The filesystemtype and data arguments are ignored. This commit adds support for configurations that leave the OPTIONAL type [2] unset for bind mounts. There's a related spec-example change in flight with [3], although my personal preference would be a more explicit spec for the whole mount structure [4]. [1]: http://man7.org/linux/man-pages/man2/mount.2.html [2]: https://github.com/opencontainers/runtime-spec/blame/v1.0.1/config.md#L102 [3]: https://github.com/opencontainers/runtime-spec/pull/954 [4]: https://github.com/opencontainers/runtime-spec/pull/771 Signed-off-by: W. Trevor King <wking@tremily.us>	2018-03-07 10:23:42 -08:00
vikaschoudhary16	04e95b526d	Add timeout while waiting for StartTransinetUnit completion signal from dbus Signed-off-by: vikaschoudhary16 <choudharyvikas16@gmail.com>	2018-03-07 05:11:38 -05:00
Denys Smirnov	3d26fc3fd7	cgroups/fs: fix NPE on Destroy than no cgroups are set Currently Manager accepts nil cgroups when calling Apply, but it will panic then trying to call Destroy with the same config. Signed-off-by: Denys Smirnov <denys@sourced.tech>	2018-03-06 23:31:31 +01:00
Vincent Batts	bf74951617	libcontainer/user: platform dependent calls This rearranges a bit of the user and group lookup, such that only a basic subset is exposed. Signed-off-by: Vincent Batts <vbatts@hashbangbash.com>	2018-02-28 14:14:24 -05:00
Aleksa Sarai	757e78bebd	merge branch 'pr-1743' The setupUserNamespace function is always called. LGTMs: @crosbymichael @mrunalp @cyphar Closes #1743	2018-02-27 12:22:52 +11:00
Michael Crosby	8aca07289d	Merge pull request #1736 from allencloud/fix-lint-warning fix lint error in specconv	2018-02-26 14:21:26 -05:00
ynirk	2420eb1f4d	The setupUserNamespace function is always called. The function is called even if the usernamespace is not set. This results having wrong uid/gid set on devices. This fix add a test to check if usernamespace is set befor calling setupUserNamespace. Fixes #1742 Signed-off-by: Julien Lavesque <julien.lavesque@gmail.com>	2018-02-26 14:27:11 +01:00
Allen Sun	3f32e72963	fix lint error in specconv Signed-off-by: Allen Sun <allensun.shl@alibaba-inc.com>	2018-02-26 15:39:54 +08:00
W. Trevor King	d71b3f5344	libcontainer/sync: Drop procConsole transaction from comments These were added in `244c9fc4` (*: console rewrite, 2016-06-04, #1018) alongside procConsole and the associated handling. procConsole and that handling were removed in `00a0ecf5` (Add separate console socket, 2017-03-02, #1356), but `00a0ecf5` missed this comment. Signed-off-by: W. Trevor King <wking@tremily.us>	2018-02-23 15:03:56 -08:00
Michael Crosby	595bea022f	Merge pull request #1722 from ravisantoshgudimetla/fix-systemd-path fix systemd slice expansion so that it could be consumed by cAdvisor	2018-02-20 09:59:24 -05:00
W. Trevor King	50dc7ee96c	libcontainer/capabilities_linux: Drop os.Getpid() call gocapability has supported 0 as "the current PID" since syndtr/gocapability@5e7cce49 (Allow to use the zero value for pid to operate with the current task, 2015-01-15, syndtr/gocapability#2). libcontainer was ported to that approach in `444cc298` (namespaces: allow to use pid namespace without mount namespace, 2015-01-27, docker/libcontainer#358), but the change was clobbered by `22df5551` (Merge branch 'master' into api, 2015-02-19, docker/libcontainer#388) which landed via `5b73860e` (Merge pull request #388 from docker/api, 2015-02-19, docker/libcontainer#388). This commit restores the changes from `444cc298`. Signed-off-by: W. Trevor King <wking@tremily.us>	2018-02-19 15:47:42 -08:00
ravisantoshgudimetla	7019e1de7b	fix systemd slice expansion so that it could be consumed by cAdvisor Signed-off-by: ravisantoshgudimetla <ravisantoshgudimetla@gmail.com>	2018-02-18 21:32:39 -05:00
Mrunal Patel	6e15bc3f92	Merge pull request #1702 from crosbymichael/chroot chroot when no mount namespaces is provided	2018-02-07 10:09:35 -08:00
W. Trevor King	be16b13645	libcontainer/state_linux_test: Add a testTransitions helper The helper DRYs up the transition tests and makes it easy to get complete coverage for invalid transitions. I'm also using t.Run() for subtests. Run() is new in Go 1.7 [1], but runc dropped support for 1.6 back in `e773f96b` (update go version at travis-ci, 2017-02-20, #1335). [1]: https://blog.golang.org/subtests Signed-off-by: W. Trevor King <wking@tremily.us>	2018-01-25 11:18:45 -08:00
Michael Crosby	91ca331474	chroot when no mount namespaces is provided Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2018-01-25 11:36:37 -05:00
Michael Crosby	c4e4bb0df2	Merge pull request #1699 from AkihiroSuda/indent-c make: validate C format	2018-01-25 10:09:09 -05:00
Aleksa Sarai	5a46c2ba8b	nsenter: move namespace creation after userns creation Technically, this change should not be necessary, as the kernel documentation claims that if you call clone(flags\|CLONE_NEWUSER), the new user namespace will be the owner of all other namespaces created in @flags. Unfortunately this isn't always the case, due to various additional semantics and kernel bugs. One particular instance is SELinux, which acts very strangely towards the IPC namespace and mqueue. If you unshare the IPC namespace before you map a user in the user namespace, the IPC namespace's internal kern-mount for mqueue will be labelled incorrectly and the container won't be able to access it. The only way of solving this is to unshare IPC after the user has been mapped and we have changed to that user. I've also heard of this happening to the NET namespace while talking to some LXC folks, though I haven't personally seen that issue. This change matches our handling of user namespaces to be the same as how LXC handles these problems. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2018-01-25 23:56:49 +11:00
Akihiro Suda	dd5eb3b9e3	make: validate C format Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>	2018-01-24 10:49:50 +09:00
Ed King	5c0af14bf8	Return from goroutine when it should terminate Signed-off-by: Craig Furman <cfurman@pivotal.io>	2018-01-23 10:46:31 +00:00
Will Martin	8d3e6c9826	Avoid race when opening exec fifo When starting a container with `runc start` or `runc run`, the stub process (runc[2:INIT]) opens a fifo for writing. Its parent runc process will open the same fifo for reading. In this way, they synchronize. If the stub process exits at the wrong time, the parent runc process will block forever. This can happen when racing 2 runc operations against each other: `runc run/start`, and `runc delete`. It could also happen for other reasons, e.g. the kernel's OOM killer may select the stub process. This commit resolves this race by racing the opening of the exec fifo from the runc parent process against the stub process exiting. If the stub process exits before we open the fifo, we return an error. Another solution is to wait on the stub process. However, it seems it would require more refactoring to avoid calling wait multiple times on the same process, which is an error. Signed-off-by: Craig Furman <cfurman@pivotal.io>	2018-01-22 17:03:02 +00:00
Antonio Murdaca	cd1e7abee2	libcontainer: expose annotations in hooks Annotations weren't passed to hooks. This patch fixes that by passing annotations to stdin for hooks. Signed-off-by: Antonio Murdaca <runcom@redhat.com>	2018-01-11 16:54:01 +01:00
vikaschoudhary16	d5b4a3eddb	Fix race against systemd - T0: runc triggers a systemd unit creation asynchronously from [here](https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/systemd/apply_systemd.go#L298) - T1: runc then moves ahead and starts creating cgroup paths(.scope directories), [here](https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/systemd/apply_systemd.go#L348). Kernel creates .scope directory and cgroup.procs file(along with other default files) in the directory automatically, in an atomic manner. - T3: systemd execution thread which was invoked at time `T0`, is still in the process of unit creation. systemd also trying to create cgroup paths and deletes the `.scope` directory which is created at time `T1` by runc from [here](https://github.com/systemd/systemd/blob/v219/src/shared/cgroup-util.c#L1630) in the code Signed-off-by: vikaschoudhary16 <choudharyvikas16@gmail.com>	2018-01-08 09:37:26 -05:00
Mrunal Patel	e6516b3d5d	Merge pull request #1678 from sboeuf/sboeuf/subreaper libcontainer: Do not wait for signalled processes if subreaper is set	2017-12-15 08:47:07 -08:00
Michael Crosby	7f24b40cc5	Merge pull request #1675 from tklauser/apparmor-no-cgo RFC: libcontainer: remove dependency on libapparmor	2017-12-15 11:23:35 -05:00
Tobias Klauser	db093f621f	libcontainer: remove dependency on libapparmor libapparmor is integrated in libcontainer using cgo but is only used to call a single function: aa_change_onexec. It turns out this function is simple enough (writing a string to a file in /proc/<n>/attr/...) to be re-implemented locally in libcontainer in plain Go. This allows to drop the dependency on libapparmor and the corresponding cgo integration. Fixes #1674 Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2017-12-15 09:59:58 +01:00
Sebastien Boeuf	bb912eb00c	libcontainer: Do not wait for signalled processes if subreaper is set When a subreaper is enabled, it might expect to reap a process and retrieve its exit code. That's the reason why this patch is giving the possibility to define the usage of a subreaper as a consumer of libcontainer. Relying on this information, libcontainer will not wait for signalled processes in case a subreaper has been set. Fixes #1677 Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>	2017-12-14 10:37:38 -08:00
Mrunal Patel	c6e4a1ebeb	Merge pull request #1665 from Mashimiao/gidmapping-valid-fix specconv: avoid skipping gidmappings applied when uidmappings is empty	2017-12-11 09:50:54 -08:00
Mrunal Patel	b028413c35	Merge pull request #1655 from Mashimiao/add-propagation-more support unbindable,runbindable for rootfs propagation	2017-12-11 09:21:41 -08:00
Allen Sun	fec6b0fea5	Update criu_opts_linux.go Signed-off-by: Allen Sun <shlallen1990@gmail.com>	2017-12-05 15:16:26 +08:00
Michael Crosby	91e9795013	Merge pull request #1654 from dqminh/only-linux remove placeholder for non-linux platforms	2017-11-30 09:51:47 -05:00
Ma Shimiao	57edfbbaf2	specconv: avoid skipping gidmappings applied when uidmappings is empty Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>	2017-11-30 16:24:36 +08:00
Aleksa Sarai	e8149af291	merge branch 'pr-1661' Ensure container tests do not write on the host LGTMs: @hqhq @cyphar Closes #1661	2017-11-27 20:10:48 +11:00
Danail Branekov	0495fece57	Ensure container tests do not write on the host TestGetContainerStateAfterUpdate creates its state.json file on the current directory which turns out to be the host runc directory. Thus whenever the test completes it leaves the state.json file behind thus a) poluting the local git repository b) changing the host file system violating the principle of doing everything in an isolated container environment This change would create a new temporary (in-container) directory and use it as linuxContainer.root Signed-off-by: Tom Godkin <tgodkin@pivotal.io>	2017-11-27 10:43:10 +02:00
Daniel Dao	8898b6b446	remove placeholder for non-linux platforms runc currently only support Linux platform, and since we dont intend to expose the support to other platform, removing all other platforms placeholder code. `libcontainer/configs` still being used in https://github.com/moby/moby/blob/master/daemon/daemon_windows.go so keeping it for now. After this, we probably should also rename files to drop linux suffices if possible. Signed-off-by: Daniel Dao <dqminh89@gmail.com>	2017-11-24 18:14:51 +00:00
Daniel, Dao Quang Minh	fb871d9cd0	Merge pull request #1664 from tklauser/drop-freebsd libcontainer: drop FreeBSD support	2017-11-24 18:08:21 +00:00
Tobias Klauser	4d27f20db0	libcontainer: drop FreeBSD support runc is not supported on FreeBSD, so remove all FreeBSD specific bits. As suggested by @crosbymichael in #1653 Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2017-11-24 14:51:05 +01:00
Danail Branekov	38d1e6ec27	Delete xattr related code Selinux related code has been moved to the selinux package (https://github.com/opencontainers/selinux) and therefore xattr related code can be deleted from libcontainer Signed-off-by: Danail Branekov <danailster@gmail.com>	2017-11-21 12:49:28 +02:00
Ma Shimiao	17db6560be	support unbindable,runbindable for rootfs propagation Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>	2017-11-17 16:14:15 +08:00
Seth Jennings	bca53e7b49	systemd: adjust CPUQuotaPerSecUSec to compensate for systemd internal handling Signed-off-by: Seth Jennings <sjenning@redhat.com>	2017-11-15 20:20:06 -06:00
Vincent Demeester	3ca4c78b1a	Import docker/docker/pkg/mount into runc This will help get rid of docker/docker dependency in runc 👼 Signed-off-by: Vincent Demeester <vincent@sbr.pm>	2017-11-08 16:25:58 +01:00
Michael Crosby	2f010ecf19	Merge pull request #1622 from vdemeester/import-symlink-from-docker Remove pkg/symlink from docker/docker and use cyphar/filepath-securejoin	2017-11-08 10:07:00 -05:00
Akihiro Suda	0aac2368e4	specconv.Example(): add /proc/scsi to masked paths Port over https://github.com/moby/moby/pull/35399 Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>	2017-11-04 17:38:14 +00:00
Michael Crosby	0232e38342	Merge pull request #1629 from masters-of-cats/busybox-inflation Avoid disk usage explosion when copying busybox	2017-11-01 09:15:22 -04:00
Danail Branekov	fdbb9e3e55	Avoid disk usage explosion when copying busybox When running runc tests with temp directory with size 500M copying busybox without preserving hardlinks causes the folder to inflate to roughly 330M. Copying busybox twice in certain tests causes the /tmp directory to overfill. Using `-a` preserves links which busybox uses to implement its choice of binary to run. Signed-off-by: Tom Godkin <tgodkin@pivotal.io>	2017-11-01 09:52:05 +00:00
Vincent Demeester	594501475e	Use cyphar/filepath-securejoin instead of docker pkg/symlink runc shouldn't depend on docker and be more self-contained. Removing github.com/pkg/symlink dep is the first step to not depend on docker anymore Signed-off-by: Vincent Demeester <vincent@sbr.pm>	2017-10-31 16:53:45 +01:00
Lorenzo Fontana	780f8ef567	Specconv: Test create command hooks and seccomp setup Signed-off-by: Lorenzo Fontana <lo@linux.com>	2017-10-28 21:46:46 +02:00
Mrunal Patel	9a1186d128	Merge pull request #1619 from fntlnz/spec-linux-testing WIP: Better testsuite for specconv	2017-10-25 15:23:19 -07:00
Lorenzo Fontana	c0e6e12f9d	Test Cgroup creation and memory allocations Signed-off-by: Lorenzo Fontana <lo@linux.com>	2017-10-25 01:58:10 +02:00
Aleksa Sarai	ff5075c33f	init: correctly handle unmapped stdio with multiple mappings Previously we would handle the "unmapped stdio" case by just doing a simple check, however this didn't handle cases where the overflow_uid was actually mapped in the user namespace. Instead of doing some userspace checks, just try to do the fchown(2) and ignore EINVAL (unmapped) or EPERM (lacking privilege over inode) errors. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-10-25 00:12:21 +11:00
Qiang Huang	74a1729647	Merge pull request #1607 from crosbymichael/term-err libcontainer: handler errors from terminate	2017-10-20 15:15:38 +08:00
Qiang Huang	e8b9b92f57	Merge pull request #1206 from YuPengZTE/devMD026 trailing punctuation in header	2017-10-20 14:47:09 +08:00
Mrunal Patel	80ee9e50b5	Merge pull request #1616 from mheon/seccomp_fix_breakage Fix breaking change in Seccomp profile behavior	2017-10-19 14:15:04 -07:00
Aleksa Sarai	c05f6368af	merge branch 'pr-1615' libcontainer: intelrdt: fix a GetStats() issue LGTMs: @crosbymichael @cyphar Closes #1615	2017-10-19 03:41:16 +11:00
Matthew Heon	e9193ba6e6	Fix breaking change in Seccomp profile behavior Multiple conditions were previously allowed to be placed upon the same syscall argument. Restore this behavior. Signed-off-by: Matthew Heon <mheon@redhat.com>	2017-10-18 11:53:56 -04:00
Qiang Huang	3409d5c555	Merge pull request #1606 from cyphar/rootfs-propagation-no-pivot specconv: emit an error when using MS_PRIVATE with --no-pivot	2017-10-18 09:52:04 +08:00
Xiaochen Shen	d89217515b	libcontainer: intelrdt: fix a GetStats() issue This fixes a GetStats() issue introduced in #1590: If Intel RDT is enabled by hardware and kernel, but intelRdt is not specified in original config, GetStats() will return error unexpectedly because we haven't called Apply() to create intelrdt group or attach tasks for this container. As a result, runc events command will have no output. Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2017-10-17 17:37:07 +08:00
Tobias Klauser	0eed453b21	libcontainer: use Major/Minor from x/sys/unix The Major and Minor functions were added for Linux in golang/sys@85d1495 which is already vendored in. Use these functions instead of the local re-implementation. Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2017-10-17 09:06:42 +02:00
Aleksa Sarai	9b13f5cc7f	merge branch 'pr-1453' propagate argv0 when re-execing from /proc/self/exe LGTMs: @crosbymichael @cyphar Closes #1453	2017-10-17 03:12:22 +11:00
Michael Crosby	ff4481dbf6	Merge pull request #1540 from cloudfoundry-incubator/rootless-cgroups Support cgroups with limits as rootless	2017-10-16 12:03:49 -04:00
Petros Angelatos	8098828680	propagate argv0 when re-execing from /proc/self/exe This allows runc to be used as a target for docker's reexec module that depends on a correct argv0 to select which process entrypoint to invoke. Without this patch, when runc re-execs argv0 is set to "/proc/self/exe" and the reexec module doesn't know what to do with it. Signed-off-by: Petros Angelatos <petrosagg@gmail.com>	2017-10-16 14:00:26 +02:00
Tobias Klauser	d2bc081420	libcontainer: merge common syscall implementations There are essentially two possible implementations for Setuid/Setgid on Linux, either using SYS_SETUID32/SYS_SETGID32 or SYS_SETUID/SYS_SETGID, depending on the architecture (see golang/go#1435 for why Setuid/Setgid aren currently implemented for Linux neither in syscall nor in golang.org/x/sys/unix). Reduce duplication by merging the currently implemented variants and adjusting the build tags accordingly. Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2017-10-16 11:11:18 +02:00
Aleksa Sarai	6d30f7a01b	merge branch 'pr-1424' Update Travis config to use trusty-backports libseccomp Add integration tests for multi-argument Seccomp filters Vendor updated libseccomp-golang for bugfix LGTMs: @crosbymichael @cyphar Closes #1424	2017-10-16 03:01:37 +11:00
Aleksa Sarai	d2ac52fe52	merge branch 'pr-1475' Add support for mips/mips64 Put signalMap in a separate file, so it may be arch-specific LGTMs: @crosbymichael @cyphar Closes #1475	2017-10-16 02:59:34 +11:00
Aleksa Sarai	2430a98e64	merge branch 'pr-1500' rootfs: switch ms_private remount of oldroot to ms_slave LGTMs: @crosbymichael @hqhq Closes opencontainers/runc#1500	2017-10-14 09:32:59 +11:00
Sebastien Boeuf	acb93c9c62	libcontainer: cgroups: Write freezer state after every state check This commit ensures we write the expected freezer cgroup state after every state check, in case the state check does not give the expected result. This can happen when a new task is created and prevents the whole cgroup to be FROZEN, leaving the state into FREEZING instead. This patch prevents the case of an infinite loop to happen. Fixes https://github.com/opencontainers/runc/issues/1609 Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>	2017-10-12 07:07:28 -07:00
Matthew Heon	bbc847a457	Add integration tests for multi-argument Seccomp filters Signed-off-by: Matthew Heon <mheon@redhat.com>	2017-10-10 15:49:08 -04:00
Michael Crosby	bfe3058fc9	Make process check more forgiving Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2017-10-10 15:36:19 -04:00
Steven Hartland	eb68b900bc	Prevent invalid errors from terminate Both Process.Kill() and Process.Wait() can return errors that don't impact the correct behaviour of terminate. Instead of letting these get returned and logged, which causes confusion, silently ignore them. Currently the test needs to be a string test as the errors are private to the runtime packages, so its our only option. This can be seen if init fails during the setns. Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>	2017-10-10 15:32:46 -04:00
Michael Crosby	4693fae411	Merge pull request #1590 from xiaochenshen/rdt-cat-support-update-command libcontainer: intelrdt: add update command support	2017-10-10 15:25:22 -04:00
Aleksa Sarai	d4f0f9a52b	specconv: emit an error when using MS_PRIVATE with --no-pivot Due to the semantics of chroot(2) when it comes to mount namespaces, it is not generally safe to use MS_PRIVATE as a mount propgation when using chroot(2). The reason for this is that this effectively results in a set of mount references being held by the chroot'd namespace which the namespace cannot free. pivot_root(2) does not have this issue because the @old_root can be unmounted by the process. Ultimately, --no-pivot is not really necessary anymore as a commonly used option since `f8e6b5af5e` ("rootfs: make pivot_root not use a temporary directory") resolved the read-only issue. But if someone really needs to use it, MS_PRIVATE is never a good idea. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-10-08 17:50:55 +11:00
Michael Crosby	f53ad9cec9	Merge pull request #1604 from AkihiroSuda/cwd libcontainer: create Cwd when it does not exist	2017-10-05 11:15:10 -04:00
Will Martin	ca4f427af1	Support cgroups with limits as rootless Signed-off-by: Ed King <eking@pivotal.io> Signed-off-by: Gabriel Rosenhouse <grosenhouse@pivotal.io> Signed-off-by: Konstantinos Karampogias <konstantinos.karampogias@swisscom.com>	2017-10-05 11:22:54 +01:00
Akihiro Suda	2edd36fdff	libcontainer: create Cwd when it does not exist The benefit for doing this within runc is that it works well with userns. Actually, runc already does the same thing for mount points. Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>	2017-10-05 05:31:46 +00:00
Konstantinos Karampogias	605dc5c811	Set initial console size based on process spec Signed-off-by: Will Martin <wmartin@pivotal.io> Signed-off-by: Petar Petrov <pppepito86@gmail.com> Signed-off-by: Ed King <eking@pivotal.io> Signed-off-by: Roberto Jimenez Sanchez <jszroberto@gmail.com> Signed-off-by: Thomas Godkin <tgodkin@pivotal.io>	2017-10-04 12:32:16 +01:00
Daniel, Dao Quang Minh	0351df1c5a	Merge pull request #1600 from crosbymichael/console Bump console and sys deps	2017-09-26 10:15:10 +01:00
Michael Crosby	f364c1a58c	Set ClearONLCR in tests Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2017-09-25 13:35:22 -04:00
Tobias Klauser	d713652bda	libcontainer: remove unnecessary type conversions Generated using github.com/mdempsky/unconvert Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2017-09-25 10:41:57 +02:00
Qiang Huang	79ad714374	Merge pull request #1598 from euank/ragent libcontainer: default mount propagation correctly	2017-09-25 11:55:29 +08:00
Euan Kemp	4301b440d6	libcontainer: default mount propagation correctly The code in prepareRoot (`e385f67a0e/libcontainer/rootfs_linux.go (L599-L605)`) attempts to default the rootfs mount to `rslave`. However, since the spec conversion has already defaulted it to `rprivate`, that code doesn't actually ever do anything. This changes the spec conversion code to accept "" and treat it as 0. Implicitly, this makes rootfs propagation default to `rslave`, which is a part of fixing the moby bug https://github.com/moby/moby/issues/34672 Alternate implementatoins include changing this defaulting to be `rslave` and removing the defaulting code in prepareRoot, or skipping the mapping entirely for "", but I think this change is the cleanest of those options. Signed-off-by: Euan Kemp <euan.kemp@coreos.com>	2017-09-22 13:36:23 -07:00
Xiaochen Shen	2549545df5	intelrdt: always init IntelRdtManager if Intel RDT is enabled In current implementation: Either Intel RDT is not enabled by hardware and kernel, or intelRdt is not specified in original config, we don't init IntelRdtManager in the container to handle intelrdt constraint. It is a tradeoff that Intel RDT has hardware limitation to support only limited number of groups. This patch makes a minor change to support update command: Whether or not intelRdt is specified in config, we always init IntelRdtManager in the container if Intel RDT is enabled. If intelRdt is not specified in original config, we just don't Apply() to create intelrdt group or attach tasks for this container. In update command, we could re-enable through IntelRdtManager.Apply() and then update intelrdt constraint. Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2017-09-20 01:37:31 +08:00
Michael Crosby	593914b8bd	Merge pull request #1593 from s7v7nislands/drop_go1.5 Drop support golang 1.5	2017-09-12 15:22:00 -04:00
s7v7nislands	00ad8e1e56	Drop support golang 1.5 Signed-off-by: Xiaobing Jiang <s7v7nislands@gmail.com>	2017-09-12 20:56:51 +08:00
Qiang Huang	68e00e906b	Merge pull request #1586 from crosbymichael/set-cgroups Apply cgroups earlier	2017-09-12 12:13:29 +08:00
Yong Tang	e9944d0f4c	Disable systemd in static build This fix tries to address the warnings caused by static build with go 1.9. As systemd needs dlopen/dlclose, the following warnings will be generated for static build in go 1.9: ``` root@f4b077232050:/go/src/github.com/opencontainers/runc# make static CGO_ENABLED=1 go build -tags "seccomp cgo static_build" -ldflags "-w -extldflags -static -X main.gitCommit="1c81e2a794c6e26a4c650142ae8893c47f619764" -X main.version=1.0.0-rc4+dev " -o runc . /tmp/go-link-113476657/000007.o: In function `_cgo_a5acef59ed3f_Cfunc_dlopen': /tmp/go-build/github.com/opencontainers/runc/vendor/github.com/coreos/pkg/dlopen/_obj/cgo-gcc-prolog:76: warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking ``` This fix disables systemd when `static_build` flag is on (apply_nosystemd.go is used instead). This fix also fixes a small bug in `apply_nosystemd.go` for return value. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>	2017-09-11 18:38:22 +00:00
Mrunal Patel	d5b43c3981	Merge pull request #1455 from dqminh/epoll-io tty: move IO of master pty to be done with epoll	2017-09-11 11:32:42 -07:00
Aleksa Sarai	1a5fdc1c5f	init: support setting -u with rootless containers Now that rootless containers have support for multiple uid and gid mappings, allow --user to work as expected. If the user is not mapped, an error occurs (as usual). Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-09-09 12:45:33 +10:00
Aleksa Sarai	969bb49cc3	nsenter: do not resolve path in nsexec context With the addition of our new{uid,gid}map support, we used to call execvp(3) from inside nsexec. This would mean that the path resolution for the binaries would happen in nsexec. Move the resolution to the initial setup code, and pass the absolute path to nsexec. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-09-09 12:45:33 +10:00
Aleksa Sarai	6097ce74d8	nsenter: correctly handle newgidmap path for rootless containers After quite a bit of debugging, I found that previous versions of this patchset did not include newgidmap in a rootless setting. Fix this by passing it whenever group mappings are applied, and also providing some better checking for try_mapping_tool. This commit also includes some stylistic improvements. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-09-09 12:45:32 +10:00
Giuseppe Scrivano	3282f5a7c1	tests: fix for rootless multiple uids/gids Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2017-09-09 12:45:32 +10:00
Giuseppe Scrivano	d8b669400a	rootless: allow multiple user/group mappings Take advantage of the newuidmap/newgidmap tools to allow multiple users/groups to be mapped into the new user namespace in the rootless case. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> [ rebased to handle intelrdt changes. ] Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-09-09 12:45:32 +10:00
Mrunal Patel	13fa5d2953	Merge pull request #1588 from s7v7nislands/delete_unused Delete unused function	2017-09-08 17:34:00 -07:00
Michael Crosby	b82d07e816	Merge pull request #1587 from Mashimiao/fix-namespace-empty Fixes #1585 config.Namespaces is empty when accessed	2017-09-08 10:50:16 -04:00
Xiaochen Shen	88d22fde40	libcontainer: intelrdt: use init() to avoid race condition This is the follow-up PR of #1279 to fix remaining issues: Use init() to avoid race condition in IsIntelRdtEnabled(). Add also rename some variables and functions. Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2017-09-08 17:15:31 +08:00
s7v7nislands	c795b8690b	Delete unused function Signed-off-by: Xiaobing Jiang <s7v7nislands@gmail.com>	2017-09-08 10:35:46 +08:00
Ma Shimiao	c3d20e7817	Fixes #1585 config.Namespaces is empty when accessed Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>	2017-09-08 09:30:07 +08:00
Mrunal Patel	deb9d7fd96	Merge pull request #1569 from cyphar/delay-seccomp init: delay seccomp application as late as possible	2017-09-07 13:27:37 -07:00
Mrunal Patel	7e036aa0b0	Merge pull request #1541 from adrianreber/lazy checkpoint: support lazy migration	2017-09-07 13:25:04 -07:00
Michael Crosby	7062c7556b	Apply cgroups earlier This applies cgroups earlier for container creation before the init process starts running and forking off any additional processes. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2017-09-07 11:27:33 -04:00
Adrian Reber	60ae7091de	checkpoint: support lazy migration With the help of userfaultfd CRIU supports lazy migration. Lazy migration means that memory pages are only transferred from the migration source to the migration destination on page fault. This enables to reduce the downtime during process or container migration to a minimum as the memory does not need to be transferred during migration. Lazy migration currently depends on userfaultfd being available on the current Linux kernel and if the used CRIU version supports lazy migration. Both dependencies can be checked by querying CRIU via RPC if the lazy migration feature is available. Using feature checking instead of version comparison enables runC to use CRIU features from the criu-dev branch. This way the user can decide if lazy migration should be available by choosing the right kernel and CRIU branch. To use lazy migration the CRIU process during dump needs to dump everything besides the memory pages and then it opens a network port waiting for remote page fault requests: # runc checkpoint httpd --lazy-pages --page-server 0.0.0.0:27 \ --status-fd /tmp/postcopy-pipe In this example CRIU will hang/wait once it has opened the network port and wait for network connection. As runC waits for CRIU to finish it will also hang until the lazy migration has finished. To know when the restore on the destination side can start the '--status-fd' parameter is used: #️ runc checkpoint --help \| grep status --status-fd value criu writes \0 to this FD once lazy-pages is ready The parameter '--status-fd' is directly from CRIU and this way the process outside of runC which controls the migration knows exactly when to transfer the checkpoint (without memory pages) to the destination and that the restore can be started. On the destination side it is necessary to start CRIU in 'lazy-pages' mode like this: # criu lazy-pages --page-server --address 192.168.122.3 --port 27 \ -D checkpoint and tell runC to do a lazy restore: # runc restore -d --image-path checkpoint --work-path checkpoint \ --lazy-pages httpd If both processes on the restore side have the same working directory 'criu lazy-pages' creates a unix domain socket where it waits for requests from the actual restore. runC starts CRIU restore in lazy restore mode and talks to 'criu lazy-pages' that it wants to restore memory pages on demand. CRIU continues to restore the process and once the process is running and accesses the first non-existing memory page the 'criu lazy-pages' server will request the page from the source system. Thus all pages from the source system will be transferred to the destination system. Once all pages have been transferred runC on the source system will end and the container will have finished migration. This can also be combined with CRIU's pre-copy support. The combination of pre-copy and post-copy (lazy migration) provides the possibility to migrate containers with minimal downtimes. Some additional background about post-copy migration can be found in these articles: https://lisas.de/~adrian/?p=1253 https://lisas.de/~adrian/?p=1183 Signed-off-by: Adrian Reber <areber@redhat.com>	2017-09-06 12:35:38 +00:00
Adrian Reber	a3a632ad28	checkpoint: add support to query for lazy page support Before adding the actual lazy migration support, this adds the feature check for lazy-pages. Right now lazy migration, which is based on userfaultd is only available in the criu-dev branch and not yet in a release. As the check does not dependent on a certain version but on a CRIU feature which can be queried it can be part of runC without a new version check depending on a feature from criu-dev. Signed-off-by: Adrian Reber <areber@redhat.com>	2017-09-06 12:35:38 +00:00
Xiaochen Shen	4d2756c116	libcontainer: add test cases for Intel RDT/CAT Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2017-09-01 14:35:40 +08:00
Xiaochen Shen	692f6e1e27	libcontainer: add support for Intel RDT/CAT in runc About Intel RDT/CAT feature: Intel platforms with new Xeon CPU support Intel Resource Director Technology (RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which currently supports L3 cache resource allocation. This feature provides a way for the software to restrict cache allocation to a defined 'subset' of L3 cache which may be overlapping with other 'subsets'. The different subsets are identified by class of service (CLOS) and each CLOS has a capacity bitmask (CBM). For more information about Intel RDT/CAT can be found in the section 17.17 of Intel Software Developer Manual. About Intel RDT/CAT kernel interface: In Linux 4.10 kernel or newer, the interface is defined and exposed via "resource control" filesystem, which is a "cgroup-like" interface. Comparing with cgroups, it has similar process management lifecycle and interfaces in a container. But unlike cgroups' hierarchy, it has single level filesystem layout. Intel RDT "resource control" filesystem hierarchy: mount -t resctrl resctrl /sys/fs/resctrl tree /sys/fs/resctrl /sys/fs/resctrl/ \|-- info \| \|-- L3 \| \|-- cbm_mask \| \|-- min_cbm_bits \| \|-- num_closids \|-- cpus \|-- schemata \|-- tasks \|-- <container_id> \|-- cpus \|-- schemata \|-- tasks For runc, we can make use of `tasks` and `schemata` configuration for L3 cache resource constraints. The file `tasks` has a list of tasks that belongs to this group (e.g., <container_id>" group). Tasks can be added to a group by writing the task ID to the "tasks" file (which will automatically remove them from the previous group to which they belonged). New tasks created by fork(2) and clone(2) are added to the same group as their parent. If a pid is not in any sub group, it Is in root group. The file `schemata` has allocation bitmasks/values for L3 cache on each socket, which contains L3 cache id and capacity bitmask (CBM). Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..." For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0` which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0. The valid L3 cache CBM is a contiguous bits set and number of bits that can be set is less than the max bit. The max bits in the CBM is varied among supported Intel Xeon platforms. In Intel RDT "resource control" filesystem layout, the CBM in a group should be a subset of the CBM in root. Kernel will check if it is valid when writing. e.g., 0xfffff in root indicates the max bits of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc. For more information about Intel RDT/CAT kernel interface: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt An example for runc: Consider a two-socket machine with two L3 caches where the default CBM is 0xfffff and the max CBM length is 20 bits. With this configuration, tasks inside the container only have access to the "upper" 80% of L3 cache id 0 and the "lower" 50% L3 cache id 1: "linux": { "intelRdt": { "l3CacheSchema": "L3:0=ffff0;1=3ff" } } Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2017-09-01 14:26:33 +08:00
Xiaochen Shen	af3b0d9dce	libcontainer/SPEC.md: add documentation for Intel RDT/CAT Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2017-09-01 14:26:33 +08:00
Aleksa Sarai	1f32fff46d	setns init: delay seccomp as late as possible This mirrors the standard_init_linux.go seccomp code, which only applies seccomp early if NoNewPrivileges is enabled. Otherwise it's done immediately before execve to reduce the amount of syscalls necessary for users to enable in their seccomp profiles. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-08-26 13:42:30 +10:00
Aleksa Sarai	3ddde27d7d	init: move close(stateDirFd) before seccomp apply This further reduces the number of syscalls that a user needs to enable in their seccomp profile. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-08-26 13:42:26 +10:00
Qiang Huang	1c81e2a794	Merge pull request #1572 from tych0/fix-readonly-userns fix --read-only containers under --userns-remap	2017-08-26 09:38:14 +08:00
Aleksa Sarai	4d6e6720a7	Merge branch 'pr-1573' Fix systemd cgroup after memory type changed LGTMs: @crosbymichael @cyphar Closes #1573	2017-08-25 23:55:27 +10:00
Qiang Huang	acaf6897f5	Fix systemd cgroup after memory type changed Fixes: #1557 I'm not quite sure about the root cause, looks like systemd still want them to be uint64. Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>	2017-08-25 01:14:16 -04:00
Aleksa Sarai	7d66aab77a	init: switch away from stateDirFd entirely While we have significant protections in place against CVE-2016-9962, we still were holding onto a file descriptor that referenced the host filesystem. This meant that in certain scenarios it was still possible for a semi-privileged container to gain access to the host filesystem (if they had CAP_SYS_PTRACE). Instead, open the FIFO itself using a O_PATH. This allows us to reference the FIFO directly without providing the ability for directory-level access. When opening the FIFO inside the init process, open it through procfs to re-open the actual FIFO (this is currently the only supported way to open such a file descriptor). Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-08-25 13:19:03 +10:00
Tycho Andersen	66eb2a3e8f	fix --read-only containers under --userns-remap The documentation here: https://docs.docker.com/engine/security/userns-remap/#user-namespace-known-limitations says that readonly containers can't be used with user namespaces do to some kernel restriction. In fact, there is a special case in the kernel to be able to do stuff like this, so let's use it. This takes us from: ubuntu@docker:~$ docker run -it --read-only ubuntu docker: Error response from daemon: oci runtime error: container_linux.go:262: starting container process caused "process_linux.go:339: container init caused \"rootfs_linux.go:125: remounting \\\"/dev\\\" as readonly caused \\\"operation not permitted\\\"\"". to: ubuntu@docker:~$ docker-runc --version runc version 1.0.0-rc4+dev commit: ae2948042b08ad3d6d13cd09f40a50ffff4fc688-dirty spec: 1.0.0 ubuntu@docker:~$ docker run -it --read-only ubuntu root@181e2acb909a:/# touch foo touch: cannot touch 'foo': Read-only file system Signed-off-by: Tycho Andersen <tycho@docker.com>	2017-08-24 16:43:21 -06:00
Nikolas Sepos	da4a5a9515	Add AutoDedup option to CriuOpts Memory image deduplication, very useful for incremental dumps. See: https://criu.org/Memory_images_deduplication Signed-off-by: Nikolas Sepos <nikolas.sepos@gmail.com>	2017-08-18 01:21:42 +02:00
Michael Crosby	ccd2c20aa4	Merge pull request #1559 from Mashimiao/panic-fix-nil-linux fix panic when Linux is nil for rootless case	2017-08-17 09:57:35 -04:00
Ma Shimiao	2333e7dc67	fix panic when Linux is nil for rootless case congfig.Sysctl setting is duplicated. when contianer is rootless and Linux is nil, runc will panic. Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>	2017-08-16 09:11:13 +08:00
Mrunal Patel	b31bdfc38a	Merge pull request #1558 from hqhq/update_state Update state after update	2017-08-15 10:46:44 -07:00
Qiang Huang	e6e1c34a7d	Update state after update state.json should be a reflection of the container's realtime state, including resource configurations, so we should update state.json after updating container resources. Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>	2017-08-15 14:38:44 +08:00
Michael Crosby	3096b3fc85	Merge pull request #1556 from hqhq/fix_flakytest_TestNotifyOnOOM Fix flaky test TestNotifyOnOOM	2017-08-14 10:03:23 -04:00
Qiang Huang	7726bcf0e2	Some fixes for testMemoryNotification Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>	2017-08-14 15:28:03 +08:00
Qiang Huang	40a1fb0e2f	Fix flaky test TestNotifyOnOOM Fixes: #1228 It can be reproduced by applying this patch: ```diff @@ -45,6 +46,7 @@ func registerMemoryEvent(cgDir string, evName string, arg string) (<-chan struct go func() { defer func() { close(ch) + <-time.After(1 * time.Second) eventfd.Close() evFile.Close() }() ``` We can close channel after fds were closed. Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>	2017-08-14 15:18:59 +08:00
Ma Shimiao	527dc5acbb	fix panic when Linux is nil Linux is not always not nil. If Linux is nil, panic will occur. Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com> Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2017-08-10 15:57:49 -04:00
Kenfe-Mickael Laventure	3ed492ad33	Handle non-devices correctly in DeviceFromPath Before this change, some file type would be treated as char devices (e.g. symlinks). Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>	2017-08-09 08:52:20 -07:00
Alex Fang	e92add2151	Pass back the pid of runc:[1:CHILD] so we can wait on it This allows the libcontainer to automatically clean up runc:[1:CHILD] processes created as part of nsenter. Signed-off-by: Alex Fang <littlelightlittlefire@gmail.com>	2017-08-05 13:44:36 +10:00
Aleksa Sarai	45bde006ca	merge branch 'pr-1535' LGTMs: @avagin @cyphar Closes #1535	2017-08-05 13:33:07 +10:00
Aleksa Sarai	22bbec1b7f	merge branch 'pr-1548' LGTMs: @crosbymichael @mrunalp @cyphar Closes #1548	2017-08-05 13:02:46 +10:00
Mrunal Patel	135b9992b3	Merge pull request #1544 from mlaventure/fix-device-from-path Fix condition to detect device type in DeviceFromPath	2017-08-04 17:36:57 -07:00
Kenfe-Mickael Laventure	6056912217	Revert "Merge pull request #1450 from vrothberg/sgid-non-numeric" This reverts commit `5c73abbe75`, reversing changes made to `51b501dab1`. Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>	2017-08-04 14:28:21 -07:00
Kenfe-Mickael Laventure	25f4c7e72b	Move user pkg unix specific calls to unix file Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>	2017-08-03 11:31:21 -07:00
Kenfe-Mickael Laventure	9ed15e94c8	Fix condition to detect device type in DeviceFromPath Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>	2017-08-03 11:06:54 -07:00
Adrian Reber	5d386f6e2b	checkpoint: use CRIU VERSION RPC if available With this runC also uses RPC to ask CRIU for its version. CRIU supports a VERSION RPC since CRIU 3.0 and using the RPC interface does not require parsing the console output of CRIU (which could change anytime). For older CRIU versions which do not yet have the VERSION RPC runC falls back to its old CRIU output parsing mode. Once CRIU 3.0 is the minimum version required for runC the old code can be removed. v2: * adapt to changes in the previous patches based on the review Signed-off-by: Adrian Reber <areber@redhat.com>	2017-08-02 16:08:07 +00:00
Adrian Reber	2393692536	criurpc.proto: copy latest criurpc.proto from criu 3.3 Update criurpc.proto for the upcoming VERSION RPC. This includes lazy_pages for the upcoming lazy migration support. Signed-off-by: Adrian Reber <areber@redhat.com>	2017-08-02 16:07:32 +00:00
Adrian Reber	c71d9cd447	criuSwrk: prepare for CRIU VERSION RPC To use the CRIU VERSION RPC the criuSwrk function is adapted to work with CriuOpts set to 'nil' as CriuOpts is not required for the VERSION RPC. Also do not print c.criuVersion if it is '0' as the first RPC call will always be the VERSION call and only after that the version will be known. Signed-off-by: Adrian Reber <areber@redhat.com>	2017-08-02 16:07:28 +00:00
Adrian Reber	c5f0ce979b	checkCriuVersion: only ask criu once about its version If the version of criu has already been determined there is no need to ask criu for the version again. Use the value from c.criuVersion. v2: * reduce unnecessary code movement in the patch series * factor out the criu version parsing into a separate function Signed-off-by: Adrian Reber <areber@redhat.com>	2017-08-02 16:07:15 +00:00
Adrian Reber	b6c47281db	checkCriuVersion: switch to version using int The checkCriuVersion function used a string to specify the minimum version required. This is more comfortable for an external interface but for an internal function this added unnecessary complexity. This changes to version string like '1.5.2' to an integer like 10502. This is already the format used internally in the function. Signed-off-by: Adrian Reber <areber@redhat.com>	2017-08-02 16:05:27 +00:00
Michael Crosby	882d8eaba6	Merge pull request #1537 from tklauser/staticcheck Fix issues found by staticcheck	2017-08-02 09:52:11 -04:00
Daniel, Dao Quang Minh	b313a75364	Merge pull request #1477 from yummypeng/save-own-ns-path Always save own namespace paths	2017-08-02 11:24:30 +01:00
Tobias Klauser	e4e56cb6d8	libcontainer: remove ineffective break statements go's switch statement doesn't need an explicit break. Remove it where that is the case and add a comment to indicate the purpose where the removal would lead to an empty case. Found with honnef.co/go/tools/cmd/staticcheck Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2017-07-28 15:13:39 +02:00
Tobias Klauser	24a4273cf9	libcontainer: handle error cases Handle err return value of fmt.Scanf, os.Pipe and unix.ParseUnixRights. Found with honnef.co/go/tools/cmd/staticcheck Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2017-07-28 15:13:11 +02:00
Daniel Dao	91eafcbc65	tty: move IO of master pty to be done with epoll This moves all console code to use github.com/containerd/console library to handle console I/O. Also move to use EpollConsole by default when user requests a terminal so we can still cope when the other side temporarily goes away. Signed-off-by: Daniel Dao <dqminh89@gmail.com>	2017-07-28 12:35:02 +01:00
Michael Crosby	e775f0fba3	Merge pull request #1526 from stevenh/logrus-v1 Updated logrus to v1	2017-07-27 13:28:55 -04:00
yangshukui	5428532bdd	remove the code that close negative descriptor Signed-off-by: yangshukui <yangshukui@huawei.com>	2017-07-24 11:10:18 +08:00

... 4 5 6 7 8 ...

1494 Commits