jasder/runc - runc - 军科开源项目托管

Commit Graph

Author	SHA1	Message	Date
W. Trevor King	be16b13645	libcontainer/state_linux_test: Add a testTransitions helper The helper DRYs up the transition tests and makes it easy to get complete coverage for invalid transitions. I'm also using t.Run() for subtests. Run() is new in Go 1.7 [1], but runc dropped support for 1.6 back in `e773f96b` (update go version at travis-ci, 2017-02-20, #1335). [1]: https://blog.golang.org/subtests Signed-off-by: W. Trevor King <wking@tremily.us>	2018-01-25 11:18:45 -08:00
Michael Crosby	91ca331474	chroot when no mount namespaces is provided Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2018-01-25 11:36:37 -05:00
Michael Crosby	c4e4bb0df2	Merge pull request #1699 from AkihiroSuda/indent-c make: validate C format	2018-01-25 10:09:09 -05:00
Aleksa Sarai	5a46c2ba8b	nsenter: move namespace creation after userns creation Technically, this change should not be necessary, as the kernel documentation claims that if you call clone(flags\|CLONE_NEWUSER), the new user namespace will be the owner of all other namespaces created in @flags. Unfortunately this isn't always the case, due to various additional semantics and kernel bugs. One particular instance is SELinux, which acts very strangely towards the IPC namespace and mqueue. If you unshare the IPC namespace before you map a user in the user namespace, the IPC namespace's internal kern-mount for mqueue will be labelled incorrectly and the container won't be able to access it. The only way of solving this is to unshare IPC after the user has been mapped and we have changed to that user. I've also heard of this happening to the NET namespace while talking to some LXC folks, though I haven't personally seen that issue. This change matches our handling of user namespaces to be the same as how LXC handles these problems. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2018-01-25 23:56:49 +11:00
Akihiro Suda	dd5eb3b9e3	make: validate C format Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>	2018-01-24 10:49:50 +09:00
Ed King	5c0af14bf8	Return from goroutine when it should terminate Signed-off-by: Craig Furman <cfurman@pivotal.io>	2018-01-23 10:46:31 +00:00
Will Martin	8d3e6c9826	Avoid race when opening exec fifo When starting a container with `runc start` or `runc run`, the stub process (runc[2:INIT]) opens a fifo for writing. Its parent runc process will open the same fifo for reading. In this way, they synchronize. If the stub process exits at the wrong time, the parent runc process will block forever. This can happen when racing 2 runc operations against each other: `runc run/start`, and `runc delete`. It could also happen for other reasons, e.g. the kernel's OOM killer may select the stub process. This commit resolves this race by racing the opening of the exec fifo from the runc parent process against the stub process exiting. If the stub process exits before we open the fifo, we return an error. Another solution is to wait on the stub process. However, it seems it would require more refactoring to avoid calling wait multiple times on the same process, which is an error. Signed-off-by: Craig Furman <cfurman@pivotal.io>	2018-01-22 17:03:02 +00:00
Antonio Murdaca	cd1e7abee2	libcontainer: expose annotations in hooks Annotations weren't passed to hooks. This patch fixes that by passing annotations to stdin for hooks. Signed-off-by: Antonio Murdaca <runcom@redhat.com>	2018-01-11 16:54:01 +01:00
vikaschoudhary16	d5b4a3eddb	Fix race against systemd - T0: runc triggers a systemd unit creation asynchronously from [here](https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/systemd/apply_systemd.go#L298) - T1: runc then moves ahead and starts creating cgroup paths(.scope directories), [here](https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/systemd/apply_systemd.go#L348). Kernel creates .scope directory and cgroup.procs file(along with other default files) in the directory automatically, in an atomic manner. - T3: systemd execution thread which was invoked at time `T0`, is still in the process of unit creation. systemd also trying to create cgroup paths and deletes the `.scope` directory which is created at time `T1` by runc from [here](https://github.com/systemd/systemd/blob/v219/src/shared/cgroup-util.c#L1630) in the code Signed-off-by: vikaschoudhary16 <choudharyvikas16@gmail.com>	2018-01-08 09:37:26 -05:00
Mrunal Patel	e6516b3d5d	Merge pull request #1678 from sboeuf/sboeuf/subreaper libcontainer: Do not wait for signalled processes if subreaper is set	2017-12-15 08:47:07 -08:00
Michael Crosby	7f24b40cc5	Merge pull request #1675 from tklauser/apparmor-no-cgo RFC: libcontainer: remove dependency on libapparmor	2017-12-15 11:23:35 -05:00
Tobias Klauser	db093f621f	libcontainer: remove dependency on libapparmor libapparmor is integrated in libcontainer using cgo but is only used to call a single function: aa_change_onexec. It turns out this function is simple enough (writing a string to a file in /proc/<n>/attr/...) to be re-implemented locally in libcontainer in plain Go. This allows to drop the dependency on libapparmor and the corresponding cgo integration. Fixes #1674 Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2017-12-15 09:59:58 +01:00
Sebastien Boeuf	bb912eb00c	libcontainer: Do not wait for signalled processes if subreaper is set When a subreaper is enabled, it might expect to reap a process and retrieve its exit code. That's the reason why this patch is giving the possibility to define the usage of a subreaper as a consumer of libcontainer. Relying on this information, libcontainer will not wait for signalled processes in case a subreaper has been set. Fixes #1677 Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>	2017-12-14 10:37:38 -08:00
Mrunal Patel	c6e4a1ebeb	Merge pull request #1665 from Mashimiao/gidmapping-valid-fix specconv: avoid skipping gidmappings applied when uidmappings is empty	2017-12-11 09:50:54 -08:00
Mrunal Patel	b028413c35	Merge pull request #1655 from Mashimiao/add-propagation-more support unbindable,runbindable for rootfs propagation	2017-12-11 09:21:41 -08:00
Allen Sun	fec6b0fea5	Update criu_opts_linux.go Signed-off-by: Allen Sun <shlallen1990@gmail.com>	2017-12-05 15:16:26 +08:00
Michael Crosby	91e9795013	Merge pull request #1654 from dqminh/only-linux remove placeholder for non-linux platforms	2017-11-30 09:51:47 -05:00
Ma Shimiao	57edfbbaf2	specconv: avoid skipping gidmappings applied when uidmappings is empty Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>	2017-11-30 16:24:36 +08:00
Aleksa Sarai	e8149af291	merge branch 'pr-1661' Ensure container tests do not write on the host LGTMs: @hqhq @cyphar Closes #1661	2017-11-27 20:10:48 +11:00
Danail Branekov	0495fece57	Ensure container tests do not write on the host TestGetContainerStateAfterUpdate creates its state.json file on the current directory which turns out to be the host runc directory. Thus whenever the test completes it leaves the state.json file behind thus a) poluting the local git repository b) changing the host file system violating the principle of doing everything in an isolated container environment This change would create a new temporary (in-container) directory and use it as linuxContainer.root Signed-off-by: Tom Godkin <tgodkin@pivotal.io>	2017-11-27 10:43:10 +02:00
Daniel Dao	8898b6b446	remove placeholder for non-linux platforms runc currently only support Linux platform, and since we dont intend to expose the support to other platform, removing all other platforms placeholder code. `libcontainer/configs` still being used in https://github.com/moby/moby/blob/master/daemon/daemon_windows.go so keeping it for now. After this, we probably should also rename files to drop linux suffices if possible. Signed-off-by: Daniel Dao <dqminh89@gmail.com>	2017-11-24 18:14:51 +00:00
Daniel, Dao Quang Minh	fb871d9cd0	Merge pull request #1664 from tklauser/drop-freebsd libcontainer: drop FreeBSD support	2017-11-24 18:08:21 +00:00
Tobias Klauser	4d27f20db0	libcontainer: drop FreeBSD support runc is not supported on FreeBSD, so remove all FreeBSD specific bits. As suggested by @crosbymichael in #1653 Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2017-11-24 14:51:05 +01:00
Danail Branekov	38d1e6ec27	Delete xattr related code Selinux related code has been moved to the selinux package (https://github.com/opencontainers/selinux) and therefore xattr related code can be deleted from libcontainer Signed-off-by: Danail Branekov <danailster@gmail.com>	2017-11-21 12:49:28 +02:00
Ma Shimiao	17db6560be	support unbindable,runbindable for rootfs propagation Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>	2017-11-17 16:14:15 +08:00
Seth Jennings	bca53e7b49	systemd: adjust CPUQuotaPerSecUSec to compensate for systemd internal handling Signed-off-by: Seth Jennings <sjenning@redhat.com>	2017-11-15 20:20:06 -06:00
Vincent Demeester	3ca4c78b1a	Import docker/docker/pkg/mount into runc This will help get rid of docker/docker dependency in runc 👼 Signed-off-by: Vincent Demeester <vincent@sbr.pm>	2017-11-08 16:25:58 +01:00
Michael Crosby	2f010ecf19	Merge pull request #1622 from vdemeester/import-symlink-from-docker Remove pkg/symlink from docker/docker and use cyphar/filepath-securejoin	2017-11-08 10:07:00 -05:00
Akihiro Suda	0aac2368e4	specconv.Example(): add /proc/scsi to masked paths Port over https://github.com/moby/moby/pull/35399 Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>	2017-11-04 17:38:14 +00:00
Michael Crosby	0232e38342	Merge pull request #1629 from masters-of-cats/busybox-inflation Avoid disk usage explosion when copying busybox	2017-11-01 09:15:22 -04:00
Danail Branekov	fdbb9e3e55	Avoid disk usage explosion when copying busybox When running runc tests with temp directory with size 500M copying busybox without preserving hardlinks causes the folder to inflate to roughly 330M. Copying busybox twice in certain tests causes the /tmp directory to overfill. Using `-a` preserves links which busybox uses to implement its choice of binary to run. Signed-off-by: Tom Godkin <tgodkin@pivotal.io>	2017-11-01 09:52:05 +00:00
Vincent Demeester	594501475e	Use cyphar/filepath-securejoin instead of docker pkg/symlink runc shouldn't depend on docker and be more self-contained. Removing github.com/pkg/symlink dep is the first step to not depend on docker anymore Signed-off-by: Vincent Demeester <vincent@sbr.pm>	2017-10-31 16:53:45 +01:00
Lorenzo Fontana	780f8ef567	Specconv: Test create command hooks and seccomp setup Signed-off-by: Lorenzo Fontana <lo@linux.com>	2017-10-28 21:46:46 +02:00
Mrunal Patel	9a1186d128	Merge pull request #1619 from fntlnz/spec-linux-testing WIP: Better testsuite for specconv	2017-10-25 15:23:19 -07:00
Lorenzo Fontana	c0e6e12f9d	Test Cgroup creation and memory allocations Signed-off-by: Lorenzo Fontana <lo@linux.com>	2017-10-25 01:58:10 +02:00
Aleksa Sarai	ff5075c33f	init: correctly handle unmapped stdio with multiple mappings Previously we would handle the "unmapped stdio" case by just doing a simple check, however this didn't handle cases where the overflow_uid was actually mapped in the user namespace. Instead of doing some userspace checks, just try to do the fchown(2) and ignore EINVAL (unmapped) or EPERM (lacking privilege over inode) errors. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-10-25 00:12:21 +11:00
Qiang Huang	74a1729647	Merge pull request #1607 from crosbymichael/term-err libcontainer: handler errors from terminate	2017-10-20 15:15:38 +08:00
Qiang Huang	e8b9b92f57	Merge pull request #1206 from YuPengZTE/devMD026 trailing punctuation in header	2017-10-20 14:47:09 +08:00
Mrunal Patel	80ee9e50b5	Merge pull request #1616 from mheon/seccomp_fix_breakage Fix breaking change in Seccomp profile behavior	2017-10-19 14:15:04 -07:00
Aleksa Sarai	c05f6368af	merge branch 'pr-1615' libcontainer: intelrdt: fix a GetStats() issue LGTMs: @crosbymichael @cyphar Closes #1615	2017-10-19 03:41:16 +11:00
Matthew Heon	e9193ba6e6	Fix breaking change in Seccomp profile behavior Multiple conditions were previously allowed to be placed upon the same syscall argument. Restore this behavior. Signed-off-by: Matthew Heon <mheon@redhat.com>	2017-10-18 11:53:56 -04:00
Qiang Huang	3409d5c555	Merge pull request #1606 from cyphar/rootfs-propagation-no-pivot specconv: emit an error when using MS_PRIVATE with --no-pivot	2017-10-18 09:52:04 +08:00
Xiaochen Shen	d89217515b	libcontainer: intelrdt: fix a GetStats() issue This fixes a GetStats() issue introduced in #1590: If Intel RDT is enabled by hardware and kernel, but intelRdt is not specified in original config, GetStats() will return error unexpectedly because we haven't called Apply() to create intelrdt group or attach tasks for this container. As a result, runc events command will have no output. Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2017-10-17 17:37:07 +08:00
Tobias Klauser	0eed453b21	libcontainer: use Major/Minor from x/sys/unix The Major and Minor functions were added for Linux in golang/sys@85d1495 which is already vendored in. Use these functions instead of the local re-implementation. Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2017-10-17 09:06:42 +02:00
Aleksa Sarai	9b13f5cc7f	merge branch 'pr-1453' propagate argv0 when re-execing from /proc/self/exe LGTMs: @crosbymichael @cyphar Closes #1453	2017-10-17 03:12:22 +11:00
Michael Crosby	ff4481dbf6	Merge pull request #1540 from cloudfoundry-incubator/rootless-cgroups Support cgroups with limits as rootless	2017-10-16 12:03:49 -04:00
Petros Angelatos	8098828680	propagate argv0 when re-execing from /proc/self/exe This allows runc to be used as a target for docker's reexec module that depends on a correct argv0 to select which process entrypoint to invoke. Without this patch, when runc re-execs argv0 is set to "/proc/self/exe" and the reexec module doesn't know what to do with it. Signed-off-by: Petros Angelatos <petrosagg@gmail.com>	2017-10-16 14:00:26 +02:00
Tobias Klauser	d2bc081420	libcontainer: merge common syscall implementations There are essentially two possible implementations for Setuid/Setgid on Linux, either using SYS_SETUID32/SYS_SETGID32 or SYS_SETUID/SYS_SETGID, depending on the architecture (see golang/go#1435 for why Setuid/Setgid aren currently implemented for Linux neither in syscall nor in golang.org/x/sys/unix). Reduce duplication by merging the currently implemented variants and adjusting the build tags accordingly. Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2017-10-16 11:11:18 +02:00
Aleksa Sarai	6d30f7a01b	merge branch 'pr-1424' Update Travis config to use trusty-backports libseccomp Add integration tests for multi-argument Seccomp filters Vendor updated libseccomp-golang for bugfix LGTMs: @crosbymichael @cyphar Closes #1424	2017-10-16 03:01:37 +11:00
Aleksa Sarai	d2ac52fe52	merge branch 'pr-1475' Add support for mips/mips64 Put signalMap in a separate file, so it may be arch-specific LGTMs: @crosbymichael @cyphar Closes #1475	2017-10-16 02:59:34 +11:00
Aleksa Sarai	2430a98e64	merge branch 'pr-1500' rootfs: switch ms_private remount of oldroot to ms_slave LGTMs: @crosbymichael @hqhq Closes opencontainers/runc#1500	2017-10-14 09:32:59 +11:00
Sebastien Boeuf	acb93c9c62	libcontainer: cgroups: Write freezer state after every state check This commit ensures we write the expected freezer cgroup state after every state check, in case the state check does not give the expected result. This can happen when a new task is created and prevents the whole cgroup to be FROZEN, leaving the state into FREEZING instead. This patch prevents the case of an infinite loop to happen. Fixes https://github.com/opencontainers/runc/issues/1609 Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>	2017-10-12 07:07:28 -07:00
Matthew Heon	bbc847a457	Add integration tests for multi-argument Seccomp filters Signed-off-by: Matthew Heon <mheon@redhat.com>	2017-10-10 15:49:08 -04:00
Michael Crosby	bfe3058fc9	Make process check more forgiving Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2017-10-10 15:36:19 -04:00
Steven Hartland	eb68b900bc	Prevent invalid errors from terminate Both Process.Kill() and Process.Wait() can return errors that don't impact the correct behaviour of terminate. Instead of letting these get returned and logged, which causes confusion, silently ignore them. Currently the test needs to be a string test as the errors are private to the runtime packages, so its our only option. This can be seen if init fails during the setns. Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>	2017-10-10 15:32:46 -04:00
Michael Crosby	4693fae411	Merge pull request #1590 from xiaochenshen/rdt-cat-support-update-command libcontainer: intelrdt: add update command support	2017-10-10 15:25:22 -04:00
Aleksa Sarai	d4f0f9a52b	specconv: emit an error when using MS_PRIVATE with --no-pivot Due to the semantics of chroot(2) when it comes to mount namespaces, it is not generally safe to use MS_PRIVATE as a mount propgation when using chroot(2). The reason for this is that this effectively results in a set of mount references being held by the chroot'd namespace which the namespace cannot free. pivot_root(2) does not have this issue because the @old_root can be unmounted by the process. Ultimately, --no-pivot is not really necessary anymore as a commonly used option since `f8e6b5af5e` ("rootfs: make pivot_root not use a temporary directory") resolved the read-only issue. But if someone really needs to use it, MS_PRIVATE is never a good idea. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-10-08 17:50:55 +11:00
Michael Crosby	f53ad9cec9	Merge pull request #1604 from AkihiroSuda/cwd libcontainer: create Cwd when it does not exist	2017-10-05 11:15:10 -04:00
Will Martin	ca4f427af1	Support cgroups with limits as rootless Signed-off-by: Ed King <eking@pivotal.io> Signed-off-by: Gabriel Rosenhouse <grosenhouse@pivotal.io> Signed-off-by: Konstantinos Karampogias <konstantinos.karampogias@swisscom.com>	2017-10-05 11:22:54 +01:00
Akihiro Suda	2edd36fdff	libcontainer: create Cwd when it does not exist The benefit for doing this within runc is that it works well with userns. Actually, runc already does the same thing for mount points. Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>	2017-10-05 05:31:46 +00:00
Konstantinos Karampogias	605dc5c811	Set initial console size based on process spec Signed-off-by: Will Martin <wmartin@pivotal.io> Signed-off-by: Petar Petrov <pppepito86@gmail.com> Signed-off-by: Ed King <eking@pivotal.io> Signed-off-by: Roberto Jimenez Sanchez <jszroberto@gmail.com> Signed-off-by: Thomas Godkin <tgodkin@pivotal.io>	2017-10-04 12:32:16 +01:00
Daniel, Dao Quang Minh	0351df1c5a	Merge pull request #1600 from crosbymichael/console Bump console and sys deps	2017-09-26 10:15:10 +01:00
Michael Crosby	f364c1a58c	Set ClearONLCR in tests Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2017-09-25 13:35:22 -04:00
Tobias Klauser	d713652bda	libcontainer: remove unnecessary type conversions Generated using github.com/mdempsky/unconvert Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2017-09-25 10:41:57 +02:00
Qiang Huang	79ad714374	Merge pull request #1598 from euank/ragent libcontainer: default mount propagation correctly	2017-09-25 11:55:29 +08:00
Euan Kemp	4301b440d6	libcontainer: default mount propagation correctly The code in prepareRoot (`e385f67a0e/libcontainer/rootfs_linux.go (L599-L605)`) attempts to default the rootfs mount to `rslave`. However, since the spec conversion has already defaulted it to `rprivate`, that code doesn't actually ever do anything. This changes the spec conversion code to accept "" and treat it as 0. Implicitly, this makes rootfs propagation default to `rslave`, which is a part of fixing the moby bug https://github.com/moby/moby/issues/34672 Alternate implementatoins include changing this defaulting to be `rslave` and removing the defaulting code in prepareRoot, or skipping the mapping entirely for "", but I think this change is the cleanest of those options. Signed-off-by: Euan Kemp <euan.kemp@coreos.com>	2017-09-22 13:36:23 -07:00
Xiaochen Shen	2549545df5	intelrdt: always init IntelRdtManager if Intel RDT is enabled In current implementation: Either Intel RDT is not enabled by hardware and kernel, or intelRdt is not specified in original config, we don't init IntelRdtManager in the container to handle intelrdt constraint. It is a tradeoff that Intel RDT has hardware limitation to support only limited number of groups. This patch makes a minor change to support update command: Whether or not intelRdt is specified in config, we always init IntelRdtManager in the container if Intel RDT is enabled. If intelRdt is not specified in original config, we just don't Apply() to create intelrdt group or attach tasks for this container. In update command, we could re-enable through IntelRdtManager.Apply() and then update intelrdt constraint. Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2017-09-20 01:37:31 +08:00
Michael Crosby	593914b8bd	Merge pull request #1593 from s7v7nislands/drop_go1.5 Drop support golang 1.5	2017-09-12 15:22:00 -04:00
s7v7nislands	00ad8e1e56	Drop support golang 1.5 Signed-off-by: Xiaobing Jiang <s7v7nislands@gmail.com>	2017-09-12 20:56:51 +08:00
Qiang Huang	68e00e906b	Merge pull request #1586 from crosbymichael/set-cgroups Apply cgroups earlier	2017-09-12 12:13:29 +08:00
Yong Tang	e9944d0f4c	Disable systemd in static build This fix tries to address the warnings caused by static build with go 1.9. As systemd needs dlopen/dlclose, the following warnings will be generated for static build in go 1.9: ``` root@f4b077232050:/go/src/github.com/opencontainers/runc# make static CGO_ENABLED=1 go build -tags "seccomp cgo static_build" -ldflags "-w -extldflags -static -X main.gitCommit="1c81e2a794c6e26a4c650142ae8893c47f619764" -X main.version=1.0.0-rc4+dev " -o runc . /tmp/go-link-113476657/000007.o: In function `_cgo_a5acef59ed3f_Cfunc_dlopen': /tmp/go-build/github.com/opencontainers/runc/vendor/github.com/coreos/pkg/dlopen/_obj/cgo-gcc-prolog:76: warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking ``` This fix disables systemd when `static_build` flag is on (apply_nosystemd.go is used instead). This fix also fixes a small bug in `apply_nosystemd.go` for return value. Signed-off-by: Yong Tang <yong.tang.github@outlook.com>	2017-09-11 18:38:22 +00:00
Mrunal Patel	d5b43c3981	Merge pull request #1455 from dqminh/epoll-io tty: move IO of master pty to be done with epoll	2017-09-11 11:32:42 -07:00
Aleksa Sarai	1a5fdc1c5f	init: support setting -u with rootless containers Now that rootless containers have support for multiple uid and gid mappings, allow --user to work as expected. If the user is not mapped, an error occurs (as usual). Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-09-09 12:45:33 +10:00
Aleksa Sarai	969bb49cc3	nsenter: do not resolve path in nsexec context With the addition of our new{uid,gid}map support, we used to call execvp(3) from inside nsexec. This would mean that the path resolution for the binaries would happen in nsexec. Move the resolution to the initial setup code, and pass the absolute path to nsexec. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-09-09 12:45:33 +10:00
Aleksa Sarai	6097ce74d8	nsenter: correctly handle newgidmap path for rootless containers After quite a bit of debugging, I found that previous versions of this patchset did not include newgidmap in a rootless setting. Fix this by passing it whenever group mappings are applied, and also providing some better checking for try_mapping_tool. This commit also includes some stylistic improvements. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-09-09 12:45:32 +10:00
Giuseppe Scrivano	3282f5a7c1	tests: fix for rootless multiple uids/gids Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2017-09-09 12:45:32 +10:00
Giuseppe Scrivano	d8b669400a	rootless: allow multiple user/group mappings Take advantage of the newuidmap/newgidmap tools to allow multiple users/groups to be mapped into the new user namespace in the rootless case. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> [ rebased to handle intelrdt changes. ] Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-09-09 12:45:32 +10:00
Mrunal Patel	13fa5d2953	Merge pull request #1588 from s7v7nislands/delete_unused Delete unused function	2017-09-08 17:34:00 -07:00
Michael Crosby	b82d07e816	Merge pull request #1587 from Mashimiao/fix-namespace-empty Fixes #1585 config.Namespaces is empty when accessed	2017-09-08 10:50:16 -04:00
Xiaochen Shen	88d22fde40	libcontainer: intelrdt: use init() to avoid race condition This is the follow-up PR of #1279 to fix remaining issues: Use init() to avoid race condition in IsIntelRdtEnabled(). Add also rename some variables and functions. Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2017-09-08 17:15:31 +08:00
s7v7nislands	c795b8690b	Delete unused function Signed-off-by: Xiaobing Jiang <s7v7nislands@gmail.com>	2017-09-08 10:35:46 +08:00
Ma Shimiao	c3d20e7817	Fixes #1585 config.Namespaces is empty when accessed Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>	2017-09-08 09:30:07 +08:00
Mrunal Patel	deb9d7fd96	Merge pull request #1569 from cyphar/delay-seccomp init: delay seccomp application as late as possible	2017-09-07 13:27:37 -07:00
Mrunal Patel	7e036aa0b0	Merge pull request #1541 from adrianreber/lazy checkpoint: support lazy migration	2017-09-07 13:25:04 -07:00
Michael Crosby	7062c7556b	Apply cgroups earlier This applies cgroups earlier for container creation before the init process starts running and forking off any additional processes. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2017-09-07 11:27:33 -04:00
Adrian Reber	60ae7091de	checkpoint: support lazy migration With the help of userfaultfd CRIU supports lazy migration. Lazy migration means that memory pages are only transferred from the migration source to the migration destination on page fault. This enables to reduce the downtime during process or container migration to a minimum as the memory does not need to be transferred during migration. Lazy migration currently depends on userfaultfd being available on the current Linux kernel and if the used CRIU version supports lazy migration. Both dependencies can be checked by querying CRIU via RPC if the lazy migration feature is available. Using feature checking instead of version comparison enables runC to use CRIU features from the criu-dev branch. This way the user can decide if lazy migration should be available by choosing the right kernel and CRIU branch. To use lazy migration the CRIU process during dump needs to dump everything besides the memory pages and then it opens a network port waiting for remote page fault requests: # runc checkpoint httpd --lazy-pages --page-server 0.0.0.0:27 \ --status-fd /tmp/postcopy-pipe In this example CRIU will hang/wait once it has opened the network port and wait for network connection. As runC waits for CRIU to finish it will also hang until the lazy migration has finished. To know when the restore on the destination side can start the '--status-fd' parameter is used: #️ runc checkpoint --help \| grep status --status-fd value criu writes \0 to this FD once lazy-pages is ready The parameter '--status-fd' is directly from CRIU and this way the process outside of runC which controls the migration knows exactly when to transfer the checkpoint (without memory pages) to the destination and that the restore can be started. On the destination side it is necessary to start CRIU in 'lazy-pages' mode like this: # criu lazy-pages --page-server --address 192.168.122.3 --port 27 \ -D checkpoint and tell runC to do a lazy restore: # runc restore -d --image-path checkpoint --work-path checkpoint \ --lazy-pages httpd If both processes on the restore side have the same working directory 'criu lazy-pages' creates a unix domain socket where it waits for requests from the actual restore. runC starts CRIU restore in lazy restore mode and talks to 'criu lazy-pages' that it wants to restore memory pages on demand. CRIU continues to restore the process and once the process is running and accesses the first non-existing memory page the 'criu lazy-pages' server will request the page from the source system. Thus all pages from the source system will be transferred to the destination system. Once all pages have been transferred runC on the source system will end and the container will have finished migration. This can also be combined with CRIU's pre-copy support. The combination of pre-copy and post-copy (lazy migration) provides the possibility to migrate containers with minimal downtimes. Some additional background about post-copy migration can be found in these articles: https://lisas.de/~adrian/?p=1253 https://lisas.de/~adrian/?p=1183 Signed-off-by: Adrian Reber <areber@redhat.com>	2017-09-06 12:35:38 +00:00
Adrian Reber	a3a632ad28	checkpoint: add support to query for lazy page support Before adding the actual lazy migration support, this adds the feature check for lazy-pages. Right now lazy migration, which is based on userfaultd is only available in the criu-dev branch and not yet in a release. As the check does not dependent on a certain version but on a CRIU feature which can be queried it can be part of runC without a new version check depending on a feature from criu-dev. Signed-off-by: Adrian Reber <areber@redhat.com>	2017-09-06 12:35:38 +00:00
Xiaochen Shen	4d2756c116	libcontainer: add test cases for Intel RDT/CAT Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2017-09-01 14:35:40 +08:00
Xiaochen Shen	692f6e1e27	libcontainer: add support for Intel RDT/CAT in runc About Intel RDT/CAT feature: Intel platforms with new Xeon CPU support Intel Resource Director Technology (RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which currently supports L3 cache resource allocation. This feature provides a way for the software to restrict cache allocation to a defined 'subset' of L3 cache which may be overlapping with other 'subsets'. The different subsets are identified by class of service (CLOS) and each CLOS has a capacity bitmask (CBM). For more information about Intel RDT/CAT can be found in the section 17.17 of Intel Software Developer Manual. About Intel RDT/CAT kernel interface: In Linux 4.10 kernel or newer, the interface is defined and exposed via "resource control" filesystem, which is a "cgroup-like" interface. Comparing with cgroups, it has similar process management lifecycle and interfaces in a container. But unlike cgroups' hierarchy, it has single level filesystem layout. Intel RDT "resource control" filesystem hierarchy: mount -t resctrl resctrl /sys/fs/resctrl tree /sys/fs/resctrl /sys/fs/resctrl/ \|-- info \| \|-- L3 \| \|-- cbm_mask \| \|-- min_cbm_bits \| \|-- num_closids \|-- cpus \|-- schemata \|-- tasks \|-- <container_id> \|-- cpus \|-- schemata \|-- tasks For runc, we can make use of `tasks` and `schemata` configuration for L3 cache resource constraints. The file `tasks` has a list of tasks that belongs to this group (e.g., <container_id>" group). Tasks can be added to a group by writing the task ID to the "tasks" file (which will automatically remove them from the previous group to which they belonged). New tasks created by fork(2) and clone(2) are added to the same group as their parent. If a pid is not in any sub group, it Is in root group. The file `schemata` has allocation bitmasks/values for L3 cache on each socket, which contains L3 cache id and capacity bitmask (CBM). Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..." For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0` which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0. The valid L3 cache CBM is a contiguous bits set and number of bits that can be set is less than the max bit. The max bits in the CBM is varied among supported Intel Xeon platforms. In Intel RDT "resource control" filesystem layout, the CBM in a group should be a subset of the CBM in root. Kernel will check if it is valid when writing. e.g., 0xfffff in root indicates the max bits of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc. For more information about Intel RDT/CAT kernel interface: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt An example for runc: Consider a two-socket machine with two L3 caches where the default CBM is 0xfffff and the max CBM length is 20 bits. With this configuration, tasks inside the container only have access to the "upper" 80% of L3 cache id 0 and the "lower" 50% L3 cache id 1: "linux": { "intelRdt": { "l3CacheSchema": "L3:0=ffff0;1=3ff" } } Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2017-09-01 14:26:33 +08:00
Xiaochen Shen	af3b0d9dce	libcontainer/SPEC.md: add documentation for Intel RDT/CAT Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2017-09-01 14:26:33 +08:00
Aleksa Sarai	1f32fff46d	setns init: delay seccomp as late as possible This mirrors the standard_init_linux.go seccomp code, which only applies seccomp early if NoNewPrivileges is enabled. Otherwise it's done immediately before execve to reduce the amount of syscalls necessary for users to enable in their seccomp profiles. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-08-26 13:42:30 +10:00
Aleksa Sarai	3ddde27d7d	init: move close(stateDirFd) before seccomp apply This further reduces the number of syscalls that a user needs to enable in their seccomp profile. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-08-26 13:42:26 +10:00
Qiang Huang	1c81e2a794	Merge pull request #1572 from tych0/fix-readonly-userns fix --read-only containers under --userns-remap	2017-08-26 09:38:14 +08:00
Aleksa Sarai	4d6e6720a7	Merge branch 'pr-1573' Fix systemd cgroup after memory type changed LGTMs: @crosbymichael @cyphar Closes #1573	2017-08-25 23:55:27 +10:00
Qiang Huang	acaf6897f5	Fix systemd cgroup after memory type changed Fixes: #1557 I'm not quite sure about the root cause, looks like systemd still want them to be uint64. Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>	2017-08-25 01:14:16 -04:00
Aleksa Sarai	7d66aab77a	init: switch away from stateDirFd entirely While we have significant protections in place against CVE-2016-9962, we still were holding onto a file descriptor that referenced the host filesystem. This meant that in certain scenarios it was still possible for a semi-privileged container to gain access to the host filesystem (if they had CAP_SYS_PTRACE). Instead, open the FIFO itself using a O_PATH. This allows us to reference the FIFO directly without providing the ability for directory-level access. When opening the FIFO inside the init process, open it through procfs to re-open the actual FIFO (this is currently the only supported way to open such a file descriptor). Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-08-25 13:19:03 +10:00
Tycho Andersen	66eb2a3e8f	fix --read-only containers under --userns-remap The documentation here: https://docs.docker.com/engine/security/userns-remap/#user-namespace-known-limitations says that readonly containers can't be used with user namespaces do to some kernel restriction. In fact, there is a special case in the kernel to be able to do stuff like this, so let's use it. This takes us from: ubuntu@docker:~$ docker run -it --read-only ubuntu docker: Error response from daemon: oci runtime error: container_linux.go:262: starting container process caused "process_linux.go:339: container init caused \"rootfs_linux.go:125: remounting \\\"/dev\\\" as readonly caused \\\"operation not permitted\\\"\"". to: ubuntu@docker:~$ docker-runc --version runc version 1.0.0-rc4+dev commit: ae2948042b08ad3d6d13cd09f40a50ffff4fc688-dirty spec: 1.0.0 ubuntu@docker:~$ docker run -it --read-only ubuntu root@181e2acb909a:/# touch foo touch: cannot touch 'foo': Read-only file system Signed-off-by: Tycho Andersen <tycho@docker.com>	2017-08-24 16:43:21 -06:00
Nikolas Sepos	da4a5a9515	Add AutoDedup option to CriuOpts Memory image deduplication, very useful for incremental dumps. See: https://criu.org/Memory_images_deduplication Signed-off-by: Nikolas Sepos <nikolas.sepos@gmail.com>	2017-08-18 01:21:42 +02:00
Michael Crosby	ccd2c20aa4	Merge pull request #1559 from Mashimiao/panic-fix-nil-linux fix panic when Linux is nil for rootless case	2017-08-17 09:57:35 -04:00
Ma Shimiao	2333e7dc67	fix panic when Linux is nil for rootless case congfig.Sysctl setting is duplicated. when contianer is rootless and Linux is nil, runc will panic. Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>	2017-08-16 09:11:13 +08:00

1 2 3 4 5 ...

1120 Commits