jasder/runc - runc - 军科开源项目托管

Commit Graph

Author	SHA1	Message	Date
Aleksa Sarai	201b063745	merge branch 'pr-2141' Radostin Stoyanov (1): criu: Ensure other users cannot read c/r files LGTMs: @crosbymichael @cyphar Closes #2141	2019-12-07 09:32:58 +11:00
Akihiro Suda	ec49f98d72	fs2: support legacy device spec (to pass CI) Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2019-12-06 15:53:07 +09:00
Akihiro Suda	88e8350de2	cgroup2: split fs2 from fs split fs2 package from fs, as mixing up fs and fs2 is very likely to result in unmaintainable code. Inspired by containerd/cgroups#109 Fix #2157 Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2019-12-06 15:42:10 +09:00
Aleksa Sarai	5e63695384	merge branch 'pr-2174' Sascha Grunert (1): Expose network interfaces via runc events LGTMs: @cyphar @mrunalp Closes #2174	2019-12-06 13:07:44 +11:00
Michael Crosby	8bb10af481	Merge pull request #2165 from AkihiroSuda/travis-f31 .travis.yml: add Fedora 31 vagrant box (for cgroup2)	2019-12-05 16:26:51 -05:00
Sascha Grunert	41a20b5852	Expose network interfaces via runc events The libcontainer network statistics are unreachable without manually creating a libcontainer instance. To retrieve them via the CLI interface of runc, we now expose them as well. Signed-off-by: Sascha Grunert <sgrunert@suse.com>	2019-12-05 13:20:51 +01:00
Akihiro Suda	faf1e44ea9	cgroup2: ebpf: increase RLIM_MEMLOCK to avoid BPF_PROG_LOAD error Fix #2167 Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2019-11-07 15:43:27 +09:00
Mrunal Patel	46def4cc4c	Merge pull request #2154 from jpeach/2008-remove-static-build-tag Remove the static_build build tag.	2019-11-04 17:10:59 -08:00
Akihiro Suda	ccd4436fc4	.travis.yml: add Fedora 31 vagrant box (for cgroup2) As the baby step, only unit tests are executed. Failing tests are currently skipped and will be fixed in follow-up PRs. Fix #2124 Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2019-10-31 16:53:01 +09:00
Akihiro Suda	faf673ee45	cgroup2: port over eBPF device controller from crun The implementation is based on https://github.com/containers/crun/blob/0.10.2/src/libcrun/ebpf.c Although ebpf.c is originally licensed under LGPL-3.0-or-later, the author Giuseppe Scrivano agreed to relicense the file in Apache License 2.0: https://github.com/opencontainers/runc/issues/2144#issuecomment-543116397 See libcontainer/cgroups/ebpf/devicefilter/devicefilter_test.go for tested configurations. Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2019-10-31 14:01:46 +09:00
Qiang Huang	e57a774066	Merge pull request #2149 from AkihiroSuda/cgroup2-ps cgroup2: implement `runc ps`	2019-10-31 09:44:39 +08:00
Qiang Huang	d239ca8425	Merge pull request #2148 from AkihiroSuda/cg2-ignore-cpuset-when-no-config cgroup2: cpuset_v2: skip Apply when no limit is specified	2019-10-29 21:57:58 +08:00
Mrunal Patel	03cf145f5a	Merge pull request #2159 from AkihiroSuda/cgroup2-mount-in-userns cgroup2: allow mounting /sys/fs/cgroup in UserNS without unsharing CgroupNS	2019-10-28 19:19:09 -07:00
Akihiro Suda	74a3fe5d1b	cgroup2: do not parse /proc/cgroups /proc/cgroups is meaningless for v2 and should be ignored. https://github.com/torvalds/linux/blob/v5.3/Documentation/admin-guide/cgroup-v2.rst#deprecated-v1-core-features * Now GetAllSubsystems() parses /sys/fs/cgroup/cgroup.controller, not /proc/cgroups. The function result also contains "pseudo" controllers: {"devices", "freezer"}. As it is hard to detect availability of pseudo controllers, pseudo controllers are always assumed to be available. * Now IOGroupV2.Name() returns "io", not "blkio" Fix #2155 #2156 Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2019-10-28 00:00:33 +09:00
Akihiro Suda	9c81440fb5	cgroup2: allow mounting /sys/fs/cgroup in UserNS without unsharing CgroupNS Bind-mount /sys/fs/cgroup when we are in UserNS but CgroupNS is not unshared, because we cannot mount cgroup2. This behavior correspond to crun v0.10.2. Fix #2158 Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2019-10-27 23:09:41 +09:00
James Peach	13919f5dfd	Remove the static_build build tag. The `static_build` build tag was introduced in `e9944d0f` to remove build warnings related to systemd cgroup driver dependencies. Since then, those dependencies have changed and building the systemd cgroup driver no longer imports dlopen. After this change, runc builds will always include the systemd cgroup driver. This fixes #2008. Signed-off-by: James Peach <jpeach@apache.org>	2019-10-26 08:28:45 +11:00
Michael Crosby	c4d8e1688c	Merge pull request #2140 from crosbymichael/fs-unified Set unified mountpoint in find mnt func	2019-10-24 15:20:47 -04:00
Akihiro Suda	dbd771e475	cgroup2: implement `runc ps` Implemented `runc ps` for cgroup v2 , using a newly added method `m.GetUnifiedPath()`. Unlike the v1 implementation that checks `m.GetPaths()["devices"]`, the v2 implementation does not require the device controller to be available. Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2019-10-19 01:59:24 +09:00
Akihiro Suda	d918e7f408	cpuset_v2: skip Apply when no limit is specified Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2019-10-19 00:33:31 +09:00
Akihiro Suda	033936ef76	io_v2.go: remove blkio v1 code Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2019-10-18 21:33:48 +09:00
Radostin Stoyanov	a610a84821	criu: Ensure other users cannot read c/r files No checkpoint files should be readable by anyone else but the user creating it. Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>	2019-10-17 07:49:38 +01:00
Michael Crosby	b28f58f31b	Set unified mountpoint in find mnt func This is needed for the fsv2 cgroups to work when there is a unified mountpoint. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2019-10-15 15:40:03 -04:00
Radostin Stoyanov	f017e0f9e1	checkpoint: Set descriptors.json file mode to 0600 Prevent unprivileged users from being able to read descriptors.json Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>	2019-10-12 19:29:44 +01:00
Aleksa Sarai	1b8a1eeec3	merge branch 'pr-2132' Support different field counts of cpuaact.stats LGTMs: @crosbymichael @cyphar Closes #2132	2019-10-02 01:50:47 +10:00
Aleksa Sarai	d463f6485b	*: verify that operations on /proc/... are on procfs This is an additional mitigation for CVE-2019-16884. The primary problem is that Docker can be coerced into bind-mounting a file system on top of /proc (resulting in label-related writes to /proc no longer happening). While we are working on mitigations against permitting the mounts, this helps avoid our code from being tricked into writing to non-procfs files. This is not a perfect solution (after all, there might be a bind-mount of a different procfs file over the target) but in order to exploit that you would need to be able to tweak a config.json pretty specifically (which thankfully Docker doesn't allow). Specifically this stops AppArmor from not labeling a process silently due to /proc/self/attr/... being incorrectly set, and stops any accidental fd leaks because /proc/self/fd/... is not real. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-09-30 09:06:48 +10:00
tianye15	28e58a0f6a	Support different field counts of cpuaact.stats Signed-off-by: skilxnTL <tylxltt@gmail.com>	2019-09-29 10:20:58 +08:00
Julia Nedialkova	e63b797f38	Handle ENODEV when accessing the freezer.state file ...when checking if a container is paused Signed-off-by: Julia Nedialkova <julianedialkova@hotmail.com>	2019-09-27 17:02:56 +03:00
blacktop	84373aaa56	Add SCMP_ACT_LOG as a valid Seccomp action (#1951 ) Signed-off-by: blacktop <blacktop@users.noreply.github.com>	2019-09-26 11:03:03 -04:00
Michael Crosby	331692baa7	Only allow proc mount if it is procfs Fixes #2128 This allows proc to be bind mounted for host and rootless namespace usecases but it removes the ability to mount over the top of proc with a directory. ```bash > sudo docker run --rm apparmor docker: Error response from daemon: OCI runtime create failed: container_linux.go:346: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"/var/lib/docker/volumes/aae28ea068c33d60e64d1a75916cf3ec2dc3634f97571854c9ed30c8401460c1/_data\\\" to rootfs \\\"/var/lib/docker/overlay2/a6be5ae911bf19f8eecb23a295dec85be9a8ee8da66e9fb55b47c841d1e381b7/merged\\\" at \\\"/proc\\\" caused \\\"\\\\\\\"/var/lib/docker/overlay2/a6be5ae911bf19f8eecb23a295dec85be9a8ee8da66e9fb55b47c841d1e381b7/merged/proc\\\\\\\" cannot be mounted because it is not of type proc\\\"\"": unknown. > sudo docker run --rm -v /proc:/proc apparmor docker-default (enforce) root 18989 0.9 0.0 1288 4 ? Ss 16:47 0:00 sleep 20 ``` Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2019-09-24 11:00:18 -04:00
Jonathan Rudenberg	af7b6547ec	libcontainer/nsenter: Don't import C in non-cgo file Signed-off-by: Jonathan Rudenberg <jonathan@titanous.com>	2019-09-11 17:03:07 +00:00
Giuseppe Scrivano	718a566e02	cgroup: support mount of cgroup2 convert a "cgroup" mount to "cgroup2" when the system uses cgroups v2 unified hierarchy. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2019-09-06 17:57:14 +02:00
Sebastiaan van Stijn	eb86f6037e	bump syndtr/gocapability d98352740cb2c55f81556b63d4a1ec64c5a319c2 relevant changes: - syndtr/gocapability#14 capability: Deprecate NewPid and NewFile for NewPid2 and NewFile2 - syndtr/gocapability#16 Fix capHeader.pid type Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2019-09-06 01:44:26 +02:00
Mrunal Patel	92ac8e3f84	Merge pull request #2113 from giuseppe/cgroupv2 libcontainer: initial support for cgroups v2	2019-09-05 13:14:29 -07:00
Giuseppe Scrivano	524cb7c318	libcontainer: add systemd.UnifiedManager Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2019-09-05 13:02:27 +02:00
Giuseppe Scrivano	ec11136828	libcontainer, cgroups: rename systemd.Manager to LegacyManager Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2019-09-05 13:02:26 +02:00
Giuseppe Scrivano	1932917b71	libcontainer: add initial support for cgroups v2 allow to set what subsystems are used by libcontainer/cgroups/fs.Manager. subsystemsUnified is used on a system running with cgroups v2 unified mode. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2019-09-05 13:02:25 +02:00
Mrunal Patel	92d851e03b	Merge pull request #2123 from carlosedp/riscv64 Bump x/sys and update syscall for initial Risc-V support	2019-09-04 14:10:26 -07:00
Carlos de Paula	4316e4d047	Bump x/sys and update syscall to start Risc-V support Signed-off-by: Carlos de Paula <me@carlosedp.com>	2019-08-29 12:09:08 -03:00
Akihiro Suda	0bc069d795	nsenter: fix clang-tidy warning nsexec.c:148:3: warning: Initialized va_list 'args' is leaked [clang-analyzer-valist.Unterminated] Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2019-08-29 00:18:02 +09:00
Akihiro Suda	b225ef58fb	nsenter: minor clean up Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2019-08-28 19:50:35 +09:00
Daniel J Walsh	e4aa73424b	Rename cgroups_windows.go to cgroups_unsupported.go Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>	2019-08-26 18:13:52 -04:00
Mrunal Patel	c61c7370f9	Merge pull request #2103 from sipsma/cgnil cgroups/fs: check nil pointers in cgroup manager	2019-08-26 14:05:44 -07:00
Mrunal Patel	68d73f0a2e	Merge pull request #2107 from sashayakovtseva/public-get-devices Make get devices function public	2019-08-26 09:58:10 -07:00
Kenta Tada	c740965a18	libcontainer: update masked paths of /proc This commit updates the masked paths of /proc. Related issues: * https://github.com/moby/moby/pull/37404 * https://github.com/moby/moby/pull/38299 * https://github.com/moby/moby/pull/36368 Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>	2019-08-26 12:25:56 +09:00
Mrunal Patel	3525eddec5	Merge pull request #2117 from filbranden/detection1 Remove libcontainer detection for systemd features	2019-08-25 13:15:15 -07:00
Filipe Brandenburger	518c855833	Remove libcontainer detection for systemd features Transient units (and transient slice units) have been available for quite a long time and RHEL 7 with systemd v219 (likely the oldest OS we care about at this point) supports that. A system running a systemd without these features is likely to break a lot of other stuff that runc/libcontainer care about. Regarding delegated slices, modern systemd doesn't allow it and runc/libcontainer run fine on it, so we might as well just stop requesting it on older versions of systemd which allowed it. (Those versions never really changed behavior significantly when that option was passed anyways.) Signed-off-by: Filipe Brandenburger <filbranden@gmail.com>	2019-08-22 21:53:24 -07:00
Filipe Brandenburger	588f040a77	Avoid the dependency on cgo through go-systemd/util package This dependency is only needed in package "github.com/coreos/go-systemd/util" and we only use it for IsRunningSystemd(), which is a simple Go function that just stats a file. Let's just borrow it here, so we remove the dependency and can remove that package from vendored build. This also removes dependencies on dlopen and on trying to find libsystemd.so or libsystemd-login.so in the system. Tested that this still builds and works as expected. Signed-off-by: Filipe Brandenburger <filbranden@gmail.com>	2019-08-22 21:07:24 -07:00
sashayakovtseva	afc24792dc	Make get devices function public Signed-off-by: sashayakovtseva <sasha@sylabs.io>	2019-08-15 17:16:47 +03:00
Erik Sipsma	9c822e4847	cgroups/fs: check nil pointers in cgroup manager Signed-off-by: Erik Sipsma <sipsma@amazon.com>	2019-08-14 09:50:45 -07:00
Mrunal Patel	2e94378464	Merge pull request #2094 from sipsma/2093-nodotudev Skip searching /dev/.udev for device nodes.	2019-08-05 10:41:54 -07:00
Erik Sipsma	f08cdaeec9	Skip searching /dev/.udev for device nodes. Closes: #2093 Signed-off-by: Erik Sipsma <sipsma@amazon.com>	2019-07-31 19:41:33 +00:00
Andreas Stocker	808e809f8a	doc: First process in container needs `Init: true` `Init` on the `Process` struct specifies whether the process is the first process in the container. This needs to be set to `true` when running the container. Signed-off-by: Andreas Stocker <astocker@anexia-it.com>	2019-07-29 22:24:28 +02:00
Kurnia D Win	5e0e67d76c	fix permission denied when exec as root and config.Cwd is not owned by root, exec will fail because root doesn't have the caps. So, Chdir should be done before setting the caps. Signed-off-by: Kurnia D Win <kurnia.d.win@gmail.com>	2019-07-18 12:49:36 +07:00
Mrunal Patel	b4a0b1d737	Merge pull request #2065 from odinuge/master Fix cgroup hugetlb size prefix for kB	2019-06-06 12:38:57 -07:00
Kenta Tada	b54fd85bbf	libcontainer: change seccomp test for clone syscall This commit changes the value of seccomp test for clone syscall. Also hardcoded values should be changed because it is unclear to understand what flags are tested. Related issues: * https://github.com/containerd/containerd/pull/3314 * https://github.com/moby/moby/pull/39308 * https://github.com/opencontainers/runtime-tools/pull/694 Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>	2019-06-04 18:52:00 +09:00
Odin Ugedal	6f77e35daf	Export list of HugePageSizeUnits This will allow others to import it instead of copying it. Signed-off-by: Odin Ugedal <odin@ugedal.com>	2019-05-30 20:17:30 +02:00
Odin Ugedal	c6445b1c1c	Add tests for GetHugePageSize Add tests to avoid regressions Signed-off-by: Odin Ugedal <odin@ugedal.com>	2019-05-30 17:27:32 +02:00
Odin Ugedal	273e7b74a7	Fix cgroup hugetlb size prefix for kB The hugetlb cgroup control files (introduced here in 2012: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=abb8206cb0773) use "KB" and not "kB" (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/hugetlb_cgroup.c?h=v5.0#n349). The behavior in the kernel has not changed since the introduction, and the current code using "kB" will therefore fail on devices with small amounts of ram (see https://github.com/kubernetes/kubernetes/issues/77169) running a kernel with config flag CONFIG_HUGETLBFS=y As seen from the code in "mem_fmt" inside hugetlb_cgroup.c, only "KB", "MB" and "GB" are used, so the others may be removed as well. Here is a real world example of the files inside the "/sys/kernel/mm/hugepages/" directory: - "hugepages-64kB" - "hugepages-2048kB" - "hugepages-32768kB" - "hugepages-1048576kB" And the corresponding cgroup files: - "hugetlb.64KB._____" - "hugetlb.2MB._____" - "hugetlb.32MB._____" - "hugetlb.1GB._____" Signed-off-by: Odin Ugedal <odin@ugedal.com>	2019-05-29 21:52:43 +02:00
Mrunal Patel	5ef781c2e7	Merge pull request #2061 from KentaTada/add-cgroup-namespace-test libcontainer: fix TestGetContainerState to check configs.NEWCGROUP	2019-05-22 16:09:38 -07:00
Qiang Huang	c8337777b6	Merge pull request #2042 from xiaochenshen/rdt-add-missing-destroy libcontainer: intelrdt: add missing destroy handler in defer func	2019-05-21 09:48:00 +08:00
Kenta Tada	65032b55b1	libcontainer: fix TestGetContainerState to check configs.NEWCGROUP This test needs to handle the case of configs.NEWCGROUP as Namespace's type. Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>	2019-05-21 09:10:38 +09:00
Mrunal Patel	2484581dd7	Merge pull request #2035 from cyphar/bindmount-types specconv: always set "type: bind" in case of MS_BIND	2019-05-07 15:47:58 -07:00
Mrunal Patel	a0ecf749ee	Merge pull request #2047 from filbranden/systemd7 Move systemd.Manager initialization into a function in that module	2019-05-07 15:08:41 -07:00
Filipe Brandenburger	46351eb3d1	Move systemd.Manager initialization into a function in that module This will permit us to extend the internals of systemd.Manager to include further information about the system, such as whether cgroupv1, cgroupv2 or both are in effect. Furthermore, it allows a future refactor of moving more of UseSystemd() code into the factory initialization function. Signed-off-by: Filipe Brandenburger <filbranden@gmail.com>	2019-05-01 13:22:19 -07:00
Georgi Sabev	a146081828	Write logs to stderr by default Minor refactoring to use the filePair struct for both init sock and log pipe Co-authored-by: Julia Nedialkova <julianedialkova@hotmail.com> Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>	2019-04-24 15:18:14 +03:00
Georgi Sabev	68b4ff5b37	Simplify bail logic & minor nsexec improvements Co-authored-by: Julia Nedialkova <julianedialkova@hotmail.com> Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>	2019-04-24 15:16:11 +03:00
Xiaochen Shen	17b37ea3fa	libcontainer: intelrdt: add missing destroy handler in defer func In the exception handling of initProcess.start(), we need to add the missing IntelRdtManager.Destroy() handler in defer func. Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2019-04-24 16:41:51 +08:00
Georgi Sabev	475aef10f7	Remove redundant log function Bump logrus so that we can use logrus.StandardLogger().Logf instead Co-authored-by: Julia Nedialkova <julianedialkova@hotmail.com> Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>	2019-04-22 17:54:55 +03:00
Georgi Sabev	ba3cabf932	Improve nsexec logging * Simplify logging function * Logs contain __FUNCTION__:__LINE__ * Bail uses write_log Co-authored-by: Julia Nedialkova <julianedialkova@hotmail.com> Co-authored-by: Danail Branekov <danailster@gmail.com> Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>	2019-04-22 17:53:52 +03:00
Aleksa Sarai	8296826da5	specconv: always set "type: bind" in case of MS_BIND We discovered in umoci that setting a dummy type of "none" would result in file-based bind-mounts no longer working properly, which is caused by a restriction for when specconv will change the device type to "bind" to work around rootfs_linux.go's ... issues. However, bind-mounts don't have a type (and Linux will ignore any type specifier you give it) because the type is copied from the source of the bind-mount. So we should always overwrite it to avoid user confusion. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-04-08 15:08:08 +10:00
Danail Branekov	c486e3c406	Address comments in PR 1861 Refactor configuring logging into a reusable component so that it can be nicely used in both main() and init process init() Co-authored-by: Georgi Sabev <georgethebeatle@gmail.com> Co-authored-by: Giuseppe Capizzi <gcapizzi@pivotal.io> Co-authored-by: Claudia Beresford <cberesford@pivotal.io> Signed-off-by: Danail Branekov <danailster@gmail.com>	2019-04-04 14:57:28 +03:00
Marco Vedovati	feebfac358	Remove pipe close before exec. Pipe close before exec is not necessary as os.Pipe() is calling pipe2 with O_CLOEXEC option. Signed-off-by: Marco Vedovati <mvedovati@suse.com>	2019-04-04 14:53:30 +03:00
Marco Vedovati	9a599f62fb	Support for logging from children processes Add support for children processes logging (including nsexec). A pipe is used to send logs from children to parent in JSON. The JSON format used is the same used by logrus JSON formatted, i.e. children process can use standard logrus APIs. Signed-off-by: Marco Vedovati <mvedovati@suse.com>	2019-04-04 14:53:23 +03:00
Michael Crosby	11fc498ffa	Merge pull request #2023 from LittleLightLittleFire/2022-fix-runc-zombie-process-regression Fixes regression causing zombie runc:[1:CHILD] processes	2019-03-22 14:06:31 -04:00
Mrunal Patel	dd22a84864	Merge pull request #2012 from rhatdan/selinux Need to setup labeling of kernel keyrings.	2019-03-20 21:17:18 -07:00
Alex Fang	eab5330908	Fixes regression causing zombie runc:[1:CHILD] processes Whenever processes are spawned using nsexec, a zombie runc:[1:CHILD] process will always be created and will need to be reaped by the parent Signed-off-by: Alex Fang <littlelightlittlefire@gmail.com>	2019-03-21 13:43:38 +11:00
Aleksa Sarai	f56b4cbead	merge branch 'pr-2015' Use getenv not secure_getenv LGTMs: @crosbymichael @cyphar Closes #2015	2019-03-16 17:30:56 +11:00
Filipe Brandenburger	4b2b978291	Add cgroup name to error message More information should help troubleshoot an issue when this error occurs. Signed-off-by: Filipe Brandenburger <filbranden@google.com>	2019-03-14 10:25:00 -07:00
Justin Cormack	6f714aa928	Use getenv not secure_getenv secure_getenv is a Glibc extension and so this code does not compile on Musl libc any more after this patch. secure_getenv is only intended to be used in setuid binaries, in order that they should not trust their environment. It simply returns NULL if the binary is running setuid. If runc was installed setuid, the user can already do anything as root, so it is game over, so this check is not needed. Signed-off-by: Justin Cormack <justin.cormack@docker.com>	2019-03-14 10:58:10 +00:00
Daniel J Walsh	cd96170c10	Need to setup labeling of kernel keyrings. Work is ongoing in the kernel to support different kernel keyrings per user namespace. We want to allow SELinux to manage kernel keyrings inside of the container. Currently when runc creates the kernel keyring it gets the label which runc is running with ususally `container_runtime_t`, with this change the kernel keyring will be labeled with the container process label container_t:s0:C1,c2. Container running as container_t:s0:c1,c2 can manage keyrings with the same label. This change required a revendoring or the SELinux go bindings. github.com/opencontainers/selinux. Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>	2019-03-13 17:57:30 -04:00
Mrunal Patel	2b18fe1d88	Merge pull request #1984 from cyphar/memfd-cleanups nsenter: cloned_binary: "memfd" cleanups	2019-03-07 10:18:33 -08:00
Michael Crosby	f739110263	Merge pull request #1968 from adrianreber/podman Create bind mount mountpoints during restore	2019-03-04 11:37:07 -06:00
Aleksa Sarai	2d4a37b427	nsenter: cloned_binary: userspace copy fallback if sendfile fails There are some circumstances where sendfile(2) can fail (one example is that AppArmor appears to block writing to deleted files with sendfile(2) under some circumstances) and so we need to have a userspace fallback. It's fairly trivial (and handles short-writes). Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-03-01 23:29:10 +11:00
Aleksa Sarai	16612d74de	nsenter: cloned_binary: try to ro-bind /proc/self/exe before copying The usage of memfd_create(2) and other copying techniques is quite wasteful, despite attempts to minimise it with _LIBCONTAINER_STATEDIR. memfd_create(2) added ~10M of memory usage to the cgroup associated with the container, which can result in some setups getting OOM'd (or just hogging the hosts' memory when you have lots of created-but-not-started containers sticking around). The easiest way of solving this is by creating a read-only bind-mount of the binary, opening that read-only bindmount, and then umounting it to ensure that the host won't accidentally be re-mounted read-write. This avoids all copying and cleans up naturally like the other techniques used. Unfortunately, like the O_TMPFILE fallback, this requires being able to create a file inside _LIBCONTAINER_STATEDIR (since bind-mounting over the most obvious path -- /proc/self/exe -- is a very bad idea). Unfortunately detecting this isn't fool-proof -- on a system with a read-only root filesystem (that might become read-write during "runc init" execution), we cannot tell whether we have already done an ro remount. As a partial mitigation, we store a _LIBCONTAINER_CLONED_BINARY environment variable which is checked alongside the protection being present. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-03-01 23:29:08 +11:00
Aleksa Sarai	af9da0a450	nsenter: cloned_binary: use the runc statedir for O_TMPFILE Writing a file to tmpfs actually incurs a memcg penalty, and thus the benefit of being able to disable memfd_create(2) with _LIBCONTAINER_DISABLE_MEMFD_CLONE is fairly minimal -- though it should be noted that quite a few distributions don't use tmpfs for /tmp (and instead have it as a regular directory or subvolume of the host filesystem). Since runc must have write access to the state directory anyway (and the state directory is usually not on a tmpfs) we can use that instead of /tmp -- avoiding potential memcg costs with no real downside. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-03-01 23:28:51 +11:00
Aleksa Sarai	2429d59352	nsenter: cloned_binary: expand and add pre-3.11 fallbacks In order to get around the memfd_create(2) requirement, `0a8e4117e7` ("nsenter: clone /proc/self/exe to avoid exposing host binary to container") added an O_TMPFILE fallback. However, this fallback was flawed in two ways: * It required O_TMPFILE which is relatively new (having been added to Linux 3.11). * The fallback choice was made at compile-time, not runtime. This results in several complications when it comes to running binaries on different machines to the ones they were built on. The easiest way to resolve these things is to have fallbacks work in a more procedural way (though it does make the code unfortunately more complicated) and to add a new fallback that uses mkotemp(3). Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-03-01 23:28:50 +11:00
Aleksa Sarai	5b775bf297	nsenter: cloned_binary: detect and handle short copies For a variety of reasons, sendfile(2) can end up doing a short-copy so we need to just loop until we hit the binary size. Since /proc/self/exe is tautologically our own binary, there's no chance someone is going to modify it underneath us (or changing the size). Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-02-26 19:51:01 +11:00
Mrunal Patel	5b5130ad76	Merge pull request #1963 from adrianreber/go-criu Vendor in go-criu and use it for CRIU's RPC definition	2019-02-23 10:44:28 -08:00
Adrian Reber	9edb5494bb	Use vendored in CRIU Go bindings This makes use of the vendored in Go bindings and removes the copy of the CRIU RPC interface definition. runc now relies on go-criu for RPC definition and hopefully more CRIU functions can be used in the future from the CRIU Go bindings. Signed-off-by: Adrian Reber <areber@redhat.com>	2019-02-14 18:20:02 +01:00
Christian Brauner	bb7d8b1f41	nsexec (CVE-2019-5736): avoid parsing environ My first attempt to simplify this and make it less costly focussed on the way constructors are called. I was under the impression that the ELF specification mandated that arg, argv, and actually even envp need to be passed to functions located in the .init_arry section (aka "constructors"). Actually, the specifications is (cf. [2]): SHT_INIT_ARRAY This section contains an array of pointers to initialization functions, as described in ``Initialization and Termination Functions'' in Chapter 5. Each pointer in the array is taken as a parameterless procedure with a void return. which means that this becomes a libc specific decision. Glibc passes down those args, musl doesn't. So this approach can't work. However, we can at least remove the environment parsing part based on POSIX since [1] mandates that there should be an environ variable defined in unistd.h which provides access to the environment. See also the relevant Open Group specification [1]. [1]: http://pubs.opengroup.org/onlinepubs/9699919799/ [2]: http://www.sco.com/developers/gabi/latest/ch4.sheader.html#init_array Fixes: CVE-2019-5736 Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2019-02-14 16:06:21 +01:00
Filipe Brandenburger	cd41feb46b	Remove detection for scope properties, which have always been broken The detection for scope properties (whether scope units support DefaultDependencies= or Delegate=) has always been broken, since systemd refuses to create scopes unless at least one PID is attached to it (and this has been so since scope units were introduced in systemd v205.) This can be seen in journal logs whenever a container is started with libpod: Feb 11 15:08:07 myhost systemd[1]: libcontainer-12345-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing. Feb 11 15:08:07 myhost systemd[1]: libcontainer-12345-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing. Since this logic never worked, just assume both attributes are supported (which is what the code does when detection fails for this reason, since it's looking for an "unknown attribute" or "read-only attribute" to mark them as false) and skip the detection altogether. Signed-off-by: Filipe Brandenburger <filbranden@google.com>	2019-02-11 16:05:37 -08:00
Adrian Reber	7354546cc8	Create mountpoints also on restore runc creates all missing mountpoints when it starts a container, this commit also creates those mountpoints during restore. Now it is possible to restore a container using the same, but newly created rootfs just as during container start. Signed-off-by: Adrian Reber <areber@redhat.com>	2019-02-08 15:59:51 +01:00
Adrian Reber	f661e02343	factor out bind mount mountpoint creation During rootfs setup all mountpoints (directory and files) are created before bind mounting the bind mounts. This does not happen during container restore via CRIU. If restoring in an identical but newly created rootfs, the restore fails right now. This just factors out the code to create the bind mount mountpoints so that it also can be used during restore. Signed-off-by: Adrian Reber <areber@redhat.com>	2019-02-08 15:59:51 +01:00
Aleksa Sarai	0a8e4117e7	nsenter: clone /proc/self/exe to avoid exposing host binary to container There are quite a few circumstances where /proc/self/exe pointing to a pretty important container binary is a _bad_ thing, so to avoid this we have to make a copy (preferably doing self-clean-up and not being writeable). We require memfd_create(2) -- though there is an O_TMPFILE fallback -- but we can always extend this to use a scratch MNT_DETACH overlayfs or tmpfs. The main downside to this approach is no page-cache sharing for the runc binary (which overlayfs would give us) but this is far less complicated. This is only done during nsenter so that it happens transparently to the Go code, and any libcontainer users benefit from it. This also makes ExtraFiles and --preserve-fds handling trivial (because we don't need to worry about it). Fixes: CVE-2019-5736 Co-developed-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-02-08 18:57:59 +11:00
Mrunal Patel	e4fa8a4575	Merge pull request #1955 from xiaochenshen/rdt-fix-destroy-issue libcontainer: intelrdt: fix null intelrdt path issue in Destroy()	2019-02-01 13:18:56 -08:00
Mrunal Patel	4e4c907193	Merge pull request #1950 from cloudfoundry-incubator/enter-pid-race Resilience in adding of exec tasks to cgroups	2019-02-01 13:18:16 -08:00
Aleksa Sarai	565325fc36	integration: fix mis-use of libcontainer.Factory For some reason, libcontainer/integration has a whole bunch of incorrect usages of libcontainer.Factory -- causing test failures with a set of security patches that will be published soon. Fixing ths is fairly trivial (switch to creating a new libcontainer.Factory once in each process, rather than creating one in TestMain globally). Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-01-24 23:12:48 +13:00
Michael Crosby	c1e454b2a1	Merge pull request #1960 from giuseppe/fix-kmem-systemd systemd: fix setting kernel memory limit	2019-01-15 13:21:01 -05:00
Michael Crosby	4e9d52da54	Merge pull request #1933 from adrianreber/master Add CRIU configuration file support	2019-01-15 11:22:38 -05:00
Giuseppe Scrivano	28a697cce3	rootfs: umount all procfs and sysfs with --no-pivot When creating a new user namespace, the kernel doesn't allow to mount a new procfs or sysfs file system if there is not already one instance fully visible in the current mount namespace. When using --no-pivot we were effectively inhibiting this protection from the kernel, as /proc and /sys from the host are still present in the container mount namespace. A container without full access to /proc could then create a new user namespace, and from there able to mount a fully visible /proc, bypassing the limitations in the container. A simple reproducer for this issue is: unshare -mrfp sh -c "mount -t proc none /proc && echo c > /proc/sysrq-trigger" Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2019-01-14 09:53:35 +01:00
Giuseppe Scrivano	f01923376d	systemd: fix setting kernel memory limit since commit `df3fa115f9` it is not possible to set a kernel memory limit when using the systemd cgroups backend as we use cgroup.Apply twice. Skip enabling kernel memory if there are already tasks in the cgroup. Without this patch, runc fails with: container_linux.go:344: starting container process caused "process_linux.go:311: applying cgroup configuration for process caused \"failed to set memory.kmem.limit_in_bytes, because either tasks have already joined this cgroup or it has children\"" Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2019-01-10 11:33:50 +01:00
Xiaochen Shen	acb75d0e38	libcontainer: intelrdt: fix null intelrdt path issue in Destroy() This patch fixes a corner case when destroy a container: If we start a container without 'intelRdt' config set, and then we run “runc update --l3-cache-schema/--mem-bw-schema” to add 'intelRdt' config implicitly. Now if we enter "exit" from the container inside, we will pass through linuxContainer.Destroy() -> state.destroy() -> intelRdtManager.Destroy(). But in IntelRdtManager.Destroy(), IntelRdtManager.Path is still null string, it hasn’t been initialized yet. As a result, the created rdt group directory during "runc update" will not be removed as expected. Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2019-01-05 00:34:25 +08:00
Adrian Reber	e157963054	Enable CRIU configuration files CRIU 3.11 introduces configuration files: https://criu.org/Configuration_files https://lisas.de/~adrian/posts/2018-Nov-08-criu-configuration-files.html This enables the user to influence CRIU's behaviour without code changes if using new CRIU features or if the user wants to enable certain CRIU behaviour without always specifying certain options. With this it is possible to write 'tcp-established' to the configuration file: $ echo tcp-established > /etc/criu/runc.conf and from now on all checkpoints will preserve the state of established TCP connections. This removes the need to always use $ runc checkpoint --tcp-stablished If the goal is to always checkpoint with '--tcp-established' It also adds the possibility for unexpected CRIU behaviour if the user created a configuration file at some point in time and forgets about it. As a result of the discussion in https://github.com/opencontainers/runc/pull/1933 it is now also possible to define a CRIU configuration file for each container with the annotation 'org.criu.config'. If 'org.criu.config' does not exist, runc will tell CRIU to use '/etc/criu/runc.conf' if it exists. If 'org.criu.config' is set to an empty string (''), runc will tell CRIU to not use any runc specific configuration file at all. If 'org.criu.config' is set to a non-empty string, runc will use that value as an additional configuration file for CRIU. With the annotation the user can decide to use the default configuration file ('/etc/criu/runc.conf'), none or a container specific configuration file. Signed-off-by: Adrian Reber <areber@redhat.com>	2018-12-21 07:42:12 +01:00
Adrian Reber	360ba8a27d	Update criurpc definition for latest features Signed-off-by: Adrian Reber <areber@redhat.com>	2018-12-21 07:42:12 +01:00
JoeWrightss	0855bce448	Fix .Fatalf() error message Signed-off-by: JoeWrightss <zhoulin.xie@daocloud.io>	2018-12-19 20:22:48 +08:00
Tom Godkin	bdf3524b34	Retry adding pids to cgroups when EINVAL occurs The kernel will sometimes return EINVAL when writing a pid to a cgroup.procs file. It does so when the task being added still has the state TASK_NEW. See: https://elixir.bootlin.com/linux/v4.8/source/kernel/sched/core.c#L8286 Co-authored-by: Danail Branekov <danailster@gmail.com> Signed-off-by: Tom Godkin <tgodkin@pivotal.io> Signed-off-by: Danail Branekov <danailster@gmail.com>	2018-12-17 15:34:47 +00:00
JoeWrightss	769d6c4a75	Fix some typos Signed-off-by: JoeWrightss <zhoulin.xie@daocloud.io>	2018-12-09 23:52:54 +08:00
Michael Crosby	25f3f893c8	Merge pull request #1939 from cyphar/nokmem-error cgroups: nokmem: error out on explicitly-set kmemcg limits	2018-12-04 11:14:56 -05:00
Michael Crosby	96ec2177ae	Merge pull request #1943 from giuseppe/allow-to-signal-paused-containers kill: allow to signal paused containers	2018-12-03 16:55:13 -05:00
Ace-Tang	dce70cdff5	cr: get pid from criu notify when restore when restore container from a checkpoint directory, we should get pid from criu notify, since c.initProcess has not been created. Signed-off-by: Ace-Tang <aceapril@126.com>	2018-12-03 13:31:20 +08:00
Aleksa Sarai	8a4629f7b5	cgroups: nokmem: error out on explicitly-set kmemcg limits When built with nokmem we explicitly are disabling support for kmemcg, but it is a strict specification requirement that if we cannot fulfil an aspect of the container configuration that we error out. Completely ignoring explicitly-requested kmemcg limits with nokmem would undoubtably lead to problems. Fixes: `6a2c155968` ("libcontainer: ability to compile without kmem") Signed-off-by: Aleksa Sarai <asarai@suse.de>	2018-12-01 14:31:35 +11:00
Giuseppe Scrivano	07d1ad44c8	kill: allow to signal paused containers regression introduced by `87a188996e` Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2018-11-30 23:35:47 +01:00
Michael Crosby	4932620b62	Merge pull request #1919 from xiaochenshen/rdt-mba-software-controller libcontainer: intelrdt: add support for Intel RDT/MBA Software Controller in runc	2018-11-26 16:45:42 -05:00
Michael Crosby	50e2634995	Merge pull request #1934 from lifubang/kill fix: may kill other process when container has been stopped	2018-11-21 10:30:25 -05:00
Lifubang	87a188996e	may kill other process when container has been stopped Signed-off-by: Lifubang <lifubang@acmcoder.com>	2018-11-21 17:44:52 +08:00
Aleksa Sarai	ceefc3fe4e	merge branch 'pr-1741' libcontainer: Set 'status' in hook stdin LGTMs: @cyphar @crosbymichael Closes #1741	2018-11-20 06:39:30 +11:00
Michael Crosby	76520a4bf0	Merge pull request #1872 from masters-of-cats/better-find-cgroup-mountpoint Respect container's cgroup path	2018-11-16 14:06:54 -05:00
W. Trevor King	e23868603a	libcontainer: Set 'status' in hook stdin Finish off the work started in `a344b2d6` (sync up `HookState` with OCI spec `State`, 2016-12-19, #1201). And drop HookState, since there's no need for a local alias for specs.State. Also set c.initProcess in newInitProcess to support OCIState calls from within initProcess.start(). I think the cyclic references between linuxContainer and initProcess are unfortunate, but didn't want to address that here. I've also left the timing of the Prestart hooks alone, although the spec calls for them to happen before start (not as part of creation) [1,2]. Once the timing gets fixed we can drop the initProcessStartTime hacks which initProcess.start currently needs. I'm not sure why we trigger the prestart hooks in response to both procReady and procHooks. But we've had two prestart rounds in initProcess.start since `2f276498` (Move pre-start hooks after container mounts, 2016-02-17, #568). I've left that alone too. I really think we should have len() guards to avoid computing the state when .Hooks is non-nil but the particular phase we're looking at is empty. Aleksa, however, is adamantly against them [3] citing a risk of sloppy copy/pastes causing the hook slice being len-guarded to diverge from the hook slice being iterated over within the guard. I think that ort of thing is very lo-risk, because: * We shouldn't be copy/pasting this, right? DRY for the win :). * There's only ever a few lines between the guard and the guarded loop. That makes broken copy/pastes easy to catch in review. * We should have test coverage for these. Guarding with the wrong slice is certainly not the only thing you can break with a sloppy copy/paste. But I'm not a maintainer ;). [1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.0/config.md#prestart [2]: https://github.com/opencontainers/runc/issues/1710 [3]: https://github.com/opencontainers/runc/pull/1741#discussion_r233331570 Signed-off-by: W. Trevor King <wking@tremily.us>	2018-11-14 06:49:49 -08:00
Mrunal Patel	4769cdf607	Merge pull request #1916 from crosbymichael/cgns Add support for cgroup namespace	2018-11-13 12:21:38 -08:00
Mrunal Patel	f000fe11ec	Merge pull request #1917 from slp/master libcontainer: map PidsLimit to systemd's TasksMax property	2018-11-13 12:21:23 -08:00
Michael Crosby	aa7917b751	Merge pull request #1911 from theSuess/linter-fixes Various cleanups to address linter issues	2018-11-13 12:13:34 -05:00
Michael Crosby	bd420b59f1	Merge pull request #1925 from Ace-Tang/fix_dup_ns test: fix TestDupNamespaces fail to test dup-ns error	2018-11-13 12:11:11 -05:00
Xiaochen Shen	95af9eff82	libcontainer: intelrdt: add support for Intel RDT/MBA Software Controller in runc MBA Software Controller feature is introduced in Linux kernel v4.18. It is a software enhancement to mitigate some limitations in MBA which describes in kernel documentation. It also makes the interface more user friendly - we could specify memory bandwidth in "MBps" (Mega Bytes per second) as well as in "percentages". The kernel underneath would use a software feedback mechanism or a "Software Controller" which reads the actual bandwidth using MBM counters and adjust the memory bandwidth percentages to ensure: "actual memory bandwidth < user specified memory bandwidth". We could enable this feature through mount option "-o mba_MBps": mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl In runc, we handle both memory bandwidth schemata in unified format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..." The unit of memory bandwidth is specified in "percentages" by default, and in "MBps" if MBA Software Controller is enabled. For more information about Intel RDT and MBA Software Controller: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2018-11-13 23:27:08 +08:00
Ace-Tang	16d55f17a8	libcontainer: fix potential panic if spec.Process is nil for the code logic, pointer 'spec.Process' should be judge first to avoid panic. Signed-off-by: Ace-Tang <aceapril@126.com>	2018-11-06 11:55:30 +08:00
Ace-Tang	95d1aa1886	test: fix TestDupNamespaces add Root in created spec, or error message is 'Root must be specified' Signed-off-by: Ace-Tang <aceapril@126.com>	2018-11-06 11:36:27 +08:00
Michael Crosby	b1068fb925	Merge pull request #1814 from rhatdan/selinux SELinux labels are tied to the thread	2018-11-05 10:00:11 -05:00
Aleksa Sarai	9f1e94488e	merge branch 'pr-1921' libcontainer: ability to compile without kmem LGTMs: @mrunalp @cyphar Closes #1921	2018-11-02 09:54:16 +11:00
Michael Crosby	9e5aa7494d	Merge pull request #1918 from giuseppe/skip-setgroups rootless: fix running with /proc/self/setgroups set to deny	2018-11-01 13:16:47 -04:00
Kir Kolyshkin	6a2c155968	libcontainer: ability to compile without kmem Commit `fe898e7862` (PR #1350) enables kernel memory accounting for all cgroups created by libcontainer -- even if kmem limit is not configured. Kernel memory accounting is known to be broken in some kernels, specifically the ones from RHEL7 (including RHEL 7.5). Those kernels do not support kernel memory reclaim, and are prone to oopses. Unconditionally enabling kmem acct on such kernels lead to bugs, such as * https://github.com/opencontainers/runc/issues/1725 * https://github.com/kubernetes/kubernetes/issues/61937 * https://github.com/moby/moby/issues/29638 This commit gives a way to compile runc without kernel memory setting support. To do so, use something like make BUILDTAGS="seccomp nokmem" Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2018-10-31 20:35:51 -07:00
Yuanhong Peng	df3fa115f9	Add support for cgroup namespace Cgroup namespace can be configured in `config.json` as other namespaces. Here is an example: ``` "namespaces": [ { "type": "pid" }, { "type": "network" }, { "type": "ipc" }, { "type": "uts" }, { "type": "mount" }, { "type": "cgroup" } ], ``` Note that if you want to run a container which has shared cgroup ns with another container, then it's strongly recommended that you set proper `CgroupsPath` of both containers(the second container's cgroup path must be the subdirectory of the first one). Or there might be some unexpected results. Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com> Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2018-10-31 10:51:43 -04:00
Chris Aniszczyk	f3ce8221ea	Merge pull request #1913 from xiaochenshen/rdt-add-diagnostics libcontainer: intelrdt: add user-friendly diagnostics for Intel RDT operation errors	2018-10-25 14:27:17 -05:00
Giuseppe Scrivano	869add3318	rootless: fix running with /proc/self/setgroups set to deny This is a regression from `06f789cf26` when the user namespace was configured without a privileged helper. To allow a single mapping in an user namespace, it is necessary to set /proc/self/setgroups to "deny". For a simple reproducer, the user namespace can be created with "unshare -r". Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2018-10-25 15:44:15 +02:00
Sergio Lopez	5c6b9c3c1c	libcontainer: map PidsLimit to systemd's TasksMax property Currently runc applies PidsLimit restriction by writing directly to cgroup's pids.max, without notifying systemd. As a consequence, when the later updates the context of the corresponding scope, pids.max is reset to the value of systemd's TasksMax property. This can be easily reproduced this way (I'm using "postfix" here just an example, any unrelated but existing service will do): # CTR=`docker run --pids-limit 111 --detach --rm busybox /bin/sleep 8h` # cat /sys/fs/cgroup/pids/system.slice/docker-${CTR}.scope/pids.max 111 # systemctl disable --now postfix # systemctl enable --now postfix # cat /sys/fs/cgroup/pids/system.slice/docker-${CTR}.scope/pids.max max This patch adds TasksAccounting=true and TasksMax=PidsLimit to the properties sent to systemd. Signed-off-by: Sergio Lopez <slp@redhat.com>	2018-10-24 17:20:27 +02:00
Aleksa Sarai	e93996674f	merge branch 'pr-1903' clarify license information LGTMs: @hqhq @cyphar Closes #1903	2018-10-24 22:03:44 +11:00
Aleksa Sarai	9a3a8a5ebf	libcontainer: implement CLONE_NEWCGROUP This is a very simple implementation because it doesn't require any configuration unlike the other namespaces, and in its current state it only masks paths. This feature is available in Linux 4.6+ and is enabled by default for kernels compiled with CONFIG_CGROUP=y. Signed-off-by: Aleksa Sarai <asarai@suse.de> Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2018-10-23 16:23:00 -04:00
Xiaochen Shen	6c307f8ff2	libcontainer: intelrdt: add user-friendly diagnostics for Intel RDT operation errors Linux kernel v4.15 introduces better diagnostics for Intel RDT operation errors. If any error returns when making new directories or writing to any of the control file in resctrl filesystem, reading file /sys/fs/resctrl/info/last_cmd_status could provide more information that can be conveyed in the error returns from file operations. Some examples: echo "L3:0=f3;1=ff" > /sys/fs/resctrl/container_id/schemata -bash: echo: write error: Invalid argument cat /sys/fs/resctrl/info/last_cmd_status mask f3 has non-consecutive 1-bits echo "MB:0=0;1=110" > /sys/fs/resctrl/container_id/schemata -bash: echo: write error: Invalid argument cat /sys/fs/resctrl/info/last_cmd_status MB value 0 out of range [10,100] cd /sys/fs/resctrl mkdir 1 2 3 4 5 6 7 8 mkdir: cannot create directory '8': No space left on device cat /sys/fs/resctrl/info/last_cmd_status out of CLOSIDs See 'last_cmd_status' for more details in kernel documentation: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt In runc, we could append the diagnostics information to the error message of Intel RDT operation errors to provide more user-friendly information. Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2018-10-19 00:16:08 +08:00
Mrunal Patel	c2ab1e656e	Merge pull request #1910 from adrianreber/tip Fix travis Go: tip	2018-10-17 12:47:08 -07:00
Michael Crosby	58592df567	Merge pull request #1880 from AkihiroSuda/fix-subgid libcontainer: CurrentGroupSubGIDs -> CurrentUserSubGIDs	2018-10-16 15:21:51 -04:00
Xiaochen Shen	d59b17d6d5	libcontainer: intelrdt: Add more check if sub-features are enabled Double check if Intel RDT sub-features are available in "resource control" filesystem. Intel RDT sub-features can be selectively disabled or enabled by kernel command line (e.g., rdt=!l3cat,mba) in 4.14 and newer kernel. Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2018-10-16 14:29:44 +08:00
Xiaochen Shen	f097339289	libcontainer: intelrdt: add test cases for Intel RDT/MBA Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2018-10-16 14:29:39 +08:00
Xiaochen Shen	27560ace2f	libcontainer: intelrdt: add support for Intel RDT/MBA in runc Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature of Intel Resource Director Technology (RDT) which is supported on some Intel Xeon platforms. Intel RDT/MBA provides indirect and approximate throttle over memory bandwidth for the software. A user controls the resource by indicating the percentage of maximum memory bandwidth. Hardware details of Intel RDT/MBA can be found in section 17.18 of Intel Software Developer Manual: https://software.intel.com/en-us/articles/intel-sdm In Linux 4.12 kernel and newer, Intel RDT/MBA is enabled by kernel config CONFIG_INTEL_RDT. If hardware support, CPU flags `rdt_a` and `mba` will be set in /proc/cpuinfo. Intel RDT "resource control" filesystem hierarchy: mount -t resctrl resctrl /sys/fs/resctrl tree /sys/fs/resctrl /sys/fs/resctrl/ \|-- info \| \|-- L3 \| \| \|-- cbm_mask \| \| \|-- min_cbm_bits \| \| \|-- num_closids \| \|-- MB \| \|-- bandwidth_gran \| \|-- delay_linear \| \|-- min_bandwidth \| \|-- num_closids \|-- ... \|-- schemata \|-- tasks \|-- <container_id> \|-- ... \|-- schemata \|-- tasks For MBA support for `runc`, we will reuse the infrastructure and code base of Intel RDT/CAT which implemented in #1279. We could also make use of `tasks` and `schemata` configuration for memory bandwidth resource constraints. The file `tasks` has a list of tasks that belongs to this group (e.g., <container_id>" group). Tasks can be added to a group by writing the task ID to the "tasks" file (which will automatically remove them from the previous group to which they belonged). New tasks created by fork(2) and clone(2) are added to the same group as their parent. The file `schemata` has a list of all the resources available to this group. Each resource (L3 cache, memory bandwidth) has its own line and format. Memory bandwidth schema: It has allocation values for memory bandwidth on each socket, which contains L3 cache id and memory bandwidth percentage. Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..." The minimum bandwidth percentage value for each CPU model is predefined and can be looked up through "info/MB/min_bandwidth". The bandwidth granularity that is allocated is also dependent on the CPU model and can be looked up at "info/MB/bandwidth_gran". The available bandwidth control steps are: min_bw + N * bw_gran. Intermediate values are rounded to the next control step available on the hardware. For more information about Intel RDT kernel interface: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt An example for runc: Consider a two-socket machine with two L3 caches where the minimum memory bandwidth of 10% with a memory bandwidth granularity of 10%. Tasks inside the container may use a maximum memory bandwidth of 20% on socket 0 and 70% on socket 1. "linux": { "intelRdt": { "memBwSchema": "MB:0=20;1=70" } } Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2018-10-16 14:29:29 +08:00
Xiaochen Shen	c1cece7e23	libcontainer: intelrdt: add Intel RDT/MBA docs in SPEC.md Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2018-10-16 14:28:19 +08:00
Mrunal Patel	a00bf01908	Merge pull request #1862 from AkihiroSuda/decompose-rootless-pr Disable rootless mode except RootlessCgMgr when executed as the root in userns (fix Docker-in-LXD regression)	2018-10-15 17:32:15 -07:00
Dominik Süß	0b412e9482	various cleanups to address linter issues Signed-off-by: Dominik Süß <dominik@suess.wtf>	2018-10-13 21:14:03 +02:00
Adrian Reber	0d01164756	Fix travis Go: tip This fixes libcontainer/container_linux.go:1200: Error call has possible formatting directive %s Signed-off-by: Adrian Reber <areber@redhat.com>	2018-10-13 10:44:07 +00:00
Aleksa Sarai	e40d4635c4	merge branch 'pr-1894' Move spec.Linux.IntelRdt check to spec.Linux != nil block LGTMs: @crosbymichael @cyphar Closes #1894	2018-10-09 02:41:13 +11:00
Jonathan Marler	1499c746a1	Move spec.Linux.IntelRdt check to spec.Linux != nil block Signed-off-by: Jonathan Marler <johnnymarler@gmail.com>	2018-10-04 21:30:55 -06:00
Mike Brown	26bdc0dce7	clarify license information Signed-off-by: Mike Brown <brownwm@us.ibm.com>	2018-10-03 10:39:44 -05:00
Mrunal Patel	2abd837c8c	Merge pull request #1893 from cyphar/keyctl-ignore-enosys keyring: handle ENOSYS with keyctl(KEYCTL_JOIN_SESSION_KEYRING)	2018-09-25 13:35:16 -07:00
Danail Branekov	a1d5398afa	Respect container's cgroup path Respect the container's cgroup path when finding the container's cgroup mount point, which is useful in multi-tenant environments, where containers have their own unique cgroup mounts Signed-off-by: Danail Branekov <danailster@gmail.com> Signed-off-by: Oliver Stenbom <ostenbom@pivotal.io> Signed-off-by: Giuseppe Capizzi <gcapizzi@pivotal.io>	2018-09-25 17:43:36 +01:00

1 2 3 4 5 ...

1404 Commits