jasder/runc - runc - 军科开源项目托管

Commit Graph

Author	SHA1	Message	Date
Giuseppe Scrivano	28a697cce3	rootfs: umount all procfs and sysfs with --no-pivot When creating a new user namespace, the kernel doesn't allow to mount a new procfs or sysfs file system if there is not already one instance fully visible in the current mount namespace. When using --no-pivot we were effectively inhibiting this protection from the kernel, as /proc and /sys from the host are still present in the container mount namespace. A container without full access to /proc could then create a new user namespace, and from there able to mount a fully visible /proc, bypassing the limitations in the container. A simple reproducer for this issue is: unshare -mrfp sh -c "mount -t proc none /proc && echo c > /proc/sysrq-trigger" Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2019-01-14 09:53:35 +01:00
JoeWrightss	0855bce448	Fix .Fatalf() error message Signed-off-by: JoeWrightss <zhoulin.xie@daocloud.io>	2018-12-19 20:22:48 +08:00
JoeWrightss	769d6c4a75	Fix some typos Signed-off-by: JoeWrightss <zhoulin.xie@daocloud.io>	2018-12-09 23:52:54 +08:00
Michael Crosby	25f3f893c8	Merge pull request #1939 from cyphar/nokmem-error cgroups: nokmem: error out on explicitly-set kmemcg limits	2018-12-04 11:14:56 -05:00
Michael Crosby	96ec2177ae	Merge pull request #1943 from giuseppe/allow-to-signal-paused-containers kill: allow to signal paused containers	2018-12-03 16:55:13 -05:00
Ace-Tang	dce70cdff5	cr: get pid from criu notify when restore when restore container from a checkpoint directory, we should get pid from criu notify, since c.initProcess has not been created. Signed-off-by: Ace-Tang <aceapril@126.com>	2018-12-03 13:31:20 +08:00
Aleksa Sarai	8a4629f7b5	cgroups: nokmem: error out on explicitly-set kmemcg limits When built with nokmem we explicitly are disabling support for kmemcg, but it is a strict specification requirement that if we cannot fulfil an aspect of the container configuration that we error out. Completely ignoring explicitly-requested kmemcg limits with nokmem would undoubtably lead to problems. Fixes: `6a2c155968` ("libcontainer: ability to compile without kmem") Signed-off-by: Aleksa Sarai <asarai@suse.de>	2018-12-01 14:31:35 +11:00
Giuseppe Scrivano	07d1ad44c8	kill: allow to signal paused containers regression introduced by `87a188996e` Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2018-11-30 23:35:47 +01:00
Michael Crosby	4932620b62	Merge pull request #1919 from xiaochenshen/rdt-mba-software-controller libcontainer: intelrdt: add support for Intel RDT/MBA Software Controller in runc	2018-11-26 16:45:42 -05:00
Michael Crosby	50e2634995	Merge pull request #1934 from lifubang/kill fix: may kill other process when container has been stopped	2018-11-21 10:30:25 -05:00
Lifubang	87a188996e	may kill other process when container has been stopped Signed-off-by: Lifubang <lifubang@acmcoder.com>	2018-11-21 17:44:52 +08:00
Aleksa Sarai	ceefc3fe4e	merge branch 'pr-1741' libcontainer: Set 'status' in hook stdin LGTMs: @cyphar @crosbymichael Closes #1741	2018-11-20 06:39:30 +11:00
Michael Crosby	76520a4bf0	Merge pull request #1872 from masters-of-cats/better-find-cgroup-mountpoint Respect container's cgroup path	2018-11-16 14:06:54 -05:00
W. Trevor King	e23868603a	libcontainer: Set 'status' in hook stdin Finish off the work started in `a344b2d6` (sync up `HookState` with OCI spec `State`, 2016-12-19, #1201). And drop HookState, since there's no need for a local alias for specs.State. Also set c.initProcess in newInitProcess to support OCIState calls from within initProcess.start(). I think the cyclic references between linuxContainer and initProcess are unfortunate, but didn't want to address that here. I've also left the timing of the Prestart hooks alone, although the spec calls for them to happen before start (not as part of creation) [1,2]. Once the timing gets fixed we can drop the initProcessStartTime hacks which initProcess.start currently needs. I'm not sure why we trigger the prestart hooks in response to both procReady and procHooks. But we've had two prestart rounds in initProcess.start since `2f276498` (Move pre-start hooks after container mounts, 2016-02-17, #568). I've left that alone too. I really think we should have len() guards to avoid computing the state when .Hooks is non-nil but the particular phase we're looking at is empty. Aleksa, however, is adamantly against them [3] citing a risk of sloppy copy/pastes causing the hook slice being len-guarded to diverge from the hook slice being iterated over within the guard. I think that ort of thing is very lo-risk, because: * We shouldn't be copy/pasting this, right? DRY for the win :). * There's only ever a few lines between the guard and the guarded loop. That makes broken copy/pastes easy to catch in review. * We should have test coverage for these. Guarding with the wrong slice is certainly not the only thing you can break with a sloppy copy/paste. But I'm not a maintainer ;). [1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.0/config.md#prestart [2]: https://github.com/opencontainers/runc/issues/1710 [3]: https://github.com/opencontainers/runc/pull/1741#discussion_r233331570 Signed-off-by: W. Trevor King <wking@tremily.us>	2018-11-14 06:49:49 -08:00
Mrunal Patel	4769cdf607	Merge pull request #1916 from crosbymichael/cgns Add support for cgroup namespace	2018-11-13 12:21:38 -08:00
Mrunal Patel	f000fe11ec	Merge pull request #1917 from slp/master libcontainer: map PidsLimit to systemd's TasksMax property	2018-11-13 12:21:23 -08:00
Michael Crosby	aa7917b751	Merge pull request #1911 from theSuess/linter-fixes Various cleanups to address linter issues	2018-11-13 12:13:34 -05:00
Michael Crosby	bd420b59f1	Merge pull request #1925 from Ace-Tang/fix_dup_ns test: fix TestDupNamespaces fail to test dup-ns error	2018-11-13 12:11:11 -05:00
Xiaochen Shen	95af9eff82	libcontainer: intelrdt: add support for Intel RDT/MBA Software Controller in runc MBA Software Controller feature is introduced in Linux kernel v4.18. It is a software enhancement to mitigate some limitations in MBA which describes in kernel documentation. It also makes the interface more user friendly - we could specify memory bandwidth in "MBps" (Mega Bytes per second) as well as in "percentages". The kernel underneath would use a software feedback mechanism or a "Software Controller" which reads the actual bandwidth using MBM counters and adjust the memory bandwidth percentages to ensure: "actual memory bandwidth < user specified memory bandwidth". We could enable this feature through mount option "-o mba_MBps": mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl In runc, we handle both memory bandwidth schemata in unified format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..." The unit of memory bandwidth is specified in "percentages" by default, and in "MBps" if MBA Software Controller is enabled. For more information about Intel RDT and MBA Software Controller: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2018-11-13 23:27:08 +08:00
Ace-Tang	16d55f17a8	libcontainer: fix potential panic if spec.Process is nil for the code logic, pointer 'spec.Process' should be judge first to avoid panic. Signed-off-by: Ace-Tang <aceapril@126.com>	2018-11-06 11:55:30 +08:00
Ace-Tang	95d1aa1886	test: fix TestDupNamespaces add Root in created spec, or error message is 'Root must be specified' Signed-off-by: Ace-Tang <aceapril@126.com>	2018-11-06 11:36:27 +08:00
Michael Crosby	b1068fb925	Merge pull request #1814 from rhatdan/selinux SELinux labels are tied to the thread	2018-11-05 10:00:11 -05:00
Aleksa Sarai	9f1e94488e	merge branch 'pr-1921' libcontainer: ability to compile without kmem LGTMs: @mrunalp @cyphar Closes #1921	2018-11-02 09:54:16 +11:00
Michael Crosby	9e5aa7494d	Merge pull request #1918 from giuseppe/skip-setgroups rootless: fix running with /proc/self/setgroups set to deny	2018-11-01 13:16:47 -04:00
Kir Kolyshkin	6a2c155968	libcontainer: ability to compile without kmem Commit `fe898e7862` (PR #1350) enables kernel memory accounting for all cgroups created by libcontainer -- even if kmem limit is not configured. Kernel memory accounting is known to be broken in some kernels, specifically the ones from RHEL7 (including RHEL 7.5). Those kernels do not support kernel memory reclaim, and are prone to oopses. Unconditionally enabling kmem acct on such kernels lead to bugs, such as * https://github.com/opencontainers/runc/issues/1725 * https://github.com/kubernetes/kubernetes/issues/61937 * https://github.com/moby/moby/issues/29638 This commit gives a way to compile runc without kernel memory setting support. To do so, use something like make BUILDTAGS="seccomp nokmem" Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2018-10-31 20:35:51 -07:00
Yuanhong Peng	df3fa115f9	Add support for cgroup namespace Cgroup namespace can be configured in `config.json` as other namespaces. Here is an example: ``` "namespaces": [ { "type": "pid" }, { "type": "network" }, { "type": "ipc" }, { "type": "uts" }, { "type": "mount" }, { "type": "cgroup" } ], ``` Note that if you want to run a container which has shared cgroup ns with another container, then it's strongly recommended that you set proper `CgroupsPath` of both containers(the second container's cgroup path must be the subdirectory of the first one). Or there might be some unexpected results. Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com> Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2018-10-31 10:51:43 -04:00
Chris Aniszczyk	f3ce8221ea	Merge pull request #1913 from xiaochenshen/rdt-add-diagnostics libcontainer: intelrdt: add user-friendly diagnostics for Intel RDT operation errors	2018-10-25 14:27:17 -05:00
Giuseppe Scrivano	869add3318	rootless: fix running with /proc/self/setgroups set to deny This is a regression from `06f789cf26` when the user namespace was configured without a privileged helper. To allow a single mapping in an user namespace, it is necessary to set /proc/self/setgroups to "deny". For a simple reproducer, the user namespace can be created with "unshare -r". Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2018-10-25 15:44:15 +02:00
Sergio Lopez	5c6b9c3c1c	libcontainer: map PidsLimit to systemd's TasksMax property Currently runc applies PidsLimit restriction by writing directly to cgroup's pids.max, without notifying systemd. As a consequence, when the later updates the context of the corresponding scope, pids.max is reset to the value of systemd's TasksMax property. This can be easily reproduced this way (I'm using "postfix" here just an example, any unrelated but existing service will do): # CTR=`docker run --pids-limit 111 --detach --rm busybox /bin/sleep 8h` # cat /sys/fs/cgroup/pids/system.slice/docker-${CTR}.scope/pids.max 111 # systemctl disable --now postfix # systemctl enable --now postfix # cat /sys/fs/cgroup/pids/system.slice/docker-${CTR}.scope/pids.max max This patch adds TasksAccounting=true and TasksMax=PidsLimit to the properties sent to systemd. Signed-off-by: Sergio Lopez <slp@redhat.com>	2018-10-24 17:20:27 +02:00
Aleksa Sarai	e93996674f	merge branch 'pr-1903' clarify license information LGTMs: @hqhq @cyphar Closes #1903	2018-10-24 22:03:44 +11:00
Aleksa Sarai	9a3a8a5ebf	libcontainer: implement CLONE_NEWCGROUP This is a very simple implementation because it doesn't require any configuration unlike the other namespaces, and in its current state it only masks paths. This feature is available in Linux 4.6+ and is enabled by default for kernels compiled with CONFIG_CGROUP=y. Signed-off-by: Aleksa Sarai <asarai@suse.de> Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2018-10-23 16:23:00 -04:00
Xiaochen Shen	6c307f8ff2	libcontainer: intelrdt: add user-friendly diagnostics for Intel RDT operation errors Linux kernel v4.15 introduces better diagnostics for Intel RDT operation errors. If any error returns when making new directories or writing to any of the control file in resctrl filesystem, reading file /sys/fs/resctrl/info/last_cmd_status could provide more information that can be conveyed in the error returns from file operations. Some examples: echo "L3:0=f3;1=ff" > /sys/fs/resctrl/container_id/schemata -bash: echo: write error: Invalid argument cat /sys/fs/resctrl/info/last_cmd_status mask f3 has non-consecutive 1-bits echo "MB:0=0;1=110" > /sys/fs/resctrl/container_id/schemata -bash: echo: write error: Invalid argument cat /sys/fs/resctrl/info/last_cmd_status MB value 0 out of range [10,100] cd /sys/fs/resctrl mkdir 1 2 3 4 5 6 7 8 mkdir: cannot create directory '8': No space left on device cat /sys/fs/resctrl/info/last_cmd_status out of CLOSIDs See 'last_cmd_status' for more details in kernel documentation: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt In runc, we could append the diagnostics information to the error message of Intel RDT operation errors to provide more user-friendly information. Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2018-10-19 00:16:08 +08:00
Mrunal Patel	c2ab1e656e	Merge pull request #1910 from adrianreber/tip Fix travis Go: tip	2018-10-17 12:47:08 -07:00
Michael Crosby	58592df567	Merge pull request #1880 from AkihiroSuda/fix-subgid libcontainer: CurrentGroupSubGIDs -> CurrentUserSubGIDs	2018-10-16 15:21:51 -04:00
Xiaochen Shen	d59b17d6d5	libcontainer: intelrdt: Add more check if sub-features are enabled Double check if Intel RDT sub-features are available in "resource control" filesystem. Intel RDT sub-features can be selectively disabled or enabled by kernel command line (e.g., rdt=!l3cat,mba) in 4.14 and newer kernel. Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2018-10-16 14:29:44 +08:00
Xiaochen Shen	f097339289	libcontainer: intelrdt: add test cases for Intel RDT/MBA Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2018-10-16 14:29:39 +08:00
Xiaochen Shen	27560ace2f	libcontainer: intelrdt: add support for Intel RDT/MBA in runc Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature of Intel Resource Director Technology (RDT) which is supported on some Intel Xeon platforms. Intel RDT/MBA provides indirect and approximate throttle over memory bandwidth for the software. A user controls the resource by indicating the percentage of maximum memory bandwidth. Hardware details of Intel RDT/MBA can be found in section 17.18 of Intel Software Developer Manual: https://software.intel.com/en-us/articles/intel-sdm In Linux 4.12 kernel and newer, Intel RDT/MBA is enabled by kernel config CONFIG_INTEL_RDT. If hardware support, CPU flags `rdt_a` and `mba` will be set in /proc/cpuinfo. Intel RDT "resource control" filesystem hierarchy: mount -t resctrl resctrl /sys/fs/resctrl tree /sys/fs/resctrl /sys/fs/resctrl/ \|-- info \| \|-- L3 \| \| \|-- cbm_mask \| \| \|-- min_cbm_bits \| \| \|-- num_closids \| \|-- MB \| \|-- bandwidth_gran \| \|-- delay_linear \| \|-- min_bandwidth \| \|-- num_closids \|-- ... \|-- schemata \|-- tasks \|-- <container_id> \|-- ... \|-- schemata \|-- tasks For MBA support for `runc`, we will reuse the infrastructure and code base of Intel RDT/CAT which implemented in #1279. We could also make use of `tasks` and `schemata` configuration for memory bandwidth resource constraints. The file `tasks` has a list of tasks that belongs to this group (e.g., <container_id>" group). Tasks can be added to a group by writing the task ID to the "tasks" file (which will automatically remove them from the previous group to which they belonged). New tasks created by fork(2) and clone(2) are added to the same group as their parent. The file `schemata` has a list of all the resources available to this group. Each resource (L3 cache, memory bandwidth) has its own line and format. Memory bandwidth schema: It has allocation values for memory bandwidth on each socket, which contains L3 cache id and memory bandwidth percentage. Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..." The minimum bandwidth percentage value for each CPU model is predefined and can be looked up through "info/MB/min_bandwidth". The bandwidth granularity that is allocated is also dependent on the CPU model and can be looked up at "info/MB/bandwidth_gran". The available bandwidth control steps are: min_bw + N * bw_gran. Intermediate values are rounded to the next control step available on the hardware. For more information about Intel RDT kernel interface: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt An example for runc: Consider a two-socket machine with two L3 caches where the minimum memory bandwidth of 10% with a memory bandwidth granularity of 10%. Tasks inside the container may use a maximum memory bandwidth of 20% on socket 0 and 70% on socket 1. "linux": { "intelRdt": { "memBwSchema": "MB:0=20;1=70" } } Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2018-10-16 14:29:29 +08:00
Xiaochen Shen	c1cece7e23	libcontainer: intelrdt: add Intel RDT/MBA docs in SPEC.md Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2018-10-16 14:28:19 +08:00
Mrunal Patel	a00bf01908	Merge pull request #1862 from AkihiroSuda/decompose-rootless-pr Disable rootless mode except RootlessCgMgr when executed as the root in userns (fix Docker-in-LXD regression)	2018-10-15 17:32:15 -07:00
Dominik Süß	0b412e9482	various cleanups to address linter issues Signed-off-by: Dominik Süß <dominik@suess.wtf>	2018-10-13 21:14:03 +02:00
Adrian Reber	0d01164756	Fix travis Go: tip This fixes libcontainer/container_linux.go:1200: Error call has possible formatting directive %s Signed-off-by: Adrian Reber <areber@redhat.com>	2018-10-13 10:44:07 +00:00
Aleksa Sarai	e40d4635c4	merge branch 'pr-1894' Move spec.Linux.IntelRdt check to spec.Linux != nil block LGTMs: @crosbymichael @cyphar Closes #1894	2018-10-09 02:41:13 +11:00
Jonathan Marler	1499c746a1	Move spec.Linux.IntelRdt check to spec.Linux != nil block Signed-off-by: Jonathan Marler <johnnymarler@gmail.com>	2018-10-04 21:30:55 -06:00
Mike Brown	26bdc0dce7	clarify license information Signed-off-by: Mike Brown <brownwm@us.ibm.com>	2018-10-03 10:39:44 -05:00
Mrunal Patel	2abd837c8c	Merge pull request #1893 from cyphar/keyctl-ignore-enosys keyring: handle ENOSYS with keyctl(KEYCTL_JOIN_SESSION_KEYRING)	2018-09-25 13:35:16 -07:00
Danail Branekov	a1d5398afa	Respect container's cgroup path Respect the container's cgroup path when finding the container's cgroup mount point, which is useful in multi-tenant environments, where containers have their own unique cgroup mounts Signed-off-by: Danail Branekov <danailster@gmail.com> Signed-off-by: Oliver Stenbom <ostenbom@pivotal.io> Signed-off-by: Giuseppe Capizzi <gcapizzi@pivotal.io>	2018-09-25 17:43:36 +01:00
Aleksa Sarai	578fe65e4f	merge branch 'pr-1817' Fix duplicate entries and missing entries in getCgroupMountsHelper Add test for testing cgroup mounts on bedrock linux Stop relying on number of subsystems for cgroups LGTMs: @crosbymichael @cyphar Closes #1817	2018-09-19 19:48:17 +10:00
Michael Crosby	cc8146cf93	Merge pull request #1858 from marcov/nsenter-README Update outdated nsenter README content	2018-09-17 10:53:19 -04:00
Michael Crosby	d77251d5fc	Merge pull request #1892 from Ace-Tang/add_clean_test test: add more test case for CleanPath	2018-09-17 10:51:17 -04:00
Aleksa Sarai	40f1468413	keyring: handle ENOSYS with keyctl(KEYCTL_JOIN_SESSION_KEYRING) While all modern kernels (and I do mean _all_ of them -- this syscall was added in 2.6.10 before git had begun development!) have support for this syscall, LXC has a default seccomp profile that returns ENOSYS for this syscall. For most syscalls this would be a deal-breaker, and our use of session keyrings is security-based there are a few mitigating factors that make this change not-completely-insane: * We already have a flag that disables the use of session keyrings (for older kernels that had system-wide keyring limits and so on). So disabling it is not a new idea. * While the primary justification of using session keys is security-based, it's more of a security-by-obscurity protection. The main defense keyrings have is VFS credentials -- which is something that users already have better security tools for (setuid(2) and user namespaces). * Given the security justification you might argue that we shouldn't silently ignore this. However, the only way for the kernel to return -ENOSYS is either being ridiculously old (at which point we wouldn't work anyway) or that there is a seccomp profile in place blocking it. Given that the seccomp profile (if malicious) could very easily just return 0 or a silly return code (or something even more clever with seccomp-bpf) and trick us without this patch, there isn't much of a significant change in how much seccomp can trick us with or without this patch. Given all of that over-analysis, I'm pretty convinced there isn't a security problem in this very specific case and it will help out the ChromeOS folks by allowing Docker to run inside their LXC container setup. I'd be happy to be proven wrong. Ref: https://bugs.chromium.org/p/chromium/issues/detail?id=860565 Signed-off-by: Aleksa Sarai <asarai@suse.de>	2018-09-17 21:38:30 +10:00

1 2 3 4 5 ...

1199 Commits