jasder/runc - runc - 军科开源项目托管

Commit Graph

Author	SHA1	Message	Date
Xiaochen Shen	27560ace2f	libcontainer: intelrdt: add support for Intel RDT/MBA in runc Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature of Intel Resource Director Technology (RDT) which is supported on some Intel Xeon platforms. Intel RDT/MBA provides indirect and approximate throttle over memory bandwidth for the software. A user controls the resource by indicating the percentage of maximum memory bandwidth. Hardware details of Intel RDT/MBA can be found in section 17.18 of Intel Software Developer Manual: https://software.intel.com/en-us/articles/intel-sdm In Linux 4.12 kernel and newer, Intel RDT/MBA is enabled by kernel config CONFIG_INTEL_RDT. If hardware support, CPU flags `rdt_a` and `mba` will be set in /proc/cpuinfo. Intel RDT "resource control" filesystem hierarchy: mount -t resctrl resctrl /sys/fs/resctrl tree /sys/fs/resctrl /sys/fs/resctrl/ \|-- info \| \|-- L3 \| \| \|-- cbm_mask \| \| \|-- min_cbm_bits \| \| \|-- num_closids \| \|-- MB \| \|-- bandwidth_gran \| \|-- delay_linear \| \|-- min_bandwidth \| \|-- num_closids \|-- ... \|-- schemata \|-- tasks \|-- <container_id> \|-- ... \|-- schemata \|-- tasks For MBA support for `runc`, we will reuse the infrastructure and code base of Intel RDT/CAT which implemented in #1279. We could also make use of `tasks` and `schemata` configuration for memory bandwidth resource constraints. The file `tasks` has a list of tasks that belongs to this group (e.g., <container_id>" group). Tasks can be added to a group by writing the task ID to the "tasks" file (which will automatically remove them from the previous group to which they belonged). New tasks created by fork(2) and clone(2) are added to the same group as their parent. The file `schemata` has a list of all the resources available to this group. Each resource (L3 cache, memory bandwidth) has its own line and format. Memory bandwidth schema: It has allocation values for memory bandwidth on each socket, which contains L3 cache id and memory bandwidth percentage. Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..." The minimum bandwidth percentage value for each CPU model is predefined and can be looked up through "info/MB/min_bandwidth". The bandwidth granularity that is allocated is also dependent on the CPU model and can be looked up at "info/MB/bandwidth_gran". The available bandwidth control steps are: min_bw + N * bw_gran. Intermediate values are rounded to the next control step available on the hardware. For more information about Intel RDT kernel interface: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt An example for runc: Consider a two-socket machine with two L3 caches where the minimum memory bandwidth of 10% with a memory bandwidth granularity of 10%. Tasks inside the container may use a maximum memory bandwidth of 20% on socket 0 and 70% on socket 1. "linux": { "intelRdt": { "memBwSchema": "MB:0=20;1=70" } } Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2018-10-16 14:29:29 +08:00
Akihiro Suda	06f789cf26	Disable rootless mode except RootlessCgMgr when executed as the root in userns This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and `RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc. `RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in the current user namespace. `RootlessEUID` is almost identical to the former `Rootless` except cgroups stuff. `RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups. `RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace. Otherwise `RootlessCgroups` is set to true. (Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well) When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes), `RootlessEUID` is set to false but `RootlessCgroups` is set to true. So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored. This PR does not have any impact on CLI flags and `state.json`. Note about CLI: * Now `runc --rootless=(auto\|true\|false)` CLI flag is only used for setting `RootlessCgroups`. * Now `runc spec --rootless` is only required when `RootlessEUID` is set to true. For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of UID/GID are mapped. Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`): * `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility. (`/run/runc` is used) * If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`. This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`. Note about `state.json`: * `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`. Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>	2018-09-07 15:05:03 +09:00
Aleksa Sarai	823c06eae9	libcontainer: improve "kernel.{domainname,hostname}" sysctl handling These sysctls are namespaced by CLONE_NEWUTS, and we need to use "kernel.domainname" if we want users to be able to set an NIS domainname on Linux. However we disallow "kernel.hostname" because it would conflict with the "hostname" field and cause confusion (but we include a helpful message to make it clearer to the user). Signed-off-by: Aleksa Sarai <asarai@suse.de>	2018-06-18 21:48:04 +10:00
Giuseppe Scrivano	cbcc85d311	runc: not require uid/gid mappings if euid()==0 When running in a new unserNS as root, don't require a mapping to be present in the configuration file. We are already skipping the test for a new userns to be present. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2018-06-12 12:45:54 +02:00
Qiang Huang	dd67ab10d7	Merge pull request #1759 from cyphar/rootless-erofs-as-eperm rootless: cgroup: treat EROFS as a skippable error	2018-05-25 09:24:16 +08:00
Tamal Saha	58415b4b12	Fix error message Signed-off-by: Tamal Saha <tamal@appscode.com>	2018-03-21 20:52:09 -07:00
Aleksa Sarai	fd3a6e6c83	libcontainer: handle unset oomScoreAdj corectly Previously if oomScoreAdj was not set in config.json we would implicitly set oom_score_adj to 0. This is not allowed according to the spec: > If oomScoreAdj is not set, the runtime MUST NOT change the value of > oom_score_adj. Change this so that we do not modify oom_score_adj if oomScoreAdj is not present in the configuration. While this modifies our internal configuration types, the on-disk format is still compatible. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2018-03-17 13:53:42 +11:00
Daniel Dao	8898b6b446	remove placeholder for non-linux platforms runc currently only support Linux platform, and since we dont intend to expose the support to other platform, removing all other platforms placeholder code. `libcontainer/configs` still being used in https://github.com/moby/moby/blob/master/daemon/daemon_windows.go so keeping it for now. After this, we probably should also rename files to drop linux suffices if possible. Signed-off-by: Daniel Dao <dqminh89@gmail.com>	2017-11-24 18:14:51 +00:00
Tobias Klauser	4d27f20db0	libcontainer: drop FreeBSD support runc is not supported on FreeBSD, so remove all FreeBSD specific bits. As suggested by @crosbymichael in #1653 Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2017-11-24 14:51:05 +01:00
Will Martin	ca4f427af1	Support cgroups with limits as rootless Signed-off-by: Ed King <eking@pivotal.io> Signed-off-by: Gabriel Rosenhouse <grosenhouse@pivotal.io> Signed-off-by: Konstantinos Karampogias <konstantinos.karampogias@swisscom.com>	2017-10-05 11:22:54 +01:00
Giuseppe Scrivano	3282f5a7c1	tests: fix for rootless multiple uids/gids Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2017-09-09 12:45:32 +10:00
Giuseppe Scrivano	d8b669400a	rootless: allow multiple user/group mappings Take advantage of the newuidmap/newgidmap tools to allow multiple users/groups to be mapped into the new user namespace in the rootless case. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> [ rebased to handle intelrdt changes. ] Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-09-09 12:45:32 +10:00
Xiaochen Shen	88d22fde40	libcontainer: intelrdt: use init() to avoid race condition This is the follow-up PR of #1279 to fix remaining issues: Use init() to avoid race condition in IsIntelRdtEnabled(). Add also rename some variables and functions. Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2017-09-08 17:15:31 +08:00
Xiaochen Shen	692f6e1e27	libcontainer: add support for Intel RDT/CAT in runc About Intel RDT/CAT feature: Intel platforms with new Xeon CPU support Intel Resource Director Technology (RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which currently supports L3 cache resource allocation. This feature provides a way for the software to restrict cache allocation to a defined 'subset' of L3 cache which may be overlapping with other 'subsets'. The different subsets are identified by class of service (CLOS) and each CLOS has a capacity bitmask (CBM). For more information about Intel RDT/CAT can be found in the section 17.17 of Intel Software Developer Manual. About Intel RDT/CAT kernel interface: In Linux 4.10 kernel or newer, the interface is defined and exposed via "resource control" filesystem, which is a "cgroup-like" interface. Comparing with cgroups, it has similar process management lifecycle and interfaces in a container. But unlike cgroups' hierarchy, it has single level filesystem layout. Intel RDT "resource control" filesystem hierarchy: mount -t resctrl resctrl /sys/fs/resctrl tree /sys/fs/resctrl /sys/fs/resctrl/ \|-- info \| \|-- L3 \| \|-- cbm_mask \| \|-- min_cbm_bits \| \|-- num_closids \|-- cpus \|-- schemata \|-- tasks \|-- <container_id> \|-- cpus \|-- schemata \|-- tasks For runc, we can make use of `tasks` and `schemata` configuration for L3 cache resource constraints. The file `tasks` has a list of tasks that belongs to this group (e.g., <container_id>" group). Tasks can be added to a group by writing the task ID to the "tasks" file (which will automatically remove them from the previous group to which they belonged). New tasks created by fork(2) and clone(2) are added to the same group as their parent. If a pid is not in any sub group, it Is in root group. The file `schemata` has allocation bitmasks/values for L3 cache on each socket, which contains L3 cache id and capacity bitmask (CBM). Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..." For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0` which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0. The valid L3 cache CBM is a contiguous bits set and number of bits that can be set is less than the max bit. The max bits in the CBM is varied among supported Intel Xeon platforms. In Intel RDT "resource control" filesystem layout, the CBM in a group should be a subset of the CBM in root. Kernel will check if it is valid when writing. e.g., 0xfffff in root indicates the max bits of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc. For more information about Intel RDT/CAT kernel interface: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt An example for runc: Consider a two-socket machine with two L3 caches where the default CBM is 0xfffff and the max CBM length is 20 bits. With this configuration, tasks inside the container only have access to the "upper" 80% of L3 cache id 0 and the "lower" 50% L3 cache id 1: "linux": { "intelRdt": { "l3CacheSchema": "L3:0=ffff0;1=3ff" } } Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2017-09-01 14:26:33 +08:00
Daniel, Dao Quang Minh	b313a75364	Merge pull request #1477 from yummypeng/save-own-ns-path Always save own namespace paths	2017-08-02 11:24:30 +01:00
Steven Hartland	ee4f68e302	Updated logrus to v1 Updated logrus to use v1 which includes a breaking name change Sirupsen -> sirupsen. This includes a manual edit of the docker term package to also correct the name there too. Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>	2017-07-19 15:20:56 +00:00
Yuanhong Peng	e939079acf	Always save own namespace paths fix #1476 If containerA shares namespace, say ipc namespace, with containerB, then its ipc namespace path would be the same as containerB and be stored in `state.json`. Exec into containerA will just read the namespace paths stored in this file and join these namespaces. So, if containerB has already been stopped, `docker exec containerA` will fail. To address this issue, we should always save own namespace paths no matter if we share namespaces with other containers. Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com>	2017-07-13 16:13:05 +08:00
Justin Cormack	3d9074ead3	Update memory specs to use int64 not uint64 replace #1492 #1494 fix #1422 Since https://github.com/opencontainers/runtime-spec/pull/876 the memory specifications are now `int64`, as that better matches the visible interface where `-1` is a valid value. Otherwise finding the correct value was difficult as it was kernel dependent. Signed-off-by: Justin Cormack <justin.cormack@docker.com>	2017-06-27 12:16:07 +01:00
Christy Perez	3d7cb4293c	Move libcontainer to x/sys/unix Since syscall is outdated and broken for some architectures, use x/sys/unix instead. There are still some dependencies on the syscall package that will remain in syscall for the forseeable future: Errno Signal SysProcAttr Additionally: - os still uses syscall, so it needs to be kept for anything returning *os.ProcessState, such as process.Wait. Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>	2017-05-22 17:35:20 -05:00
Justin Cormack	4c67360296	Clean up unix vs linux usage FreeBSD does not support cgroups or namespaces, which the code suggested, and is not supported in runc anyway right now. So clean up the file naming to use `_linux` where appropriate. Signed-off-by: Justin Cormack <justin.cormack@docker.com>	2017-05-12 17:22:09 +01:00
Harshal Patil	22953c122f	Remove redundant declaraion of namespace slice Signed-off-by: Harshal Patil <harshal.patil@in.ibm.com>	2017-05-02 10:04:57 +05:30
Daniel, Dao Quang Minh	13a8c5d140	Merge pull request #1365 from hqhq/use_go_selinux Use opencontainers/selinux package	2017-04-15 14:22:32 +01:00
Aleksa Sarai	f0876b0427	libcontainer: configs: add proper HostUID and HostGID Previously Host{U,G}ID only gave you the root mapping, which isn't very useful if you are trying to do other things with the IDMaps. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-03-23 20:46:20 +11:00
Aleksa Sarai	d2f49696b0	runc: add support for rootless containers This enables the support for the rootless container mode. There are many restrictions on what rootless containers can do, so many different runC commands have been disabled: * runc checkpoint * runc events * runc pause * runc ps * runc restore * runc resume * runc update The following commands work: * runc create * runc delete * runc exec * runc kill * runc list * runc run * runc spec * runc state In addition, any specification options that imply joining cgroups have also been disabled. This is due to support for unprivileged subtree management not being available from Linux upstream. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-03-23 20:45:24 +11:00
Qiang Huang	5e7b48f7c0	Use opencontainers/selinux package It's splitted as a separate project. Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>	2017-03-23 08:21:19 +08:00
Qiang Huang	8430cc4f48	Use uint64 for resources to keep consistency with runtime-spec Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>	2017-03-20 18:51:39 +08:00
Mrunal Patel	4f9cb13b64	Update runtime spec to 1.0.0.rc5 Signed-off-by: Mrunal Patel <mrunalp@gmail.com>	2017-03-15 11:38:37 -07:00
Qiang Huang	f3c16acd47	Fix go_vet errors runc/libcontainer/configs/namespaces_syscall_unsupported.go Line 7: error: unreachable code (vet) Line 14: error: unreachable code (vet) Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>	2017-01-06 10:20:27 +08:00
Zhang Wei	a344b2d6a8	sync up `HookState` with OCI spec `State` `HookState` struct should follow definition of `State` in runtime-spec: * modify json name of `version` to `ociVersion`. * Remove redundant `Rootfs` field as rootfs can be retrived from `bundlePath/config.json`. Signed-off-by: Zhang Wei <zhangwei555@huawei.com>	2016-12-20 00:00:43 +08:00
Samuel Ortiz	f19aa2d04d	validate: Check that the given namespace path is a symlink When checking if the provided networking namespace is the host one or not, we should first check if it's a symbolic link or not as in some cases we can use persistent networking namespace under e.g. /var/run/netns/. Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>	2016-12-10 11:14:49 +01:00
Qiang Huang	81d6088c8f	Unify rootfs validation Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>	2016-10-29 10:31:44 +08:00
Michael Crosby	6328410520	Merge pull request #1149 from cyphar/fix-sysctl-validation validator: unbreak sysctl net.* validation	2016-10-26 09:06:41 -07:00
Aleksa Sarai	1ab3c035d2	validator: actually test success Previously we only tested failures, which causes us to miss issues where setting sysctls would always fail. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2016-10-26 23:07:57 +11:00
Aleksa Sarai	2a94c3651b	validator: unbreak sysctl net.* validation When changing this validation, the code actually allowing the validation to pass was removed. This meant that any net.* sysctl would always fail to validate. Fixes: `bc84f83344` ("fix docker/docker#27484") Reported-by: Justin Cormack <justin.cormack@docker.com> Signed-off-by: Aleksa Sarai <asarai@suse.de>	2016-10-26 22:58:51 +11:00
Qiang Huang	157a96a428	Merge pull request #977 from cyphar/nsenter-userns-ordering nsenter: guarantee correct user namespace ordering	2016-10-26 16:45:15 +08:00
Qiang Huang	4ec570d060	Merge pull request #1138 from gaocegege/fix-config-validator docker/docker#27484-check if sysctls are used in host network mode.	2016-10-25 11:08:51 +08:00
Aleksa Sarai	c7ed2244f4	merge branch 'pr-1125' LGTMs: @hqhq @mrunalp Closes #1125	2016-10-25 10:05:28 +11:00
Ce Gao	41c35810f2	add test cases about host ns Signed-off-by: Ce Gao <ce.gao@outlook.com>	2016-10-22 11:31:15 +08:00
Ce Gao	bc84f83344	fix docker/docker#27484 Signed-off-by: Ce Gao <ce.gao@outlook.com>	2016-10-22 11:22:52 +08:00
Alexander Morozov	1ab9d5e6f4	Merge pull request #845 from mrunalp/cp_tmpfs Add support for copying up directories into tmpfs when a tmpfs is mounted over them	2016-10-21 13:47:16 -07:00
Aleksa Sarai	f8e6b5af5e	rootfs: make pivot_root not use a temporary directory Namely, use an undocumented feature of pivot_root(2) where pivot_root(".", ".") is actually a feature and allows you to make the old_root be tied to your /proc/self/cwd in a way that makes unmounting easy. Thanks a lot to the LXC developers which came up with this idea first. This is the first step of many to allowing runC to work with a completely read-only rootfs. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2016-10-20 12:55:58 +11:00
Daniel Dao	1b876b0bf2	fix typos with misspell pipe the source through https://github.com/client9/misspell. typos be gone! Signed-off-by: Daniel Dao <dqminh89@gmail.com>	2016-10-11 23:22:48 +00:00
Aleksa Sarai	ed053a740c	nsenter: specify namespace type in setns() This avoids us from running into cases where libcontainer thinks that a particular namespace file is a different type, and makes it a fatal error rather than causing broken functionality. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2016-10-04 16:17:55 +11:00
Mrunal Patel	f5103d311e	config: Add new Extensions flag to support custom mount options in runc Also, defines a EXT_COPYUP flag for supporting tmpfs copyup operation. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>	2016-09-30 09:46:46 -07:00
Michael Crosby	ad400bb093	Change netclassid json tag This allows older state files to be loaded without the unmarshal error of the string to int conversion. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-09-12 09:31:58 -07:00
xiekeyang	206fea7f50	remove unused code Signed-off-by: xiekeyang <xiekeyang@huawei.com>	2016-08-22 17:16:45 +08:00
Justin Cormack	834e53144b	Do not create /dev/fuse by default This device is not required by the OCI spec. The rationale for this was linked to https://github.com/docker/docker/issues/2393 So a non functional /dev/fuse was created, and actual fuse use still is required to add the device explicitly. However even old versions of the JVM on Ubuntu 12.04 no longer require the fuse package, and this is all not needed. Signed-off-by: Justin Cormack <justin.cormack@docker.com>	2016-08-12 13:00:24 +01:00
Alexander Morozov	7679c80be5	libcontainer/configs: make hooks run safer It's possible that `cmd.Process` is still nil when we reach timeout. Start creates `Process` field synchronously, and there is no way to such race. Signed-off-by: Alexander Morozov <lk4d4math@gmail.com>	2016-08-08 10:16:35 -07:00
Qiang Huang	1a81e9ab1f	Merge pull request #958 from dubstack/skip-devices Skip updates on parent Devices cgroup	2016-07-29 10:31:49 +08:00
Buddha Prakash	ef4ff6a8ad	Skip updates on parent Devices cgroup Signed-off-by: Buddha Prakash <buddhap@google.com>	2016-07-25 10:30:46 -07:00

1 2 3

121 Commits