jasder/runc - runc - 军科开源项目托管

Commit Graph

Author	SHA1	Message	Date
Akihiro Suda	88e8350de2	cgroup2: split fs2 from fs split fs2 package from fs, as mixing up fs and fs2 is very likely to result in unmaintainable code. Inspired by containerd/cgroups#109 Fix #2157 Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>	2019-12-06 15:42:10 +09:00
Filipe Brandenburger	46351eb3d1	Move systemd.Manager initialization into a function in that module This will permit us to extend the internals of systemd.Manager to include further information about the system, such as whether cgroupv1, cgroupv2 or both are in effect. Furthermore, it allows a future refactor of moving more of UseSystemd() code into the factory initialization function. Signed-off-by: Filipe Brandenburger <filbranden@gmail.com>	2019-05-01 13:22:19 -07:00
Xiaochen Shen	27560ace2f	libcontainer: intelrdt: add support for Intel RDT/MBA in runc Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature of Intel Resource Director Technology (RDT) which is supported on some Intel Xeon platforms. Intel RDT/MBA provides indirect and approximate throttle over memory bandwidth for the software. A user controls the resource by indicating the percentage of maximum memory bandwidth. Hardware details of Intel RDT/MBA can be found in section 17.18 of Intel Software Developer Manual: https://software.intel.com/en-us/articles/intel-sdm In Linux 4.12 kernel and newer, Intel RDT/MBA is enabled by kernel config CONFIG_INTEL_RDT. If hardware support, CPU flags `rdt_a` and `mba` will be set in /proc/cpuinfo. Intel RDT "resource control" filesystem hierarchy: mount -t resctrl resctrl /sys/fs/resctrl tree /sys/fs/resctrl /sys/fs/resctrl/ \|-- info \| \|-- L3 \| \| \|-- cbm_mask \| \| \|-- min_cbm_bits \| \| \|-- num_closids \| \|-- MB \| \|-- bandwidth_gran \| \|-- delay_linear \| \|-- min_bandwidth \| \|-- num_closids \|-- ... \|-- schemata \|-- tasks \|-- <container_id> \|-- ... \|-- schemata \|-- tasks For MBA support for `runc`, we will reuse the infrastructure and code base of Intel RDT/CAT which implemented in #1279. We could also make use of `tasks` and `schemata` configuration for memory bandwidth resource constraints. The file `tasks` has a list of tasks that belongs to this group (e.g., <container_id>" group). Tasks can be added to a group by writing the task ID to the "tasks" file (which will automatically remove them from the previous group to which they belonged). New tasks created by fork(2) and clone(2) are added to the same group as their parent. The file `schemata` has a list of all the resources available to this group. Each resource (L3 cache, memory bandwidth) has its own line and format. Memory bandwidth schema: It has allocation values for memory bandwidth on each socket, which contains L3 cache id and memory bandwidth percentage. Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..." The minimum bandwidth percentage value for each CPU model is predefined and can be looked up through "info/MB/min_bandwidth". The bandwidth granularity that is allocated is also dependent on the CPU model and can be looked up at "info/MB/bandwidth_gran". The available bandwidth control steps are: min_bw + N * bw_gran. Intermediate values are rounded to the next control step available on the hardware. For more information about Intel RDT kernel interface: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt An example for runc: Consider a two-socket machine with two L3 caches where the minimum memory bandwidth of 10% with a memory bandwidth granularity of 10%. Tasks inside the container may use a maximum memory bandwidth of 20% on socket 0 and 70% on socket 1. "linux": { "intelRdt": { "memBwSchema": "MB:0=20;1=70" } } Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2018-10-16 14:29:29 +08:00
Akihiro Suda	06f789cf26	Disable rootless mode except RootlessCgMgr when executed as the root in userns This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and `RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc. `RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in the current user namespace. `RootlessEUID` is almost identical to the former `Rootless` except cgroups stuff. `RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups. `RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace. Otherwise `RootlessCgroups` is set to true. (Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well) When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes), `RootlessEUID` is set to false but `RootlessCgroups` is set to true. So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored. This PR does not have any impact on CLI flags and `state.json`. Note about CLI: * Now `runc --rootless=(auto\|true\|false)` CLI flag is only used for setting `RootlessCgroups`. * Now `runc spec --rootless` is only required when `RootlessEUID` is set to true. For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of UID/GID are mapped. Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`): * `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility. (`/run/runc` is used) * If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`. This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`. Note about `state.json`: * `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`. Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>	2018-09-07 15:05:03 +09:00
Lifubang	4eb30fcdbe	code optimization: use securejoin.SecureJoin and CleanPath Signed-off-by: Lifubang <lifubang@acmcoder.com>	2018-09-04 09:02:18 +08:00
Lifubang	4fae8fcce2	code optimization after review Signed-off-by: Lifubang <lifubang@acmcoder.com>	2018-09-03 23:27:31 +08:00
Lifubang	d2d226e8f9	fix unexpected delete bug when container id is .. Signed-off-by: Lifubang <lifubang@acmcoder.com>	2018-08-31 11:17:42 +08:00
Aleksa Sarai	03e585985f	rootless: cgroup: treat EROFS as a skippable error In some cases, /sys/fs/cgroups is mounted read-only. In rootless containers we can consider this effectively identical to having cgroups that we don't have write permission to -- because the user isn't responsible for the read-only setup and cannot modify it. The rules are identical to when /sys/fs/cgroups is not writable by the unprivileged user. An example of this is the default configuration of Docker, where cgroups are mounted as read-only as a preventative security measure. Reported-by: Vladimir Rutsky <rutsky@google.com> Signed-off-by: Aleksa Sarai <asarai@suse.de>	2018-03-17 13:53:42 +11:00
Vincent Demeester	3ca4c78b1a	Import docker/docker/pkg/mount into runc This will help get rid of docker/docker dependency in runc 👼 Signed-off-by: Vincent Demeester <vincent@sbr.pm>	2017-11-08 16:25:58 +01:00
Aleksa Sarai	9b13f5cc7f	merge branch 'pr-1453' propagate argv0 when re-execing from /proc/self/exe LGTMs: @crosbymichael @cyphar Closes #1453	2017-10-17 03:12:22 +11:00
Michael Crosby	ff4481dbf6	Merge pull request #1540 from cloudfoundry-incubator/rootless-cgroups Support cgroups with limits as rootless	2017-10-16 12:03:49 -04:00
Petros Angelatos	8098828680	propagate argv0 when re-execing from /proc/self/exe This allows runc to be used as a target for docker's reexec module that depends on a correct argv0 to select which process entrypoint to invoke. Without this patch, when runc re-execs argv0 is set to "/proc/self/exe" and the reexec module doesn't know what to do with it. Signed-off-by: Petros Angelatos <petrosagg@gmail.com>	2017-10-16 14:00:26 +02:00
Will Martin	ca4f427af1	Support cgroups with limits as rootless Signed-off-by: Ed King <eking@pivotal.io> Signed-off-by: Gabriel Rosenhouse <grosenhouse@pivotal.io> Signed-off-by: Konstantinos Karampogias <konstantinos.karampogias@swisscom.com>	2017-10-05 11:22:54 +01:00
Xiaochen Shen	2549545df5	intelrdt: always init IntelRdtManager if Intel RDT is enabled In current implementation: Either Intel RDT is not enabled by hardware and kernel, or intelRdt is not specified in original config, we don't init IntelRdtManager in the container to handle intelrdt constraint. It is a tradeoff that Intel RDT has hardware limitation to support only limited number of groups. This patch makes a minor change to support update command: Whether or not intelRdt is specified in config, we always init IntelRdtManager in the container if Intel RDT is enabled. If intelRdt is not specified in original config, we just don't Apply() to create intelrdt group or attach tasks for this container. In update command, we could re-enable through IntelRdtManager.Apply() and then update intelrdt constraint. Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2017-09-20 01:37:31 +08:00
Giuseppe Scrivano	d8b669400a	rootless: allow multiple user/group mappings Take advantage of the newuidmap/newgidmap tools to allow multiple users/groups to be mapped into the new user namespace in the rootless case. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> [ rebased to handle intelrdt changes. ] Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-09-09 12:45:32 +10:00
Xiaochen Shen	88d22fde40	libcontainer: intelrdt: use init() to avoid race condition This is the follow-up PR of #1279 to fix remaining issues: Use init() to avoid race condition in IsIntelRdtEnabled(). Add also rename some variables and functions. Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2017-09-08 17:15:31 +08:00
Xiaochen Shen	692f6e1e27	libcontainer: add support for Intel RDT/CAT in runc About Intel RDT/CAT feature: Intel platforms with new Xeon CPU support Intel Resource Director Technology (RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which currently supports L3 cache resource allocation. This feature provides a way for the software to restrict cache allocation to a defined 'subset' of L3 cache which may be overlapping with other 'subsets'. The different subsets are identified by class of service (CLOS) and each CLOS has a capacity bitmask (CBM). For more information about Intel RDT/CAT can be found in the section 17.17 of Intel Software Developer Manual. About Intel RDT/CAT kernel interface: In Linux 4.10 kernel or newer, the interface is defined and exposed via "resource control" filesystem, which is a "cgroup-like" interface. Comparing with cgroups, it has similar process management lifecycle and interfaces in a container. But unlike cgroups' hierarchy, it has single level filesystem layout. Intel RDT "resource control" filesystem hierarchy: mount -t resctrl resctrl /sys/fs/resctrl tree /sys/fs/resctrl /sys/fs/resctrl/ \|-- info \| \|-- L3 \| \|-- cbm_mask \| \|-- min_cbm_bits \| \|-- num_closids \|-- cpus \|-- schemata \|-- tasks \|-- <container_id> \|-- cpus \|-- schemata \|-- tasks For runc, we can make use of `tasks` and `schemata` configuration for L3 cache resource constraints. The file `tasks` has a list of tasks that belongs to this group (e.g., <container_id>" group). Tasks can be added to a group by writing the task ID to the "tasks" file (which will automatically remove them from the previous group to which they belonged). New tasks created by fork(2) and clone(2) are added to the same group as their parent. If a pid is not in any sub group, it Is in root group. The file `schemata` has allocation bitmasks/values for L3 cache on each socket, which contains L3 cache id and capacity bitmask (CBM). Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..." For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0` which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0. The valid L3 cache CBM is a contiguous bits set and number of bits that can be set is less than the max bit. The max bits in the CBM is varied among supported Intel Xeon platforms. In Intel RDT "resource control" filesystem layout, the CBM in a group should be a subset of the CBM in root. Kernel will check if it is valid when writing. e.g., 0xfffff in root indicates the max bits of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc. For more information about Intel RDT/CAT kernel interface: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt An example for runc: Consider a two-socket machine with two L3 caches where the default CBM is 0xfffff and the max CBM length is 20 bits. With this configuration, tasks inside the container only have access to the "upper" 80% of L3 cache id 0 and the "lower" 50% L3 cache id 1: "linux": { "intelRdt": { "l3CacheSchema": "L3:0=ffff0;1=3ff" } } Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2017-09-01 14:26:33 +08:00
Aleksa Sarai	7d66aab77a	init: switch away from stateDirFd entirely While we have significant protections in place against CVE-2016-9962, we still were holding onto a file descriptor that referenced the host filesystem. This meant that in certain scenarios it was still possible for a semi-privileged container to gain access to the host filesystem (if they had CAP_SYS_PTRACE). Instead, open the FIFO itself using a O_PATH. This allows us to reference the FIFO directly without providing the ability for directory-level access. When opening the FIFO inside the init process, open it through procfs to re-open the actual FIFO (this is currently the only supported way to open such a file descriptor). Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-08-25 13:19:03 +10:00
Aleksa Sarai	7cfb107f2c	factory: use e{u,g}id as the owner of /run/runc/$id It appears as though these semantics were not fully thought out when implementing them for rootless containers. It is not necessary (and could be potentially dangerous) to set the owner of /run/ctr/$id to be the root inside the container (if user namespaces are being used). Instead, just use the e{g,u}id of runc to determine the owner. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-07-12 06:30:46 +10:00
Christy Perez	3d7cb4293c	Move libcontainer to x/sys/unix Since syscall is outdated and broken for some architectures, use x/sys/unix instead. There are still some dependencies on the syscall package that will remain in syscall for the forseeable future: Errno Signal SysProcAttr Additionally: - os still uses syscall, so it needs to be kept for anything returning *os.ProcessState, such as process.Wait. Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>	2017-05-22 17:35:20 -05:00
Harshal Patil	700c74cb7e	Issue #1429 : Removing check for id string length Signed-off-by: Harshal Patil <harshal.patil@in.ibm.com>	2017-05-04 09:21:29 +05:30
Aleksa Sarai	f0876b0427	libcontainer: configs: add proper HostUID and HostGID Previously Host{U,G}ID only gave you the root mapping, which isn't very useful if you are trying to do other things with the IDMaps. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-03-23 20:46:20 +11:00
Aleksa Sarai	baeef29858	rootless: add rootless cgroup manager The rootless cgroup manager acts as a noop for all set and apply operations. It is just used for rootless setups. Currently this is far too simple (we need to add opportunistic cgroup management), but is good enough as a first-pass at a noop cgroup manager. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-03-23 20:46:20 +11:00
Michael Crosby	00a0ecf554	Add separate console socket Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2017-03-16 10:23:59 -07:00
Qiang Huang	805b8c73d3	Do not create exec fifo in factory.Create It should not be binded to container creation, for example, runc restore needs to create a libcontainer.Container, but it won't need exec fifo. So create exec fifo when container is started or run, where we really need it. Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>	2017-02-22 10:34:48 -08:00
Qiang Huang	be33383e60	Merge pull request #1293 from stevenh/resolve-initarg Resolve InitArgs to ensure init works	2017-02-03 19:25:52 +08:00
Steven Hartland	b9dfa444c4	Resolve InitArgs to ensure init works If a relative pathed exe is used for InitArgs init will fail to run if Cwd is not set the original path. Prevent failure of init to run by ensuring that exe in InitArgs is an absolute path. Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>	2017-02-01 13:42:09 +00:00
Aleksa Sarai	e034cedce7	libcontainer: init: only pass stateDirFd when creating a container If we pass a file descriptor to the host filesystem while joining a container, there is a race condition where a process inside the container can ptrace(2) the joining process and stop it from closing its file descriptor to the stateDirFd. Then the process can access the host filesystem from that file descriptor. This was fixed in part by `5d93fed3d2` ("Set init processes as non-dumpable"), but that fix is more of a hail-mary than an actual fix for the underlying issue. To fix this, don't open or pass the stateDirFd to the init process unless we're creating a new container. A proper fix for this would be to remove the need for even passing around directory file descriptors (which are quite dangerous in the context of mount namespaces). There is still an issue with containers that have CAP_SYS_PTRACE and are using the setns(2)-style of joining a container namespace. Currently I'm not really sure how to fix it without rampant layer violation. Fixes: CVE-2016-9962 Fixes: `5d93fed3d2` ("Set init processes as non-dumpable") Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-02-02 00:41:11 +11:00
Steven Hartland	64aa78b762	Ensure pipe is always closed on error in StartInitialization Ensure that the pipe is always closed during the error processing of StartInitialization. Also: * Fix a comment typo. * Use newContainerInit directly as there's no need for i to be an initer. * Move the comment about the behaviour of Init() directly above it, clarifying what happens for all defers. Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>	2017-01-25 12:36:40 +00:00
Aleksa Sarai	4776b4326a	libcontainer: refactor syncT handling To make the code cleaner, and more clear, refactor the syncT handling used when creating the `runc init` process. In addition, document the state changes so that people actually understand what is going on. Rather than only using syncT for the standard initProcess, use it for both initProcess and setnsProcess. This removes some special cases, as well as allowing for the use of syncT with setnsProcess. Also remove a bunch of the boilerplate around syncT handling. This patch is part of the console rewrite patchset. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2016-12-01 15:46:04 +11:00
Michael Crosby	fcc40b7a63	Remove panic from init Print the error message to stderr if we are unable to return it back via the pipe to the parent process. Also, don't panic here as it is most likely a system or user error and not a programmer error. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-10-17 15:54:51 -07:00
Qiang Huang	3597b7b743	Merge pull request #1087 from williammartin/master Fix typo when container does not exist	2016-09-29 09:19:45 +08:00
William Martin	152169ed34	Fix typo when container does not exist Signed-off-by: William Martin <wmartin@pivotal.io>	2016-09-28 11:00:50 +00:00
Peng Gao	c5393da813	Refactor enum map range to slice range grep -r "range map" showw 3 parts use map to range enum types, use slice instead can get better performance and less memory usage. Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>	2016-09-28 15:36:29 +08:00
rajasec	714550f87c	Error handling when container not exists Signed-off-by: rajasec <rajasec79@gmail.com> Error handling when container not exists Signed-off-by: rajasec <rajasec79@gmail.com> Error handling when container not exists Signed-off-by: rajasec <rajasec79@gmail.com> Error handling when container not exists Signed-off-by: rajasec <rajasec79@gmail.com>	2016-08-26 00:00:54 +05:30
Michael Crosby	46d9535096	Merge pull request #934 from macrosheep/fix-initargs Fix and refactor init args	2016-08-24 10:06:01 -07:00
Yang Hongyang	a59d63c5d3	Fix and refactor init args 1. According to docs of Cmd.Path and Cmd.Args from package "os/exec": Path is the path of the command to run. Args holds command line arguments, including the command as Args[0]. We have mixed usage of args. In InitPath(), InitArgs only take arguments, in InitArgs(), InitArgs including the command as Args[0]. This is confusing. 2. InitArgs() already have the ability to configure a LinuxFactory with the provided absolute path to the init binary and arguements as InitPath() does. 3. exec.Command() will take care of serching executable path. 4. The default "/proc/self/exe" instead of os.Args[0] is passed to InitArgs in order to allow relative path for the runC binary. Signed-off-by: Yang Hongyang <imhy.yang@gmail.com>	2016-07-06 23:21:02 -04:00
Yang Hongyang	9ade2cc5ce	libcontainer: Add a helper func to set CriuPath Added a helper func to set CriuPath for LinuxFactory. Signed-off-by: Yang Hongyang <imhy.yang@gmail.com>	2016-07-06 22:58:55 -04:00
Qiang Huang	14e95b2aa9	Make state detection precise Fixes: https://github.com/opencontainers/runc/issues/871 Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>	2016-07-05 08:24:13 +08:00
Michael Crosby	5ce88a95f6	Fix fifo usage with userns Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-06-13 20:20:48 -07:00
Michael Crosby	3aacff695d	Use fifo for create/start This removes the use of a signal handler and SIGCONT to signal the init process to exec the users process. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-06-13 11:26:53 -07:00
Michael Crosby	efcd73fb5b	Fix signal handling for unit tests Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-05-31 11:10:47 -07:00
Michael Crosby	3fc929f350	Only create a buffered channel of one Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-05-31 11:06:41 -07:00
Michael Crosby	30f1006b33	Fix libcontainer states Move initialized to created and destoryed to stopped. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-05-31 11:06:41 -07:00
Michael Crosby	3fe7d7f31e	Add create and start command for container lifecycle Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-05-31 11:06:41 -07:00
Alexander Morozov	d57898610b	Merge pull request #675 from pankit/master Allow + in container ID	2016-05-25 10:35:08 -07:00
Tonis Tiigi	78ecdfe18e	Show proper error from init process panic Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>	2016-03-22 15:57:15 -07:00
pankit thapar	4629512d89	Allow + in container ID Signed-off-by: pankit thapar <pankit@umich.edu>	2016-03-22 11:40:55 -04:00
rajasec	d4be3405c7	Fixing valid-id in regex Signed-off-by: rajasec <rajasec79@gmail.com>	2016-03-14 08:48:41 +05:30
Michael Crosby	213c1a1a4a	Revert "Return proper exit code for exec errors" This reverts commit `6bb653a6e8`. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-03-10 11:00:48 -08:00

1 2

68 Commits