The fs2 cgroup driver has a sanity check for path.
Since systemd driver is relying on the same path,
it makes sense to move this check to the common code.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Having map of per-subsystem paths in systemd unified cgroups
driver does not make sense and makes the code less readable.
To get rid of it, move the systemd v1-or-v2 init code to
libcontainer/factory_linux.go which already has a function
to deduce unified path out of paths map.
End result is much cleaner code. Besides, we no longer write pid
to the same cgroup file 7 times in Apply() like we did before.
While at it
- add `rootless` flag which is passed on to fs2 manager
- merge getv2Path() into GetUnifiedPath(), don't overwrite
path if it is set during initialization (on Load).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Delete libcontainer/mount in favor of github.com/moby/sys/mountinfo,
which is fast mountinfo parser.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
split fs2 package from fs, as mixing up fs and fs2 is very likely to result in
unmaintainable code.
Inspired by containerd/cgroups#109
Fix#2157
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
This will permit us to extend the internals of systemd.Manager to include
further information about the system, such as whether cgroupv1, cgroupv2 or
both are in effect.
Furthermore, it allows a future refactor of moving more of UseSystemd() code
into the factory initialization function.
Signed-off-by: Filipe Brandenburger <filbranden@gmail.com>
Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature
of Intel Resource Director Technology (RDT) which is supported on some
Intel Xeon platforms. Intel RDT/MBA provides indirect and approximate
throttle over memory bandwidth for the software. A user controls the
resource by indicating the percentage of maximum memory bandwidth.
Hardware details of Intel RDT/MBA can be found in section 17.18 of
Intel Software Developer Manual:
https://software.intel.com/en-us/articles/intel-sdm
In Linux 4.12 kernel and newer, Intel RDT/MBA is enabled by kernel
config CONFIG_INTEL_RDT. If hardware support, CPU flags `rdt_a` and
`mba` will be set in /proc/cpuinfo.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| | |-- cbm_mask
| | |-- min_cbm_bits
| | |-- num_closids
| |-- MB
| |-- bandwidth_gran
| |-- delay_linear
| |-- min_bandwidth
| |-- num_closids
|-- ...
|-- schemata
|-- tasks
|-- <container_id>
|-- ...
|-- schemata
|-- tasks
For MBA support for `runc`, we will reuse the infrastructure and code
base of Intel RDT/CAT which implemented in #1279. We could also make
use of `tasks` and `schemata` configuration for memory bandwidth
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the
task ID to the "tasks" file (which will automatically remove them from
the previous group to which they belonged). New tasks created by
fork(2) and clone(2) are added to the same group as their parent.
The file `schemata` has a list of all the resources available to this
group. Each resource (L3 cache, memory bandwidth) has its own line and
format.
Memory bandwidth schema:
It has allocation values for memory bandwidth on each socket, which
contains L3 cache id and memory bandwidth percentage.
Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."
The minimum bandwidth percentage value for each CPU model is predefined
and can be looked up through "info/MB/min_bandwidth". The bandwidth
granularity that is allocated is also dependent on the CPU model and
can be looked up at "info/MB/bandwidth_gran". The available bandwidth
control steps are: min_bw + N * bw_gran. Intermediate values are
rounded to the next control step available on the hardware.
For more information about Intel RDT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the minimum
memory bandwidth of 10% with a memory bandwidth granularity of 10%.
Tasks inside the container may use a maximum memory bandwidth of 20%
on socket 0 and 70% on socket 1.
"linux": {
"intelRdt": {
"memBwSchema": "MB:0=20;1=70"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
In some cases, /sys/fs/cgroups is mounted read-only. In rootless
containers we can consider this effectively identical to having cgroups
that we don't have write permission to -- because the user isn't
responsible for the read-only setup and cannot modify it. The rules are
identical to when /sys/fs/cgroups is not writable by the unprivileged
user.
An example of this is the default configuration of Docker, where cgroups
are mounted as read-only as a preventative security measure.
Reported-by: Vladimir Rutsky <rutsky@google.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This allows runc to be used as a target for docker's reexec module that
depends on a correct argv0 to select which process entrypoint to invoke.
Without this patch, when runc re-execs argv0 is set to "/proc/self/exe"
and the reexec module doesn't know what to do with it.
Signed-off-by: Petros Angelatos <petrosagg@gmail.com>
Signed-off-by: Ed King <eking@pivotal.io>
Signed-off-by: Gabriel Rosenhouse <grosenhouse@pivotal.io>
Signed-off-by: Konstantinos Karampogias <konstantinos.karampogias@swisscom.com>
In current implementation:
Either Intel RDT is not enabled by hardware and kernel, or intelRdt is
not specified in original config, we don't init IntelRdtManager in the
container to handle intelrdt constraint. It is a tradeoff that Intel RDT
has hardware limitation to support only limited number of groups.
This patch makes a minor change to support update command:
Whether or not intelRdt is specified in config, we always init
IntelRdtManager in the container if Intel RDT is enabled. If intelRdt is
not specified in original config, we just don't Apply() to create
intelrdt group or attach tasks for this container.
In update command, we could re-enable through IntelRdtManager.Apply()
and then update intelrdt constraint.
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Take advantage of the newuidmap/newgidmap tools to allow multiple
users/groups to be mapped into the new user namespace in the rootless
case.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
[ rebased to handle intelrdt changes. ]
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This is the follow-up PR of #1279 to fix remaining issues:
Use init() to avoid race condition in IsIntelRdtEnabled().
Add also rename some variables and functions.
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.
This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).
For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.
About Intel RDT/CAT kernel interface:
In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.
Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| |-- cbm_mask
| |-- min_cbm_bits
| |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
|-- cpus
|-- schemata
|-- tasks
For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.
The file `schemata` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.
The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.
For more information about Intel RDT/CAT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
inside the container only have access to the "upper" 80% of L3 cache id 0 and
the "lower" 50% L3 cache id 1:
"linux": {
"intelRdt": {
"l3CacheSchema": "L3:0=ffff0;1=3ff"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
While we have significant protections in place against CVE-2016-9962, we
still were holding onto a file descriptor that referenced the host
filesystem. This meant that in certain scenarios it was still possible
for a semi-privileged container to gain access to the host filesystem
(if they had CAP_SYS_PTRACE).
Instead, open the FIFO itself using a O_PATH. This allows us to
reference the FIFO directly without providing the ability for
directory-level access. When opening the FIFO inside the init process,
open it through procfs to re-open the actual FIFO (this is currently the
only supported way to open such a file descriptor).
Signed-off-by: Aleksa Sarai <asarai@suse.de>
It appears as though these semantics were not fully thought out when
implementing them for rootless containers. It is not necessary (and
could be potentially dangerous) to set the owner of /run/ctr/$id to be
the root inside the container (if user namespaces are being used).
Instead, just use the e{g,u}id of runc to determine the owner.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Since syscall is outdated and broken for some architectures,
use x/sys/unix instead.
There are still some dependencies on the syscall package that will
remain in syscall for the forseeable future:
Errno
Signal
SysProcAttr
Additionally:
- os still uses syscall, so it needs to be kept for anything
returning *os.ProcessState, such as process.Wait.
Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
Previously Host{U,G}ID only gave you the root mapping, which isn't very
useful if you are trying to do other things with the IDMaps.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
The rootless cgroup manager acts as a noop for all set and apply
operations. It is just used for rootless setups. Currently this is far
too simple (we need to add opportunistic cgroup management), but is good
enough as a first-pass at a noop cgroup manager.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
It should not be binded to container creation, for
example, runc restore needs to create a
libcontainer.Container, but it won't need exec fifo.
So create exec fifo when container is started or run,
where we really need it.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
If a relative pathed exe is used for InitArgs init will fail to run if Cwd is not set the original path.
Prevent failure of init to run by ensuring that exe in InitArgs is an absolute path.
Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
If we pass a file descriptor to the host filesystem while joining a
container, there is a race condition where a process inside the
container can ptrace(2) the joining process and stop it from closing its
file descriptor to the stateDirFd. Then the process can access the
*host* filesystem from that file descriptor. This was fixed in part by
5d93fed3d2 ("Set init processes as non-dumpable"), but that fix is
more of a hail-mary than an actual fix for the underlying issue.
To fix this, don't open or pass the stateDirFd to the init process
unless we're creating a new container. A proper fix for this would be to
remove the need for even passing around directory file descriptors
(which are quite dangerous in the context of mount namespaces).
There is still an issue with containers that have CAP_SYS_PTRACE and are
using the setns(2)-style of joining a container namespace. Currently I'm
not really sure how to fix it without rampant layer violation.
Fixes: CVE-2016-9962
Fixes: 5d93fed3d2 ("Set init processes as non-dumpable")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Ensure that the pipe is always closed during the error processing of StartInitialization.
Also:
* Fix a comment typo.
* Use newContainerInit directly as there's no need for i to be an initer.
* Move the comment about the behaviour of Init() directly above it, clarifying what happens for all defers.
Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
To make the code cleaner, and more clear, refactor the syncT handling
used when creating the `runc init` process. In addition, document the
state changes so that people actually understand what is going on.
Rather than only using syncT for the standard initProcess, use it for
both initProcess and setnsProcess. This removes some special cases, as
well as allowing for the use of syncT with setnsProcess.
Also remove a bunch of the boilerplate around syncT handling.
This patch is part of the console rewrite patchset.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Print the error message to stderr if we are unable to return it back via
the pipe to the parent process. Also, don't panic here as it is most
likely a system or user error and not a programmer error.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
grep -r "range map" showw 3 parts use map to
range enum types, use slice instead can get
better performance and less memory usage.
Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>
Signed-off-by: rajasec <rajasec79@gmail.com>
Error handling when container not exists
Signed-off-by: rajasec <rajasec79@gmail.com>
Error handling when container not exists
Signed-off-by: rajasec <rajasec79@gmail.com>
Error handling when container not exists
Signed-off-by: rajasec <rajasec79@gmail.com>
1. According to docs of Cmd.Path and Cmd.Args from package "os/exec":
Path is the path of the command to run. Args holds command line
arguments, including the command as Args[0]. We have mixed usage
of args. In InitPath(), InitArgs only take arguments, in InitArgs(),
InitArgs including the command as Args[0]. This is confusing.
2. InitArgs() already have the ability to configure a LinuxFactory
with the provided absolute path to the init binary and arguements as
InitPath() does.
3. exec.Command() will take care of serching executable path.
4. The default "/proc/self/exe" instead of os.Args[0] is passed to
InitArgs in order to allow relative path for the runC binary.
Signed-off-by: Yang Hongyang <imhy.yang@gmail.com>
This removes the use of a signal handler and SIGCONT to signal the init
process to exec the users process.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>