This (and the converting function) is only used by one of the four
cgroup drivers. The other three do some checking and conversion in
place, so let the fs2 do the same.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Making them the same type is simply confusing, but also means that you
could accidentally use one in the wrong context. This eliminates that
problem. This also includes a whole bunch of cleanups for the types
within DeviceRule, so that they can be used more ergonomically.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
/dev/console is a host resouce which gives a bunch of permissions that
we really shouldn't be giving to containers, not to mention that
/dev/console in containers is actually /dev/pts/$n. Drop this since
arguably this is a fairly scary thing to allow...
Signed-off-by: Aleksa Sarai <asarai@suse.de>
These lists have been in the codebase for a very long time, and have
been unused for a large portion of that time -- specconv doesn't
generate them and the only user of these flags has been tests (which
doesn't inspire much confidence).
In addition, we had an incorrect implementation of a white-list policy.
This wasn't exploitable because all of our users explicitly specify
"deny all" as the first rule, but it was a pretty glaring issue that
came from the "feature" that users can select whether they prefer a
white- or black- list. Fix this by always writing a deny-all rule (which
is what our users were doing anyway, to work around this bug).
This is one of many changes needed to clean up the devices cgroup code.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
The change ensures that the passed in value of NoNewPrivileges under spec.Process
is reflected in the container config generated by specconv.CreateLibcontainerConfig
Closes#2397
Signed-off-by: Pradyumna Agrawal <pradyumnaa@vmware.com>
Some systemd properties are documented as having "Sec" suffix
(e.g. "TimeoutStopSec") but are expected to have "USec" suffix
when passed over dbus, so let's provide appropriate conversion
to improve compatibility.
This means, one can specify TimeoutStopSec with a numeric argument,
in seconds, and it will be properly converted to TimeoutStopUsec
with the argument in microseconds. As a side bonus, even float
values are converted, so e.g. TimeoutStopSec=1.5 is possible.
This turned out a bit more tricky to implement when I was
originally expected, since there are a handful of numeric
types in dbus and each one requires explicit conversion.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case systemd is used to set cgroups for the container,
it creates a scope unit dedicated to it (usually named
`runc-$ID.scope`).
This patch adds an ability to set arbitrary systemd properties
for the systemd unit via runtime spec annotations.
Initially this was developed as an ability to specify the
`TimeoutStopUSec` property, but later generalized to work with
arbitrary ones.
Example usage: add the following to runtime spec (config.json):
```
"annotations": {
"org.systemd.property.TimeoutStopUSec": "uint64 123456789",
"org.systemd.property.CollectMode":"'inactive-or-failed'"
},
```
and start the container (e.g. `runc --systemd-cgroup run $ID`).
The above will set the following systemd parameters:
* `TimeoutStopSec` to 2 minutes and 3 seconds,
* `CollectMode` to "inactive-or-failed".
The values are in the gvariant format (see [1]). To figure out
which type systemd expects for a particular parameter, see
systemd sources.
In particular, parameters with `USec` suffix require an `uint64`
typed argument, while gvariant assumes int32 for a numeric values,
therefore the explicit type is required.
NOTE that systemd receives the time-typed parameters as *USec
but shows them (in `systemctl show`) as *Sec. For example,
the stop timeout should be set as `TimeoutStopUSec` but
is shown as `TimeoutStopSec`.
[1] https://developer.gnome.org/glib/stable/gvariant-text.html
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
A `config.Cgroups` object is required to manipulate cgroups v1 and v2 using
libcontainer.
Export `createCgroupConfig` to allow API users to create `config.Cgroups`
objects using directly libcontainer API.
Signed-off-by: Julio Montes <julio.montes@intel.com>
We discovered in umoci that setting a dummy type of "none" would result
in file-based bind-mounts no longer working properly, which is caused by
a restriction for when specconv will change the device type to "bind" to
work around rootfs_linux.go's ... issues.
However, bind-mounts don't have a type (and Linux will ignore any type
specifier you give it) because the type is copied from the source of the
bind-mount. So we should always overwrite it to avoid user confusion.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This is a very simple implementation because it doesn't require any
configuration unlike the other namespaces, and in its current state it
only masks paths.
This feature is available in Linux 4.6+ and is enabled by default for
kernels compiled with CONFIG_CGROUP=y.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature
of Intel Resource Director Technology (RDT) which is supported on some
Intel Xeon platforms. Intel RDT/MBA provides indirect and approximate
throttle over memory bandwidth for the software. A user controls the
resource by indicating the percentage of maximum memory bandwidth.
Hardware details of Intel RDT/MBA can be found in section 17.18 of
Intel Software Developer Manual:
https://software.intel.com/en-us/articles/intel-sdm
In Linux 4.12 kernel and newer, Intel RDT/MBA is enabled by kernel
config CONFIG_INTEL_RDT. If hardware support, CPU flags `rdt_a` and
`mba` will be set in /proc/cpuinfo.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| | |-- cbm_mask
| | |-- min_cbm_bits
| | |-- num_closids
| |-- MB
| |-- bandwidth_gran
| |-- delay_linear
| |-- min_bandwidth
| |-- num_closids
|-- ...
|-- schemata
|-- tasks
|-- <container_id>
|-- ...
|-- schemata
|-- tasks
For MBA support for `runc`, we will reuse the infrastructure and code
base of Intel RDT/CAT which implemented in #1279. We could also make
use of `tasks` and `schemata` configuration for memory bandwidth
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the
task ID to the "tasks" file (which will automatically remove them from
the previous group to which they belonged). New tasks created by
fork(2) and clone(2) are added to the same group as their parent.
The file `schemata` has a list of all the resources available to this
group. Each resource (L3 cache, memory bandwidth) has its own line and
format.
Memory bandwidth schema:
It has allocation values for memory bandwidth on each socket, which
contains L3 cache id and memory bandwidth percentage.
Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."
The minimum bandwidth percentage value for each CPU model is predefined
and can be looked up through "info/MB/min_bandwidth". The bandwidth
granularity that is allocated is also dependent on the CPU model and
can be looked up at "info/MB/bandwidth_gran". The available bandwidth
control steps are: min_bw + N * bw_gran. Intermediate values are
rounded to the next control step available on the hardware.
For more information about Intel RDT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the minimum
memory bandwidth of 10% with a memory bandwidth granularity of 10%.
Tasks inside the container may use a maximum memory bandwidth of 20%
on socket 0 and 70% on socket 1.
"linux": {
"intelRdt": {
"memBwSchema": "MB:0=20;1=70"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
When joining an existing namespace, don't default to configuring a
loopback interface in that namespace.
Its creator should have done that, and we don't want to fail to create
the container when we don't have sufficient privileges to configure the
network namespace.
Signed-off-by: Nalin Dahyabhai <nalin@redhat.com>
Previously if oomScoreAdj was not set in config.json we would implicitly
set oom_score_adj to 0. This is not allowed according to the spec:
> If oomScoreAdj is not set, the runtime MUST NOT change the value of
> oom_score_adj.
Change this so that we do not modify oom_score_adj if oomScoreAdj is not
present in the configuration. While this modifies our internal
configuration types, the on-disk format is still compatible.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
The function is called even if the usernamespace is not set.
This results having wrong uid/gid set on devices.
This fix add a test to check if usernamespace is set befor calling
setupUserNamespace.
Fixes#1742
Signed-off-by: Julien Lavesque <julien.lavesque@gmail.com>