This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
Since this is a runC-specific feature, this belongs here over in
opencontainers/ocitools (which is for generic OCI runtimes).
In addition, we don't create a new network namespace. This is because
currently if you want to set up a veth bridge you need CAP_NET_ADMIN in
both network namespaces' pinned user namespace to create the necessary
interfaces in each network namespace.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This enables the support for the rootless container mode. There are many
restrictions on what rootless containers can do, so many different runC
commands have been disabled:
* runc checkpoint
* runc events
* runc pause
* runc ps
* runc restore
* runc resume
* runc update
The following commands work:
* runc create
* runc delete
* runc exec
* runc kill
* runc list
* runc run
* runc spec
* runc state
In addition, any specification options that imply joining cgroups have
also been disabled. This is due to support for unprivileged subtree
management not being available from Linux upstream.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
For example, the /sys/firmware directory should be masked because it can contain some sensitive files:
- /sys/firmware/acpi/tables/{SLIC,MSDM}: Windows license information:
- /sys/firmware/ibft/target0/chap-secret: iSCSI CHAP secret
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
runc currently fails to build against the upstream version of
runtime-spec/specs-go.
```
# github.com/opencontainers/runc
./spec.go:189: cannot use specs.Linux literal (type specs.Linux) as type *specs.Linux in field value
```
on account of 63231576ec (diff-7f24d60f0cbb9c433e165467e3d34838R25)
This commit updates the dependency to current runtime-spec master and
fixes the type mismatch.
Fixes#1035
Signed-off-by: Adam Thomason <ad@mthomason.net>
/proc/timer_list seems to leak information about the host. Here is
an example from a busybox container running on docker+kubernetes.
# cat /proc/timer_list | grep -i -e kube
<ffff8800b8cc3db0>, hrtimer_wakeup, S:01, futex_wait_queue_me, kubelet/2497
<ffff880129ac3db0>, hrtimer_wakeup, S:01, futex_wait_queue_me, kube-proxy/3478
<ffff8800b1b77db0>, hrtimer_wakeup, S:01, futex_wait_queue_me, kube-proxy/3470
<ffff8800bb6abdb0>, hrtimer_wakeup, S:01, futex_wait_queue_me, kubelet/2499
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
Users of libcontainer other than runc may also require parsing and
converting specification configuration files.
Since runc cannot be imported, move the relevant functions and
definitions to a separate package, libcontainer/specconv.
Signed-off-by: Ido Yariv <ido@wizery.com>
systemd expects cgroupsPath to be of form "slice:prefix:name".
So dont call cleanPath on it anymore.
Signed-off-by: Anusha Ragunathan <anusha@docker.com>
This updates runc and libcontainer to handle rlimits per process and set
them correctly for the container.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
The error handling on the runc cli is currenly pretty messy because
messages to the user are split between regular stderr format and logrus
message format. This changes all the error reporting to the cli to only
output on stderr and exit(1) for consumers of the api.
By default logrus logs to /dev/null so that it is not seen by the user.
If the user wants extra and/or structured loggging/errors from runc they
can use the `--log` flag to provide a path to the file where they want
this information. This allows a consistent behavior on the cli but
extra power and information when debugging with logs.
This also includes a change to enable the same logging information
inside the container's init by adding an init cli command that can share
the existing flags for all other runc commands.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This commit adds support to libcontainer to allow caps, no new privs,
apparmor, and selinux process label to the process struct so that it can
be used together of override the base settings on the container config
per individual process.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This bump of the spec includes a change to the deivce type to be a
string so that it is more readable in the json serialization.
It also includes the change were caps, no new privs, and process
labeling features are moved from the container config onto the process.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This updates the current list to what we have now in docker and also
makes these always added so that these are masked out. Privileged
containers can always unmount these if they want to read from kcore or
something like that.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>