When built with nokmem we explicitly are disabling support for kmemcg,
but it is a strict specification requirement that if we cannot fulfil an
aspect of the container configuration that we error out.
Completely ignoring explicitly-requested kmemcg limits with nokmem would
undoubtably lead to problems.
Fixes: 6a2c155968 ("libcontainer: ability to compile without kmem")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Commit fe898e7862 (PR #1350) enables kernel memory accounting
for all cgroups created by libcontainer -- even if kmem limit is
not configured.
Kernel memory accounting is known to be broken in some kernels,
specifically the ones from RHEL7 (including RHEL 7.5). Those
kernels do not support kernel memory reclaim, and are prone to
oopses. Unconditionally enabling kmem acct on such kernels lead
to bugs, such as
* https://github.com/opencontainers/runc/issues/1725
* https://github.com/kubernetes/kubernetes/issues/61937
* https://github.com/moby/moby/issues/29638
This commit gives a way to compile runc without kernel memory setting
support. To do so, use something like
make BUILDTAGS="seccomp nokmem"
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Respect the container's cgroup path when finding the container's
cgroup mount point, which is useful in multi-tenant environments, where
containers have their own unique cgroup mounts
Signed-off-by: Danail Branekov <danailster@gmail.com>
Signed-off-by: Oliver Stenbom <ostenbom@pivotal.io>
Signed-off-by: Giuseppe Capizzi <gcapizzi@pivotal.io>
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
Include a rootless argument for isIgnorableError to avoid people
accidentally using isIgnorableError when they shouldn't (we don't ignore
any errors when running as root as that really isn't safe).
Signed-off-by: Aleksa Sarai <asarai@suse.de>
In some cases, /sys/fs/cgroups is mounted read-only. In rootless
containers we can consider this effectively identical to having cgroups
that we don't have write permission to -- because the user isn't
responsible for the read-only setup and cannot modify it. The rules are
identical to when /sys/fs/cgroups is not writable by the unprivileged
user.
An example of this is the default configuration of Docker, where cgroups
are mounted as read-only as a preventative security measure.
Reported-by: Vladimir Rutsky <rutsky@google.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Currently Manager accepts nil cgroups when calling Apply, but it will panic then trying to call Destroy with the same config.
Signed-off-by: Denys Smirnov <denys@sourced.tech>
This commit ensures we write the expected freezer cgroup state after
every state check, in case the state check does not give the expected
result. This can happen when a new task is created and prevents the
whole cgroup to be FROZEN, leaving the state into FREEZING instead.
This patch prevents the case of an infinite loop to happen.
Fixes https://github.com/opencontainers/runc/issues/1609
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: Ed King <eking@pivotal.io>
Signed-off-by: Gabriel Rosenhouse <grosenhouse@pivotal.io>
Signed-off-by: Konstantinos Karampogias <konstantinos.karampogias@swisscom.com>
Updated logrus to use v1 which includes a breaking name change Sirupsen -> sirupsen.
This includes a manual edit of the docker term package to also correct the name there too.
Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
replace #1492#1494fix#1422
Since https://github.com/opencontainers/runtime-spec/pull/876 the memory
specifications are now `int64`, as that better matches the visible interface where
`-1` is a valid value. Otherwise finding the correct value was difficult as it
was kernel dependent.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Since syscall is outdated and broken for some architectures,
use x/sys/unix instead.
There are still some dependencies on the syscall package that will
remain in syscall for the forseeable future:
Errno
Signal
SysProcAttr
Additionally:
- os still uses syscall, so it needs to be kept for anything
returning *os.ProcessState, such as process.Wait.
Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
The rootless cgroup manager acts as a noop for all set and apply
operations. It is just used for rootless setups. Currently this is far
too simple (we need to add opportunistic cgroup management), but is good
enough as a first-pass at a noop cgroup manager.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Runc needs to copy certain files from the top of the cgroup cpuset hierarchy
into the container's cpuset cgroup directory. Currently, runc determines
which directory is the top of the hierarchy by using the parent dir of
the first entry in /proc/self/mountinfo of type cgroup.
This creates problems when cgroup subsystems are mounted arbitrarily in
different dirs on the host.
Now, we use the most deeply nested mountpoint that contains the
container's cpuset cgroup directory.
Signed-off-by: Konstantinos Karampogias <konstantinos.karampogias@swisscom.com>
Signed-off-by: Will Martin <wmartin@pivotal.io>
Fixes: #1347Fixes: #1083
The root cause of #1083 is because we're joining an
existed cgroup whose kmem accouting is not initialized,
and it has child cgroup or tasks in it.
Fix it by checking if the cgroup is first time created,
and we should enable kmem accouting if the cgroup is
craeted by libcontainer with or without kmem limit
configed. Otherwise we'll get issue like #1347
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
In the cases that we got failure on a subsystem's Apply,
we'll get some subsystems' cgroup directories leftover.
On Docker's point of view, start a container failed, use
`docker rm` to remove the container, but some cgroup files
are leftover.
Sometimes we don't want to clean everyting up when something
went wrong, because we need these inter situation
information to debug what's going on, but cgroup directories
are not useful information we want to keep.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
This PR fix issue in this scenario:
```
in terminal 1:
~# cd /sys/fs/cgroup/cpuset
~# mkdir test
~# cd test
~# cat cpuset.cpus
0-3
~# echo 1 > cpuset.cpu_exclusive (make sure you don't have other cgroups under root)
in terminal 2:
~# echo $$ > /sys/fs/cgroup/cpuset/test/tasks
// set resources.cpu.cpus="0-2" in config.json
~# runc run test1
back to terminal 1:
~# cd test1
~# cat cpuset.cpus
0-2
~# echo 1 > cpuset.cpu_exclusive
in terminal 3:
~# echo $$ > /sys/fs/cgroup/test/tasks
// set resources.cpu.cpus="3" in config.json
~# runc run test2
container_linux.go:247: starting container process caused "process_linux.go:258:
applying cgroup configuration for process caused \"failed to write 0-3\\n to
cpuset.cpus: write /sys/fs/cgroup/cpuset/test2/cpuset.cpus: invalid argument\""
```
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
cgroupData.join method using `WriteCgroupProc` to place the pid into
the proc file, it can avoid attach any pid to the cgroup if -1 is
specified as a pid.
so, replace `writeFile` with `WriteCgroupProc` like `cpuset.go`'s
ApplyDir method.
Signed-off-by: Wang Long <long.wanglong@huawei.com>
After #1009, we don't always set `cgroup.Paths`, so
`getCgroupPath()` will return wrong cgroup path because
it'll take current process's cgroup as the parent, which
would be wrong when we try to find the cgroup path in
`runc ps` and `runc kill`.
Fix it by using `m.GetPath()` to get the true cgroup
paths.
Reported-by: Yang Shukui <yangshukui@huawei.com>
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Revert: #935Fixes: #946
I can reproduce #946 on some machines, the problem is on
some machines, it could be very fast that modify time
of `memory.kmem.limit_in_bytes` could be the same as
before it's modified.
And now we'll call `SetKernelMemory` twice on container
creation which cause the second time failure.
Revert this before we find a better solution.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>