When built with nokmem we explicitly are disabling support for kmemcg,
but it is a strict specification requirement that if we cannot fulfil an
aspect of the container configuration that we error out.
Completely ignoring explicitly-requested kmemcg limits with nokmem would
undoubtably lead to problems.
Fixes: 6a2c155968 ("libcontainer: ability to compile without kmem")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Finish off the work started in a344b2d6 (sync up `HookState` with OCI
spec `State`, 2016-12-19, #1201).
And drop HookState, since there's no need for a local alias for
specs.State.
Also set c.initProcess in newInitProcess to support OCIState calls
from within initProcess.start(). I think the cyclic references
between linuxContainer and initProcess are unfortunate, but didn't
want to address that here.
I've also left the timing of the Prestart hooks alone, although the
spec calls for them to happen before start (not as part of creation)
[1,2]. Once the timing gets fixed we can drop the
initProcessStartTime hacks which initProcess.start currently needs.
I'm not sure why we trigger the prestart hooks in response to both
procReady and procHooks. But we've had two prestart rounds in
initProcess.start since 2f276498 (Move pre-start hooks after container
mounts, 2016-02-17, #568). I've left that alone too.
I really think we should have len() guards to avoid computing the
state when .Hooks is non-nil but the particular phase we're looking at
is empty. Aleksa, however, is adamantly against them [3] citing a
risk of sloppy copy/pastes causing the hook slice being len-guarded to
diverge from the hook slice being iterated over within the guard. I
think that ort of thing is very lo-risk, because:
* We shouldn't be copy/pasting this, right? DRY for the win :).
* There's only ever a few lines between the guard and the guarded
loop. That makes broken copy/pastes easy to catch in review.
* We should have test coverage for these. Guarding with the wrong
slice is certainly not the only thing you can break with a sloppy
copy/paste.
But I'm not a maintainer ;).
[1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.0/config.md#prestart
[2]: https://github.com/opencontainers/runc/issues/1710
[3]: https://github.com/opencontainers/runc/pull/1741#discussion_r233331570
Signed-off-by: W. Trevor King <wking@tremily.us>
MBA Software Controller feature is introduced in Linux kernel v4.18.
It is a software enhancement to mitigate some limitations in MBA which
describes in kernel documentation. It also makes the interface more user
friendly - we could specify memory bandwidth in "MBps" (Mega Bytes per
second) as well as in "percentages".
The kernel underneath would use a software feedback mechanism or a
"Software Controller" which reads the actual bandwidth using MBM
counters and adjust the memory bandwidth percentages to ensure:
"actual memory bandwidth < user specified memory bandwidth".
We could enable this feature through mount option "-o mba_MBps":
mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl
In runc, we handle both memory bandwidth schemata in unified format:
"MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."
The unit of memory bandwidth is specified in "percentages" by default,
and in "MBps" if MBA Software Controller is enabled.
For more information about Intel RDT and MBA Software Controller:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Commit fe898e7862 (PR #1350) enables kernel memory accounting
for all cgroups created by libcontainer -- even if kmem limit is
not configured.
Kernel memory accounting is known to be broken in some kernels,
specifically the ones from RHEL7 (including RHEL 7.5). Those
kernels do not support kernel memory reclaim, and are prone to
oopses. Unconditionally enabling kmem acct on such kernels lead
to bugs, such as
* https://github.com/opencontainers/runc/issues/1725
* https://github.com/kubernetes/kubernetes/issues/61937
* https://github.com/moby/moby/issues/29638
This commit gives a way to compile runc without kernel memory setting
support. To do so, use something like
make BUILDTAGS="seccomp nokmem"
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Cgroup namespace can be configured in `config.json` as other
namespaces. Here is an example:
```
"namespaces": [
{
"type": "pid"
},
{
"type": "network"
},
{
"type": "ipc"
},
{
"type": "uts"
},
{
"type": "mount"
},
{
"type": "cgroup"
}
],
```
Note that if you want to run a container which has shared cgroup ns with
another container, then it's strongly recommended that you set
proper `CgroupsPath` of both containers(the second container's cgroup
path must be the subdirectory of the first one). Or there might be
some unexpected results.
Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This is a regression from 06f789cf26
when the user namespace was configured without a privileged helper.
To allow a single mapping in an user namespace, it is necessary to set
/proc/self/setgroups to "deny".
For a simple reproducer, the user namespace can be created with
"unshare -r".
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Currently runc applies PidsLimit restriction by writing directly to
cgroup's pids.max, without notifying systemd. As a consequence, when the
later updates the context of the corresponding scope, pids.max is reset
to the value of systemd's TasksMax property.
This can be easily reproduced this way (I'm using "postfix" here just an
example, any unrelated but existing service will do):
# CTR=`docker run --pids-limit 111 --detach --rm busybox /bin/sleep 8h`
# cat /sys/fs/cgroup/pids/system.slice/docker-${CTR}.scope/pids.max
111
# systemctl disable --now postfix
# systemctl enable --now postfix
# cat /sys/fs/cgroup/pids/system.slice/docker-${CTR}.scope/pids.max
max
This patch adds TasksAccounting=true and TasksMax=PidsLimit to the
properties sent to systemd.
Signed-off-by: Sergio Lopez <slp@redhat.com>
This is a very simple implementation because it doesn't require any
configuration unlike the other namespaces, and in its current state it
only masks paths.
This feature is available in Linux 4.6+ and is enabled by default for
kernels compiled with CONFIG_CGROUP=y.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Linux kernel v4.15 introduces better diagnostics for Intel RDT operation
errors. If any error returns when making new directories or writing to
any of the control file in resctrl filesystem, reading file
/sys/fs/resctrl/info/last_cmd_status could provide more information that
can be conveyed in the error returns from file operations.
Some examples:
echo "L3:0=f3;1=ff" > /sys/fs/resctrl/container_id/schemata
-bash: echo: write error: Invalid argument
cat /sys/fs/resctrl/info/last_cmd_status
mask f3 has non-consecutive 1-bits
echo "MB:0=0;1=110" > /sys/fs/resctrl/container_id/schemata
-bash: echo: write error: Invalid argument
cat /sys/fs/resctrl/info/last_cmd_status
MB value 0 out of range [10,100]
cd /sys/fs/resctrl
mkdir 1 2 3 4 5 6 7 8
mkdir: cannot create directory '8': No space left on device
cat /sys/fs/resctrl/info/last_cmd_status
out of CLOSIDs
See 'last_cmd_status' for more details in kernel documentation:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
In runc, we could append the diagnostics information to the error
message of Intel RDT operation errors to provide more user-friendly
information.
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Double check if Intel RDT sub-features are available in "resource
control" filesystem. Intel RDT sub-features can be selectively disabled
or enabled by kernel command line (e.g., rdt=!l3cat,mba) in 4.14 and
newer kernel.
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature
of Intel Resource Director Technology (RDT) which is supported on some
Intel Xeon platforms. Intel RDT/MBA provides indirect and approximate
throttle over memory bandwidth for the software. A user controls the
resource by indicating the percentage of maximum memory bandwidth.
Hardware details of Intel RDT/MBA can be found in section 17.18 of
Intel Software Developer Manual:
https://software.intel.com/en-us/articles/intel-sdm
In Linux 4.12 kernel and newer, Intel RDT/MBA is enabled by kernel
config CONFIG_INTEL_RDT. If hardware support, CPU flags `rdt_a` and
`mba` will be set in /proc/cpuinfo.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| | |-- cbm_mask
| | |-- min_cbm_bits
| | |-- num_closids
| |-- MB
| |-- bandwidth_gran
| |-- delay_linear
| |-- min_bandwidth
| |-- num_closids
|-- ...
|-- schemata
|-- tasks
|-- <container_id>
|-- ...
|-- schemata
|-- tasks
For MBA support for `runc`, we will reuse the infrastructure and code
base of Intel RDT/CAT which implemented in #1279. We could also make
use of `tasks` and `schemata` configuration for memory bandwidth
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the
task ID to the "tasks" file (which will automatically remove them from
the previous group to which they belonged). New tasks created by
fork(2) and clone(2) are added to the same group as their parent.
The file `schemata` has a list of all the resources available to this
group. Each resource (L3 cache, memory bandwidth) has its own line and
format.
Memory bandwidth schema:
It has allocation values for memory bandwidth on each socket, which
contains L3 cache id and memory bandwidth percentage.
Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."
The minimum bandwidth percentage value for each CPU model is predefined
and can be looked up through "info/MB/min_bandwidth". The bandwidth
granularity that is allocated is also dependent on the CPU model and
can be looked up at "info/MB/bandwidth_gran". The available bandwidth
control steps are: min_bw + N * bw_gran. Intermediate values are
rounded to the next control step available on the hardware.
For more information about Intel RDT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the minimum
memory bandwidth of 10% with a memory bandwidth granularity of 10%.
Tasks inside the container may use a maximum memory bandwidth of 20%
on socket 0 and 70% on socket 1.
"linux": {
"intelRdt": {
"memBwSchema": "MB:0=20;1=70"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Respect the container's cgroup path when finding the container's
cgroup mount point, which is useful in multi-tenant environments, where
containers have their own unique cgroup mounts
Signed-off-by: Danail Branekov <danailster@gmail.com>
Signed-off-by: Oliver Stenbom <ostenbom@pivotal.io>
Signed-off-by: Giuseppe Capizzi <gcapizzi@pivotal.io>
Fix duplicate entries and missing entries in getCgroupMountsHelper
Add test for testing cgroup mounts on bedrock linux
Stop relying on number of subsystems for cgroups
LGTMs: @crosbymichael @cyphar
Closes#1817
While all modern kernels (and I do mean _all_ of them -- this syscall
was added in 2.6.10 before git had begun development!) have support for
this syscall, LXC has a default seccomp profile that returns ENOSYS for
this syscall. For most syscalls this would be a deal-breaker, and our
use of session keyrings is security-based there are a few mitigating
factors that make this change not-completely-insane:
* We already have a flag that disables the use of session keyrings
(for older kernels that had system-wide keyring limits and so
on). So disabling it is not a new idea.
* While the primary justification of using session keys *is*
security-based, it's more of a security-by-obscurity protection.
The main defense keyrings have is VFS credentials -- which is
something that users already have better security tools for
(setuid(2) and user namespaces).
* Given the security justification you might argue that we
shouldn't silently ignore this. However, the only way for the
kernel to return -ENOSYS is either being ridiculously old (at
which point we wouldn't work anyway) or that there is a seccomp
profile in place blocking it.
Given that the seccomp profile (if malicious) could very easily
just return 0 or a silly return code (or something even more
clever with seccomp-bpf) and trick us without this patch, there
isn't much of a significant change in how much seccomp can trick
us with or without this patch.
Given all of that over-analysis, I'm pretty convinced there isn't a
security problem in this very specific case and it will help out the
ChromeOS folks by allowing Docker to run inside their LXC container
setup. I'd be happy to be proven wrong.
Ref: https://bugs.chromium.org/p/chromium/issues/detail?id=860565
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
it is now allowed to bind mount /proc. This is useful for rootless
containers when the PID namespace is shared with the host.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
subgid is defined per user, not group (see subgid(5))
This commit also adds support for specifying subuid owner with a numeric UID.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
Using CRIU to checkpoint and restore a container into an existing
network namespace is not possible.
If the network namespace is defined like
{
"type": "network",
"path": "/run/netns/test"
}
there is the expectation that the restored container is again running in
the network namespace specified with 'path'.
This adds the new CRIU 'external namespace' feature to runc, where
during checkpointing that specific namespace is referenced and during
restore CRIU tries to restore the container in exactly that
namespace.
This breaks/fixes current runc behavior. If, without this patch, runc
restores a container with such a network namespace definition, it is
ignored and CRIU recreates a network namespace without a name.
With this patch runc uses the network namespace path (if available) to
checkpoint and restore the container in just that network namespace.
Restore will now fail if a container was checkpointed with a network
namespace path set and if that network namespace path does not exist
during restore.
runc still falls back to the old behavior if CRIU older than 3.11 is
installed.
Fixes#1786
Related to https://github.com/projectatomic/libpod/pull/469
Thanks to Andrei Vagin for all the help in getting the interface between
CRIU and runc right!
Signed-off-by: Adrian Reber <areber@redhat.com>
MOVE_MOUNT will fail under certain situations.
You are not allowed to MS_MOVE if the parent directory is shared.
man mount
...
The move operation
Move a mounted tree to another place (atomically). The call is:
mount --move olddir newdir
This will cause the contents which previously appeared under olddir to
now be accessible under newdir. The physical location of the files is
not changed. Note that olddir has to be a mountpoint.
Note also that moving a mount residing under a shared mount is invalid
and unsupported. Use findmnt -o TARGET,PROPAGATION to see the current
propagation flags.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
This will help runc's init to not spawn many threads on large systems when
launched with max procs by the caller.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Add a mountinfo from a bedrock linux system with 4 strata, and include
it for tests
Signed-off-by: Jay Kamat <jaygkamat@gmail.com>
Signed-off-by: Daniel Dao <dqminh89@gmail.com>
When there are complicated mount setups, there can be multiple mount
points which have the subsystem we are looking for. Instead of
counting the mountpoints, tick off subsystems until we have found them
all.
Without the 'all' flag, ignore duplicate subsystems after the first.
Signed-off-by: Daniel Dao <dqminh89@gmail.com>
These sysctls are namespaced by CLONE_NEWUTS, and we need to use
"kernel.domainname" if we want users to be able to set an NIS domainname
on Linux. However we disallow "kernel.hostname" because it would
conflict with the "hostname" field and cause confusion (but we include a
helpful message to make it clearer to the user).
Signed-off-by: Aleksa Sarai <asarai@suse.de>
It turns out that MIPS uses uint32 in the device number returned by
stat(2), so explicitly wrap everything to make the compiler happy. I
really wish that Go had C-like numeric type promotion.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This fixes the following compilation error on 32bit ARM:
```
$ GOARCH=arm GOARCH=6 go build ./libcontainer/system/
libcontainer/system/linux.go:119:89: constant 4294967295 overflows int
```
Signed-off-by: Tibor Vass <tibor@docker.com>
When running in a new unserNS as root, don't require a mapping to be
present in the configuration file. We are already skipping the test
for a new userns to be present.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
We need to lock the threads for the SetProcessLabel to work,
should also call SetProcessLabel("") after the container starts
to go back to the default SELinux behaviour.
Once you call SetProcessLabel, then any process executed by runc
will run with this label, even if the process is for setup rather
then the container.
It is always safest to call the SELinux calls just before the exec of the
container, so that other processes do not get started with the incorrect label.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
There is a race in runc exec when the init process stops just before
the check for the container status. It is then wrongly assumed that
we are trying to start an init process instead of an exec process.
This commit add an Init field to libcontainer Process to distinguish
between init and exec processes to prevent this race.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
Include a rootless argument for isIgnorableError to avoid people
accidentally using isIgnorableError when they shouldn't (we don't ignore
any errors when running as root as that really isn't safe).
Signed-off-by: Aleksa Sarai <asarai@suse.de>
So that, if a timeout happens and we decide to stop blocking on the
operation, the writer will not block when they try to report the result
of the operation.
This should address Issue #1780 and it's a follow up for PR #1683,
PR #1754 and PR #1772.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>