Respect the container's cgroup path when finding the container's
cgroup mount point, which is useful in multi-tenant environments, where
containers have their own unique cgroup mounts
Signed-off-by: Danail Branekov <danailster@gmail.com>
Signed-off-by: Oliver Stenbom <ostenbom@pivotal.io>
Signed-off-by: Giuseppe Capizzi <gcapizzi@pivotal.io>
ec0d23a92f ("tty: close epollConsole on errors") fixed a significant
issue, but the cleanup was not ideal (especially if the function is
changed in future to add additional error conditions to those currently
present). Using the defer-named-error trick avoids this issue and makes
the code more readable.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
make sure epollConsole is closed before returning an error. It solves
a hang when using these commands with a container that uses a
terminal:
runc run foo &
ssh root@localhost runc exec foo echo hello
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Fix duplicate entries and missing entries in getCgroupMountsHelper
Add test for testing cgroup mounts on bedrock linux
Stop relying on number of subsystems for cgroups
LGTMs: @crosbymichael @cyphar
Closes#1817
While all modern kernels (and I do mean _all_ of them -- this syscall
was added in 2.6.10 before git had begun development!) have support for
this syscall, LXC has a default seccomp profile that returns ENOSYS for
this syscall. For most syscalls this would be a deal-breaker, and our
use of session keyrings is security-based there are a few mitigating
factors that make this change not-completely-insane:
* We already have a flag that disables the use of session keyrings
(for older kernels that had system-wide keyring limits and so
on). So disabling it is not a new idea.
* While the primary justification of using session keys *is*
security-based, it's more of a security-by-obscurity protection.
The main defense keyrings have is VFS credentials -- which is
something that users already have better security tools for
(setuid(2) and user namespaces).
* Given the security justification you might argue that we
shouldn't silently ignore this. However, the only way for the
kernel to return -ENOSYS is either being ridiculously old (at
which point we wouldn't work anyway) or that there is a seccomp
profile in place blocking it.
Given that the seccomp profile (if malicious) could very easily
just return 0 or a silly return code (or something even more
clever with seccomp-bpf) and trick us without this patch, there
isn't much of a significant change in how much seccomp can trick
us with or without this patch.
Given all of that over-analysis, I'm pretty convinced there isn't a
security problem in this very specific case and it will help out the
ChromeOS folks by allowing Docker to run inside their LXC container
setup. I'd be happy to be proven wrong.
Ref: https://bugs.chromium.org/p/chromium/issues/detail?id=860565
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.
`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.
`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)
When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.
This PR does not have any impact on CLI flags and `state.json`.
Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
For runc-in-userns, `runc spec` without `--rootless` should work, when sufficient numbers of
UID/GID are mapped.
Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
(`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.
Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
it is now allowed to bind mount /proc. This is useful for rootless
containers when the PID namespace is shared with the host.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
subgid is defined per user, not group (see subgid(5))
This commit also adds support for specifying subuid owner with a numeric UID.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
This adds a new CRIU based checkpoint/restore test to check if
the restored container runs in the same network namespace as before.
Signed-off-by: Adrian Reber <areber@redhat.com>
Using CRIU to checkpoint and restore a container into an existing
network namespace is not possible.
If the network namespace is defined like
{
"type": "network",
"path": "/run/netns/test"
}
there is the expectation that the restored container is again running in
the network namespace specified with 'path'.
This adds the new CRIU 'external namespace' feature to runc, where
during checkpointing that specific namespace is referenced and during
restore CRIU tries to restore the container in exactly that
namespace.
This breaks/fixes current runc behavior. If, without this patch, runc
restores a container with such a network namespace definition, it is
ignored and CRIU recreates a network namespace without a name.
With this patch runc uses the network namespace path (if available) to
checkpoint and restore the container in just that network namespace.
Restore will now fail if a container was checkpointed with a network
namespace path set and if that network namespace path does not exist
during restore.
runc still falls back to the old behavior if CRIU older than 3.11 is
installed.
Fixes#1786
Related to https://github.com/projectatomic/libpod/pull/469
Thanks to Andrei Vagin for all the help in getting the interface between
CRIU and runc right!
Signed-off-by: Adrian Reber <areber@redhat.com>
MOVE_MOUNT will fail under certain situations.
You are not allowed to MS_MOVE if the parent directory is shared.
man mount
...
The move operation
Move a mounted tree to another place (atomically). The call is:
mount --move olddir newdir
This will cause the contents which previously appeared under olddir to
now be accessible under newdir. The physical location of the files is
not changed. Note that olddir has to be a mountpoint.
Note also that moving a mount residing under a shared mount is invalid
and unsupported. Use findmnt -o TARGET,PROPAGATION to see the current
propagation flags.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
since runc don't manage net device and their configuration, checkpoint
also don't dump net namespace by default, so set 'nsmask = unix.CLONE_NEWNET'
by default in restore. Or if user do not pass 'empty-ns network', criu will
cost extra time in restore.
Signed-off-by: Ace-Tang <aceapril@126.com>
For criu v3.10, a patch is needed for `@test "checkpoint --lazy-pages and restore"`.
Starting with v3.11, the patch will no longer be needed.
The issue had not been caught in Travis because the kernel is too old and the test
had not been executed in Travis.
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>