Fix duplicate entries and missing entries in getCgroupMountsHelper
Add test for testing cgroup mounts on bedrock linux
Stop relying on number of subsystems for cgroups
LGTMs: @crosbymichael @cyphar
Closes#1817
While all modern kernels (and I do mean _all_ of them -- this syscall
was added in 2.6.10 before git had begun development!) have support for
this syscall, LXC has a default seccomp profile that returns ENOSYS for
this syscall. For most syscalls this would be a deal-breaker, and our
use of session keyrings is security-based there are a few mitigating
factors that make this change not-completely-insane:
* We already have a flag that disables the use of session keyrings
(for older kernels that had system-wide keyring limits and so
on). So disabling it is not a new idea.
* While the primary justification of using session keys *is*
security-based, it's more of a security-by-obscurity protection.
The main defense keyrings have is VFS credentials -- which is
something that users already have better security tools for
(setuid(2) and user namespaces).
* Given the security justification you might argue that we
shouldn't silently ignore this. However, the only way for the
kernel to return -ENOSYS is either being ridiculously old (at
which point we wouldn't work anyway) or that there is a seccomp
profile in place blocking it.
Given that the seccomp profile (if malicious) could very easily
just return 0 or a silly return code (or something even more
clever with seccomp-bpf) and trick us without this patch, there
isn't much of a significant change in how much seccomp can trick
us with or without this patch.
Given all of that over-analysis, I'm pretty convinced there isn't a
security problem in this very specific case and it will help out the
ChromeOS folks by allowing Docker to run inside their LXC container
setup. I'd be happy to be proven wrong.
Ref: https://bugs.chromium.org/p/chromium/issues/detail?id=860565
Signed-off-by: Aleksa Sarai <asarai@suse.de>
it is now allowed to bind mount /proc. This is useful for rootless
containers when the PID namespace is shared with the host.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Using CRIU to checkpoint and restore a container into an existing
network namespace is not possible.
If the network namespace is defined like
{
"type": "network",
"path": "/run/netns/test"
}
there is the expectation that the restored container is again running in
the network namespace specified with 'path'.
This adds the new CRIU 'external namespace' feature to runc, where
during checkpointing that specific namespace is referenced and during
restore CRIU tries to restore the container in exactly that
namespace.
This breaks/fixes current runc behavior. If, without this patch, runc
restores a container with such a network namespace definition, it is
ignored and CRIU recreates a network namespace without a name.
With this patch runc uses the network namespace path (if available) to
checkpoint and restore the container in just that network namespace.
Restore will now fail if a container was checkpointed with a network
namespace path set and if that network namespace path does not exist
during restore.
runc still falls back to the old behavior if CRIU older than 3.11 is
installed.
Fixes#1786
Related to https://github.com/projectatomic/libpod/pull/469
Thanks to Andrei Vagin for all the help in getting the interface between
CRIU and runc right!
Signed-off-by: Adrian Reber <areber@redhat.com>
MOVE_MOUNT will fail under certain situations.
You are not allowed to MS_MOVE if the parent directory is shared.
man mount
...
The move operation
Move a mounted tree to another place (atomically). The call is:
mount --move olddir newdir
This will cause the contents which previously appeared under olddir to
now be accessible under newdir. The physical location of the files is
not changed. Note that olddir has to be a mountpoint.
Note also that moving a mount residing under a shared mount is invalid
and unsupported. Use findmnt -o TARGET,PROPAGATION to see the current
propagation flags.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
This will help runc's init to not spawn many threads on large systems when
launched with max procs by the caller.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Add a mountinfo from a bedrock linux system with 4 strata, and include
it for tests
Signed-off-by: Jay Kamat <jaygkamat@gmail.com>
Signed-off-by: Daniel Dao <dqminh89@gmail.com>
When there are complicated mount setups, there can be multiple mount
points which have the subsystem we are looking for. Instead of
counting the mountpoints, tick off subsystems until we have found them
all.
Without the 'all' flag, ignore duplicate subsystems after the first.
Signed-off-by: Daniel Dao <dqminh89@gmail.com>
These sysctls are namespaced by CLONE_NEWUTS, and we need to use
"kernel.domainname" if we want users to be able to set an NIS domainname
on Linux. However we disallow "kernel.hostname" because it would
conflict with the "hostname" field and cause confusion (but we include a
helpful message to make it clearer to the user).
Signed-off-by: Aleksa Sarai <asarai@suse.de>
It turns out that MIPS uses uint32 in the device number returned by
stat(2), so explicitly wrap everything to make the compiler happy. I
really wish that Go had C-like numeric type promotion.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This fixes the following compilation error on 32bit ARM:
```
$ GOARCH=arm GOARCH=6 go build ./libcontainer/system/
libcontainer/system/linux.go:119:89: constant 4294967295 overflows int
```
Signed-off-by: Tibor Vass <tibor@docker.com>
When running in a new unserNS as root, don't require a mapping to be
present in the configuration file. We are already skipping the test
for a new userns to be present.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
There is a race in runc exec when the init process stops just before
the check for the container status. It is then wrongly assumed that
we are trying to start an init process instead of an exec process.
This commit add an Init field to libcontainer Process to distinguish
between init and exec processes to prevent this race.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
Include a rootless argument for isIgnorableError to avoid people
accidentally using isIgnorableError when they shouldn't (we don't ignore
any errors when running as root as that really isn't safe).
Signed-off-by: Aleksa Sarai <asarai@suse.de>