And use it only in local tooling that is forwarding the pseudoterminal
master. That way runC no longer has an opinion on the onlcr setting
for folks who are creating a terminal and detaching. They'll use
--console-socket and can setup the pseudoterminal however they like
without runC having an opinion. With this commit, the only cases
where runC still has applies SaneTerminal is when *it* is the process
consuming the master descriptor.
Signed-off-by: W. Trevor King <wking@tremily.us>
Use the NLA_ALIGNTO and NLA_HDRLEN constants from x/sys/unix instead of
syscall, as the syscall package shouldn't be used anymore (except for a
few exceptions).
This also makes the syscall_NLA_HDRLEN workaround for gccgo unnecessary.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Use the symlink xattr syscall wrappers Lgetxattr, Llistxattr and
Lsetxattr from x/sys/unix (introduced in
golang/sys@b90f89a1e7) instead of
providing own wrappers. Leave the functionality of system.Lgetxattr
intact with respect to the retry with a larger buffer, but switch it to
use unix.Lgetxattr.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Follow commit 3d7cb4293c ("Move libcontainer to x/sys/unix") and also
move the examples in README.md from syscall to x/sys/unix.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Since syscall is outdated and broken for some architectures,
use x/sys/unix instead.
There are still some dependencies on the syscall package that will
remain in syscall for the forseeable future:
Errno
Signal
SysProcAttr
Additionally:
- os still uses syscall, so it needs to be kept for anything
returning *os.ProcessState, such as process.Wait.
Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
* User Case:
User could use prestart hook to add block devices to container. so the
hook should have a way to set the permissions of the devices.
Just move cgroup config operation before prestart hook will work.
Signed-off-by: Wentao Zhang <zhangwentao234@huawei.com>
FreeBSD does not support cgroups or namespaces, which the code suggested, and is not supported
in runc anyway right now. So clean up the file naming to use `_linux` where appropriate.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
A freezer cgroup allows to dump processes faster.
If a user wants to checkpoint a container and its storage,
he has to pause a container, but in this case we need to pass
a path to its freezer cgroup to "criu dump".
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
When C/R was implemented, it was enough to call manager.Set to apply
limits and to move a task. Now .Set() and .Apply() have to be called
separately.
Fixes: 8a740d5391 ("libcontainer: cgroups: don't Set in Apply")
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Instead of relying on version numbers it is possible to check if CRIU
actually supports certain features. This introduces an initial
implementation to check if CRIU and the underlying kernel actually
support dirty memory tracking for memory pre-dumping.
Upstream CRIU also supports the lazy-page migration feature check and
additional feature checks can be included in CRIU to reduce the version
number parsing. There are also certain CRIU features which depend on one
side on the CRIU version but also require certain kernel versions to
actually work. CRIU knows if it can do certain things on the kernel it
is running on and using the feature check RPC interface makes it easier
for runc to decide if the criu+kernel combination will support that
feature.
Feature checking was introduced with CRIU 1.8. Running with older CRIU
versions will ignore the feature check functionality and behave just
like it used to.
v2:
- Do not use reflection to compare requested and responded
features. Checking which feature is available is now hardcoded
and needs to be adapted for every new feature check. The code
is now much more readable and simpler.
v3:
- Move the variable criuFeat out of the linuxContainer struct,
as it is not container specific. Now it is a global variable.
Signed-off-by: Adrian Reber <areber@redhat.com>
The original implementation is in C, which increases cognitive load and
possibly might cause us problems in the future. Since sys/unix is better
maintained than the syscall standard library switching makes more sense.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Since this is a runC-specific feature, this belongs here over in
opencontainers/ocitools (which is for generic OCI runtimes).
In addition, we don't create a new network namespace. This is because
currently if you want to set up a veth bridge you need CAP_NET_ADMIN in
both network namespaces' pinned user namespace to create the necessary
interfaces in each network namespace.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
If the stdio of the container is owned by a group which is not mapped in
the user namespace, attempting to fchown the file descriptor will result
in EINVAL. Counteract this by simply not doing an fchown if the group
owner of the file descriptor has no host mapping according to the
configured GIDMappings.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Previously Host{U,G}ID only gave you the root mapping, which isn't very
useful if you are trying to do other things with the IDMaps.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
The rootless cgroup manager acts as a noop for all set and apply
operations. It is just used for rootless setups. Currently this is far
too simple (we need to add opportunistic cgroup management), but is good
enough as a first-pass at a noop cgroup manager.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This enables the support for the rootless container mode. There are many
restrictions on what rootless containers can do, so many different runC
commands have been disabled:
* runc checkpoint
* runc events
* runc pause
* runc ps
* runc restore
* runc resume
* runc update
The following commands work:
* runc create
* runc delete
* runc exec
* runc kill
* runc list
* runc run
* runc spec
* runc state
In addition, any specification options that imply joining cgroups have
also been disabled. This is due to support for unprivileged subtree
management not being available from Linux upstream.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Effectively, !dumpable makes implementing rootless containers quite
hard, due to a bunch of different operations on /proc/self no longer
being possible without reordering everything.
!dumpable only really makes sense when you are switching between
different security contexts, which is only the case when we are joining
namespaces. Unfortunately this means that !dumpable will still have
issues in this instance, and it should only be necessary to set
!dumpable if we are not joining USER namespaces (new kernels have
protections that make !dumpable no longer necessary). But that's a topic
for another time.
This also includes code to unset and then re-set dumpable when doing the
USER namespace mappings. This should also be safe because in principle
processes in a container can't see us until after we fork into the PID
namespace (which happens after the user mapping).
In rootless containers, it is not possible to set a non-dumpable
process's /proc/self/oom_score_adj (it's owned by root and thus not
writeable). Thus, it needs to be set inside nsexec before we set
ourselves as non-dumpable.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
If we try to pause a container on the system without freezer cgroups,
we can found that runc tries to open ./freezer.state. It is obviously wrong.
$ ./runc pause test
no such directory for freezer.state
$ echo FROZEN > freezer.state
$ ./runc pause test
container not running or created: paused
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>