This allows runc to be used as a target for docker's reexec module that
depends on a correct argv0 to select which process entrypoint to invoke.
Without this patch, when runc re-execs argv0 is set to "/proc/self/exe"
and the reexec module doesn't know what to do with it.
Signed-off-by: Petros Angelatos <petrosagg@gmail.com>
Both Process.Kill() and Process.Wait() can return errors that don't impact the correct behaviour of terminate.
Instead of letting these get returned and logged, which causes confusion, silently ignore them.
Currently the test needs to be a string test as the errors are private to the runtime packages, so its our only option.
This can be seen if init fails during the setns.
Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
Signed-off-by: Will Martin <wmartin@pivotal.io>
Signed-off-by: Petar Petrov <pppepito86@gmail.com>
Signed-off-by: Ed King <eking@pivotal.io>
Signed-off-by: Roberto Jimenez Sanchez <jszroberto@gmail.com>
Signed-off-by: Thomas Godkin <tgodkin@pivotal.io>
After quite a bit of debugging, I found that previous versions of this
patchset did not include newgidmap in a rootless setting. Fix this by
passing it whenever group mappings are applied, and also providing some
better checking for try_mapping_tool. This commit also includes some
stylistic improvements.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Take advantage of the newuidmap/newgidmap tools to allow multiple
users/groups to be mapped into the new user namespace in the rootless
case.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
[ rebased to handle intelrdt changes. ]
Signed-off-by: Aleksa Sarai <asarai@suse.de>
With the help of userfaultfd CRIU supports lazy migration. Lazy
migration means that memory pages are only transferred from the
migration source to the migration destination on page fault.
This enables to reduce the downtime during process or container
migration to a minimum as the memory does not need to be transferred
during migration.
Lazy migration currently depends on userfaultfd being available on the
current Linux kernel and if the used CRIU version supports lazy
migration. Both dependencies can be checked by querying CRIU via RPC if
the lazy migration feature is available. Using feature checking instead
of version comparison enables runC to use CRIU features from the
criu-dev branch. This way the user can decide if lazy migration should
be available by choosing the right kernel and CRIU branch.
To use lazy migration the CRIU process during dump needs to dump
everything besides the memory pages and then it opens a network port
waiting for remote page fault requests:
# runc checkpoint httpd --lazy-pages --page-server 0.0.0.0:27 \
--status-fd /tmp/postcopy-pipe
In this example CRIU will hang/wait once it has opened the network port
and wait for network connection. As runC waits for CRIU to finish it
will also hang until the lazy migration has finished. To know when the
restore on the destination side can start the '--status-fd' parameter is
used:
#️ runc checkpoint --help | grep status
--status-fd value criu writes \0 to this FD once lazy-pages is ready
The parameter '--status-fd' is directly from CRIU and this way the
process outside of runC which controls the migration knows exactly when
to transfer the checkpoint (without memory pages) to the destination and
that the restore can be started.
On the destination side it is necessary to start CRIU in 'lazy-pages'
mode like this:
# criu lazy-pages --page-server --address 192.168.122.3 --port 27 \
-D checkpoint
and tell runC to do a lazy restore:
# runc restore -d --image-path checkpoint --work-path checkpoint \
--lazy-pages httpd
If both processes on the restore side have the same working directory
'criu lazy-pages' creates a unix domain socket where it waits for
requests from the actual restore. runC starts CRIU restore in lazy
restore mode and talks to 'criu lazy-pages' that it wants to restore
memory pages on demand. CRIU continues to restore the process and once
the process is running and accesses the first non-existing memory page
the 'criu lazy-pages' server will request the page from the source
system. Thus all pages from the source system will be transferred to the
destination system. Once all pages have been transferred runC on the
source system will end and the container will have finished migration.
This can also be combined with CRIU's pre-copy support. The combination
of pre-copy and post-copy (lazy migration) provides the possibility to
migrate containers with minimal downtimes.
Some additional background about post-copy migration can be found in
these articles:
https://lisas.de/~adrian/?p=1253https://lisas.de/~adrian/?p=1183
Signed-off-by: Adrian Reber <areber@redhat.com>
Before adding the actual lazy migration support, this adds the feature
check for lazy-pages. Right now lazy migration, which is based on
userfaultd is only available in the criu-dev branch and not yet in a
release. As the check does not dependent on a certain version but on
a CRIU feature which can be queried it can be part of runC without a new
version check depending on a feature from criu-dev.
Signed-off-by: Adrian Reber <areber@redhat.com>
About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.
This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).
For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.
About Intel RDT/CAT kernel interface:
In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.
Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| |-- cbm_mask
| |-- min_cbm_bits
| |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
|-- cpus
|-- schemata
|-- tasks
For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.
The file `schemata` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.
The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.
For more information about Intel RDT/CAT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
inside the container only have access to the "upper" 80% of L3 cache id 0 and
the "lower" 50% L3 cache id 1:
"linux": {
"intelRdt": {
"l3CacheSchema": "L3:0=ffff0;1=3ff"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
While we have significant protections in place against CVE-2016-9962, we
still were holding onto a file descriptor that referenced the host
filesystem. This meant that in certain scenarios it was still possible
for a semi-privileged container to gain access to the host filesystem
(if they had CAP_SYS_PTRACE).
Instead, open the FIFO itself using a O_PATH. This allows us to
reference the FIFO directly without providing the ability for
directory-level access. When opening the FIFO inside the init process,
open it through procfs to re-open the actual FIFO (this is currently the
only supported way to open such a file descriptor).
Signed-off-by: Aleksa Sarai <asarai@suse.de>
state.json should be a reflection of the container's
realtime state, including resource configurations,
so we should update state.json after updating container
resources.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
With this runC also uses RPC to ask CRIU for its version. CRIU supports
a VERSION RPC since CRIU 3.0 and using the RPC interface does not
require parsing the console output of CRIU (which could change anytime).
For older CRIU versions which do not yet have the VERSION RPC runC falls
back to its old CRIU output parsing mode.
Once CRIU 3.0 is the minimum version required for runC the old code can
be removed.
v2:
* adapt to changes in the previous patches based on the review
Signed-off-by: Adrian Reber <areber@redhat.com>
To use the CRIU VERSION RPC the criuSwrk function is adapted to work
with CriuOpts set to 'nil' as CriuOpts is not required for the VERSION
RPC.
Also do not print c.criuVersion if it is '0' as the first RPC call will
always be the VERSION call and only after that the version will be
known.
Signed-off-by: Adrian Reber <areber@redhat.com>
If the version of criu has already been determined there is no need to
ask criu for the version again. Use the value from c.criuVersion.
v2:
* reduce unnecessary code movement in the patch series
* factor out the criu version parsing into a separate function
Signed-off-by: Adrian Reber <areber@redhat.com>
The checkCriuVersion function used a string to specify the minimum
version required. This is more comfortable for an external interface
but for an internal function this added unnecessary complexity. This
changes to version string like '1.5.2' to an integer like 10502. This is
already the format used internally in the function.
Signed-off-by: Adrian Reber <areber@redhat.com>
go's switch statement doesn't need an explicit break. Remove it where
that is the case and add a comment to indicate the purpose where the
removal would lead to an empty case.
Found with honnef.co/go/tools/cmd/staticcheck
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Handle err return value of fmt.Scanf, os.Pipe and unix.ParseUnixRights.
Found with honnef.co/go/tools/cmd/staticcheck
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
This moves all console code to use github.com/containerd/console library to
handle console I/O. Also move to use EpollConsole by default when user requests
a terminal so we can still cope when the other side temporarily goes away.
Signed-off-by: Daniel Dao <dqminh89@gmail.com>
Updated logrus to use v1 which includes a breaking name change Sirupsen -> sirupsen.
This includes a manual edit of the docker term package to also correct the name there too.
Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
Use ParseSocketControlMessage and ParseUnixRights from
golang.org/x/sys/unix instead of their syscall equivalent.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
And Stat_t.PID and Stat_t.Name while we're at it. Then use the new
.State property in runType to distinguish between running and
zombie/dead processes, since kill(2) does not [1]. With this change
we no longer claim Running status for zombie/dead processes.
I've also removed the kill(2) call from runType. It was originally
added in 13841ef3 (new-api: return the Running state only if the init
process is alive, 2014-12-23), but we've been accessing
/proc/[pid]/stat since 14e95b2a (Make state detection precise,
2016-07-05, #930), and with the /stat access the kill(2) check is
redundant.
I also don't see much point to the previously-separate
doesInitProcessExist, so I've inlined that logic in runType.
It would be nice to distinguish between "/proc/[pid]/stat doesn't
exist" and errors parsing its contents, but I've skipped that for the
moment.
The Running -> Stopped change in checkpoint_test.go is because the
post-checkpoint process is a zombie, and with this commit zombie
processes are Stopped (and no longer Running).
[1]: https://github.com/opencontainers/runc/pull/1483#issuecomment-307527789
Signed-off-by: W. Trevor King <wking@tremily.us>
And convert the various start-time properties from strings to uint64s.
This removes all internal consumers of the deprecated
GetProcessStartTime function.
Signed-off-by: W. Trevor King <wking@tremily.us>
Since syscall is outdated and broken for some architectures,
use x/sys/unix instead.
There are still some dependencies on the syscall package that will
remain in syscall for the forseeable future:
Errno
Signal
SysProcAttr
Additionally:
- os still uses syscall, so it needs to be kept for anything
returning *os.ProcessState, such as process.Wait.
Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
A freezer cgroup allows to dump processes faster.
If a user wants to checkpoint a container and its storage,
he has to pause a container, but in this case we need to pass
a path to its freezer cgroup to "criu dump".
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
When C/R was implemented, it was enough to call manager.Set to apply
limits and to move a task. Now .Set() and .Apply() have to be called
separately.
Fixes: 8a740d5391 ("libcontainer: cgroups: don't Set in Apply")
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Instead of relying on version numbers it is possible to check if CRIU
actually supports certain features. This introduces an initial
implementation to check if CRIU and the underlying kernel actually
support dirty memory tracking for memory pre-dumping.
Upstream CRIU also supports the lazy-page migration feature check and
additional feature checks can be included in CRIU to reduce the version
number parsing. There are also certain CRIU features which depend on one
side on the CRIU version but also require certain kernel versions to
actually work. CRIU knows if it can do certain things on the kernel it
is running on and using the feature check RPC interface makes it easier
for runc to decide if the criu+kernel combination will support that
feature.
Feature checking was introduced with CRIU 1.8. Running with older CRIU
versions will ignore the feature check functionality and behave just
like it used to.
v2:
- Do not use reflection to compare requested and responded
features. Checking which feature is available is now hardcoded
and needs to be adapted for every new feature check. The code
is now much more readable and simpler.
v3:
- Move the variable criuFeat out of the linuxContainer struct,
as it is not container specific. Now it is a global variable.
Signed-off-by: Adrian Reber <areber@redhat.com>
Previously Host{U,G}ID only gave you the root mapping, which isn't very
useful if you are trying to do other things with the IDMaps.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
The rootless cgroup manager acts as a noop for all set and apply
operations. It is just used for rootless setups. Currently this is far
too simple (we need to add opportunistic cgroup management), but is good
enough as a first-pass at a noop cgroup manager.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This enables the support for the rootless container mode. There are many
restrictions on what rootless containers can do, so many different runC
commands have been disabled:
* runc checkpoint
* runc events
* runc pause
* runc ps
* runc restore
* runc resume
* runc update
The following commands work:
* runc create
* runc delete
* runc exec
* runc kill
* runc list
* runc run
* runc spec
* runc state
In addition, any specification options that imply joining cgroups have
also been disabled. This is due to support for unprivileged subtree
management not being available from Linux upstream.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Effectively, !dumpable makes implementing rootless containers quite
hard, due to a bunch of different operations on /proc/self no longer
being possible without reordering everything.
!dumpable only really makes sense when you are switching between
different security contexts, which is only the case when we are joining
namespaces. Unfortunately this means that !dumpable will still have
issues in this instance, and it should only be necessary to set
!dumpable if we are not joining USER namespaces (new kernels have
protections that make !dumpable no longer necessary). But that's a topic
for another time.
This also includes code to unset and then re-set dumpable when doing the
USER namespace mappings. This should also be safe because in principle
processes in a container can't see us until after we fork into the PID
namespace (which happens after the user mapping).
In rootless containers, it is not possible to set a non-dumpable
process's /proc/self/oom_score_adj (it's owned by root and thus not
writeable). Thus, it needs to be set inside nsexec before we set
ourselves as non-dumpable.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
If we try to pause a container on the system without freezer cgroups,
we can found that runc tries to open ./freezer.state. It is obviously wrong.
$ ./runc pause test
no such directory for freezer.state
$ echo FROZEN > freezer.state
$ ./runc pause test
container not running or created: paused
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
In container process's Init function, we use
fd + execFifoFilename to open exec fifo, so this
field in init config is never used.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
It should not be binded to container creation, for
example, runc restore needs to create a
libcontainer.Container, but it won't need exec fifo.
So create exec fifo when container is started or run,
where we really need it.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
CRIU gets pre-dump to complete iterative migration.
pre-dump saves process memory info only. And it need parent-path
to specify the former memory files.
This patch add pre-dump and parent-path arguments to runc checkpoint
Signed-off-by: Deng Guangxing <dengguangxing@huawei.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
If we pass a file descriptor to the host filesystem while joining a
container, there is a race condition where a process inside the
container can ptrace(2) the joining process and stop it from closing its
file descriptor to the stateDirFd. Then the process can access the
*host* filesystem from that file descriptor. This was fixed in part by
5d93fed3d2 ("Set init processes as non-dumpable"), but that fix is
more of a hail-mary than an actual fix for the underlying issue.
To fix this, don't open or pass the stateDirFd to the init process
unless we're creating a new container. A proper fix for this would be to
remove the need for even passing around directory file descriptors
(which are quite dangerous in the context of mount namespaces).
There is still an issue with containers that have CAP_SYS_PTRACE and are
using the setns(2)-style of joining a container namespace. Currently I'm
not really sure how to fix it without rampant layer violation.
Fixes: CVE-2016-9962
Fixes: 5d93fed3d2 ("Set init processes as non-dumpable")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
`HookState` struct should follow definition of `State` in runtime-spec:
* modify json name of `version` to `ociVersion`.
* Remove redundant `Rootfs` field as rootfs can be retrived from
`bundlePath/config.json`.
Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
This implements {createTTY, detach} and all of the combinations and
negations of the two that were previously implemented. There are some
valid questions about out-of-OCI-scope topics like !createTTY and how
things should be handled (why do we dup the current stdio to the
process, and how is that not a security issue). However, these will be
dealt with in a separate patchset.
In order to allow for late console setup, split setupRootfs into the
"preparation" section where all of the mounts are created and the
"finalize" section where we pivot_root and set things as ro. In between
the two we can set up all of the console mountpoints and symlinks we
need.
We use two-stage synchronisation to ensures that when the syscalls are
reordered in a suboptimal way, an out-of-place read() on the parentPipe
will not gobble the ancilliary information.
This patch is part of the console rewrite patchset.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This allows a user to send a signal to all the processes in the
container within a single atomic action to avoid new processes being
forked off before the signal can be sent.
This is basically taking functionality that we already use being
`delete` and exposing it ok the `kill` command by adding a flag.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
In user namespaces devices are bind-mounted from the host, so
we need to add them as external mounts for CRIU.
Reported-by: Ross Boucher <boucher@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Depending on your SELinux setup, the order in which you join namespaces
can be important. In general, user namespaces should *always* be joined
and unshared first because then the other namespaces are correctly
pinned and you have the right priviliges within them. This also is very
useful for rootless containers, as well as older kernels that had
essentially broken unshare(2) and clone(2) implementations.
This also includes huge refactorings in how we spawn processes for
complicated reasons that I don't want to get into because it will make
me spiral into a cloud of rage. The reasoning is in the giant comment in
clone_parent. Have fun.
In addition, because we now create multiple children with CLONE_PARENT,
we cannot wait for them to SIGCHLD us in the case of a death. Thus, we
have to resort to having a child kindly send us their exit code before
they die. Hopefully this all works okay, but at this point there's not
much more than we can do.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This avoids us from running into cases where libcontainer thinks that a
particular namespace file is a different type, and makes it a fatal
error rather than causing broken functionality.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
if a container state is running or created, the container.Pause()
method can set the state to pausing, and then paused.
this patch update the comment, so it can be consistent with the code.
Signed-off-by: Wang Long <long.wanglong@huawei.com>
1. According to docs of Cmd.Path and Cmd.Args from package "os/exec":
Path is the path of the command to run. Args holds command line
arguments, including the command as Args[0]. We have mixed usage
of args. In InitPath(), InitArgs only take arguments, in InitArgs(),
InitArgs including the command as Args[0]. This is confusing.
2. InitArgs() already have the ability to configure a LinuxFactory
with the provided absolute path to the init binary and arguements as
InitPath() does.
3. exec.Command() will take care of serching executable path.
4. The default "/proc/self/exe" instead of os.Args[0] is passed to
InitArgs in order to allow relative path for the runC binary.
Signed-off-by: Yang Hongyang <imhy.yang@gmail.com>
This removes the use of a signal handler and SIGCONT to signal the init
process to exec the users process.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This is the inital port of the libcontainer.Error to added a cause to
all the existing error messages. Going forward, when an error can be
wrapped because it is not being checked at the higher levels for
something like `os.IsNotExist` we can add more information to the error
message like cause and stack file/line information. This will help
higher level tools to know what cause a container start or operation to
fail.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
No substantial code change.
Note that some style errors reported by `golint` are not fixed due to possible compatibility issues.
Signed-off-by: Akihiro Suda <suda.kyoto@gmail.com>
This updates runc and libcontainer to handle rlimits per process and set
them correctly for the container.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This commit adds support to libcontainer to allow caps, no new privs,
apparmor, and selinux process label to the process struct so that it can
be used together of override the base settings on the container config
per individual process.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
currentState() always adds all possible namespaces to the state,
regardless of whether they are supported.
If orderNamespacePaths detects an unsupported namespace, an error is
returned that results in initialization failure.
Fix this by only adding paths of supported namespaces to the state.
Signed-off-by: Ido Yariv <ido@wizery.com>
An init process can join other namespaces (pidns, ipc etc.). This leverages
C code defined in nsenter package to spawn a process with correct namespaces
and clone if necessary.
This moves all setns and cloneflags related code to nsenter layer, which mean
that we dont use Go os/exec to create process with cloneflags and set
uid/gid_map or setgroups anymore. The necessary data is passed from Go to C
using a netlink binary-encoding format.
With this change, setns and init processes are almost the same, which brings
some opportunity for refactoring.
Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
[mickael.laventure@docker.com: adapted to apply on master @ d97d5e]
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@docker.com>
This adds orderNamespacePaths to get correct order of namespaces for the
bootstrap program to join.
Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
Create a unique session key name for every container. Use the pattern
_ses.<postfix> with postfix being the container's Id.
This patch does not prevent containers from joining each other's session
keyring.
Signed-off-by: Stefan Berger <stefanb@linux.vnet.ibm.com>
Docker uses Prestart hooks to call a libnetwork hook to create
network devices and set addesses and routes.
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
This options is set a namespace mask which will not be dumped and restored.
For example, we are going to use this option to restore network
for docker containers. CRIU will create a network namespace and
call a libnetwork hook to restore network devices, addresses and routes.
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>