Commit Graph

235 Commits

Author SHA1 Message Date
Andrei Vagin 269ea385a4 restore: fix a race condition in process.Wait()
Adrian reported that the checkpoint test stated failing:
=== RUN   TestCheckpoint
--- FAIL: TestCheckpoint (0.38s)
    checkpoint_test.go:297: Did not restore the pipe correctly:

The problem here is when we start exec.Cmd, we don't call its wait
method. This means that we don't wait cmd.goroutines ans so we don't
know when all data will be read from process pipes.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2020-02-10 10:21:08 -08:00
Aleksa Sarai f6fb7a0338
merge branch 'pr-2133'
Julia Nedialkova (1):
  Handle ENODEV when accessing the freezer.state file

LGTMs: @crosbymichael @cyphar
Closes #2133
2020-01-17 02:07:19 +11:00
Jordan Liggitt 8541d9cf3d Fix race checking for process exit and waiting for exec fifo
Signed-off-by: Jordan Liggitt <liggitt@google.com>
2019-12-18 18:48:18 +00:00
Radostin Stoyanov a610a84821 criu: Ensure other users cannot read c/r files
No checkpoint files should be readable by
anyone else but the user creating it.

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
2019-10-17 07:49:38 +01:00
Radostin Stoyanov f017e0f9e1 checkpoint: Set descriptors.json file mode to 0600
Prevent unprivileged users from being able to read descriptors.json

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
2019-10-12 19:29:44 +01:00
Julia Nedialkova e63b797f38 Handle ENODEV when accessing the freezer.state file
...when checking if a container is paused

Signed-off-by: Julia Nedialkova <julianedialkova@hotmail.com>
2019-09-27 17:02:56 +03:00
Michael Crosby 331692baa7 Only allow proc mount if it is procfs
Fixes #2128

This allows proc to be bind mounted for host and rootless namespace usecases but
it removes the ability to mount over the top of proc with a directory.

```bash
> sudo docker run --rm  apparmor
docker: Error response from daemon: OCI runtime create failed:
container_linux.go:346: starting container process caused "process_linux.go:449:
container init caused \"rootfs_linux.go:58: mounting
\\\"/var/lib/docker/volumes/aae28ea068c33d60e64d1a75916cf3ec2dc3634f97571854c9ed30c8401460c1/_data\\\"
to rootfs
\\\"/var/lib/docker/overlay2/a6be5ae911bf19f8eecb23a295dec85be9a8ee8da66e9fb55b47c841d1e381b7/merged\\\"
at \\\"/proc\\\" caused
\\\"\\\\\\\"/var/lib/docker/overlay2/a6be5ae911bf19f8eecb23a295dec85be9a8ee8da66e9fb55b47c841d1e381b7/merged/proc\\\\\\\"
cannot be mounted because it is not of type proc\\\"\"": unknown.

> sudo docker run --rm -v /proc:/proc apparmor

docker-default (enforce)        root     18989  0.9  0.0   1288     4 ?
Ss   16:47   0:00 sleep 20
```

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2019-09-24 11:00:18 -04:00
Giuseppe Scrivano 1932917b71
libcontainer: add initial support for cgroups v2
allow to set what subsystems are used by
libcontainer/cgroups/fs.Manager.

subsystemsUnified is used on a system running with cgroups v2 unified
mode.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2019-09-05 13:02:25 +02:00
Georgi Sabev a146081828 Write logs to stderr by default
Minor refactoring to use the filePair struct for both init sock and log pipe

Co-authored-by: Julia Nedialkova <julianedialkova@hotmail.com>
Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>
2019-04-24 15:18:14 +03:00
Georgi Sabev ba3cabf932 Improve nsexec logging
* Simplify logging function
* Logs contain __FUNCTION__:__LINE__
* Bail uses write_log

Co-authored-by: Julia Nedialkova <julianedialkova@hotmail.com>
Co-authored-by: Danail Branekov <danailster@gmail.com>
Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>
2019-04-22 17:53:52 +03:00
Danail Branekov c486e3c406 Address comments in PR 1861
Refactor configuring logging into a reusable component
so that it can be nicely used in both main() and init process init()

Co-authored-by: Georgi Sabev <georgethebeatle@gmail.com>
Co-authored-by: Giuseppe Capizzi <gcapizzi@pivotal.io>
Co-authored-by: Claudia Beresford <cberesford@pivotal.io>
Signed-off-by: Danail Branekov <danailster@gmail.com>
2019-04-04 14:57:28 +03:00
Marco Vedovati 9a599f62fb Support for logging from children processes
Add support for children processes logging (including nsexec).
A pipe is used to send logs from children to parent in JSON.
The JSON format used is the same used by logrus JSON formatted,
i.e. children process can use standard logrus APIs.

Signed-off-by: Marco Vedovati <mvedovati@suse.com>
2019-04-04 14:53:23 +03:00
Mrunal Patel 2b18fe1d88
Merge pull request #1984 from cyphar/memfd-cleanups
nsenter: cloned_binary: "memfd" cleanups
2019-03-07 10:18:33 -08:00
Michael Crosby f739110263
Merge pull request #1968 from adrianreber/podman
Create bind mount mountpoints during restore
2019-03-04 11:37:07 -06:00
Aleksa Sarai af9da0a450
nsenter: cloned_binary: use the runc statedir for O_TMPFILE
Writing a file to tmpfs actually incurs a memcg penalty, and thus the
benefit of being able to disable memfd_create(2) with
_LIBCONTAINER_DISABLE_MEMFD_CLONE is fairly minimal -- though it should
be noted that quite a few distributions don't use tmpfs for /tmp (and
instead have it as a regular directory or subvolume of the host
filesystem).

Since runc must have write access to the state directory anyway (and the
state directory is usually not on a tmpfs) we can use that instead of
/tmp -- avoiding potential memcg costs with no real downside.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-03-01 23:28:51 +11:00
Adrian Reber 9edb5494bb
Use vendored in CRIU Go bindings
This makes use of the vendored in Go bindings and removes the copy of
the CRIU RPC interface definition. runc now relies on go-criu for RPC
definition and hopefully more CRIU functions can be used in the future
from the CRIU Go bindings.

Signed-off-by: Adrian Reber <areber@redhat.com>
2019-02-14 18:20:02 +01:00
Adrian Reber 7354546cc8
Create mountpoints also on restore
runc creates all missing mountpoints when it starts a container, this
commit also creates those mountpoints during restore. Now it is possible
to restore a container using the same, but newly created rootfs just as
during container start.

Signed-off-by: Adrian Reber <areber@redhat.com>
2019-02-08 15:59:51 +01:00
Adrian Reber e157963054
Enable CRIU configuration files
CRIU 3.11 introduces configuration files:

https://criu.org/Configuration_files
https://lisas.de/~adrian/posts/2018-Nov-08-criu-configuration-files.html

This enables the user to influence CRIU's behaviour without code changes
if using new CRIU features or if the user wants to enable certain CRIU
behaviour without always specifying certain options.

With this it is possible to write 'tcp-established' to the configuration
file:

$ echo tcp-established > /etc/criu/runc.conf

and from now on all checkpoints will preserve the state of established
TCP connections. This removes the need to always use

$ runc checkpoint --tcp-stablished

If the goal is to always checkpoint with '--tcp-established'

It also adds the possibility for unexpected CRIU behaviour if the user
created a configuration file at some point in time and forgets about it.

As a result of the discussion in https://github.com/opencontainers/runc/pull/1933
it is now also possible to define a CRIU configuration file for each
container with the annotation 'org.criu.config'.

If 'org.criu.config' does not exist, runc will tell CRIU to use
'/etc/criu/runc.conf' if it exists.

If 'org.criu.config' is set to an empty string (''), runc will tell CRIU
to not use any runc specific configuration file at all.

If 'org.criu.config' is set to a non-empty string, runc will use that
value as an additional configuration file for CRIU.

With the annotation the user can decide to use the default configuration
file ('/etc/criu/runc.conf'), none or a container specific configuration
file.

Signed-off-by: Adrian Reber <areber@redhat.com>
2018-12-21 07:42:12 +01:00
Michael Crosby 96ec2177ae
Merge pull request #1943 from giuseppe/allow-to-signal-paused-containers
kill: allow to signal paused containers
2018-12-03 16:55:13 -05:00
Ace-Tang dce70cdff5 cr: get pid from criu notify when restore
when restore container from a checkpoint directory, we should get
pid from criu notify, since c.initProcess has not been created.

Signed-off-by: Ace-Tang <aceapril@126.com>
2018-12-03 13:31:20 +08:00
Giuseppe Scrivano 07d1ad44c8
kill: allow to signal paused containers
regression introduced by 87a188996e

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2018-11-30 23:35:47 +01:00
Michael Crosby 50e2634995
Merge pull request #1934 from lifubang/kill
fix: may kill other process when container has been stopped
2018-11-21 10:30:25 -05:00
Lifubang 87a188996e may kill other process when container has been stopped
Signed-off-by: Lifubang <lifubang@acmcoder.com>
2018-11-21 17:44:52 +08:00
W. Trevor King e23868603a libcontainer: Set 'status' in hook stdin
Finish off the work started in a344b2d6 (sync up `HookState` with OCI
spec `State`, 2016-12-19, #1201).

And drop HookState, since there's no need for a local alias for
specs.State.

Also set c.initProcess in newInitProcess to support OCIState calls
from within initProcess.start().  I think the cyclic references
between linuxContainer and initProcess are unfortunate, but didn't
want to address that here.

I've also left the timing of the Prestart hooks alone, although the
spec calls for them to happen before start (not as part of creation)
[1,2].  Once the timing gets fixed we can drop the
initProcessStartTime hacks which initProcess.start currently needs.

I'm not sure why we trigger the prestart hooks in response to both
procReady and procHooks.  But we've had two prestart rounds in
initProcess.start since 2f276498 (Move pre-start hooks after container
mounts, 2016-02-17, #568).  I've left that alone too.

I really think we should have len() guards to avoid computing the
state when .Hooks is non-nil but the particular phase we're looking at
is empty.  Aleksa, however, is adamantly against them [3] citing a
risk of sloppy copy/pastes causing the hook slice being len-guarded to
diverge from the hook slice being iterated over within the guard.  I
think that ort of thing is very lo-risk, because:

* We shouldn't be copy/pasting this, right?  DRY for the win :).
* There's only ever a few lines between the guard and the guarded
  loop.  That makes broken copy/pastes easy to catch in review.
* We should have test coverage for these.  Guarding with the wrong
  slice is certainly not the only thing you can break with a sloppy
  copy/paste.

But I'm not a maintainer ;).

[1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.0/config.md#prestart
[2]: https://github.com/opencontainers/runc/issues/1710
[3]: https://github.com/opencontainers/runc/pull/1741#discussion_r233331570

Signed-off-by: W. Trevor King <wking@tremily.us>
2018-11-14 06:49:49 -08:00
Mrunal Patel 4769cdf607
Merge pull request #1916 from crosbymichael/cgns
Add support for cgroup namespace
2018-11-13 12:21:38 -08:00
Michael Crosby aa7917b751
Merge pull request #1911 from theSuess/linter-fixes
Various cleanups to address linter issues
2018-11-13 12:13:34 -05:00
Yuanhong Peng df3fa115f9 Add support for cgroup namespace
Cgroup namespace can be configured in `config.json` as other
namespaces. Here is an example:

```
"namespaces": [
	{
		"type": "pid"
	},
	{
		"type": "network"
	},
	{
		"type": "ipc"
	},
	{
		"type": "uts"
	},
	{
		"type": "mount"
	},
	{
		"type": "cgroup"
	}
],

```

Note that if you want to run a container which has shared cgroup ns with
another container, then it's strongly recommended that you set
proper `CgroupsPath` of both containers(the second container's cgroup
path must be the subdirectory of the first one). Or there might be
some unexpected results.

Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2018-10-31 10:51:43 -04:00
Mrunal Patel c2ab1e656e
Merge pull request #1910 from adrianreber/tip
Fix travis Go: tip
2018-10-17 12:47:08 -07:00
Dominik Süß 0b412e9482 various cleanups to address linter issues
Signed-off-by: Dominik Süß <dominik@suess.wtf>
2018-10-13 21:14:03 +02:00
Adrian Reber 0d01164756 Fix travis Go: tip
This fixes

 libcontainer/container_linux.go:1200: Error call has possible formatting directive %s

Signed-off-by: Adrian Reber <areber@redhat.com>
2018-10-13 10:44:07 +00:00
Akihiro Suda 06f789cf26 Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.

`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.

`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)

When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.

This PR does not have any impact on CLI flags and `state.json`.

Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
  For runc-in-userns, `runc spec`  without `--rootless` should work, when sufficient numbers of
  UID/GID are mapped.

Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
  (`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
  This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.

Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-09-07 15:05:03 +09:00
Adrian Reber fa43a72aba
criu: restore into existing namespace when specified
Using CRIU to checkpoint and restore a container into an existing
network namespace is not possible.

If the network namespace is defined like

	{
		"type": "network",
		"path": "/run/netns/test"
	}

there is the expectation that the restored container is again running in
the network namespace specified with 'path'.

This adds the new CRIU 'external namespace' feature to runc, where
during checkpointing that specific namespace is referenced and during
restore CRIU tries to restore the container in exactly that
namespace.

This breaks/fixes current runc behavior. If, without this patch, runc
restores a container with such a network namespace definition, it is
ignored and CRIU recreates a network namespace without a name.

With this patch runc uses the network namespace path (if available) to
checkpoint and restore the container in just that network namespace.

Restore will now fail if a container was checkpointed with a network
namespace path set and if that network namespace path does not exist
during restore.

runc still falls back to the old behavior if CRIU older than 3.11 is
installed.

Fixes #1786

Related to https://github.com/projectatomic/libpod/pull/469

Thanks to Andrei Vagin for all the help in getting the interface between
CRIU and runc right!

Signed-off-by: Adrian Reber <areber@redhat.com>
2018-08-22 23:27:20 +02:00
Michael Crosby 53fddb540a Pass GOMAXPROCS to init processes
This will help runc's init to not spawn many threads on large systems when
launched with max procs by the caller.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2018-06-26 11:23:37 -04:00
Mrunal Patel bd3c4f844a Fix race in runc exec
There is a race in runc exec when the init process stops just before
the check for the container status. It is then wrongly assumed that
we are trying to start an init process instead of an exec process.

This commit add an Init field to libcontainer Process to distinguish
between init and exec processes to prevent this race.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2018-06-01 16:25:58 -07:00
Michael Crosby 0e561642f8
Merge pull request #1688 from AkihiroSuda/unshare-m-r
main: support rootless mode in userns
2018-05-29 15:41:17 -04:00
Qiang Huang dd67ab10d7
Merge pull request #1759 from cyphar/rootless-erofs-as-eperm
rootless: cgroup: treat EROFS as a skippable error
2018-05-25 09:24:16 +08:00
Akihiro Suda c93815738a libcontainer: remove extra CAP_SETGID check for SetgroupAttr
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-05-24 14:59:30 +09:00
Michael Crosby bdbb9fab07
Merge pull request #1693 from AkihiroSuda/leave-setgroups-allow
libcontainer: allow setgroup in rootless mode
2018-04-24 11:24:04 -04:00
Sebastien Boeuf 985628dda0 libcontainer: Don't set container state to running when exec'ing
There is no reason to set the container state to "running" as a
temporary value when exec'ing a process on a container in "created"
state. The problem doing this is that consumers of the libcontainer
library might use it by keeping pointers in memory. In this case,
the container state will indicate that the container is running, which
is wrong, and this will end up with a failure on the next action
because the check for the container state transition will complain.

Fixes #1767

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2018-03-30 09:29:18 -07:00
Akihiro Suda 73f3dc6389 libcontainer: allow setgroup in rootless mode
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-03-27 17:42:05 +09:00
Aleksa Sarai fd3a6e6c83
libcontainer: handle unset oomScoreAdj corectly
Previously if oomScoreAdj was not set in config.json we would implicitly
set oom_score_adj to 0. This is not allowed according to the spec:

> If oomScoreAdj is not set, the runtime MUST NOT change the value of
> oom_score_adj.

Change this so that we do not modify oom_score_adj if oomScoreAdj is not
present in the configuration. While this modifies our internal
configuration types, the on-disk format is still compatible.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-03-17 13:53:42 +11:00
W. Trevor King 50dc7ee96c libcontainer/capabilities_linux: Drop os.Getpid() call
gocapability has supported 0 as "the current PID" since
syndtr/gocapability@5e7cce49 (Allow to use the zero value for pid to
operate with the current task, 2015-01-15, syndtr/gocapability#2).
libcontainer was ported to that approach in 444cc298 (namespaces:
allow to use pid namespace without mount namespace, 2015-01-27,
docker/libcontainer#358), but the change was clobbered by 22df5551
(Merge branch 'master' into api, 2015-02-19, docker/libcontainer#388)
which landed via 5b73860e (Merge pull request #388 from docker/api,
2015-02-19, docker/libcontainer#388).  This commit restores the
changes from 444cc298.

Signed-off-by: W. Trevor King <wking@tremily.us>
2018-02-19 15:47:42 -08:00
Ed King 5c0af14bf8 Return from goroutine when it should terminate
Signed-off-by: Craig Furman <cfurman@pivotal.io>
2018-01-23 10:46:31 +00:00
Will Martin 8d3e6c9826 Avoid race when opening exec fifo
When starting a container with `runc start` or `runc run`, the stub
process (runc[2:INIT]) opens a fifo for writing. Its parent runc process
will open the same fifo for reading. In this way, they synchronize.

If the stub process exits at the wrong time, the parent runc process
will block forever.

This can happen when racing 2 runc operations against each other: `runc
run/start`, and `runc delete`. It could also happen for other reasons,
e.g. the kernel's OOM killer may select the stub process.

This commit resolves this race by racing the opening of the exec fifo
from the runc parent process against the stub process exiting. If the
stub process exits before we open the fifo, we return an error.

Another solution is to wait on the stub process. However, it seems it
would require more refactoring to avoid calling wait multiple times on
the same process, which is an error.

Signed-off-by: Craig Furman <cfurman@pivotal.io>
2018-01-22 17:03:02 +00:00
Antonio Murdaca cd1e7abee2
libcontainer: expose annotations in hooks
Annotations weren't passed to hooks. This patch fixes that by passing
annotations to stdin for hooks.

Signed-off-by: Antonio Murdaca <runcom@redhat.com>
2018-01-11 16:54:01 +01:00
Qiang Huang 74a1729647 Merge pull request #1607 from crosbymichael/term-err
libcontainer: handler errors from terminate
2017-10-20 15:15:38 +08:00
Petros Angelatos 8098828680
propagate argv0 when re-execing from /proc/self/exe
This allows runc to be used as a target for docker's reexec module that
depends on a correct argv0 to select which process entrypoint to invoke.
Without this patch, when runc re-execs argv0 is set to "/proc/self/exe"
and the reexec module doesn't know what to do with it.

Signed-off-by: Petros Angelatos <petrosagg@gmail.com>
2017-10-16 14:00:26 +02:00
Michael Crosby bfe3058fc9 Make process check more forgiving
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-10-10 15:36:19 -04:00
Steven Hartland eb68b900bc Prevent invalid errors from terminate
Both Process.Kill() and Process.Wait() can return errors that don't impact the correct behaviour of terminate.

Instead of letting these get returned and logged, which causes confusion, silently ignore them.

Currently the test needs to be a string test as the errors are private to the runtime packages, so its our only option.

This can be seen if init fails during the setns.

Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
2017-10-10 15:32:46 -04:00
Konstantinos Karampogias 605dc5c811 Set initial console size based on process spec
Signed-off-by: Will Martin <wmartin@pivotal.io>
Signed-off-by: Petar Petrov <pppepito86@gmail.com>
Signed-off-by: Ed King <eking@pivotal.io>
Signed-off-by: Roberto Jimenez Sanchez <jszroberto@gmail.com>
Signed-off-by: Thomas Godkin <tgodkin@pivotal.io>
2017-10-04 12:32:16 +01:00
Tobias Klauser d713652bda libcontainer: remove unnecessary type conversions
Generated using github.com/mdempsky/unconvert

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-09-25 10:41:57 +02:00
Mrunal Patel d5b43c3981 Merge pull request #1455 from dqminh/epoll-io
tty: move IO of master pty to be done with epoll
2017-09-11 11:32:42 -07:00
Aleksa Sarai 6097ce74d8
nsenter: correctly handle newgidmap path for rootless containers
After quite a bit of debugging, I found that previous versions of this
patchset did not include newgidmap in a rootless setting. Fix this by
passing it whenever group mappings are applied, and also providing some
better checking for try_mapping_tool. This commit also includes some
stylistic improvements.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-09-09 12:45:32 +10:00
Giuseppe Scrivano d8b669400a
rootless: allow multiple user/group mappings
Take advantage of the newuidmap/newgidmap tools to allow multiple
users/groups to be mapped into the new user namespace in the rootless
case.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
[ rebased to handle intelrdt changes. ]
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-09-09 12:45:32 +10:00
Mrunal Patel 7e036aa0b0 Merge pull request #1541 from adrianreber/lazy
checkpoint: support lazy migration
2017-09-07 13:25:04 -07:00
Adrian Reber 60ae7091de checkpoint: support lazy migration
With the help of userfaultfd CRIU supports lazy migration. Lazy
migration means that memory pages are only transferred from the
migration source to the migration destination on page fault.

This enables to reduce the downtime during process or container
migration to a minimum as the memory does not need to be transferred
during migration.

Lazy migration currently depends on userfaultfd being available on the
current Linux kernel and if the used CRIU version supports lazy
migration. Both dependencies can be checked by querying CRIU via RPC if
the lazy migration feature is available. Using feature checking instead
of version comparison enables runC to use CRIU features from the
criu-dev branch. This way the user can decide if lazy migration should
be available by choosing the right kernel and CRIU branch.

To use lazy migration the CRIU process during dump needs to dump
everything besides the memory pages and then it opens a network port
waiting for remote page fault requests:

 # runc checkpoint httpd --lazy-pages --page-server 0.0.0.0:27 \
  --status-fd /tmp/postcopy-pipe

In this example CRIU will hang/wait once it has opened the network port
and wait for network connection. As runC waits for CRIU to finish it
will also hang until the lazy migration has finished. To know when the
restore on the destination side can start the '--status-fd' parameter is
used:

 #️ runc checkpoint --help | grep status
  --status-fd value   criu writes \0 to this FD once lazy-pages is ready

The parameter '--status-fd' is directly from CRIU and this way the
process outside of runC which controls the migration knows exactly when
to transfer the checkpoint (without memory pages) to the destination and
that the restore can be started.

On the destination side it is necessary to start CRIU in 'lazy-pages'
mode like this:

 # criu lazy-pages --page-server --address 192.168.122.3 --port 27 \
  -D checkpoint

and tell runC to do a lazy restore:

 # runc restore -d --image-path checkpoint --work-path checkpoint \
  --lazy-pages httpd

If both processes on the restore side have the same working directory
'criu lazy-pages' creates a unix domain socket where it waits for
requests from the actual restore. runC starts CRIU restore in lazy
restore mode and talks to 'criu lazy-pages' that it wants to restore
memory pages on demand. CRIU continues to restore the process and once
the process is running and accesses the first non-existing memory page
the 'criu lazy-pages' server will request the page from the source
system. Thus all pages from the source system will be transferred to the
destination system. Once all pages have been transferred runC on the
source system will end and the container will have finished migration.

This can also be combined with CRIU's pre-copy support. The combination
of pre-copy and post-copy (lazy migration) provides the possibility to
migrate containers with minimal downtimes.

Some additional background about post-copy migration can be found in
these articles:

 https://lisas.de/~adrian/?p=1253
 https://lisas.de/~adrian/?p=1183

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-09-06 12:35:38 +00:00
Adrian Reber a3a632ad28 checkpoint: add support to query for lazy page support
Before adding the actual lazy migration support, this adds the feature
check for lazy-pages. Right now lazy migration, which is based on
userfaultd is only available in the criu-dev branch and not yet in a
release. As the check does not dependent on a certain version but on
a CRIU feature which can be queried it can be part of runC without a new
version check depending on a feature from criu-dev.

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-09-06 12:35:38 +00:00
Xiaochen Shen 692f6e1e27 libcontainer: add support for Intel RDT/CAT in runc
About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.

About Intel RDT/CAT kernel interface:
In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.

Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.

Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
|   |-- L3
|       |-- cbm_mask
|       |-- min_cbm_bits
|       |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
    |-- cpus
    |-- schemata
    |-- tasks

For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
resource constraints.

The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file  (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.

The file `schemata` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
	Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.

The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.

For more information about Intel RDT/CAT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
inside the container only have access to the "upper" 80% of L3 cache id 0 and
the "lower" 50% L3 cache id 1:

"linux": {
	"intelRdt": {
		"l3CacheSchema": "L3:0=ffff0;1=3ff"
	}
}

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-09-01 14:26:33 +08:00
Aleksa Sarai 7d66aab77a
init: switch away from stateDirFd entirely
While we have significant protections in place against CVE-2016-9962, we
still were holding onto a file descriptor that referenced the host
filesystem. This meant that in certain scenarios it was still possible
for a semi-privileged container to gain access to the host filesystem
(if they had CAP_SYS_PTRACE).

Instead, open the FIFO itself using a O_PATH. This allows us to
reference the FIFO directly without providing the ability for
directory-level access. When opening the FIFO inside the init process,
open it through procfs to re-open the actual FIFO (this is currently the
only supported way to open such a file descriptor).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-08-25 13:19:03 +10:00
Nikolas Sepos da4a5a9515 Add AutoDedup option to CriuOpts
Memory image deduplication, very useful for incremental dumps.

See: https://criu.org/Memory_images_deduplication

Signed-off-by: Nikolas Sepos <nikolas.sepos@gmail.com>
2017-08-18 01:21:42 +02:00
Qiang Huang e6e1c34a7d Update state after update
state.json should be a reflection of the container's
realtime state, including resource configurations,
so we should update state.json after updating container
resources.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-08-15 14:38:44 +08:00
Adrian Reber 5d386f6e2b checkpoint: use CRIU VERSION RPC if available
With this runC also uses RPC to ask CRIU for its version. CRIU supports
a VERSION RPC since CRIU 3.0 and using the RPC interface does not
require parsing the console output of CRIU (which could change anytime).

For older CRIU versions which do not yet have the VERSION RPC runC falls
back to its old CRIU output parsing mode.

Once CRIU 3.0 is the minimum version required for runC the old code can
be removed.

v2:
 * adapt to changes in the previous patches based on the review

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-08-02 16:08:07 +00:00
Adrian Reber c71d9cd447 criuSwrk: prepare for CRIU VERSION RPC
To use the CRIU VERSION RPC the criuSwrk function is adapted to work
with CriuOpts set to 'nil' as CriuOpts is not required for the VERSION
RPC.

Also do not print c.criuVersion if it is '0' as the first RPC call will
always be the VERSION call and only after that the version will be
known.

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-08-02 16:07:28 +00:00
Adrian Reber c5f0ce979b checkCriuVersion: only ask criu once about its version
If the version of criu has already been determined there is no need to
ask criu for the version again. Use the value from c.criuVersion.

v2:
 * reduce unnecessary code movement in the patch series
 * factor out the criu version parsing into a separate function

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-08-02 16:07:15 +00:00
Adrian Reber b6c47281db checkCriuVersion: switch to version using int
The checkCriuVersion function used a string to specify the minimum
version required. This is more comfortable for an external interface
but for an internal function this added unnecessary complexity. This
changes to version string like '1.5.2' to an integer like 10502. This is
already the format used internally in the function.

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-08-02 16:05:27 +00:00
Michael Crosby 882d8eaba6 Merge pull request #1537 from tklauser/staticcheck
Fix issues found by staticcheck
2017-08-02 09:52:11 -04:00
Tobias Klauser e4e56cb6d8 libcontainer: remove ineffective break statements
go's switch statement doesn't need an explicit break. Remove it where
that is the case and add a comment to indicate the purpose where the
removal would lead to an empty case.

Found with honnef.co/go/tools/cmd/staticcheck

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-07-28 15:13:39 +02:00
Tobias Klauser 24a4273cf9 libcontainer: handle error cases
Handle err return value of fmt.Scanf, os.Pipe and unix.ParseUnixRights.

Found with honnef.co/go/tools/cmd/staticcheck

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-07-28 15:13:11 +02:00
Daniel Dao 91eafcbc65
tty: move IO of master pty to be done with epoll
This moves all console code to use github.com/containerd/console library to
handle console I/O. Also move to use EpollConsole by default when user requests
a terminal so we can still cope when the other side temporarily goes away.

Signed-off-by: Daniel Dao <dqminh89@gmail.com>
2017-07-28 12:35:02 +01:00
Steven Hartland ee4f68e302 Updated logrus to v1
Updated logrus to use v1 which includes a breaking name change Sirupsen -> sirupsen.

This includes a manual edit of the docker term package to also correct the name there too.

Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
2017-07-19 15:20:56 +00:00
Tobias Klauser 54d27bed7f libcontainer: use ParseSocketControlMessage/ParseUnixRights from x/sys/unix
Use ParseSocketControlMessage and ParseUnixRights from
golang.org/x/sys/unix instead of their syscall equivalent.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-07-13 15:02:17 +02:00
W. Trevor King 2bea4c897e libcontainer/system/proc: Add Stat_t.State
And Stat_t.PID and Stat_t.Name while we're at it.  Then use the new
.State property in runType to distinguish between running and
zombie/dead processes, since kill(2) does not [1].  With this change
we no longer claim Running status for zombie/dead processes.

I've also removed the kill(2) call from runType.  It was originally
added in 13841ef3 (new-api: return the Running state only if the init
process is alive, 2014-12-23), but we've been accessing
/proc/[pid]/stat since 14e95b2a (Make state detection precise,
2016-07-05, #930), and with the /stat access the kill(2) check is
redundant.

I also don't see much point to the previously-separate
doesInitProcessExist, so I've inlined that logic in runType.

It would be nice to distinguish between "/proc/[pid]/stat doesn't
exist" and errors parsing its contents, but I've skipped that for the
moment.

The Running -> Stopped change in checkpoint_test.go is because the
post-checkpoint process is a zombie, and with this commit zombie
processes are Stopped (and no longer Running).

[1]: https://github.com/opencontainers/runc/pull/1483#issuecomment-307527789

Signed-off-by: W. Trevor King <wking@tremily.us>
2017-06-20 16:26:55 -07:00
W. Trevor King 75d98b26b7 libcontainer: Replace GetProcessStartTime with Stat_t.StartTime
And convert the various start-time properties from strings to uint64s.
This removes all internal consumers of the deprecated
GetProcessStartTime function.

Signed-off-by: W. Trevor King <wking@tremily.us>
2017-06-20 16:26:55 -07:00
Christy Perez 3d7cb4293c Move libcontainer to x/sys/unix
Since syscall is outdated and broken for some architectures,
use x/sys/unix instead.

There are still some dependencies on the syscall package that will
remain in syscall for the forseeable future:

Errno
Signal
SysProcAttr

Additionally:
- os still uses syscall, so it needs to be kept for anything
returning *os.ProcessState, such as process.Wait.

Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
2017-05-22 17:35:20 -05:00
Mrunal Patel 639454475c Merge pull request #1355 from avagin/cr-console
Dump and restore containers with external terminals
2017-05-18 11:22:52 -07:00
Harshal Patil 22953c122f Remove redundant declaraion of namespace slice
Signed-off-by: Harshal Patil <harshal.patil@in.ibm.com>
2017-05-02 10:04:57 +05:30
Andrei Vagin 73258813d3 cr: set a freezer cgroup for criu
A freezer cgroup allows to dump processes faster.

If a user wants to checkpoint a container and its storage,
he has to pause a container, but in this case we need to pass
a path to its freezer cgroup to "criu dump".

Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
2017-05-02 04:48:47 +03:00
Andrei Vagin 1c43d091a1 checkpoint: add support for containers with terminals
CRIU was extended to report about orphaned master pty-s via RPC.

Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
2017-05-02 04:48:47 +03:00
Andrei Vagin d307e85dbb Print a criu version in a error message
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
2017-05-01 21:45:23 +03:00
Harshal Patil c44d4fa6ed Optimizing looping over namespaces
Signed-off-by: Harshal Patil <harshal.patil@in.ibm.com>
2017-04-26 11:54:43 +05:30
Qiang Huang 94cfb7955b Merge pull request #1387 from avagin/freezer
Don't try to read freezer.state from the current directory
2017-04-24 20:02:45 -05:00
Mrunal Patel 97db1eaad9 Merge pull request #1396 from harche/cstate
Set container state only once during start
2017-04-17 11:32:42 -07:00
Mrunal Patel 7814a0d14b Merge pull request #1399 from avagin/cr-cgroup
restore: apply resource limits
2017-04-13 11:28:28 -07:00
Andrei Vagin 57ef30a2ae restore: apply resource limits
When C/R was implemented, it was enough to call manager.Set to apply
limits and to move a task. Now .Set() and .Apply() have to be called
separately.

Fixes: 8a740d5391 ("libcontainer: cgroups: don't Set in Apply")
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
2017-04-07 02:47:43 +03:00
Adrian Reber 273b7853c8 checkpoint: check if system supports pre-dumping
Instead of relying on version numbers it is possible to check if CRIU
actually supports certain features. This introduces an initial
implementation to check if CRIU and the underlying kernel actually
support dirty memory tracking for memory pre-dumping.

Upstream CRIU also supports the lazy-page migration feature check and
additional feature checks can be included in CRIU to reduce the version
number parsing. There are also certain CRIU features which depend on one
side on the CRIU version but also require certain kernel versions to
actually work. CRIU knows if it can do certain things on the kernel it
is running on and using the feature check RPC interface makes it easier
for runc to decide if the criu+kernel combination will support that
feature.

Feature checking was introduced with CRIU 1.8. Running with older CRIU
versions will ignore the feature check functionality and behave just
like it used to.

v2:
 - Do not use reflection to compare requested and responded
   features. Checking which feature is available is now hardcoded
   and needs to be adapted for every new feature check. The code
   is now much more readable and simpler.

v3:
 - Move the variable criuFeat out of the linuxContainer struct,
   as it is not container specific. Now it is a global variable.

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-04-06 11:17:52 +00:00
Harshal Patil 1be5d31da2 Set container state only once during start
Signed-off-by: Harshal Patil <harshal.patil@in.ibm.com>
2017-04-04 15:08:04 +05:30
Aleksa Sarai f0876b0427
libcontainer: configs: add proper HostUID and HostGID
Previously Host{U,G}ID only gave you the root mapping, which isn't very
useful if you are trying to do other things with the IDMaps.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-03-23 20:46:20 +11:00
Aleksa Sarai baeef29858
rootless: add rootless cgroup manager
The rootless cgroup manager acts as a noop for all set and apply
operations. It is just used for rootless setups. Currently this is far
too simple (we need to add opportunistic cgroup management), but is good
enough as a first-pass at a noop cgroup manager.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-03-23 20:46:20 +11:00
Aleksa Sarai d2f49696b0
runc: add support for rootless containers
This enables the support for the rootless container mode. There are many
restrictions on what rootless containers can do, so many different runC
commands have been disabled:

* runc checkpoint
* runc events
* runc pause
* runc ps
* runc restore
* runc resume
* runc update

The following commands work:

* runc create
* runc delete
* runc exec
* runc kill
* runc list
* runc run
* runc spec
* runc state

In addition, any specification options that imply joining cgroups have
also been disabled. This is due to support for unprivileged subtree
management not being available from Linux upstream.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-03-23 20:45:24 +11:00
Aleksa Sarai 6bd4bd9030
*: handle unprivileged operations and !dumpable
Effectively, !dumpable makes implementing rootless containers quite
hard, due to a bunch of different operations on /proc/self no longer
being possible without reordering everything.

!dumpable only really makes sense when you are switching between
different security contexts, which is only the case when we are joining
namespaces. Unfortunately this means that !dumpable will still have
issues in this instance, and it should only be necessary to set
!dumpable if we are not joining USER namespaces (new kernels have
protections that make !dumpable no longer necessary). But that's a topic
for another time.

This also includes code to unset and then re-set dumpable when doing the
USER namespace mappings. This should also be safe because in principle
processes in a container can't see us until after we fork into the PID
namespace (which happens after the user mapping).

In rootless containers, it is not possible to set a non-dumpable
process's /proc/self/oom_score_adj (it's owned by root and thus not
writeable). Thus, it needs to be set inside nsexec before we set
ourselves as non-dumpable.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-03-23 20:45:19 +11:00
Andrei Vagin 88256d646d Don't try to read freezer.state from the current directory
If we try to pause a container on the system without freezer cgroups,
we can found that runc tries to open ./freezer.state. It is obviously wrong.

$ ./runc pause test
no such directory for freezer.state

$ echo FROZEN > freezer.state
$ ./runc pause test
container not running or created: paused

Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
2017-03-23 01:58:45 +03:00
Michael Crosby 00a0ecf554 Add separate console socket
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-03-16 10:23:59 -07:00
Mrunal Patel 4f9cb13b64 Update runtime spec to 1.0.0.rc5
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2017-03-15 11:38:37 -07:00
Qiang Huang b7932a2e07 Remove unused ExecFifoPath
In container process's Init function, we use
fd + execFifoFilename to open exec fifo, so this
field in init config is never used.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-03-09 10:58:16 +08:00
Qiang Huang 707dd48b2f Merge pull request #1001 from x1022as/predump
add pre-dump and parent-path to checkpoint
2017-02-24 10:55:06 -08:00
Qiang Huang 733563552e Fix state when _LIBCONTAINER in environment
Fixes: #1311

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-02-22 10:35:14 -08:00
Qiang Huang 805b8c73d3 Do not create exec fifo in factory.Create
It should not be binded to container creation, for
example, runc restore needs to create a
libcontainer.Container, but it won't need exec fifo.

So create exec fifo when container is started or run,
where we really need it.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-02-22 10:34:48 -08:00
Deng Guangxing 98f004182b add pre-dump and parent-path to checkpoint
CRIU gets pre-dump to complete iterative migration.
pre-dump saves process memory info only. And it need parent-path
to specify the former memory files.

This patch add pre-dump and parent-path arguments to runc checkpoint

Signed-off-by: Deng Guangxing <dengguangxing@huawei.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
2017-02-14 19:45:07 +08:00
Aleksa Sarai e034cedce7
libcontainer: init: only pass stateDirFd when creating a container
If we pass a file descriptor to the host filesystem while joining a
container, there is a race condition where a process inside the
container can ptrace(2) the joining process and stop it from closing its
file descriptor to the stateDirFd. Then the process can access the
*host* filesystem from that file descriptor. This was fixed in part by
5d93fed3d2 ("Set init processes as non-dumpable"), but that fix is
more of a hail-mary than an actual fix for the underlying issue.

To fix this, don't open or pass the stateDirFd to the init process
unless we're creating a new container. A proper fix for this would be to
remove the need for even passing around directory file descriptors
(which are quite dangerous in the context of mount namespaces).

There is still an issue with containers that have CAP_SYS_PTRACE and are
using the setns(2)-style of joining a container namespace. Currently I'm
not really sure how to fix it without rampant layer violation.

Fixes: CVE-2016-9962
Fixes: 5d93fed3d2 ("Set init processes as non-dumpable")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-02-02 00:41:11 +11:00
Qiang Huang db99936a0e Merge pull request #1110 from avagin/cpt-in-userns
checkpoint: handle config.Devices and config.MaskPaths
2017-01-10 00:34:40 -06:00