Commit Graph

60 Commits

Author SHA1 Message Date
Akihiro Suda 06f789cf26 Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.

`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.

`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)

When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.

This PR does not have any impact on CLI flags and `state.json`.

Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
  For runc-in-userns, `runc spec`  without `--rootless` should work, when sufficient numbers of
  UID/GID are mapped.

Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
  (`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
  This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.

Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-09-07 15:05:03 +09:00
Ace-Tang 4803faf00e cr: don't restore net namespace by default
since runc don't manage net device and their configuration, checkpoint
also don't dump net namespace by default, so set 'nsmask = unix.CLONE_NEWNET'
by default in restore. Or if user do not pass 'empty-ns network', criu will
cost extra time in restore.

Signed-off-by: Ace-Tang <aceapril@126.com>
2018-08-17 16:03:21 +08:00
Akihiro Suda f103de57ec main: support rootless mode in userns
Running rootless containers in userns is useful for mounting
filesystems (e.g. overlay) with mapped euid 0, but without actual root
privilege.

Usage: (Note that `unshare --mount` requires `--map-root-user`)

  user$ mkdir lower upper work rootfs
  user$ curl http://dl-cdn.alpinelinux.org/alpine/v3.7/releases/x86_64/alpine-minirootfs-3.7.0-x86_64.tar.gz | tar Cxz ./lower || ( true; echo "mknod errors were ignored" )
  user$ unshare --mount --map-root-user
  mappedroot# runc spec --rootless
  mappedroot# sed -i 's/"readonly": true/"readonly": false/g' config.json
  mappedroot# mount -t overlay -o lowerdir=./lower,upperdir=./upper,workdir=./work overlayfs ./rootfs
  mappedroot# runc run foo

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-05-10 12:16:43 +09:00
Adrian Reber 60ae7091de checkpoint: support lazy migration
With the help of userfaultfd CRIU supports lazy migration. Lazy
migration means that memory pages are only transferred from the
migration source to the migration destination on page fault.

This enables to reduce the downtime during process or container
migration to a minimum as the memory does not need to be transferred
during migration.

Lazy migration currently depends on userfaultfd being available on the
current Linux kernel and if the used CRIU version supports lazy
migration. Both dependencies can be checked by querying CRIU via RPC if
the lazy migration feature is available. Using feature checking instead
of version comparison enables runC to use CRIU features from the
criu-dev branch. This way the user can decide if lazy migration should
be available by choosing the right kernel and CRIU branch.

To use lazy migration the CRIU process during dump needs to dump
everything besides the memory pages and then it opens a network port
waiting for remote page fault requests:

 # runc checkpoint httpd --lazy-pages --page-server 0.0.0.0:27 \
  --status-fd /tmp/postcopy-pipe

In this example CRIU will hang/wait once it has opened the network port
and wait for network connection. As runC waits for CRIU to finish it
will also hang until the lazy migration has finished. To know when the
restore on the destination side can start the '--status-fd' parameter is
used:

 #️ runc checkpoint --help | grep status
  --status-fd value   criu writes \0 to this FD once lazy-pages is ready

The parameter '--status-fd' is directly from CRIU and this way the
process outside of runC which controls the migration knows exactly when
to transfer the checkpoint (without memory pages) to the destination and
that the restore can be started.

On the destination side it is necessary to start CRIU in 'lazy-pages'
mode like this:

 # criu lazy-pages --page-server --address 192.168.122.3 --port 27 \
  -D checkpoint

and tell runC to do a lazy restore:

 # runc restore -d --image-path checkpoint --work-path checkpoint \
  --lazy-pages httpd

If both processes on the restore side have the same working directory
'criu lazy-pages' creates a unix domain socket where it waits for
requests from the actual restore. runC starts CRIU restore in lazy
restore mode and talks to 'criu lazy-pages' that it wants to restore
memory pages on demand. CRIU continues to restore the process and once
the process is running and accesses the first non-existing memory page
the 'criu lazy-pages' server will request the page from the source
system. Thus all pages from the source system will be transferred to the
destination system. Once all pages have been transferred runC on the
source system will end and the container will have finished migration.

This can also be combined with CRIU's pre-copy support. The combination
of pre-copy and post-copy (lazy migration) provides the possibility to
migrate containers with minimal downtimes.

Some additional background about post-copy migration can be found in
these articles:

 https://lisas.de/~adrian/?p=1253
 https://lisas.de/~adrian/?p=1183

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-09-06 12:35:38 +00:00
Nikolas Sepos 3f234b15d0 Add auto-dedup flag for checkpoint/restore
When doing incremental dumps is useful to use auto deduplication of
memory images to save space.

Signed-off-by: Nikolas Sepos <nikolas.sepos@gmail.com>
2017-08-18 16:19:21 +02:00
Andrei Vagin 1c43d091a1 checkpoint: add support for containers with terminals
CRIU was extended to report about orphaned master pty-s via RPC.

Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
2017-05-02 04:48:47 +03:00
Andrei Vagin a4fcbfb704 Prepare startContainer() to have more action
Currently startContainer() is used to create and to run a container.
In the next patch it will be used to restore a container.

Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
2017-05-01 21:55:57 +03:00
Tim Potter 9458b39ca9 Fix misspelling of "properties" in various places
Signed-off-by: Tim Potter <tpot@hpe.com>
2017-04-21 13:29:58 +10:00
Aleksa Sarai d2f49696b0
runc: add support for rootless containers
This enables the support for the rootless container mode. There are many
restrictions on what rootless containers can do, so many different runC
commands have been disabled:

* runc checkpoint
* runc events
* runc pause
* runc ps
* runc restore
* runc resume
* runc update

The following commands work:

* runc create
* runc delete
* runc exec
* runc kill
* runc list
* runc run
* runc spec
* runc state

In addition, any specification options that imply joining cgroups have
also been disabled. This is due to support for unprivileged subtree
management not being available from Linux upstream.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-03-23 20:45:24 +11:00
Michael Crosby 00a0ecf554 Add separate console socket
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-03-16 10:23:59 -07:00
Mrunal Patel 899b0748f0 Merge pull request #1308 from giuseppe/fix-systemd-notify
fix systemd-notify when using a different PID namespace
2017-02-24 11:05:21 -08:00
Giuseppe Scrivano d5026f0e43 signals: support detach and notify socket together
let runc run until READY= is received and then proceed with
detaching the process.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2017-02-22 22:28:03 +01:00
Giuseppe Scrivano 892f2ded6f fix systemd-notify when using a different PID namespace
The current support of systemd-notify has a race condition as the
message send to the systemd notify socket might be dropped if the sender
process is not running by the time systemd checks for the sender of the
datagram.  A proper fix of this in systemd would require changes to the
kernel to maintain the cgroup of the sender process when it is dead (but
it is not probably going to happen...)
Generally, the solution to this issue is to specify the PID in the
message itself so that systemd has not to guess the sender, but this
wouldn't work when running in a PID namespace as the container will pass
the PID known in its namespace (something like PID=1,2,3..) and systemd
running on the host is not able to map it to the runc service.

The proposed solution is to have a proxy in runc that forwards the
messages to the host systemd.

Example of this issue:

https://github.com/projectatomic/atomic-system-containers/pull/24

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2017-02-22 22:27:59 +01:00
Deng Guangxing 98f004182b add pre-dump and parent-path to checkpoint
CRIU gets pre-dump to complete iterative migration.
pre-dump saves process memory info only. And it need parent-path
to specify the former memory files.

This patch add pre-dump and parent-path arguments to runc checkpoint

Signed-off-by: Deng Guangxing <dengguangxing@huawei.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
2017-02-14 19:45:07 +08:00
Mrunal Patel c54f1495e3 Fix error shadow and error check warnings
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2017-01-06 16:21:23 -08:00
Aleksa Sarai c6d8a2f26f
merge branch 'pr-1158'
Closes #1158
LGTMs: @hqhq @cyphar
2016-12-26 13:59:47 +11:00
Aleksa Sarai 244c9fc426
*: console rewrite
This implements {createTTY, detach} and all of the combinations and
negations of the two that were previously implemented. There are some
valid questions about out-of-OCI-scope topics like !createTTY and how
things should be handled (why do we dup the current stdio to the
process, and how is that not a security issue). However, these will be
dealt with in a separate patchset.

In order to allow for late console setup, split setupRootfs into the
"preparation" section where all of the mounts are created and the
"finalize" section where we pivot_root and set things as ro. In between
the two we can set up all of the console mountpoints and symlinks we
need.

We use two-stage synchronisation to ensures that when the syscalls are
reordered in a suboptimal way, an out-of-place read() on the parentPipe
will not gobble the ancilliary information.

This patch is part of the console rewrite patchset.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-12-01 15:49:36 +11:00
Zhang Wei b517076907 Check args numbers before application start
Add a general args number validator for all client commands.

Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
2016-11-29 11:18:51 +08:00
xiekeyang 55e783b57a remove unused returned variables name
The returned variables name seems be able to removed.

Signed-off-by: xiekeyang <xiekeyang@huawei.com>
2016-06-15 17:41:57 +08:00
Andrew Vagin acef7461a4 restore: add the empty-ns option
For example:
./runc restore --empty-ns network CTID

In this case criu creates a network namespace, but doesn't restore it.

We are going to use this option to restore docker containers and
Docker sets a hook to restore a network namespace.

https://github.com/xemul/criu/issues/165
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
2016-06-07 20:24:59 +03:00
Mrunal Patel a753b06645 Replace github.com/codegangsta/cli by github.com/urfave/cli
The package got moved to a different repository

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2016-06-06 11:47:20 -07:00
Qiang Huang 2503fca35d Update man pages to refect the latest cli change
The major change is the description of options, change
it as the latest cli help message shows, which specify
a "value" after an option if it takes value, and add
(default: xxx) if the option has a default value.

This also includes some other minor consistency fixes.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-05-28 13:33:57 +08:00
Aleksa Sarai 1a913c7b89 *: correctly chown() consoles
In user namespaces, we need to make sure we don't chown() the console to
unmapped users. This means we need to get both the UID and GID of the
root user in the container when changing the owner.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-05-22 22:37:13 +10:00
Qiang Huang 8477638aab Update cli package
The old one has bug when showing help message for IntFlags.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-05-10 13:58:09 +08:00
Michael Crosby f417e993d0 Update spec to v0.5.0
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-04-12 14:11:40 -07:00
Michael Crosby 12bd4cffd0 Add --no-pivot option for containers on ramdisk
This adds a `--no-pivot` cli flag to runc so that a container's rootfs
can be located ontop of ramdisk/tmpfs and not fail because you cannot
pivot root.

This should be a cli flag and not part of the spec because this is a
detail of the host/runtime environment and not an attribute of a
container.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-03-30 12:02:17 -07:00
Ido Yariv 28b21a5988 Export CreateLibcontainerConfig
Users of libcontainer other than runc may also require parsing and
converting specification configuration files.

Since runc cannot be imported, move the relevant functions and
definitions to a separate package, libcontainer/specconv.

Signed-off-by: Ido Yariv <ido@wizery.com>
2016-03-25 12:19:18 -04:00
Mrunal Patel 7e91a96605 Add support for systemd cgroups in runc
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2016-03-22 17:08:07 -07:00
Michael Crosby fdb100d247 Destroy container along with processes before stdio
We need to make sure the container is destroyed before closing the stdio
for the container.  This becomes a big issues when running in the host's
pid namespace because the other processes could have inherited the stdio
of the initial process.  The call to close will just block as they still
have the io open.

Calling destroy before closing io, especially in the host pid namespace
will cause all additional processes to be killed in the container's
cgroup.  This will allow the io to be closed successfuly.

This change makes sure the order for destroy and close is correct as
well as ensuring that if any errors encoutered during start or exec will
be handled by terminating the process and destroying the container.  We
cannot use defers here because we need to enforce the correct ordering
on destroy.

This also sets the subreaper setting for runc so that when running in
pid host, runc can wait on the addiontal processes launched by the
container, useful on destroy, but also good for reaping the additional
processes that were launched.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-03-15 13:17:11 -07:00
Michael Crosby 47eaa08f5a Update runc usage for new specs changes
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-03-10 14:18:39 -08:00
Michael Crosby 044e298507 Improve error handling in runc
The error handling on the runc cli is currenly pretty messy because
messages to the user are split between regular stderr format and logrus
message format.  This changes all the error reporting to the cli to only
output on stderr and exit(1) for consumers of the api.

By default logrus logs to /dev/null so that it is not seen by the user.
If the user wants extra and/or structured loggging/errors from runc they
can use the `--log` flag to provide a path to the file where they want
this information.  This allows a consistent behavior on the cli but
extra power and information when debugging with logs.

This also includes a change to enable the same logging information
inside the container's init by adding an init cli command that can share
the existing flags for all other runc commands.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-03-09 11:08:30 -08:00
Michael Crosby 8d0a05b8dd Wait for pipes to write all data before exit
Add a waitgroup to wait for the io.Copy of stdout/err to finish before
existing runc.  The problem happens more in exec because it is really
fast and the pipe has data buffered but not yet read after the process
has already exited.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-02-26 12:14:47 -08:00
Mrunal Patel 90472aeb9e Merge pull request #546 from mikebrow/usage-updates
updating usage for runc, and all runc commands that now use <container id> as the first argument
2016-02-17 21:13:22 +05:30
Mike Brown f4e37ab63e updating usage for runc and runc commands
Signed-off-by: Mike Brown <brownwm@us.ibm.com>
2016-02-17 09:00:39 -06:00
Michael Crosby ce72f86a2b Merge pull request #558 from rajasec/tty-panic
panic during start of failed detached container
2016-02-16 16:01:08 -08:00
Julian Friedman 5fbdf6c3fc Register signal handlers earlier to avoid zombies
newSignalHandler needs to be called before the process is started, otherwise when
the process exits quickly the SIGCHLD is recieved (and ignored) before the
handler is set up. When this happens the reaper never runs, the
process becomes a zombie, and the exit code isn't returned to the user.

Signed-off-by: Julian Friedman <julz.friedman@uk.ibm.com>
2016-02-16 18:38:54 +00:00
rajasec 321b842404 panic during start of failed detached container
Signed-off-by: rajasec <rajasec79@gmail.com>

Adding nil check before closing tty for restore operation

Signed-off-by: rajasec <rajasec79@gmail.com>
2016-02-14 19:11:09 +05:30
rajasec a7ee55b716 Adding tty closure for restore operation
Signed-off-by: rajasec <rajasec79@gmail.com>
2016-02-10 09:48:12 +05:30
Michael Crosby a7278cad98 Require containerd id as arg 1
Closes #532

This requires the container id to always be passed to all runc commands
as arg one on the cli.  This was the result of the last OCI meeting and
how operations work with the spec.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-02-09 11:20:55 -08:00
Mike Brown c2c0458598 merges latest spec with runc
Signed-off-by: Mike Brown <brownwm@us.ibm.com>
2016-02-05 12:47:09 -08:00
Michael Crosby fbc74c0eba Add detach and pid-file to restore
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-02-05 11:56:21 -08:00
Michael Crosby 4c4c9b85b7 Add --console to specify path to use from runc
This flag allows systems that are running runc to allocate tty's that
they own and provide to the container.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-01-07 15:01:36 -08:00
Michael Crosby 4415446c32 Add state pattern for container state transition
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>

Add state status() method

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>

Allow multiple checkpoint on restore

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>

Handle leave-running state

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>

Fix state transitions for inprocess

Because the tests use libcontainer in process between the various states
we need to ensure that that usecase works as well as the out of process
one.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>

Remove isDestroyed method

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>

Handling Pausing from freezer state

Signed-off-by: Rajasekaran <rajasec79@gmail.com>

freezer status

Signed-off-by: Rajasekaran <rajasec79@gmail.com>

Fixing review comments

Signed-off-by: Rajasekaran <rajasec79@gmail.com>

Added comment when freezer not available

Signed-off-by: Rajasekaran <rajasec79@gmail.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>

Conflicts:
	libcontainer/container_linux.go

Change checkFreezer logic to isPaused()

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>

Remove state base and factor out destroy func

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>

Add unit test for state transitions

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2015-12-17 13:55:38 -08:00
Michael Crosby 29b139f702 Move STDIO initialization to libcontainer.Process
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2015-12-10 16:11:49 -08:00
Mike Brown 8b19581694 adding support for --bundle -b to start, restore, and spec; fixes issue #310
Signed-off-by: Mike Brown <brownwm@us.ibm.com>
2015-11-13 09:13:57 -06:00
Hui Kang 25da513c4b Add option to support criu manage cgroups mode for dump and restore
CRIU supports cgroup-manage mode from v1.7

Signed-off-by: Hui Kang <hkang.sunysb@gmail.com>
2015-10-11 04:42:54 +00:00
Alexander Morozov ea5032bc5e Adjust runc to new opencontainers/specs version
I deleted possibility to specify config file from commands for now.
Until we decide how it'll be done. Also I changed runc spec interface to
write config files instead of output them.

Signed-off-by: Alexander Morozov <lk4d4@docker.com>
2015-09-15 08:35:25 -07:00
Rajasekaran 77af09efd6 Restorefixforrunningcontainer
Signed-off-by: Rajasekaran <rajasec79@gmail.com>
2015-08-31 22:16:38 +05:30
Alexander Morozov 37c506058d Merge pull request #221 from crosbymichael/defaults-criu
Remove hard-coded default for tcp connections
2015-08-28 11:24:36 -07:00
Michael Crosby ba56afde7b Remove hard-coded default for tcp connections
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2015-08-21 15:59:43 -07:00