Commit Graph

1120 Commits

Author SHA1 Message Date
W. Trevor King be16b13645 libcontainer/state_linux_test: Add a testTransitions helper
The helper DRYs up the transition tests and makes it easy to get
complete coverage for invalid transitions.

I'm also using t.Run() for subtests.  Run() is new in Go 1.7 [1], but
runc dropped support for 1.6 back in e773f96b (update go version at
travis-ci, 2017-02-20, #1335).

[1]: https://blog.golang.org/subtests

Signed-off-by: W. Trevor King <wking@tremily.us>
2018-01-25 11:18:45 -08:00
Michael Crosby 91ca331474 chroot when no mount namespaces is provided
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2018-01-25 11:36:37 -05:00
Michael Crosby c4e4bb0df2
Merge pull request #1699 from AkihiroSuda/indent-c
make: validate C format
2018-01-25 10:09:09 -05:00
Aleksa Sarai 5a46c2ba8b
nsenter: move namespace creation after userns creation
Technically, this change should not be necessary, as the kernel
documentation claims that if you call clone(flags|CLONE_NEWUSER), the
new user namespace will be the owner of all other namespaces created in
@flags. Unfortunately this isn't always the case, due to various
additional semantics and kernel bugs.

One particular instance is SELinux, which acts very strangely towards
the IPC namespace and mqueue. If you unshare the IPC namespace *before*
you map a user in the user namespace, the IPC namespace's internal
kern-mount for mqueue will be labelled incorrectly and the container
won't be able to access it. The only way of solving this is to unshare
IPC *after* the user has been mapped and we have changed to that user.
I've also heard of this happening to the NET namespace while talking to
some LXC folks, though I haven't personally seen that issue.

This change matches our handling of user namespaces to be the same as
how LXC handles these problems.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-01-25 23:56:49 +11:00
Akihiro Suda dd5eb3b9e3 make: validate C format
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-01-24 10:49:50 +09:00
Ed King 5c0af14bf8 Return from goroutine when it should terminate
Signed-off-by: Craig Furman <cfurman@pivotal.io>
2018-01-23 10:46:31 +00:00
Will Martin 8d3e6c9826 Avoid race when opening exec fifo
When starting a container with `runc start` or `runc run`, the stub
process (runc[2:INIT]) opens a fifo for writing. Its parent runc process
will open the same fifo for reading. In this way, they synchronize.

If the stub process exits at the wrong time, the parent runc process
will block forever.

This can happen when racing 2 runc operations against each other: `runc
run/start`, and `runc delete`. It could also happen for other reasons,
e.g. the kernel's OOM killer may select the stub process.

This commit resolves this race by racing the opening of the exec fifo
from the runc parent process against the stub process exiting. If the
stub process exits before we open the fifo, we return an error.

Another solution is to wait on the stub process. However, it seems it
would require more refactoring to avoid calling wait multiple times on
the same process, which is an error.

Signed-off-by: Craig Furman <cfurman@pivotal.io>
2018-01-22 17:03:02 +00:00
Antonio Murdaca cd1e7abee2
libcontainer: expose annotations in hooks
Annotations weren't passed to hooks. This patch fixes that by passing
annotations to stdin for hooks.

Signed-off-by: Antonio Murdaca <runcom@redhat.com>
2018-01-11 16:54:01 +01:00
vikaschoudhary16 d5b4a3eddb Fix race against systemd
- T0: runc triggers a systemd unit creation asynchronously from [here](https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/systemd/apply_systemd.go#L298)
- T1: runc then moves ahead and starts creating cgroup paths(.scope directories), [here](https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/systemd/apply_systemd.go#L348). Kernel creates .scope directory and cgroup.procs file(along with other default files) in the directory automatically, in an atomic manner.
- T3: systemd execution thread which was invoked at time `T0`, is still in the process of unit creation. systemd also trying to create cgroup paths and deletes the `.scope` directory which is created at time `T1` by runc from [here](https://github.com/systemd/systemd/blob/v219/src/shared/cgroup-util.c#L1630) in the code

Signed-off-by: vikaschoudhary16 <choudharyvikas16@gmail.com>
2018-01-08 09:37:26 -05:00
Mrunal Patel e6516b3d5d
Merge pull request #1678 from sboeuf/sboeuf/subreaper
libcontainer: Do not wait for signalled processes if subreaper is set
2017-12-15 08:47:07 -08:00
Michael Crosby 7f24b40cc5
Merge pull request #1675 from tklauser/apparmor-no-cgo
RFC: libcontainer: remove dependency on libapparmor
2017-12-15 11:23:35 -05:00
Tobias Klauser db093f621f libcontainer: remove dependency on libapparmor
libapparmor is integrated in libcontainer using cgo but is only used to
call a single function: aa_change_onexec. It turns out this function is
simple enough (writing a string to a file in /proc/<n>/attr/...) to be
re-implemented locally in libcontainer in plain Go.

This allows to drop the dependency on libapparmor and the corresponding
cgo integration.

Fixes #1674

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-12-15 09:59:58 +01:00
Sebastien Boeuf bb912eb00c libcontainer: Do not wait for signalled processes if subreaper is set
When a subreaper is enabled, it might expect to reap a process and
retrieve its exit code. That's the reason why this patch is giving
the possibility to define the usage of a subreaper as a consumer of
libcontainer. Relying on this information, libcontainer will not
wait for signalled processes in case a subreaper has been set.

Fixes #1677

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2017-12-14 10:37:38 -08:00
Mrunal Patel c6e4a1ebeb
Merge pull request #1665 from Mashimiao/gidmapping-valid-fix
specconv: avoid skipping gidmappings applied when uidmappings is empty
2017-12-11 09:50:54 -08:00
Mrunal Patel b028413c35
Merge pull request #1655 from Mashimiao/add-propagation-more
support unbindable,runbindable for rootfs propagation
2017-12-11 09:21:41 -08:00
Allen Sun fec6b0fea5 Update criu_opts_linux.go
Signed-off-by: Allen Sun <shlallen1990@gmail.com>
2017-12-05 15:16:26 +08:00
Michael Crosby 91e9795013
Merge pull request #1654 from dqminh/only-linux
remove placeholder for non-linux platforms
2017-11-30 09:51:47 -05:00
Ma Shimiao 57edfbbaf2 specconv: avoid skipping gidmappings applied when uidmappings is empty
Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
2017-11-30 16:24:36 +08:00
Aleksa Sarai e8149af291
merge branch 'pr-1661'
Ensure container tests do not write on the host

LGTMs: @hqhq @cyphar
Closes #1661
2017-11-27 20:10:48 +11:00
Danail Branekov 0495fece57 Ensure container tests do not write on the host
TestGetContainerStateAfterUpdate creates its state.json file on the current
directory which turns out to be the host runc directory. Thus whenever
the test completes it leaves the state.json file behind thus
a) poluting the local git repository
b) changing the host file system violating the principle of doing
everything in an isolated container environment

This change would create a new temporary (in-container) directory and use it as
linuxContainer.root

Signed-off-by: Tom Godkin <tgodkin@pivotal.io>
2017-11-27 10:43:10 +02:00
Daniel Dao 8898b6b446 remove placeholder for non-linux platforms
runc currently only support Linux platform, and since we dont intend to expose
the support to other platform, removing all other platforms placeholder code.

`libcontainer/configs` still being used in
https://github.com/moby/moby/blob/master/daemon/daemon_windows.go so
keeping it for now.

After this, we probably should also rename files to drop linux suffices
if possible.

Signed-off-by: Daniel Dao <dqminh89@gmail.com>
2017-11-24 18:14:51 +00:00
Daniel, Dao Quang Minh fb871d9cd0
Merge pull request #1664 from tklauser/drop-freebsd
libcontainer: drop FreeBSD support
2017-11-24 18:08:21 +00:00
Tobias Klauser 4d27f20db0 libcontainer: drop FreeBSD support
runc is not supported on FreeBSD, so remove all FreeBSD specific bits.

As suggested by @crosbymichael in #1653

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-11-24 14:51:05 +01:00
Danail Branekov 38d1e6ec27 Delete xattr related code
Selinux related code has been moved to the selinux package
(https://github.com/opencontainers/selinux) and therefore xattr related
code can be deleted from libcontainer

Signed-off-by: Danail Branekov <danailster@gmail.com>
2017-11-21 12:49:28 +02:00
Ma Shimiao 17db6560be support unbindable,runbindable for rootfs propagation
Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
2017-11-17 16:14:15 +08:00
Seth Jennings bca53e7b49 systemd: adjust CPUQuotaPerSecUSec to compensate for systemd internal handling
Signed-off-by: Seth Jennings <sjenning@redhat.com>
2017-11-15 20:20:06 -06:00
Vincent Demeester 3ca4c78b1a
Import docker/docker/pkg/mount into runc
This will help get rid of docker/docker dependency in runc 👼

Signed-off-by: Vincent Demeester <vincent@sbr.pm>
2017-11-08 16:25:58 +01:00
Michael Crosby 2f010ecf19
Merge pull request #1622 from vdemeester/import-symlink-from-docker
Remove pkg/symlink from docker/docker and use cyphar/filepath-securejoin
2017-11-08 10:07:00 -05:00
Akihiro Suda 0aac2368e4 specconv.Example(): add /proc/scsi to masked paths
Port over https://github.com/moby/moby/pull/35399

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2017-11-04 17:38:14 +00:00
Michael Crosby 0232e38342
Merge pull request #1629 from masters-of-cats/busybox-inflation
Avoid disk usage explosion when copying busybox
2017-11-01 09:15:22 -04:00
Danail Branekov fdbb9e3e55 Avoid disk usage explosion when copying busybox
When running runc tests with temp directory with size 500M copying
busybox without preserving hardlinks causes the folder to inflate to
roughly 330M. Copying busybox twice in certain tests causes the /tmp
directory to overfill. Using `-a` preserves links which busybox uses to
implement its choice of binary to run.

Signed-off-by: Tom Godkin <tgodkin@pivotal.io>
2017-11-01 09:52:05 +00:00
Vincent Demeester 594501475e
Use cyphar/filepath-securejoin instead of docker pkg/symlink
runc shouldn't depend on docker and be more self-contained.
Removing github.com/pkg/symlink dep is the first step to not depend on docker anymore

Signed-off-by: Vincent Demeester <vincent@sbr.pm>
2017-10-31 16:53:45 +01:00
Lorenzo Fontana 780f8ef567
Specconv: Test create command hooks and seccomp setup
Signed-off-by: Lorenzo Fontana <lo@linux.com>
2017-10-28 21:46:46 +02:00
Mrunal Patel 9a1186d128 Merge pull request #1619 from fntlnz/spec-linux-testing
WIP: Better testsuite for specconv
2017-10-25 15:23:19 -07:00
Lorenzo Fontana c0e6e12f9d
Test Cgroup creation and memory allocations
Signed-off-by: Lorenzo Fontana <lo@linux.com>
2017-10-25 01:58:10 +02:00
Aleksa Sarai ff5075c33f
init: correctly handle unmapped stdio with multiple mappings
Previously we would handle the "unmapped stdio" case by just doing a
simple check, however this didn't handle cases where the overflow_uid
was actually mapped in the user namespace. Instead of doing some
userspace checks, just try to do the fchown(2) and ignore EINVAL
(unmapped) or EPERM (lacking privilege over inode) errors.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-10-25 00:12:21 +11:00
Qiang Huang 74a1729647 Merge pull request #1607 from crosbymichael/term-err
libcontainer: handler errors from terminate
2017-10-20 15:15:38 +08:00
Qiang Huang e8b9b92f57 Merge pull request #1206 from YuPengZTE/devMD026
trailing punctuation in header
2017-10-20 14:47:09 +08:00
Mrunal Patel 80ee9e50b5 Merge pull request #1616 from mheon/seccomp_fix_breakage
Fix breaking change in Seccomp profile behavior
2017-10-19 14:15:04 -07:00
Aleksa Sarai c05f6368af
merge branch 'pr-1615'
libcontainer: intelrdt: fix a GetStats() issue

LGTMs: @crosbymichael @cyphar
Closes #1615
2017-10-19 03:41:16 +11:00
Matthew Heon e9193ba6e6 Fix breaking change in Seccomp profile behavior
Multiple conditions were previously allowed to be placed upon the
same syscall argument. Restore this behavior.

Signed-off-by: Matthew Heon <mheon@redhat.com>
2017-10-18 11:53:56 -04:00
Qiang Huang 3409d5c555 Merge pull request #1606 from cyphar/rootfs-propagation-no-pivot
specconv: emit an error when using MS_PRIVATE with --no-pivot
2017-10-18 09:52:04 +08:00
Xiaochen Shen d89217515b libcontainer: intelrdt: fix a GetStats() issue
This fixes a GetStats() issue introduced in #1590:
If Intel RDT is enabled by hardware and kernel, but intelRdt is not
specified in original config, GetStats() will return error unexpectedly
because we haven't called Apply() to create intelrdt group or attach
tasks for this container. As a result, runc events command will have no
output.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-10-17 17:37:07 +08:00
Tobias Klauser 0eed453b21 libcontainer: use Major/Minor from x/sys/unix
The Major and Minor functions were added for Linux in golang/sys@85d1495
which is already vendored in. Use these functions instead of the local
re-implementation.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-10-17 09:06:42 +02:00
Aleksa Sarai 9b13f5cc7f
merge branch 'pr-1453'
propagate argv0 when re-execing from /proc/self/exe

LGTMs: @crosbymichael @cyphar
Closes #1453
2017-10-17 03:12:22 +11:00
Michael Crosby ff4481dbf6 Merge pull request #1540 from cloudfoundry-incubator/rootless-cgroups
Support cgroups with limits as rootless
2017-10-16 12:03:49 -04:00
Petros Angelatos 8098828680
propagate argv0 when re-execing from /proc/self/exe
This allows runc to be used as a target for docker's reexec module that
depends on a correct argv0 to select which process entrypoint to invoke.
Without this patch, when runc re-execs argv0 is set to "/proc/self/exe"
and the reexec module doesn't know what to do with it.

Signed-off-by: Petros Angelatos <petrosagg@gmail.com>
2017-10-16 14:00:26 +02:00
Tobias Klauser d2bc081420 libcontainer: merge common syscall implementations
There are essentially two possible implementations for Setuid/Setgid on
Linux, either using SYS_SETUID32/SYS_SETGID32 or SYS_SETUID/SYS_SETGID,
depending on the architecture (see golang/go#1435 for why Setuid/Setgid
aren currently implemented for Linux neither in syscall nor in
golang.org/x/sys/unix).

Reduce duplication by merging the currently implemented variants and
adjusting the build tags accordingly.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-10-16 11:11:18 +02:00
Aleksa Sarai 6d30f7a01b
merge branch 'pr-1424'
Update Travis config to use trusty-backports libseccomp
  Add integration tests for multi-argument Seccomp filters
  Vendor updated libseccomp-golang for bugfix

LGTMs: @crosbymichael @cyphar
Closes #1424
2017-10-16 03:01:37 +11:00
Aleksa Sarai d2ac52fe52
merge branch 'pr-1475'
Add support for mips/mips64
  Put signalMap in a separate file, so it may be arch-specific

LGTMs: @crosbymichael @cyphar
Closes #1475
2017-10-16 02:59:34 +11:00
Aleksa Sarai 2430a98e64
merge branch 'pr-1500'
rootfs: switch ms_private remount of oldroot to ms_slave

LGTMs: @crosbymichael @hqhq
Closes opencontainers/runc#1500
2017-10-14 09:32:59 +11:00
Sebastien Boeuf acb93c9c62 libcontainer: cgroups: Write freezer state after every state check
This commit ensures we write the expected freezer cgroup state after
every state check, in case the state check does not give the expected
result. This can happen when a new task is created and prevents the
whole cgroup to be FROZEN, leaving the state into FREEZING instead.

This patch prevents the case of an infinite loop to happen.

Fixes https://github.com/opencontainers/runc/issues/1609

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2017-10-12 07:07:28 -07:00
Matthew Heon bbc847a457 Add integration tests for multi-argument Seccomp filters
Signed-off-by: Matthew Heon <mheon@redhat.com>
2017-10-10 15:49:08 -04:00
Michael Crosby bfe3058fc9 Make process check more forgiving
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-10-10 15:36:19 -04:00
Steven Hartland eb68b900bc Prevent invalid errors from terminate
Both Process.Kill() and Process.Wait() can return errors that don't impact the correct behaviour of terminate.

Instead of letting these get returned and logged, which causes confusion, silently ignore them.

Currently the test needs to be a string test as the errors are private to the runtime packages, so its our only option.

This can be seen if init fails during the setns.

Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
2017-10-10 15:32:46 -04:00
Michael Crosby 4693fae411 Merge pull request #1590 from xiaochenshen/rdt-cat-support-update-command
libcontainer: intelrdt: add update command support
2017-10-10 15:25:22 -04:00
Aleksa Sarai d4f0f9a52b
specconv: emit an error when using MS_PRIVATE with --no-pivot
Due to the semantics of chroot(2) when it comes to mount namespaces, it
is not generally safe to use MS_PRIVATE as a mount propgation when using
chroot(2). The reason for this is that this effectively results in a set
of mount references being held by the chroot'd namespace which the
namespace cannot free. pivot_root(2) does not have this issue because
the @old_root can be unmounted by the process.

Ultimately, --no-pivot is not really necessary anymore as a commonly
used option since f8e6b5af5e ("rootfs: make pivot_root not use a
temporary directory") resolved the read-only issue. But if someone
really needs to use it, MS_PRIVATE is never a good idea.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-10-08 17:50:55 +11:00
Michael Crosby f53ad9cec9 Merge pull request #1604 from AkihiroSuda/cwd
libcontainer: create Cwd when it does not exist
2017-10-05 11:15:10 -04:00
Will Martin ca4f427af1 Support cgroups with limits as rootless
Signed-off-by: Ed King <eking@pivotal.io>
Signed-off-by: Gabriel Rosenhouse <grosenhouse@pivotal.io>
Signed-off-by: Konstantinos Karampogias <konstantinos.karampogias@swisscom.com>
2017-10-05 11:22:54 +01:00
Akihiro Suda 2edd36fdff libcontainer: create Cwd when it does not exist
The benefit for doing this within runc is that it works well with
userns.
Actually, runc already does the same thing for mount points.

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2017-10-05 05:31:46 +00:00
Konstantinos Karampogias 605dc5c811 Set initial console size based on process spec
Signed-off-by: Will Martin <wmartin@pivotal.io>
Signed-off-by: Petar Petrov <pppepito86@gmail.com>
Signed-off-by: Ed King <eking@pivotal.io>
Signed-off-by: Roberto Jimenez Sanchez <jszroberto@gmail.com>
Signed-off-by: Thomas Godkin <tgodkin@pivotal.io>
2017-10-04 12:32:16 +01:00
Daniel, Dao Quang Minh 0351df1c5a Merge pull request #1600 from crosbymichael/console
Bump console and sys deps
2017-09-26 10:15:10 +01:00
Michael Crosby f364c1a58c Set ClearONLCR in tests
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-09-25 13:35:22 -04:00
Tobias Klauser d713652bda libcontainer: remove unnecessary type conversions
Generated using github.com/mdempsky/unconvert

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-09-25 10:41:57 +02:00
Qiang Huang 79ad714374 Merge pull request #1598 from euank/ragent
libcontainer: default mount propagation correctly
2017-09-25 11:55:29 +08:00
Euan Kemp 4301b440d6 libcontainer: default mount propagation correctly
The code in prepareRoot (e385f67a0e/libcontainer/rootfs_linux.go (L599-L605))
attempts to default the rootfs mount to `rslave`. However, since the spec
conversion has already defaulted it to `rprivate`, that code doesn't
actually ever do anything.

This changes the spec conversion code to accept "" and treat it as 0.

Implicitly, this makes rootfs propagation default to `rslave`, which is
a part of fixing the moby bug https://github.com/moby/moby/issues/34672

Alternate implementatoins include changing this defaulting to be
`rslave` and removing the defaulting code in prepareRoot, or skipping
the mapping entirely for "", but I think this change is the cleanest of
those options.

Signed-off-by: Euan Kemp <euan.kemp@coreos.com>
2017-09-22 13:36:23 -07:00
Xiaochen Shen 2549545df5 intelrdt: always init IntelRdtManager if Intel RDT is enabled
In current implementation:
Either Intel RDT is not enabled by hardware and kernel, or intelRdt is
not specified in original config, we don't init IntelRdtManager in the
container to handle intelrdt constraint. It is a tradeoff that Intel RDT
has hardware limitation to support only limited number of groups.

This patch makes a minor change to support update command:
Whether or not intelRdt is specified in config, we always init
IntelRdtManager in the container if Intel RDT is enabled. If intelRdt is
not specified in original config, we just don't Apply() to create
intelrdt group or attach tasks for this container.

In update command, we could re-enable through IntelRdtManager.Apply()
and then update intelrdt constraint.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-09-20 01:37:31 +08:00
Michael Crosby 593914b8bd Merge pull request #1593 from s7v7nislands/drop_go1.5
Drop support golang 1.5
2017-09-12 15:22:00 -04:00
s7v7nislands 00ad8e1e56 Drop support golang 1.5
Signed-off-by: Xiaobing Jiang <s7v7nislands@gmail.com>
2017-09-12 20:56:51 +08:00
Qiang Huang 68e00e906b Merge pull request #1586 from crosbymichael/set-cgroups
Apply cgroups earlier
2017-09-12 12:13:29 +08:00
Yong Tang e9944d0f4c Disable systemd in static build
This fix tries to address the warnings caused by static build
with go 1.9. As systemd needs dlopen/dlclose, the following warnings
will be generated for static build in go 1.9:
```
root@f4b077232050:/go/src/github.com/opencontainers/runc# make static
CGO_ENABLED=1 go build  -tags "seccomp cgo static_build" -ldflags "-w -extldflags -static -X main.gitCommit="1c81e2a794c6e26a4c650142ae8893c47f619764" -X main.version=1.0.0-rc4+dev " -o runc .
/tmp/go-link-113476657/000007.o: In function `_cgo_a5acef59ed3f_Cfunc_dlopen':
/tmp/go-build/github.com/opencontainers/runc/vendor/github.com/coreos/pkg/dlopen/_obj/cgo-gcc-prolog:76: warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
```

This fix disables systemd when `static_build` flag is on (apply_nosystemd.go
is used instead).

This fix also fixes a small bug in `apply_nosystemd.go` for return value.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
2017-09-11 18:38:22 +00:00
Mrunal Patel d5b43c3981 Merge pull request #1455 from dqminh/epoll-io
tty: move IO of master pty to be done with epoll
2017-09-11 11:32:42 -07:00
Aleksa Sarai 1a5fdc1c5f
init: support setting -u with rootless containers
Now that rootless containers have support for multiple uid and gid
mappings, allow --user to work as expected. If the user is not mapped,
an error occurs (as usual).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-09-09 12:45:33 +10:00
Aleksa Sarai 969bb49cc3
nsenter: do not resolve path in nsexec context
With the addition of our new{uid,gid}map support, we used to call
execvp(3) from inside nsexec. This would mean that the path resolution
for the binaries would happen in nsexec. Move the resolution to the
initial setup code, and pass the absolute path to nsexec.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-09-09 12:45:33 +10:00
Aleksa Sarai 6097ce74d8
nsenter: correctly handle newgidmap path for rootless containers
After quite a bit of debugging, I found that previous versions of this
patchset did not include newgidmap in a rootless setting. Fix this by
passing it whenever group mappings are applied, and also providing some
better checking for try_mapping_tool. This commit also includes some
stylistic improvements.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-09-09 12:45:32 +10:00
Giuseppe Scrivano 3282f5a7c1
tests: fix for rootless multiple uids/gids
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2017-09-09 12:45:32 +10:00
Giuseppe Scrivano d8b669400a
rootless: allow multiple user/group mappings
Take advantage of the newuidmap/newgidmap tools to allow multiple
users/groups to be mapped into the new user namespace in the rootless
case.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
[ rebased to handle intelrdt changes. ]
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-09-09 12:45:32 +10:00
Mrunal Patel 13fa5d2953 Merge pull request #1588 from s7v7nislands/delete_unused
Delete unused function
2017-09-08 17:34:00 -07:00
Michael Crosby b82d07e816 Merge pull request #1587 from Mashimiao/fix-namespace-empty
Fixes #1585 config.Namespaces is empty when accessed
2017-09-08 10:50:16 -04:00
Xiaochen Shen 88d22fde40 libcontainer: intelrdt: use init() to avoid race condition
This is the follow-up PR of #1279 to fix remaining issues:

Use init() to avoid race condition in IsIntelRdtEnabled().
Add also rename some variables and functions.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-09-08 17:15:31 +08:00
s7v7nislands c795b8690b Delete unused function
Signed-off-by: Xiaobing Jiang <s7v7nislands@gmail.com>
2017-09-08 10:35:46 +08:00
Ma Shimiao c3d20e7817 Fixes #1585 config.Namespaces is empty when accessed
Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
2017-09-08 09:30:07 +08:00
Mrunal Patel deb9d7fd96 Merge pull request #1569 from cyphar/delay-seccomp
init: delay seccomp application as late as possible
2017-09-07 13:27:37 -07:00
Mrunal Patel 7e036aa0b0 Merge pull request #1541 from adrianreber/lazy
checkpoint: support lazy migration
2017-09-07 13:25:04 -07:00
Michael Crosby 7062c7556b Apply cgroups earlier
This applies cgroups earlier for container creation before the init
process starts running and forking off any additional processes.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-09-07 11:27:33 -04:00
Adrian Reber 60ae7091de checkpoint: support lazy migration
With the help of userfaultfd CRIU supports lazy migration. Lazy
migration means that memory pages are only transferred from the
migration source to the migration destination on page fault.

This enables to reduce the downtime during process or container
migration to a minimum as the memory does not need to be transferred
during migration.

Lazy migration currently depends on userfaultfd being available on the
current Linux kernel and if the used CRIU version supports lazy
migration. Both dependencies can be checked by querying CRIU via RPC if
the lazy migration feature is available. Using feature checking instead
of version comparison enables runC to use CRIU features from the
criu-dev branch. This way the user can decide if lazy migration should
be available by choosing the right kernel and CRIU branch.

To use lazy migration the CRIU process during dump needs to dump
everything besides the memory pages and then it opens a network port
waiting for remote page fault requests:

 # runc checkpoint httpd --lazy-pages --page-server 0.0.0.0:27 \
  --status-fd /tmp/postcopy-pipe

In this example CRIU will hang/wait once it has opened the network port
and wait for network connection. As runC waits for CRIU to finish it
will also hang until the lazy migration has finished. To know when the
restore on the destination side can start the '--status-fd' parameter is
used:

 #️ runc checkpoint --help | grep status
  --status-fd value   criu writes \0 to this FD once lazy-pages is ready

The parameter '--status-fd' is directly from CRIU and this way the
process outside of runC which controls the migration knows exactly when
to transfer the checkpoint (without memory pages) to the destination and
that the restore can be started.

On the destination side it is necessary to start CRIU in 'lazy-pages'
mode like this:

 # criu lazy-pages --page-server --address 192.168.122.3 --port 27 \
  -D checkpoint

and tell runC to do a lazy restore:

 # runc restore -d --image-path checkpoint --work-path checkpoint \
  --lazy-pages httpd

If both processes on the restore side have the same working directory
'criu lazy-pages' creates a unix domain socket where it waits for
requests from the actual restore. runC starts CRIU restore in lazy
restore mode and talks to 'criu lazy-pages' that it wants to restore
memory pages on demand. CRIU continues to restore the process and once
the process is running and accesses the first non-existing memory page
the 'criu lazy-pages' server will request the page from the source
system. Thus all pages from the source system will be transferred to the
destination system. Once all pages have been transferred runC on the
source system will end and the container will have finished migration.

This can also be combined with CRIU's pre-copy support. The combination
of pre-copy and post-copy (lazy migration) provides the possibility to
migrate containers with minimal downtimes.

Some additional background about post-copy migration can be found in
these articles:

 https://lisas.de/~adrian/?p=1253
 https://lisas.de/~adrian/?p=1183

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-09-06 12:35:38 +00:00
Adrian Reber a3a632ad28 checkpoint: add support to query for lazy page support
Before adding the actual lazy migration support, this adds the feature
check for lazy-pages. Right now lazy migration, which is based on
userfaultd is only available in the criu-dev branch and not yet in a
release. As the check does not dependent on a certain version but on
a CRIU feature which can be queried it can be part of runC without a new
version check depending on a feature from criu-dev.

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-09-06 12:35:38 +00:00
Xiaochen Shen 4d2756c116 libcontainer: add test cases for Intel RDT/CAT
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-09-01 14:35:40 +08:00
Xiaochen Shen 692f6e1e27 libcontainer: add support for Intel RDT/CAT in runc
About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.

About Intel RDT/CAT kernel interface:
In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.

Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.

Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
|   |-- L3
|       |-- cbm_mask
|       |-- min_cbm_bits
|       |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
    |-- cpus
    |-- schemata
    |-- tasks

For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
resource constraints.

The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file  (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.

The file `schemata` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
	Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.

The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.

For more information about Intel RDT/CAT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
inside the container only have access to the "upper" 80% of L3 cache id 0 and
the "lower" 50% L3 cache id 1:

"linux": {
	"intelRdt": {
		"l3CacheSchema": "L3:0=ffff0;1=3ff"
	}
}

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-09-01 14:26:33 +08:00
Xiaochen Shen af3b0d9dce libcontainer/SPEC.md: add documentation for Intel RDT/CAT
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-09-01 14:26:33 +08:00
Aleksa Sarai 1f32fff46d
setns init: delay seccomp as late as possible
This mirrors the standard_init_linux.go seccomp code, which only applies
seccomp early if NoNewPrivileges is enabled. Otherwise it's done
immediately before execve to reduce the amount of syscalls necessary for
users to enable in their seccomp profiles.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-08-26 13:42:30 +10:00
Aleksa Sarai 3ddde27d7d
init: move close(stateDirFd) before seccomp apply
This further reduces the number of syscalls that a user needs to enable
in their seccomp profile.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-08-26 13:42:26 +10:00
Qiang Huang 1c81e2a794 Merge pull request #1572 from tych0/fix-readonly-userns
fix --read-only containers under --userns-remap
2017-08-26 09:38:14 +08:00
Aleksa Sarai 4d6e6720a7
Merge branch 'pr-1573'
Fix systemd cgroup after memory type changed

LGTMs: @crosbymichael @cyphar
Closes #1573
2017-08-25 23:55:27 +10:00
Qiang Huang acaf6897f5 Fix systemd cgroup after memory type changed
Fixes: #1557

I'm not quite sure about the root cause, looks like
systemd still want them to be uint64.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-08-25 01:14:16 -04:00
Aleksa Sarai 7d66aab77a
init: switch away from stateDirFd entirely
While we have significant protections in place against CVE-2016-9962, we
still were holding onto a file descriptor that referenced the host
filesystem. This meant that in certain scenarios it was still possible
for a semi-privileged container to gain access to the host filesystem
(if they had CAP_SYS_PTRACE).

Instead, open the FIFO itself using a O_PATH. This allows us to
reference the FIFO directly without providing the ability for
directory-level access. When opening the FIFO inside the init process,
open it through procfs to re-open the actual FIFO (this is currently the
only supported way to open such a file descriptor).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-08-25 13:19:03 +10:00
Tycho Andersen 66eb2a3e8f fix --read-only containers under --userns-remap
The documentation here:
https://docs.docker.com/engine/security/userns-remap/#user-namespace-known-limitations

says that readonly containers can't be used with user namespaces do to some
kernel restriction. In fact, there is a special case in the kernel to be
able to do stuff like this, so let's use it.

This takes us from:

ubuntu@docker:~$ docker run -it --read-only ubuntu
docker: Error response from daemon: oci runtime error: container_linux.go:262: starting container process caused "process_linux.go:339: container init caused \"rootfs_linux.go:125: remounting \\\"/dev\\\" as readonly caused \\\"operation not permitted\\\"\"".

to:

ubuntu@docker:~$ docker-runc --version
runc version 1.0.0-rc4+dev
commit: ae2948042b08ad3d6d13cd09f40a50ffff4fc688-dirty
spec: 1.0.0
ubuntu@docker:~$ docker run -it --read-only ubuntu
root@181e2acb909a:/# touch foo
touch: cannot touch 'foo': Read-only file system

Signed-off-by: Tycho Andersen <tycho@docker.com>
2017-08-24 16:43:21 -06:00
Nikolas Sepos da4a5a9515 Add AutoDedup option to CriuOpts
Memory image deduplication, very useful for incremental dumps.

See: https://criu.org/Memory_images_deduplication

Signed-off-by: Nikolas Sepos <nikolas.sepos@gmail.com>
2017-08-18 01:21:42 +02:00
Michael Crosby ccd2c20aa4 Merge pull request #1559 from Mashimiao/panic-fix-nil-linux
fix panic when Linux is nil for rootless case
2017-08-17 09:57:35 -04:00
Ma Shimiao 2333e7dc67 fix panic when Linux is nil for rootless case
congfig.Sysctl setting is duplicated.
when contianer is rootless and Linux is nil, runc will panic.

Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
2017-08-16 09:11:13 +08:00