Commit Graph

1494 Commits

Author SHA1 Message Date
Kenta Tada 65032b55b1 libcontainer: fix TestGetContainerState to check configs.NEWCGROUP
This test needs to handle the case of configs.NEWCGROUP
as Namespace's type.

Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
2019-05-21 09:10:38 +09:00
Mrunal Patel 2484581dd7
Merge pull request #2035 from cyphar/bindmount-types
specconv: always set "type: bind" in case of MS_BIND
2019-05-07 15:47:58 -07:00
Mrunal Patel a0ecf749ee
Merge pull request #2047 from filbranden/systemd7
Move systemd.Manager initialization into a function in that module
2019-05-07 15:08:41 -07:00
Filipe Brandenburger 46351eb3d1 Move systemd.Manager initialization into a function in that module
This will permit us to extend the internals of systemd.Manager to include
further information about the system, such as whether cgroupv1, cgroupv2 or
both are in effect.

Furthermore, it allows a future refactor of moving more of UseSystemd() code
into the factory initialization function.

Signed-off-by: Filipe Brandenburger <filbranden@gmail.com>
2019-05-01 13:22:19 -07:00
Georgi Sabev a146081828 Write logs to stderr by default
Minor refactoring to use the filePair struct for both init sock and log pipe

Co-authored-by: Julia Nedialkova <julianedialkova@hotmail.com>
Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>
2019-04-24 15:18:14 +03:00
Georgi Sabev 68b4ff5b37 Simplify bail logic & minor nsexec improvements
Co-authored-by: Julia Nedialkova <julianedialkova@hotmail.com>
Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>
2019-04-24 15:16:11 +03:00
Xiaochen Shen 17b37ea3fa libcontainer: intelrdt: add missing destroy handler in defer func
In the exception handling of initProcess.start(), we need to add the
missing IntelRdtManager.Destroy() handler in defer func.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2019-04-24 16:41:51 +08:00
Georgi Sabev 475aef10f7 Remove redundant log function
Bump logrus so that we can use logrus.StandardLogger().Logf instead

Co-authored-by: Julia Nedialkova <julianedialkova@hotmail.com>
Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>
2019-04-22 17:54:55 +03:00
Georgi Sabev ba3cabf932 Improve nsexec logging
* Simplify logging function
* Logs contain __FUNCTION__:__LINE__
* Bail uses write_log

Co-authored-by: Julia Nedialkova <julianedialkova@hotmail.com>
Co-authored-by: Danail Branekov <danailster@gmail.com>
Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>
2019-04-22 17:53:52 +03:00
Aleksa Sarai 8296826da5
specconv: always set "type: bind" in case of MS_BIND
We discovered in umoci that setting a dummy type of "none" would result
in file-based bind-mounts no longer working properly, which is caused by
a restriction for when specconv will change the device type to "bind" to
work around rootfs_linux.go's ... issues.

However, bind-mounts don't have a type (and Linux will ignore any type
specifier you give it) because the type is copied from the source of the
bind-mount. So we should always overwrite it to avoid user confusion.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-04-08 15:08:08 +10:00
Danail Branekov c486e3c406 Address comments in PR 1861
Refactor configuring logging into a reusable component
so that it can be nicely used in both main() and init process init()

Co-authored-by: Georgi Sabev <georgethebeatle@gmail.com>
Co-authored-by: Giuseppe Capizzi <gcapizzi@pivotal.io>
Co-authored-by: Claudia Beresford <cberesford@pivotal.io>
Signed-off-by: Danail Branekov <danailster@gmail.com>
2019-04-04 14:57:28 +03:00
Marco Vedovati feebfac358 Remove pipe close before exec.
Pipe close before exec is not necessary as os.Pipe() is calling pipe2
with O_CLOEXEC option.

Signed-off-by: Marco Vedovati <mvedovati@suse.com>
2019-04-04 14:53:30 +03:00
Marco Vedovati 9a599f62fb Support for logging from children processes
Add support for children processes logging (including nsexec).
A pipe is used to send logs from children to parent in JSON.
The JSON format used is the same used by logrus JSON formatted,
i.e. children process can use standard logrus APIs.

Signed-off-by: Marco Vedovati <mvedovati@suse.com>
2019-04-04 14:53:23 +03:00
Michael Crosby 11fc498ffa
Merge pull request #2023 from LittleLightLittleFire/2022-fix-runc-zombie-process-regression
Fixes regression causing zombie runc:[1:CHILD] processes
2019-03-22 14:06:31 -04:00
Mrunal Patel dd22a84864
Merge pull request #2012 from rhatdan/selinux
Need to setup labeling of kernel keyrings.
2019-03-20 21:17:18 -07:00
Alex Fang eab5330908 Fixes regression causing zombie runc:[1:CHILD] processes
Whenever processes are spawned using nsexec, a zombie runc:[1:CHILD]
process will always be created and will need to be reaped by the parent

Signed-off-by: Alex Fang <littlelightlittlefire@gmail.com>
2019-03-21 13:43:38 +11:00
Aleksa Sarai f56b4cbead
merge branch 'pr-2015'
Use getenv not secure_getenv

LGTMs: @crosbymichael @cyphar
Closes #2015
2019-03-16 17:30:56 +11:00
Filipe Brandenburger 4b2b978291 Add cgroup name to error message
More information should help troubleshoot an issue when this error occurs.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2019-03-14 10:25:00 -07:00
Justin Cormack 6f714aa928
Use getenv not secure_getenv
secure_getenv is a Glibc extension and so this code does not compile
on Musl libc any more after this patch.

secure_getenv is only intended to be used in setuid binaries, in
order that they should not trust their environment. It simply returns
NULL if the binary is running setuid. If runc was installed setuid,
the user can already do anything as root, so it is game over, so this
check is not needed.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2019-03-14 10:58:10 +00:00
Daniel J Walsh cd96170c10
Need to setup labeling of kernel keyrings.
Work is ongoing in the kernel to support different kernel
keyrings per user namespace.  We want to allow SELinux to manage
kernel keyrings inside of the container.

Currently when runc creates the kernel keyring it gets the label which runc is
running with ususally `container_runtime_t`, with this change the kernel keyring
will be labeled with the container process label container_t:s0:C1,c2.

Container running as container_t:s0:c1,c2 can manage keyrings with the same label.

This change required a revendoring or the SELinux go bindings.

github.com/opencontainers/selinux.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2019-03-13 17:57:30 -04:00
Mrunal Patel 2b18fe1d88
Merge pull request #1984 from cyphar/memfd-cleanups
nsenter: cloned_binary: "memfd" cleanups
2019-03-07 10:18:33 -08:00
Michael Crosby f739110263
Merge pull request #1968 from adrianreber/podman
Create bind mount mountpoints during restore
2019-03-04 11:37:07 -06:00
Aleksa Sarai 2d4a37b427
nsenter: cloned_binary: userspace copy fallback if sendfile fails
There are some circumstances where sendfile(2) can fail (one example is
that AppArmor appears to block writing to deleted files with sendfile(2)
under some circumstances) and so we need to have a userspace fallback.
It's fairly trivial (and handles short-writes).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-03-01 23:29:10 +11:00
Aleksa Sarai 16612d74de
nsenter: cloned_binary: try to ro-bind /proc/self/exe before copying
The usage of memfd_create(2) and other copying techniques is quite
wasteful, despite attempts to minimise it with _LIBCONTAINER_STATEDIR.
memfd_create(2) added ~10M of memory usage to the cgroup associated with
the container, which can result in some setups getting OOM'd (or just
hogging the hosts' memory when you have lots of created-but-not-started
containers sticking around).

The easiest way of solving this is by creating a read-only bind-mount of
the binary, opening that read-only bindmount, and then umounting it to
ensure that the host won't accidentally be re-mounted read-write. This
avoids all copying and cleans up naturally like the other techniques
used. Unfortunately, like the O_TMPFILE fallback, this requires being
able to create a file inside _LIBCONTAINER_STATEDIR (since bind-mounting
over the most obvious path -- /proc/self/exe -- is a *very bad idea*).

Unfortunately detecting this isn't fool-proof -- on a system with a
read-only root filesystem (that might become read-write during "runc
init" execution), we cannot tell whether we have already done an ro
remount. As a partial mitigation, we store a _LIBCONTAINER_CLONED_BINARY
environment variable which is checked *alongside* the protection being
present.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-03-01 23:29:08 +11:00
Aleksa Sarai af9da0a450
nsenter: cloned_binary: use the runc statedir for O_TMPFILE
Writing a file to tmpfs actually incurs a memcg penalty, and thus the
benefit of being able to disable memfd_create(2) with
_LIBCONTAINER_DISABLE_MEMFD_CLONE is fairly minimal -- though it should
be noted that quite a few distributions don't use tmpfs for /tmp (and
instead have it as a regular directory or subvolume of the host
filesystem).

Since runc must have write access to the state directory anyway (and the
state directory is usually not on a tmpfs) we can use that instead of
/tmp -- avoiding potential memcg costs with no real downside.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-03-01 23:28:51 +11:00
Aleksa Sarai 2429d59352
nsenter: cloned_binary: expand and add pre-3.11 fallbacks
In order to get around the memfd_create(2) requirement, 0a8e4117e7
("nsenter: clone /proc/self/exe to avoid exposing host binary to
container") added an O_TMPFILE fallback. However, this fallback was
flawed in two ways:

 * It required O_TMPFILE which is relatively new (having been added to
   Linux 3.11).

 * The fallback choice was made at compile-time, not runtime. This
   results in several complications when it comes to running binaries
   on different machines to the ones they were built on.

The easiest way to resolve these things is to have fallbacks work in a
more procedural way (though it does make the code unfortunately more
complicated) and to add a new fallback that uses mkotemp(3).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-03-01 23:28:50 +11:00
Aleksa Sarai 5b775bf297
nsenter: cloned_binary: detect and handle short copies
For a variety of reasons, sendfile(2) can end up doing a short-copy so
we need to just loop until we hit the binary size. Since /proc/self/exe
is tautologically our own binary, there's no chance someone is going to
modify it underneath us (or changing the size).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-02-26 19:51:01 +11:00
Mrunal Patel 5b5130ad76
Merge pull request #1963 from adrianreber/go-criu
Vendor in go-criu and use it for CRIU's RPC definition
2019-02-23 10:44:28 -08:00
Adrian Reber 9edb5494bb
Use vendored in CRIU Go bindings
This makes use of the vendored in Go bindings and removes the copy of
the CRIU RPC interface definition. runc now relies on go-criu for RPC
definition and hopefully more CRIU functions can be used in the future
from the CRIU Go bindings.

Signed-off-by: Adrian Reber <areber@redhat.com>
2019-02-14 18:20:02 +01:00
Christian Brauner bb7d8b1f41
nsexec (CVE-2019-5736): avoid parsing environ
My first attempt to simplify this and make it less costly focussed on
the way constructors are called. I was under the impression that the ELF
specification mandated that arg, argv, and actually even envp need to be
passed to functions located in the .init_arry section (aka
"constructors"). Actually, the specifications is (cf. [2]):

SHT_INIT_ARRAY
This section contains an array of pointers to initialization functions,
as described in ``Initialization and Termination Functions'' in Chapter
5. Each pointer in the array is taken as a parameterless procedure with
a void return.

which means that this becomes a libc specific decision. Glibc passes
down those args, musl doesn't. So this approach can't work. However, we
can at least remove the environment parsing part based on POSIX since
[1] mandates that there should be an environ variable defined in
unistd.h which provides access to the environment. See also the relevant
Open Group specification [1].

[1]: http://pubs.opengroup.org/onlinepubs/9699919799/
[2]: http://www.sco.com/developers/gabi/latest/ch4.sheader.html#init_array

Fixes: CVE-2019-5736
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-02-14 16:06:21 +01:00
Filipe Brandenburger cd41feb46b Remove detection for scope properties, which have always been broken
The detection for scope properties (whether scope units support
DefaultDependencies= or Delegate=) has always been broken, since systemd
refuses to create scopes unless at least one PID is attached to it (and
this has been so since scope units were introduced in systemd v205.)

This can be seen in journal logs whenever a container is started with
libpod:

  Feb 11 15:08:07 myhost systemd[1]: libcontainer-12345-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing.
  Feb 11 15:08:07 myhost systemd[1]: libcontainer-12345-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing.

Since this logic never worked, just assume both attributes are supported
(which is what the code does when detection fails for this reason, since
it's looking for an "unknown attribute" or "read-only attribute" to mark
them as false) and skip the detection altogether.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2019-02-11 16:05:37 -08:00
Adrian Reber 7354546cc8
Create mountpoints also on restore
runc creates all missing mountpoints when it starts a container, this
commit also creates those mountpoints during restore. Now it is possible
to restore a container using the same, but newly created rootfs just as
during container start.

Signed-off-by: Adrian Reber <areber@redhat.com>
2019-02-08 15:59:51 +01:00
Adrian Reber f661e02343
factor out bind mount mountpoint creation
During rootfs setup all mountpoints (directory and files) are created
before bind mounting the bind mounts. This does not happen during
container restore via CRIU. If restoring in an identical but newly created
rootfs, the restore fails right now. This just factors out the code to
create the bind mount mountpoints so that it also can be used during
restore.

Signed-off-by: Adrian Reber <areber@redhat.com>
2019-02-08 15:59:51 +01:00
Aleksa Sarai 0a8e4117e7
nsenter: clone /proc/self/exe to avoid exposing host binary to container
There are quite a few circumstances where /proc/self/exe pointing to a
pretty important container binary is a _bad_ thing, so to avoid this we
have to make a copy (preferably doing self-clean-up and not being
writeable).

We require memfd_create(2) -- though there is an O_TMPFILE fallback --
but we can always extend this to use a scratch MNT_DETACH overlayfs or
tmpfs. The main downside to this approach is no page-cache sharing for
the runc binary (which overlayfs would give us) but this is far less
complicated.

This is only done during nsenter so that it happens transparently to the
Go code, and any libcontainer users benefit from it. This also makes
ExtraFiles and --preserve-fds handling trivial (because we don't need to
worry about it).

Fixes: CVE-2019-5736
Co-developed-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-02-08 18:57:59 +11:00
Mrunal Patel e4fa8a4575
Merge pull request #1955 from xiaochenshen/rdt-fix-destroy-issue
libcontainer: intelrdt: fix null intelrdt path issue in Destroy()
2019-02-01 13:18:56 -08:00
Mrunal Patel 4e4c907193
Merge pull request #1950 from cloudfoundry-incubator/enter-pid-race
Resilience in adding of exec tasks to cgroups
2019-02-01 13:18:16 -08:00
Aleksa Sarai 565325fc36
integration: fix mis-use of libcontainer.Factory
For some reason, libcontainer/integration has a whole bunch of incorrect
usages of libcontainer.Factory -- causing test failures with a set of
security patches that will be published soon. Fixing ths is fairly
trivial (switch to creating a new libcontainer.Factory once in each
process, rather than creating one in TestMain globally).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-01-24 23:12:48 +13:00
Michael Crosby c1e454b2a1
Merge pull request #1960 from giuseppe/fix-kmem-systemd
systemd: fix setting kernel memory limit
2019-01-15 13:21:01 -05:00
Michael Crosby 4e9d52da54
Merge pull request #1933 from adrianreber/master
Add CRIU configuration file support
2019-01-15 11:22:38 -05:00
Giuseppe Scrivano 28a697cce3
rootfs: umount all procfs and sysfs with --no-pivot
When creating a new user namespace, the kernel doesn't allow to mount
a new procfs or sysfs file system if there is not already one instance
fully visible in the current mount namespace.

When using --no-pivot we were effectively inhibiting this protection
from the kernel, as /proc and /sys from the host are still present in
the container mount namespace.

A container without full access to /proc could then create a new user
namespace, and from there able to mount a fully visible /proc, bypassing
the limitations in the container.

A simple reproducer for this issue is:

unshare -mrfp sh -c "mount -t proc none /proc && echo c > /proc/sysrq-trigger"

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2019-01-14 09:53:35 +01:00
Giuseppe Scrivano f01923376d
systemd: fix setting kernel memory limit
since commit df3fa115f9 it is not
possible to set a kernel memory limit when using the systemd cgroups
backend as we use cgroup.Apply twice.

Skip enabling kernel memory if there are already tasks in the cgroup.

Without this patch, runc fails with:

container_linux.go:344: starting container process caused
"process_linux.go:311: applying cgroup configuration for process
caused \"failed to set memory.kmem.limit_in_bytes, because either
tasks have already joined this cgroup or it has children\""

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2019-01-10 11:33:50 +01:00
Xiaochen Shen acb75d0e38 libcontainer: intelrdt: fix null intelrdt path issue in Destroy()
This patch fixes a corner case when destroy a container:

If we start a container without 'intelRdt' config set, and then we run
“runc update --l3-cache-schema/--mem-bw-schema” to add 'intelRdt' config
implicitly.

Now if we enter "exit" from the container inside, we will pass through
linuxContainer.Destroy() -> state.destroy() -> intelRdtManager.Destroy().
But in IntelRdtManager.Destroy(), IntelRdtManager.Path is still null
string, it hasn’t been initialized yet. As a result, the created rdt
group directory during "runc update" will not be removed as expected.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2019-01-05 00:34:25 +08:00
Adrian Reber e157963054
Enable CRIU configuration files
CRIU 3.11 introduces configuration files:

https://criu.org/Configuration_files
https://lisas.de/~adrian/posts/2018-Nov-08-criu-configuration-files.html

This enables the user to influence CRIU's behaviour without code changes
if using new CRIU features or if the user wants to enable certain CRIU
behaviour without always specifying certain options.

With this it is possible to write 'tcp-established' to the configuration
file:

$ echo tcp-established > /etc/criu/runc.conf

and from now on all checkpoints will preserve the state of established
TCP connections. This removes the need to always use

$ runc checkpoint --tcp-stablished

If the goal is to always checkpoint with '--tcp-established'

It also adds the possibility for unexpected CRIU behaviour if the user
created a configuration file at some point in time and forgets about it.

As a result of the discussion in https://github.com/opencontainers/runc/pull/1933
it is now also possible to define a CRIU configuration file for each
container with the annotation 'org.criu.config'.

If 'org.criu.config' does not exist, runc will tell CRIU to use
'/etc/criu/runc.conf' if it exists.

If 'org.criu.config' is set to an empty string (''), runc will tell CRIU
to not use any runc specific configuration file at all.

If 'org.criu.config' is set to a non-empty string, runc will use that
value as an additional configuration file for CRIU.

With the annotation the user can decide to use the default configuration
file ('/etc/criu/runc.conf'), none or a container specific configuration
file.

Signed-off-by: Adrian Reber <areber@redhat.com>
2018-12-21 07:42:12 +01:00
Adrian Reber 360ba8a27d
Update criurpc definition for latest features
Signed-off-by: Adrian Reber <areber@redhat.com>
2018-12-21 07:42:12 +01:00
JoeWrightss 0855bce448 Fix .Fatalf() error message
Signed-off-by: JoeWrightss <zhoulin.xie@daocloud.io>
2018-12-19 20:22:48 +08:00
Tom Godkin bdf3524b34 Retry adding pids to cgroups when EINVAL occurs
The kernel will sometimes return EINVAL when writing a pid to a
cgroup.procs file. It does so when the task being added still has the
state TASK_NEW.

See: https://elixir.bootlin.com/linux/v4.8/source/kernel/sched/core.c#L8286

Co-authored-by: Danail Branekov <danailster@gmail.com>

Signed-off-by: Tom Godkin <tgodkin@pivotal.io>
Signed-off-by: Danail Branekov <danailster@gmail.com>
2018-12-17 15:34:47 +00:00
JoeWrightss 769d6c4a75 Fix some typos
Signed-off-by: JoeWrightss <zhoulin.xie@daocloud.io>
2018-12-09 23:52:54 +08:00
Michael Crosby 25f3f893c8
Merge pull request #1939 from cyphar/nokmem-error
cgroups: nokmem: error out on explicitly-set kmemcg limits
2018-12-04 11:14:56 -05:00
Michael Crosby 96ec2177ae
Merge pull request #1943 from giuseppe/allow-to-signal-paused-containers
kill: allow to signal paused containers
2018-12-03 16:55:13 -05:00
Ace-Tang dce70cdff5 cr: get pid from criu notify when restore
when restore container from a checkpoint directory, we should get
pid from criu notify, since c.initProcess has not been created.

Signed-off-by: Ace-Tang <aceapril@126.com>
2018-12-03 13:31:20 +08:00
Aleksa Sarai 8a4629f7b5
cgroups: nokmem: error out on explicitly-set kmemcg limits
When built with nokmem we explicitly are disabling support for kmemcg,
but it is a strict specification requirement that if we cannot fulfil an
aspect of the container configuration that we error out.

Completely ignoring explicitly-requested kmemcg limits with nokmem would
undoubtably lead to problems.

Fixes: 6a2c155968 ("libcontainer: ability to compile without kmem")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-12-01 14:31:35 +11:00
Giuseppe Scrivano 07d1ad44c8
kill: allow to signal paused containers
regression introduced by 87a188996e

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2018-11-30 23:35:47 +01:00
Michael Crosby 4932620b62
Merge pull request #1919 from xiaochenshen/rdt-mba-software-controller
libcontainer: intelrdt: add support for Intel RDT/MBA Software Controller in runc
2018-11-26 16:45:42 -05:00
Michael Crosby 50e2634995
Merge pull request #1934 from lifubang/kill
fix: may kill other process when container has been stopped
2018-11-21 10:30:25 -05:00
Lifubang 87a188996e may kill other process when container has been stopped
Signed-off-by: Lifubang <lifubang@acmcoder.com>
2018-11-21 17:44:52 +08:00
Aleksa Sarai ceefc3fe4e
merge branch 'pr-1741'
libcontainer: Set 'status' in hook stdin

LGTMs: @cyphar @crosbymichael
Closes #1741
2018-11-20 06:39:30 +11:00
Michael Crosby 76520a4bf0
Merge pull request #1872 from masters-of-cats/better-find-cgroup-mountpoint
Respect container's cgroup path
2018-11-16 14:06:54 -05:00
W. Trevor King e23868603a libcontainer: Set 'status' in hook stdin
Finish off the work started in a344b2d6 (sync up `HookState` with OCI
spec `State`, 2016-12-19, #1201).

And drop HookState, since there's no need for a local alias for
specs.State.

Also set c.initProcess in newInitProcess to support OCIState calls
from within initProcess.start().  I think the cyclic references
between linuxContainer and initProcess are unfortunate, but didn't
want to address that here.

I've also left the timing of the Prestart hooks alone, although the
spec calls for them to happen before start (not as part of creation)
[1,2].  Once the timing gets fixed we can drop the
initProcessStartTime hacks which initProcess.start currently needs.

I'm not sure why we trigger the prestart hooks in response to both
procReady and procHooks.  But we've had two prestart rounds in
initProcess.start since 2f276498 (Move pre-start hooks after container
mounts, 2016-02-17, #568).  I've left that alone too.

I really think we should have len() guards to avoid computing the
state when .Hooks is non-nil but the particular phase we're looking at
is empty.  Aleksa, however, is adamantly against them [3] citing a
risk of sloppy copy/pastes causing the hook slice being len-guarded to
diverge from the hook slice being iterated over within the guard.  I
think that ort of thing is very lo-risk, because:

* We shouldn't be copy/pasting this, right?  DRY for the win :).
* There's only ever a few lines between the guard and the guarded
  loop.  That makes broken copy/pastes easy to catch in review.
* We should have test coverage for these.  Guarding with the wrong
  slice is certainly not the only thing you can break with a sloppy
  copy/paste.

But I'm not a maintainer ;).

[1]: https://github.com/opencontainers/runtime-spec/blob/v1.0.0/config.md#prestart
[2]: https://github.com/opencontainers/runc/issues/1710
[3]: https://github.com/opencontainers/runc/pull/1741#discussion_r233331570

Signed-off-by: W. Trevor King <wking@tremily.us>
2018-11-14 06:49:49 -08:00
Mrunal Patel 4769cdf607
Merge pull request #1916 from crosbymichael/cgns
Add support for cgroup namespace
2018-11-13 12:21:38 -08:00
Mrunal Patel f000fe11ec
Merge pull request #1917 from slp/master
libcontainer: map PidsLimit to systemd's TasksMax property
2018-11-13 12:21:23 -08:00
Michael Crosby aa7917b751
Merge pull request #1911 from theSuess/linter-fixes
Various cleanups to address linter issues
2018-11-13 12:13:34 -05:00
Michael Crosby bd420b59f1
Merge pull request #1925 from Ace-Tang/fix_dup_ns
test: fix TestDupNamespaces fail to test dup-ns error
2018-11-13 12:11:11 -05:00
Xiaochen Shen 95af9eff82 libcontainer: intelrdt: add support for Intel RDT/MBA Software Controller in runc
MBA Software Controller feature is introduced in Linux kernel v4.18.
It is a software enhancement to mitigate some limitations in MBA which
describes in kernel documentation. It also makes the interface more user
friendly - we could specify memory bandwidth in "MBps" (Mega Bytes per
second) as well as in "percentages".

The kernel underneath would use a software feedback mechanism or a
"Software Controller" which reads the actual bandwidth using MBM
counters and adjust the memory bandwidth percentages to ensure:
"actual memory bandwidth < user specified memory bandwidth".

We could enable this feature through mount option "-o mba_MBps":
mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl

In runc, we handle both memory bandwidth schemata in unified format:
"MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."
The unit of memory bandwidth is specified in "percentages" by default,
and in "MBps" if MBA Software Controller is enabled.

For more information about Intel RDT and MBA Software Controller:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2018-11-13 23:27:08 +08:00
Ace-Tang 16d55f17a8 libcontainer: fix potential panic if spec.Process is nil
for the code logic, pointer 'spec.Process' should be judge first
to avoid panic.

Signed-off-by: Ace-Tang <aceapril@126.com>
2018-11-06 11:55:30 +08:00
Ace-Tang 95d1aa1886 test: fix TestDupNamespaces
add Root in created spec, or error message is 'Root must be specified'

Signed-off-by: Ace-Tang <aceapril@126.com>
2018-11-06 11:36:27 +08:00
Michael Crosby b1068fb925
Merge pull request #1814 from rhatdan/selinux
SELinux labels are tied to the thread
2018-11-05 10:00:11 -05:00
Aleksa Sarai 9f1e94488e
merge branch 'pr-1921'
libcontainer: ability to compile without kmem

LGTMs: @mrunalp @cyphar
Closes #1921
2018-11-02 09:54:16 +11:00
Michael Crosby 9e5aa7494d
Merge pull request #1918 from giuseppe/skip-setgroups
rootless: fix running with /proc/self/setgroups set to deny
2018-11-01 13:16:47 -04:00
Kir Kolyshkin 6a2c155968 libcontainer: ability to compile without kmem
Commit fe898e7862 (PR #1350) enables kernel memory accounting
for all cgroups created by libcontainer -- even if kmem limit is
not configured.

Kernel memory accounting is known to be broken in some kernels,
specifically the ones from RHEL7 (including RHEL 7.5). Those
kernels do not support kernel memory reclaim, and are prone to
oopses. Unconditionally enabling kmem acct on such kernels lead
to bugs, such as

* https://github.com/opencontainers/runc/issues/1725
* https://github.com/kubernetes/kubernetes/issues/61937
* https://github.com/moby/moby/issues/29638

This commit gives a way to compile runc without kernel memory setting
support. To do so, use something like

	make BUILDTAGS="seccomp nokmem"

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2018-10-31 20:35:51 -07:00
Yuanhong Peng df3fa115f9 Add support for cgroup namespace
Cgroup namespace can be configured in `config.json` as other
namespaces. Here is an example:

```
"namespaces": [
	{
		"type": "pid"
	},
	{
		"type": "network"
	},
	{
		"type": "ipc"
	},
	{
		"type": "uts"
	},
	{
		"type": "mount"
	},
	{
		"type": "cgroup"
	}
],

```

Note that if you want to run a container which has shared cgroup ns with
another container, then it's strongly recommended that you set
proper `CgroupsPath` of both containers(the second container's cgroup
path must be the subdirectory of the first one). Or there might be
some unexpected results.

Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2018-10-31 10:51:43 -04:00
Chris Aniszczyk f3ce8221ea
Merge pull request #1913 from xiaochenshen/rdt-add-diagnostics
libcontainer: intelrdt: add user-friendly diagnostics for Intel RDT operation errors
2018-10-25 14:27:17 -05:00
Giuseppe Scrivano 869add3318
rootless: fix running with /proc/self/setgroups set to deny
This is a regression from 06f789cf26
when the user namespace was configured without a privileged helper.
To allow a single mapping in an user namespace, it is necessary to set
/proc/self/setgroups to "deny".

For a simple reproducer, the user namespace can be created with
"unshare -r".

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2018-10-25 15:44:15 +02:00
Sergio Lopez 5c6b9c3c1c libcontainer: map PidsLimit to systemd's TasksMax property
Currently runc applies PidsLimit restriction by writing directly to
cgroup's pids.max, without notifying systemd. As a consequence, when the
later updates the context of the corresponding scope, pids.max is reset
to the value of systemd's TasksMax property.

This can be easily reproduced this way (I'm using "postfix" here just an
example, any unrelated but existing service will do):

 # CTR=`docker run --pids-limit 111 --detach --rm busybox /bin/sleep 8h`
 # cat /sys/fs/cgroup/pids/system.slice/docker-${CTR}.scope/pids.max
 111
 # systemctl disable --now postfix
 # systemctl enable --now postfix
 # cat /sys/fs/cgroup/pids/system.slice/docker-${CTR}.scope/pids.max
 max

This patch adds TasksAccounting=true and TasksMax=PidsLimit to the
properties sent to systemd.

Signed-off-by: Sergio Lopez <slp@redhat.com>
2018-10-24 17:20:27 +02:00
Aleksa Sarai e93996674f
merge branch 'pr-1903'
clarify license information

LGTMs: @hqhq @cyphar
Closes #1903
2018-10-24 22:03:44 +11:00
Aleksa Sarai 9a3a8a5ebf libcontainer: implement CLONE_NEWCGROUP
This is a very simple implementation because it doesn't require any
configuration unlike the other namespaces, and in its current state it
only masks paths.

This feature is available in Linux 4.6+ and is enabled by default for
kernels compiled with CONFIG_CGROUP=y.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2018-10-23 16:23:00 -04:00
Xiaochen Shen 6c307f8ff2 libcontainer: intelrdt: add user-friendly diagnostics for Intel RDT operation errors
Linux kernel v4.15 introduces better diagnostics for Intel RDT operation
errors. If any error returns when making new directories or writing to
any of the control file in resctrl filesystem, reading file
/sys/fs/resctrl/info/last_cmd_status could provide more information that
can be conveyed in the error returns from file operations.

Some examples:
  echo "L3:0=f3;1=ff" > /sys/fs/resctrl/container_id/schemata
  -bash: echo: write error: Invalid argument
  cat /sys/fs/resctrl/info/last_cmd_status
  mask f3 has non-consecutive 1-bits

  echo "MB:0=0;1=110" > /sys/fs/resctrl/container_id/schemata
  -bash: echo: write error: Invalid argument
  cat /sys/fs/resctrl/info/last_cmd_status
  MB value 0 out of range [10,100]

  cd /sys/fs/resctrl
  mkdir 1 2 3 4 5 6 7 8
  mkdir: cannot create directory '8': No space left on device
  cat /sys/fs/resctrl/info/last_cmd_status
  out of CLOSIDs

See 'last_cmd_status' for more details in kernel documentation:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

In runc, we could append the diagnostics information to the error
message of Intel RDT operation errors to provide more user-friendly
information.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2018-10-19 00:16:08 +08:00
Mrunal Patel c2ab1e656e
Merge pull request #1910 from adrianreber/tip
Fix travis Go: tip
2018-10-17 12:47:08 -07:00
Michael Crosby 58592df567
Merge pull request #1880 from AkihiroSuda/fix-subgid
libcontainer: CurrentGroupSubGIDs -> CurrentUserSubGIDs
2018-10-16 15:21:51 -04:00
Xiaochen Shen d59b17d6d5 libcontainer: intelrdt: Add more check if sub-features are enabled
Double check if Intel RDT sub-features are available in "resource
control" filesystem. Intel RDT sub-features can be selectively disabled
or enabled by kernel command line (e.g., rdt=!l3cat,mba) in 4.14 and
newer kernel.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2018-10-16 14:29:44 +08:00
Xiaochen Shen f097339289 libcontainer: intelrdt: add test cases for Intel RDT/MBA
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2018-10-16 14:29:39 +08:00
Xiaochen Shen 27560ace2f libcontainer: intelrdt: add support for Intel RDT/MBA in runc
Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature
of Intel Resource Director Technology (RDT) which is supported on some
Intel Xeon platforms. Intel RDT/MBA provides indirect and approximate
throttle over memory bandwidth for the software. A user controls the
resource by indicating the percentage of maximum memory bandwidth.

Hardware details of Intel RDT/MBA can be found in section 17.18 of
Intel Software Developer Manual:
https://software.intel.com/en-us/articles/intel-sdm

In Linux 4.12 kernel and newer, Intel RDT/MBA is enabled by kernel
config CONFIG_INTEL_RDT. If hardware support, CPU flags `rdt_a` and
`mba` will be set in /proc/cpuinfo.

Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
|   |-- L3
|   |   |-- cbm_mask
|   |   |-- min_cbm_bits
|   |   |-- num_closids
|   |-- MB
|       |-- bandwidth_gran
|       |-- delay_linear
|       |-- min_bandwidth
|       |-- num_closids
|-- ...
|-- schemata
|-- tasks
|-- <container_id>
    |-- ...
    |-- schemata
    |-- tasks

For MBA support for `runc`, we will reuse the infrastructure and code
base of Intel RDT/CAT which implemented in #1279. We could also make
use of `tasks` and `schemata` configuration for memory bandwidth
resource constraints.

The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the
task ID to the "tasks" file (which will automatically remove them from
the previous group to which they belonged). New tasks created by
fork(2) and clone(2) are added to the same group as their parent.

The file `schemata` has a list of all the resources available to this
group. Each resource (L3 cache, memory bandwidth) has its own line and
format.

Memory bandwidth schema:
It has allocation values for memory bandwidth on each socket, which
contains L3 cache id and memory bandwidth percentage.
    Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."

The minimum bandwidth percentage value for each CPU model is predefined
and can be looked up through "info/MB/min_bandwidth". The bandwidth
granularity that is allocated is also dependent on the CPU model and
can be looked up at "info/MB/bandwidth_gran". The available bandwidth
control steps are: min_bw + N * bw_gran. Intermediate values are
rounded to the next control step available on the hardware.

For more information about Intel RDT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

An example for runc:
Consider a two-socket machine with two L3 caches where the minimum
memory bandwidth of 10% with a memory bandwidth granularity of 10%.
Tasks inside the container may use a maximum memory bandwidth of 20%
on socket 0 and 70% on socket 1.

"linux": {
    "intelRdt": {
        "memBwSchema": "MB:0=20;1=70"
    }
}

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2018-10-16 14:29:29 +08:00
Xiaochen Shen c1cece7e23 libcontainer: intelrdt: add Intel RDT/MBA docs in SPEC.md
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2018-10-16 14:28:19 +08:00
Mrunal Patel a00bf01908
Merge pull request #1862 from AkihiroSuda/decompose-rootless-pr
Disable rootless mode except RootlessCgMgr when executed as the root in userns (fix Docker-in-LXD regression)
2018-10-15 17:32:15 -07:00
Dominik Süß 0b412e9482 various cleanups to address linter issues
Signed-off-by: Dominik Süß <dominik@suess.wtf>
2018-10-13 21:14:03 +02:00
Adrian Reber 0d01164756 Fix travis Go: tip
This fixes

 libcontainer/container_linux.go:1200: Error call has possible formatting directive %s

Signed-off-by: Adrian Reber <areber@redhat.com>
2018-10-13 10:44:07 +00:00
Aleksa Sarai e40d4635c4
merge branch 'pr-1894'
Move spec.Linux.IntelRdt check to spec.Linux != nil block

LGTMs: @crosbymichael @cyphar
Closes #1894
2018-10-09 02:41:13 +11:00
Jonathan Marler 1499c746a1 Move spec.Linux.IntelRdt check to spec.Linux != nil block
Signed-off-by: Jonathan Marler <johnnymarler@gmail.com>
2018-10-04 21:30:55 -06:00
Mike Brown 26bdc0dce7 clarify license information
Signed-off-by: Mike Brown <brownwm@us.ibm.com>
2018-10-03 10:39:44 -05:00
Mrunal Patel 2abd837c8c
Merge pull request #1893 from cyphar/keyctl-ignore-enosys
keyring: handle ENOSYS with keyctl(KEYCTL_JOIN_SESSION_KEYRING)
2018-09-25 13:35:16 -07:00
Danail Branekov a1d5398afa Respect container's cgroup path
Respect the container's cgroup path when finding the container's
cgroup mount point, which is useful in multi-tenant environments, where
containers have their own unique cgroup mounts

Signed-off-by: Danail Branekov <danailster@gmail.com>
Signed-off-by: Oliver Stenbom <ostenbom@pivotal.io>
Signed-off-by: Giuseppe Capizzi <gcapizzi@pivotal.io>
2018-09-25 17:43:36 +01:00
Aleksa Sarai 578fe65e4f
merge branch 'pr-1817'
Fix duplicate entries and missing entries in getCgroupMountsHelper
  Add test for testing cgroup mounts on bedrock linux
  Stop relying on number of subsystems for cgroups

LGTMs: @crosbymichael @cyphar
Closes #1817
2018-09-19 19:48:17 +10:00
Michael Crosby cc8146cf93
Merge pull request #1858 from marcov/nsenter-README
Update outdated nsenter README content
2018-09-17 10:53:19 -04:00
Michael Crosby d77251d5fc
Merge pull request #1892 from Ace-Tang/add_clean_test
test: add more test case for CleanPath
2018-09-17 10:51:17 -04:00
Aleksa Sarai 40f1468413
keyring: handle ENOSYS with keyctl(KEYCTL_JOIN_SESSION_KEYRING)
While all modern kernels (and I do mean _all_ of them -- this syscall
was added in 2.6.10 before git had begun development!) have support for
this syscall, LXC has a default seccomp profile that returns ENOSYS for
this syscall. For most syscalls this would be a deal-breaker, and our
use of session keyrings is security-based there are a few mitigating
factors that make this change not-completely-insane:

  * We already have a flag that disables the use of session keyrings
    (for older kernels that had system-wide keyring limits and so
    on). So disabling it is not a new idea.

  * While the primary justification of using session keys *is*
    security-based, it's more of a security-by-obscurity protection.
    The main defense keyrings have is VFS credentials -- which is
    something that users already have better security tools for
    (setuid(2) and user namespaces).

  * Given the security justification you might argue that we
    shouldn't silently ignore this. However, the only way for the
    kernel to return -ENOSYS is either being ridiculously old (at
    which point we wouldn't work anyway) or that there is a seccomp
    profile in place blocking it.

    Given that the seccomp profile (if malicious) could very easily
    just return 0 or a silly return code (or something even more
    clever with seccomp-bpf) and trick us without this patch, there
    isn't much of a significant change in how much seccomp can trick
    us with or without this patch.

Given all of that over-analysis, I'm pretty convinced there isn't a
security problem in this very specific case and it will help out the
ChromeOS folks by allowing Docker to run inside their LXC container
setup. I'd be happy to be proven wrong.

Ref: https://bugs.chromium.org/p/chromium/issues/detail?id=860565
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-09-17 21:38:30 +10:00
Ace-Tang 5963cf2afc test: add more test case for CleanPath
Signed-off-by: Ace-Tang <aceapril@126.com>
2018-09-14 21:37:12 +08:00
Akihiro Suda 06f789cf26 Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.

`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.

`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)

When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.

This PR does not have any impact on CLI flags and `state.json`.

Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
  For runc-in-userns, `runc spec`  without `--rootless` should work, when sufficient numbers of
  UID/GID are mapped.

Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
  (`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
  This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.

Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-09-07 15:05:03 +09:00
Yan Zhu feb90346e0 doc: fix typo
Signed-off-by: Yan Zhu <yanzhu@alauda.io>
2018-09-07 11:58:59 +08:00
Michael Crosby 70ca035aa6
Merge pull request #1883 from lifubang/containeridinpath
fix delete other file bug when container id is ..
2018-09-05 13:43:21 -04:00
Mrunal Patel 9cda583235
Merge pull request #1832 from giuseppe/runc-drop-invalid-proc-destination-with-chroot
linux: drop check for /proc as invalid dest
2018-09-04 09:26:21 -07:00
Lifubang 4eb30fcdbe code optimization: use securejoin.SecureJoin and CleanPath
Signed-off-by: Lifubang <lifubang@acmcoder.com>
2018-09-04 09:02:18 +08:00
Lifubang 4fae8fcce2 code optimization after review
Signed-off-by: Lifubang <lifubang@acmcoder.com>
2018-09-03 23:27:31 +08:00
Lifubang d2d226e8f9 fix unexpected delete bug when container id is ..
Signed-off-by: Lifubang <lifubang@acmcoder.com>
2018-08-31 11:17:42 +08:00
ChangFeng 3ce8fac7c4 libcontainer: add /proc/loadavg to the white list of bind mount
Signed-off-by: JunLi <lijun.git@gmail.com>
2018-08-30 21:30:23 +08:00
Giuseppe Scrivano 636b664027
linux: drop check for /proc as invalid dest
it is now allowed to bind mount /proc.  This is useful for rootless
containers when the PID namespace is shared with the host.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2018-08-30 09:56:18 +02:00
Akihiro Suda b34d6d8a7c libcontainer: CurrentGroupSubGIDs -> CurrentUserSubGIDs
subgid is defined per user, not group (see subgid(5))

This commit also adds support for specifying subuid owner with a numeric UID.

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-08-29 07:46:03 +09:00
Michael Crosby 1555a78945
Merge pull request #1874 from mrunalp/drop_unused_code
Remove unused veth setup code
2018-08-27 11:07:25 -04:00
Qiang Huang 0228707b77
Merge pull request #1873 from rhatdan/ms_move
When doing a copyup, /tmp can not be a shared mount point
2018-08-27 10:08:53 +08:00
Mrunal Patel fe3d5c4c6e Remove unused veth setup code
Networking is setup by plugins for users of runc so it makes sense
to get rid of the veth strategy.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2018-08-24 15:41:52 -07:00
Adrian Reber fa43a72aba
criu: restore into existing namespace when specified
Using CRIU to checkpoint and restore a container into an existing
network namespace is not possible.

If the network namespace is defined like

	{
		"type": "network",
		"path": "/run/netns/test"
	}

there is the expectation that the restored container is again running in
the network namespace specified with 'path'.

This adds the new CRIU 'external namespace' feature to runc, where
during checkpointing that specific namespace is referenced and during
restore CRIU tries to restore the container in exactly that
namespace.

This breaks/fixes current runc behavior. If, without this patch, runc
restores a container with such a network namespace definition, it is
ignored and CRIU recreates a network namespace without a name.

With this patch runc uses the network namespace path (if available) to
checkpoint and restore the container in just that network namespace.

Restore will now fail if a container was checkpointed with a network
namespace path set and if that network namespace path does not exist
during restore.

runc still falls back to the old behavior if CRIU older than 3.11 is
installed.

Fixes #1786

Related to https://github.com/projectatomic/libpod/pull/469

Thanks to Andrei Vagin for all the help in getting the interface between
CRIU and runc right!

Signed-off-by: Adrian Reber <areber@redhat.com>
2018-08-22 23:27:20 +02:00
Daniel J Walsh 62a4763a7a
When doing a copyup, /tmp can not be a shared mount point
MOVE_MOUNT will fail under certain situations.

You are not allowed to MS_MOVE if the parent directory is shared.

man mount
...
   The move operation
       Move a mounted tree to another place (atomically).  The call is:

              mount --move olddir newdir

       This  will cause the contents which previously appeared under olddir to
       now be accessible under newdir.  The physical location of the files  is
       not changed.  Note that olddir has to be a mountpoint.

       Note  also that moving a mount residing under a shared mount is invalid
       and unsupported.  Use findmnt -o TARGET,PROPAGATION to see the  current
       propagation flags.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2018-08-20 17:41:06 -04:00
Aleksa Sarai 20aff4f048
merge branch 'pr-1867'
Revert "libcontainer/rootfs_linux: minor cleanup"

LGTMs: @hqhq @cyphar
Closes #1867
2018-08-15 15:42:56 +10:00
Mrunal Patel 26ec8a9783 Revert "libcontainer/rootfs_linux: minor cleanup"
This reverts commit 1b27db67f1.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2018-08-14 15:50:18 -07:00
Marco Vedovati 34ed62697b Update outdated nsenter README content
Signed-off-by: Marco Vedovati <mvedovati@suse.com>
2018-08-07 17:53:56 +02:00
Michael Crosby 4056a41f58
Merge pull request #1830 from crosbymichael/procs
Pass GOMAXPROCS to init processes
2018-08-01 10:48:06 -04:00
Jay Kamat a2faaa1317
Fix duplicate entries and missing entries in getCgroupMountsHelper
Signed-off-by: Jay Kamat <jaygkamat@gmail.com>
2018-07-31 20:12:18 -07:00
Alban Crequy 3321aa1af7 Fix regression with mounts with non-absolute source path
PR #1753 introduced a test on the mount flags but the binary operator
was wrong, see https://github.com/opencontainers/runc/pull/1753#discussion_r203445652

This was noticed when investigating https://github.com/opencontainers/runtime-tools/issues/651

Symptoms: in the container, /proc/self/mountinfo displays some mounts as
follow:

296 279 0:67 / /tmp rw,nosuid - tmpfs /home/dpark/go/src/github.com/opencontainers/runc/tmpfs rw,size=65536k,mode=755

Signed-off-by: Alban Crequy <alban@kinvolk.io>
2018-07-18 18:30:49 +02:00
Michael Crosby 53fddb540a Pass GOMAXPROCS to init processes
This will help runc's init to not spawn many threads on large systems when
launched with max procs by the caller.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2018-06-26 11:23:37 -04:00
Michael Crosby 2c632d1a2d
Merge pull request #1824 from cyphar/fix-mips-build-devNumber
libcontainer: devices: fix mips builds
2018-06-25 13:21:28 -04:00
Jay Kamat e5a7c61f3c Add test for testing cgroup mounts on bedrock linux
Add a mountinfo from a bedrock linux system with 4 strata, and include
it for tests

Signed-off-by: Jay Kamat <jaygkamat@gmail.com>
Signed-off-by: Daniel Dao <dqminh89@gmail.com>
2018-06-24 00:01:07 +01:00
Daniel Dao 5ee0648bfb Stop relying on number of subsystems for cgroups
When there are complicated mount setups, there can be multiple mount
points which have the subsystem we are looking for. Instead of
counting the mountpoints, tick off subsystems until we have found them
all.

Without the 'all' flag, ignore duplicate subsystems after the first.

Signed-off-by: Daniel Dao <dqminh89@gmail.com>
2018-06-24 00:00:58 +01:00
Aleksa Sarai 823c06eae9
libcontainer: improve "kernel.{domainname,hostname}" sysctl handling
These sysctls are namespaced by CLONE_NEWUTS, and we need to use
"kernel.domainname" if we want users to be able to set an NIS domainname
on Linux. However we disallow "kernel.hostname" because it would
conflict with the "hostname" field and cause confusion (but we include a
helpful message to make it clearer to the user).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-06-18 21:48:04 +10:00
Aleksa Sarai a0e99e7a1a
libcontainer: devices: fix mips builds
It turns out that MIPS uses uint32 in the device number returned by
stat(2), so explicitly wrap everything to make the compiler happy. I
really wish that Go had C-like numeric type promotion.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-06-17 11:22:01 +10:00
Mrunal Patel ad0f525506
Merge pull request #1819 from tiborvass/fix-arm32bit
libcontainer: fix compilation on GOARCH=arm GOARM=6 (32 bits)
2018-06-15 07:06:50 -07:00
Tibor Vass c205e9fb64 libcontainer: fix compilation on GOARCH=arm GOARM=6 (32 bits)
This fixes the following compilation error on 32bit ARM:
```
$ GOARCH=arm GOARCH=6 go build ./libcontainer/system/
libcontainer/system/linux.go:119:89: constant 4294967295 overflows int
```

Signed-off-by: Tibor Vass <tibor@docker.com>
2018-06-14 18:33:14 +00:00
Giuseppe Scrivano cbcc85d311
runc: not require uid/gid mappings if euid()==0
When running in a new unserNS as root, don't require a mapping to be
present in the configuration file.  We are already skipping the test
for a new userns to be present.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2018-06-12 12:45:54 +02:00
Daniel J Walsh aa3fee6c80
SELinux labels are tied to the thread
We need to lock the threads for the SetProcessLabel to work,
should also call SetProcessLabel("") after the container starts
to go back to the default SELinux behaviour.

Once you call SetProcessLabel, then any process executed by runc
will run with this label, even if the process is for setup rather
then the container.

It is always safest to call the SELinux calls just before the exec of the
container, so that other processes do not get started with the incorrect label.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2018-06-11 08:34:58 -04:00
Aleksa Sarai dd56ece823
merge branch 'pr-1812'
Fix race in runc exec

LGTMs: @dqminh @cyphar
Closes #1812
2018-06-04 19:02:33 +10:00
Daniel, Dao Quang Minh 2e91544060
Merge pull request #1806 from cyphar/cgroup-ignorable-error-fixup
cgroup: clean up isIgnorableError for skippable EROFS
2018-06-02 23:57:02 +01:00
Mrunal Patel bd3c4f844a Fix race in runc exec
There is a race in runc exec when the init process stops just before
the check for the container status. It is then wrongly assumed that
we are trying to start an init process instead of an exec process.

This commit add an Init field to libcontainer Process to distinguish
between init and exec processes to prevent this race.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2018-06-01 16:25:58 -07:00
Michael Crosby 0e561642f8
Merge pull request #1688 from AkihiroSuda/unshare-m-r
main: support rootless mode in userns
2018-05-29 15:41:17 -04:00
Aleksa Sarai 939d5a3753
cgroup: clean up isIgnorableError for skippable EROFS
Include a rootless argument for isIgnorableError to avoid people
accidentally using isIgnorableError when they shouldn't (we don't ignore
any errors when running as root as that really isn't safe).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-05-25 11:31:41 +10:00
Qiang Huang dd67ab10d7
Merge pull request #1759 from cyphar/rootless-erofs-as-eperm
rootless: cgroup: treat EROFS as a skippable error
2018-05-25 09:24:16 +08:00
Daniel, Dao Quang Minh 2e931185f9
Merge pull request #1805 from derekwaynecarr/systemd-cpuquota-fix
fix systemd cpu quota for -1
2018-05-24 11:24:27 +01:00
Akihiro Suda c93815738a libcontainer: remove extra CAP_SETGID check for SetgroupAttr
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-05-24 14:59:30 +09:00
Derek Carr b515963c10 systemd cpu quota ignores -1
Signed-off-by: Derek Carr <decarr@redhat.com>
2018-05-23 14:28:39 -04:00
Michael Crosby fd0febd3ce Wrap error messages during init
Fixes #1437

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2018-05-10 10:28:10 -04:00
Akihiro Suda f103de57ec main: support rootless mode in userns
Running rootless containers in userns is useful for mounting
filesystems (e.g. overlay) with mapped euid 0, but without actual root
privilege.

Usage: (Note that `unshare --mount` requires `--map-root-user`)

  user$ mkdir lower upper work rootfs
  user$ curl http://dl-cdn.alpinelinux.org/alpine/v3.7/releases/x86_64/alpine-minirootfs-3.7.0-x86_64.tar.gz | tar Cxz ./lower || ( true; echo "mknod errors were ignored" )
  user$ unshare --mount --map-root-user
  mappedroot# runc spec --rootless
  mappedroot# sed -i 's/"readonly": true/"readonly": false/g' config.json
  mappedroot# mount -t overlay -o lowerdir=./lower,upperdir=./upper,workdir=./work overlayfs ./rootfs
  mappedroot# runc run foo

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-05-10 12:16:43 +09:00
Akihiro Suda 9c7d8bc1fd libcontainer: add parser for /etc/sub{u,g}id and /proc/PID/{u,g}id_map
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-05-10 12:16:43 +09:00
Mrunal Patel 0cbfd8392f
Merge pull request #1562 from cyphar/carry-975-959-ipc-uid-namespaces
nsenter: improve namespace creation and SELinux IPC handling
2018-04-26 14:12:33 -07:00
Mrunal Patel 871ba2e58e
Merge pull request #1781 from filbranden/systemd3
Make channel for StartTransientUnit buffered
2018-04-24 11:56:34 -07:00
Michael Crosby bdbb9fab07
Merge pull request #1693 from AkihiroSuda/leave-setgroups-allow
libcontainer: allow setgroup in rootless mode
2018-04-24 11:24:04 -04:00
Michael Crosby 1f11dc5dba
Merge pull request #1785 from dlorenc/seccomp
Make the setupSeccomp function public.
2018-04-19 16:00:54 -04:00
Mrunal Patel 63e6708c74
Merge pull request #1784 from pierrchen/master
libcontainer/rootfs_linux: minor cleanup
2018-04-17 17:02:10 -07:00
dlorenc 40680b2d37 Make the setupSeccomp function public.
This function is useful for converting from the OCI spec format to the one used by runC/libcontainer.

Signed-off-by: dlorenc <lorenc.d@gmail.com>
2018-04-17 10:47:22 -07:00
Michael Crosby d56f6cc202
Merge pull request #1753 from wking/do-not-require-bind-mount-type
libcontainer/specconv/spec_linux: Support empty 'type' for bind mounts
2018-04-16 11:01:53 -04:00
Bin Chen 1b27db67f1 libcontainer/rootfs_linux: minor cleanup
move variable close to where is used

Signed-off-by: Bin Chen <nk@devicu.com>
2018-04-16 22:25:48 +10:00
Filipe Brandenburger 165ee45334 Make channel for StartTransientUnit buffered
So that, if a timeout happens and we decide to stop blocking on the
operation, the writer will not block when they try to report the result
of the operation.

This should address Issue #1780 and it's a follow up for PR #1683,
PR #1754 and PR #1772.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2018-04-14 08:49:50 -07:00
Michael Crosby f753f300ae
Merge pull request #1779 from runcom/gcc8-fix
nsexec.c: fix GCC 8 warning
2018-04-12 12:13:43 -04:00
Michael Crosby 9f0eca2a94
Merge pull request #1777 from nalind/no-config-for-extant-netns
Only configure networking when creating a net ns
2018-04-12 10:55:02 -04:00
Antonio Murdaca 1a5064622c
nsexec.c: fix GCC 8 warning
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
2018-04-12 12:25:06 +02:00
Nalin Dahyabhai 4521d4b19c Only configure networking when creating a net ns
When joining an existing namespace, don't default to configuring a
loopback interface in that namespace.

Its creator should have done that, and we don't want to fail to create
the container when we don't have sufficient privileges to configure the
network namespace.

Signed-off-by: Nalin Dahyabhai <nalin@redhat.com>
2018-04-11 13:28:19 -04:00
Filipe Brandenburger 0e16bd9b53 Detect whether Delegate is available on both slices and scopes
Starting with systemd 237, in preparation for cgroup v2, delegation is
only now available for scopes, not slices.

Update libcontainer code to detect whether delegation is available on
both and use that information when creating new slices.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2018-04-10 11:42:55 -07:00
Filipe Brandenburger 8ab251f298 Fix systemd.Apply() to check for DBus error before waiting on a channel.
The channel was introduced in #1683 to work around a race condition.
However, the check for error in StartTransientUnit ignores the error for
an already existing unit, and in that case there will be no notification
from DBus (so waiting on the channel will make it hang.)

Later PR #1754 added a timeout, which worked around the issue, but we
can fix this correctly by only waiting on the channel when there is no
error. Fix the code to do so.

The timeout handling was kept, since there might be other cases where
this situation occurs (https://bugzilla.redhat.com/show_bug.cgi?id=1548358
mentions calling this code from inside a container, it's unclear whether
an existing container was in use or not, so not sure whether this would
have fixed that bug as well.)

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2018-04-09 11:51:59 -07:00
Sebastien Boeuf 985628dda0 libcontainer: Don't set container state to running when exec'ing
There is no reason to set the container state to "running" as a
temporary value when exec'ing a process on a container in "created"
state. The problem doing this is that consumers of the libcontainer
library might use it by keeping pointers in memory. In this case,
the container state will indicate that the container is running, which
is wrong, and this will end up with a failure on the next action
because the check for the container state transition will complain.

Fixes #1767

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2018-03-30 09:29:18 -07:00
Akihiro Suda 73f3dc6389 libcontainer: allow setgroup in rootless mode
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-03-27 17:42:05 +09:00
Akihiro Suda ed58366cc8 libcontainer: fix Boolmsg alignment
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-03-26 14:44:03 +09:00
Tamal Saha 58415b4b12 Fix error message
Signed-off-by: Tamal Saha <tamal@appscode.com>
2018-03-21 20:52:09 -07:00
Aleksa Sarai fd3a6e6c83
libcontainer: handle unset oomScoreAdj corectly
Previously if oomScoreAdj was not set in config.json we would implicitly
set oom_score_adj to 0. This is not allowed according to the spec:

> If oomScoreAdj is not set, the runtime MUST NOT change the value of
> oom_score_adj.

Change this so that we do not modify oom_score_adj if oomScoreAdj is not
present in the configuration. While this modifies our internal
configuration types, the on-disk format is still compatible.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-03-17 13:53:42 +11:00
Aleksa Sarai 03e585985f
rootless: cgroup: treat EROFS as a skippable error
In some cases, /sys/fs/cgroups is mounted read-only. In rootless
containers we can consider this effectively identical to having cgroups
that we don't have write permission to -- because the user isn't
responsible for the read-only setup and cannot modify it. The rules are
identical to when /sys/fs/cgroups is not writable by the unprivileged
user.

An example of this is the default configuration of Docker, where cgroups
are mounted as read-only as a preventative security measure.

Reported-by: Vladimir Rutsky <rutsky@google.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-03-17 13:53:42 +11:00
Daniel J Walsh 43aea05946 Label the masked tmpfs with the mount label
Currently if a confined container process tries to list these directories
AVC's are generated because they are labeled with external labels.  Adding
the mountlabel will remove these AVC's.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2018-03-09 14:29:06 -05:00
Qiang Huang 9facb87f87
Merge pull request #1754 from vikaschoudhary16/add-timeout
Add timeout while waiting for StartTransinetUnit completion signal
2018-03-08 09:09:34 +08:00
W. Trevor King 0aa6e4e5d3 libcontainer/specconv/spec_linux: Support empty 'type' for bind mounts
From the "Creating a bind mount" section of mount(2) [1]:

> If mountflags includes MS_BIND (available since Linux 2.4), then
> perform a bind mount...
>
> The filesystemtype and data arguments are ignored.

This commit adds support for configurations that leave the OPTIONAL
type [2] unset for bind mounts.  There's a related spec-example change
in flight with [3], although my personal preference would be a more
explicit spec for the whole mount structure [4].

[1]: http://man7.org/linux/man-pages/man2/mount.2.html
[2]: https://github.com/opencontainers/runtime-spec/blame/v1.0.1/config.md#L102
[3]: https://github.com/opencontainers/runtime-spec/pull/954
[4]: https://github.com/opencontainers/runtime-spec/pull/771

Signed-off-by: W. Trevor King <wking@tremily.us>
2018-03-07 10:23:42 -08:00
vikaschoudhary16 04e95b526d Add timeout while waiting for StartTransinetUnit completion signal from dbus
Signed-off-by: vikaschoudhary16 <choudharyvikas16@gmail.com>
2018-03-07 05:11:38 -05:00
Denys Smirnov 3d26fc3fd7 cgroups/fs: fix NPE on Destroy than no cgroups are set
Currently Manager accepts nil cgroups when calling Apply, but it will panic then trying to call Destroy with the same config.

Signed-off-by: Denys Smirnov <denys@sourced.tech>
2018-03-06 23:31:31 +01:00
Vincent Batts bf74951617
libcontainer/user: platform dependent calls
This rearranges a bit of the user and group lookup, such that only a
basic subset is exposed.

Signed-off-by: Vincent Batts <vbatts@hashbangbash.com>
2018-02-28 14:14:24 -05:00
Aleksa Sarai 757e78bebd
merge branch 'pr-1743'
The setupUserNamespace function is always called.

LGTMs: @crosbymichael @mrunalp @cyphar
Closes #1743
2018-02-27 12:22:52 +11:00
Michael Crosby 8aca07289d
Merge pull request #1736 from allencloud/fix-lint-warning
fix lint error in specconv
2018-02-26 14:21:26 -05:00
ynirk 2420eb1f4d The setupUserNamespace function is always called.
The function is called even if the usernamespace is not set.
This results having wrong uid/gid set on devices.

This fix add a test to check if usernamespace is set befor calling
setupUserNamespace.

Fixes #1742

Signed-off-by: Julien Lavesque <julien.lavesque@gmail.com>
2018-02-26 14:27:11 +01:00
Allen Sun 3f32e72963 fix lint error in specconv
Signed-off-by: Allen Sun <allensun.shl@alibaba-inc.com>
2018-02-26 15:39:54 +08:00
W. Trevor King d71b3f5344 libcontainer/sync: Drop procConsole transaction from comments
These were added in 244c9fc4 (*: console rewrite, 2016-06-04, #1018)
alongside procConsole and the associated handling.  procConsole and
that handling were removed in 00a0ecf5 (Add separate console socket,
2017-03-02, #1356), but 00a0ecf5 missed this comment.

Signed-off-by: W. Trevor King <wking@tremily.us>
2018-02-23 15:03:56 -08:00
Michael Crosby 595bea022f
Merge pull request #1722 from ravisantoshgudimetla/fix-systemd-path
fix systemd slice expansion so that it could be consumed by cAdvisor
2018-02-20 09:59:24 -05:00
W. Trevor King 50dc7ee96c libcontainer/capabilities_linux: Drop os.Getpid() call
gocapability has supported 0 as "the current PID" since
syndtr/gocapability@5e7cce49 (Allow to use the zero value for pid to
operate with the current task, 2015-01-15, syndtr/gocapability#2).
libcontainer was ported to that approach in 444cc298 (namespaces:
allow to use pid namespace without mount namespace, 2015-01-27,
docker/libcontainer#358), but the change was clobbered by 22df5551
(Merge branch 'master' into api, 2015-02-19, docker/libcontainer#388)
which landed via 5b73860e (Merge pull request #388 from docker/api,
2015-02-19, docker/libcontainer#388).  This commit restores the
changes from 444cc298.

Signed-off-by: W. Trevor King <wking@tremily.us>
2018-02-19 15:47:42 -08:00
ravisantoshgudimetla 7019e1de7b fix systemd slice expansion so that it could be consumed by cAdvisor
Signed-off-by: ravisantoshgudimetla <ravisantoshgudimetla@gmail.com>
2018-02-18 21:32:39 -05:00
Mrunal Patel 6e15bc3f92
Merge pull request #1702 from crosbymichael/chroot
chroot when no mount namespaces is provided
2018-02-07 10:09:35 -08:00
W. Trevor King be16b13645 libcontainer/state_linux_test: Add a testTransitions helper
The helper DRYs up the transition tests and makes it easy to get
complete coverage for invalid transitions.

I'm also using t.Run() for subtests.  Run() is new in Go 1.7 [1], but
runc dropped support for 1.6 back in e773f96b (update go version at
travis-ci, 2017-02-20, #1335).

[1]: https://blog.golang.org/subtests

Signed-off-by: W. Trevor King <wking@tremily.us>
2018-01-25 11:18:45 -08:00
Michael Crosby 91ca331474 chroot when no mount namespaces is provided
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2018-01-25 11:36:37 -05:00
Michael Crosby c4e4bb0df2
Merge pull request #1699 from AkihiroSuda/indent-c
make: validate C format
2018-01-25 10:09:09 -05:00
Aleksa Sarai 5a46c2ba8b
nsenter: move namespace creation after userns creation
Technically, this change should not be necessary, as the kernel
documentation claims that if you call clone(flags|CLONE_NEWUSER), the
new user namespace will be the owner of all other namespaces created in
@flags. Unfortunately this isn't always the case, due to various
additional semantics and kernel bugs.

One particular instance is SELinux, which acts very strangely towards
the IPC namespace and mqueue. If you unshare the IPC namespace *before*
you map a user in the user namespace, the IPC namespace's internal
kern-mount for mqueue will be labelled incorrectly and the container
won't be able to access it. The only way of solving this is to unshare
IPC *after* the user has been mapped and we have changed to that user.
I've also heard of this happening to the NET namespace while talking to
some LXC folks, though I haven't personally seen that issue.

This change matches our handling of user namespaces to be the same as
how LXC handles these problems.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-01-25 23:56:49 +11:00
Akihiro Suda dd5eb3b9e3 make: validate C format
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-01-24 10:49:50 +09:00
Ed King 5c0af14bf8 Return from goroutine when it should terminate
Signed-off-by: Craig Furman <cfurman@pivotal.io>
2018-01-23 10:46:31 +00:00
Will Martin 8d3e6c9826 Avoid race when opening exec fifo
When starting a container with `runc start` or `runc run`, the stub
process (runc[2:INIT]) opens a fifo for writing. Its parent runc process
will open the same fifo for reading. In this way, they synchronize.

If the stub process exits at the wrong time, the parent runc process
will block forever.

This can happen when racing 2 runc operations against each other: `runc
run/start`, and `runc delete`. It could also happen for other reasons,
e.g. the kernel's OOM killer may select the stub process.

This commit resolves this race by racing the opening of the exec fifo
from the runc parent process against the stub process exiting. If the
stub process exits before we open the fifo, we return an error.

Another solution is to wait on the stub process. However, it seems it
would require more refactoring to avoid calling wait multiple times on
the same process, which is an error.

Signed-off-by: Craig Furman <cfurman@pivotal.io>
2018-01-22 17:03:02 +00:00
Antonio Murdaca cd1e7abee2
libcontainer: expose annotations in hooks
Annotations weren't passed to hooks. This patch fixes that by passing
annotations to stdin for hooks.

Signed-off-by: Antonio Murdaca <runcom@redhat.com>
2018-01-11 16:54:01 +01:00
vikaschoudhary16 d5b4a3eddb Fix race against systemd
- T0: runc triggers a systemd unit creation asynchronously from [here](https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/systemd/apply_systemd.go#L298)
- T1: runc then moves ahead and starts creating cgroup paths(.scope directories), [here](https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/systemd/apply_systemd.go#L348). Kernel creates .scope directory and cgroup.procs file(along with other default files) in the directory automatically, in an atomic manner.
- T3: systemd execution thread which was invoked at time `T0`, is still in the process of unit creation. systemd also trying to create cgroup paths and deletes the `.scope` directory which is created at time `T1` by runc from [here](https://github.com/systemd/systemd/blob/v219/src/shared/cgroup-util.c#L1630) in the code

Signed-off-by: vikaschoudhary16 <choudharyvikas16@gmail.com>
2018-01-08 09:37:26 -05:00
Mrunal Patel e6516b3d5d
Merge pull request #1678 from sboeuf/sboeuf/subreaper
libcontainer: Do not wait for signalled processes if subreaper is set
2017-12-15 08:47:07 -08:00
Michael Crosby 7f24b40cc5
Merge pull request #1675 from tklauser/apparmor-no-cgo
RFC: libcontainer: remove dependency on libapparmor
2017-12-15 11:23:35 -05:00
Tobias Klauser db093f621f libcontainer: remove dependency on libapparmor
libapparmor is integrated in libcontainer using cgo but is only used to
call a single function: aa_change_onexec. It turns out this function is
simple enough (writing a string to a file in /proc/<n>/attr/...) to be
re-implemented locally in libcontainer in plain Go.

This allows to drop the dependency on libapparmor and the corresponding
cgo integration.

Fixes #1674

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-12-15 09:59:58 +01:00
Sebastien Boeuf bb912eb00c libcontainer: Do not wait for signalled processes if subreaper is set
When a subreaper is enabled, it might expect to reap a process and
retrieve its exit code. That's the reason why this patch is giving
the possibility to define the usage of a subreaper as a consumer of
libcontainer. Relying on this information, libcontainer will not
wait for signalled processes in case a subreaper has been set.

Fixes #1677

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2017-12-14 10:37:38 -08:00
Mrunal Patel c6e4a1ebeb
Merge pull request #1665 from Mashimiao/gidmapping-valid-fix
specconv: avoid skipping gidmappings applied when uidmappings is empty
2017-12-11 09:50:54 -08:00
Mrunal Patel b028413c35
Merge pull request #1655 from Mashimiao/add-propagation-more
support unbindable,runbindable for rootfs propagation
2017-12-11 09:21:41 -08:00
Allen Sun fec6b0fea5 Update criu_opts_linux.go
Signed-off-by: Allen Sun <shlallen1990@gmail.com>
2017-12-05 15:16:26 +08:00
Michael Crosby 91e9795013
Merge pull request #1654 from dqminh/only-linux
remove placeholder for non-linux platforms
2017-11-30 09:51:47 -05:00
Ma Shimiao 57edfbbaf2 specconv: avoid skipping gidmappings applied when uidmappings is empty
Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
2017-11-30 16:24:36 +08:00
Aleksa Sarai e8149af291
merge branch 'pr-1661'
Ensure container tests do not write on the host

LGTMs: @hqhq @cyphar
Closes #1661
2017-11-27 20:10:48 +11:00
Danail Branekov 0495fece57 Ensure container tests do not write on the host
TestGetContainerStateAfterUpdate creates its state.json file on the current
directory which turns out to be the host runc directory. Thus whenever
the test completes it leaves the state.json file behind thus
a) poluting the local git repository
b) changing the host file system violating the principle of doing
everything in an isolated container environment

This change would create a new temporary (in-container) directory and use it as
linuxContainer.root

Signed-off-by: Tom Godkin <tgodkin@pivotal.io>
2017-11-27 10:43:10 +02:00
Daniel Dao 8898b6b446 remove placeholder for non-linux platforms
runc currently only support Linux platform, and since we dont intend to expose
the support to other platform, removing all other platforms placeholder code.

`libcontainer/configs` still being used in
https://github.com/moby/moby/blob/master/daemon/daemon_windows.go so
keeping it for now.

After this, we probably should also rename files to drop linux suffices
if possible.

Signed-off-by: Daniel Dao <dqminh89@gmail.com>
2017-11-24 18:14:51 +00:00
Daniel, Dao Quang Minh fb871d9cd0
Merge pull request #1664 from tklauser/drop-freebsd
libcontainer: drop FreeBSD support
2017-11-24 18:08:21 +00:00
Tobias Klauser 4d27f20db0 libcontainer: drop FreeBSD support
runc is not supported on FreeBSD, so remove all FreeBSD specific bits.

As suggested by @crosbymichael in #1653

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-11-24 14:51:05 +01:00
Danail Branekov 38d1e6ec27 Delete xattr related code
Selinux related code has been moved to the selinux package
(https://github.com/opencontainers/selinux) and therefore xattr related
code can be deleted from libcontainer

Signed-off-by: Danail Branekov <danailster@gmail.com>
2017-11-21 12:49:28 +02:00
Ma Shimiao 17db6560be support unbindable,runbindable for rootfs propagation
Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
2017-11-17 16:14:15 +08:00
Seth Jennings bca53e7b49 systemd: adjust CPUQuotaPerSecUSec to compensate for systemd internal handling
Signed-off-by: Seth Jennings <sjenning@redhat.com>
2017-11-15 20:20:06 -06:00
Vincent Demeester 3ca4c78b1a
Import docker/docker/pkg/mount into runc
This will help get rid of docker/docker dependency in runc 👼

Signed-off-by: Vincent Demeester <vincent@sbr.pm>
2017-11-08 16:25:58 +01:00
Michael Crosby 2f010ecf19
Merge pull request #1622 from vdemeester/import-symlink-from-docker
Remove pkg/symlink from docker/docker and use cyphar/filepath-securejoin
2017-11-08 10:07:00 -05:00
Akihiro Suda 0aac2368e4 specconv.Example(): add /proc/scsi to masked paths
Port over https://github.com/moby/moby/pull/35399

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2017-11-04 17:38:14 +00:00
Michael Crosby 0232e38342
Merge pull request #1629 from masters-of-cats/busybox-inflation
Avoid disk usage explosion when copying busybox
2017-11-01 09:15:22 -04:00
Danail Branekov fdbb9e3e55 Avoid disk usage explosion when copying busybox
When running runc tests with temp directory with size 500M copying
busybox without preserving hardlinks causes the folder to inflate to
roughly 330M. Copying busybox twice in certain tests causes the /tmp
directory to overfill. Using `-a` preserves links which busybox uses to
implement its choice of binary to run.

Signed-off-by: Tom Godkin <tgodkin@pivotal.io>
2017-11-01 09:52:05 +00:00
Vincent Demeester 594501475e
Use cyphar/filepath-securejoin instead of docker pkg/symlink
runc shouldn't depend on docker and be more self-contained.
Removing github.com/pkg/symlink dep is the first step to not depend on docker anymore

Signed-off-by: Vincent Demeester <vincent@sbr.pm>
2017-10-31 16:53:45 +01:00
Lorenzo Fontana 780f8ef567
Specconv: Test create command hooks and seccomp setup
Signed-off-by: Lorenzo Fontana <lo@linux.com>
2017-10-28 21:46:46 +02:00
Mrunal Patel 9a1186d128 Merge pull request #1619 from fntlnz/spec-linux-testing
WIP: Better testsuite for specconv
2017-10-25 15:23:19 -07:00
Lorenzo Fontana c0e6e12f9d
Test Cgroup creation and memory allocations
Signed-off-by: Lorenzo Fontana <lo@linux.com>
2017-10-25 01:58:10 +02:00
Aleksa Sarai ff5075c33f
init: correctly handle unmapped stdio with multiple mappings
Previously we would handle the "unmapped stdio" case by just doing a
simple check, however this didn't handle cases where the overflow_uid
was actually mapped in the user namespace. Instead of doing some
userspace checks, just try to do the fchown(2) and ignore EINVAL
(unmapped) or EPERM (lacking privilege over inode) errors.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-10-25 00:12:21 +11:00
Qiang Huang 74a1729647 Merge pull request #1607 from crosbymichael/term-err
libcontainer: handler errors from terminate
2017-10-20 15:15:38 +08:00
Qiang Huang e8b9b92f57 Merge pull request #1206 from YuPengZTE/devMD026
trailing punctuation in header
2017-10-20 14:47:09 +08:00
Mrunal Patel 80ee9e50b5 Merge pull request #1616 from mheon/seccomp_fix_breakage
Fix breaking change in Seccomp profile behavior
2017-10-19 14:15:04 -07:00
Aleksa Sarai c05f6368af
merge branch 'pr-1615'
libcontainer: intelrdt: fix a GetStats() issue

LGTMs: @crosbymichael @cyphar
Closes #1615
2017-10-19 03:41:16 +11:00
Matthew Heon e9193ba6e6 Fix breaking change in Seccomp profile behavior
Multiple conditions were previously allowed to be placed upon the
same syscall argument. Restore this behavior.

Signed-off-by: Matthew Heon <mheon@redhat.com>
2017-10-18 11:53:56 -04:00
Qiang Huang 3409d5c555 Merge pull request #1606 from cyphar/rootfs-propagation-no-pivot
specconv: emit an error when using MS_PRIVATE with --no-pivot
2017-10-18 09:52:04 +08:00
Xiaochen Shen d89217515b libcontainer: intelrdt: fix a GetStats() issue
This fixes a GetStats() issue introduced in #1590:
If Intel RDT is enabled by hardware and kernel, but intelRdt is not
specified in original config, GetStats() will return error unexpectedly
because we haven't called Apply() to create intelrdt group or attach
tasks for this container. As a result, runc events command will have no
output.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-10-17 17:37:07 +08:00
Tobias Klauser 0eed453b21 libcontainer: use Major/Minor from x/sys/unix
The Major and Minor functions were added for Linux in golang/sys@85d1495
which is already vendored in. Use these functions instead of the local
re-implementation.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-10-17 09:06:42 +02:00
Aleksa Sarai 9b13f5cc7f
merge branch 'pr-1453'
propagate argv0 when re-execing from /proc/self/exe

LGTMs: @crosbymichael @cyphar
Closes #1453
2017-10-17 03:12:22 +11:00
Michael Crosby ff4481dbf6 Merge pull request #1540 from cloudfoundry-incubator/rootless-cgroups
Support cgroups with limits as rootless
2017-10-16 12:03:49 -04:00
Petros Angelatos 8098828680
propagate argv0 when re-execing from /proc/self/exe
This allows runc to be used as a target for docker's reexec module that
depends on a correct argv0 to select which process entrypoint to invoke.
Without this patch, when runc re-execs argv0 is set to "/proc/self/exe"
and the reexec module doesn't know what to do with it.

Signed-off-by: Petros Angelatos <petrosagg@gmail.com>
2017-10-16 14:00:26 +02:00
Tobias Klauser d2bc081420 libcontainer: merge common syscall implementations
There are essentially two possible implementations for Setuid/Setgid on
Linux, either using SYS_SETUID32/SYS_SETGID32 or SYS_SETUID/SYS_SETGID,
depending on the architecture (see golang/go#1435 for why Setuid/Setgid
aren currently implemented for Linux neither in syscall nor in
golang.org/x/sys/unix).

Reduce duplication by merging the currently implemented variants and
adjusting the build tags accordingly.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-10-16 11:11:18 +02:00
Aleksa Sarai 6d30f7a01b
merge branch 'pr-1424'
Update Travis config to use trusty-backports libseccomp
  Add integration tests for multi-argument Seccomp filters
  Vendor updated libseccomp-golang for bugfix

LGTMs: @crosbymichael @cyphar
Closes #1424
2017-10-16 03:01:37 +11:00
Aleksa Sarai d2ac52fe52
merge branch 'pr-1475'
Add support for mips/mips64
  Put signalMap in a separate file, so it may be arch-specific

LGTMs: @crosbymichael @cyphar
Closes #1475
2017-10-16 02:59:34 +11:00
Aleksa Sarai 2430a98e64
merge branch 'pr-1500'
rootfs: switch ms_private remount of oldroot to ms_slave

LGTMs: @crosbymichael @hqhq
Closes opencontainers/runc#1500
2017-10-14 09:32:59 +11:00
Sebastien Boeuf acb93c9c62 libcontainer: cgroups: Write freezer state after every state check
This commit ensures we write the expected freezer cgroup state after
every state check, in case the state check does not give the expected
result. This can happen when a new task is created and prevents the
whole cgroup to be FROZEN, leaving the state into FREEZING instead.

This patch prevents the case of an infinite loop to happen.

Fixes https://github.com/opencontainers/runc/issues/1609

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2017-10-12 07:07:28 -07:00
Matthew Heon bbc847a457 Add integration tests for multi-argument Seccomp filters
Signed-off-by: Matthew Heon <mheon@redhat.com>
2017-10-10 15:49:08 -04:00
Michael Crosby bfe3058fc9 Make process check more forgiving
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-10-10 15:36:19 -04:00
Steven Hartland eb68b900bc Prevent invalid errors from terminate
Both Process.Kill() and Process.Wait() can return errors that don't impact the correct behaviour of terminate.

Instead of letting these get returned and logged, which causes confusion, silently ignore them.

Currently the test needs to be a string test as the errors are private to the runtime packages, so its our only option.

This can be seen if init fails during the setns.

Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
2017-10-10 15:32:46 -04:00
Michael Crosby 4693fae411 Merge pull request #1590 from xiaochenshen/rdt-cat-support-update-command
libcontainer: intelrdt: add update command support
2017-10-10 15:25:22 -04:00
Aleksa Sarai d4f0f9a52b
specconv: emit an error when using MS_PRIVATE with --no-pivot
Due to the semantics of chroot(2) when it comes to mount namespaces, it
is not generally safe to use MS_PRIVATE as a mount propgation when using
chroot(2). The reason for this is that this effectively results in a set
of mount references being held by the chroot'd namespace which the
namespace cannot free. pivot_root(2) does not have this issue because
the @old_root can be unmounted by the process.

Ultimately, --no-pivot is not really necessary anymore as a commonly
used option since f8e6b5af5e ("rootfs: make pivot_root not use a
temporary directory") resolved the read-only issue. But if someone
really needs to use it, MS_PRIVATE is never a good idea.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-10-08 17:50:55 +11:00
Michael Crosby f53ad9cec9 Merge pull request #1604 from AkihiroSuda/cwd
libcontainer: create Cwd when it does not exist
2017-10-05 11:15:10 -04:00
Will Martin ca4f427af1 Support cgroups with limits as rootless
Signed-off-by: Ed King <eking@pivotal.io>
Signed-off-by: Gabriel Rosenhouse <grosenhouse@pivotal.io>
Signed-off-by: Konstantinos Karampogias <konstantinos.karampogias@swisscom.com>
2017-10-05 11:22:54 +01:00
Akihiro Suda 2edd36fdff libcontainer: create Cwd when it does not exist
The benefit for doing this within runc is that it works well with
userns.
Actually, runc already does the same thing for mount points.

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2017-10-05 05:31:46 +00:00
Konstantinos Karampogias 605dc5c811 Set initial console size based on process spec
Signed-off-by: Will Martin <wmartin@pivotal.io>
Signed-off-by: Petar Petrov <pppepito86@gmail.com>
Signed-off-by: Ed King <eking@pivotal.io>
Signed-off-by: Roberto Jimenez Sanchez <jszroberto@gmail.com>
Signed-off-by: Thomas Godkin <tgodkin@pivotal.io>
2017-10-04 12:32:16 +01:00
Daniel, Dao Quang Minh 0351df1c5a Merge pull request #1600 from crosbymichael/console
Bump console and sys deps
2017-09-26 10:15:10 +01:00
Michael Crosby f364c1a58c Set ClearONLCR in tests
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-09-25 13:35:22 -04:00
Tobias Klauser d713652bda libcontainer: remove unnecessary type conversions
Generated using github.com/mdempsky/unconvert

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-09-25 10:41:57 +02:00
Qiang Huang 79ad714374 Merge pull request #1598 from euank/ragent
libcontainer: default mount propagation correctly
2017-09-25 11:55:29 +08:00
Euan Kemp 4301b440d6 libcontainer: default mount propagation correctly
The code in prepareRoot (e385f67a0e/libcontainer/rootfs_linux.go (L599-L605))
attempts to default the rootfs mount to `rslave`. However, since the spec
conversion has already defaulted it to `rprivate`, that code doesn't
actually ever do anything.

This changes the spec conversion code to accept "" and treat it as 0.

Implicitly, this makes rootfs propagation default to `rslave`, which is
a part of fixing the moby bug https://github.com/moby/moby/issues/34672

Alternate implementatoins include changing this defaulting to be
`rslave` and removing the defaulting code in prepareRoot, or skipping
the mapping entirely for "", but I think this change is the cleanest of
those options.

Signed-off-by: Euan Kemp <euan.kemp@coreos.com>
2017-09-22 13:36:23 -07:00
Xiaochen Shen 2549545df5 intelrdt: always init IntelRdtManager if Intel RDT is enabled
In current implementation:
Either Intel RDT is not enabled by hardware and kernel, or intelRdt is
not specified in original config, we don't init IntelRdtManager in the
container to handle intelrdt constraint. It is a tradeoff that Intel RDT
has hardware limitation to support only limited number of groups.

This patch makes a minor change to support update command:
Whether or not intelRdt is specified in config, we always init
IntelRdtManager in the container if Intel RDT is enabled. If intelRdt is
not specified in original config, we just don't Apply() to create
intelrdt group or attach tasks for this container.

In update command, we could re-enable through IntelRdtManager.Apply()
and then update intelrdt constraint.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-09-20 01:37:31 +08:00
Michael Crosby 593914b8bd Merge pull request #1593 from s7v7nislands/drop_go1.5
Drop support golang 1.5
2017-09-12 15:22:00 -04:00
s7v7nislands 00ad8e1e56 Drop support golang 1.5
Signed-off-by: Xiaobing Jiang <s7v7nislands@gmail.com>
2017-09-12 20:56:51 +08:00
Qiang Huang 68e00e906b Merge pull request #1586 from crosbymichael/set-cgroups
Apply cgroups earlier
2017-09-12 12:13:29 +08:00
Yong Tang e9944d0f4c Disable systemd in static build
This fix tries to address the warnings caused by static build
with go 1.9. As systemd needs dlopen/dlclose, the following warnings
will be generated for static build in go 1.9:
```
root@f4b077232050:/go/src/github.com/opencontainers/runc# make static
CGO_ENABLED=1 go build  -tags "seccomp cgo static_build" -ldflags "-w -extldflags -static -X main.gitCommit="1c81e2a794c6e26a4c650142ae8893c47f619764" -X main.version=1.0.0-rc4+dev " -o runc .
/tmp/go-link-113476657/000007.o: In function `_cgo_a5acef59ed3f_Cfunc_dlopen':
/tmp/go-build/github.com/opencontainers/runc/vendor/github.com/coreos/pkg/dlopen/_obj/cgo-gcc-prolog:76: warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
```

This fix disables systemd when `static_build` flag is on (apply_nosystemd.go
is used instead).

This fix also fixes a small bug in `apply_nosystemd.go` for return value.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
2017-09-11 18:38:22 +00:00
Mrunal Patel d5b43c3981 Merge pull request #1455 from dqminh/epoll-io
tty: move IO of master pty to be done with epoll
2017-09-11 11:32:42 -07:00
Aleksa Sarai 1a5fdc1c5f
init: support setting -u with rootless containers
Now that rootless containers have support for multiple uid and gid
mappings, allow --user to work as expected. If the user is not mapped,
an error occurs (as usual).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-09-09 12:45:33 +10:00
Aleksa Sarai 969bb49cc3
nsenter: do not resolve path in nsexec context
With the addition of our new{uid,gid}map support, we used to call
execvp(3) from inside nsexec. This would mean that the path resolution
for the binaries would happen in nsexec. Move the resolution to the
initial setup code, and pass the absolute path to nsexec.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-09-09 12:45:33 +10:00
Aleksa Sarai 6097ce74d8
nsenter: correctly handle newgidmap path for rootless containers
After quite a bit of debugging, I found that previous versions of this
patchset did not include newgidmap in a rootless setting. Fix this by
passing it whenever group mappings are applied, and also providing some
better checking for try_mapping_tool. This commit also includes some
stylistic improvements.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-09-09 12:45:32 +10:00
Giuseppe Scrivano 3282f5a7c1
tests: fix for rootless multiple uids/gids
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2017-09-09 12:45:32 +10:00
Giuseppe Scrivano d8b669400a
rootless: allow multiple user/group mappings
Take advantage of the newuidmap/newgidmap tools to allow multiple
users/groups to be mapped into the new user namespace in the rootless
case.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
[ rebased to handle intelrdt changes. ]
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-09-09 12:45:32 +10:00
Mrunal Patel 13fa5d2953 Merge pull request #1588 from s7v7nislands/delete_unused
Delete unused function
2017-09-08 17:34:00 -07:00
Michael Crosby b82d07e816 Merge pull request #1587 from Mashimiao/fix-namespace-empty
Fixes #1585 config.Namespaces is empty when accessed
2017-09-08 10:50:16 -04:00
Xiaochen Shen 88d22fde40 libcontainer: intelrdt: use init() to avoid race condition
This is the follow-up PR of #1279 to fix remaining issues:

Use init() to avoid race condition in IsIntelRdtEnabled().
Add also rename some variables and functions.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-09-08 17:15:31 +08:00
s7v7nislands c795b8690b Delete unused function
Signed-off-by: Xiaobing Jiang <s7v7nislands@gmail.com>
2017-09-08 10:35:46 +08:00
Ma Shimiao c3d20e7817 Fixes #1585 config.Namespaces is empty when accessed
Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
2017-09-08 09:30:07 +08:00
Mrunal Patel deb9d7fd96 Merge pull request #1569 from cyphar/delay-seccomp
init: delay seccomp application as late as possible
2017-09-07 13:27:37 -07:00
Mrunal Patel 7e036aa0b0 Merge pull request #1541 from adrianreber/lazy
checkpoint: support lazy migration
2017-09-07 13:25:04 -07:00
Michael Crosby 7062c7556b Apply cgroups earlier
This applies cgroups earlier for container creation before the init
process starts running and forking off any additional processes.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-09-07 11:27:33 -04:00
Adrian Reber 60ae7091de checkpoint: support lazy migration
With the help of userfaultfd CRIU supports lazy migration. Lazy
migration means that memory pages are only transferred from the
migration source to the migration destination on page fault.

This enables to reduce the downtime during process or container
migration to a minimum as the memory does not need to be transferred
during migration.

Lazy migration currently depends on userfaultfd being available on the
current Linux kernel and if the used CRIU version supports lazy
migration. Both dependencies can be checked by querying CRIU via RPC if
the lazy migration feature is available. Using feature checking instead
of version comparison enables runC to use CRIU features from the
criu-dev branch. This way the user can decide if lazy migration should
be available by choosing the right kernel and CRIU branch.

To use lazy migration the CRIU process during dump needs to dump
everything besides the memory pages and then it opens a network port
waiting for remote page fault requests:

 # runc checkpoint httpd --lazy-pages --page-server 0.0.0.0:27 \
  --status-fd /tmp/postcopy-pipe

In this example CRIU will hang/wait once it has opened the network port
and wait for network connection. As runC waits for CRIU to finish it
will also hang until the lazy migration has finished. To know when the
restore on the destination side can start the '--status-fd' parameter is
used:

 #️ runc checkpoint --help | grep status
  --status-fd value   criu writes \0 to this FD once lazy-pages is ready

The parameter '--status-fd' is directly from CRIU and this way the
process outside of runC which controls the migration knows exactly when
to transfer the checkpoint (without memory pages) to the destination and
that the restore can be started.

On the destination side it is necessary to start CRIU in 'lazy-pages'
mode like this:

 # criu lazy-pages --page-server --address 192.168.122.3 --port 27 \
  -D checkpoint

and tell runC to do a lazy restore:

 # runc restore -d --image-path checkpoint --work-path checkpoint \
  --lazy-pages httpd

If both processes on the restore side have the same working directory
'criu lazy-pages' creates a unix domain socket where it waits for
requests from the actual restore. runC starts CRIU restore in lazy
restore mode and talks to 'criu lazy-pages' that it wants to restore
memory pages on demand. CRIU continues to restore the process and once
the process is running and accesses the first non-existing memory page
the 'criu lazy-pages' server will request the page from the source
system. Thus all pages from the source system will be transferred to the
destination system. Once all pages have been transferred runC on the
source system will end and the container will have finished migration.

This can also be combined with CRIU's pre-copy support. The combination
of pre-copy and post-copy (lazy migration) provides the possibility to
migrate containers with minimal downtimes.

Some additional background about post-copy migration can be found in
these articles:

 https://lisas.de/~adrian/?p=1253
 https://lisas.de/~adrian/?p=1183

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-09-06 12:35:38 +00:00
Adrian Reber a3a632ad28 checkpoint: add support to query for lazy page support
Before adding the actual lazy migration support, this adds the feature
check for lazy-pages. Right now lazy migration, which is based on
userfaultd is only available in the criu-dev branch and not yet in a
release. As the check does not dependent on a certain version but on
a CRIU feature which can be queried it can be part of runC without a new
version check depending on a feature from criu-dev.

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-09-06 12:35:38 +00:00
Xiaochen Shen 4d2756c116 libcontainer: add test cases for Intel RDT/CAT
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-09-01 14:35:40 +08:00
Xiaochen Shen 692f6e1e27 libcontainer: add support for Intel RDT/CAT in runc
About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.

About Intel RDT/CAT kernel interface:
In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.

Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.

Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
|   |-- L3
|       |-- cbm_mask
|       |-- min_cbm_bits
|       |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
    |-- cpus
    |-- schemata
    |-- tasks

For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
resource constraints.

The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file  (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.

The file `schemata` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
	Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.

The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.

For more information about Intel RDT/CAT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
inside the container only have access to the "upper" 80% of L3 cache id 0 and
the "lower" 50% L3 cache id 1:

"linux": {
	"intelRdt": {
		"l3CacheSchema": "L3:0=ffff0;1=3ff"
	}
}

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-09-01 14:26:33 +08:00
Xiaochen Shen af3b0d9dce libcontainer/SPEC.md: add documentation for Intel RDT/CAT
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-09-01 14:26:33 +08:00
Aleksa Sarai 1f32fff46d
setns init: delay seccomp as late as possible
This mirrors the standard_init_linux.go seccomp code, which only applies
seccomp early if NoNewPrivileges is enabled. Otherwise it's done
immediately before execve to reduce the amount of syscalls necessary for
users to enable in their seccomp profiles.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-08-26 13:42:30 +10:00
Aleksa Sarai 3ddde27d7d
init: move close(stateDirFd) before seccomp apply
This further reduces the number of syscalls that a user needs to enable
in their seccomp profile.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-08-26 13:42:26 +10:00
Qiang Huang 1c81e2a794 Merge pull request #1572 from tych0/fix-readonly-userns
fix --read-only containers under --userns-remap
2017-08-26 09:38:14 +08:00
Aleksa Sarai 4d6e6720a7
Merge branch 'pr-1573'
Fix systemd cgroup after memory type changed

LGTMs: @crosbymichael @cyphar
Closes #1573
2017-08-25 23:55:27 +10:00
Qiang Huang acaf6897f5 Fix systemd cgroup after memory type changed
Fixes: #1557

I'm not quite sure about the root cause, looks like
systemd still want them to be uint64.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-08-25 01:14:16 -04:00
Aleksa Sarai 7d66aab77a
init: switch away from stateDirFd entirely
While we have significant protections in place against CVE-2016-9962, we
still were holding onto a file descriptor that referenced the host
filesystem. This meant that in certain scenarios it was still possible
for a semi-privileged container to gain access to the host filesystem
(if they had CAP_SYS_PTRACE).

Instead, open the FIFO itself using a O_PATH. This allows us to
reference the FIFO directly without providing the ability for
directory-level access. When opening the FIFO inside the init process,
open it through procfs to re-open the actual FIFO (this is currently the
only supported way to open such a file descriptor).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-08-25 13:19:03 +10:00
Tycho Andersen 66eb2a3e8f fix --read-only containers under --userns-remap
The documentation here:
https://docs.docker.com/engine/security/userns-remap/#user-namespace-known-limitations

says that readonly containers can't be used with user namespaces do to some
kernel restriction. In fact, there is a special case in the kernel to be
able to do stuff like this, so let's use it.

This takes us from:

ubuntu@docker:~$ docker run -it --read-only ubuntu
docker: Error response from daemon: oci runtime error: container_linux.go:262: starting container process caused "process_linux.go:339: container init caused \"rootfs_linux.go:125: remounting \\\"/dev\\\" as readonly caused \\\"operation not permitted\\\"\"".

to:

ubuntu@docker:~$ docker-runc --version
runc version 1.0.0-rc4+dev
commit: ae2948042b08ad3d6d13cd09f40a50ffff4fc688-dirty
spec: 1.0.0
ubuntu@docker:~$ docker run -it --read-only ubuntu
root@181e2acb909a:/# touch foo
touch: cannot touch 'foo': Read-only file system

Signed-off-by: Tycho Andersen <tycho@docker.com>
2017-08-24 16:43:21 -06:00
Nikolas Sepos da4a5a9515 Add AutoDedup option to CriuOpts
Memory image deduplication, very useful for incremental dumps.

See: https://criu.org/Memory_images_deduplication

Signed-off-by: Nikolas Sepos <nikolas.sepos@gmail.com>
2017-08-18 01:21:42 +02:00
Michael Crosby ccd2c20aa4 Merge pull request #1559 from Mashimiao/panic-fix-nil-linux
fix panic when Linux is nil for rootless case
2017-08-17 09:57:35 -04:00
Ma Shimiao 2333e7dc67 fix panic when Linux is nil for rootless case
congfig.Sysctl setting is duplicated.
when contianer is rootless and Linux is nil, runc will panic.

Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
2017-08-16 09:11:13 +08:00
Mrunal Patel b31bdfc38a Merge pull request #1558 from hqhq/update_state
Update state after update
2017-08-15 10:46:44 -07:00
Qiang Huang e6e1c34a7d Update state after update
state.json should be a reflection of the container's
realtime state, including resource configurations,
so we should update state.json after updating container
resources.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-08-15 14:38:44 +08:00
Michael Crosby 3096b3fc85 Merge pull request #1556 from hqhq/fix_flakytest_TestNotifyOnOOM
Fix flaky test TestNotifyOnOOM
2017-08-14 10:03:23 -04:00
Qiang Huang 7726bcf0e2 Some fixes for testMemoryNotification
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-08-14 15:28:03 +08:00
Qiang Huang 40a1fb0e2f Fix flaky test TestNotifyOnOOM
Fixes: #1228

It can be reproduced by applying this patch:
```diff
@@ -45,6 +46,7 @@ func registerMemoryEvent(cgDir string, evName string, arg string) (<-chan struct
        go func() {
                defer func() {
                        close(ch)
+                       <-time.After(1 * time.Second)
                        eventfd.Close()
                        evFile.Close()
                }()
```

We can close channel after fds were closed.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-08-14 15:18:59 +08:00
Ma Shimiao 527dc5acbb fix panic when Linux is nil
Linux is not always not nil.
If Linux is nil, panic will occur.

Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-08-10 15:57:49 -04:00
Kenfe-Mickael Laventure 3ed492ad33
Handle non-devices correctly in DeviceFromPath
Before this change, some file type would be treated as char devices
(e.g. symlinks).

Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
2017-08-09 08:52:20 -07:00
Alex Fang e92add2151 Pass back the pid of runc:[1:CHILD] so we can wait on it
This allows the libcontainer to automatically clean up runc:[1:CHILD]
processes created as part of nsenter.

Signed-off-by: Alex Fang <littlelightlittlefire@gmail.com>
2017-08-05 13:44:36 +10:00
Aleksa Sarai 45bde006ca
merge branch 'pr-1535'
LGTMs: @avagin @cyphar
Closes #1535
2017-08-05 13:33:07 +10:00
Aleksa Sarai 22bbec1b7f
merge branch 'pr-1548'
LGTMs: @crosbymichael @mrunalp @cyphar
Closes #1548
2017-08-05 13:02:46 +10:00
Mrunal Patel 135b9992b3 Merge pull request #1544 from mlaventure/fix-device-from-path
Fix condition to detect device type in DeviceFromPath
2017-08-04 17:36:57 -07:00
Kenfe-Mickael Laventure 6056912217
Revert "Merge pull request #1450 from vrothberg/sgid-non-numeric"
This reverts commit 5c73abbe75, reversing
changes made to 51b501dab1.

Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
2017-08-04 14:28:21 -07:00
Kenfe-Mickael Laventure 25f4c7e72b
Move user pkg unix specific calls to unix file
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
2017-08-03 11:31:21 -07:00
Kenfe-Mickael Laventure 9ed15e94c8
Fix condition to detect device type in DeviceFromPath
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
2017-08-03 11:06:54 -07:00
Adrian Reber 5d386f6e2b checkpoint: use CRIU VERSION RPC if available
With this runC also uses RPC to ask CRIU for its version. CRIU supports
a VERSION RPC since CRIU 3.0 and using the RPC interface does not
require parsing the console output of CRIU (which could change anytime).

For older CRIU versions which do not yet have the VERSION RPC runC falls
back to its old CRIU output parsing mode.

Once CRIU 3.0 is the minimum version required for runC the old code can
be removed.

v2:
 * adapt to changes in the previous patches based on the review

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-08-02 16:08:07 +00:00
Adrian Reber 2393692536 criurpc.proto: copy latest criurpc.proto from criu 3.3
Update criurpc.proto for the upcoming VERSION RPC.

This includes lazy_pages for the upcoming lazy migration support.

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-08-02 16:07:32 +00:00
Adrian Reber c71d9cd447 criuSwrk: prepare for CRIU VERSION RPC
To use the CRIU VERSION RPC the criuSwrk function is adapted to work
with CriuOpts set to 'nil' as CriuOpts is not required for the VERSION
RPC.

Also do not print c.criuVersion if it is '0' as the first RPC call will
always be the VERSION call and only after that the version will be
known.

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-08-02 16:07:28 +00:00
Adrian Reber c5f0ce979b checkCriuVersion: only ask criu once about its version
If the version of criu has already been determined there is no need to
ask criu for the version again. Use the value from c.criuVersion.

v2:
 * reduce unnecessary code movement in the patch series
 * factor out the criu version parsing into a separate function

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-08-02 16:07:15 +00:00
Adrian Reber b6c47281db checkCriuVersion: switch to version using int
The checkCriuVersion function used a string to specify the minimum
version required. This is more comfortable for an external interface
but for an internal function this added unnecessary complexity. This
changes to version string like '1.5.2' to an integer like 10502. This is
already the format used internally in the function.

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-08-02 16:05:27 +00:00
Michael Crosby 882d8eaba6 Merge pull request #1537 from tklauser/staticcheck
Fix issues found by staticcheck
2017-08-02 09:52:11 -04:00
Daniel, Dao Quang Minh b313a75364 Merge pull request #1477 from yummypeng/save-own-ns-path
Always save own namespace paths
2017-08-02 11:24:30 +01:00
Tobias Klauser e4e56cb6d8 libcontainer: remove ineffective break statements
go's switch statement doesn't need an explicit break. Remove it where
that is the case and add a comment to indicate the purpose where the
removal would lead to an empty case.

Found with honnef.co/go/tools/cmd/staticcheck

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-07-28 15:13:39 +02:00
Tobias Klauser 24a4273cf9 libcontainer: handle error cases
Handle err return value of fmt.Scanf, os.Pipe and unix.ParseUnixRights.

Found with honnef.co/go/tools/cmd/staticcheck

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-07-28 15:13:11 +02:00
Daniel Dao 91eafcbc65
tty: move IO of master pty to be done with epoll
This moves all console code to use github.com/containerd/console library to
handle console I/O. Also move to use EpollConsole by default when user requests
a terminal so we can still cope when the other side temporarily goes away.

Signed-off-by: Daniel Dao <dqminh89@gmail.com>
2017-07-28 12:35:02 +01:00
Michael Crosby e775f0fba3 Merge pull request #1526 from stevenh/logrus-v1
Updated logrus to v1
2017-07-27 13:28:55 -04:00
yangshukui 5428532bdd remove the code that close negative descriptor
Signed-off-by: yangshukui <yangshukui@huawei.com>
2017-07-24 11:10:18 +08:00