Commit Graph

3825 Commits

Author SHA1 Message Date
Marco Vedovati feebfac358 Remove pipe close before exec.
Pipe close before exec is not necessary as os.Pipe() is calling pipe2
with O_CLOEXEC option.

Signed-off-by: Marco Vedovati <mvedovati@suse.com>
2019-04-04 14:53:30 +03:00
Marco Vedovati 9a599f62fb Support for logging from children processes
Add support for children processes logging (including nsexec).
A pipe is used to send logs from children to parent in JSON.
The JSON format used is the same used by logrus JSON formatted,
i.e. children process can use standard logrus APIs.

Signed-off-by: Marco Vedovati <mvedovati@suse.com>
2019-04-04 14:53:23 +03:00
Michael Crosby 029124da7a
Merge pull request #2031 from lifubang/selinux
Add selinux validate in runc exec
2019-04-03 16:09:19 -04:00
lifubang 3e6688f5c9 add selinux label for runc exec
Signed-off-by: lifubang <lifubang@acmcoder.com>
2019-04-03 12:09:06 +08:00
Mrunal Patel 6a3f4749b8
Merge pull request #2032 from rhatdan/selinux
Fix SELinux failures on disabled SELinux Machines
2019-04-02 13:39:48 -07:00
Daniel J Walsh dcf994b4f8
Fix SELinux failures on disabled SELinux Machines
On some machines when setting the SELinux key labels to "", we are seeing
failures that cause runc to fail.  Even if SELinux is disabled.

This check will ignore callers calling SELinux Set*Label functions with ""
when SELinux is disabled.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2019-04-02 10:27:27 -04:00
Aleksa Sarai da2021132b
merge branch 'pr-2026'
VERSION: back to development
  VERSION: release v1.0.0-rc7

Votes: +5 -0 /0
LGTMs: [unanimous]
2019-03-29 02:19:24 +11:00
Aleksa Sarai 6b5ee713f3
VERSION: back to development
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-03-28 22:46:35 +11:00
Aleksa Sarai 69ae5da6af
VERSION: release v1.0.0-rc7
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-03-28 22:45:53 +11:00
Michael Crosby 11fc498ffa
Merge pull request #2023 from LittleLightLittleFire/2022-fix-runc-zombie-process-regression
Fixes regression causing zombie runc:[1:CHILD] processes
2019-03-22 14:06:31 -04:00
Mrunal Patel dd22a84864
Merge pull request #2012 from rhatdan/selinux
Need to setup labeling of kernel keyrings.
2019-03-20 21:17:18 -07:00
Alex Fang eab5330908 Fixes regression causing zombie runc:[1:CHILD] processes
Whenever processes are spawned using nsexec, a zombie runc:[1:CHILD]
process will always be created and will need to be reaped by the parent

Signed-off-by: Alex Fang <littlelightlittlefire@gmail.com>
2019-03-21 13:43:38 +11:00
Aleksa Sarai f56b4cbead
merge branch 'pr-2015'
Use getenv not secure_getenv

LGTMs: @crosbymichael @cyphar
Closes #2015
2019-03-16 17:30:56 +11:00
Daniel, Dao Quang Minh 7341c22d46
Merge pull request #2014 from filbranden/testing1
Add $RUNC_USE_SYSTEMD to run tests using systemd cgroup driver
2019-03-15 10:49:13 +00:00
Filipe Brandenburger 9fe7c939f8 Add a Travis-CI job for systemd cgroup driver
The additional test shows as a separate job. It sets environment
RUNC_USE_SYSTEMD=1 so it will be clear in Travis-CI that this job is
testing the systemd cgroup driver.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2019-03-14 18:53:27 -07:00
Filipe Brandenburger 5369f9ade3 Skip CRIU tests when $RUNC_USE_SYSTEMD for now
These tests sometimes hang, so let's skip them for now.

Tested:
  $ sudo make localintegration TESTPATH='/checkpoint.bats' RUNC_USE_SYSTEMD=1

The 5 tests in this test suite will be skipped.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2019-03-14 14:53:09 -07:00
Filipe Brandenburger d4586090c4 Update tests that depend on cgroupfs paths to consider systemd cgroups
When $RUNC_USE_SYSTEMD is set, then use a systemd syntax for the
cgroupsPath. Also fix $CGROUPS_PATH to look under the actual path to the
slice/scope created by systemd.

Tested:
  $ sudo make localintegration TESTPATH='/cgroups.bats' RUNC_USE_SYSTEMD=1

That test will fail without this commit.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2019-03-14 14:51:24 -07:00
Filipe Brandenburger a9056a348f Add $RUNC_USE_SYSTEMD to use systemd cgroup driver in tests
This allows us to test runc using libcontainer's systemd driver, by
passing an extra `--systemd-cgroup` argument to the calls to runc.

Tested:
  $ sudo make localintegration TESTPATH='/exec.bats' RUNC_USE_SYSTEMD=1

And confirmed that systemd was in use by looking at creation and removal
of libcontainer_<pid>_systemd_test_default.slice test slices. Also
introduced a breakage in systemd cgroup driver and confirmed that the
tests failed as expected.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2019-03-14 10:26:47 -07:00
Filipe Brandenburger 4b2b978291 Add cgroup name to error message
More information should help troubleshoot an issue when this error occurs.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2019-03-14 10:25:00 -07:00
Justin Cormack 6f714aa928
Use getenv not secure_getenv
secure_getenv is a Glibc extension and so this code does not compile
on Musl libc any more after this patch.

secure_getenv is only intended to be used in setuid binaries, in
order that they should not trust their environment. It simply returns
NULL if the binary is running setuid. If runc was installed setuid,
the user can already do anything as root, so it is game over, so this
check is not needed.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2019-03-14 10:58:10 +00:00
Daniel J Walsh cd96170c10
Need to setup labeling of kernel keyrings.
Work is ongoing in the kernel to support different kernel
keyrings per user namespace.  We want to allow SELinux to manage
kernel keyrings inside of the container.

Currently when runc creates the kernel keyring it gets the label which runc is
running with ususally `container_runtime_t`, with this change the kernel keyring
will be labeled with the container process label container_t:s0:C1,c2.

Container running as container_t:s0:c1,c2 can manage keyrings with the same label.

This change required a revendoring or the SELinux go bindings.

github.com/opencontainers/selinux.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2019-03-13 17:57:30 -04:00
Mrunal Patel 2b18fe1d88
Merge pull request #1984 from cyphar/memfd-cleanups
nsenter: cloned_binary: "memfd" cleanups
2019-03-07 10:18:33 -08:00
Aleksa Sarai 923a8f8a9a
merge branch 'pr-2001'
README: link to /org/security/

LGTMs: @crosbymichael @cyphar
Closes #2001
2019-03-05 18:45:55 +11:00
Michael Crosby f739110263
Merge pull request #1968 from adrianreber/podman
Create bind mount mountpoints during restore
2019-03-04 11:37:07 -06:00
Michael Crosby f416cac1fa
Merge pull request #2000 from lifubang/preserve-fds-error
fix preserve-fds flag may cause runc hang
2019-03-04 10:45:17 -06:00
Vincent Batts dbf6e48d0f
README: link to /org/security/
Signed-off-by: Vincent Batts <vbatts@hashbangbash.com>
2019-03-03 15:01:08 -05:00
Aleksa Sarai 2d4a37b427
nsenter: cloned_binary: userspace copy fallback if sendfile fails
There are some circumstances where sendfile(2) can fail (one example is
that AppArmor appears to block writing to deleted files with sendfile(2)
under some circumstances) and so we need to have a userspace fallback.
It's fairly trivial (and handles short-writes).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-03-01 23:29:10 +11:00
Aleksa Sarai 16612d74de
nsenter: cloned_binary: try to ro-bind /proc/self/exe before copying
The usage of memfd_create(2) and other copying techniques is quite
wasteful, despite attempts to minimise it with _LIBCONTAINER_STATEDIR.
memfd_create(2) added ~10M of memory usage to the cgroup associated with
the container, which can result in some setups getting OOM'd (or just
hogging the hosts' memory when you have lots of created-but-not-started
containers sticking around).

The easiest way of solving this is by creating a read-only bind-mount of
the binary, opening that read-only bindmount, and then umounting it to
ensure that the host won't accidentally be re-mounted read-write. This
avoids all copying and cleans up naturally like the other techniques
used. Unfortunately, like the O_TMPFILE fallback, this requires being
able to create a file inside _LIBCONTAINER_STATEDIR (since bind-mounting
over the most obvious path -- /proc/self/exe -- is a *very bad idea*).

Unfortunately detecting this isn't fool-proof -- on a system with a
read-only root filesystem (that might become read-write during "runc
init" execution), we cannot tell whether we have already done an ro
remount. As a partial mitigation, we store a _LIBCONTAINER_CLONED_BINARY
environment variable which is checked *alongside* the protection being
present.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-03-01 23:29:08 +11:00
Aleksa Sarai af9da0a450
nsenter: cloned_binary: use the runc statedir for O_TMPFILE
Writing a file to tmpfs actually incurs a memcg penalty, and thus the
benefit of being able to disable memfd_create(2) with
_LIBCONTAINER_DISABLE_MEMFD_CLONE is fairly minimal -- though it should
be noted that quite a few distributions don't use tmpfs for /tmp (and
instead have it as a regular directory or subvolume of the host
filesystem).

Since runc must have write access to the state directory anyway (and the
state directory is usually not on a tmpfs) we can use that instead of
/tmp -- avoiding potential memcg costs with no real downside.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-03-01 23:28:51 +11:00
Aleksa Sarai 2429d59352
nsenter: cloned_binary: expand and add pre-3.11 fallbacks
In order to get around the memfd_create(2) requirement, 0a8e4117e7
("nsenter: clone /proc/self/exe to avoid exposing host binary to
container") added an O_TMPFILE fallback. However, this fallback was
flawed in two ways:

 * It required O_TMPFILE which is relatively new (having been added to
   Linux 3.11).

 * The fallback choice was made at compile-time, not runtime. This
   results in several complications when it comes to running binaries
   on different machines to the ones they were built on.

The easiest way to resolve these things is to have fallbacks work in a
more procedural way (though it does make the code unfortunately more
complicated) and to add a new fallback that uses mkotemp(3).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-03-01 23:28:50 +11:00
lifubang 7cb3cde1f4 fix preserve-fds flag may cause runc hang
Signed-off-by: lifubang <lifubang@acmcoder.com>
2019-03-01 17:15:17 +08:00
Aleksa Sarai 5b775bf297
nsenter: cloned_binary: detect and handle short copies
For a variety of reasons, sendfile(2) can end up doing a short-copy so
we need to just loop until we hit the binary size. Since /proc/self/exe
is tautologically our own binary, there's no chance someone is going to
modify it underneath us (or changing the size).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-02-26 19:51:01 +11:00
Mrunal Patel f79e211b1d
Merge pull request #1995 from giuseppe/exec-preserve-fds
exec: expose --preserve-fds
2019-02-25 17:35:28 -08:00
Giuseppe Scrivano 52f4e0facc
exec: expose --preserve-fds
The implementation is already there, we only need to add the CLI
option and pass it down.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2019-02-25 17:33:04 +01:00
Mrunal Patel 5b5130ad76
Merge pull request #1963 from adrianreber/go-criu
Vendor in go-criu and use it for CRIU's RPC definition
2019-02-23 10:44:28 -08:00
Michael Crosby 8084f7611e
Merge pull request #1986 from adrianreber/master
switched travis to xenial
2019-02-21 15:36:02 -05:00
Adrian Reber f1da0d3008
switched travis to xenial
The CRIU test for lazy migration was always skipped in Travis because
the kernel was too old. This switches Travis testing to dist: xenial
which provides a newer kernel which enables CRIU lazy migration testing.

Signed-off-by: Adrian Reber <areber@redhat.com>
2019-02-16 19:45:22 +01:00
Aleksa Sarai 751f18de2a
merge branch 'pr-1982'
nsexec (CVE-2019-5736): avoid parsing environ

LGTMs: @cyphar @crosbymichael
Closes #1982
2019-02-15 18:40:33 +11:00
Adrian Reber 9edb5494bb
Use vendored in CRIU Go bindings
This makes use of the vendored in Go bindings and removes the copy of
the CRIU RPC interface definition. runc now relies on go-criu for RPC
definition and hopefully more CRIU functions can be used in the future
from the CRIU Go bindings.

Signed-off-by: Adrian Reber <areber@redhat.com>
2019-02-14 18:20:02 +01:00
Adrian Reber bfca1e6262
Vendor in go-criu
Now that CRIU has released Go bindings, this commit vendors those in.

At first it only replaces the copy of RPC interface but the goal is to
use CRIU functions from the Go bindings instead of replicating the
functionality in runc.

Signed-off-by: Adrian Reber <areber@redhat.com>
2019-02-14 18:20:02 +01:00
Christian Brauner bb7d8b1f41
nsexec (CVE-2019-5736): avoid parsing environ
My first attempt to simplify this and make it less costly focussed on
the way constructors are called. I was under the impression that the ELF
specification mandated that arg, argv, and actually even envp need to be
passed to functions located in the .init_arry section (aka
"constructors"). Actually, the specifications is (cf. [2]):

SHT_INIT_ARRAY
This section contains an array of pointers to initialization functions,
as described in ``Initialization and Termination Functions'' in Chapter
5. Each pointer in the array is taken as a parameterless procedure with
a void return.

which means that this becomes a libc specific decision. Glibc passes
down those args, musl doesn't. So this approach can't work. However, we
can at least remove the environment parsing part based on POSIX since
[1] mandates that there should be an environ variable defined in
unistd.h which provides access to the environment. See also the relevant
Open Group specification [1].

[1]: http://pubs.opengroup.org/onlinepubs/9699919799/
[2]: http://www.sco.com/developers/gabi/latest/ch4.sheader.html#init_array

Fixes: CVE-2019-5736
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-02-14 16:06:21 +01:00
Mrunal Patel f414f497b5
Merge pull request #1978 from filbranden/systemd5
Remove detection for scope properties, which have always been broken
2019-02-13 11:54:08 -08:00
Daniel, Dao Quang Minh 0a012df867
Merge pull request #1973 from jhowardmsft/jjh/runtimespec
Vendor opencontainers/runtime-spec 29686dbc
2019-02-12 17:07:43 +00:00
Filipe Brandenburger cd41feb46b Remove detection for scope properties, which have always been broken
The detection for scope properties (whether scope units support
DefaultDependencies= or Delegate=) has always been broken, since systemd
refuses to create scopes unless at least one PID is attached to it (and
this has been so since scope units were introduced in systemd v205.)

This can be seen in journal logs whenever a container is started with
libpod:

  Feb 11 15:08:07 myhost systemd[1]: libcontainer-12345-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing.
  Feb 11 15:08:07 myhost systemd[1]: libcontainer-12345-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing.

Since this logic never worked, just assume both attributes are supported
(which is what the code does when detection fails for this reason, since
it's looking for an "unknown attribute" or "read-only attribute" to mark
them as false) and skip the detection altogether.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2019-02-11 16:05:37 -08:00
Adrian Reber 7354546cc8
Create mountpoints also on restore
runc creates all missing mountpoints when it starts a container, this
commit also creates those mountpoints during restore. Now it is possible
to restore a container using the same, but newly created rootfs just as
during container start.

Signed-off-by: Adrian Reber <areber@redhat.com>
2019-02-08 15:59:51 +01:00
Adrian Reber f661e02343
factor out bind mount mountpoint creation
During rootfs setup all mountpoints (directory and files) are created
before bind mounting the bind mounts. This does not happen during
container restore via CRIU. If restoring in an identical but newly created
rootfs, the restore fails right now. This just factors out the code to
create the bind mount mountpoints so that it also can be used during
restore.

Signed-off-by: Adrian Reber <areber@redhat.com>
2019-02-08 15:59:51 +01:00
Aleksa Sarai 6635b4f0c6
merge branch 'cve-2019-5736'
nsenter: clone /proc/self/exe to avoid exposing host binary to container

Fixes: CVE-2019-5736
LGTMs: @cyphar @crosbymichael
2019-02-08 18:58:10 +11:00
Aleksa Sarai 0a8e4117e7
nsenter: clone /proc/self/exe to avoid exposing host binary to container
There are quite a few circumstances where /proc/self/exe pointing to a
pretty important container binary is a _bad_ thing, so to avoid this we
have to make a copy (preferably doing self-clean-up and not being
writeable).

We require memfd_create(2) -- though there is an O_TMPFILE fallback --
but we can always extend this to use a scratch MNT_DETACH overlayfs or
tmpfs. The main downside to this approach is no page-cache sharing for
the runc binary (which overlayfs would give us) but this is far less
complicated.

This is only done during nsenter so that it happens transparently to the
Go code, and any libcontainer users benefit from it. This also makes
ExtraFiles and --preserve-fds handling trivial (because we don't need to
worry about it).

Fixes: CVE-2019-5736
Co-developed-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-02-08 18:57:59 +11:00
Aleksa Sarai dd023c457d
merge branch 'pr-1972'
Update vendored golang.org/x/sys to latest

LGTMs: @crosbymichael @cyphar
Closes #1972
2019-02-08 18:52:59 +11:00
John Howard ec069fe332 Vendor opencontainers/runtime-spec 29686dbc
Signed-off-by: John Howard <jhoward@microsoft.com>
2019-02-07 14:49:22 -08:00