Commit Graph

1211 Commits

Author SHA1 Message Date
Mrunal Patel a00bf01908
Merge pull request #1862 from AkihiroSuda/decompose-rootless-pr
Disable rootless mode except RootlessCgMgr when executed as the root in userns (fix Docker-in-LXD regression)
2018-10-15 17:32:15 -07:00
Dominik Süß 0b412e9482 various cleanups to address linter issues
Signed-off-by: Dominik Süß <dominik@suess.wtf>
2018-10-13 21:14:03 +02:00
Adrian Reber 0d01164756 Fix travis Go: tip
This fixes

 libcontainer/container_linux.go:1200: Error call has possible formatting directive %s

Signed-off-by: Adrian Reber <areber@redhat.com>
2018-10-13 10:44:07 +00:00
Aleksa Sarai e40d4635c4
merge branch 'pr-1894'
Move spec.Linux.IntelRdt check to spec.Linux != nil block

LGTMs: @crosbymichael @cyphar
Closes #1894
2018-10-09 02:41:13 +11:00
Jonathan Marler 1499c746a1 Move spec.Linux.IntelRdt check to spec.Linux != nil block
Signed-off-by: Jonathan Marler <johnnymarler@gmail.com>
2018-10-04 21:30:55 -06:00
Mike Brown 26bdc0dce7 clarify license information
Signed-off-by: Mike Brown <brownwm@us.ibm.com>
2018-10-03 10:39:44 -05:00
Mrunal Patel 2abd837c8c
Merge pull request #1893 from cyphar/keyctl-ignore-enosys
keyring: handle ENOSYS with keyctl(KEYCTL_JOIN_SESSION_KEYRING)
2018-09-25 13:35:16 -07:00
Danail Branekov a1d5398afa Respect container's cgroup path
Respect the container's cgroup path when finding the container's
cgroup mount point, which is useful in multi-tenant environments, where
containers have their own unique cgroup mounts

Signed-off-by: Danail Branekov <danailster@gmail.com>
Signed-off-by: Oliver Stenbom <ostenbom@pivotal.io>
Signed-off-by: Giuseppe Capizzi <gcapizzi@pivotal.io>
2018-09-25 17:43:36 +01:00
Aleksa Sarai 578fe65e4f
merge branch 'pr-1817'
Fix duplicate entries and missing entries in getCgroupMountsHelper
  Add test for testing cgroup mounts on bedrock linux
  Stop relying on number of subsystems for cgroups

LGTMs: @crosbymichael @cyphar
Closes #1817
2018-09-19 19:48:17 +10:00
Michael Crosby cc8146cf93
Merge pull request #1858 from marcov/nsenter-README
Update outdated nsenter README content
2018-09-17 10:53:19 -04:00
Michael Crosby d77251d5fc
Merge pull request #1892 from Ace-Tang/add_clean_test
test: add more test case for CleanPath
2018-09-17 10:51:17 -04:00
Aleksa Sarai 40f1468413
keyring: handle ENOSYS with keyctl(KEYCTL_JOIN_SESSION_KEYRING)
While all modern kernels (and I do mean _all_ of them -- this syscall
was added in 2.6.10 before git had begun development!) have support for
this syscall, LXC has a default seccomp profile that returns ENOSYS for
this syscall. For most syscalls this would be a deal-breaker, and our
use of session keyrings is security-based there are a few mitigating
factors that make this change not-completely-insane:

  * We already have a flag that disables the use of session keyrings
    (for older kernels that had system-wide keyring limits and so
    on). So disabling it is not a new idea.

  * While the primary justification of using session keys *is*
    security-based, it's more of a security-by-obscurity protection.
    The main defense keyrings have is VFS credentials -- which is
    something that users already have better security tools for
    (setuid(2) and user namespaces).

  * Given the security justification you might argue that we
    shouldn't silently ignore this. However, the only way for the
    kernel to return -ENOSYS is either being ridiculously old (at
    which point we wouldn't work anyway) or that there is a seccomp
    profile in place blocking it.

    Given that the seccomp profile (if malicious) could very easily
    just return 0 or a silly return code (or something even more
    clever with seccomp-bpf) and trick us without this patch, there
    isn't much of a significant change in how much seccomp can trick
    us with or without this patch.

Given all of that over-analysis, I'm pretty convinced there isn't a
security problem in this very specific case and it will help out the
ChromeOS folks by allowing Docker to run inside their LXC container
setup. I'd be happy to be proven wrong.

Ref: https://bugs.chromium.org/p/chromium/issues/detail?id=860565
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-09-17 21:38:30 +10:00
Ace-Tang 5963cf2afc test: add more test case for CleanPath
Signed-off-by: Ace-Tang <aceapril@126.com>
2018-09-14 21:37:12 +08:00
Akihiro Suda 06f789cf26 Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.

`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.

`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)

When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.

This PR does not have any impact on CLI flags and `state.json`.

Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
  For runc-in-userns, `runc spec`  without `--rootless` should work, when sufficient numbers of
  UID/GID are mapped.

Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
  (`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
  This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.

Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-09-07 15:05:03 +09:00
Yan Zhu feb90346e0 doc: fix typo
Signed-off-by: Yan Zhu <yanzhu@alauda.io>
2018-09-07 11:58:59 +08:00
Michael Crosby 70ca035aa6
Merge pull request #1883 from lifubang/containeridinpath
fix delete other file bug when container id is ..
2018-09-05 13:43:21 -04:00
Mrunal Patel 9cda583235
Merge pull request #1832 from giuseppe/runc-drop-invalid-proc-destination-with-chroot
linux: drop check for /proc as invalid dest
2018-09-04 09:26:21 -07:00
Lifubang 4eb30fcdbe code optimization: use securejoin.SecureJoin and CleanPath
Signed-off-by: Lifubang <lifubang@acmcoder.com>
2018-09-04 09:02:18 +08:00
Lifubang 4fae8fcce2 code optimization after review
Signed-off-by: Lifubang <lifubang@acmcoder.com>
2018-09-03 23:27:31 +08:00
Lifubang d2d226e8f9 fix unexpected delete bug when container id is ..
Signed-off-by: Lifubang <lifubang@acmcoder.com>
2018-08-31 11:17:42 +08:00
ChangFeng 3ce8fac7c4 libcontainer: add /proc/loadavg to the white list of bind mount
Signed-off-by: JunLi <lijun.git@gmail.com>
2018-08-30 21:30:23 +08:00
Giuseppe Scrivano 636b664027
linux: drop check for /proc as invalid dest
it is now allowed to bind mount /proc.  This is useful for rootless
containers when the PID namespace is shared with the host.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2018-08-30 09:56:18 +02:00
Akihiro Suda b34d6d8a7c libcontainer: CurrentGroupSubGIDs -> CurrentUserSubGIDs
subgid is defined per user, not group (see subgid(5))

This commit also adds support for specifying subuid owner with a numeric UID.

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-08-29 07:46:03 +09:00
Michael Crosby 1555a78945
Merge pull request #1874 from mrunalp/drop_unused_code
Remove unused veth setup code
2018-08-27 11:07:25 -04:00
Qiang Huang 0228707b77
Merge pull request #1873 from rhatdan/ms_move
When doing a copyup, /tmp can not be a shared mount point
2018-08-27 10:08:53 +08:00
Mrunal Patel fe3d5c4c6e Remove unused veth setup code
Networking is setup by plugins for users of runc so it makes sense
to get rid of the veth strategy.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2018-08-24 15:41:52 -07:00
Adrian Reber fa43a72aba
criu: restore into existing namespace when specified
Using CRIU to checkpoint and restore a container into an existing
network namespace is not possible.

If the network namespace is defined like

	{
		"type": "network",
		"path": "/run/netns/test"
	}

there is the expectation that the restored container is again running in
the network namespace specified with 'path'.

This adds the new CRIU 'external namespace' feature to runc, where
during checkpointing that specific namespace is referenced and during
restore CRIU tries to restore the container in exactly that
namespace.

This breaks/fixes current runc behavior. If, without this patch, runc
restores a container with such a network namespace definition, it is
ignored and CRIU recreates a network namespace without a name.

With this patch runc uses the network namespace path (if available) to
checkpoint and restore the container in just that network namespace.

Restore will now fail if a container was checkpointed with a network
namespace path set and if that network namespace path does not exist
during restore.

runc still falls back to the old behavior if CRIU older than 3.11 is
installed.

Fixes #1786

Related to https://github.com/projectatomic/libpod/pull/469

Thanks to Andrei Vagin for all the help in getting the interface between
CRIU and runc right!

Signed-off-by: Adrian Reber <areber@redhat.com>
2018-08-22 23:27:20 +02:00
Daniel J Walsh 62a4763a7a
When doing a copyup, /tmp can not be a shared mount point
MOVE_MOUNT will fail under certain situations.

You are not allowed to MS_MOVE if the parent directory is shared.

man mount
...
   The move operation
       Move a mounted tree to another place (atomically).  The call is:

              mount --move olddir newdir

       This  will cause the contents which previously appeared under olddir to
       now be accessible under newdir.  The physical location of the files  is
       not changed.  Note that olddir has to be a mountpoint.

       Note  also that moving a mount residing under a shared mount is invalid
       and unsupported.  Use findmnt -o TARGET,PROPAGATION to see the  current
       propagation flags.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2018-08-20 17:41:06 -04:00
Aleksa Sarai 20aff4f048
merge branch 'pr-1867'
Revert "libcontainer/rootfs_linux: minor cleanup"

LGTMs: @hqhq @cyphar
Closes #1867
2018-08-15 15:42:56 +10:00
Mrunal Patel 26ec8a9783 Revert "libcontainer/rootfs_linux: minor cleanup"
This reverts commit 1b27db67f1.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2018-08-14 15:50:18 -07:00
Marco Vedovati 34ed62697b Update outdated nsenter README content
Signed-off-by: Marco Vedovati <mvedovati@suse.com>
2018-08-07 17:53:56 +02:00
Michael Crosby 4056a41f58
Merge pull request #1830 from crosbymichael/procs
Pass GOMAXPROCS to init processes
2018-08-01 10:48:06 -04:00
Jay Kamat a2faaa1317
Fix duplicate entries and missing entries in getCgroupMountsHelper
Signed-off-by: Jay Kamat <jaygkamat@gmail.com>
2018-07-31 20:12:18 -07:00
Alban Crequy 3321aa1af7 Fix regression with mounts with non-absolute source path
PR #1753 introduced a test on the mount flags but the binary operator
was wrong, see https://github.com/opencontainers/runc/pull/1753#discussion_r203445652

This was noticed when investigating https://github.com/opencontainers/runtime-tools/issues/651

Symptoms: in the container, /proc/self/mountinfo displays some mounts as
follow:

296 279 0:67 / /tmp rw,nosuid - tmpfs /home/dpark/go/src/github.com/opencontainers/runc/tmpfs rw,size=65536k,mode=755

Signed-off-by: Alban Crequy <alban@kinvolk.io>
2018-07-18 18:30:49 +02:00
Michael Crosby 53fddb540a Pass GOMAXPROCS to init processes
This will help runc's init to not spawn many threads on large systems when
launched with max procs by the caller.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2018-06-26 11:23:37 -04:00
Michael Crosby 2c632d1a2d
Merge pull request #1824 from cyphar/fix-mips-build-devNumber
libcontainer: devices: fix mips builds
2018-06-25 13:21:28 -04:00
Jay Kamat e5a7c61f3c Add test for testing cgroup mounts on bedrock linux
Add a mountinfo from a bedrock linux system with 4 strata, and include
it for tests

Signed-off-by: Jay Kamat <jaygkamat@gmail.com>
Signed-off-by: Daniel Dao <dqminh89@gmail.com>
2018-06-24 00:01:07 +01:00
Daniel Dao 5ee0648bfb Stop relying on number of subsystems for cgroups
When there are complicated mount setups, there can be multiple mount
points which have the subsystem we are looking for. Instead of
counting the mountpoints, tick off subsystems until we have found them
all.

Without the 'all' flag, ignore duplicate subsystems after the first.

Signed-off-by: Daniel Dao <dqminh89@gmail.com>
2018-06-24 00:00:58 +01:00
Aleksa Sarai 823c06eae9
libcontainer: improve "kernel.{domainname,hostname}" sysctl handling
These sysctls are namespaced by CLONE_NEWUTS, and we need to use
"kernel.domainname" if we want users to be able to set an NIS domainname
on Linux. However we disallow "kernel.hostname" because it would
conflict with the "hostname" field and cause confusion (but we include a
helpful message to make it clearer to the user).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-06-18 21:48:04 +10:00
Aleksa Sarai a0e99e7a1a
libcontainer: devices: fix mips builds
It turns out that MIPS uses uint32 in the device number returned by
stat(2), so explicitly wrap everything to make the compiler happy. I
really wish that Go had C-like numeric type promotion.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-06-17 11:22:01 +10:00
Mrunal Patel ad0f525506
Merge pull request #1819 from tiborvass/fix-arm32bit
libcontainer: fix compilation on GOARCH=arm GOARM=6 (32 bits)
2018-06-15 07:06:50 -07:00
Tibor Vass c205e9fb64 libcontainer: fix compilation on GOARCH=arm GOARM=6 (32 bits)
This fixes the following compilation error on 32bit ARM:
```
$ GOARCH=arm GOARCH=6 go build ./libcontainer/system/
libcontainer/system/linux.go:119:89: constant 4294967295 overflows int
```

Signed-off-by: Tibor Vass <tibor@docker.com>
2018-06-14 18:33:14 +00:00
Giuseppe Scrivano cbcc85d311
runc: not require uid/gid mappings if euid()==0
When running in a new unserNS as root, don't require a mapping to be
present in the configuration file.  We are already skipping the test
for a new userns to be present.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2018-06-12 12:45:54 +02:00
Daniel J Walsh aa3fee6c80
SELinux labels are tied to the thread
We need to lock the threads for the SetProcessLabel to work,
should also call SetProcessLabel("") after the container starts
to go back to the default SELinux behaviour.

Once you call SetProcessLabel, then any process executed by runc
will run with this label, even if the process is for setup rather
then the container.

It is always safest to call the SELinux calls just before the exec of the
container, so that other processes do not get started with the incorrect label.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2018-06-11 08:34:58 -04:00
Aleksa Sarai dd56ece823
merge branch 'pr-1812'
Fix race in runc exec

LGTMs: @dqminh @cyphar
Closes #1812
2018-06-04 19:02:33 +10:00
Daniel, Dao Quang Minh 2e91544060
Merge pull request #1806 from cyphar/cgroup-ignorable-error-fixup
cgroup: clean up isIgnorableError for skippable EROFS
2018-06-02 23:57:02 +01:00
Mrunal Patel bd3c4f844a Fix race in runc exec
There is a race in runc exec when the init process stops just before
the check for the container status. It is then wrongly assumed that
we are trying to start an init process instead of an exec process.

This commit add an Init field to libcontainer Process to distinguish
between init and exec processes to prevent this race.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2018-06-01 16:25:58 -07:00
Michael Crosby 0e561642f8
Merge pull request #1688 from AkihiroSuda/unshare-m-r
main: support rootless mode in userns
2018-05-29 15:41:17 -04:00
Aleksa Sarai 939d5a3753
cgroup: clean up isIgnorableError for skippable EROFS
Include a rootless argument for isIgnorableError to avoid people
accidentally using isIgnorableError when they shouldn't (we don't ignore
any errors when running as root as that really isn't safe).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-05-25 11:31:41 +10:00
Qiang Huang dd67ab10d7
Merge pull request #1759 from cyphar/rootless-erofs-as-eperm
rootless: cgroup: treat EROFS as a skippable error
2018-05-25 09:24:16 +08:00
Daniel, Dao Quang Minh 2e931185f9
Merge pull request #1805 from derekwaynecarr/systemd-cpuquota-fix
fix systemd cpu quota for -1
2018-05-24 11:24:27 +01:00
Akihiro Suda c93815738a libcontainer: remove extra CAP_SETGID check for SetgroupAttr
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-05-24 14:59:30 +09:00
Derek Carr b515963c10 systemd cpu quota ignores -1
Signed-off-by: Derek Carr <decarr@redhat.com>
2018-05-23 14:28:39 -04:00
Michael Crosby fd0febd3ce Wrap error messages during init
Fixes #1437

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2018-05-10 10:28:10 -04:00
Akihiro Suda f103de57ec main: support rootless mode in userns
Running rootless containers in userns is useful for mounting
filesystems (e.g. overlay) with mapped euid 0, but without actual root
privilege.

Usage: (Note that `unshare --mount` requires `--map-root-user`)

  user$ mkdir lower upper work rootfs
  user$ curl http://dl-cdn.alpinelinux.org/alpine/v3.7/releases/x86_64/alpine-minirootfs-3.7.0-x86_64.tar.gz | tar Cxz ./lower || ( true; echo "mknod errors were ignored" )
  user$ unshare --mount --map-root-user
  mappedroot# runc spec --rootless
  mappedroot# sed -i 's/"readonly": true/"readonly": false/g' config.json
  mappedroot# mount -t overlay -o lowerdir=./lower,upperdir=./upper,workdir=./work overlayfs ./rootfs
  mappedroot# runc run foo

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-05-10 12:16:43 +09:00
Akihiro Suda 9c7d8bc1fd libcontainer: add parser for /etc/sub{u,g}id and /proc/PID/{u,g}id_map
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-05-10 12:16:43 +09:00
Mrunal Patel 0cbfd8392f
Merge pull request #1562 from cyphar/carry-975-959-ipc-uid-namespaces
nsenter: improve namespace creation and SELinux IPC handling
2018-04-26 14:12:33 -07:00
Mrunal Patel 871ba2e58e
Merge pull request #1781 from filbranden/systemd3
Make channel for StartTransientUnit buffered
2018-04-24 11:56:34 -07:00
Michael Crosby bdbb9fab07
Merge pull request #1693 from AkihiroSuda/leave-setgroups-allow
libcontainer: allow setgroup in rootless mode
2018-04-24 11:24:04 -04:00
Michael Crosby 1f11dc5dba
Merge pull request #1785 from dlorenc/seccomp
Make the setupSeccomp function public.
2018-04-19 16:00:54 -04:00
Mrunal Patel 63e6708c74
Merge pull request #1784 from pierrchen/master
libcontainer/rootfs_linux: minor cleanup
2018-04-17 17:02:10 -07:00
dlorenc 40680b2d37 Make the setupSeccomp function public.
This function is useful for converting from the OCI spec format to the one used by runC/libcontainer.

Signed-off-by: dlorenc <lorenc.d@gmail.com>
2018-04-17 10:47:22 -07:00
Michael Crosby d56f6cc202
Merge pull request #1753 from wking/do-not-require-bind-mount-type
libcontainer/specconv/spec_linux: Support empty 'type' for bind mounts
2018-04-16 11:01:53 -04:00
Bin Chen 1b27db67f1 libcontainer/rootfs_linux: minor cleanup
move variable close to where is used

Signed-off-by: Bin Chen <nk@devicu.com>
2018-04-16 22:25:48 +10:00
Filipe Brandenburger 165ee45334 Make channel for StartTransientUnit buffered
So that, if a timeout happens and we decide to stop blocking on the
operation, the writer will not block when they try to report the result
of the operation.

This should address Issue #1780 and it's a follow up for PR #1683,
PR #1754 and PR #1772.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2018-04-14 08:49:50 -07:00
Michael Crosby f753f300ae
Merge pull request #1779 from runcom/gcc8-fix
nsexec.c: fix GCC 8 warning
2018-04-12 12:13:43 -04:00
Michael Crosby 9f0eca2a94
Merge pull request #1777 from nalind/no-config-for-extant-netns
Only configure networking when creating a net ns
2018-04-12 10:55:02 -04:00
Antonio Murdaca 1a5064622c
nsexec.c: fix GCC 8 warning
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
2018-04-12 12:25:06 +02:00
Nalin Dahyabhai 4521d4b19c Only configure networking when creating a net ns
When joining an existing namespace, don't default to configuring a
loopback interface in that namespace.

Its creator should have done that, and we don't want to fail to create
the container when we don't have sufficient privileges to configure the
network namespace.

Signed-off-by: Nalin Dahyabhai <nalin@redhat.com>
2018-04-11 13:28:19 -04:00
Filipe Brandenburger 0e16bd9b53 Detect whether Delegate is available on both slices and scopes
Starting with systemd 237, in preparation for cgroup v2, delegation is
only now available for scopes, not slices.

Update libcontainer code to detect whether delegation is available on
both and use that information when creating new slices.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2018-04-10 11:42:55 -07:00
Filipe Brandenburger 8ab251f298 Fix systemd.Apply() to check for DBus error before waiting on a channel.
The channel was introduced in #1683 to work around a race condition.
However, the check for error in StartTransientUnit ignores the error for
an already existing unit, and in that case there will be no notification
from DBus (so waiting on the channel will make it hang.)

Later PR #1754 added a timeout, which worked around the issue, but we
can fix this correctly by only waiting on the channel when there is no
error. Fix the code to do so.

The timeout handling was kept, since there might be other cases where
this situation occurs (https://bugzilla.redhat.com/show_bug.cgi?id=1548358
mentions calling this code from inside a container, it's unclear whether
an existing container was in use or not, so not sure whether this would
have fixed that bug as well.)

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2018-04-09 11:51:59 -07:00
Sebastien Boeuf 985628dda0 libcontainer: Don't set container state to running when exec'ing
There is no reason to set the container state to "running" as a
temporary value when exec'ing a process on a container in "created"
state. The problem doing this is that consumers of the libcontainer
library might use it by keeping pointers in memory. In this case,
the container state will indicate that the container is running, which
is wrong, and this will end up with a failure on the next action
because the check for the container state transition will complain.

Fixes #1767

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2018-03-30 09:29:18 -07:00
Akihiro Suda 73f3dc6389 libcontainer: allow setgroup in rootless mode
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-03-27 17:42:05 +09:00
Akihiro Suda ed58366cc8 libcontainer: fix Boolmsg alignment
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-03-26 14:44:03 +09:00
Tamal Saha 58415b4b12 Fix error message
Signed-off-by: Tamal Saha <tamal@appscode.com>
2018-03-21 20:52:09 -07:00
Aleksa Sarai fd3a6e6c83
libcontainer: handle unset oomScoreAdj corectly
Previously if oomScoreAdj was not set in config.json we would implicitly
set oom_score_adj to 0. This is not allowed according to the spec:

> If oomScoreAdj is not set, the runtime MUST NOT change the value of
> oom_score_adj.

Change this so that we do not modify oom_score_adj if oomScoreAdj is not
present in the configuration. While this modifies our internal
configuration types, the on-disk format is still compatible.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-03-17 13:53:42 +11:00
Aleksa Sarai 03e585985f
rootless: cgroup: treat EROFS as a skippable error
In some cases, /sys/fs/cgroups is mounted read-only. In rootless
containers we can consider this effectively identical to having cgroups
that we don't have write permission to -- because the user isn't
responsible for the read-only setup and cannot modify it. The rules are
identical to when /sys/fs/cgroups is not writable by the unprivileged
user.

An example of this is the default configuration of Docker, where cgroups
are mounted as read-only as a preventative security measure.

Reported-by: Vladimir Rutsky <rutsky@google.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-03-17 13:53:42 +11:00
Daniel J Walsh 43aea05946 Label the masked tmpfs with the mount label
Currently if a confined container process tries to list these directories
AVC's are generated because they are labeled with external labels.  Adding
the mountlabel will remove these AVC's.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2018-03-09 14:29:06 -05:00
Qiang Huang 9facb87f87
Merge pull request #1754 from vikaschoudhary16/add-timeout
Add timeout while waiting for StartTransinetUnit completion signal
2018-03-08 09:09:34 +08:00
W. Trevor King 0aa6e4e5d3 libcontainer/specconv/spec_linux: Support empty 'type' for bind mounts
From the "Creating a bind mount" section of mount(2) [1]:

> If mountflags includes MS_BIND (available since Linux 2.4), then
> perform a bind mount...
>
> The filesystemtype and data arguments are ignored.

This commit adds support for configurations that leave the OPTIONAL
type [2] unset for bind mounts.  There's a related spec-example change
in flight with [3], although my personal preference would be a more
explicit spec for the whole mount structure [4].

[1]: http://man7.org/linux/man-pages/man2/mount.2.html
[2]: https://github.com/opencontainers/runtime-spec/blame/v1.0.1/config.md#L102
[3]: https://github.com/opencontainers/runtime-spec/pull/954
[4]: https://github.com/opencontainers/runtime-spec/pull/771

Signed-off-by: W. Trevor King <wking@tremily.us>
2018-03-07 10:23:42 -08:00
vikaschoudhary16 04e95b526d Add timeout while waiting for StartTransinetUnit completion signal from dbus
Signed-off-by: vikaschoudhary16 <choudharyvikas16@gmail.com>
2018-03-07 05:11:38 -05:00
Denys Smirnov 3d26fc3fd7 cgroups/fs: fix NPE on Destroy than no cgroups are set
Currently Manager accepts nil cgroups when calling Apply, but it will panic then trying to call Destroy with the same config.

Signed-off-by: Denys Smirnov <denys@sourced.tech>
2018-03-06 23:31:31 +01:00
Vincent Batts bf74951617
libcontainer/user: platform dependent calls
This rearranges a bit of the user and group lookup, such that only a
basic subset is exposed.

Signed-off-by: Vincent Batts <vbatts@hashbangbash.com>
2018-02-28 14:14:24 -05:00
Aleksa Sarai 757e78bebd
merge branch 'pr-1743'
The setupUserNamespace function is always called.

LGTMs: @crosbymichael @mrunalp @cyphar
Closes #1743
2018-02-27 12:22:52 +11:00
Michael Crosby 8aca07289d
Merge pull request #1736 from allencloud/fix-lint-warning
fix lint error in specconv
2018-02-26 14:21:26 -05:00
ynirk 2420eb1f4d The setupUserNamespace function is always called.
The function is called even if the usernamespace is not set.
This results having wrong uid/gid set on devices.

This fix add a test to check if usernamespace is set befor calling
setupUserNamespace.

Fixes #1742

Signed-off-by: Julien Lavesque <julien.lavesque@gmail.com>
2018-02-26 14:27:11 +01:00
Allen Sun 3f32e72963 fix lint error in specconv
Signed-off-by: Allen Sun <allensun.shl@alibaba-inc.com>
2018-02-26 15:39:54 +08:00
Michael Crosby 595bea022f
Merge pull request #1722 from ravisantoshgudimetla/fix-systemd-path
fix systemd slice expansion so that it could be consumed by cAdvisor
2018-02-20 09:59:24 -05:00
W. Trevor King 50dc7ee96c libcontainer/capabilities_linux: Drop os.Getpid() call
gocapability has supported 0 as "the current PID" since
syndtr/gocapability@5e7cce49 (Allow to use the zero value for pid to
operate with the current task, 2015-01-15, syndtr/gocapability#2).
libcontainer was ported to that approach in 444cc298 (namespaces:
allow to use pid namespace without mount namespace, 2015-01-27,
docker/libcontainer#358), but the change was clobbered by 22df5551
(Merge branch 'master' into api, 2015-02-19, docker/libcontainer#388)
which landed via 5b73860e (Merge pull request #388 from docker/api,
2015-02-19, docker/libcontainer#388).  This commit restores the
changes from 444cc298.

Signed-off-by: W. Trevor King <wking@tremily.us>
2018-02-19 15:47:42 -08:00
ravisantoshgudimetla 7019e1de7b fix systemd slice expansion so that it could be consumed by cAdvisor
Signed-off-by: ravisantoshgudimetla <ravisantoshgudimetla@gmail.com>
2018-02-18 21:32:39 -05:00
Mrunal Patel 6e15bc3f92
Merge pull request #1702 from crosbymichael/chroot
chroot when no mount namespaces is provided
2018-02-07 10:09:35 -08:00
W. Trevor King be16b13645 libcontainer/state_linux_test: Add a testTransitions helper
The helper DRYs up the transition tests and makes it easy to get
complete coverage for invalid transitions.

I'm also using t.Run() for subtests.  Run() is new in Go 1.7 [1], but
runc dropped support for 1.6 back in e773f96b (update go version at
travis-ci, 2017-02-20, #1335).

[1]: https://blog.golang.org/subtests

Signed-off-by: W. Trevor King <wking@tremily.us>
2018-01-25 11:18:45 -08:00
Michael Crosby 91ca331474 chroot when no mount namespaces is provided
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2018-01-25 11:36:37 -05:00
Michael Crosby c4e4bb0df2
Merge pull request #1699 from AkihiroSuda/indent-c
make: validate C format
2018-01-25 10:09:09 -05:00
Aleksa Sarai 5a46c2ba8b
nsenter: move namespace creation after userns creation
Technically, this change should not be necessary, as the kernel
documentation claims that if you call clone(flags|CLONE_NEWUSER), the
new user namespace will be the owner of all other namespaces created in
@flags. Unfortunately this isn't always the case, due to various
additional semantics and kernel bugs.

One particular instance is SELinux, which acts very strangely towards
the IPC namespace and mqueue. If you unshare the IPC namespace *before*
you map a user in the user namespace, the IPC namespace's internal
kern-mount for mqueue will be labelled incorrectly and the container
won't be able to access it. The only way of solving this is to unshare
IPC *after* the user has been mapped and we have changed to that user.
I've also heard of this happening to the NET namespace while talking to
some LXC folks, though I haven't personally seen that issue.

This change matches our handling of user namespaces to be the same as
how LXC handles these problems.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-01-25 23:56:49 +11:00
Akihiro Suda dd5eb3b9e3 make: validate C format
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-01-24 10:49:50 +09:00
Ed King 5c0af14bf8 Return from goroutine when it should terminate
Signed-off-by: Craig Furman <cfurman@pivotal.io>
2018-01-23 10:46:31 +00:00
Will Martin 8d3e6c9826 Avoid race when opening exec fifo
When starting a container with `runc start` or `runc run`, the stub
process (runc[2:INIT]) opens a fifo for writing. Its parent runc process
will open the same fifo for reading. In this way, they synchronize.

If the stub process exits at the wrong time, the parent runc process
will block forever.

This can happen when racing 2 runc operations against each other: `runc
run/start`, and `runc delete`. It could also happen for other reasons,
e.g. the kernel's OOM killer may select the stub process.

This commit resolves this race by racing the opening of the exec fifo
from the runc parent process against the stub process exiting. If the
stub process exits before we open the fifo, we return an error.

Another solution is to wait on the stub process. However, it seems it
would require more refactoring to avoid calling wait multiple times on
the same process, which is an error.

Signed-off-by: Craig Furman <cfurman@pivotal.io>
2018-01-22 17:03:02 +00:00
Antonio Murdaca cd1e7abee2
libcontainer: expose annotations in hooks
Annotations weren't passed to hooks. This patch fixes that by passing
annotations to stdin for hooks.

Signed-off-by: Antonio Murdaca <runcom@redhat.com>
2018-01-11 16:54:01 +01:00
vikaschoudhary16 d5b4a3eddb Fix race against systemd
- T0: runc triggers a systemd unit creation asynchronously from [here](https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/systemd/apply_systemd.go#L298)
- T1: runc then moves ahead and starts creating cgroup paths(.scope directories), [here](https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/systemd/apply_systemd.go#L348). Kernel creates .scope directory and cgroup.procs file(along with other default files) in the directory automatically, in an atomic manner.
- T3: systemd execution thread which was invoked at time `T0`, is still in the process of unit creation. systemd also trying to create cgroup paths and deletes the `.scope` directory which is created at time `T1` by runc from [here](https://github.com/systemd/systemd/blob/v219/src/shared/cgroup-util.c#L1630) in the code

Signed-off-by: vikaschoudhary16 <choudharyvikas16@gmail.com>
2018-01-08 09:37:26 -05:00