Commit Graph

52 Commits

Author SHA1 Message Date
Mrunal Patel a00bf01908
Merge pull request #1862 from AkihiroSuda/decompose-rootless-pr
Disable rootless mode except RootlessCgMgr when executed as the root in userns (fix Docker-in-LXD regression)
2018-10-15 17:32:15 -07:00
Akihiro Suda 06f789cf26 Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.

`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.

`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)

When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.

This PR does not have any impact on CLI flags and `state.json`.

Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
  For runc-in-userns, `runc spec`  without `--rootless` should work, when sufficient numbers of
  UID/GID are mapped.

Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
  (`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
  This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.

Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-09-07 15:05:03 +09:00
Yan Zhu feb90346e0 doc: fix typo
Signed-off-by: Yan Zhu <yanzhu@alauda.io>
2018-09-07 11:58:59 +08:00
Mrunal Patel 0cbfd8392f
Merge pull request #1562 from cyphar/carry-975-959-ipc-uid-namespaces
nsenter: improve namespace creation and SELinux IPC handling
2018-04-26 14:12:33 -07:00
Michael Crosby bdbb9fab07
Merge pull request #1693 from AkihiroSuda/leave-setgroups-allow
libcontainer: allow setgroup in rootless mode
2018-04-24 11:24:04 -04:00
Antonio Murdaca 1a5064622c
nsexec.c: fix GCC 8 warning
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
2018-04-12 12:25:06 +02:00
Akihiro Suda 73f3dc6389 libcontainer: allow setgroup in rootless mode
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-03-27 17:42:05 +09:00
Aleksa Sarai 5a46c2ba8b
nsenter: move namespace creation after userns creation
Technically, this change should not be necessary, as the kernel
documentation claims that if you call clone(flags|CLONE_NEWUSER), the
new user namespace will be the owner of all other namespaces created in
@flags. Unfortunately this isn't always the case, due to various
additional semantics and kernel bugs.

One particular instance is SELinux, which acts very strangely towards
the IPC namespace and mqueue. If you unshare the IPC namespace *before*
you map a user in the user namespace, the IPC namespace's internal
kern-mount for mqueue will be labelled incorrectly and the container
won't be able to access it. The only way of solving this is to unshare
IPC *after* the user has been mapped and we have changed to that user.
I've also heard of this happening to the NET namespace while talking to
some LXC folks, though I haven't personally seen that issue.

This change matches our handling of user namespaces to be the same as
how LXC handles these problems.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-01-25 23:56:49 +11:00
Akihiro Suda dd5eb3b9e3 make: validate C format
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-01-24 10:49:50 +09:00
Aleksa Sarai 969bb49cc3
nsenter: do not resolve path in nsexec context
With the addition of our new{uid,gid}map support, we used to call
execvp(3) from inside nsexec. This would mean that the path resolution
for the binaries would happen in nsexec. Move the resolution to the
initial setup code, and pass the absolute path to nsexec.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-09-09 12:45:33 +10:00
Aleksa Sarai 6097ce74d8
nsenter: correctly handle newgidmap path for rootless containers
After quite a bit of debugging, I found that previous versions of this
patchset did not include newgidmap in a rootless setting. Fix this by
passing it whenever group mappings are applied, and also providing some
better checking for try_mapping_tool. This commit also includes some
stylistic improvements.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-09-09 12:45:32 +10:00
Giuseppe Scrivano d8b669400a
rootless: allow multiple user/group mappings
Take advantage of the newuidmap/newgidmap tools to allow multiple
users/groups to be mapped into the new user namespace in the rootless
case.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
[ rebased to handle intelrdt changes. ]
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-09-09 12:45:32 +10:00
Alex Fang e92add2151 Pass back the pid of runc:[1:CHILD] so we can wait on it
This allows the libcontainer to automatically clean up runc:[1:CHILD]
processes created as part of nsenter.

Signed-off-by: Alex Fang <littlelightlittlefire@gmail.com>
2017-08-05 13:44:36 +10:00
yangshukui 5428532bdd remove the code that close negative descriptor
Signed-off-by: yangshukui <yangshukui@huawei.com>
2017-07-24 11:10:18 +08:00
Aleksa Sarai d2f49696b0
runc: add support for rootless containers
This enables the support for the rootless container mode. There are many
restrictions on what rootless containers can do, so many different runC
commands have been disabled:

* runc checkpoint
* runc events
* runc pause
* runc ps
* runc restore
* runc resume
* runc update

The following commands work:

* runc create
* runc delete
* runc exec
* runc kill
* runc list
* runc run
* runc spec
* runc state

In addition, any specification options that imply joining cgroups have
also been disabled. This is due to support for unprivileged subtree
management not being available from Linux upstream.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-03-23 20:45:24 +11:00
Aleksa Sarai 6bd4bd9030
*: handle unprivileged operations and !dumpable
Effectively, !dumpable makes implementing rootless containers quite
hard, due to a bunch of different operations on /proc/self no longer
being possible without reordering everything.

!dumpable only really makes sense when you are switching between
different security contexts, which is only the case when we are joining
namespaces. Unfortunately this means that !dumpable will still have
issues in this instance, and it should only be necessary to set
!dumpable if we are not joining USER namespaces (new kernels have
protections that make !dumpable no longer necessary). But that's a topic
for another time.

This also includes code to unset and then re-set dumpable when doing the
USER namespace mappings. This should also be safe because in principle
processes in a container can't see us until after we fork into the PID
namespace (which happens after the user mapping).

In rootless containers, it is not possible to set a non-dumpable
process's /proc/self/oom_score_adj (it's owned by root and thus not
writeable). Thus, it needs to be set inside nsexec before we set
ourselves as non-dumpable.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-03-23 20:45:19 +11:00
Michael Crosby 8438b26e9f Merge pull request #1237 from hqhq/fix_sync_race
Fix race condition when sync with child and grandchild
2017-02-20 17:16:43 -08:00
Michael Crosby 4a164a826c Use %zu for printing of size_t values
This helps fix compile warnings on some arm systems.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-02-20 16:57:27 -08:00
Qiang Huang a54316bae1 Fix race condition when sync with child and grandchild
Fixes: #1236
Fixes: #1281

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-02-18 20:42:08 +08:00
Michael Crosby 5d93fed3d2 Set init processes as non-dumpable
This sets the init processes that join and setup the container's
namespaces as non-dumpable before they setns to the container's pid (or
any other ) namespace.

This settings is automatically reset to the default after the Exec in
the container so that it does not change functionality for the
applications that are running inside, just our init processes.

This prevents parent processes, the pid 1 of the container, to ptrace
the init process before it drops caps and other sets LSMs.

This patch also ensures that the stateDirFD being used is still closed
prior to exec, even though it is set as O_CLOEXEC, because of the order
in the kernel.

https://github.com/torvalds/linux/blob/v4.9/fs/exec.c#L1290-L1318

The order during the exec syscall is that the process is set back to
dumpable before O_CLOEXEC are processed.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-01-11 09:56:56 -08:00
Aleksa Sarai 244c9fc426
*: console rewrite
This implements {createTTY, detach} and all of the combinations and
negations of the two that were previously implemented. There are some
valid questions about out-of-OCI-scope topics like !createTTY and how
things should be handled (why do we dup the current stdio to the
process, and how is that not a security issue). However, these will be
dealt with in a separate patchset.

In order to allow for late console setup, split setupRootfs into the
"preparation" section where all of the mounts are created and the
"finalize" section where we pivot_root and set things as ro. In between
the two we can set up all of the console mountpoints and symlinks we
need.

We use two-stage synchronisation to ensures that when the syscalls are
reordered in a suboptimal way, an out-of-place read() on the parentPipe
will not gobble the ancilliary information.

This patch is part of the console rewrite patchset.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-12-01 15:49:36 +11:00
Daniel, Dao Quang Minh f156f73c2a Merge pull request #1154 from hqhq/sync_child
Sync with grandchild
2016-11-23 09:10:00 -08:00
Qiang Huang 16a2e8ba6e Sync with grandchild
Without this, it's possible that father process exit with
0 before grandchild exit with error.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-11-17 08:59:37 +08:00
rajasec 43287af982 Fixing error message in nsexec
Signed-off-by: rajasec <rajasec79@gmail.com>
2016-11-10 17:06:50 +05:30
Qiang Huang 84a4218ece More fix to nsexec.c's comments
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-11-03 10:15:01 +08:00
Aleksa Sarai 9b15bf17a0
nsenter: fix up comments
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-11-01 00:21:09 +11:00
Qiang Huang f520eab891 Remove unnecessary cloneflag validation
config.cloneflag is not mandatory, when using `runc exec`,
config.cloneflag can be empty, and even then it won't be
`-1` but `0`.

So this validation is totally wrong and unneeded.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-10-27 09:34:20 +08:00
Aleksa Sarai e3cd191acc
nsenter: un-split clone(cloneflags) for RHEL
Without this patch applied, RHEL's SELinux policies cause container
creation to not really work. Unfortunately this might be an issue for
rootless containers (opencontainers/runc#774) but we'll cross that
bridge when we come to it.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-10-18 18:26:27 +11:00
Aleksa Sarai 2cd9c31b99
nsenter: guarantee correct user namespace ordering
Depending on your SELinux setup, the order in which you join namespaces
can be important. In general, user namespaces should *always* be joined
and unshared first because then the other namespaces are correctly
pinned and you have the right priviliges within them. This also is very
useful for rootless containers, as well as older kernels that had
essentially broken unshare(2) and clone(2) implementations.

This also includes huge refactorings in how we spawn processes for
complicated reasons that I don't want to get into because it will make
me spiral into a cloud of rage. The reasoning is in the giant comment in
clone_parent. Have fun.

In addition, because we now create multiple children with CLONE_PARENT,
we cannot wait for them to SIGCHLD us in the case of a death. Thus, we
have to resort to having a child kindly send us their exit code before
they die. Hopefully this all works okay, but at this point there's not
much more than we can do.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-10-04 16:17:55 +11:00
Aleksa Sarai ed053a740c
nsenter: specify namespace type in setns()
This avoids us from running into cases where libcontainer thinks that a
particular namespace file is a different type, and makes it a fatal
error rather than causing broken functionality.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-10-04 16:17:55 +11:00
Shukui Yang e15af9ffbb remove redundant by in annotation(nsexec.c)
Signed-off-by: Shukui Yang <yangshukui@huawei.com>
2016-09-05 10:53:19 +08:00
Mrunal Patel 0bd675a56c Fix format specifier for size_t
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2016-08-17 11:40:08 -07:00
Aleksa Sarai 4e72ffc237
nsenter: simplify netlink parsing
This just moves everything to one function so we don't have to pass a
bunch of things to functions when there's no real benefit. It also makes
the API nicer.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-08-17 08:21:48 +10:00
Aleksa Sarai faa3281ce8
nsenter: major cleanup
Removed a lot of clutter, improved the style of the code, removed
unnecessary complexity. In addition, made errors unique by making bail()
exit with a unique error code. Most of this code comes from the current
state of the rootless containers branch.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-08-13 03:18:04 +10:00
Antonio Murdaca 9d14efec4c libcontainer: nsenter: nsexec.c: fix warnings
Fix the following warnings when building runc with gcc 6+:

Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:
In function ‘nsexec’:
Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:322:6:
warning: ‘__s’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
      pr_perror("Failed to open %s", ns);
Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:273:30:
note: ‘__s’ was declared here
 static struct nsenter_config process_nl_attributes(int pipenum, char
*data, int data_size)
                              ^~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Antonio Murdaca <runcom@redhat.com>
2016-05-14 11:19:44 +02:00
Natanael Copa ac6bd95319 nsexec: fix build against musl libc
Remove a wrongly added include which was added in commit 3c2e77ee (Add a
compatibility header for CentOS/RHEL 6, 2016-01-29) apparently to
fix this compile error on centos 6:

> In file included from
> Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:20:
> /usr/include/linux/netlink.h:35: error: expected specifier-qualifier-list before 'sa_family_t'

The glibc bits/sockaddr.h says that this header should never be included
directly[1]. Instead, sys/socket.h should be used.

The problem was correctly fixed later, in commit 394fb55 (Fix build
error on centos6, 2016-03-02) so the incorrect bits/sockaddr.h can
safely be removed.

This is needed to build musl libc.

Fixes #761

[1]: 20003c4988/bits/sockaddr.h (L20)

Signed-off-by: Natanael Copa <natanael.copa@docker.com>
2016-04-19 10:58:17 +02:00
Qiang Huang d9520aeba4 Close opened files before exit
Not to say it'll cause memory leak, it'll still be a
good practice.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-03-28 11:16:34 +08:00
Qiang Huang 3b7e10652b Refactor nsexec.c and add some comments
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-03-28 11:16:12 +08:00
Tonis Tiigi 04da969aa8 Clear groups after entering userns
Clears supplementary groups that have effect on the
mount permissions before joining the user specified
groups happens.

Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
2016-03-10 22:23:38 -08:00
Andrey Vagin 080eac3d2a nsexec: don't use CLONE_PARENT and CLONE_NEWPID together
The rhel6 kernel returns EINVAL in this case

Known issue:
* CT with userns doesn't work

This is a copy of
d31e97fa28
to address https://github.com/opencontainers/runc/issues/613

Signed-off-by: Andrey Vagin <avagin@virtuozzo.com>
Signed-off-by: Andrew Fernandes <andrew@fernandes.org>
2016-03-10 14:28:10 -05:00
Michael Crosby 3af08519d0 Merge pull request #616 from hqhq/hq_remove_dup_headfile
Remove duplicated included head file
2016-03-08 10:54:31 -08:00
Phil Estes 178bad5e71 Properly setuid/setgid after entering userns
The re-work of namespace entering lost the setuid/setgid that was part
of the Go-routine based process exec in the prior code. A side issue was
found with setting oom_score_adj before execve() in a userns that is
also solved here.

Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
2016-03-04 11:12:26 -05:00
Qiang Huang 87e05b84e2 Remove duplicated included head file
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-03-03 11:08:18 +08:00
Ye Yin 394fb55d85 Fix build error on centos6
Signed-off-by: Ye Yin <eyniy@qq.com>
2016-03-02 18:32:19 +08:00
Kenfe-Mickael Laventure 08c3c6ebe2 Refactor nsexec
Cut nsexec in smaller chunk routines to make it more readable.

Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
2016-02-28 12:26:53 -08:00
Daniel, Dao Quang Minh 002b6c2fe8 Reorder and remove unused imports in nsexec.c
Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
2016-02-28 12:26:53 -08:00
Daniel, Dao Quang Minh 42d5d04801 Sets custom namespaces for init processes
An init process can join other namespaces (pidns, ipc etc.). This leverages
C code defined in nsenter package to spawn a process with correct namespaces
and clone if necessary.

This moves all setns and cloneflags related code to nsenter layer, which mean
that we dont use Go os/exec to create process with cloneflags and set
uid/gid_map or setgroups anymore. The necessary data is passed from Go to C
using a netlink binary-encoding format.

With this change, setns and init processes are almost the same, which brings
some opportunity for refactoring.

Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
[mickael.laventure@docker.com: adapted to apply on master @ d97d5e]
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@docker.com>
2016-02-28 12:26:53 -08:00
Andrew Fernandes 3c2e77eed5 Add a compatibility header for CentOS/RHEL 6
Signed-off-by: Andrew Fernandes <andrew@fernandes.org>
2016-01-29 20:46:50 +00:00
Daniel, Dao Quang Minh 7d423cb7a1 setns: replace env with netlink for bootstrap data
replace passing of pid and console path via environment variable with passing
them with netlink message via an established pipe.

this change requires us to set _LIBCONTAINER_INITTYPE and
_LIBCONTAINER_INITPIPE as the env environment of the bootstrap process as we
only send the bootstrap data for setns process right now. When init and setns
bootstrap process are unified (i.e., init use nsexec instead of Go to clone new
process), we can remove _LIBCONTAINER_INITTYPE.

Note:
- we read nlmsghdr first before reading the content so we can get the total
  length of the payload and allocate buffer properly instead of allocating
  one large buffer.

- check read bytes vs the wanted number. It's an error if we failed to read
  the desired number of bytes from the pipe into the buffer.

Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
2015-12-03 18:03:48 +00:00
Bogdan Purcareata 4c5eb45862 nsexec: Align clone child stack ptr to 16
This is required on ARM64 builds that use the clone syscall. Check [1].

[1] http://lxr.free-electrons.com/source/arch/arm64/kernel/process.c#L264

Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com>
2015-10-06 10:41:18 +00:00