New tests for user namespaces and groups issue

This test illustrate an issue when trying to use runc with user
namespaces in Kubernetes.

runc needs to bind mount files from /var/lib/kubelet/pods/... (such as
etc-hosts) into the container. When using user namespaces, the bind
mount didn't work anymore when runc is started from a systemd unit.

The workaround is to start the systemd unit with SupplementaryGroups=0.

runc needs to have permission on the directory to stat() the source of
the bind mount. Without user namespaces, this is not a problem since
runc is running as root, so it has 'rwx' permissions over the directory:

drwxr-x---. 8 root   root   4096 May 28 18:05 /var/lib/kubelet

Moreover, runc has CAP_DAC_OVERRIDE at this point because the mount
phase happens before giving up the additional permissions.

However, when using user namespaces, the runc process is belonging to a
different user than root (depending on the mapping). /var/lib/kubelet is
seen as belonging to the special unmapped user (65534, nobody). runc
does not have 'rwx' permissions anymore but the empty '---' permission
for 'other'.

CAP_DAC_OVERRIDE is also no effective because the kernel performs the
capability check with capable_wrt_inode_uidgid(inode, CAP_DAC_OVERRIDE).
This checks that the owner of the /var/lib/kubelet is mapped in the
current user namespace, which is not the case.

Despite that, bind mounting /var/lib/kubelet/pods/...etc-hosts was
working when runc was started manually with 'sudo' but not working
when started from a systemd unit. The difference is how supplementary
groups are handled between sudo and systemd units: systemd does not set
supplementary groups by default.

$ sudo grep -E 'Groups:|Uid:|Gid:' /proc/self/status
Uid:	0	0	0	0
Gid:	0	0	0	0
Groups:	0

$ sudo systemd-run -t grep -E 'Groups:|Uid:|Gid:' /proc/self/status
Running as unit: run-u296886.service
Press ^] three times within 1s to disconnect TTY.
Uid:	0	0	0	0
Gid:	0	0	0	0
Groups:

When runc has the supplementary group 0 configured, it is retained
during the bind-mount phase, even though it is an unmapped group (runc
temporarily sees 'Groups: 65534' in its own /proc/self/status), so runc
effectively has the 'r-x' permissions over /var/lib/kubelet. This makes
the bind mount of etc-hosts work.

After the mount phase, runc will set the credential correctly (following
OCI's config.json specification), so the container will not retain this
unmapped supplementary group.

It is difficult to set up supplementary groups from Golang code
automatically with syscall.Setgroups() because "at the kernel level,
user IDs and group IDs are a per-thread attribute" (man setgroups) and
the way Golang uses threads make it difficult to predict which thread is
going to be used to execute runc. glibc's setgroup() is a wrapper that
changes the credentials for all threads but Golang does not use the
glibc implementation.

Signed-off-by: Alban Crequy <alban@kinvolk.io>
This commit is contained in:
Alban Crequy 2020-06-22 13:02:23 +02:00
parent 0fa097fc37
commit 67f903e941
1 changed files with 51 additions and 0 deletions

View File

@ -0,0 +1,51 @@
#!/usr/bin/env bats
load helpers
function setup() {
teardown_busybox
setup_busybox
run mkdir -p "$BUSYBOX_BUNDLE"/source-{accessible,inaccessible}/dir
chmod 750 "$BUSYBOX_BUNDLE"/source-inaccessible
run mkdir -p "$BUSYBOX_BUNDLE"/rootfs/{proc,sys,tmp}
run mkdir -p "$BUSYBOX_BUNDLE"/rootfs/tmp/{accessible,inaccessible}
update_config ' .process.args += ["-c", "echo HelloWorld"] '
update_config ' .linux.namespaces += [{"type": "user"}]
| .linux.uidMappings += [{"hostID": 100000, "containerID": 0, "size": 65534}]
| .linux.gidMappings += [{"hostID": 100000, "containerID": 0, "size": 65534}] '
}
function teardown() {
teardown_busybox
}
@test "userns without mount" {
# run hello-world
runc run test_userns_without_mount
[ "$status" -eq 0 ]
# check expected output
[[ "${output}" == *"HelloWorld"* ]]
}
@test "userns with simple mount" {
update_config ' .mounts += [{"source": "source-accessible/dir", "destination": "/tmp/accessible", "options": ["bind"]}] '
# run hello-world
runc run test_userns_with_simple_mount
[ "$status" -eq 0 ]
# check expected output
[[ "${output}" == *"HelloWorld"* ]]
}
@test "userns with difficult mount" {
update_config ' .mounts += [{"source": "source-inaccessible/dir", "destination": "/tmp/inaccessible", "options": ["bind"]}] '
# run hello-world
runc run test_userns_with_difficult_mount
[ "$status" -eq 0 ]
# check expected output
[[ "${output}" == *"HelloWorld"* ]]
}