If we pass a file descriptor to the host filesystem while joining a
container, there is a race condition where a process inside the
container can ptrace(2) the joining process and stop it from closing its
file descriptor to the stateDirFd. Then the process can access the
*host* filesystem from that file descriptor. This was fixed in part by
5d93fed3d2 ("Set init processes as non-dumpable"), but that fix is
more of a hail-mary than an actual fix for the underlying issue.
To fix this, don't open or pass the stateDirFd to the init process
unless we're creating a new container. A proper fix for this would be to
remove the need for even passing around directory file descriptors
(which are quite dangerous in the context of mount namespaces).
There is still an issue with containers that have CAP_SYS_PTRACE and are
using the setns(2)-style of joining a container namespace. Currently I'm
not really sure how to fix it without rampant layer violation.
Fixes: CVE-2016-9962
Fixes: 5d93fed3d2 ("Set init processes as non-dumpable")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Ensure that the pipe is always closed during the error processing of StartInitialization.
Also:
* Fix a comment typo.
* Use newContainerInit directly as there's no need for i to be an initer.
* Move the comment about the behaviour of Init() directly above it, clarifying what happens for all defers.
Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
Add the import of nsenter to the example in libcontainer's README.md, as without it none of the example code works.
Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
POSIX mandates that `cmsg_len` in `struct cmsghdr` is a `socklen_t`,
which is an `unsigned int`. Musl libc as used in Alpine implements
this; Glibc ignores the spec and makes it a `size_t` ie `unsigned long`.
To avoid the `-Wformat=` warning from the `%lu` on Alpine, cast this
to an `unsigned long` always.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Add godoc links to README.md files for runc and libcontainer so its easy to access the golang documentation.
Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
The `bufio.Scanner.Scan` method returns false either by reaching the
end of the input or an error. After Scan returns false, the Err method
will return any error that occurred during scanning, except that if it
was io.EOF, Err will return nil.
We should check the error when Scan return false(out of the for loop).
Signed-off-by: Wang Long <long.wanglong@huawei.com>
This sets the init processes that join and setup the container's
namespaces as non-dumpable before they setns to the container's pid (or
any other ) namespace.
This settings is automatically reset to the default after the Exec in
the container so that it does not change functionality for the
applications that are running inside, just our init processes.
This prevents parent processes, the pid 1 of the container, to ptrace
the init process before it drops caps and other sets LSMs.
This patch also ensures that the stateDirFD being used is still closed
prior to exec, even though it is set as O_CLOEXEC, because of the order
in the kernel.
https://github.com/torvalds/linux/blob/v4.9/fs/exec.c#L1290-L1318
The order during the exec syscall is that the process is set back to
dumpable before O_CLOEXEC are processed.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
On some systems, when we mount some cgroup subsystems into
a same mountpoint, the name sequence of mount options and
cgroup directory name can not be the same.
For example, the mount option is cpuacct,cpu, but
mountpoint name is /sys/fs/cgroup/cpu,cpuacct. In current
runc, we set mount destination name from combining
subsystems, which comes from mount option from
/proc/self/mountinfo, so in my case the name would be
/sys/fs/cgroup/cpuacct,cpu, which is differernt from
host, and will break some applications.
Fix it by using directory name from host mountpoint.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
`HookState` struct should follow definition of `State` in runtime-spec:
* modify json name of `version` to `ociVersion`.
* Remove redundant `Rootfs` field as rootfs can be retrived from
`bundlePath/config.json`.
Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
A remount of a mount point must include all the current flags or
these will be cleared:
```
The mountflags and data arguments should match the values used in the
original mount() call, except for those parameters that are being
deliberately changed.
```
The current code does not do this; the bug manifests in the specified
flags for `/dev` being lost on remount read only at present. As we
need to specify flags, split the code path for this from remounting
paths which are not mount points, as these can only inherit the
existing flags of the path, and these cannot be changed.
In the bind case, remove extra flags from the bind remount. A bind
mount can only be remounted read only, no other flags can be set,
all other flags are inherited from the parent. From the man page:
```
Since Linux 2.6.26, this flag can also be used to make an existing
bind mount read-only by specifying mountflags as:
MS_REMOUNT | MS_BIND | MS_RDONLY
Note that only the MS_RDONLY setting of the bind mount can be changed
in this manner.
```
MS_REC can only be set on the original bind, so move this. See note
in man page on bind mounts:
```
The remaining bits in the mountflags argument are also ignored, with
the exception of MS_REC.
```
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
When checking if the provided networking namespace is the host
one or not, we should first check if it's a symbolic link or not
as in some cases we can use persistent networking namespace under
e.g. /var/run/netns/.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Both suffered from different race conditions.
SelinuxEnabled assigned selinuxEnabledChecked before selinuxEnabled.
Thus racing callers could see the wrong selinuxEnabled.
getSelinuxMountPoint assigned selinuxfs to "" before it know the right
value. Thus racing could see "" improperly.
The gate selinuxfs, enabled, and mclist all on the same lock