Commit Graph

48 Commits

Author SHA1 Message Date
Daniel J Walsh cd96170c10
Need to setup labeling of kernel keyrings.
Work is ongoing in the kernel to support different kernel
keyrings per user namespace.  We want to allow SELinux to manage
kernel keyrings inside of the container.

Currently when runc creates the kernel keyring it gets the label which runc is
running with ususally `container_runtime_t`, with this change the kernel keyring
will be labeled with the container process label container_t:s0:C1,c2.

Container running as container_t:s0:c1,c2 can manage keyrings with the same label.

This change required a revendoring or the SELinux go bindings.

github.com/opencontainers/selinux.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2019-03-13 17:57:30 -04:00
Michael Crosby aa7917b751
Merge pull request #1911 from theSuess/linter-fixes
Various cleanups to address linter issues
2018-11-13 12:13:34 -05:00
Michael Crosby b1068fb925
Merge pull request #1814 from rhatdan/selinux
SELinux labels are tied to the thread
2018-11-05 10:00:11 -05:00
Dominik Süß 0b412e9482 various cleanups to address linter issues
Signed-off-by: Dominik Süß <dominik@suess.wtf>
2018-10-13 21:14:03 +02:00
Aleksa Sarai 40f1468413
keyring: handle ENOSYS with keyctl(KEYCTL_JOIN_SESSION_KEYRING)
While all modern kernels (and I do mean _all_ of them -- this syscall
was added in 2.6.10 before git had begun development!) have support for
this syscall, LXC has a default seccomp profile that returns ENOSYS for
this syscall. For most syscalls this would be a deal-breaker, and our
use of session keyrings is security-based there are a few mitigating
factors that make this change not-completely-insane:

  * We already have a flag that disables the use of session keyrings
    (for older kernels that had system-wide keyring limits and so
    on). So disabling it is not a new idea.

  * While the primary justification of using session keys *is*
    security-based, it's more of a security-by-obscurity protection.
    The main defense keyrings have is VFS credentials -- which is
    something that users already have better security tools for
    (setuid(2) and user namespaces).

  * Given the security justification you might argue that we
    shouldn't silently ignore this. However, the only way for the
    kernel to return -ENOSYS is either being ridiculously old (at
    which point we wouldn't work anyway) or that there is a seccomp
    profile in place blocking it.

    Given that the seccomp profile (if malicious) could very easily
    just return 0 or a silly return code (or something even more
    clever with seccomp-bpf) and trick us without this patch, there
    isn't much of a significant change in how much seccomp can trick
    us with or without this patch.

Given all of that over-analysis, I'm pretty convinced there isn't a
security problem in this very specific case and it will help out the
ChromeOS folks by allowing Docker to run inside their LXC container
setup. I'd be happy to be proven wrong.

Ref: https://bugs.chromium.org/p/chromium/issues/detail?id=860565
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-09-17 21:38:30 +10:00
Daniel J Walsh aa3fee6c80
SELinux labels are tied to the thread
We need to lock the threads for the SetProcessLabel to work,
should also call SetProcessLabel("") after the container starts
to go back to the default SELinux behaviour.

Once you call SetProcessLabel, then any process executed by runc
will run with this label, even if the process is for setup rather
then the container.

It is always safest to call the SELinux calls just before the exec of the
container, so that other processes do not get started with the incorrect label.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2018-06-11 08:34:58 -04:00
Michael Crosby fd0febd3ce Wrap error messages during init
Fixes #1437

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2018-05-10 10:28:10 -04:00
Daniel J Walsh 43aea05946 Label the masked tmpfs with the mount label
Currently if a confined container process tries to list these directories
AVC's are generated because they are labeled with external labels.  Adding
the mountlabel will remove these AVC's.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2018-03-09 14:29:06 -05:00
Michael Crosby 91ca331474 chroot when no mount namespaces is provided
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2018-01-25 11:36:37 -05:00
Akihiro Suda 2edd36fdff libcontainer: create Cwd when it does not exist
The benefit for doing this within runc is that it works well with
userns.
Actually, runc already does the same thing for mount points.

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2017-10-05 05:31:46 +00:00
Aleksa Sarai 3ddde27d7d
init: move close(stateDirFd) before seccomp apply
This further reduces the number of syscalls that a user needs to enable
in their seccomp profile.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-08-26 13:42:26 +10:00
Aleksa Sarai 7d66aab77a
init: switch away from stateDirFd entirely
While we have significant protections in place against CVE-2016-9962, we
still were holding onto a file descriptor that referenced the host
filesystem. This meant that in certain scenarios it was still possible
for a semi-privileged container to gain access to the host filesystem
(if they had CAP_SYS_PTRACE).

Instead, open the FIFO itself using a O_PATH. This allows us to
reference the FIFO directly without providing the ability for
directory-level access. When opening the FIFO inside the init process,
open it through procfs to re-open the actual FIFO (this is currently the
only supported way to open such a file descriptor).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-08-25 13:19:03 +10:00
Tobias Klauser 4019833d46 libcontainer: use PR_SET_NO_NEW_PRIVS from x/sys/unix
Use PR_SET_NO_NEW_PRIVS defined in golang.org/x/sys/unix instead of
manually defining it.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-07-13 15:31:33 +02:00
Tobias Klauser 553016d7da Use Prctl() from x/sys/unix instead of own wrapper
Use unix.Prctl() instead of reimplemnting it as system.Prctl().

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-06-07 15:03:15 +02:00
Christy Perez 3d7cb4293c Move libcontainer to x/sys/unix
Since syscall is outdated and broken for some architectures,
use x/sys/unix instead.

There are still some dependencies on the syscall package that will
remain in syscall for the forseeable future:

Errno
Signal
SysProcAttr

Additionally:
- os still uses syscall, so it needs to be kept for anything
returning *os.ProcessState, such as process.Wait.

Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
2017-05-22 17:35:20 -05:00
Qiang Huang 5e7b48f7c0 Use opencontainers/selinux package
It's splitted as a separate project.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-03-23 08:21:19 +08:00
Michael Crosby 00a0ecf554 Add separate console socket
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-03-16 10:23:59 -07:00
Michael Crosby 5d93fed3d2 Set init processes as non-dumpable
This sets the init processes that join and setup the container's
namespaces as non-dumpable before they setns to the container's pid (or
any other ) namespace.

This settings is automatically reset to the default after the Exec in
the container so that it does not change functionality for the
applications that are running inside, just our init processes.

This prevents parent processes, the pid 1 of the container, to ptrace
the init process before it drops caps and other sets LSMs.

This patch also ensures that the stateDirFD being used is still closed
prior to exec, even though it is set as O_CLOEXEC, because of the order
in the kernel.

https://github.com/torvalds/linux/blob/v4.9/fs/exec.c#L1290-L1318

The order during the exec syscall is that the process is set back to
dumpable before O_CLOEXEC are processed.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-01-11 09:56:56 -08:00
Justin Cormack 50acb55233 Split the code for remounting mount points and mounting paths.
A remount of a mount point must include all the current flags or
these will be cleared:

```
The mountflags and data arguments should match the values used in the
original mount() call, except for those parameters that are being
deliberately changed.
```

The current code does not do this; the bug manifests in the specified
flags for `/dev` being lost on remount read only at present. As we
need to specify flags, split the code path for this from remounting
paths which are not mount points, as these can only inherit the
existing flags of the path, and these cannot be changed.

In the bind case, remove extra flags from the bind remount. A bind
mount can only be remounted read only, no other flags can be set,
all other flags are inherited from the parent. From the man page:

```
Since Linux 2.6.26, this flag can also be used to make an existing
bind mount read-only by specifying mountflags as:

MS_REMOUNT | MS_BIND | MS_RDONLY

Note that only the MS_RDONLY setting of the bind mount can be changed
in this manner.
```

MS_REC can only be set on the original bind, so move this. See note
in man page on bind mounts:

```
The remaining bits in the mountflags argument are also ignored, with
the exception of MS_REC.
```

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2016-12-16 14:01:17 -08:00
Aleksa Sarai 244c9fc426
*: console rewrite
This implements {createTTY, detach} and all of the combinations and
negations of the two that were previously implemented. There are some
valid questions about out-of-OCI-scope topics like !createTTY and how
things should be handled (why do we dup the current stdio to the
process, and how is that not a security issue). However, these will be
dealt with in a separate patchset.

In order to allow for late console setup, split setupRootfs into the
"preparation" section where all of the mounts are created and the
"finalize" section where we pivot_root and set things as ro. In between
the two we can set up all of the console mountpoints and symlinks we
need.

We use two-stage synchronisation to ensures that when the syscalls are
reordered in a suboptimal way, an out-of-place read() on the parentPipe
will not gobble the ancilliary information.

This patch is part of the console rewrite patchset.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-12-01 15:49:36 +11:00
Daniel Dao 1b876b0bf2 fix typos with misspell
pipe the source through https://github.com/client9/misspell. typos be gone!

Signed-off-by: Daniel Dao <dqminh89@gmail.com>
2016-10-11 23:22:48 +00:00
Akihiro Suda 53179559a1 MaskPaths: support directory
For example, the /sys/firmware directory should be masked because it can contain some sensitive files:
  - /sys/firmware/acpi/tables/{SLIC,MSDM}: Windows license information:
  - /sys/firmware/ibft/target0/chap-secret: iSCSI CHAP secret

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2016-09-23 16:14:41 +00:00
Guilherme Rezende 1cdaa709f1
libcontainer: rename keyctl package to keys
This avoid the goimports tool from remove the libcontainer/keys import line due the package name is diferent from folder name

Signed-off-by: Guilherme Rezende <guilhermebr@gmail.com>
2016-07-25 20:59:26 -03:00
Mrunal Patel ec01ae5f10 Merge pull request #942 from ggaaooppeenngg/fix-typo
Fix typo
2016-07-14 11:18:06 -04:00
Peng Gao 765df7eed0 Fix typo
Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>
2016-07-13 23:32:38 +08:00
Michael Crosby 5ce88a95f6 Fix fifo usage with userns
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-06-13 20:20:48 -07:00
Michael Crosby 3aacff695d Use fifo for create/start
This removes the use of a signal handler and SIGCONT to signal the init
process to exec the users process.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-06-13 11:26:53 -07:00
Michael Crosby 8c9db3a7a5 Add option to disable new session keys
This adds an `--no-new-keyring` flag to run and create so that a new
session keyring is not created for the container and the calling
processes keyring is inherited.

Fixes #818

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-06-03 11:53:07 -07:00
Michael Crosby 3fe7d7f31e Add create and start command for container lifecycle
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-05-31 11:06:41 -07:00
Justin Cormack e18de63108 If possible, apply seccomp rules immediately before exec
See https://github.com/docker/docker/issues/22252

Previously we would apply seccomp rules before applying
capabilities, because it requires CAP_SYS_ADMIN. This
however means that a seccomp profile needs to allow
operations such as setcap() and setuid() which you
might reasonably want to disallow.

If prctl(PR_SET_NO_NEW_PRIVS) has been applied however
setting a seccomp filter is an unprivileged operation.
Therefore if this has been set, apply the seccomp
filter as late as possible, after capabilities have
been dropped and the uid set.

Note a small number of syscalls will take place
after the filter is applied, such as `futex`,
`stat` and `execve`, so these still need to be allowed
in addition to any the program itself needs.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2016-04-27 20:06:14 +01:00
Julian Friedman e91b2b8aca Set rlimits using prlimit in parent
Fixes #680

This changes setupRlimit to use the Prlimit syscall (rather than
Setrlimit) and moves the call to the parent process. This is necessary
because Setrlimit would affect the libcontainer consumer if called in
the parent, and would fail if called from the child if the
child process is in a user namespace and the requested rlimit is higher
than that in the parent.

Signed-off-by: Julian Friedman <julz.friedman@uk.ibm.com>
2016-03-25 15:11:44 +00:00
Michael Crosby 20422c9bd9 Update libcontainer to support rlimit per process
This updates runc and libcontainer to handle rlimits per process and set
them correctly for the container.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-03-10 14:35:16 -08:00
Phil Estes 178bad5e71 Properly setuid/setgid after entering userns
The re-work of namespace entering lost the setuid/setgid that was part
of the Go-routine based process exec in the prior code. A side issue was
found with setting oom_score_adj before execve() in a userns that is
also solved here.

Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
2016-03-04 11:12:26 -05:00
Michael Crosby 3cc90bd2d8 Add support for process overrides of settings
This commit adds support to libcontainer to allow caps, no new privs,
apparmor, and selinux process label to the process struct so that it can
be used together of override the base settings on the container config
per individual process.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-03-03 11:41:33 -08:00
Daniel, Dao Quang Minh 42d5d04801 Sets custom namespaces for init processes
An init process can join other namespaces (pidns, ipc etc.). This leverages
C code defined in nsenter package to spawn a process with correct namespaces
and clone if necessary.

This moves all setns and cloneflags related code to nsenter layer, which mean
that we dont use Go os/exec to create process with cloneflags and set
uid/gid_map or setgroups anymore. The necessary data is passed from Go to C
using a netlink binary-encoding format.

With this change, setns and init processes are almost the same, which brings
some opportunity for refactoring.

Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
[mickael.laventure@docker.com: adapted to apply on master @ d97d5e]
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@docker.com>
2016-02-28 12:26:53 -08:00
Mrunal Patel 4951f5821b Merge pull request #582 from stefanberger/new_session_keyring
Create unique session key name for every container
2016-02-25 17:54:14 -08:00
Stefan Berger 5fbf791e31 Create unique session key name for every container
Create a unique session key name for every container. Use the pattern
_ses.<postfix> with postfix being the container's Id.

This patch does not prevent containers from joining each other's session
keyring.

Signed-off-by: Stefan Berger <stefanb@linux.vnet.ibm.com>
2016-02-24 08:39:52 -05:00
Mrunal Patel 2f27649848 Move pre-start hooks after container mounts
Today mounts in pre-start hooks get overriden by the default mounts.
Moving the pre-start hooks to after the container mounts and before
the pivot/move root gives better flexiblity in the hooks.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2016-02-23 02:50:35 -08:00
Mrunal Patel 38b39645d9 Implement NoNewPrivileges support in libcontainer
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2016-02-16 06:57:50 -08:00
Stefan Berger ad22e23aee Create a new session key for every container
Create a new session key ring '_ses' for every container. This avoids sharing
the key structure with the process that created the container and the
container inherits from.

This patch fixes it init and exec.

Signed-off-by: Stefan Berger <stefanb@linux.vnet.ibm.com>
2016-02-04 22:05:50 -05:00
Aleksa Sarai 103853ead7 libcontainer: set cgroup config late
Due to the fact that the init is implemented in Go (which seemingly
randomly spawns new processes and loves eating memory), most cgroup
configurations are required to have an arbitrary minimum dictated by the
init. This confuses users and makes configuration more annoying than it
should. An example of this is pids.max, where Go spawns multiple
processes that then cause init to violate the pids cgroup constraint
before the container can even start.

Solve this problem by setting the cgroup configurations as late as
possible, to avoid hitting as many of the resources hogged by the Go
init as possible. This has to be done before seccomp rules are applied,
as the parent and child must synchronise in order for the parent to
correctly set the configurations (and writes might be blocked by seccomp).

Signed-off-by: Aleksa Sarai <asarai@suse.com>
2016-01-12 10:06:35 +11:00
Mrunal Patel 4124ba9468 Revert "cgroups: add pids controller support"
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2015-12-19 07:48:48 -08:00
Aleksa Sarai 14ed8696c1 libcontainer: set cgroup config late
Due to the fact that the init is implemented in Go (which seemingly
randomly spawns new processes and loves eating memory), most cgroup
configurations are required to have an arbitrary minimum dictated by the
init. This confuses users and makes configuration more annoying than it
should. An example of this is pids.max, where Go spawns multiple
processes that then cause init to violate the pids cgroup constraint
before the container can even start.

Solve this problem by setting the cgroup configurations as late as
possible, to avoid hitting as many of the resources hogged by the Go
init as possible. This has to be done before seccomp rules are applied,
as the parent and child must synchronise in order for the parent to
correctly set the configurations (and writes might be blocked by seccomp).

Signed-off-by: Aleksa Sarai <asarai@suse.com>
2015-12-19 11:30:48 +11:00
Vishnu Kannan cc232c4707 Adding oom_score_adj as a container config param.
Signed-off-by: Vishnu Kannan <vishnuk@google.com>
2015-08-31 14:02:59 -07:00
Matthew Heon 2ae581ae62 Convert Seccomp support to use Libseccomp
This removes the existing, native Go seccomp filter generation and replaces it
with Libseccomp. Libseccomp is a C library which provides architecture
independent generation of Seccomp filters for the Linux kernel.

This adds a dependency on v2.2.1 or above of Libseccomp.

Signed-off-by: Matthew Heon <mheon@redhat.com>
2015-08-13 07:56:27 -04:00
Mrunal Patel 8ea6c65d12 Rename SystemProperties to Sysctl and make it available in the runc config
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2015-07-06 19:18:08 -04:00
Michael Crosby 080df7ab88 Update import paths for new repository
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2015-06-21 19:29:59 -07:00
Michael Crosby 8f97d39dd2 Move libcontainer into subdirectory
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2015-06-21 19:29:15 -07:00