Commit Graph

84 Commits

Author SHA1 Message Date
Tycho Andersen 66eb2a3e8f fix --read-only containers under --userns-remap
The documentation here:
https://docs.docker.com/engine/security/userns-remap/#user-namespace-known-limitations

says that readonly containers can't be used with user namespaces do to some
kernel restriction. In fact, there is a special case in the kernel to be
able to do stuff like this, so let's use it.

This takes us from:

ubuntu@docker:~$ docker run -it --read-only ubuntu
docker: Error response from daemon: oci runtime error: container_linux.go:262: starting container process caused "process_linux.go:339: container init caused \"rootfs_linux.go:125: remounting \\\"/dev\\\" as readonly caused \\\"operation not permitted\\\"\"".

to:

ubuntu@docker:~$ docker-runc --version
runc version 1.0.0-rc4+dev
commit: ae2948042b08ad3d6d13cd09f40a50ffff4fc688-dirty
spec: 1.0.0
ubuntu@docker:~$ docker run -it --read-only ubuntu
root@181e2acb909a:/# touch foo
touch: cannot touch 'foo': Read-only file system

Signed-off-by: Tycho Andersen <tycho@docker.com>
2017-08-24 16:43:21 -06:00
Christy Perez 3d7cb4293c Move libcontainer to x/sys/unix
Since syscall is outdated and broken for some architectures,
use x/sys/unix instead.

There are still some dependencies on the syscall package that will
remain in syscall for the forseeable future:

Errno
Signal
SysProcAttr

Additionally:
- os still uses syscall, so it needs to be kept for anything
returning *os.ProcessState, such as process.Wait.

Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
2017-05-22 17:35:20 -05:00
Qiang Huang 96e0df7633 Fix comments about when to pivot_root
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-05-06 07:59:03 +08:00
Daniel, Dao Quang Minh 13a8c5d140 Merge pull request #1365 from hqhq/use_go_selinux
Use opencontainers/selinux package
2017-04-15 14:22:32 +01:00
Aleksa Sarai baeef29858
rootless: add rootless cgroup manager
The rootless cgroup manager acts as a noop for all set and apply
operations. It is just used for rootless setups. Currently this is far
too simple (we need to add opportunistic cgroup management), but is good
enough as a first-pass at a noop cgroup manager.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-03-23 20:46:20 +11:00
Qiang Huang 5e7b48f7c0 Use opencontainers/selinux package
It's splitted as a separate project.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-03-23 08:21:19 +08:00
Ma Shimiao 06e27471bb support create device with type p and u
Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
2017-02-10 14:45:15 +08:00
Qiang Huang 45a8341811 Small cleanup
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-02-08 15:09:06 +08:00
Antonio Murdaca ca14e7b463
libcontainer: rootfs_linux: support overlayfs
As the runtime-spec allows it, we want to be able to specify overlayfs
mounts with:

    {
        "destination": "/etc/pki",
        "type": "overlay",
        "source": "overlay",
        "options": [
            "lowerdir=/etc/pki:/home/amurdaca/go/src/github.com/opencontainers/runc/rootfs_fedora/etc/pki"
        ]
    },

This patch takes care of allowing overlayfs mounts. Both RO and RW
should be supported.

Signed-off-by: Antonio Murdaca <runcom@redhat.com>
2017-02-06 19:43:24 +01:00
Qiang Huang 0599ac7d93 Do not create cgroup dir name from combining subsystems
On some systems, when we mount some cgroup subsystems into
a same mountpoint, the name sequence of mount options and
cgroup directory name can not be the same.

For example, the mount option is cpuacct,cpu, but
mountpoint name is /sys/fs/cgroup/cpu,cpuacct. In current
runc, we set mount destination name from combining
subsystems, which comes from mount option from
/proc/self/mountinfo, so in my case the name would be
/sys/fs/cgroup/cpuacct,cpu, which is differernt from
host, and will break some applications.

Fix it by using directory name from host mountpoint.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-01-11 15:27:58 +08:00
Justin Cormack 50acb55233 Split the code for remounting mount points and mounting paths.
A remount of a mount point must include all the current flags or
these will be cleared:

```
The mountflags and data arguments should match the values used in the
original mount() call, except for those parameters that are being
deliberately changed.
```

The current code does not do this; the bug manifests in the specified
flags for `/dev` being lost on remount read only at present. As we
need to specify flags, split the code path for this from remounting
paths which are not mount points, as these can only inherit the
existing flags of the path, and these cannot be changed.

In the bind case, remove extra flags from the bind remount. A bind
mount can only be remounted read only, no other flags can be set,
all other flags are inherited from the parent. From the man page:

```
Since Linux 2.6.26, this flag can also be used to make an existing
bind mount read-only by specifying mountflags as:

MS_REMOUNT | MS_BIND | MS_RDONLY

Note that only the MS_RDONLY setting of the bind mount can be changed
in this manner.
```

MS_REC can only be set on the original bind, so move this. See note
in man page on bind mounts:

```
The remaining bits in the mountflags argument are also ignored, with
the exception of MS_REC.
```

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2016-12-16 14:01:17 -08:00
Aleksa Sarai 244c9fc426
*: console rewrite
This implements {createTTY, detach} and all of the combinations and
negations of the two that were previously implemented. There are some
valid questions about out-of-OCI-scope topics like !createTTY and how
things should be handled (why do we dup the current stdio to the
process, and how is that not a security issue). However, these will be
dealt with in a separate patchset.

In order to allow for late console setup, split setupRootfs into the
"preparation" section where all of the mounts are created and the
"finalize" section where we pivot_root and set things as ro. In between
the two we can set up all of the console mountpoints and symlinks we
need.

We use two-stage synchronisation to ensures that when the syscalls are
reordered in a suboptimal way, an out-of-place read() on the parentPipe
will not gobble the ancilliary information.

This patch is part of the console rewrite patchset.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-12-01 15:49:36 +11:00
Qiang Huang b15668b36d Fix all typos found by misspell
I use the same tool (https://github.com/client9/misspell)
as Daniel used a few days ago, don't why he missed these
typos at that time.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-10-29 14:14:42 +08:00
Vivek Goyal 6c147f8649 Make parent mount private before bind mounting rootfs
This reverts part of the commit eb0a144b5e

That commit introduced two issues.

- We need to make parent mount of rootfs private before bind mounting
  rootfs. Otherwise bind mounting root can propagate in other mount
  namespaces. (If parent mount is shared).

- It broke test TestRootfsPropagationSharedMount() on Fedora.

  On fedora /tmp is a mount point with "shared" propagation. I think
  you should be able to reproduce it on other distributions as well
  as long as you mount tmpfs on /tmp and make it "shared" propagation.

  Reason for failure is that pivot_root() fails. And it fails because
  kernel does following check.

  IS_MNT_SHARED(new_mnt->mnt_parent)

  Say /tmp/foo is new rootfs, we have bind mounted rootfs, so new_mnt
  is /tmp/foo, and new_mnt->mnt_parent is /tmp which is "shared" on
  fedora and above check fails.

So this change broke few things, it is a good idea to revert part of it.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
2016-10-25 11:15:11 -04:00
Aleksa Sarai c7ed2244f4
merge branch 'pr-1125'
LGTMs: @hqhq @mrunalp
Closes #1125
2016-10-25 10:05:28 +11:00
Alexander Morozov 1ab9d5e6f4 Merge pull request #845 from mrunalp/cp_tmpfs
Add support for copying up directories into tmpfs when a tmpfs is mounted over them
2016-10-21 13:47:16 -07:00
Aleksa Sarai f8e6b5af5e
rootfs: make pivot_root not use a temporary directory
Namely, use an undocumented feature of pivot_root(2) where
pivot_root(".", ".") is actually a feature and allows you to make the
old_root be tied to your /proc/self/cwd in a way that makes unmounting
easy. Thanks a lot to the LXC developers which came up with this idea
first.

This is the first step of many to allowing runC to work with a
completely read-only rootfs.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-10-20 12:55:58 +11:00
Daniel Dao 1b876b0bf2 fix typos with misspell
pipe the source through https://github.com/client9/misspell. typos be gone!

Signed-off-by: Daniel Dao <dqminh89@gmail.com>
2016-10-11 23:22:48 +00:00
Mrunal Patel c7406f7075 Support copyup mount extension for tmpfs mounts
If copyup is specified for a tmpfs mount, then the contents of the
underlying directory are copied into the tmpfs mounted over it.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2016-10-04 11:26:30 -07:00
Michael Crosby 70b16a5ab9 Remove check for binding to /
In order to mount root filesystems inside the container's mount
namespace as part of the spec we need to have the ability to do a bind
mount to / as the destination.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-09-29 15:26:09 -07:00
Akihiro Suda 53179559a1 MaskPaths: support directory
For example, the /sys/firmware directory should be masked because it can contain some sensitive files:
  - /sys/firmware/acpi/tables/{SLIC,MSDM}: Windows license information:
  - /sys/firmware/ibft/target0/chap-secret: iSCSI CHAP secret

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2016-09-23 16:14:41 +00:00
Mrunal Patel f557996401 Add flag to allow getting all mounts for cgroups subsystems
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2016-09-15 15:19:27 -04:00
Serge Hallyn 52a8873f62 checkMountDesktionation: add swaps and uptime to /proc whitelist
Signed-off-by: Serge Hallyn <serge@hallyn.com>
2016-08-14 18:32:39 -05:00
Haiyan Meng f40fbcd595 Fix the err info of mount failure
Signed-off-by: Haiyan Meng <haiyanalady@gmail.com>
2016-08-08 11:58:28 -04:00
Aleksa Sarai c29695ad0a
rootfs: don't change directory
There's no point in changing directory here. Syscalls are resolved local
to the linkpath, not to the current directory that the process was in
when creating the symlink. Changing directories just confuses people who
are trying to debug things.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-06-24 16:44:40 +10:00
Aleksa Sarai 0f1d6772c6
libcontainer: rootfs: use CleanPath when comparing paths
Comparisons with paths aren't really a good idea unless you're
guaranteed that the comparison will work will all paths that resolve to
the same lexical path as the compared path.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-06-22 01:45:32 +10:00
Aleksa Sarai e991f041a1 Revert "Need to make sure labels applied to /dev"
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-05-11 23:28:01 +10:00
Dan Walsh 77f312c51c Need to make sure labels applied to /dev
Signed-off-by: Dan Walsh <dwalsh@redhat.com>
2016-05-02 08:17:49 -04:00
Tatsushi Inagaki eb0a144b5e Rootfs: reduce redundant parsing of mountinfo
Postpone parsing mountinfo until pivot_root() actually failed

Signed-off-by: Tatsushi Inagaki <e29253@jp.ibm.com>
2016-04-22 09:41:28 +09:00
Michael Crosby 27fd0575ee Merge pull request #763 from mrunalp/userns_cgroups_ro
Allow mounting cgroups as read-only when user namespace is configured
2016-04-19 10:36:00 -07:00
Mrunal Patel a6104c3bbe Allow mounting cgroups as read-only when user namespace is configured
We use bind mount to achieve this as other file system remounts are disallowed
in a user namespace.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2016-04-19 10:12:09 -07:00
Michael Crosby 6978875298 Add cause to error messages
This is the inital port of the libcontainer.Error to added a cause to
all the existing error messages.  Going forward, when an error can be
wrapped because it is not being checked at the higher levels for
something like `os.IsNotExist` we can add more information to the error
message like cause and stack file/line information.  This will help
higher level tools to know what cause a container start or operation to
fail.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-04-18 11:37:26 -07:00
Akihiro Suda 1829531241 Fix trivial style errors reported by `go vet` and `golint`
No substantial code change.
Note that some style errors reported by `golint` are not fixed due to possible compatibility issues.

Signed-off-by: Akihiro Suda <suda.kyoto@gmail.com>
2016-04-12 08:13:16 +00:00
Akihiro Suda 42234a85d1 Fix setupDev logic in rootfs_linux.go
setupDev was introduced in #96, but broken since #536 because spec 0.3.0 introduced default devices.

Fix #80 again
Fix docker/docker#21808

Signed-off-by: Akihiro Suda <suda.kyoto@gmail.com>
Signed-off-by: Alexander Morozov <lk4d4@docker.com>
2016-04-11 10:29:40 -07:00
Thomas Tanaka 55aabc142c Only perform mount labelling when necessary
Do label mqueue when mounting it with label failed/not supported.

Signed-off-by: Thomas Tanaka <thomas.tanaka@oracle.com>
2016-03-24 13:38:18 -07:00
Mrunal Patel 64d87ebdec Merge pull request #585 from crosbymichael/dev-remountro
Remount /dev as ro after it is populated
2016-02-27 00:31:40 -08:00
Michael Crosby c5a34a6fe2 Allow extra mount types
This allows the mount syscall to validate the addiontal types where we
do not have to perform extra validation and is up to the consumer to
verify the functionality of the type of device they are trying to
mount.

Fixes #572

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-02-26 15:21:33 -08:00
rajasec 1db7322ded Removing pivot directory in defer
Signed-off-by: rajasec <rajasec79@gmail.com>

Changing to name values for defer as per review comments

Signed-off-by: rajasec <rajasec79@gmail.com>

Fixed review comments

Signed-off-by: rajasec <rajasec79@gmail.com>
2016-02-25 13:12:40 +05:30
Michael Crosby fc98958321 Remount /dev as ro after it is populated
Because we more than likely control dev and populate devices and files
inside of it we need to make sure that we fulfil the user's request to
make it ro only after it has been populated.  This removes the need to
expose something like ReadonlyPaths in the config but still have the
same outcome but more seemless for the user.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-02-23 13:56:01 -08:00
Mrunal Patel 2f27649848 Move pre-start hooks after container mounts
Today mounts in pre-start hooks get overriden by the default mounts.
Moving the pre-start hooks to after the container mounts and before
the pivot/move root gives better flexiblity in the hooks.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2016-02-23 02:50:35 -08:00
Michael Crosby 4f33b03703 Merge pull request #561 from rajasec/kcore-link
Change softlink name to /dev/core
2016-02-16 11:03:37 -08:00
Chun Chen 2ee9cbbd12 It's /proc/stat, not /proc/stats
Also adds /proc/net/dev to the valid mount destination white list

Signed-off-by: Chun Chen <ramichen@tencent.com>
2016-02-16 15:59:27 +08:00
rajasec 4cd31f63c5 Change softlink name to /dev/core
Signed-off-by: rajasec <rajasec79@gmail.com>
2016-02-15 17:52:19 +05:30
Kenfe-Mickael Laventure dceeb0d0df Move pathClean to libcontainer/utils.CleanPath
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
2016-02-09 16:21:58 -08:00
Serge Hallyn c0ad40c5e6 Do not create devices when in user namespace
When we launch a container in a new user namespace, we cannot create
devices, so we bind mount the host's devices into place instead.

If we are running in a user namespace (i.e. nested in a container),
then we need to do the same thing.  Add a function to detect that
and check for it before doing mknod.

Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
---
 Changelog - add a comment clarifying what's going on with the
	     uidmap file.
2016-01-08 12:54:08 -08:00
Qiang Huang 9c1242ecba Add white list for bind mount chec
Fixes: #400

It would be useful to use fuse to isolate proc info.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-01-06 14:48:40 +08:00
Alexander Morozov 776791463d Merge pull request #357 from ashahab-altiscale/350-container-in-container
Bind mount device nodes on EPERM
2015-11-16 14:54:02 -08:00
Qiang Huang 96f0eefa1a Fix comment to be consistent with the code
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2015-11-16 19:16:27 +08:00
Abin Shahab 28c9d0252c Userns container in containers
Enables launching userns containers by catching EPERM errors for writing
to devices cgroups, and for mknod invocations.

Signed-off-by: Abin Shahab <ashahab@altiscale.com>
2015-11-15 14:42:35 -08:00
Qiang Huang 34cff6f2f3 Correct intuition for setupDev
Minor fix, the former setupDev=true means not setup dev,
which is contrary to intuition, just correct it.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2015-10-21 16:06:26 +08:00