Commit Graph

75 Commits

Author SHA1 Message Date
Qiang Huang 0599ac7d93 Do not create cgroup dir name from combining subsystems
On some systems, when we mount some cgroup subsystems into
a same mountpoint, the name sequence of mount options and
cgroup directory name can not be the same.

For example, the mount option is cpuacct,cpu, but
mountpoint name is /sys/fs/cgroup/cpu,cpuacct. In current
runc, we set mount destination name from combining
subsystems, which comes from mount option from
/proc/self/mountinfo, so in my case the name would be
/sys/fs/cgroup/cpuacct,cpu, which is differernt from
host, and will break some applications.

Fix it by using directory name from host mountpoint.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-01-11 15:27:58 +08:00
Justin Cormack 50acb55233 Split the code for remounting mount points and mounting paths.
A remount of a mount point must include all the current flags or
these will be cleared:

```
The mountflags and data arguments should match the values used in the
original mount() call, except for those parameters that are being
deliberately changed.
```

The current code does not do this; the bug manifests in the specified
flags for `/dev` being lost on remount read only at present. As we
need to specify flags, split the code path for this from remounting
paths which are not mount points, as these can only inherit the
existing flags of the path, and these cannot be changed.

In the bind case, remove extra flags from the bind remount. A bind
mount can only be remounted read only, no other flags can be set,
all other flags are inherited from the parent. From the man page:

```
Since Linux 2.6.26, this flag can also be used to make an existing
bind mount read-only by specifying mountflags as:

MS_REMOUNT | MS_BIND | MS_RDONLY

Note that only the MS_RDONLY setting of the bind mount can be changed
in this manner.
```

MS_REC can only be set on the original bind, so move this. See note
in man page on bind mounts:

```
The remaining bits in the mountflags argument are also ignored, with
the exception of MS_REC.
```

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2016-12-16 14:01:17 -08:00
Aleksa Sarai 244c9fc426
*: console rewrite
This implements {createTTY, detach} and all of the combinations and
negations of the two that were previously implemented. There are some
valid questions about out-of-OCI-scope topics like !createTTY and how
things should be handled (why do we dup the current stdio to the
process, and how is that not a security issue). However, these will be
dealt with in a separate patchset.

In order to allow for late console setup, split setupRootfs into the
"preparation" section where all of the mounts are created and the
"finalize" section where we pivot_root and set things as ro. In between
the two we can set up all of the console mountpoints and symlinks we
need.

We use two-stage synchronisation to ensures that when the syscalls are
reordered in a suboptimal way, an out-of-place read() on the parentPipe
will not gobble the ancilliary information.

This patch is part of the console rewrite patchset.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-12-01 15:49:36 +11:00
Qiang Huang b15668b36d Fix all typos found by misspell
I use the same tool (https://github.com/client9/misspell)
as Daniel used a few days ago, don't why he missed these
typos at that time.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-10-29 14:14:42 +08:00
Vivek Goyal 6c147f8649 Make parent mount private before bind mounting rootfs
This reverts part of the commit eb0a144b5e

That commit introduced two issues.

- We need to make parent mount of rootfs private before bind mounting
  rootfs. Otherwise bind mounting root can propagate in other mount
  namespaces. (If parent mount is shared).

- It broke test TestRootfsPropagationSharedMount() on Fedora.

  On fedora /tmp is a mount point with "shared" propagation. I think
  you should be able to reproduce it on other distributions as well
  as long as you mount tmpfs on /tmp and make it "shared" propagation.

  Reason for failure is that pivot_root() fails. And it fails because
  kernel does following check.

  IS_MNT_SHARED(new_mnt->mnt_parent)

  Say /tmp/foo is new rootfs, we have bind mounted rootfs, so new_mnt
  is /tmp/foo, and new_mnt->mnt_parent is /tmp which is "shared" on
  fedora and above check fails.

So this change broke few things, it is a good idea to revert part of it.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
2016-10-25 11:15:11 -04:00
Aleksa Sarai c7ed2244f4
merge branch 'pr-1125'
LGTMs: @hqhq @mrunalp
Closes #1125
2016-10-25 10:05:28 +11:00
Alexander Morozov 1ab9d5e6f4 Merge pull request #845 from mrunalp/cp_tmpfs
Add support for copying up directories into tmpfs when a tmpfs is mounted over them
2016-10-21 13:47:16 -07:00
Aleksa Sarai f8e6b5af5e
rootfs: make pivot_root not use a temporary directory
Namely, use an undocumented feature of pivot_root(2) where
pivot_root(".", ".") is actually a feature and allows you to make the
old_root be tied to your /proc/self/cwd in a way that makes unmounting
easy. Thanks a lot to the LXC developers which came up with this idea
first.

This is the first step of many to allowing runC to work with a
completely read-only rootfs.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-10-20 12:55:58 +11:00
Daniel Dao 1b876b0bf2 fix typos with misspell
pipe the source through https://github.com/client9/misspell. typos be gone!

Signed-off-by: Daniel Dao <dqminh89@gmail.com>
2016-10-11 23:22:48 +00:00
Mrunal Patel c7406f7075 Support copyup mount extension for tmpfs mounts
If copyup is specified for a tmpfs mount, then the contents of the
underlying directory are copied into the tmpfs mounted over it.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2016-10-04 11:26:30 -07:00
Michael Crosby 70b16a5ab9 Remove check for binding to /
In order to mount root filesystems inside the container's mount
namespace as part of the spec we need to have the ability to do a bind
mount to / as the destination.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-09-29 15:26:09 -07:00
Akihiro Suda 53179559a1 MaskPaths: support directory
For example, the /sys/firmware directory should be masked because it can contain some sensitive files:
  - /sys/firmware/acpi/tables/{SLIC,MSDM}: Windows license information:
  - /sys/firmware/ibft/target0/chap-secret: iSCSI CHAP secret

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2016-09-23 16:14:41 +00:00
Mrunal Patel f557996401 Add flag to allow getting all mounts for cgroups subsystems
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2016-09-15 15:19:27 -04:00
Serge Hallyn 52a8873f62 checkMountDesktionation: add swaps and uptime to /proc whitelist
Signed-off-by: Serge Hallyn <serge@hallyn.com>
2016-08-14 18:32:39 -05:00
Haiyan Meng f40fbcd595 Fix the err info of mount failure
Signed-off-by: Haiyan Meng <haiyanalady@gmail.com>
2016-08-08 11:58:28 -04:00
Aleksa Sarai c29695ad0a
rootfs: don't change directory
There's no point in changing directory here. Syscalls are resolved local
to the linkpath, not to the current directory that the process was in
when creating the symlink. Changing directories just confuses people who
are trying to debug things.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-06-24 16:44:40 +10:00
Aleksa Sarai 0f1d6772c6
libcontainer: rootfs: use CleanPath when comparing paths
Comparisons with paths aren't really a good idea unless you're
guaranteed that the comparison will work will all paths that resolve to
the same lexical path as the compared path.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-06-22 01:45:32 +10:00
Aleksa Sarai e991f041a1 Revert "Need to make sure labels applied to /dev"
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-05-11 23:28:01 +10:00
Dan Walsh 77f312c51c Need to make sure labels applied to /dev
Signed-off-by: Dan Walsh <dwalsh@redhat.com>
2016-05-02 08:17:49 -04:00
Tatsushi Inagaki eb0a144b5e Rootfs: reduce redundant parsing of mountinfo
Postpone parsing mountinfo until pivot_root() actually failed

Signed-off-by: Tatsushi Inagaki <e29253@jp.ibm.com>
2016-04-22 09:41:28 +09:00
Michael Crosby 27fd0575ee Merge pull request #763 from mrunalp/userns_cgroups_ro
Allow mounting cgroups as read-only when user namespace is configured
2016-04-19 10:36:00 -07:00
Mrunal Patel a6104c3bbe Allow mounting cgroups as read-only when user namespace is configured
We use bind mount to achieve this as other file system remounts are disallowed
in a user namespace.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2016-04-19 10:12:09 -07:00
Michael Crosby 6978875298 Add cause to error messages
This is the inital port of the libcontainer.Error to added a cause to
all the existing error messages.  Going forward, when an error can be
wrapped because it is not being checked at the higher levels for
something like `os.IsNotExist` we can add more information to the error
message like cause and stack file/line information.  This will help
higher level tools to know what cause a container start or operation to
fail.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-04-18 11:37:26 -07:00
Akihiro Suda 1829531241 Fix trivial style errors reported by `go vet` and `golint`
No substantial code change.
Note that some style errors reported by `golint` are not fixed due to possible compatibility issues.

Signed-off-by: Akihiro Suda <suda.kyoto@gmail.com>
2016-04-12 08:13:16 +00:00
Akihiro Suda 42234a85d1 Fix setupDev logic in rootfs_linux.go
setupDev was introduced in #96, but broken since #536 because spec 0.3.0 introduced default devices.

Fix #80 again
Fix docker/docker#21808

Signed-off-by: Akihiro Suda <suda.kyoto@gmail.com>
Signed-off-by: Alexander Morozov <lk4d4@docker.com>
2016-04-11 10:29:40 -07:00
Thomas Tanaka 55aabc142c Only perform mount labelling when necessary
Do label mqueue when mounting it with label failed/not supported.

Signed-off-by: Thomas Tanaka <thomas.tanaka@oracle.com>
2016-03-24 13:38:18 -07:00
Mrunal Patel 64d87ebdec Merge pull request #585 from crosbymichael/dev-remountro
Remount /dev as ro after it is populated
2016-02-27 00:31:40 -08:00
Michael Crosby c5a34a6fe2 Allow extra mount types
This allows the mount syscall to validate the addiontal types where we
do not have to perform extra validation and is up to the consumer to
verify the functionality of the type of device they are trying to
mount.

Fixes #572

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-02-26 15:21:33 -08:00
rajasec 1db7322ded Removing pivot directory in defer
Signed-off-by: rajasec <rajasec79@gmail.com>

Changing to name values for defer as per review comments

Signed-off-by: rajasec <rajasec79@gmail.com>

Fixed review comments

Signed-off-by: rajasec <rajasec79@gmail.com>
2016-02-25 13:12:40 +05:30
Michael Crosby fc98958321 Remount /dev as ro after it is populated
Because we more than likely control dev and populate devices and files
inside of it we need to make sure that we fulfil the user's request to
make it ro only after it has been populated.  This removes the need to
expose something like ReadonlyPaths in the config but still have the
same outcome but more seemless for the user.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-02-23 13:56:01 -08:00
Mrunal Patel 2f27649848 Move pre-start hooks after container mounts
Today mounts in pre-start hooks get overriden by the default mounts.
Moving the pre-start hooks to after the container mounts and before
the pivot/move root gives better flexiblity in the hooks.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2016-02-23 02:50:35 -08:00
Michael Crosby 4f33b03703 Merge pull request #561 from rajasec/kcore-link
Change softlink name to /dev/core
2016-02-16 11:03:37 -08:00
Chun Chen 2ee9cbbd12 It's /proc/stat, not /proc/stats
Also adds /proc/net/dev to the valid mount destination white list

Signed-off-by: Chun Chen <ramichen@tencent.com>
2016-02-16 15:59:27 +08:00
rajasec 4cd31f63c5 Change softlink name to /dev/core
Signed-off-by: rajasec <rajasec79@gmail.com>
2016-02-15 17:52:19 +05:30
Kenfe-Mickael Laventure dceeb0d0df Move pathClean to libcontainer/utils.CleanPath
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
2016-02-09 16:21:58 -08:00
Serge Hallyn c0ad40c5e6 Do not create devices when in user namespace
When we launch a container in a new user namespace, we cannot create
devices, so we bind mount the host's devices into place instead.

If we are running in a user namespace (i.e. nested in a container),
then we need to do the same thing.  Add a function to detect that
and check for it before doing mknod.

Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
---
 Changelog - add a comment clarifying what's going on with the
	     uidmap file.
2016-01-08 12:54:08 -08:00
Qiang Huang 9c1242ecba Add white list for bind mount chec
Fixes: #400

It would be useful to use fuse to isolate proc info.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-01-06 14:48:40 +08:00
Alexander Morozov 776791463d Merge pull request #357 from ashahab-altiscale/350-container-in-container
Bind mount device nodes on EPERM
2015-11-16 14:54:02 -08:00
Qiang Huang 96f0eefa1a Fix comment to be consistent with the code
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2015-11-16 19:16:27 +08:00
Abin Shahab 28c9d0252c Userns container in containers
Enables launching userns containers by catching EPERM errors for writing
to devices cgroups, and for mknod invocations.

Signed-off-by: Abin Shahab <ashahab@altiscale.com>
2015-11-15 14:42:35 -08:00
Qiang Huang 34cff6f2f3 Correct intuition for setupDev
Minor fix, the former setupDev=true means not setup dev,
which is contrary to intuition, just correct it.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2015-10-21 16:06:26 +08:00
Antonio Murdaca c5b80bddf1 bump docker pkgs
Docker pkgs were updated while golinting the whole docker code base.
Now when trying to bump libcontainer/runc in docker, it fails compiling
with the following error:
``
vendor/src/github.com/opencontainers/runc/libcontainer/rootfs_linux.go:424:
undefined: mount.MountInfo
``
This is because, for instance, the mount pkg was updated here
0f5c9d301b (diff-49294d05afa48e2f7c0d2f02c6f7614c)
and now that type is only `mount.Info`.
This patch bump docker pkgs commit and adapt code to it.

Signed-off-by: Antonio Murdaca <amurdaca@redhat.com>
2015-10-06 10:48:12 +02:00
Vivek Goyal da8d776c08 Make pivotDir rprivate
pivotDir is the one where pivot_root() call puts the old root. We will
unmount pivotDir() and delete it.

Previously we were making / always rslave or rprivate. That will mean 
that pivotDir() could never have mounts which would be shared with
parent mount namespace. That also means that unmounting pivotDir() was
safe and none of the unmount will propagate to parent namespace and
unmount things which we did not want to.

But now user can specify that apply private, shared, slave on /. That
means some of the mounts we inherited from parent could be shared and that
also means if we umount pivotDir/, those mounts will get unmounted in
parent too. That's not what we want.

Instead make pivotDir rprivate so that unmounts don't propagate back to
parent.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
2015-10-01 17:03:02 -04:00
Vivek Goyal 23ec72a426 Make parent mount of container root private if it is shared.
pivot_root() introduces bunch of restrictions otherwise it fails. parent
mount of container root can not be shared otherwise pivot_root() will
fail. 

So far parent could not be shared as we marked everything either private
or slave. But now we have introduced new propagation modes where parent
mount of container rootfs could be shared and pivot_root() will fail.

So check if parent mount is shared and if yes, make it private. This will
make sure pivot_root() works.

Also it will make sure that when we bind mount container rootfs, it does
not propagate to parent mount namespace. Otherwise cleanup becomes a 
problem.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
2015-10-01 17:03:02 -04:00
Vivek Goyal 5dd6caf6cf Replace config.Privatefs with config.RootPropagation
Right now config.Privatefs is a boolean which determines if / is applied
with propagation flag syscall.MS_PRIVATE | syscall.MS_REC or not.

Soon we want to represent other propagation states like private, [r]slave,
and [r]shared. So either we can introduce more boolean variable or keep
track of propagation flags in an integer variable. Keeping an integer
variable is more versatile and can allow various kind of propagation flags
to be specified. So replace Privatefs with RootPropagation which is an
integer.

Note, this will require changes in docker. Instead of setting Privatefs
to true, they will need to set.

config.RootPropagation = syscall.MS_PRIVATE | syscall.MS_REC
 
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
2015-10-01 17:03:02 -04:00
Alexander Morozov 4d5079b9dc Merge pull request #309 from chenchun/fix_reOpenDevNull
Fix reOpenDevNull
2015-09-30 19:06:43 -07:00
Alexander Morozov fba07bce72 Merge pull request #307 from estesp/no-remount-if-unecessary
Only remount if requested flags differ from current
2015-09-30 11:40:06 -07:00
Chun Chen 06d91f546f Fix reOpenDevNull
We should open /dev/null with os.O_RDWR, otherwise it won't be
possible writen to it

Signed-off-by: Chun Chen <ramichen@tencent.com>
2015-09-30 16:05:49 +08:00
Phil Estes 97f5ee4e6a Only remount if requested flags differ from current
Do not remount a bind mount to enable flags unless non-default flags are
provided for the requested mount. This solves a problem with user
namespaces and remount of bind mount permissions.

Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
2015-09-29 23:13:04 -04:00
Dan Walsh cab342f0de Check for failure on /dev/mqueue and try again without labeling
Signed-off-by: Dan Walsh <dwalsh@redhat.com>
2015-09-28 12:31:52 -04:00