Commit Graph

40 Commits

Author SHA1 Message Date
Serge Hallyn c0ad40c5e6 Do not create devices when in user namespace
When we launch a container in a new user namespace, we cannot create
devices, so we bind mount the host's devices into place instead.

If we are running in a user namespace (i.e. nested in a container),
then we need to do the same thing.  Add a function to detect that
and check for it before doing mknod.

Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
---
 Changelog - add a comment clarifying what's going on with the
	     uidmap file.
2016-01-08 12:54:08 -08:00
Qiang Huang 9c1242ecba Add white list for bind mount chec
Fixes: #400

It would be useful to use fuse to isolate proc info.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-01-06 14:48:40 +08:00
Alexander Morozov 776791463d Merge pull request #357 from ashahab-altiscale/350-container-in-container
Bind mount device nodes on EPERM
2015-11-16 14:54:02 -08:00
Qiang Huang 96f0eefa1a Fix comment to be consistent with the code
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2015-11-16 19:16:27 +08:00
Abin Shahab 28c9d0252c Userns container in containers
Enables launching userns containers by catching EPERM errors for writing
to devices cgroups, and for mknod invocations.

Signed-off-by: Abin Shahab <ashahab@altiscale.com>
2015-11-15 14:42:35 -08:00
Qiang Huang 34cff6f2f3 Correct intuition for setupDev
Minor fix, the former setupDev=true means not setup dev,
which is contrary to intuition, just correct it.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2015-10-21 16:06:26 +08:00
Antonio Murdaca c5b80bddf1 bump docker pkgs
Docker pkgs were updated while golinting the whole docker code base.
Now when trying to bump libcontainer/runc in docker, it fails compiling
with the following error:
``
vendor/src/github.com/opencontainers/runc/libcontainer/rootfs_linux.go:424:
undefined: mount.MountInfo
``
This is because, for instance, the mount pkg was updated here
0f5c9d301b (diff-49294d05afa48e2f7c0d2f02c6f7614c)
and now that type is only `mount.Info`.
This patch bump docker pkgs commit and adapt code to it.

Signed-off-by: Antonio Murdaca <amurdaca@redhat.com>
2015-10-06 10:48:12 +02:00
Vivek Goyal da8d776c08 Make pivotDir rprivate
pivotDir is the one where pivot_root() call puts the old root. We will
unmount pivotDir() and delete it.

Previously we were making / always rslave or rprivate. That will mean 
that pivotDir() could never have mounts which would be shared with
parent mount namespace. That also means that unmounting pivotDir() was
safe and none of the unmount will propagate to parent namespace and
unmount things which we did not want to.

But now user can specify that apply private, shared, slave on /. That
means some of the mounts we inherited from parent could be shared and that
also means if we umount pivotDir/, those mounts will get unmounted in
parent too. That's not what we want.

Instead make pivotDir rprivate so that unmounts don't propagate back to
parent.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
2015-10-01 17:03:02 -04:00
Vivek Goyal 23ec72a426 Make parent mount of container root private if it is shared.
pivot_root() introduces bunch of restrictions otherwise it fails. parent
mount of container root can not be shared otherwise pivot_root() will
fail. 

So far parent could not be shared as we marked everything either private
or slave. But now we have introduced new propagation modes where parent
mount of container rootfs could be shared and pivot_root() will fail.

So check if parent mount is shared and if yes, make it private. This will
make sure pivot_root() works.

Also it will make sure that when we bind mount container rootfs, it does
not propagate to parent mount namespace. Otherwise cleanup becomes a 
problem.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
2015-10-01 17:03:02 -04:00
Vivek Goyal 5dd6caf6cf Replace config.Privatefs with config.RootPropagation
Right now config.Privatefs is a boolean which determines if / is applied
with propagation flag syscall.MS_PRIVATE | syscall.MS_REC or not.

Soon we want to represent other propagation states like private, [r]slave,
and [r]shared. So either we can introduce more boolean variable or keep
track of propagation flags in an integer variable. Keeping an integer
variable is more versatile and can allow various kind of propagation flags
to be specified. So replace Privatefs with RootPropagation which is an
integer.

Note, this will require changes in docker. Instead of setting Privatefs
to true, they will need to set.

config.RootPropagation = syscall.MS_PRIVATE | syscall.MS_REC
 
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
2015-10-01 17:03:02 -04:00
Alexander Morozov 4d5079b9dc Merge pull request #309 from chenchun/fix_reOpenDevNull
Fix reOpenDevNull
2015-09-30 19:06:43 -07:00
Alexander Morozov fba07bce72 Merge pull request #307 from estesp/no-remount-if-unecessary
Only remount if requested flags differ from current
2015-09-30 11:40:06 -07:00
Chun Chen 06d91f546f Fix reOpenDevNull
We should open /dev/null with os.O_RDWR, otherwise it won't be
possible writen to it

Signed-off-by: Chun Chen <ramichen@tencent.com>
2015-09-30 16:05:49 +08:00
Phil Estes 97f5ee4e6a Only remount if requested flags differ from current
Do not remount a bind mount to enable flags unless non-default flags are
provided for the requested mount. This solves a problem with user
namespaces and remount of bind mount permissions.

Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
2015-09-29 23:13:04 -04:00
Dan Walsh cab342f0de Check for failure on /dev/mqueue and try again without labeling
Signed-off-by: Dan Walsh <dwalsh@redhat.com>
2015-09-28 12:31:52 -04:00
Dan Walsh b4dcb75503 /proc and /sys do not support labeling
This is causing docker to crash when --selinux-enforcing mode is set.

Signed-off-by: Dan Walsh <dwalsh@redhat.com>
2015-09-28 12:31:52 -04:00
Michael Crosby 203d3e258e Move mount methods out of configs pkg
Do not have methods and actions that require syscalls in the configs
package because it breaks cross compile.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2015-09-24 09:43:12 -07:00
Michael Crosby 5765dcd086 Merge pull request #296 from crosbymichael/mount-resolv-symlink
Change mount dest after resolving symlinks
2015-09-23 10:21:25 -07:00
Michael Crosby b3bb606513 Change mount dest after resolving symlinks
We need to update the mount's destination after we resolve symlinks so
that it properly creates and mounts the correct location.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2015-09-23 10:07:18 -07:00
Antonio Murdaca d6e6462478 Cleanup unused func arguments
Signed-off-by: Antonio Murdaca <runcom@linux.com>
2015-09-21 11:50:29 +02:00
Vivek Goyal d1f4a5b8b5 libcontainer: Allow passing mount propagation flags
Right now if one passes a mount propagation flag in spec file, it
does not take effect. For example, try following in spec json file.

{
  "type": "bind",
  "source": "/root/mnt-source",
  "destination": "/root/mnt-dest",
  "options": "rbind,shared"
}

One would expect that /root/mnt-dest will be shared inside the container
but that's not the case.

#findmnt -o TARGET,PROPAGATION
`-/root/mnt-dest                      private

Reason being that propagation flags can't be passed in along with other
regular flags. They need to be passed in a separate call to mount syscall.
That too, one propagation flag at a time. (from mount man page).

Hence, store propagation flags separately in a slice and apply these
in that order after the mount call wherever appropriate. This allows
user to control the propagation property of mount point inside
the container.

Storing them separately also solves another problem where recursive flag
(syscall.MS_REC) can get mixed up. For example, options "rbind,private"
and "bind,rprivate" will be same and there will be no way to differentiate
between these if all the flags are stored in a single integer.

This patch would allow one to pass propagation flags "[r]shared,[r]slave,
[r]private,[r]unbindable" in spec file as per mount property.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
2015-09-16 15:53:23 -04:00
Mrunal Patel 486ac97618 Merge pull request #236 from hqhq/hq_fix_cgroup_rw
Always remount for bind mount
2015-09-14 12:08:34 -07:00
Andrey Vagin da2535f2d1 mount: don't read /proc/self/cgroup many times
Signed-off-by: Andrey Vagin <avagin@openvz.org>
2015-09-10 21:00:22 +03:00
Qiang Huang b7385e291c Always remount for bind mount
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2015-08-31 11:10:34 +08:00
Michael Crosby b1e7041957 Merge pull request #165 from calavera/context_labels
Make label.Relabel safer.
2015-08-28 14:20:00 -07:00
rajasec 24f7a10a93 Adding securityfs mount
Signed-off-by: rajasec <rajasec79@gmail.com>
2015-08-05 16:50:08 +05:30
Mrunal Patel c9d5850629 Don't make modifications to /dev there are no devices in the configuration
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2015-08-04 16:57:29 -04:00
David Calavera 4bd4d462af Make label.Relabel safer.
- Check if Selinux is enabled before relabeling. This is a bug.
- Make exclusion detection constant time. Kinda buggy too, imo.
- Do not depend on a magic string to create a new Selinux context.

Signed-off-by: David Calavera <david.calavera@gmail.com>
2015-07-31 10:37:32 -07:00
Kir Kolyshkin 6f82d4b544 Simplify and fix os.MkdirAll() usage
TL;DR: check for IsExist(err) after a failed MkdirAll() is both
redundant and wrong -- so two reasons to remove it.

Quoting MkdirAll documentation:

> MkdirAll creates a directory named path, along with any necessary
> parents, and returns nil, or else returns an error. If path
> is already a directory, MkdirAll does nothing and returns nil.

This means two things:

1. If a directory to be created already exists, no error is
returned.

2. If the error returned is IsExist (EEXIST), it means there exists
a non-directory with the same name as MkdirAll need to use for
directory. Example: we want to MkdirAll("a/b"), but file "a"
(or "a/b") already exists, so MkdirAll fails.

The above is a theory, based on quoted documentation and my UNIX
knowledge.

3. In practice, though, current MkdirAll implementation [1] returns
ENOTDIR in most of cases described in #2, with the exception when
there is a race between MkdirAll and someone else creating the
last component of MkdirAll argument as a file. In this very case
MkdirAll() will indeed return EEXIST.

Because of #1, IsExist check after MkdirAll is not needed.

Because of #2 and #3, ignoring IsExist error is just plain wrong,
as directory we require is not created. It's cleaner to report
the error now.

Note this error is all over the tree, I guess due to copy-paste,
or trying to follow the same usage pattern as for Mkdir(),
or some not quite correct examples on the Internet.

[1] https://github.com/golang/go/blob/f9ed2f75/src/os/path.go

Signed-off-by: Kir Kolyshkin <kir@openvz.org>
2015-07-29 18:03:27 -07:00
Alexander Morozov d89964eed3 Remount /sys/fs/cgroup as RO if MS_RDONLY was passed in m.Flags
Signed-off-by: Alexander Morozov <lk4d4@docker.com>
2015-07-22 11:05:40 -07:00
Alexander Morozov d3217084b5 Create symlinks for merged cgroups
This allows software be not aware about existence of merged cgroups.

Signed-off-by: Alexander Morozov <lk4d4@docker.com>
2015-07-20 16:12:28 -07:00
Andrey Vagin af4a5e708a ct: give criu informations about cgroup mounts
Actually cgroup mounts are bind-mounts, so they should be
handled by the same way.

Reported-by: Ross Boucher <rboucher@gmail.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
2015-07-20 22:56:07 +03:00
Mrunal Patel 5b805276c2 Revert "Remount /sys/fs/cgroup as readonly always"
This reverts commit 18de1a273e.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2015-07-17 17:50:46 -04:00
Alexander Morozov 18de1a273e Remount /sys/fs/cgroup as readonly always
Signed-off-by: Alexander Morozov <lk4d4@docker.com>
2015-07-17 12:45:09 -07:00
Alexander Morozov 40b9b89107 Substract bindmount path from cgroup dir
Signed-off-by: Alexander Morozov <lk4d4@docker.com>
2015-07-15 10:41:25 -07:00
Mrunal Patel 42aa891a6b Merge pull request #91 from hqhq/hq_add_cgroup_mount
Add cgroup mount in the recommended config
2015-07-15 09:51:24 -07:00
Qiang Huang d7181a73e4 Add cgroup mount in the recommended config
And allow cgroup mount take flags from user configs.
As we show ro in the recommendation, so hard-coded
read-only flag should be removed.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2015-07-15 09:31:39 +08:00
Qiang Huang b1fd78346e Correct tmpfs mount for cgroup
Fixes: https://github.com/docker/docker/issues/14543
Fixes: https://github.com/docker/docker/pull/14610

Before this, we got mount info in container:
```
sysfs /sys sysfs ro,seclabel,nosuid,nodev,noexec,relatime 0 0
 /sys/fs/cgroup tmpfs rw,seclabel,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,relatime,cpuset 0 0
```

It has no mount source, so in `parseInfoFile` in Docker code,
we'll get:
```
Error found less than 3 fields post '-' in "84 83 0:41 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - tmpfs  rw,seclabel"
```

After this fix, we have mount info corrected:
```
sysfs /sys sysfs ro,seclabel,nosuid,nodev,noexec,relatime 0 0
tmpfs /sys/fs/cgroup tmpfs rw,seclabel,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,relatime,cpuset 0 0
```

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2015-07-15 09:09:09 +08:00
Michael Crosby 080df7ab88 Update import paths for new repository
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2015-06-21 19:29:59 -07:00
Michael Crosby 8f97d39dd2 Move libcontainer into subdirectory
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2015-06-21 19:29:15 -07:00