pivotDir is the one where pivot_root() call puts the old root. We will
unmount pivotDir() and delete it.
Previously we were making / always rslave or rprivate. That will mean
that pivotDir() could never have mounts which would be shared with
parent mount namespace. That also means that unmounting pivotDir() was
safe and none of the unmount will propagate to parent namespace and
unmount things which we did not want to.
But now user can specify that apply private, shared, slave on /. That
means some of the mounts we inherited from parent could be shared and that
also means if we umount pivotDir/, those mounts will get unmounted in
parent too. That's not what we want.
Instead make pivotDir rprivate so that unmounts don't propagate back to
parent.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
pivot_root() introduces bunch of restrictions otherwise it fails. parent
mount of container root can not be shared otherwise pivot_root() will
fail.
So far parent could not be shared as we marked everything either private
or slave. But now we have introduced new propagation modes where parent
mount of container rootfs could be shared and pivot_root() will fail.
So check if parent mount is shared and if yes, make it private. This will
make sure pivot_root() works.
Also it will make sure that when we bind mount container rootfs, it does
not propagate to parent mount namespace. Otherwise cleanup becomes a
problem.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Right now config.Privatefs is a boolean which determines if / is applied
with propagation flag syscall.MS_PRIVATE | syscall.MS_REC or not.
Soon we want to represent other propagation states like private, [r]slave,
and [r]shared. So either we can introduce more boolean variable or keep
track of propagation flags in an integer variable. Keeping an integer
variable is more versatile and can allow various kind of propagation flags
to be specified. So replace Privatefs with RootPropagation which is an
integer.
Note, this will require changes in docker. Instead of setting Privatefs
to true, they will need to set.
config.RootPropagation = syscall.MS_PRIVATE | syscall.MS_REC
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Do not remount a bind mount to enable flags unless non-default flags are
provided for the requested mount. This solves a problem with user
namespaces and remount of bind mount permissions.
Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
Do not have methods and actions that require syscalls in the configs
package because it breaks cross compile.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This commit allows additional architectures to be added to Seccomp filters
created by containers. This allows containers to make syscalls using these
architectures. For example, in a container on an AMD64 system, only AMD64
syscalls would be usable unless x86 was added to the filter using this patch,
which would allow both 32-bit and 64-bit syscalls to be used.
Signed-off-by: Matthew Heon <mheon@redhat.com>
We need to update the mount's destination after we resolve symlinks so
that it properly creates and mounts the correct location.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Whenever dev/null is used as one of the main processes STDIO, do not try
to change the permissions on it via fchown because we should not do it
in the first place and also this will fail if the container is supposed
to be readonly.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
When executing an additional process in a container, all namespaces are
entered but the user namespace. As a result, the process may be
executed as the host's root user. This has both functionality and
security implications.
Fix this by adding the missing user namespace to the array of
namespaces. Since joining a user namespace in which the caller is
already a member yields an error, skip namespaces we're already in.
Last, remove a needless and buggy AT_SYMLINK_NOFOLLOW in the code.
Signed-off-by: Ido Yariv <ido@wizery.com>
Fix the permissions of the container's main processes STDIO when the
process is not run as the root user. This changes the permissions right
before switching to the specified user so that it's STDIO matches it's
UID and GID.
Add a test for checking that the STDIO of the process is owned by the
specified user.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Right now if one passes a mount propagation flag in spec file, it
does not take effect. For example, try following in spec json file.
{
"type": "bind",
"source": "/root/mnt-source",
"destination": "/root/mnt-dest",
"options": "rbind,shared"
}
One would expect that /root/mnt-dest will be shared inside the container
but that's not the case.
#findmnt -o TARGET,PROPAGATION
`-/root/mnt-dest private
Reason being that propagation flags can't be passed in along with other
regular flags. They need to be passed in a separate call to mount syscall.
That too, one propagation flag at a time. (from mount man page).
Hence, store propagation flags separately in a slice and apply these
in that order after the mount call wherever appropriate. This allows
user to control the propagation property of mount point inside
the container.
Storing them separately also solves another problem where recursive flag
(syscall.MS_REC) can get mixed up. For example, options "rbind,private"
and "bind,rprivate" will be same and there will be no way to differentiate
between these if all the flags are stored in a single integer.
This patch would allow one to pass propagation flags "[r]shared,[r]slave,
[r]private,[r]unbindable" in spec file as per mount property.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Here are two reasons:
* If we use systemd, we need to ask it to create cgroups
* If a container is restored with another ID, we need to
change paths to cgroups.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Bug was introduced in #250
According to: http://man7.org/linux/man-pages/man5/proc.5.html
36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue
(1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11)
...
(7) optional fields: zero or more fields of the form
"tag[:value]".
The 7th field is optional. We should skip it when parsing mount info.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
I got:
```
exec_test.go:823: Mode expected to contain 'ro,nosuid,nodev,noexec': tmpfs on /sys/fs/cgroup type tmpfs (ro,seclabel,nosuid,nodev,noexec,relatime,mode=755
```wq
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Again. It looks like a build tag was somehow dropped between
the PR here: https://github.com/docker/libcontainer/pull/625
and the move to runc.
Signed-off-by: Christy Perez <clnperez@linux.vnet.ibm.com>
As v2.1.0 is no longer required for successful testing, do not build it in the
Dockerfile - instead just use the version Ubuntu ships.
Signed-off-by: Matthew Heon <mheon@redhat.com>
This removes the existing, native Go seccomp filter generation and replaces it
with Libseccomp. Libseccomp is a C library which provides architecture
independent generation of Seccomp filters for the Linux kernel.
This adds a dependency on v2.2.1 or above of Libseccomp.
Signed-off-by: Matthew Heon <mheon@redhat.com>
Simplify the code introduced by the commit d1f0d5705deb:
Return actual ProcessState on Wait error
Cc: Alexander Morozov <lk4d4@docker.com>
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
This adds a `Signal()` method to the container interface so that the
initial process can be signaled after a Load or operation. It also
implements signaling the init process from a nonChildProcess.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
A boolean field named GidMappingsEnableSetgroups was added to
SysProcAttr in Go1.5. This field determines the value of the process's
setgroups proc entry.
Since the default is to set the entry to 'deny', calling setgroups will
fail on systems running kernels 3.19+.
Set GidMappingsEnableSetgroups to true so setgroups wont be set to
'deny'.
Signed-off-by: Ido Yariv <ido@wizery.com>
- Check if Selinux is enabled before relabeling. This is a bug.
- Make exclusion detection constant time. Kinda buggy too, imo.
- Do not depend on a magic string to create a new Selinux context.
Signed-off-by: David Calavera <david.calavera@gmail.com>
TL;DR: check for IsExist(err) after a failed MkdirAll() is both
redundant and wrong -- so two reasons to remove it.
Quoting MkdirAll documentation:
> MkdirAll creates a directory named path, along with any necessary
> parents, and returns nil, or else returns an error. If path
> is already a directory, MkdirAll does nothing and returns nil.
This means two things:
1. If a directory to be created already exists, no error is
returned.
2. If the error returned is IsExist (EEXIST), it means there exists
a non-directory with the same name as MkdirAll need to use for
directory. Example: we want to MkdirAll("a/b"), but file "a"
(or "a/b") already exists, so MkdirAll fails.
The above is a theory, based on quoted documentation and my UNIX
knowledge.
3. In practice, though, current MkdirAll implementation [1] returns
ENOTDIR in most of cases described in #2, with the exception when
there is a race between MkdirAll and someone else creating the
last component of MkdirAll argument as a file. In this very case
MkdirAll() will indeed return EEXIST.
Because of #1, IsExist check after MkdirAll is not needed.
Because of #2 and #3, ignoring IsExist error is just plain wrong,
as directory we require is not created. It's cleaner to report
the error now.
Note this error is all over the tree, I guess due to copy-paste,
or trying to follow the same usage pattern as for Mkdir(),
or some not quite correct examples on the Internet.
[1] https://github.com/golang/go/blob/f9ed2f75/src/os/path.go
Signed-off-by: Kir Kolyshkin <kir@openvz.org>
When the copyBusybox() fails, the error message should be
propagated to the caller of newRootfs().
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Actually cgroup mounts are bind-mounts, so they should be
handled by the same way.
Reported-by: Ross Boucher <rboucher@gmail.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Sometimes subsystem can be mounted to path like "subsystem1,subsystem2",
so we need to handle this.
Signed-off-by: Alexander Morozov <lk4d4@docker.com>
This is needed because for nested containers cgroups. Without this patch
they creating unnecessary intermediate cgroup like:
/sys/fs/cgroup/memory/system.slice/docker-9409d9f0b68fb9e9d7d532d5b3f35e7c7f9cca1312af392ae3b28436f1f2998f.scope/system.slice/docker-9409d9f0b68fb9e9d7d532d5b3f35e7c7f9cca1312af392ae3b28436f1f2998f.scope/docker/908ebcc9c13584a14322ec070bd971e0de62f126c0cd95c079acdb99990ad3a3
It is because in /proc/self/cgroup we see paths from host, and they don't
exist in container.
Signed-off-by: Alexander Morozov <lk4d4@docker.com>
Before name=systemd cgroup was mounted inside container to
/sys/fs/cgroup/name=systemd, which is wrong, it should be
/sys/fs/cgroup/systemd
Signed-off-by: Alexander Morozov <lk4d4@docker.com>
And allow cgroup mount take flags from user configs.
As we show ro in the recommendation, so hard-coded
read-only flag should be removed.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Fixes: https://github.com/docker/docker/issues/14543
Fixes: https://github.com/docker/docker/pull/14610
Before this, we got mount info in container:
```
sysfs /sys sysfs ro,seclabel,nosuid,nodev,noexec,relatime 0 0
/sys/fs/cgroup tmpfs rw,seclabel,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,relatime,cpuset 0 0
```
It has no mount source, so in `parseInfoFile` in Docker code,
we'll get:
```
Error found less than 3 fields post '-' in "84 83 0:41 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - tmpfs rw,seclabel"
```
After this fix, we have mount info corrected:
```
sysfs /sys sysfs ro,seclabel,nosuid,nodev,noexec,relatime 0 0
tmpfs /sys/fs/cgroup tmpfs rw,seclabel,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,relatime,cpuset 0 0
```
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
In some older kernels setting swappiness fails. This happens even
when nobody tries to configure swappiness from docker UI because
we would still get some default value from host config.
With this we treat -1 value as default value (set implicitly) and skip
the enforcement of swappiness.
However from the docker UI setting an invalid value anything other than
0-100 including -1 should fail. This patch enables that fix in docker UI.
without this fix container creation with invalid value succeeds with a
default value (60) which in incorrect.
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
The creation of the profile should be handled outside of libcontainer so
that it can be customized and packaged.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Update the tests to use the test-friendly GetAdditionalGroups API,
rather than making random files for no good reason.
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
The old GetAdditionalGroups* API didn't match the rest of
libcontainer/user, we make functions that take io.Readers and then make
wrappers around them. Otherwise we have to do dodgy stuff when testing
our code.
Fixes: d4ece29c0b ("refactor GetAdditionalGroupsPath")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
This moves much of the documentation on contributing and maintainer the
codebase from the libcontainer sub directory to the root of the repo.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
A directory with a hyphen currently generates an InvalidId error because
of the regex in libcontainer. I don't believe there is any reason a
hyphen should be disallowed.
Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
It can happen if newContainer is failed. Now test shows real error from
newContainer instead of trace.
Signed-off-by: Alexander Morozov <lk4d4@docker.com>