Whenever dev/null is used as one of the main processes STDIO, do not try
to change the permissions on it via fchown because we should not do it
in the first place and also this will fail if the container is supposed
to be readonly.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
When executing an additional process in a container, all namespaces are
entered but the user namespace. As a result, the process may be
executed as the host's root user. This has both functionality and
security implications.
Fix this by adding the missing user namespace to the array of
namespaces. Since joining a user namespace in which the caller is
already a member yields an error, skip namespaces we're already in.
Last, remove a needless and buggy AT_SYMLINK_NOFOLLOW in the code.
Signed-off-by: Ido Yariv <ido@wizery.com>
Fix the permissions of the container's main processes STDIO when the
process is not run as the root user. This changes the permissions right
before switching to the specified user so that it's STDIO matches it's
UID and GID.
Add a test for checking that the STDIO of the process is owned by the
specified user.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Right now if one passes a mount propagation flag in spec file, it
does not take effect. For example, try following in spec json file.
{
"type": "bind",
"source": "/root/mnt-source",
"destination": "/root/mnt-dest",
"options": "rbind,shared"
}
One would expect that /root/mnt-dest will be shared inside the container
but that's not the case.
#findmnt -o TARGET,PROPAGATION
`-/root/mnt-dest private
Reason being that propagation flags can't be passed in along with other
regular flags. They need to be passed in a separate call to mount syscall.
That too, one propagation flag at a time. (from mount man page).
Hence, store propagation flags separately in a slice and apply these
in that order after the mount call wherever appropriate. This allows
user to control the propagation property of mount point inside
the container.
Storing them separately also solves another problem where recursive flag
(syscall.MS_REC) can get mixed up. For example, options "rbind,private"
and "bind,rprivate" will be same and there will be no way to differentiate
between these if all the flags are stored in a single integer.
This patch would allow one to pass propagation flags "[r]shared,[r]slave,
[r]private,[r]unbindable" in spec file as per mount property.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Here are two reasons:
* If we use systemd, we need to ask it to create cgroups
* If a container is restored with another ID, we need to
change paths to cgroups.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Bug was introduced in #250
According to: http://man7.org/linux/man-pages/man5/proc.5.html
36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue
(1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11)
...
(7) optional fields: zero or more fields of the form
"tag[:value]".
The 7th field is optional. We should skip it when parsing mount info.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
I got:
```
exec_test.go:823: Mode expected to contain 'ro,nosuid,nodev,noexec': tmpfs on /sys/fs/cgroup type tmpfs (ro,seclabel,nosuid,nodev,noexec,relatime,mode=755
```wq
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Again. It looks like a build tag was somehow dropped between
the PR here: https://github.com/docker/libcontainer/pull/625
and the move to runc.
Signed-off-by: Christy Perez <clnperez@linux.vnet.ibm.com>
As v2.1.0 is no longer required for successful testing, do not build it in the
Dockerfile - instead just use the version Ubuntu ships.
Signed-off-by: Matthew Heon <mheon@redhat.com>
This removes the existing, native Go seccomp filter generation and replaces it
with Libseccomp. Libseccomp is a C library which provides architecture
independent generation of Seccomp filters for the Linux kernel.
This adds a dependency on v2.2.1 or above of Libseccomp.
Signed-off-by: Matthew Heon <mheon@redhat.com>
Simplify the code introduced by the commit d1f0d5705deb:
Return actual ProcessState on Wait error
Cc: Alexander Morozov <lk4d4@docker.com>
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
This adds a `Signal()` method to the container interface so that the
initial process can be signaled after a Load or operation. It also
implements signaling the init process from a nonChildProcess.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
A boolean field named GidMappingsEnableSetgroups was added to
SysProcAttr in Go1.5. This field determines the value of the process's
setgroups proc entry.
Since the default is to set the entry to 'deny', calling setgroups will
fail on systems running kernels 3.19+.
Set GidMappingsEnableSetgroups to true so setgroups wont be set to
'deny'.
Signed-off-by: Ido Yariv <ido@wizery.com>
- Check if Selinux is enabled before relabeling. This is a bug.
- Make exclusion detection constant time. Kinda buggy too, imo.
- Do not depend on a magic string to create a new Selinux context.
Signed-off-by: David Calavera <david.calavera@gmail.com>
TL;DR: check for IsExist(err) after a failed MkdirAll() is both
redundant and wrong -- so two reasons to remove it.
Quoting MkdirAll documentation:
> MkdirAll creates a directory named path, along with any necessary
> parents, and returns nil, or else returns an error. If path
> is already a directory, MkdirAll does nothing and returns nil.
This means two things:
1. If a directory to be created already exists, no error is
returned.
2. If the error returned is IsExist (EEXIST), it means there exists
a non-directory with the same name as MkdirAll need to use for
directory. Example: we want to MkdirAll("a/b"), but file "a"
(or "a/b") already exists, so MkdirAll fails.
The above is a theory, based on quoted documentation and my UNIX
knowledge.
3. In practice, though, current MkdirAll implementation [1] returns
ENOTDIR in most of cases described in #2, with the exception when
there is a race between MkdirAll and someone else creating the
last component of MkdirAll argument as a file. In this very case
MkdirAll() will indeed return EEXIST.
Because of #1, IsExist check after MkdirAll is not needed.
Because of #2 and #3, ignoring IsExist error is just plain wrong,
as directory we require is not created. It's cleaner to report
the error now.
Note this error is all over the tree, I guess due to copy-paste,
or trying to follow the same usage pattern as for Mkdir(),
or some not quite correct examples on the Internet.
[1] https://github.com/golang/go/blob/f9ed2f75/src/os/path.go
Signed-off-by: Kir Kolyshkin <kir@openvz.org>
When the copyBusybox() fails, the error message should be
propagated to the caller of newRootfs().
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Actually cgroup mounts are bind-mounts, so they should be
handled by the same way.
Reported-by: Ross Boucher <rboucher@gmail.com>
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Sometimes subsystem can be mounted to path like "subsystem1,subsystem2",
so we need to handle this.
Signed-off-by: Alexander Morozov <lk4d4@docker.com>
This is needed because for nested containers cgroups. Without this patch
they creating unnecessary intermediate cgroup like:
/sys/fs/cgroup/memory/system.slice/docker-9409d9f0b68fb9e9d7d532d5b3f35e7c7f9cca1312af392ae3b28436f1f2998f.scope/system.slice/docker-9409d9f0b68fb9e9d7d532d5b3f35e7c7f9cca1312af392ae3b28436f1f2998f.scope/docker/908ebcc9c13584a14322ec070bd971e0de62f126c0cd95c079acdb99990ad3a3
It is because in /proc/self/cgroup we see paths from host, and they don't
exist in container.
Signed-off-by: Alexander Morozov <lk4d4@docker.com>
Before name=systemd cgroup was mounted inside container to
/sys/fs/cgroup/name=systemd, which is wrong, it should be
/sys/fs/cgroup/systemd
Signed-off-by: Alexander Morozov <lk4d4@docker.com>
And allow cgroup mount take flags from user configs.
As we show ro in the recommendation, so hard-coded
read-only flag should be removed.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Fixes: https://github.com/docker/docker/issues/14543
Fixes: https://github.com/docker/docker/pull/14610
Before this, we got mount info in container:
```
sysfs /sys sysfs ro,seclabel,nosuid,nodev,noexec,relatime 0 0
/sys/fs/cgroup tmpfs rw,seclabel,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,relatime,cpuset 0 0
```
It has no mount source, so in `parseInfoFile` in Docker code,
we'll get:
```
Error found less than 3 fields post '-' in "84 83 0:41 / /sys/fs/cgroup rw,nosuid,nodev,noexec,relatime - tmpfs rw,seclabel"
```
After this fix, we have mount info corrected:
```
sysfs /sys sysfs ro,seclabel,nosuid,nodev,noexec,relatime 0 0
tmpfs /sys/fs/cgroup tmpfs rw,seclabel,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,relatime,cpuset 0 0
```
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>