In the cases that we got failure on a subsystem's Apply,
we'll get some subsystems' cgroup directories leftover.
On Docker's point of view, start a container failed, use
`docker rm` to remove the container, but some cgroup files
are leftover.
Sometimes we don't want to clean everyting up when something
went wrong, because we need these inter situation
information to debug what's going on, but cgroup directories
are not useful information we want to keep.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
This PR fix issue in this scenario:
```
in terminal 1:
~# cd /sys/fs/cgroup/cpuset
~# mkdir test
~# cd test
~# cat cpuset.cpus
0-3
~# echo 1 > cpuset.cpu_exclusive (make sure you don't have other cgroups under root)
in terminal 2:
~# echo $$ > /sys/fs/cgroup/cpuset/test/tasks
// set resources.cpu.cpus="0-2" in config.json
~# runc run test1
back to terminal 1:
~# cd test1
~# cat cpuset.cpus
0-2
~# echo 1 > cpuset.cpu_exclusive
in terminal 3:
~# echo $$ > /sys/fs/cgroup/test/tasks
// set resources.cpu.cpus="3" in config.json
~# runc run test2
container_linux.go:247: starting container process caused "process_linux.go:258:
applying cgroup configuration for process caused \"failed to write 0-3\\n to
cpuset.cpus: write /sys/fs/cgroup/cpuset/test2/cpuset.cpus: invalid argument\""
```
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
This allows a user to send a signal to all the processes in the
container within a single atomic action to avoid new processes being
forked off before the signal can be sent.
This is basically taking functionality that we already use being
`delete` and exposing it ok the `kill` command by adding a flag.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This moves the ambient capability support behind an `ambient` build tag
so that it is only compiled upon request.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
The default terminal setting for a new pty on Linux (unix98) has +ONLCR,
resulting in '\n' writes by a container process to be converted to
'\r\n' reads by the managing process. This is quite unexpected, and
causes multiple issues with things like bats testing. To fix it, make
the terminal sane after opening it by setting -ONLCR.
This patch might need to be rewritten after the console rewrite patchset
is merged.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
I use the same tool (https://github.com/client9/misspell)
as Daniel used a few days ago, don't why he missed these
typos at that time.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
config.cloneflag is not mandatory, when using `runc exec`,
config.cloneflag can be empty, and even then it won't be
`-1` but `0`.
So this validation is totally wrong and unneeded.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
In user namespaces devices are bind-mounted from the host, so
we need to add them as external mounts for CRIU.
Reported-by: Ross Boucher <boucher@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
When spec file contains duplicated namespaces, e.g.
specs: specs.Spec{
Linux: &specs.Linux{
Namespaces: []specs.Namespace{
{
Type: "pid",
},
{
Type: "pid",
Path: "/proc/1/ns/pid",
},
},
},
}
runc should report malformed spec instead of using latest one by
default, because this spec could be quite confusing.
Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
Previously we only tested failures, which causes us to miss issues where
setting sysctls would *always* fail.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
When changing this validation, the code actually allowing the validation
to pass was removed. This meant that any net.* sysctl would always fail
to validate.
Fixes: bc84f83344 ("fix docker/docker#27484")
Reported-by: Justin Cormack <justin.cormack@docker.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This reverts part of the commit eb0a144b5e
That commit introduced two issues.
- We need to make parent mount of rootfs private before bind mounting
rootfs. Otherwise bind mounting root can propagate in other mount
namespaces. (If parent mount is shared).
- It broke test TestRootfsPropagationSharedMount() on Fedora.
On fedora /tmp is a mount point with "shared" propagation. I think
you should be able to reproduce it on other distributions as well
as long as you mount tmpfs on /tmp and make it "shared" propagation.
Reason for failure is that pivot_root() fails. And it fails because
kernel does following check.
IS_MNT_SHARED(new_mnt->mnt_parent)
Say /tmp/foo is new rootfs, we have bind mounted rootfs, so new_mnt
is /tmp/foo, and new_mnt->mnt_parent is /tmp which is "shared" on
fedora and above check fails.
So this change broke few things, it is a good idea to revert part of it.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Namely, use an undocumented feature of pivot_root(2) where
pivot_root(".", ".") is actually a feature and allows you to make the
old_root be tied to your /proc/self/cwd in a way that makes unmounting
easy. Thanks a lot to the LXC developers which came up with this idea
first.
This is the first step of many to allowing runC to work with a
completely read-only rootfs.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Without this patch applied, RHEL's SELinux policies cause container
creation to not really work. Unfortunately this might be an issue for
rootless containers (opencontainers/runc#774) but we'll cross that
bridge when we come to it.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Print the error message to stderr if we are unable to return it back via
the pipe to the parent process. Also, don't panic here as it is most
likely a system or user error and not a programmer error.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
We need support for read/only mounts in SELinux to allow a bunch of
containers to share the same read/only image. In order to do this
we need a new label which allows container processes to read/execute
all files but not write them.
Existing mount label is either shared write or private write. This
label is shared read/execute.
Signed-off-by: Dan Walsh <dwalsh@redhat.com>
At some point InitLabels was changed to look for SecuritOptions
separated by a ":" rather then an "=", but DupSecOpt was never
changed to match this default.
Signed-off-by: Dan Walsh <dwalsh@redhat.com>
If copyup is specified for a tmpfs mount, then the contents of the
underlying directory are copied into the tmpfs mounted over it.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
Depending on your SELinux setup, the order in which you join namespaces
can be important. In general, user namespaces should *always* be joined
and unshared first because then the other namespaces are correctly
pinned and you have the right priviliges within them. This also is very
useful for rootless containers, as well as older kernels that had
essentially broken unshare(2) and clone(2) implementations.
This also includes huge refactorings in how we spawn processes for
complicated reasons that I don't want to get into because it will make
me spiral into a cloud of rage. The reasoning is in the giant comment in
clone_parent. Have fun.
In addition, because we now create multiple children with CLONE_PARENT,
we cannot wait for them to SIGCHLD us in the case of a death. Thus, we
have to resort to having a child kindly send us their exit code before
they die. Hopefully this all works okay, but at this point there's not
much more than we can do.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This avoids us from running into cases where libcontainer thinks that a
particular namespace file is a different type, and makes it a fatal
error rather than causing broken functionality.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
In order to mount root filesystems inside the container's mount
namespace as part of the spec we need to have the ability to do a bind
mount to / as the destination.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Since Linux 4.3 ambient capabilities are available. If set these allow unprivileged child
processes to inherit capabilities, while at present there is no means to set capabilities
on non root processes, other than via filesystem capabilities which are not usually
supported in image formats.
With ambient capabilities non root processes can be given capabilities as well, and so
the main reason to use root in containers goes away, and capabilities work as expected.
The code falls back to the existing behaviour if ambient capabilities are not supported.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
grep -r "range map" showw 3 parts use map to
range enum types, use slice instead can get
better performance and less memory usage.
Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>
For example, the /sys/firmware directory should be masked because it can contain some sensitive files:
- /sys/firmware/acpi/tables/{SLIC,MSDM}: Windows license information:
- /sys/firmware/ibft/target0/chap-secret: iSCSI CHAP secret
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
cgroupData.join method using `WriteCgroupProc` to place the pid into
the proc file, it can avoid attach any pid to the cgroup if -1 is
specified as a pid.
so, replace `writeFile` with `WriteCgroupProc` like `cpuset.go`'s
ApplyDir method.
Signed-off-by: Wang Long <long.wanglong@huawei.com>
if a container state is running or created, the container.Pause()
method can set the state to pausing, and then paused.
this patch update the comment, so it can be consistent with the code.
Signed-off-by: Wang Long <long.wanglong@huawei.com>
Currently if a user does a command like
docker: Error response from daemon: operation not supported.
With this fix they should see a much more informative error message.
docker run -ti -v /proc:/proc:Z fedora sh
docker: Error response from daemon: SELinux Relabeling of /proc is not allowed: operation not supported.
Signed-off-by: Dan Walsh <dwalsh@redhat.com>
Error sent from child process is already genericError, if
we don't allow recrusive generic error, we won't get any
cause infomation from parent process.
Before, we got:
WARN[0000] exit status 1
ERRO[0000] operation not permitted
After, we got:
WARN[0000] exit status 1
ERRO[0000] container_linux.go:247: starting container process caused "process_linux.go:359: container init caused \"operation not permitted\""
it's not pretty but useful for detecting root causes.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
This allows older state files to be loaded without the unmarshal error
of the string to int conversion.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
After #1009, we don't always set `cgroup.Paths`, so
`getCgroupPath()` will return wrong cgroup path because
it'll take current process's cgroup as the parent, which
would be wrong when we try to find the cgroup path in
`runc ps` and `runc kill`.
Fix it by using `m.GetPath()` to get the true cgroup
paths.
Reported-by: Yang Shukui <yangshukui@huawei.com>
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>