We need support for read/only mounts in SELinux to allow a bunch of
containers to share the same read/only image. In order to do this
we need a new label which allows container processes to read/execute
all files but not write them.
Existing mount label is either shared write or private write. This
label is shared read/execute.
Signed-off-by: Dan Walsh <dwalsh@redhat.com>
At some point InitLabels was changed to look for SecuritOptions
separated by a ":" rather then an "=", but DupSecOpt was never
changed to match this default.
Signed-off-by: Dan Walsh <dwalsh@redhat.com>
If copyup is specified for a tmpfs mount, then the contents of the
underlying directory are copied into the tmpfs mounted over it.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
Depending on your SELinux setup, the order in which you join namespaces
can be important. In general, user namespaces should *always* be joined
and unshared first because then the other namespaces are correctly
pinned and you have the right priviliges within them. This also is very
useful for rootless containers, as well as older kernels that had
essentially broken unshare(2) and clone(2) implementations.
This also includes huge refactorings in how we spawn processes for
complicated reasons that I don't want to get into because it will make
me spiral into a cloud of rage. The reasoning is in the giant comment in
clone_parent. Have fun.
In addition, because we now create multiple children with CLONE_PARENT,
we cannot wait for them to SIGCHLD us in the case of a death. Thus, we
have to resort to having a child kindly send us their exit code before
they die. Hopefully this all works okay, but at this point there's not
much more than we can do.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This avoids us from running into cases where libcontainer thinks that a
particular namespace file is a different type, and makes it a fatal
error rather than causing broken functionality.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
In order to mount root filesystems inside the container's mount
namespace as part of the spec we need to have the ability to do a bind
mount to / as the destination.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Since Linux 4.3 ambient capabilities are available. If set these allow unprivileged child
processes to inherit capabilities, while at present there is no means to set capabilities
on non root processes, other than via filesystem capabilities which are not usually
supported in image formats.
With ambient capabilities non root processes can be given capabilities as well, and so
the main reason to use root in containers goes away, and capabilities work as expected.
The code falls back to the existing behaviour if ambient capabilities are not supported.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
grep -r "range map" showw 3 parts use map to
range enum types, use slice instead can get
better performance and less memory usage.
Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>
For example, the /sys/firmware directory should be masked because it can contain some sensitive files:
- /sys/firmware/acpi/tables/{SLIC,MSDM}: Windows license information:
- /sys/firmware/ibft/target0/chap-secret: iSCSI CHAP secret
Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
cgroupData.join method using `WriteCgroupProc` to place the pid into
the proc file, it can avoid attach any pid to the cgroup if -1 is
specified as a pid.
so, replace `writeFile` with `WriteCgroupProc` like `cpuset.go`'s
ApplyDir method.
Signed-off-by: Wang Long <long.wanglong@huawei.com>
if a container state is running or created, the container.Pause()
method can set the state to pausing, and then paused.
this patch update the comment, so it can be consistent with the code.
Signed-off-by: Wang Long <long.wanglong@huawei.com>
Currently if a user does a command like
docker: Error response from daemon: operation not supported.
With this fix they should see a much more informative error message.
docker run -ti -v /proc:/proc:Z fedora sh
docker: Error response from daemon: SELinux Relabeling of /proc is not allowed: operation not supported.
Signed-off-by: Dan Walsh <dwalsh@redhat.com>
Error sent from child process is already genericError, if
we don't allow recrusive generic error, we won't get any
cause infomation from parent process.
Before, we got:
WARN[0000] exit status 1
ERRO[0000] operation not permitted
After, we got:
WARN[0000] exit status 1
ERRO[0000] container_linux.go:247: starting container process caused "process_linux.go:359: container init caused \"operation not permitted\""
it's not pretty but useful for detecting root causes.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
This allows older state files to be loaded without the unmarshal error
of the string to int conversion.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
After #1009, we don't always set `cgroup.Paths`, so
`getCgroupPath()` will return wrong cgroup path because
it'll take current process's cgroup as the parent, which
would be wrong when we try to find the cgroup path in
`runc ps` and `runc kill`.
Fix it by using `m.GetPath()` to get the true cgroup
paths.
Reported-by: Yang Shukui <yangshukui@huawei.com>
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Alternative of #895 , part of #892
The intension of current behavior if to create cgroup in
parent cgroup of current process, but we did this in a
wrong way, we used devices cgroup path of current process
as the default parent path for all subsystems, this is
wrong because we don't always have the same cgroup path
for all subsystems.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Signed-off-by: rajasec <rajasec79@gmail.com>
Error handling when container not exists
Signed-off-by: rajasec <rajasec79@gmail.com>
Error handling when container not exists
Signed-off-by: rajasec <rajasec79@gmail.com>
Error handling when container not exists
Signed-off-by: rajasec <rajasec79@gmail.com>
This just moves everything to one function so we don't have to pass a
bunch of things to functions when there's no real benefit. It also makes
the API nicer.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
The original SETUID takes a 16 bit UID. Linux 2.4 introduced a new
syscall, SETUID32, with support for 32 bit UIDs. The setgid wrapper
already uses SETGID32.
Signed-off-by: Carl Henrik Lunde <chlunde@ifi.uio.no>
TestCaptureTestFunc failed in localunittest:
# make localunittest
=== RUN TestCaptureTestFunc
--- FAIL: TestCaptureTestFunc (0.00s)
capture_test.go:26: expected package "github.com/opencontainers/runc/libcontainer/stacktrace" but received "_/root/runc/libcontainer/stacktrace"
#
Reason: the path for stacktrace is a fixed string which
only valid for container environment.
And we can switch to relative path to make both in-container
and out-of-container test works.
After patch:
# make localunittest
=== RUN TestCaptureTestFunc
--- PASS: TestCaptureTestFunc (0.00s)
#
Signed-off-by: Zhao Lei <zhaolei@cn.fujitsu.com>
Removed a lot of clutter, improved the style of the code, removed
unnecessary complexity. In addition, made errors unique by making bail()
exit with a unique error code. Most of this code comes from the current
state of the rootless containers branch.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This device is not required by the OCI spec.
The rationale for this was linked to https://github.com/docker/docker/issues/2393
So a non functional /dev/fuse was created, and actual fuse use still is
required to add the device explicitly. However even old versions of the JVM
on Ubuntu 12.04 no longer require the fuse package, and this is all not
needed.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
It's possible that `cmd.Process` is still nil when we reach timeout.
Start creates `Process` field synchronously, and there is no way to such
race.
Signed-off-by: Alexander Morozov <lk4d4math@gmail.com>
This avoid the goimports tool from remove the libcontainer/keys import line due the package name is diferent from folder name
Signed-off-by: Guilherme Rezende <guilhermebr@gmail.com>