When in a non-initial user namespace you cannot update the devices
cgroup whitelist (or blacklist). The kernel won't allow it. So
detect that case and don't try.
This is a step to being able to run docker/runc containers inside a user
namespaced container.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
It's handled in `destroy()`, no need to do this in
`Apply()`. I found this because systemd cgroup didn't
do this removal and it works well.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
In order to avoid problems with security regressions going unnoticed,
add some unit tests that should make sure security regressions in cgroup
path safety cause tests to fail in runC.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Ensure that path safety is maintained, this essentially reapplies
c0cad6aa5e ("cgroups: fs: fix cgroup.Parent path sanitisation"), which
was accidentally removed in 256f3a8ebc ("Add support for CgroupsPath
field").
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Modify the memory cgroup code such that kmem is not managed by Set(), in
order to allow updating of memory constraints for containers by Docker.
This also removes the need to make memory a special case cgroup.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
It is vital to loudly fail when a user attempts to set a cgroup limit
(rather than using the system default). Otherwise the user will assume
they have security they do not actually have. This mirrors the original
Apply() (that would set cgroup configs) semantics.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Apply and Set are two separate operations, and it doesn't make sense to
group the two together (especially considering that the bootstrap
process is added to the cgroup as well). The only exception to this is
the memory cgroup, which requires the configuration to be set before
processes can join.
One of the weird cases to deal with is systemd. Systemd sets some of the
cgroup configuration options, but not all of them. Because memory is a
special case, we need to explicitly set memory in the systemd Apply().
Otherwise, the rest can be safely re-applied in .Set() as usual.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Add support for the pids cgroup controller to libcontainer, a recent
feature that is available in Linux 4.3+.
Unfortunately, due to the init process being written in Go, it can spawn
an an unknown number of threads due to blocked syscalls. This results in
the init process being unable to run properly, and thus small pids.max
configs won't work properly.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Properly sanitise the --cgroup-parent path, to avoid potential issues
(as it starts creating directories and writing to files as root). In
addition, fix an infinite recursion due to incomplete base cases.
It might be a good idea to move pathClean to a separate library (which
deals with path safety concerns, so all of runC and Docker can take
advantage of it).
Signed-off-by: Aleksa Sarai <asarai@suse.com>
It is vital to loudly fail when a user attempts to set a cgroup limit
(rather than using the system default). Otherwise the user will assume
they have security they do not actually have. This mirrors the original
Apply() (that would set cgroup configs) semantics.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Apply and Set are two separate operations, and it doesn't make sense to
group the two together (especially considering that the bootstrap
process is added to the cgroup as well). The only exception to this is
the memory cgroup, which requires the configuration to be set before
processes can join.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Add support for the pids cgroup controller to libcontainer, a recent
feature that is available in Linux 4.3+.
Unfortunately, due to the init process being written in Go, it can spawn
an an unknown number of threads due to blocked syscalls. This results in
the init process being unable to run properly, and thus small pids.max
configs won't work properly.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
This allows us to distinguish cases where a container
needs to just join the paths or also additionally
set cgroups settings. This will help in implementing
cgroupsPath support in the spec.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
The former cgroup entry is confusing, separate it to parent
and name.
Rename entry `c` to `config`.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
'parent' function is confusing with parent cgroup, it's actually
parent path, so rename it to parentPath.
The name 'data' is too common to be identified, rename it to cgroupData
which is exactly what it is.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
We have a rule that for optional cgroups, don't fail if some
of them are not mounted, but we want it fail hard when a
user specifies an option and we are unable to fulfill the
request.
Memory cgroup should also follow this rule.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Also add cpuset as the first in the list to address issues setting the
pid in any cgroup before the cpuset is populated.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
It can avoid unnecessary task migrataion, see this scenario:
- container init task is on cpu 1, and we assigned it to cpu 1,
but parent cgroup's cpuset.cpus=2
- we created the cgroup dir and inherited cpuset.cpus from parent as 2
- write container init task's pid to cgroup.procs
- [it's possibile the container init task migrated to cpu 2 here]
- set cpuset.cpus as assigned to cpu 1
- [the container init task has to be migrated back to cpu 1]
So we should set cpuset.cpus and cpuset.mems before writing pids
to cgroup.procs to aviod such problem.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
This is meant to be used in retrieving the paths so an exec
process enters all the cgroup paths correctly.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
Godeps: Vendor opencontainers/specs 96bcd043aa
Fix a bug where it's impossible to pass multiple devices to blkio
cgroup controller files. See https://github.com/opencontainers/runc/issues/274
Signed-off-by: Antonio Murdaca <runcom@linux.com>
TL;DR: check for IsExist(err) after a failed MkdirAll() is both
redundant and wrong -- so two reasons to remove it.
Quoting MkdirAll documentation:
> MkdirAll creates a directory named path, along with any necessary
> parents, and returns nil, or else returns an error. If path
> is already a directory, MkdirAll does nothing and returns nil.
This means two things:
1. If a directory to be created already exists, no error is
returned.
2. If the error returned is IsExist (EEXIST), it means there exists
a non-directory with the same name as MkdirAll need to use for
directory. Example: we want to MkdirAll("a/b"), but file "a"
(or "a/b") already exists, so MkdirAll fails.
The above is a theory, based on quoted documentation and my UNIX
knowledge.
3. In practice, though, current MkdirAll implementation [1] returns
ENOTDIR in most of cases described in #2, with the exception when
there is a race between MkdirAll and someone else creating the
last component of MkdirAll argument as a file. In this very case
MkdirAll() will indeed return EEXIST.
Because of #1, IsExist check after MkdirAll is not needed.
Because of #2 and #3, ignoring IsExist error is just plain wrong,
as directory we require is not created. It's cleaner to report
the error now.
Note this error is all over the tree, I guess due to copy-paste,
or trying to follow the same usage pattern as for Mkdir(),
or some not quite correct examples on the Internet.
[1] https://github.com/golang/go/blob/f9ed2f75/src/os/path.go
Signed-off-by: Kir Kolyshkin <kir@openvz.org>
Sometimes subsystem can be mounted to path like "subsystem1,subsystem2",
so we need to handle this.
Signed-off-by: Alexander Morozov <lk4d4@docker.com>
This is needed because for nested containers cgroups. Without this patch
they creating unnecessary intermediate cgroup like:
/sys/fs/cgroup/memory/system.slice/docker-9409d9f0b68fb9e9d7d532d5b3f35e7c7f9cca1312af392ae3b28436f1f2998f.scope/system.slice/docker-9409d9f0b68fb9e9d7d532d5b3f35e7c7f9cca1312af392ae3b28436f1f2998f.scope/docker/908ebcc9c13584a14322ec070bd971e0de62f126c0cd95c079acdb99990ad3a3
It is because in /proc/self/cgroup we see paths from host, and they don't
exist in container.
Signed-off-by: Alexander Morozov <lk4d4@docker.com>
In some older kernels setting swappiness fails. This happens even
when nobody tries to configure swappiness from docker UI because
we would still get some default value from host config.
With this we treat -1 value as default value (set implicitly) and skip
the enforcement of swappiness.
However from the docker UI setting an invalid value anything other than
0-100 including -1 should fail. This patch enables that fix in docker UI.
without this fix container creation with invalid value succeeds with a
default value (60) which in incorrect.
Signed-off-by: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>