Due to the fact that the init is implemented in Go (which seemingly
randomly spawns new processes and loves eating memory), most cgroup
configurations are required to have an arbitrary minimum dictated by the
init. This confuses users and makes configuration more annoying than it
should. An example of this is pids.max, where Go spawns multiple
processes that then cause init to violate the pids cgroup constraint
before the container can even start.
Solve this problem by setting the cgroup configurations as late as
possible, to avoid hitting as many of the resources hogged by the Go
init as possible. This has to be done before seccomp rules are applied,
as the parent and child must synchronise in order for the parent to
correctly set the configurations (and writes might be blocked by seccomp).
Signed-off-by: Aleksa Sarai <asarai@suse.com>
It is vital to loudly fail when a user attempts to set a cgroup limit
(rather than using the system default). Otherwise the user will assume
they have security they do not actually have. This mirrors the original
Apply() (that would set cgroup configs) semantics.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Apply and Set are two separate operations, and it doesn't make sense to
group the two together (especially considering that the bootstrap
process is added to the cgroup as well). The only exception to this is
the memory cgroup, which requires the configuration to be set before
processes can join.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Add support for the pids cgroup controller to libcontainer, a recent
feature that is available in Linux 4.3+.
Unfortunately, due to the init process being written in Go, it can spawn
an an unknown number of threads due to blocked syscalls. This results in
the init process being unable to run properly, and thus small pids.max
configs won't work properly.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
syscall.NLA_HDRLEN is not in gccgo (as of 5.3), so in the meantime
use the #defines taken from linux/netlink.h.
See https://github.com/golang/go/issues/13629
Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
This allows us to distinguish cases where a container
needs to just join the paths or also additionally
set cgroups settings. This will help in implementing
cgroupsPath support in the spec.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
replace passing of pid and console path via environment variable with passing
them with netlink message via an established pipe.
this change requires us to set _LIBCONTAINER_INITTYPE and
_LIBCONTAINER_INITPIPE as the env environment of the bootstrap process as we
only send the bootstrap data for setns process right now. When init and setns
bootstrap process are unified (i.e., init use nsexec instead of Go to clone new
process), we can remove _LIBCONTAINER_INITTYPE.
Note:
- we read nlmsghdr first before reading the content so we can get the total
length of the payload and allocate buffer properly instead of allocating
one large buffer.
- check read bytes vs the wanted number. It's an error if we failed to read
the desired number of bytes from the pipe into the buffer.
Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
This patch fixes the following go vet warnings:
```
libcontainer/network_linux.go:96: github.com/vishvananda/netlink.Device
composite literal uses unkeyed fields
libcontainer/network_linux.go:114: github.com/vishvananda/netlink.Device
composite literal uses unkeyed fields
```
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
add bootstrap data to setns process. If we have any bootstrap data then copy it
to the bootstrap process (i.e. nsexec) using the sync pipe. This will allow us
to eventually replace environment variable usage with more structured data
to setup namespaces, write pid/gid map, setgroup etc.
Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
Enables launching userns containers by catching EPERM errors for writing
to devices cgroups, and for mknod invocations.
Signed-off-by: Abin Shahab <ashahab@altiscale.com>
When starting and quering for pids a container can start and exit before
this is set. So set the opts after the process is started and while
libcontainer still has the container's process blocking on the pipe.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
The former cgroup entry is confusing, separate it to parent
and name.
Rename entry `c` to `config`.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
'parent' function is confusing with parent cgroup, it's actually
parent path, so rename it to parentPath.
The name 'data' is too common to be identified, rename it to cgroupData
which is exactly what it is.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>