Due to the fact that the init is implemented in Go (which seemingly
randomly spawns new processes and loves eating memory), most cgroup
configurations are required to have an arbitrary minimum dictated by the
init. This confuses users and makes configuration more annoying than it
should. An example of this is pids.max, where Go spawns multiple
processes that then cause init to violate the pids cgroup constraint
before the container can even start.
Solve this problem by setting the cgroup configurations as late as
possible, to avoid hitting as many of the resources hogged by the Go
init as possible. This has to be done before seccomp rules are applied,
as the parent and child must synchronise in order for the parent to
correctly set the configurations (and writes might be blocked by seccomp).
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Add support for the pids cgroup controller to libcontainer, a recent
feature that is available in Linux 4.3+.
Unfortunately, due to the init process being written in Go, it can spawn
an an unknown number of threads due to blocked syscalls. This results in
the init process being unable to run properly, and thus small pids.max
configs won't work properly.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
This allows us to distinguish cases where a container
needs to just join the paths or also additionally
set cgroups settings. This will help in implementing
cgroupsPath support in the spec.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
This rather naively fixes an error observed where a processes stdio
streams are not written to when there is an error upon starting up the
process, such as when the executable doesn't exist within the
container's rootfs.
Before the "fix", when an error occurred on start, `terminate` is called
immediately, which calls `cmd.Process.Kill()`, then calling `Wait()` on
the process. In some cases when this `Kill` is called the stdio stream
have not yet been written to, causing non-deterministic output. The
error itself is properly preserved but users attached to the process
will not see this error.
With the fix it is just calling `Wait()` when an error occurs rather
than trying to `Kill()` the process first. This seems to preserve stdio.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Fix the permissions of the container's main processes STDIO when the
process is not run as the root user. This changes the permissions right
before switching to the specified user so that it's STDIO matches it's
UID and GID.
Add a test for checking that the STDIO of the process is owned by the
specified user.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
I got:
```
exec_test.go:823: Mode expected to contain 'ro,nosuid,nodev,noexec': tmpfs on /sys/fs/cgroup type tmpfs (ro,seclabel,nosuid,nodev,noexec,relatime,mode=755
```wq
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
As v2.1.0 is no longer required for successful testing, do not build it in the
Dockerfile - instead just use the version Ubuntu ships.
Signed-off-by: Matthew Heon <mheon@redhat.com>
This removes the existing, native Go seccomp filter generation and replaces it
with Libseccomp. Libseccomp is a C library which provides architecture
independent generation of Seccomp filters for the Linux kernel.
This adds a dependency on v2.2.1 or above of Libseccomp.
Signed-off-by: Matthew Heon <mheon@redhat.com>
When the copyBusybox() fails, the error message should be
propagated to the caller of newRootfs().
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
It can happen if newContainer is failed. Now test shows real error from
newContainer instead of trace.
Signed-off-by: Alexander Morozov <lk4d4@docker.com>