This will help runc's init to not spawn many threads on large systems when
launched with max procs by the caller.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Add a mountinfo from a bedrock linux system with 4 strata, and include
it for tests
Signed-off-by: Jay Kamat <jaygkamat@gmail.com>
Signed-off-by: Daniel Dao <dqminh89@gmail.com>
When there are complicated mount setups, there can be multiple mount
points which have the subsystem we are looking for. Instead of
counting the mountpoints, tick off subsystems until we have found them
all.
Without the 'all' flag, ignore duplicate subsystems after the first.
Signed-off-by: Daniel Dao <dqminh89@gmail.com>
These sysctls are namespaced by CLONE_NEWUTS, and we need to use
"kernel.domainname" if we want users to be able to set an NIS domainname
on Linux. However we disallow "kernel.hostname" because it would
conflict with the "hostname" field and cause confusion (but we include a
helpful message to make it clearer to the user).
Signed-off-by: Aleksa Sarai <asarai@suse.de>
It turns out that MIPS uses uint32 in the device number returned by
stat(2), so explicitly wrap everything to make the compiler happy. I
really wish that Go had C-like numeric type promotion.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This fixes the following compilation error on 32bit ARM:
```
$ GOARCH=arm GOARCH=6 go build ./libcontainer/system/
libcontainer/system/linux.go:119:89: constant 4294967295 overflows int
```
Signed-off-by: Tibor Vass <tibor@docker.com>
When running in a new unserNS as root, don't require a mapping to be
present in the configuration file. We are already skipping the test
for a new userns to be present.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
We need to lock the threads for the SetProcessLabel to work,
should also call SetProcessLabel("") after the container starts
to go back to the default SELinux behaviour.
Once you call SetProcessLabel, then any process executed by runc
will run with this label, even if the process is for setup rather
then the container.
It is always safest to call the SELinux calls just before the exec of the
container, so that other processes do not get started with the incorrect label.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
There is a race in runc exec when the init process stops just before
the check for the container status. It is then wrongly assumed that
we are trying to start an init process instead of an exec process.
This commit add an Init field to libcontainer Process to distinguish
between init and exec processes to prevent this race.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
Include a rootless argument for isIgnorableError to avoid people
accidentally using isIgnorableError when they shouldn't (we don't ignore
any errors when running as root as that really isn't safe).
Signed-off-by: Aleksa Sarai <asarai@suse.de>
So that, if a timeout happens and we decide to stop blocking on the
operation, the writer will not block when they try to report the result
of the operation.
This should address Issue #1780 and it's a follow up for PR #1683,
PR #1754 and PR #1772.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
When joining an existing namespace, don't default to configuring a
loopback interface in that namespace.
Its creator should have done that, and we don't want to fail to create
the container when we don't have sufficient privileges to configure the
network namespace.
Signed-off-by: Nalin Dahyabhai <nalin@redhat.com>
Starting with systemd 237, in preparation for cgroup v2, delegation is
only now available for scopes, not slices.
Update libcontainer code to detect whether delegation is available on
both and use that information when creating new slices.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
The channel was introduced in #1683 to work around a race condition.
However, the check for error in StartTransientUnit ignores the error for
an already existing unit, and in that case there will be no notification
from DBus (so waiting on the channel will make it hang.)
Later PR #1754 added a timeout, which worked around the issue, but we
can fix this correctly by only waiting on the channel when there is no
error. Fix the code to do so.
The timeout handling was kept, since there might be other cases where
this situation occurs (https://bugzilla.redhat.com/show_bug.cgi?id=1548358
mentions calling this code from inside a container, it's unclear whether
an existing container was in use or not, so not sure whether this would
have fixed that bug as well.)
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
There is no reason to set the container state to "running" as a
temporary value when exec'ing a process on a container in "created"
state. The problem doing this is that consumers of the libcontainer
library might use it by keeping pointers in memory. In this case,
the container state will indicate that the container is running, which
is wrong, and this will end up with a failure on the next action
because the check for the container state transition will complain.
Fixes#1767
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Previously if oomScoreAdj was not set in config.json we would implicitly
set oom_score_adj to 0. This is not allowed according to the spec:
> If oomScoreAdj is not set, the runtime MUST NOT change the value of
> oom_score_adj.
Change this so that we do not modify oom_score_adj if oomScoreAdj is not
present in the configuration. While this modifies our internal
configuration types, the on-disk format is still compatible.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
In some cases, /sys/fs/cgroups is mounted read-only. In rootless
containers we can consider this effectively identical to having cgroups
that we don't have write permission to -- because the user isn't
responsible for the read-only setup and cannot modify it. The rules are
identical to when /sys/fs/cgroups is not writable by the unprivileged
user.
An example of this is the default configuration of Docker, where cgroups
are mounted as read-only as a preventative security measure.
Reported-by: Vladimir Rutsky <rutsky@google.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Currently if a confined container process tries to list these directories
AVC's are generated because they are labeled with external labels. Adding
the mountlabel will remove these AVC's.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
Currently Manager accepts nil cgroups when calling Apply, but it will panic then trying to call Destroy with the same config.
Signed-off-by: Denys Smirnov <denys@sourced.tech>
The function is called even if the usernamespace is not set.
This results having wrong uid/gid set on devices.
This fix add a test to check if usernamespace is set befor calling
setupUserNamespace.
Fixes#1742
Signed-off-by: Julien Lavesque <julien.lavesque@gmail.com>
gocapability has supported 0 as "the current PID" since
syndtr/gocapability@5e7cce49 (Allow to use the zero value for pid to
operate with the current task, 2015-01-15, syndtr/gocapability#2).
libcontainer was ported to that approach in 444cc298 (namespaces:
allow to use pid namespace without mount namespace, 2015-01-27,
docker/libcontainer#358), but the change was clobbered by 22df5551
(Merge branch 'master' into api, 2015-02-19, docker/libcontainer#388)
which landed via 5b73860e (Merge pull request #388 from docker/api,
2015-02-19, docker/libcontainer#388). This commit restores the
changes from 444cc298.
Signed-off-by: W. Trevor King <wking@tremily.us>
The helper DRYs up the transition tests and makes it easy to get
complete coverage for invalid transitions.
I'm also using t.Run() for subtests. Run() is new in Go 1.7 [1], but
runc dropped support for 1.6 back in e773f96b (update go version at
travis-ci, 2017-02-20, #1335).
[1]: https://blog.golang.org/subtests
Signed-off-by: W. Trevor King <wking@tremily.us>
Technically, this change should not be necessary, as the kernel
documentation claims that if you call clone(flags|CLONE_NEWUSER), the
new user namespace will be the owner of all other namespaces created in
@flags. Unfortunately this isn't always the case, due to various
additional semantics and kernel bugs.
One particular instance is SELinux, which acts very strangely towards
the IPC namespace and mqueue. If you unshare the IPC namespace *before*
you map a user in the user namespace, the IPC namespace's internal
kern-mount for mqueue will be labelled incorrectly and the container
won't be able to access it. The only way of solving this is to unshare
IPC *after* the user has been mapped and we have changed to that user.
I've also heard of this happening to the NET namespace while talking to
some LXC folks, though I haven't personally seen that issue.
This change matches our handling of user namespaces to be the same as
how LXC handles these problems.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
When starting a container with `runc start` or `runc run`, the stub
process (runc[2:INIT]) opens a fifo for writing. Its parent runc process
will open the same fifo for reading. In this way, they synchronize.
If the stub process exits at the wrong time, the parent runc process
will block forever.
This can happen when racing 2 runc operations against each other: `runc
run/start`, and `runc delete`. It could also happen for other reasons,
e.g. the kernel's OOM killer may select the stub process.
This commit resolves this race by racing the opening of the exec fifo
from the runc parent process against the stub process exiting. If the
stub process exits before we open the fifo, we return an error.
Another solution is to wait on the stub process. However, it seems it
would require more refactoring to avoid calling wait multiple times on
the same process, which is an error.
Signed-off-by: Craig Furman <cfurman@pivotal.io>
Annotations weren't passed to hooks. This patch fixes that by passing
annotations to stdin for hooks.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
libapparmor is integrated in libcontainer using cgo but is only used to
call a single function: aa_change_onexec. It turns out this function is
simple enough (writing a string to a file in /proc/<n>/attr/...) to be
re-implemented locally in libcontainer in plain Go.
This allows to drop the dependency on libapparmor and the corresponding
cgo integration.
Fixes#1674
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
When a subreaper is enabled, it might expect to reap a process and
retrieve its exit code. That's the reason why this patch is giving
the possibility to define the usage of a subreaper as a consumer of
libcontainer. Relying on this information, libcontainer will not
wait for signalled processes in case a subreaper has been set.
Fixes#1677
Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
TestGetContainerStateAfterUpdate creates its state.json file on the current
directory which turns out to be the host runc directory. Thus whenever
the test completes it leaves the state.json file behind thus
a) poluting the local git repository
b) changing the host file system violating the principle of doing
everything in an isolated container environment
This change would create a new temporary (in-container) directory and use it as
linuxContainer.root
Signed-off-by: Tom Godkin <tgodkin@pivotal.io>
runc currently only support Linux platform, and since we dont intend to expose
the support to other platform, removing all other platforms placeholder code.
`libcontainer/configs` still being used in
https://github.com/moby/moby/blob/master/daemon/daemon_windows.go so
keeping it for now.
After this, we probably should also rename files to drop linux suffices
if possible.
Signed-off-by: Daniel Dao <dqminh89@gmail.com>
runc is not supported on FreeBSD, so remove all FreeBSD specific bits.
As suggested by @crosbymichael in #1653
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Selinux related code has been moved to the selinux package
(https://github.com/opencontainers/selinux) and therefore xattr related
code can be deleted from libcontainer
Signed-off-by: Danail Branekov <danailster@gmail.com>
When running runc tests with temp directory with size 500M copying
busybox without preserving hardlinks causes the folder to inflate to
roughly 330M. Copying busybox twice in certain tests causes the /tmp
directory to overfill. Using `-a` preserves links which busybox uses to
implement its choice of binary to run.
Signed-off-by: Tom Godkin <tgodkin@pivotal.io>
runc shouldn't depend on docker and be more self-contained.
Removing github.com/pkg/symlink dep is the first step to not depend on docker anymore
Signed-off-by: Vincent Demeester <vincent@sbr.pm>
Previously we would handle the "unmapped stdio" case by just doing a
simple check, however this didn't handle cases where the overflow_uid
was actually mapped in the user namespace. Instead of doing some
userspace checks, just try to do the fchown(2) and ignore EINVAL
(unmapped) or EPERM (lacking privilege over inode) errors.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Multiple conditions were previously allowed to be placed upon the
same syscall argument. Restore this behavior.
Signed-off-by: Matthew Heon <mheon@redhat.com>