Commit Graph

226 Commits

Author SHA1 Message Date
Filipe Brandenburger 46351eb3d1 Move systemd.Manager initialization into a function in that module
This will permit us to extend the internals of systemd.Manager to include
further information about the system, such as whether cgroupv1, cgroupv2 or
both are in effect.

Furthermore, it allows a future refactor of moving more of UseSystemd() code
into the factory initialization function.

Signed-off-by: Filipe Brandenburger <filbranden@gmail.com>
2019-05-01 13:22:19 -07:00
Filipe Brandenburger cd41feb46b Remove detection for scope properties, which have always been broken
The detection for scope properties (whether scope units support
DefaultDependencies= or Delegate=) has always been broken, since systemd
refuses to create scopes unless at least one PID is attached to it (and
this has been so since scope units were introduced in systemd v205.)

This can be seen in journal logs whenever a container is started with
libpod:

  Feb 11 15:08:07 myhost systemd[1]: libcontainer-12345-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing.
  Feb 11 15:08:07 myhost systemd[1]: libcontainer-12345-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing.

Since this logic never worked, just assume both attributes are supported
(which is what the code does when detection fails for this reason, since
it's looking for an "unknown attribute" or "read-only attribute" to mark
them as false) and skip the detection altogether.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2019-02-11 16:05:37 -08:00
Mrunal Patel 4e4c907193
Merge pull request #1950 from cloudfoundry-incubator/enter-pid-race
Resilience in adding of exec tasks to cgroups
2019-02-01 13:18:16 -08:00
Giuseppe Scrivano f01923376d
systemd: fix setting kernel memory limit
since commit df3fa115f9 it is not
possible to set a kernel memory limit when using the systemd cgroups
backend as we use cgroup.Apply twice.

Skip enabling kernel memory if there are already tasks in the cgroup.

Without this patch, runc fails with:

container_linux.go:344: starting container process caused
"process_linux.go:311: applying cgroup configuration for process
caused \"failed to set memory.kmem.limit_in_bytes, because either
tasks have already joined this cgroup or it has children\""

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2019-01-10 11:33:50 +01:00
Tom Godkin bdf3524b34 Retry adding pids to cgroups when EINVAL occurs
The kernel will sometimes return EINVAL when writing a pid to a
cgroup.procs file. It does so when the task being added still has the
state TASK_NEW.

See: https://elixir.bootlin.com/linux/v4.8/source/kernel/sched/core.c#L8286

Co-authored-by: Danail Branekov <danailster@gmail.com>

Signed-off-by: Tom Godkin <tgodkin@pivotal.io>
Signed-off-by: Danail Branekov <danailster@gmail.com>
2018-12-17 15:34:47 +00:00
JoeWrightss 769d6c4a75 Fix some typos
Signed-off-by: JoeWrightss <zhoulin.xie@daocloud.io>
2018-12-09 23:52:54 +08:00
Aleksa Sarai 8a4629f7b5
cgroups: nokmem: error out on explicitly-set kmemcg limits
When built with nokmem we explicitly are disabling support for kmemcg,
but it is a strict specification requirement that if we cannot fulfil an
aspect of the container configuration that we error out.

Completely ignoring explicitly-requested kmemcg limits with nokmem would
undoubtably lead to problems.

Fixes: 6a2c155968 ("libcontainer: ability to compile without kmem")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-12-01 14:31:35 +11:00
Michael Crosby 76520a4bf0
Merge pull request #1872 from masters-of-cats/better-find-cgroup-mountpoint
Respect container's cgroup path
2018-11-16 14:06:54 -05:00
Mrunal Patel 4769cdf607
Merge pull request #1916 from crosbymichael/cgns
Add support for cgroup namespace
2018-11-13 12:21:38 -08:00
Mrunal Patel f000fe11ec
Merge pull request #1917 from slp/master
libcontainer: map PidsLimit to systemd's TasksMax property
2018-11-13 12:21:23 -08:00
Michael Crosby aa7917b751
Merge pull request #1911 from theSuess/linter-fixes
Various cleanups to address linter issues
2018-11-13 12:13:34 -05:00
Kir Kolyshkin 6a2c155968 libcontainer: ability to compile without kmem
Commit fe898e7862 (PR #1350) enables kernel memory accounting
for all cgroups created by libcontainer -- even if kmem limit is
not configured.

Kernel memory accounting is known to be broken in some kernels,
specifically the ones from RHEL7 (including RHEL 7.5). Those
kernels do not support kernel memory reclaim, and are prone to
oopses. Unconditionally enabling kmem acct on such kernels lead
to bugs, such as

* https://github.com/opencontainers/runc/issues/1725
* https://github.com/kubernetes/kubernetes/issues/61937
* https://github.com/moby/moby/issues/29638

This commit gives a way to compile runc without kernel memory setting
support. To do so, use something like

	make BUILDTAGS="seccomp nokmem"

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2018-10-31 20:35:51 -07:00
Yuanhong Peng df3fa115f9 Add support for cgroup namespace
Cgroup namespace can be configured in `config.json` as other
namespaces. Here is an example:

```
"namespaces": [
	{
		"type": "pid"
	},
	{
		"type": "network"
	},
	{
		"type": "ipc"
	},
	{
		"type": "uts"
	},
	{
		"type": "mount"
	},
	{
		"type": "cgroup"
	}
],

```

Note that if you want to run a container which has shared cgroup ns with
another container, then it's strongly recommended that you set
proper `CgroupsPath` of both containers(the second container's cgroup
path must be the subdirectory of the first one). Or there might be
some unexpected results.

Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2018-10-31 10:51:43 -04:00
Sergio Lopez 5c6b9c3c1c libcontainer: map PidsLimit to systemd's TasksMax property
Currently runc applies PidsLimit restriction by writing directly to
cgroup's pids.max, without notifying systemd. As a consequence, when the
later updates the context of the corresponding scope, pids.max is reset
to the value of systemd's TasksMax property.

This can be easily reproduced this way (I'm using "postfix" here just an
example, any unrelated but existing service will do):

 # CTR=`docker run --pids-limit 111 --detach --rm busybox /bin/sleep 8h`
 # cat /sys/fs/cgroup/pids/system.slice/docker-${CTR}.scope/pids.max
 111
 # systemctl disable --now postfix
 # systemctl enable --now postfix
 # cat /sys/fs/cgroup/pids/system.slice/docker-${CTR}.scope/pids.max
 max

This patch adds TasksAccounting=true and TasksMax=PidsLimit to the
properties sent to systemd.

Signed-off-by: Sergio Lopez <slp@redhat.com>
2018-10-24 17:20:27 +02:00
Mrunal Patel a00bf01908
Merge pull request #1862 from AkihiroSuda/decompose-rootless-pr
Disable rootless mode except RootlessCgMgr when executed as the root in userns (fix Docker-in-LXD regression)
2018-10-15 17:32:15 -07:00
Dominik Süß 0b412e9482 various cleanups to address linter issues
Signed-off-by: Dominik Süß <dominik@suess.wtf>
2018-10-13 21:14:03 +02:00
Danail Branekov a1d5398afa Respect container's cgroup path
Respect the container's cgroup path when finding the container's
cgroup mount point, which is useful in multi-tenant environments, where
containers have their own unique cgroup mounts

Signed-off-by: Danail Branekov <danailster@gmail.com>
Signed-off-by: Oliver Stenbom <ostenbom@pivotal.io>
Signed-off-by: Giuseppe Capizzi <gcapizzi@pivotal.io>
2018-09-25 17:43:36 +01:00
Aleksa Sarai 578fe65e4f
merge branch 'pr-1817'
Fix duplicate entries and missing entries in getCgroupMountsHelper
  Add test for testing cgroup mounts on bedrock linux
  Stop relying on number of subsystems for cgroups

LGTMs: @crosbymichael @cyphar
Closes #1817
2018-09-19 19:48:17 +10:00
Akihiro Suda 06f789cf26 Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.

`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.

`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)

When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.

This PR does not have any impact on CLI flags and `state.json`.

Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
  For runc-in-userns, `runc spec`  without `--rootless` should work, when sufficient numbers of
  UID/GID are mapped.

Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
  (`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
  This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.

Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-09-07 15:05:03 +09:00
Yan Zhu feb90346e0 doc: fix typo
Signed-off-by: Yan Zhu <yanzhu@alauda.io>
2018-09-07 11:58:59 +08:00
Jay Kamat a2faaa1317
Fix duplicate entries and missing entries in getCgroupMountsHelper
Signed-off-by: Jay Kamat <jaygkamat@gmail.com>
2018-07-31 20:12:18 -07:00
Jay Kamat e5a7c61f3c Add test for testing cgroup mounts on bedrock linux
Add a mountinfo from a bedrock linux system with 4 strata, and include
it for tests

Signed-off-by: Jay Kamat <jaygkamat@gmail.com>
Signed-off-by: Daniel Dao <dqminh89@gmail.com>
2018-06-24 00:01:07 +01:00
Daniel Dao 5ee0648bfb Stop relying on number of subsystems for cgroups
When there are complicated mount setups, there can be multiple mount
points which have the subsystem we are looking for. Instead of
counting the mountpoints, tick off subsystems until we have found them
all.

Without the 'all' flag, ignore duplicate subsystems after the first.

Signed-off-by: Daniel Dao <dqminh89@gmail.com>
2018-06-24 00:00:58 +01:00
Aleksa Sarai 939d5a3753
cgroup: clean up isIgnorableError for skippable EROFS
Include a rootless argument for isIgnorableError to avoid people
accidentally using isIgnorableError when they shouldn't (we don't ignore
any errors when running as root as that really isn't safe).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-05-25 11:31:41 +10:00
Qiang Huang dd67ab10d7
Merge pull request #1759 from cyphar/rootless-erofs-as-eperm
rootless: cgroup: treat EROFS as a skippable error
2018-05-25 09:24:16 +08:00
Derek Carr b515963c10 systemd cpu quota ignores -1
Signed-off-by: Derek Carr <decarr@redhat.com>
2018-05-23 14:28:39 -04:00
Filipe Brandenburger 165ee45334 Make channel for StartTransientUnit buffered
So that, if a timeout happens and we decide to stop blocking on the
operation, the writer will not block when they try to report the result
of the operation.

This should address Issue #1780 and it's a follow up for PR #1683,
PR #1754 and PR #1772.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2018-04-14 08:49:50 -07:00
Filipe Brandenburger 0e16bd9b53 Detect whether Delegate is available on both slices and scopes
Starting with systemd 237, in preparation for cgroup v2, delegation is
only now available for scopes, not slices.

Update libcontainer code to detect whether delegation is available on
both and use that information when creating new slices.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2018-04-10 11:42:55 -07:00
Filipe Brandenburger 8ab251f298 Fix systemd.Apply() to check for DBus error before waiting on a channel.
The channel was introduced in #1683 to work around a race condition.
However, the check for error in StartTransientUnit ignores the error for
an already existing unit, and in that case there will be no notification
from DBus (so waiting on the channel will make it hang.)

Later PR #1754 added a timeout, which worked around the issue, but we
can fix this correctly by only waiting on the channel when there is no
error. Fix the code to do so.

The timeout handling was kept, since there might be other cases where
this situation occurs (https://bugzilla.redhat.com/show_bug.cgi?id=1548358
mentions calling this code from inside a container, it's unclear whether
an existing container was in use or not, so not sure whether this would
have fixed that bug as well.)

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2018-04-09 11:51:59 -07:00
Aleksa Sarai 03e585985f
rootless: cgroup: treat EROFS as a skippable error
In some cases, /sys/fs/cgroups is mounted read-only. In rootless
containers we can consider this effectively identical to having cgroups
that we don't have write permission to -- because the user isn't
responsible for the read-only setup and cannot modify it. The rules are
identical to when /sys/fs/cgroups is not writable by the unprivileged
user.

An example of this is the default configuration of Docker, where cgroups
are mounted as read-only as a preventative security measure.

Reported-by: Vladimir Rutsky <rutsky@google.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-03-17 13:53:42 +11:00
Qiang Huang 9facb87f87
Merge pull request #1754 from vikaschoudhary16/add-timeout
Add timeout while waiting for StartTransinetUnit completion signal
2018-03-08 09:09:34 +08:00
vikaschoudhary16 04e95b526d Add timeout while waiting for StartTransinetUnit completion signal from dbus
Signed-off-by: vikaschoudhary16 <choudharyvikas16@gmail.com>
2018-03-07 05:11:38 -05:00
Denys Smirnov 3d26fc3fd7 cgroups/fs: fix NPE on Destroy than no cgroups are set
Currently Manager accepts nil cgroups when calling Apply, but it will panic then trying to call Destroy with the same config.

Signed-off-by: Denys Smirnov <denys@sourced.tech>
2018-03-06 23:31:31 +01:00
Michael Crosby 595bea022f
Merge pull request #1722 from ravisantoshgudimetla/fix-systemd-path
fix systemd slice expansion so that it could be consumed by cAdvisor
2018-02-20 09:59:24 -05:00
ravisantoshgudimetla 7019e1de7b fix systemd slice expansion so that it could be consumed by cAdvisor
Signed-off-by: ravisantoshgudimetla <ravisantoshgudimetla@gmail.com>
2018-02-18 21:32:39 -05:00
vikaschoudhary16 d5b4a3eddb Fix race against systemd
- T0: runc triggers a systemd unit creation asynchronously from [here](https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/systemd/apply_systemd.go#L298)
- T1: runc then moves ahead and starts creating cgroup paths(.scope directories), [here](https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/systemd/apply_systemd.go#L348). Kernel creates .scope directory and cgroup.procs file(along with other default files) in the directory automatically, in an atomic manner.
- T3: systemd execution thread which was invoked at time `T0`, is still in the process of unit creation. systemd also trying to create cgroup paths and deletes the `.scope` directory which is created at time `T1` by runc from [here](https://github.com/systemd/systemd/blob/v219/src/shared/cgroup-util.c#L1630) in the code

Signed-off-by: vikaschoudhary16 <choudharyvikas16@gmail.com>
2018-01-08 09:37:26 -05:00
Seth Jennings bca53e7b49 systemd: adjust CPUQuotaPerSecUSec to compensate for systemd internal handling
Signed-off-by: Seth Jennings <sjenning@redhat.com>
2017-11-15 20:20:06 -06:00
Michael Crosby ff4481dbf6 Merge pull request #1540 from cloudfoundry-incubator/rootless-cgroups
Support cgroups with limits as rootless
2017-10-16 12:03:49 -04:00
Sebastien Boeuf acb93c9c62 libcontainer: cgroups: Write freezer state after every state check
This commit ensures we write the expected freezer cgroup state after
every state check, in case the state check does not give the expected
result. This can happen when a new task is created and prevents the
whole cgroup to be FROZEN, leaving the state into FREEZING instead.

This patch prevents the case of an infinite loop to happen.

Fixes https://github.com/opencontainers/runc/issues/1609

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2017-10-12 07:07:28 -07:00
Will Martin ca4f427af1 Support cgroups with limits as rootless
Signed-off-by: Ed King <eking@pivotal.io>
Signed-off-by: Gabriel Rosenhouse <grosenhouse@pivotal.io>
Signed-off-by: Konstantinos Karampogias <konstantinos.karampogias@swisscom.com>
2017-10-05 11:22:54 +01:00
Yong Tang e9944d0f4c Disable systemd in static build
This fix tries to address the warnings caused by static build
with go 1.9. As systemd needs dlopen/dlclose, the following warnings
will be generated for static build in go 1.9:
```
root@f4b077232050:/go/src/github.com/opencontainers/runc# make static
CGO_ENABLED=1 go build  -tags "seccomp cgo static_build" -ldflags "-w -extldflags -static -X main.gitCommit="1c81e2a794c6e26a4c650142ae8893c47f619764" -X main.version=1.0.0-rc4+dev " -o runc .
/tmp/go-link-113476657/000007.o: In function `_cgo_a5acef59ed3f_Cfunc_dlopen':
/tmp/go-build/github.com/opencontainers/runc/vendor/github.com/coreos/pkg/dlopen/_obj/cgo-gcc-prolog:76: warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
```

This fix disables systemd when `static_build` flag is on (apply_nosystemd.go
is used instead).

This fix also fixes a small bug in `apply_nosystemd.go` for return value.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
2017-09-11 18:38:22 +00:00
Qiang Huang acaf6897f5 Fix systemd cgroup after memory type changed
Fixes: #1557

I'm not quite sure about the root cause, looks like
systemd still want them to be uint64.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-08-25 01:14:16 -04:00
Michael Crosby 882d8eaba6 Merge pull request #1537 from tklauser/staticcheck
Fix issues found by staticcheck
2017-08-02 09:52:11 -04:00
Tobias Klauser e4e56cb6d8 libcontainer: remove ineffective break statements
go's switch statement doesn't need an explicit break. Remove it where
that is the case and add a comment to indicate the purpose where the
removal would lead to an empty case.

Found with honnef.co/go/tools/cmd/staticcheck

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-07-28 15:13:39 +02:00
Steven Hartland ee4f68e302 Updated logrus to v1
Updated logrus to use v1 which includes a breaking name change Sirupsen -> sirupsen.

This includes a manual edit of the docker term package to also correct the name there too.

Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
2017-07-19 15:20:56 +00:00
Daniel, Dao Quang Minh 7139b61f7f Merge pull request #1378 from derekwaynecarr/expose_use_hierarchy
Expose memory.use_hierarchy in MemoryStats
2017-06-30 16:08:21 +01:00
Justin Cormack 3d9074ead3 Update memory specs to use int64 not uint64
replace #1492 #1494
fix #1422

Since https://github.com/opencontainers/runtime-spec/pull/876 the memory
specifications are now `int64`, as that better matches the visible interface where
`-1` is a valid value. Otherwise finding the correct value was difficult as it
was kernel dependent.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2017-06-27 12:16:07 +01:00
Daniel, Dao Quang Minh 67bd2ab554 Merge pull request #1442 from clnperez/libcontainer-sys-unix
Move libcontainer to x/sys/unix
2017-05-26 12:18:33 +01:00
Michael Crosby 18cd7e06f7 Merge pull request #1372 from cloudfoundry-incubator/cpuset-mount-root
Handle container creation when cgroups have already been mounted in another location
2017-05-25 09:53:57 -07:00
Christy Perez 3d7cb4293c Move libcontainer to x/sys/unix
Since syscall is outdated and broken for some architectures,
use x/sys/unix instead.

There are still some dependencies on the syscall package that will
remain in syscall for the forseeable future:

Errno
Signal
SysProcAttr

Additionally:
- os still uses syscall, so it needs to be kept for anything
returning *os.ProcessState, such as process.Wait.

Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
2017-05-22 17:35:20 -05:00