Commit Graph

101 Commits

Author SHA1 Message Date
Kir Kolyshkin 4b4bc995ad CreateCgroupPath: only enable needed controllers
1. Instead of enabling all available controllers, figure out which
   ones are required, and only enable those.

2. Amend all setFoo() functions to call isFooSet(). While this might
   seem unnecessary, it might actually help to uncover a bug.
   Imagine someone:
    - adds a cgroup.Resources.CpuFoo setting;
    - modifies setCpu() to apply the new setting;
    - but forgets to amend isCpuSet() accordingly <-- BUG

   In this case, a test case modifying CpuFoo will help
   to uncover the BUG. This is the reason why it's added.

This patch *could be* amended by enabling controllers on a best-effort
basis, i.e. :

 - do not return an error early if we can't enable some controllers;
 - if we fail to enable all controllers at once (usually because one
   of them can't be enabled), try enabling them one by one.

Currently this is not implemented, and it's not clear whether this
would be a good way to go or not.

[v2: add/use is${Controller}Set() functions]
[v3: document neededControllers()]
[v4: drop "best-effort" part]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin bb47e35843 cgroup/systemd: reorganize
1. Rename the files
  - v1.go: cgroupv1 aka legacy;
  - v2.go: cgroupv2 aka unified hierarchy;
  - unsupported.go: when systemd is not available.

2. Move the code that is common between v1 and v2 to common.go

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin 813cb3eb94 cgroupv2: fix fs2 cgroup init
fs2 cgroup driver was not working because it did not enable controllers
while creating cgroup directory; instead it was merely doing MkdirAll()
and gathered the list of available controllers in NewManager().

Also, cgroup should be created in Apply(), not while creating a new
manager instance.

To fix:

1. Move the createCgroupsv2Path function from systemd driver to fs2 driver,
   renaming it to CreateCgroupPath. Use in Apply() from both fs2 and
   systemd drivers.

2. Delay available controllers map initialization to until it is needed.

With this patch:
 - NewManager() only performs minimal initialization (initializin
   m.dirPath, if not provided);
 - Apply() properly creates cgroup path, enabling the controllers;
 - m.controllers is initialized lazily on demand.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin dbeff89491 cgroupv2/systemd: privatize UnifiedManager
... and its Cgroup field. There is no sense to keep it public.

This was generated by gorename.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin 88c13c0713 cgroupv2: use SecureJoin in systemd driver
It seems that some paths are coming from user and are therefore
untrusted.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:20:22 -07:00
Kir Kolyshkin 9c80cd672d cgroupv2: rm legacy Paths from systemd driver
Having map of per-subsystem paths in systemd unified cgroups
driver does not make sense and makes the code less readable.

To get rid of it, move the systemd v1-or-v2 init code to
libcontainer/factory_linux.go which already has a function
to deduce unified path out of paths map.

End result is much cleaner code. Besides, we no longer write pid
to the same cgroup file 7 times in Apply() like we did before.

While at it
 - add `rootless` flag which is passed on to fs2 manager
 - merge getv2Path() into GetUnifiedPath(), don't overwrite
   path if it is set during initialization (on Load).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:19:51 -07:00
Ted Yu 614bb96676 cgroupv2/systemd: Properly remove intermediate directory
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-04-13 08:32:08 -07:00
Kir Kolyshkin c86be8a2c1 cgroupv2: fix setting MemorySwap
The resources.MemorySwap field from OCI is memory+swap, while cgroupv2
has a separate swap limit, so subtract memory from the limit (and make
sure values are set and sane).

Make sure to set MemorySwapMax for systemd, too. Since systemd does not
have MemorySwapMax for cgroupv1, it is only needed for v2 driver.

[v2: return -1 on any negative value, add unit test]
[v3: treat any negative value other than -1 as error]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-07 20:45:53 -07:00
Tobias Klauser 3e678c08f9 Remove unused consts testScopeWait and testSliceWait
These are unused since commit 518c855833 ("Remove libcontainer
detection for systemd features")

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2020-04-03 21:09:43 +02:00
Mrunal Patel d05e5728aa systemd: Lazy initialize the systemd dbus connection
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2020-03-30 15:24:06 -07:00
Mrunal Patel 33c6125da6 systemd: Export IsSystemdRunning() function
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2020-03-30 15:24:06 -07:00
Kir Kolyshkin a949e4f22f cgroupv2: UnifiedManager.Apply: simplify
Remove joinCgroupsV2() function, as its name and second parameter
are misleading. Use createCgroupsv2Path() directly, do not call
getv2Path() twice.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-26 19:20:00 -07:00
Kir Kolyshkin 5406833a65 cgroupv2/systemd: add getv2Path
Function getSubsystemPath(), while works for v2 unified case, is
suboptimal, as it does a few unnecessary calls.

Add a simplified version of getSubsystemPath(), called getv2Path(),
and use it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-26 19:17:09 -07:00
Kir Kolyshkin ec1f957b23 cgroupv2: don't use getSubsystemPath in Apply
This code is a copy-paste from cgroupv1 systemd code. Its aim
is to check whether a subsystem is available, and skip those
that are not.

In case v2 unified hierarchy is used, getSubsystemPath never
returns "not found" error, so calling it is useless.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-26 13:32:34 -07:00
Kir Kolyshkin a675b5ebea cgroupv2: don't try to set kmem for systemd case
To the best of my knowledge, it has been decided to drop the kernel
memory controller from the cgroupv2 hierarchy, so "kernel memory limits"
do not exist if we're using v2 unified.

So, we need to ignore kernel memory setting. This was already done in
non-systemd case (see commit 88e8350de), let's do the same for systemd.

This fixes the following error:

> container_linux.go:349: starting container process caused "process_linux.go:306: applying cgroup configuration for process caused \"open /sys/fs/cgroup/machine.slice/runc-cgroups-integration-test.scope/tasks: no such file or directory\""

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-25 20:00:23 -07:00
Mrunal Patel 7de5db3dad
Merge pull request #2263 from kolyshkin/nits
Assorted minor nits in libcontainer
2020-03-24 14:17:22 -07:00
Kir Kolyshkin 12dc475dd6 libcontainer: simplify createCgroupsv2Path
fmt.Sprintf is slow and is not needed here, string concatenation would
be sufficient. It is also redundant to convert []byte from string and
back, since `bytes` package now provides the same functions as `strings`.

Use Fields() instead of TrimSpace() and Split(), mainly for readability
(note Fields() is somewhat slower than Split() but here it doesn't
matter much).

Use Join() to prepend the plus signs.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-20 11:51:55 -07:00
Akihiro Suda 492d525e55 vendor: update go-systemd and godbus
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-03-16 13:26:03 +09:00
Qiang Huang 3b7e32feba
Merge pull request #2210 from Zyqsempai/2164-remove-deprecated-systemd-resources
Exchange deprecated systemd resources with the appropriate for cgroupv2
2020-02-29 10:13:55 +08:00
Kir Kolyshkin 4c5c3fb960 Support for setting systemd properties via annotations
In case systemd is used to set cgroups for the container,
it creates a scope unit dedicated to it (usually named
`runc-$ID.scope`).

This patch adds an ability to set arbitrary systemd properties
for the systemd unit via runtime spec annotations.

Initially this was developed as an ability to specify the
`TimeoutStopUSec` property, but later generalized to work with
arbitrary ones.

Example usage: add the following to runtime spec (config.json):

```
	"annotations": {
		"org.systemd.property.TimeoutStopUSec": "uint64 123456789",
		"org.systemd.property.CollectMode":"'inactive-or-failed'"
	},
```

and start the container (e.g. `runc --systemd-cgroup run $ID`).

The above will set the following systemd parameters:
* `TimeoutStopSec` to 2 minutes and 3 seconds,
* `CollectMode` to "inactive-or-failed".

The values are in the gvariant format (see [1]). To figure out
which type systemd expects for a particular parameter, see
systemd sources.

In particular, parameters with `USec` suffix require an `uint64`
typed argument, while gvariant assumes int32 for a numeric values,
therefore the explicit type is required.

NOTE that systemd receives the time-typed parameters as *USec
but shows them (in `systemctl show`) as *Sec. For example,
the stop timeout should be set as `TimeoutStopUSec` but
is shown as `TimeoutStopSec`.

[1] https://developer.gnome.org/glib/stable/gvariant-text.html

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-02-17 16:07:19 -08:00
Boris Popovschi 5b96f314ba Exchanged deprecated systemd resources with the appropriate for cgroupv2
Signed-off-by: Boris Popovschi <zyqsempai@mail.ru>
2020-01-15 18:09:33 +02:00
Mrunal Patel 5cc0deaf7a
Merge pull request #2169 from AkihiroSuda/split-fs
cgroup2: split fs2 from fs
2020-01-13 16:23:27 -08:00
Julio Montes 8ddd892072 libcontainer: add method to get cgroup config from cgroup Manager
`configs.Cgroup` contains the configuration used to create cgroups. This
configuration must be saved to disk, since it's required to restore the
cgroup manager that was used to create the cgroups.
Add method to get cgroup configuration from cgroup Manager to allow API users
save it to disk and restore a cgroup manager later.

fixes #2176

Signed-off-by: Julio Montes <julio.montes@intel.com>
2019-12-17 22:46:03 +00:00
Akihiro Suda 88e8350de2 cgroup2: split fs2 from fs
split fs2 package from fs, as mixing up fs and fs2 is very likely to result in
unmaintainable code.

Inspired by containerd/cgroups#109

Fix #2157

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-12-06 15:42:10 +09:00
Mrunal Patel 46def4cc4c
Merge pull request #2154 from jpeach/2008-remove-static-build-tag
Remove the static_build build tag.
2019-11-04 17:10:59 -08:00
Akihiro Suda faf673ee45 cgroup2: port over eBPF device controller from crun
The implementation is based on https://github.com/containers/crun/blob/0.10.2/src/libcrun/ebpf.c

Although ebpf.c is originally licensed under LGPL-3.0-or-later, the author
Giuseppe Scrivano agreed to relicense the file in Apache License 2.0:
https://github.com/opencontainers/runc/issues/2144#issuecomment-543116397

See libcontainer/cgroups/ebpf/devicefilter/devicefilter_test.go for tested configurations.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-31 14:01:46 +09:00
James Peach 13919f5dfd Remove the static_build build tag.
The `static_build` build tag was introduced in e9944d0f
to remove build warnings related to systemd cgroup driver
dependencies. Since then, those dependencies have changed and
building the systemd cgroup driver no longer imports dlopen.

After this change, runc builds will always include the systemd
cgroup driver.

This fixes #2008.

Signed-off-by: James Peach <jpeach@apache.org>
2019-10-26 08:28:45 +11:00
Akihiro Suda dbd771e475 cgroup2: implement `runc ps`
Implemented `runc ps` for cgroup v2 , using a newly added method `m.GetUnifiedPath()`.
Unlike the v1  implementation that checks `m.GetPaths()["devices"]`, the v2 implementation does not require the device controller to be available.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-19 01:59:24 +09:00
Giuseppe Scrivano 524cb7c318
libcontainer: add systemd.UnifiedManager
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2019-09-05 13:02:27 +02:00
Giuseppe Scrivano ec11136828
libcontainer, cgroups: rename systemd.Manager to LegacyManager
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2019-09-05 13:02:26 +02:00
Mrunal Patel 3525eddec5
Merge pull request #2117 from filbranden/detection1
Remove libcontainer detection for systemd features
2019-08-25 13:15:15 -07:00
Filipe Brandenburger 518c855833 Remove libcontainer detection for systemd features
Transient units (and transient slice units) have been available for quite a
long time and RHEL 7 with systemd v219 (likely the oldest OS we care about at
this point) supports that. A system running a systemd without these features is
likely to break a lot of other stuff that runc/libcontainer care about.

Regarding delegated slices, modern systemd doesn't allow it and
runc/libcontainer run fine on it, so we might as well just stop requesting it
on older versions of systemd which allowed it. (Those versions never really
changed behavior significantly when that option was passed anyways.)

Signed-off-by: Filipe Brandenburger <filbranden@gmail.com>
2019-08-22 21:53:24 -07:00
Filipe Brandenburger 588f040a77 Avoid the dependency on cgo through go-systemd/util package
This dependency is only needed in package "github.com/coreos/go-systemd/util"
and we only use it for IsRunningSystemd(), which is a simple Go function that
just stats a file.

Let's just borrow it here, so we remove the dependency and can remove that
package from vendored build.

This also removes dependencies on dlopen and on trying to find libsystemd.so
or libsystemd-login.so in the system.

Tested that this still builds and works as expected.

Signed-off-by: Filipe Brandenburger <filbranden@gmail.com>
2019-08-22 21:07:24 -07:00
Filipe Brandenburger 46351eb3d1 Move systemd.Manager initialization into a function in that module
This will permit us to extend the internals of systemd.Manager to include
further information about the system, such as whether cgroupv1, cgroupv2 or
both are in effect.

Furthermore, it allows a future refactor of moving more of UseSystemd() code
into the factory initialization function.

Signed-off-by: Filipe Brandenburger <filbranden@gmail.com>
2019-05-01 13:22:19 -07:00
Filipe Brandenburger cd41feb46b Remove detection for scope properties, which have always been broken
The detection for scope properties (whether scope units support
DefaultDependencies= or Delegate=) has always been broken, since systemd
refuses to create scopes unless at least one PID is attached to it (and
this has been so since scope units were introduced in systemd v205.)

This can be seen in journal logs whenever a container is started with
libpod:

  Feb 11 15:08:07 myhost systemd[1]: libcontainer-12345-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing.
  Feb 11 15:08:07 myhost systemd[1]: libcontainer-12345-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing.

Since this logic never worked, just assume both attributes are supported
(which is what the code does when detection fails for this reason, since
it's looking for an "unknown attribute" or "read-only attribute" to mark
them as false) and skip the detection altogether.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2019-02-11 16:05:37 -08:00
Giuseppe Scrivano f01923376d
systemd: fix setting kernel memory limit
since commit df3fa115f9 it is not
possible to set a kernel memory limit when using the systemd cgroups
backend as we use cgroup.Apply twice.

Skip enabling kernel memory if there are already tasks in the cgroup.

Without this patch, runc fails with:

container_linux.go:344: starting container process caused
"process_linux.go:311: applying cgroup configuration for process
caused \"failed to set memory.kmem.limit_in_bytes, because either
tasks have already joined this cgroup or it has children\""

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2019-01-10 11:33:50 +01:00
Michael Crosby 76520a4bf0
Merge pull request #1872 from masters-of-cats/better-find-cgroup-mountpoint
Respect container's cgroup path
2018-11-16 14:06:54 -05:00
Sergio Lopez 5c6b9c3c1c libcontainer: map PidsLimit to systemd's TasksMax property
Currently runc applies PidsLimit restriction by writing directly to
cgroup's pids.max, without notifying systemd. As a consequence, when the
later updates the context of the corresponding scope, pids.max is reset
to the value of systemd's TasksMax property.

This can be easily reproduced this way (I'm using "postfix" here just an
example, any unrelated but existing service will do):

 # CTR=`docker run --pids-limit 111 --detach --rm busybox /bin/sleep 8h`
 # cat /sys/fs/cgroup/pids/system.slice/docker-${CTR}.scope/pids.max
 111
 # systemctl disable --now postfix
 # systemctl enable --now postfix
 # cat /sys/fs/cgroup/pids/system.slice/docker-${CTR}.scope/pids.max
 max

This patch adds TasksAccounting=true and TasksMax=PidsLimit to the
properties sent to systemd.

Signed-off-by: Sergio Lopez <slp@redhat.com>
2018-10-24 17:20:27 +02:00
Danail Branekov a1d5398afa Respect container's cgroup path
Respect the container's cgroup path when finding the container's
cgroup mount point, which is useful in multi-tenant environments, where
containers have their own unique cgroup mounts

Signed-off-by: Danail Branekov <danailster@gmail.com>
Signed-off-by: Oliver Stenbom <ostenbom@pivotal.io>
Signed-off-by: Giuseppe Capizzi <gcapizzi@pivotal.io>
2018-09-25 17:43:36 +01:00
Derek Carr b515963c10 systemd cpu quota ignores -1
Signed-off-by: Derek Carr <decarr@redhat.com>
2018-05-23 14:28:39 -04:00
Filipe Brandenburger 165ee45334 Make channel for StartTransientUnit buffered
So that, if a timeout happens and we decide to stop blocking on the
operation, the writer will not block when they try to report the result
of the operation.

This should address Issue #1780 and it's a follow up for PR #1683,
PR #1754 and PR #1772.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2018-04-14 08:49:50 -07:00
Filipe Brandenburger 0e16bd9b53 Detect whether Delegate is available on both slices and scopes
Starting with systemd 237, in preparation for cgroup v2, delegation is
only now available for scopes, not slices.

Update libcontainer code to detect whether delegation is available on
both and use that information when creating new slices.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2018-04-10 11:42:55 -07:00
Filipe Brandenburger 8ab251f298 Fix systemd.Apply() to check for DBus error before waiting on a channel.
The channel was introduced in #1683 to work around a race condition.
However, the check for error in StartTransientUnit ignores the error for
an already existing unit, and in that case there will be no notification
from DBus (so waiting on the channel will make it hang.)

Later PR #1754 added a timeout, which worked around the issue, but we
can fix this correctly by only waiting on the channel when there is no
error. Fix the code to do so.

The timeout handling was kept, since there might be other cases where
this situation occurs (https://bugzilla.redhat.com/show_bug.cgi?id=1548358
mentions calling this code from inside a container, it's unclear whether
an existing container was in use or not, so not sure whether this would
have fixed that bug as well.)

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2018-04-09 11:51:59 -07:00
vikaschoudhary16 04e95b526d Add timeout while waiting for StartTransinetUnit completion signal from dbus
Signed-off-by: vikaschoudhary16 <choudharyvikas16@gmail.com>
2018-03-07 05:11:38 -05:00
Michael Crosby 595bea022f
Merge pull request #1722 from ravisantoshgudimetla/fix-systemd-path
fix systemd slice expansion so that it could be consumed by cAdvisor
2018-02-20 09:59:24 -05:00
ravisantoshgudimetla 7019e1de7b fix systemd slice expansion so that it could be consumed by cAdvisor
Signed-off-by: ravisantoshgudimetla <ravisantoshgudimetla@gmail.com>
2018-02-18 21:32:39 -05:00
vikaschoudhary16 d5b4a3eddb Fix race against systemd
- T0: runc triggers a systemd unit creation asynchronously from [here](https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/systemd/apply_systemd.go#L298)
- T1: runc then moves ahead and starts creating cgroup paths(.scope directories), [here](https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/systemd/apply_systemd.go#L348). Kernel creates .scope directory and cgroup.procs file(along with other default files) in the directory automatically, in an atomic manner.
- T3: systemd execution thread which was invoked at time `T0`, is still in the process of unit creation. systemd also trying to create cgroup paths and deletes the `.scope` directory which is created at time `T1` by runc from [here](https://github.com/systemd/systemd/blob/v219/src/shared/cgroup-util.c#L1630) in the code

Signed-off-by: vikaschoudhary16 <choudharyvikas16@gmail.com>
2018-01-08 09:37:26 -05:00
Seth Jennings bca53e7b49 systemd: adjust CPUQuotaPerSecUSec to compensate for systemd internal handling
Signed-off-by: Seth Jennings <sjenning@redhat.com>
2017-11-15 20:20:06 -06:00
Yong Tang e9944d0f4c Disable systemd in static build
This fix tries to address the warnings caused by static build
with go 1.9. As systemd needs dlopen/dlclose, the following warnings
will be generated for static build in go 1.9:
```
root@f4b077232050:/go/src/github.com/opencontainers/runc# make static
CGO_ENABLED=1 go build  -tags "seccomp cgo static_build" -ldflags "-w -extldflags -static -X main.gitCommit="1c81e2a794c6e26a4c650142ae8893c47f619764" -X main.version=1.0.0-rc4+dev " -o runc .
/tmp/go-link-113476657/000007.o: In function `_cgo_a5acef59ed3f_Cfunc_dlopen':
/tmp/go-build/github.com/opencontainers/runc/vendor/github.com/coreos/pkg/dlopen/_obj/cgo-gcc-prolog:76: warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
```

This fix disables systemd when `static_build` flag is on (apply_nosystemd.go
is used instead).

This fix also fixes a small bug in `apply_nosystemd.go` for return value.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
2017-09-11 18:38:22 +00:00
Qiang Huang acaf6897f5 Fix systemd cgroup after memory type changed
Fixes: #1557

I'm not quite sure about the root cause, looks like
systemd still want them to be uint64.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-08-25 01:14:16 -04:00