The resources.MemorySwap field from OCI is memory+swap, while cgroupv2
has a separate swap limit, so subtract memory from the limit (and make
sure values are set and sane).
Make sure to set MemorySwapMax for systemd, too. Since systemd does not
have MemorySwapMax for cgroupv1, it is only needed for v2 driver.
[v2: return -1 on any negative value, add unit test]
[v3: treat any negative value other than -1 as error]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Remove joinCgroupsV2() function, as its name and second parameter
are misleading. Use createCgroupsv2Path() directly, do not call
getv2Path() twice.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Function getSubsystemPath(), while works for v2 unified case, is
suboptimal, as it does a few unnecessary calls.
Add a simplified version of getSubsystemPath(), called getv2Path(),
and use it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This code is a copy-paste from cgroupv1 systemd code. Its aim
is to check whether a subsystem is available, and skip those
that are not.
In case v2 unified hierarchy is used, getSubsystemPath never
returns "not found" error, so calling it is useless.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
To the best of my knowledge, it has been decided to drop the kernel
memory controller from the cgroupv2 hierarchy, so "kernel memory limits"
do not exist if we're using v2 unified.
So, we need to ignore kernel memory setting. This was already done in
non-systemd case (see commit 88e8350de), let's do the same for systemd.
This fixes the following error:
> container_linux.go:349: starting container process caused "process_linux.go:306: applying cgroup configuration for process caused \"open /sys/fs/cgroup/machine.slice/runc-cgroups-integration-test.scope/tasks: no such file or directory\""
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
fmt.Sprintf is slow and is not needed here, string concatenation would
be sufficient. It is also redundant to convert []byte from string and
back, since `bytes` package now provides the same functions as `strings`.
Use Fields() instead of TrimSpace() and Split(), mainly for readability
(note Fields() is somewhat slower than Split() but here it doesn't
matter much).
Use Join() to prepend the plus signs.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case systemd is used to set cgroups for the container,
it creates a scope unit dedicated to it (usually named
`runc-$ID.scope`).
This patch adds an ability to set arbitrary systemd properties
for the systemd unit via runtime spec annotations.
Initially this was developed as an ability to specify the
`TimeoutStopUSec` property, but later generalized to work with
arbitrary ones.
Example usage: add the following to runtime spec (config.json):
```
"annotations": {
"org.systemd.property.TimeoutStopUSec": "uint64 123456789",
"org.systemd.property.CollectMode":"'inactive-or-failed'"
},
```
and start the container (e.g. `runc --systemd-cgroup run $ID`).
The above will set the following systemd parameters:
* `TimeoutStopSec` to 2 minutes and 3 seconds,
* `CollectMode` to "inactive-or-failed".
The values are in the gvariant format (see [1]). To figure out
which type systemd expects for a particular parameter, see
systemd sources.
In particular, parameters with `USec` suffix require an `uint64`
typed argument, while gvariant assumes int32 for a numeric values,
therefore the explicit type is required.
NOTE that systemd receives the time-typed parameters as *USec
but shows them (in `systemctl show`) as *Sec. For example,
the stop timeout should be set as `TimeoutStopUSec` but
is shown as `TimeoutStopSec`.
[1] https://developer.gnome.org/glib/stable/gvariant-text.html
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
`configs.Cgroup` contains the configuration used to create cgroups. This
configuration must be saved to disk, since it's required to restore the
cgroup manager that was used to create the cgroups.
Add method to get cgroup configuration from cgroup Manager to allow API users
save it to disk and restore a cgroup manager later.
fixes#2176
Signed-off-by: Julio Montes <julio.montes@intel.com>
split fs2 package from fs, as mixing up fs and fs2 is very likely to result in
unmaintainable code.
Inspired by containerd/cgroups#109
Fix#2157
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
The `static_build` build tag was introduced in e9944d0f
to remove build warnings related to systemd cgroup driver
dependencies. Since then, those dependencies have changed and
building the systemd cgroup driver no longer imports dlopen.
After this change, runc builds will always include the systemd
cgroup driver.
This fixes#2008.
Signed-off-by: James Peach <jpeach@apache.org>
Implemented `runc ps` for cgroup v2 , using a newly added method `m.GetUnifiedPath()`.
Unlike the v1 implementation that checks `m.GetPaths()["devices"]`, the v2 implementation does not require the device controller to be available.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Transient units (and transient slice units) have been available for quite a
long time and RHEL 7 with systemd v219 (likely the oldest OS we care about at
this point) supports that. A system running a systemd without these features is
likely to break a lot of other stuff that runc/libcontainer care about.
Regarding delegated slices, modern systemd doesn't allow it and
runc/libcontainer run fine on it, so we might as well just stop requesting it
on older versions of systemd which allowed it. (Those versions never really
changed behavior significantly when that option was passed anyways.)
Signed-off-by: Filipe Brandenburger <filbranden@gmail.com>
This dependency is only needed in package "github.com/coreos/go-systemd/util"
and we only use it for IsRunningSystemd(), which is a simple Go function that
just stats a file.
Let's just borrow it here, so we remove the dependency and can remove that
package from vendored build.
This also removes dependencies on dlopen and on trying to find libsystemd.so
or libsystemd-login.so in the system.
Tested that this still builds and works as expected.
Signed-off-by: Filipe Brandenburger <filbranden@gmail.com>
This will permit us to extend the internals of systemd.Manager to include
further information about the system, such as whether cgroupv1, cgroupv2 or
both are in effect.
Furthermore, it allows a future refactor of moving more of UseSystemd() code
into the factory initialization function.
Signed-off-by: Filipe Brandenburger <filbranden@gmail.com>
The detection for scope properties (whether scope units support
DefaultDependencies= or Delegate=) has always been broken, since systemd
refuses to create scopes unless at least one PID is attached to it (and
this has been so since scope units were introduced in systemd v205.)
This can be seen in journal logs whenever a container is started with
libpod:
Feb 11 15:08:07 myhost systemd[1]: libcontainer-12345-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing.
Feb 11 15:08:07 myhost systemd[1]: libcontainer-12345-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing.
Since this logic never worked, just assume both attributes are supported
(which is what the code does when detection fails for this reason, since
it's looking for an "unknown attribute" or "read-only attribute" to mark
them as false) and skip the detection altogether.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
since commit df3fa115f9 it is not
possible to set a kernel memory limit when using the systemd cgroups
backend as we use cgroup.Apply twice.
Skip enabling kernel memory if there are already tasks in the cgroup.
Without this patch, runc fails with:
container_linux.go:344: starting container process caused
"process_linux.go:311: applying cgroup configuration for process
caused \"failed to set memory.kmem.limit_in_bytes, because either
tasks have already joined this cgroup or it has children\""
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Currently runc applies PidsLimit restriction by writing directly to
cgroup's pids.max, without notifying systemd. As a consequence, when the
later updates the context of the corresponding scope, pids.max is reset
to the value of systemd's TasksMax property.
This can be easily reproduced this way (I'm using "postfix" here just an
example, any unrelated but existing service will do):
# CTR=`docker run --pids-limit 111 --detach --rm busybox /bin/sleep 8h`
# cat /sys/fs/cgroup/pids/system.slice/docker-${CTR}.scope/pids.max
111
# systemctl disable --now postfix
# systemctl enable --now postfix
# cat /sys/fs/cgroup/pids/system.slice/docker-${CTR}.scope/pids.max
max
This patch adds TasksAccounting=true and TasksMax=PidsLimit to the
properties sent to systemd.
Signed-off-by: Sergio Lopez <slp@redhat.com>
Respect the container's cgroup path when finding the container's
cgroup mount point, which is useful in multi-tenant environments, where
containers have their own unique cgroup mounts
Signed-off-by: Danail Branekov <danailster@gmail.com>
Signed-off-by: Oliver Stenbom <ostenbom@pivotal.io>
Signed-off-by: Giuseppe Capizzi <gcapizzi@pivotal.io>
So that, if a timeout happens and we decide to stop blocking on the
operation, the writer will not block when they try to report the result
of the operation.
This should address Issue #1780 and it's a follow up for PR #1683,
PR #1754 and PR #1772.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
Starting with systemd 237, in preparation for cgroup v2, delegation is
only now available for scopes, not slices.
Update libcontainer code to detect whether delegation is available on
both and use that information when creating new slices.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
The channel was introduced in #1683 to work around a race condition.
However, the check for error in StartTransientUnit ignores the error for
an already existing unit, and in that case there will be no notification
from DBus (so waiting on the channel will make it hang.)
Later PR #1754 added a timeout, which worked around the issue, but we
can fix this correctly by only waiting on the channel when there is no
error. Fix the code to do so.
The timeout handling was kept, since there might be other cases where
this situation occurs (https://bugzilla.redhat.com/show_bug.cgi?id=1548358
mentions calling this code from inside a container, it's unclear whether
an existing container was in use or not, so not sure whether this would
have fixed that bug as well.)
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
This fix tries to address the warnings caused by static build
with go 1.9. As systemd needs dlopen/dlclose, the following warnings
will be generated for static build in go 1.9:
```
root@f4b077232050:/go/src/github.com/opencontainers/runc# make static
CGO_ENABLED=1 go build -tags "seccomp cgo static_build" -ldflags "-w -extldflags -static -X main.gitCommit="1c81e2a794c6e26a4c650142ae8893c47f619764" -X main.version=1.0.0-rc4+dev " -o runc .
/tmp/go-link-113476657/000007.o: In function `_cgo_a5acef59ed3f_Cfunc_dlopen':
/tmp/go-build/github.com/opencontainers/runc/vendor/github.com/coreos/pkg/dlopen/_obj/cgo-gcc-prolog:76: warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
```
This fix disables systemd when `static_build` flag is on (apply_nosystemd.go
is used instead).
This fix also fixes a small bug in `apply_nosystemd.go` for return value.
Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
Fixes: #1557
I'm not quite sure about the root cause, looks like
systemd still want them to be uint64.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
go's switch statement doesn't need an explicit break. Remove it where
that is the case and add a comment to indicate the purpose where the
removal would lead to an empty case.
Found with honnef.co/go/tools/cmd/staticcheck
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
The rootless cgroup manager acts as a noop for all set and apply
operations. It is just used for rootless setups. Currently this is far
too simple (we need to add opportunistic cgroup management), but is good
enough as a first-pass at a noop cgroup manager.
Signed-off-by: Aleksa Sarai <asarai@suse.de>