some hierarchies were created directly by .Apply() on top of systemd
managed cgroups. systemd doesn't manage these and as a result we leak
these cgroups.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
The result of cgroupv1.FindCgroupMountpoint() call (which is relatively
expensive) is only used in case raw.innerPath is absolute, so it only
makes sense to call it in that case.
This drastically reduces the number of calls to FindCgroupMountpoint
during container start (from 116 to 62 in my setup).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In here, defer looks like an overkill, since the code is very simple and
we already have an error path.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Iterating over the list of subsystems and comparing their names to get an
instance of fs.cgroupFreezer is useless and a waste of time, since it is
a shallow type (i.e. does not have any data/state) and we can create an
instance in place.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The kubelet uses libct/cgroups code to set up cgroups. It creates a
parent cgroup (kubepods) to put the containers into.
The problem (for cgroupv2 that uses eBPF for device configuration) is
the hard requirement to have devices cgroup configured results in
leaking an eBPF program upon every kubelet restart. program. If kubelet
is restarted 64+ times, the cgroup can't be configured anymore.
Work around this by adding a SkipDevices flag to Resources.
A check was added so that if SkipDevices is set, such a "container"
can't be started (to make sure it is only used for non-containers).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
full diff: a9f01edf17...1c8d4c9ef7
drops support for go1.12, and removes dependency on the golang.org/x/xerrors
transitional package.
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
It is obvious that the loop at the first place executes at least
twice, and the close() call after the first time always returns
an EBADF error, so move these operations outside the loop that
do not need to be repeated.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
This patch adds a test based on real world usage of runc hooks
(libnvidia-container). We verify that mounting a library inside
a container and running ldconfig succeeds.
Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>
there have been cases observed where instead of `v$VER.0-$OS` the systemdVersion returned is just `$VER`, or `$VER-1`.
handle these cases
Signed-off-by: Peter Hunt <pehunt@redhat.com>
Not sure why but the errors from scanner were ignored. Such errors
can happen if open(2) has succeeded but the subsequent read(2) fails.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. When using `runc`, we should check `$status` and not `$?`.
2. Before exit code check, let's (try to) show errors from CRIU log.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
For some reason, runc systemd drivers (both v1 and v2) never set
systemd unit property named `CPUQuotaPeriod` (known as
`CPUQuotaPeriodUSec` on dbus and in `systemctl show` output).
Set it, and add a check to all the integration tests. The check is less
than trivial because, when not set, the value is shown as "infinity" but
when set to the same (default) value, shown as "100ms", so in case we
expect 100ms (period = 100000 us), we have to _also_ check for
"infinity".
[v2: add systemd version checks since CPUQuotaPeriod requires v242+]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When testing GetCgroupMounts, the map data is supposed to be obtained
from /proc/self/cgroup, but since we're mocking things, we provide
our own map.
Unfortunately, not all controllers existing in mountinfos were listed.
Also, "name=systemd" needs special handling, so add it.
The controllers added were:
* for fedoraMountinfo case: name=systemd
* for systemdMountinfo case: name=systemd, net_prio
* for bedrockMountinfo case: name=systemd, net_prio, pids
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In most project, "utils" is a big mess, and this is not an exception.
Try to clean it up a bit by moving cgroup v1 specific code to a separate
source file.
There are no code changes in this commit, just moving it from one file
to another.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This function is cgroupv1-specific, is only used once, and its name
is very close to the name of another function, FindCgroupMountpoint.
Inline it into the (only) caller.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This function is only called from cgroupv1 code, so there is no need
for it to implement cgroupv2 stuff.
Make it v1-specific, and panic if it is called from v2 code (since this
is an internal function, the panic would mean incorrect runc code).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
It's bad and wrong to use these functions for any cgroupv2 code,
and there are no existing users (in runc, at least).
Make them return an error in such case.
Also, remove the cgroupv2-specific handling from
findCgroupMountpointAndRootFromReader().
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This function should not really be used for cgroupv2 code.
Currently it is used in kubernetes code, so we can't remove
the v2 case yet.
Add a TODO item to remove v2 code once kubernetes is converted
to not use it, and separate out v1 code.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>