If the CRIU binary is in a non $PATH location and passed to runc via
'--criu /path/to/criu', this information has not been passed to go-criu
and since the switch to use go-criu for CRIU version detection, non
$PATH CRIU usage was broken. This uses the newly added go-criu interface
to pass the location of the binary to go-criu.
Signed-off-by: Adrian Reber <areber@redhat.com>
...by checking the default path first.
Quick benchmark shows it's about 5x faster on an idle system, and the
gain should be much more on a system doing mounts etc.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(mode&S_IFCHR == S_IFCHR) is the wrong way of checking the type of an
inode because the S_IF* bits are actually not a bitmask and instead must
be checked using S_IF*. This bug was neatly hidden behind a (major == 0)
sanity-check but that was removed by [1].
In addition, add a test that makes sure that HostDevices() doesn't give
rubbish results -- because we broke this and fixed this before[2].
[1]: 24388be71e ("configs: use different types for .Devices and .Resources.Devices")
[2]: 3ed492ad33 ("Handle non-devices correctly in DeviceFromPath")
Fixes: b0d014d0e1 ("libcontainer: one more switch from syscall to x/sys/unix")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Trying to checkpoint a container out of pod in cri-o fails with:
Error (criu/namespaces.c:1081): Can't dump a pid namespace without the process init
Starting with the upcoming CRIU release 3.15, CRIU can be told to ignore
the PID namespace during checkpointing and to restore processes into an
existing network namespace.
With the changes from this commit and CRIU 3.15 it is possible to
checkpoint a container out of a pod in cri-o.
Signed-off-by: Adrian Reber <areber@redhat.com>
To checkpoint and restore a container with an external network namespace
(like with Podman and CNI), runc tells CRIU to ignore the network
namespace during checkpoint and restore.
This commit moves that code to their own functions to be able to reuse
the same code path for external PID namespaces which are necessary for
checkpointing and restoring containers out of a pod in cri-o.
Signed-off-by: Adrian Reber <areber@redhat.com>
In case cgroupPath is under the default cgroup prefix, let's try to
guess the mount point by adding the subsystem name to the default
prefix, and resolving the resulting path in case it's a symlink.
In most cases, given the default cgroup setup, this trick
should result in returning the same result faster, and avoiding
/proc/self/mountinfo parsing which is relatively slow and problematic.
Be very careful with the default path, checking it is
- a directory;
- a mount point;
- has cgroup fstype.
If something is not right, fall back to parsing mountinfo.
While at it, remove the obsoleted comment about mountinfo parsing. The
comment belongs to findCgroupMountpointAndRootFromReader(), but rather
than moving it there, let's just remove it, since it does not add any
value in understanding the current code.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In manager.Apply() method, a path to each subsystem is obtained by
calling d.path(sys.Name()), and the sys.Apply() is called that does
the same call to d.path() again.
d.path() is an expensive call, so rather than to call it twice, let's
reuse the result.
This results the number of times we parse mountinfo during container
start from 62 to 34 on my setup.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Instead of iterating over m.paths, iterate over subsystems and look up
the path for each. This is faster since a map lookup is faster than
iterating over the names in Get. A quick benchmark shows that the new
way is 2.5x faster than the old one.
Note though that this is not done to make things faster, as savings are
negligible, but to make things simpler by removing some code.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Half of controllers' GetStats just return nil, and most of the others
ignore ENOENT on files, so it will be cheaper to not check that the
path exists in the main GetStats method, offloading that to the
controllers.
Drop PathExists check from GetStats, add it to those controllers'
GetStats where it was missing.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
some hierarchies were created directly by .Apply() on top of systemd
managed cgroups. systemd doesn't manage these and as a result we leak
these cgroups.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
The result of cgroupv1.FindCgroupMountpoint() call (which is relatively
expensive) is only used in case raw.innerPath is absolute, so it only
makes sense to call it in that case.
This drastically reduces the number of calls to FindCgroupMountpoint
during container start (from 116 to 62 in my setup).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In here, defer looks like an overkill, since the code is very simple and
we already have an error path.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Iterating over the list of subsystems and comparing their names to get an
instance of fs.cgroupFreezer is useless and a waste of time, since it is
a shallow type (i.e. does not have any data/state) and we can create an
instance in place.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The kubelet uses libct/cgroups code to set up cgroups. It creates a
parent cgroup (kubepods) to put the containers into.
The problem (for cgroupv2 that uses eBPF for device configuration) is
the hard requirement to have devices cgroup configured results in
leaking an eBPF program upon every kubelet restart. program. If kubelet
is restarted 64+ times, the cgroup can't be configured anymore.
Work around this by adding a SkipDevices flag to Resources.
A check was added so that if SkipDevices is set, such a "container"
can't be started (to make sure it is only used for non-containers).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
It is obvious that the loop at the first place executes at least
twice, and the close() call after the first time always returns
an EBADF error, so move these operations outside the loop that
do not need to be repeated.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
This patch adds a test based on real world usage of runc hooks
(libnvidia-container). We verify that mounting a library inside
a container and running ldconfig succeeds.
Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>
there have been cases observed where instead of `v$VER.0-$OS` the systemdVersion returned is just `$VER`, or `$VER-1`.
handle these cases
Signed-off-by: Peter Hunt <pehunt@redhat.com>
Not sure why but the errors from scanner were ignored. Such errors
can happen if open(2) has succeeded but the subsequent read(2) fails.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
For some reason, runc systemd drivers (both v1 and v2) never set
systemd unit property named `CPUQuotaPeriod` (known as
`CPUQuotaPeriodUSec` on dbus and in `systemctl show` output).
Set it, and add a check to all the integration tests. The check is less
than trivial because, when not set, the value is shown as "infinity" but
when set to the same (default) value, shown as "100ms", so in case we
expect 100ms (period = 100000 us), we have to _also_ check for
"infinity".
[v2: add systemd version checks since CPUQuotaPeriod requires v242+]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>