Commit Graph

1608 Commits

Author SHA1 Message Date
Akihiro Suda f668854938
Merge pull request #2499 from kolyshkin/find-cgroup-mountpoint-fastpath
cgroupv1/FindCgroupMountpoint: add a fast path
2020-08-04 14:06:41 +09:00
Akihiro Suda 234d15ecd0
Merge pull request #2520 from thaJeztah/bump_runtime_spec
vendor: update runtime-spec v1.0.3-0.20200728170252-4d89ac9fbff6
2020-08-04 14:05:33 +09:00
Akihiro Suda 78d02e8563
Merge pull request #2534 from adrianreber/go-criu-4-1-0
Pass location of CRIU binary to go-criu
2020-08-03 16:21:50 +09:00
Kir Kolyshkin 3de3112c61
Merge pull request #2525 from adrianreber/external-pidns
Tell CRIU to use an external pid namespace if necessary
2020-07-31 17:50:27 -07:00
Adrian Reber 6f4616dd73
Pass location of CRIU binary to go-criu
If the CRIU binary is in a non $PATH location and passed to runc via
'--criu /path/to/criu', this information has not been passed to go-criu
and since the switch to use go-criu for CRIU version detection, non
$PATH CRIU usage was broken. This uses the newly added go-criu interface
to pass the location of the binary to go-criu.

Signed-off-by: Adrian Reber <areber@redhat.com>
2020-07-31 11:14:15 +02:00
Akihiro Suda d6f5641c20
Merge pull request #2507 from kolyshkin/alt-to-2497
libct/cgroups/GetCgroupRoot: make it faster
2020-07-31 11:43:38 +09:00
Mrunal Patel 46243fcea1
Merge pull request #2500 from kolyshkin/fs-apply
libct/cgroups/fs: rework Apply()
2020-07-30 16:39:53 -07:00
Kir Kolyshkin e0c0b0cf32 libct/cgroups/GetCgroupRoot: make it faster
...by checking the default path first.

Quick benchmark shows it's about 5x faster on an idle system, and the
gain should be much more on a system doing mounts etc.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-07-30 13:45:21 -07:00
Sebastiaan van Stijn 901dccf05d
vendor: update runtime-spec v1.0.3-0.20200728170252-4d89ac9fbff6
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-07-30 22:08:54 +02:00
Aleksa Sarai 95a59bf206
devices: correctly check device types
(mode&S_IFCHR == S_IFCHR) is the wrong way of checking the type of an
inode because the S_IF* bits are actually not a bitmask and instead must
be checked using S_IF*. This bug was neatly hidden behind a (major == 0)
sanity-check but that was removed by [1].

In addition, add a test that makes sure that HostDevices() doesn't give
rubbish results -- because we broke this and fixed this before[2].

[1]: 24388be71e ("configs: use different types for .Devices and .Resources.Devices")
[2]: 3ed492ad33 ("Handle non-devices correctly in DeviceFromPath")

Fixes: b0d014d0e1 ("libcontainer: one more switch from syscall to x/sys/unix")
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2020-07-28 19:04:30 +10:00
Adrian Reber 09e103b01e
Tell CRIU to use an external pid namespace if necessary
Trying to checkpoint a container out of pod in cri-o fails with:

  Error (criu/namespaces.c:1081): Can't dump a pid namespace without the process init

Starting with the upcoming CRIU release 3.15, CRIU can be told to ignore
the PID namespace during checkpointing and to restore processes into an
existing network namespace.

With the changes from this commit and CRIU 3.15 it is possible to
checkpoint a container out of a pod in cri-o.

Signed-off-by: Adrian Reber <areber@redhat.com>
2020-07-27 10:14:08 +02:00
Adrian Reber 610c5ad75c
Factor out checkpointing with external namespace code
To checkpoint and restore a container with an external network namespace
(like with Podman and CNI), runc tells CRIU to ignore the network
namespace during checkpoint and restore.

This commit moves that code to their own functions to be able to reuse
the same code path for external PID namespaces which are necessary for
checkpointing and restoring containers out of a pod in cri-o.

Signed-off-by: Adrian Reber <areber@redhat.com>
2020-07-27 10:14:07 +02:00
Xiaodong Liu af283b3f47 remove redundant the parameter of chroot function
Signed-off-by: Xiaodong Liu <liuxiaodong@loongson.cn>
2020-07-15 16:22:07 +08:00
Mrunal Patel cf1273abf4
Merge pull request #2498 from kolyshkin/v1-code-cleanups
libct/cgroups/fs: code cleanups
2020-07-09 15:58:06 -07:00
Kir Kolyshkin fbf047bf2f
Merge pull request #2501 from XiaodongLoong/systemderror-fix
fix TestPidsSystemd and TestRunWithKernelMemorySystemd test error
2020-07-08 20:39:39 -07:00
Xiaodong Liu f57bb2fe3d fix TestPidsSystemd and TestRunWithKernelMemorySystemd test error
Signed-off-by: Xiaodong Liu <liuxiaodong@loongson.cn>
2020-07-09 09:36:03 +08:00
Daniel J Walsh d78ee47154
Allow libcontainer/configs to be imported on Windows
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2020-07-08 15:20:37 -04:00
Kir Kolyshkin a73ce38d16 cgroupv1/FindCgroupMountpoint: add a fast path
In case cgroupPath is under the default cgroup prefix, let's try to
guess the mount point by adding the subsystem name to the default
prefix, and resolving the resulting path in case it's a symlink.

In most cases, given the default cgroup setup, this trick
should result in returning the same result faster, and avoiding
/proc/self/mountinfo parsing which is relatively slow and problematic.

Be very careful with the default path, checking it is
 - a directory;
 - a mount point;
 - has cgroup fstype.

If something is not right, fall back to parsing mountinfo.

While at it, remove the obsoleted comment about mountinfo parsing.  The
comment belongs to findCgroupMountpointAndRootFromReader(), but rather
than moving it there, let's just remove it, since it does not add any
value in understanding the current code.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-07-07 13:57:33 -07:00
Kir Kolyshkin c1adc99a20 cgroup/fs: rework Apply()
In manager.Apply() method, a path to each subsystem is obtained by
calling d.path(sys.Name()), and the sys.Apply() is called that does
the same call to d.path() again.

d.path() is an expensive call, so rather than to call it twice, let's
reuse the result.

This results the number of times we parse mountinfo during container
start from 62 to 34 on my setup.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-07-07 10:58:37 -07:00
Aleksa Sarai 819fcc687e
merge branch 'pr-2495'
Kir Kolyshkin (1):
  cgroups/fs/path: optimize

LGTMs: @mrunalp @cyphar
Closes #2495
2020-07-07 11:51:06 +10:00
Kir Kolyshkin 2a322e91ec cgroupv1: remove subsystemSet.Get()
Instead of iterating over m.paths, iterate over subsystems and look up
the path for each. This is faster since a map lookup is faster than
iterating over the names in Get. A quick benchmark shows that the new
way is 2.5x faster than the old one.

Note though that this is not done to make things faster, as savings are
negligible, but to make things simpler by removing some code.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-07-06 18:31:46 -07:00
Kir Kolyshkin daf30cb7ca cgroups/fs: rm getSubsystems
It does not add any value.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-07-06 18:29:14 -07:00
Kir Kolyshkin 2e22579946 libct/cgroups/fs.GetStats: drop PathExists check
Half of controllers' GetStats just return nil, and most of the others
ignore ENOENT on files, so it will be cheaper to not check that the
path exists in the main GetStats method, offloading that to the
controllers.

Drop PathExists check from GetStats, add it to those controllers'
GetStats where it was missing.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-07-06 18:02:17 -07:00
Kir Kolyshkin 11fb94965c cgroups/fs: rm Remove method from controllers
To my surprise, those are not used anywhere in the code.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-07-06 18:02:17 -07:00
Mrunal Patel 30dc54a995
Merge pull request #2503 from giuseppe/cgroup-fixes
cgroup, systemd: cleanup cgroups
2020-07-06 15:14:29 -07:00
Mrunal Patel 3f81131845
Merge pull request #2490 from kolyshkin/dev-opt
libct/cgroups: add SkipDevices to Resources
2020-07-06 14:28:30 -07:00
Giuseppe Scrivano 32034481ea
cgroup, systemd: cleanup cgroups
some hierarchies were created directly by .Apply() on top of systemd
managed cgroups.  systemd doesn't manage these and as a result we leak
these cgroups.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2020-07-06 23:06:16 +02:00
Mrunal Patel 46a304b592
Merge pull request #2502 from tjucoder/master
make sure pty.Close() will be called and fix comment
2020-07-06 11:49:20 -07:00
Giuseppe Scrivano 2deaeab08f
cgroup: store the result of IsRunningSystemd
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2020-07-05 12:42:27 +02:00
tjucoder ab35cfe23c make sure pty.Close() will be called and fix comment
Signed-off-by: tjucoder <chinesecoder@foxmail.com>
2020-07-05 16:37:21 +08:00
Kir Kolyshkin 62a30709d2 cgroups/fs/path: optimize
The result of cgroupv1.FindCgroupMountpoint() call (which is relatively
expensive) is only used in case raw.innerPath is absolute, so it only
makes sense to call it in that case.

This drastically reduces the number of calls to FindCgroupMountpoint
during container start (from 116 to 62 in my setup).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-07-03 14:07:27 -07:00
Kir Kolyshkin 46b26bc05d cgroups/fs/Freeze: simplify
In here, defer looks like an overkill, since the code is very simple and
we already have an error path.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-07-03 14:02:57 -07:00
Kir Kolyshkin cd479f9d14 cgroupv1/freezer: don't use subsystemSet.Get()
Iterating over the list of subsystems and comparing their names to get an
instance of fs.cgroupFreezer is useless and a waste of time, since it is
a shallow type (i.e. does not have any data/state) and we can create an
instance in place.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-07-03 14:00:44 -07:00
Kir Kolyshkin 108ee85b82 libct/cgroups: add SkipDevices to Resources
The kubelet uses libct/cgroups code to set up cgroups. It creates a
parent cgroup (kubepods) to put the containers into.

The problem (for cgroupv2 that uses eBPF for device configuration) is
the hard requirement to have devices cgroup configured results in
leaking an eBPF program upon every kubelet restart.  program. If kubelet
is restarted 64+ times, the cgroup can't be configured anymore.

Work around this by adding a SkipDevices flag to Resources.

A check was added so that if SkipDevices is set, such a "container"
can't be started (to make sure it is only used for non-containers).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-07-02 15:19:31 -07:00
Aleksa Sarai 0fa097fc37
merge branch 'pr-2481'
Tianjia Zhang (1):
  nsenter: fix repeat close() operations

LGTMs: @kolyshkin @cyphar
Closes #2481
2020-06-20 12:18:31 +10:00
Kir Kolyshkin dff7685c18
Merge pull request #2459 from tedyu/linux-cont-set-cfg
Set configs back when intelrdt configs cannot be set

LGTMS: @AkihiroSuda @kolyshkin
2020-06-19 12:57:53 -07:00
Kir Kolyshkin e643db6e0f
Merge pull request #2479 from haircommander/fix-systemd-version
systemd: parse systemdVersion when only an int is returned

LGTMS: @mrunalp @kolyshkin
2020-06-19 12:19:16 -07:00
Tianjia Zhang 04806abd39 nsenter: fix repeat close() operations
It is obvious that the loop at the first place executes at least
twice, and the close() call after the first time always returns
an EBADF error, so move these operations outside the loop that
do not need to be repeated.

Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
2020-06-19 19:28:39 +08:00
Akihiro Suda 9748b48742
Merge pull request #2229 from RenaudWasTaken/create-container
Add CreateRuntime, CreateContainer and StartContainer Hooks
2020-06-19 12:27:51 +09:00
Renaud Gaubert 861afa7509 Add integration tests for the new runc hooks
This patch adds a test based on real world usage of runc hooks
(libnvidia-container). We verify that mounting a library inside
a container and running ldconfig succeeds.

Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>
2020-06-19 02:39:20 +00:00
Renaud Gaubert 2f7bdf9d3b Tests the new Hook
Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>
2020-06-19 02:39:20 +00:00
Peter Hunt 6a0f64e7c9 systemd: add unit tests for systemdVersion
Signed-off-by: Peter Hunt <pehunt@redhat.com>
2020-06-18 22:30:50 -04:00
Peter Hunt 6369e38871 systemd: parse systemdVersion in more situations
there have been cases observed where instead of `v$VER.0-$OS` the systemdVersion returned is just `$VER`, or `$VER-1`.
handle these cases

Signed-off-by: Peter Hunt <pehunt@redhat.com>
2020-06-18 22:30:50 -04:00
Kir Kolyshkin 89516d17dd libct/cgroups/readProcsFile: ret errorr if scan failed
Not sure why but the errors from scanner were ignored. Such errors
can happen if open(2) has succeeded but the subsequent read(2) fails.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-17 12:33:01 -07:00
Mrunal Patel 406298fdf0
Merge pull request #2466 from kolyshkin/systemd-cpu-quota-period
cgroups/systemd: add setting CPUQuotaPeriod prop
2020-06-17 12:03:30 -07:00
Mrunal Patel 12a7c8fc2b
Merge pull request #2411 from kolyshkin/v1-specific
libct/cgroups/utils: fix/separate cgroupv1 code
2020-06-17 06:45:19 -07:00
Renaud Gaubert ccdd75760c Add the CreateRuntime, CreateContainer and StartContainer Hooks
Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>
2020-06-17 02:10:00 +00:00
Kir Kolyshkin e751a168dc cgroups/systemd: add setting CPUQuotaPeriod prop
For some reason, runc systemd drivers (both v1 and v2) never set
systemd unit property named `CPUQuotaPeriod` (known as
`CPUQuotaPeriodUSec` on dbus and in `systemctl show` output).

Set it, and add a check to all the integration tests. The check is less
than trivial because, when not set, the value is shown as "infinity" but
when set to the same (default) value, shown as "100ms", so in case we
expect 100ms (period = 100000 us), we have to _also_ check for
"infinity".

[v2: add systemd version checks since CPUQuotaPeriod requires v242+]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-16 15:48:06 -07:00
Kir Kolyshkin 8c5a19f79b libct/cgroups/fs: rename some files
no changes, just a few git renames

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-16 12:45:54 -07:00
Kir Kolyshkin cec5ae7c2d libct/cgroupv1/getCgroupMountsHelper: minor nit
It is easy to just use TrimPrefix which does nothing in case the prefix
does not exist.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-16 12:45:50 -07:00