Commit Graph

1620 Commits

Author SHA1 Message Date
Akihiro Suda 9748b48742
Merge pull request #2229 from RenaudWasTaken/create-container
Add CreateRuntime, CreateContainer and StartContainer Hooks
2020-06-19 12:27:51 +09:00
Renaud Gaubert 861afa7509 Add integration tests for the new runc hooks
This patch adds a test based on real world usage of runc hooks
(libnvidia-container). We verify that mounting a library inside
a container and running ldconfig succeeds.

Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>
2020-06-19 02:39:20 +00:00
Renaud Gaubert 2f7bdf9d3b Tests the new Hook
Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>
2020-06-19 02:39:20 +00:00
Peter Hunt 6a0f64e7c9 systemd: add unit tests for systemdVersion
Signed-off-by: Peter Hunt <pehunt@redhat.com>
2020-06-18 22:30:50 -04:00
Peter Hunt 6369e38871 systemd: parse systemdVersion in more situations
there have been cases observed where instead of `v$VER.0-$OS` the systemdVersion returned is just `$VER`, or `$VER-1`.
handle these cases

Signed-off-by: Peter Hunt <pehunt@redhat.com>
2020-06-18 22:30:50 -04:00
Kir Kolyshkin 89516d17dd libct/cgroups/readProcsFile: ret errorr if scan failed
Not sure why but the errors from scanner were ignored. Such errors
can happen if open(2) has succeeded but the subsequent read(2) fails.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-17 12:33:01 -07:00
Mrunal Patel 406298fdf0
Merge pull request #2466 from kolyshkin/systemd-cpu-quota-period
cgroups/systemd: add setting CPUQuotaPeriod prop
2020-06-17 12:03:30 -07:00
Mrunal Patel 12a7c8fc2b
Merge pull request #2411 from kolyshkin/v1-specific
libct/cgroups/utils: fix/separate cgroupv1 code
2020-06-17 06:45:19 -07:00
Renaud Gaubert ccdd75760c Add the CreateRuntime, CreateContainer and StartContainer Hooks
Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>
2020-06-17 02:10:00 +00:00
Kir Kolyshkin e751a168dc cgroups/systemd: add setting CPUQuotaPeriod prop
For some reason, runc systemd drivers (both v1 and v2) never set
systemd unit property named `CPUQuotaPeriod` (known as
`CPUQuotaPeriodUSec` on dbus and in `systemctl show` output).

Set it, and add a check to all the integration tests. The check is less
than trivial because, when not set, the value is shown as "infinity" but
when set to the same (default) value, shown as "100ms", so in case we
expect 100ms (period = 100000 us), we have to _also_ check for
"infinity".

[v2: add systemd version checks since CPUQuotaPeriod requires v242+]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-16 15:48:06 -07:00
Kir Kolyshkin 8c5a19f79b libct/cgroups/fs: rename some files
no changes, just a few git renames

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-16 12:45:54 -07:00
Kir Kolyshkin cec5ae7c2d libct/cgroupv1/getCgroupMountsHelper: minor nit
It is easy to just use TrimPrefix which does nothing in case the prefix
does not exist.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-16 12:45:50 -07:00
Kir Kolyshkin 0626c150c1 libct/cgroupv1: fix TestGetCgroupMounts test cases
When testing GetCgroupMounts, the map data is supposed to be obtained
from /proc/self/cgroup, but since we're mocking things, we provide
our own map.

Unfortunately, not all controllers existing in mountinfos were listed.
Also, "name=systemd" needs special handling, so add it.

The controllers added were:

 * for fedoraMountinfo case: name=systemd
 * for systemdMountinfo case: name=systemd, net_prio
 * for bedrockMountinfo case: name=systemd, net_prio, pids

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-16 12:45:30 -07:00
Kir Kolyshkin 0681d456fc libct/cgroups/utils: move cgroup v1 code to separate file
In most project, "utils" is a big mess, and this is not an exception.
Try to clean it up a bit by moving cgroup v1 specific code to a separate
source file.

There are no code changes in this commit, just moving it from one file
to another.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-16 12:45:07 -07:00
Kir Kolyshkin 7db2d3e146 libcontainer/cgroups: rm FindCgroupMountpointDir
This function is cgroupv1-specific, is only used once, and its name
is very close to the name of another function, FindCgroupMountpoint.

Inline it into the (only) caller.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-16 12:40:15 -07:00
Kir Kolyshkin d244b4058e libct/cgroups: improve ParseCgroupFile docs
In particular, state that for cgroup v2 the result is very different.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-16 12:40:08 -07:00
Kir Kolyshkin 5785aabc13 libct/cgroups: make isSubsystemAvailable v1-specific
This function is only called from cgroupv1 code, so there is no need
for it to implement cgroupv2 stuff.

Make it v1-specific, and panic if it is called from v2 code (since this
is an internal function, the panic would mean incorrect runc code).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-16 12:40:04 -07:00
Kir Kolyshkin d5c57dcea6 libct/criuApplyCgroups: don't set cgroup paths for v2
There is no need to have cgroupv1-specific controller paths on restore
in case of cgroupv2.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-16 12:40:02 -07:00
Kir Kolyshkin 52b56bc28e libc/criuSwrk: remove applyCgroups param
Its value can be easily deduced from the request type.

While at it, simplify the call logic.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-16 12:40:01 -07:00
Kir Kolyshkin 142d0f2d5d libct/cgroups/utils: make FindCgroupMountpoint* v1-specific
It's bad and wrong to use these functions for any cgroupv2 code,
and there are no existing users (in runc, at least).

Make them return an error in such case.

Also, remove the cgroupv2-specific handling from
findCgroupMountpointAndRootFromReader().

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-16 12:39:58 -07:00
Kir Kolyshkin 44b75e760e libct/cgroups: separate getCgroupMountsV1
This function should not really be used for cgroupv2 code.
Currently it is used in kubernetes code, so we can't remove
the v2 case yet.

Add a TODO item to remove v2 code once kubernetes is converted
to not use it, and separate out v1 code.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-16 12:39:06 -07:00
Mrunal Patel 82d2fa4eb0
Merge pull request #2453 from AkihiroSuda/vagrant-centos7
CI: add CentOS 7 (kernel 3.10, systemd 219)
2020-06-15 21:09:43 -07:00
Kir Kolyshkin 3834222d88 libct/cgroups/utils: getControllerPath return err for v2
This function is not used and were never used in any cgroupv2 code.

To have it stay that way, let it return error in case it's called
for v2.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-15 20:23:59 -07:00
Kir Kolyshkin dd2426d067 libct/cgroups: fix m.paths map access
This fixes a few cases of accessing m.paths map directly without holding
the mutex lock.

Fixes: 9087f2e82
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-15 18:30:16 -07:00
Kir Kolyshkin a77d7b1d0f libct: don't use GetPaths
Since commit 714c91e9f7, method GetPaths() should only be used
for saving container state. For other uses, we have a new method,
Path(), which is cleaner.

Fix GetPaths() usage introduced by recent commits 859a780d6f and 9087f2e82.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-15 18:27:34 -07:00
Kir Kolyshkin 5b247e739c
Merge pull request #2338 from lifubang/systemdcgroupv2
fix path error in systemd when stopped

LGTMs: @mrunalp @AkihiroSuda
2020-06-15 18:01:13 -07:00
Akihiro Suda c76af1d2ac
Merge pull request #2470 from katarzyna-z/kk-fix-numa-stats
Fix #2469 omit memory.numa_stat when not available
2020-06-16 09:32:49 +09:00
Akihiro Suda 601fa557c0
Merge pull request #2414 from kolyshkin/criu-notif
use lazy-pages ready notification for criu >= 3.15
2020-06-16 09:31:12 +09:00
Katarzyna Kujawa 71e63de4a3 Fix #2469 omit memory.numa_stat when not available
Signed-off-by: Katarzyna Kujawa <katarzyna.kujawa@intel.com>
2020-06-15 11:39:34 +02:00
Akihiro Suda fdc48376d1
Merge pull request #2458 from kolyshkin/cpu-quota-II
Cpu quota fixes, try II
2020-06-12 07:46:56 +09:00
Mrunal Patel a4a306d2a2 Write state.json atomically
We want to make sure that the state file is syned and cannot be
read partially or truncated.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2020-06-10 20:21:04 -07:00
Akihiro Suda bd236e50a5
integration: skip checkpoint tests if criu binary not found
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-06-10 13:18:58 +09:00
Kir Kolyshkin a92b0327ce cgroups/systemd: fix set CPU quota if period is unset
systemd drivers ignore --cpu-quota during update if the CPU
period was not set earlier.

Fixed by adding the default for the period.

The test will be added by the following commit.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-09 17:32:17 -07:00
Kir Kolyshkin 4189cb65f8 cgroups: remove cgroup.Resources.CpuMax
This (and the converting function) is only used by one of the four
cgroup drivers. The other three do some checking and conversion in
place, so let the fs2 do the same.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-09 17:15:38 -07:00
Kir Kolyshkin 8b9646775e cgroups/systemd: unify adding CpuQuota
The code that adds CpuQuotaPerSecUSec is the same in v1 and v2
systemd cgroup driver. Move it to common.

No functional change.

Note that the comment telling that we always set this property
contradicts with the current code, and therefore it is removed.

[v2: drop cgroupv1-specific comment]
[v3: drop returning error as it's not used]
[v4: remove an obsoleted comment]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-09 17:14:43 -07:00
Kir Kolyshkin 2ce20ed158 cgroups/systemd: simplify gen*ResourcesProperties
Use r instead of c.Resources for readability. No functional change.

This commit has been brought to you by '<,'>s/c\.Resources\./r./g

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-08 13:42:09 -07:00
Ted Yu 9d275d326c Set configs back when intelrdt configs cannot be set
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-06-06 15:10:45 -07:00
Aleksa Sarai 2a0466958d
merge branch 'pr-2446'
Kir Kolyshkin (1):
  (*initProcess).start: rm second Apply

LGTMs: @mrunalp @cyphar
Closes #2446
2020-06-05 08:30:51 +10:00
lifubang 9087f2e827 fix path error in systemd when stopped
When we use cgroup with systemd driver, the cgroup path will be auto removed
by systemd when all processes exited. So we should check cgroup path exists
when we access the cgroup path, for example in `kill/ps`, or else we will
got an error.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-06-02 18:17:43 +08:00
Katarzyna Kujawa 92f831bf0c Fix #2440 omit cpuacct.usage_all when not available
Signed-off-by: Katarzyna Kujawa <katarzyna.kujawa@intel.com>
2020-06-02 09:24:11 +02:00
Kir Kolyshkin d1ba8e39f8 (*initProcess).start: rm second Apply
Apply() determines and creates cgroup path(s), configures parent cgroups
(for some v1 controllers), and creates a systemd unit (in case of a
systemd cgroup manager), then adds a pid specified to the cgroup
for all configured controllers.

This is a relatively heavy procedure (in particular, for cgroups v1 it
involves parsing /proc/self/mountinfo about a dozen times), and it seems
there is no need to do it twice.

More to say, even merely adding the child pid to the same cgroup seems
redundant, as we added the parent pid to the cgroup before sending the
data to the child (runc init process), and it waits for the data before
doing clone(), so its children will be in the same cgroup anyway.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-01 19:51:19 -07:00
Mrunal Patel 332a84581e
Merge pull request #2443 from kolyshkin/kmem-fixup
cgroupv1/systemd.Set: don't enable kernel memory acct
2020-05-31 10:04:45 -07:00
Mrunal Patel 0f7ffbebeb
Merge pull request #2416 from AkihiroSuda/exec-join-init-cgroup
cgroup2: exec: join the cgroup of the init process on EBUSY
2020-05-31 08:50:41 -07:00
Aleksa Sarai a30f2556d9
merge branch 'pr-2018'
Lifubang (1):
  add prompt when rootless users have no read access to runc bin

LGTMs: @AkihiroSuda @cyphar
Closes #2018
2020-05-31 18:41:37 +10:00
Akihiro Suda c91fe9aeba cgroup2: exec: join the cgroup of the init process on EBUSY
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-05-31 13:09:43 +09:00
Kir Kolyshkin 3fe6e04510 cgroupv1/systemd.Set: don't enable kernel memory acct
This is a regression from commit 1d4ccc8e0. We only need to enable
kernel memory accounting once, from the (*legacyManager*).Apply(),
and there is no need to do it in (*legacyManager*).Set().

While at it, rename the method to better reflect what it's doing.

This saves 1 call to mountinfo parser.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-29 17:54:50 -07:00
Kir Kolyshkin 3249e2379c cgroupv1: check cpu shares in place
Commit 4e65e0e90a added a check for cpu shares. Apparently, the
kernel allows to set a value higher than max or lower than min without
an error, but the value read back is always within the limits.

The check (which was later moved out to a separate CheckCpushares()
function) is always performed after setting the cpu shares, so let's
move it to the very place where it is set.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-29 16:46:28 -07:00
Ted Yu 3ba3d9b1bd Wait for criuProcess once
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-05-29 15:50:37 -07:00
Kir Kolyshkin 0ac92aab3f cgroups/fs2: make removeCgroupPath faster
1. In cases there are no sub-cgroups, a single rmdir should be faster
than iterating through the list of files.

2. Use unix.Rmdir() to save one more syscall since os.Remove() tries
unlink(2) first which fails on a directory, and only then tries
rmdir(2).

3. Re-use rmdir.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-28 11:15:31 -07:00
Mrunal Patel 4f0bdafc8a
Merge pull request #2412 from lifubang/removecgpath
remove cgroup path recursively in cgroup v2
2020-05-27 15:50:14 -07:00
Kir Kolyshkin be5467872d cgroupv1: minimal fix for cpu quota regression
This is a quick-n-dirty fix the regression introduced by commit
06d7c1d, which made it impossible to only set CpuQuota
(without the CpuPeriod). It partially reverts the above commit,
and adds a test case.

The proper fix will follow.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-26 11:02:16 -07:00
lifubang 82fa194179 remove cgroup path recursively in cgroup v2
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-05-26 23:35:20 +08:00
Akihiro Suda 1f737eebaa
Merge pull request #2426 from kolyshkin/mem-swap-unlim
Fix some cases of swap setting
2020-05-26 14:48:59 +09:00
Akihiro Suda 7673bee6bf
Merge pull request #2395 from lifubang/updateCgroupv2
Partially revert "CreateCgroupPath: only enable needed controllers"
2020-05-25 13:56:23 +09:00
Kir Kolyshkin 68391c0e96 use lazy-pages ready notification for criu >= 3.15
This relies on https://github.com/checkpoint-restore/criu/pull/1069
and emulates the previous behavior by writing \0 and closing status
fd (as it was done by criu).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-23 11:37:28 -07:00
Kir Kolyshkin 7ab1329835 libct/criuNotifications: simplify switch
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-23 11:37:28 -07:00
Kir Kolyshkin 3c6e8ac4d2 cgroupv2: set mem+swap to max if mem set to max
... and mem+swap is not explicitly set otherwise.

This ensures compatibility with cgroupv1 controller which interprets
things this way.

With this fixed, we can finally enable swap tests for cgroupv2.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-22 21:32:16 -07:00
lifubang a67dab0ac2 Revert "CreateCgroupPath: only enable needed controllers"
1. Partially revert "CreateCgroupPath: only enable needed controllers"
If we update a resource which did not limited in the beginning,
it will have no effective.
2. Returns err if we use an non enabled controller,
or else the user may feel success, but actually there are no effective.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-05-21 12:17:46 +08:00
Kir Kolyshkin d57f5bb286 cgroupv1: don't ignore MemorySwap if Memory==-1
Commit 18ebc51b3cc3 "Reset Swap when memory is set to unlimited (-1)"
added handling of the case when a user updates the container limits
to set memory to unlimited (-1) but do not set any other limits.
Apparently, in this case, if swap limit was previously set, kernel fails
to set memory.limit_in_bytes to -1 if memory.memsw.limit_in_bytes is
not set to -1.

What the above commit fails to handle correctly is the request when
Memory is set to -1 and MemorySwap is set to some specific limit N
(where N > 0). In this case, the value of N is silently discarded
and MemorySwap is set to -1 instead.

This is wrong thing to do, as the limit set, even if incorrectly,
should not be ignored.

Fix this by only assigning MemorySwap == -1 in case it was not
explicitly set.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-20 17:23:40 -07:00
Kir Kolyshkin 59897367c4 cgroups/systemd: allow to set -1 as pids.limit
Currently, both systemd cgroup drivers (v1 and v2) only set
"TasksMax" unit property if the value > 0, so there is no
way to update the limit to -1 / unlimited / infinity / max.

Since systemd driver is backed by fs driver, and both fs and fs2
set the limit of -1 properly, it works, but systemd still has
the old value:

 # runc --systemd-cgroup update $CT --pids-limit 42
 # systemctl show runc-$CT.scope | grep TasksMax
 TasksMax=42
 # cat /sys/fs/cgroup/system.slice/runc-$CT.scope/pids.max
 42

 # ./runc --systemd-cgroup update $CT --pids-limit -1
 # systemctl show runc-$CT.scope | grep TasksMax=
 TasksMax=42
 # cat /sys/fs/cgroup/system.slice/runc-xx77.scope/pids.max
 max

Fix by changing the condition to allow -1 as a valid value.

NOTE other negative values are still being ignored by systemd drivers
(as it was done before). I am not sure whether this is correct, or
should we return an error.

A test case is added.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-20 13:20:04 -07:00
Kir Kolyshkin 06d7c1d261 systemd+cgroupv1: fix updating CPUQuotaPerSecUSec
1. do not allow to set quota without period or period without quota, as we
   won't be able to calculate new value for CPUQuotaPerSecUSec otherwise.

2. do not ignore setting quota to -1 when a period is not set.

3. update the test case accordingly.

Note that systemd value checks will be added in the next commit.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-20 13:17:18 -07:00
Kir Kolyshkin e4a84bea99 cgroupv2+systemd: set MemoryLow
For some reason, this was not set before.

Test case is added by the next commit.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-20 13:15:29 -07:00
Kir Kolyshkin 716079f95b
Merge pull request #2406 from cyphar/devices-cgroup-header
cgroups: add copyright header to devices.Emulator implementation
2020-05-20 13:01:34 -07:00
Akihiro Suda cd4b71c27a
Merge pull request #2409 from adrianreber/go-criu-4-0-0
Update to latest go-criu
2020-05-21 01:39:09 +09:00
Adrian Reber 944e057025
Update to latest go-criu (4.0.2)
This updates to the latest version of go-criu (4.0.2) which is based on
CRIU 3.14.

As go-criu provides an existing way to query the CRIU binary for its
version this also removes all the code from runc to handle CRIU version
checking and now relies on go-criu.

An important side effect of this change is that this raises the minimum
CRIU version to 3.0.0 as that is the first CRIU version that supports
CRIU version queries via RPC in contrast to parsing the output of
'criu --version'

CRIU 3.0 has been released in April of 2017.

Signed-off-by: Adrian Reber <areber@redhat.com>
2020-05-20 13:49:38 +02:00
Giuseppe Scrivano 41aa19662b
libcontainer: honor seccomp errnoRet
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2020-05-20 09:11:55 +02:00
Akihiro Suda 2fa3c286b5 fix "libcontainer/cgroups/fs/cpuset.go:63:14: undefined: fmt"
The compilation error had ocurred because of a bad rebase during #2401 and #2413

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-05-19 23:38:20 +09:00
Akihiro Suda f369199ff6
Merge pull request #2413 from JFHwang/2392-spec-check
Add nil check of spec.Process in validateProcessSpec()
2020-05-19 08:11:22 +09:00
Mrunal Patel 53a4649776
Merge pull request #2401 from kolyshkin/fs-cpuset-mountinfo
libct/cgroup: rm GetClosestMountpointAncestor using moby/sys/mountinfo parser
2020-05-18 10:43:55 -07:00
Mrunal Patel 825e91ada6
Merge pull request #2341 from kolyshkin/test-cpt-lazy
runc checkpoint: fix --status-fd to accept fd
2020-05-18 10:43:24 -07:00
John Hwang 7fc291fd45 Replace formatted errors when unneeded
Signed-off-by: John Hwang <John.F.Hwang@gmail.com>
2020-05-16 18:13:21 -07:00
lifubang 9ad1beb40f never write empty string to memory.swap.max
Because the empty string means set swap to 0.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-05-16 06:52:14 +08:00
Aleksa Sarai dc9a7879f9
cgroups: add copyright header to devices.Emulator implementation
I forgot to include this in the original patchset.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-15 11:29:51 +10:00
Akihiro Suda 3f1e886991
Merge pull request #2391 from cyphar/devices-cgroup
cgroup: devices: major cleanups and minimal transition rules
2020-05-14 09:57:06 +09:00
Kir Kolyshkin 2db3240f35 libct/cgroups: rm GetClosestMountpointAncestor
The function GetClosestMountpointAncestor is not very efficient,
does not really belong to cgroup package, and is only used once
(from fs/cpuset.go).

Remove it, replacing with the implementation based on moby/sys/mountinfo
parser.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-13 17:32:06 -07:00
Kir Kolyshkin f160352682 libct/cgroup: prep to rm GetClosestMountpointAncestor
This function is not very efficient, does not really belong to cgroup
package, and is only used once (from fs/cpuset.go).

Prepare to remove it by replacing with the implementation based on
the parser from github.com/moby/sys/mountinfo parser.

This commit is here to make sure the proposed replacement passes the
unit test.

Funny, but the unit test need to be slightly modified since it
supplies the wrong mountinfo (space as the first character, empty line
at the end).

Validated by

 $ go test -v -run Ance
 === RUN   TestGetClosestMountpointAncestor
 --- PASS: TestGetClosestMountpointAncestor (0.00s)
 PASS
 ok  	github.com/opencontainers/runc/libcontainer/cgroups	0.002s

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-13 16:26:16 -07:00
Kir Kolyshkin 85d4264d8a
Merge pull request #2390 from lifubang/threadedordomain
cgroupv2: don't enable threaded mode by default

LGTMs: AkihiroSuda, cyphar, kolyshkin
2020-05-13 14:30:25 -07:00
Kir Kolyshkin 4b71877f99
Merge pull request #2292 from Creatone/creatone/extend-intelrdt
Add RDT Memory Bandwidth Monitoring (MBM) and Cache Monitoring Technology (CMT) statistics.
2020-05-13 13:33:55 -07:00
Kir Kolyshkin 41855317b6
Merge pull request #2271 from katarzyna-z/kk-cpuacct-usage-all
Add reading of information from cpuacct.usage_all
2020-05-13 13:33:05 -07:00
lifubang fe0669b26d don't enable threaded mode by default
Because in threaded mode, we can't enable the memory controller -- it isn't thread-aware.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-05-13 16:27:36 +08:00
Aleksa Sarai b810da1490
cgroups: systemd: make use of Device*= properties
It seems we missed that systemd added support for the devices cgroup, as
a result systemd would actually *write an allow-all rule each time you
did 'runc update'* if you used the systemd cgroup driver. This is
obviously ... bad and was a clear security bug. Luckily the commits which
introduced this were never in an actual runc release.

So we simply generate the cgroupv1-style rules (which is what systemd's
DeviceAllow wants) and default to a deny-all ruleset. Unfortunately it
turns out that systemd is susceptible to the same spurrious error
failure that we were, so that problem is out of our hands for systemd
cgroup users.

However, systemd has a similar bug to the one fixed in [1]. It will
happily write a disruptive deny-all rule when it is not necessary.
Unfortunately, we cannot even use devices.Emulator to generate a minimal
set of transition rules because the DBus API is limited (you can only
clear or append to the DeviceAllow= list -- so we are forced to always
clear it). To work around this, we simply freeze the container during
SetUnitProperties.

[1]: afe83489d4 ("cgroupv1: devices: use minimal transition rules with devices.Emulator")

Fixes: 1d4ccc8e0c ("fix data inconsistent when runc update in systemd driven cgroup v1")
Fixes: 7682a2b2a5 ("fix data inconsistent when runc update in systemd driven cgroup v2")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:43:56 +10:00
Aleksa Sarai afe83489d4
cgroupv1: devices: use minimal transition rules with devices.Emulator
Now that all of the infrastructure for devices.Emulator is in place, we
can finally implement minimal transition rules for devices cgroups. This
allows for minimal disruption to running containers if a rule update is
requested. Only in very rare circumstances (black-list cgroups and mode
switching) will a clear-all rule be written. As a result, containers
should no longer see spurious errors.

A similar issue affects the cgroupv2 devices setup, but that is a topic
for another time (as the solution is drastically different).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:42:43 +10:00
Aleksa Sarai 2353ffec2b
cgroups: implement a devices cgroupv1 emulator
Okay, this requires a bit of explanation.

The reason for this emulation is to allow us to have seamless updates of
the devices cgroup for running containers. This was triggered by several
users having issues where our initial writing of a deny-all rule (in all
cases) results in spurrious errors.

The obvious solution would be to just remove the deny-all rule, right?
Well, it turns out that runc doesn't actually control the deny-all rule
because all users of runc have explicitly specified their own deny-all
rule for many years. This appears to have been done to work around a bug
in runc (which this series has fixed in [1]) where we would actually act
as a black-list despite this being a violation of the OCI spec.

This means that not adding our own deny-all rule in the case of updates
won't solve the issue. However, it will also not solve the issue in
several other cases (the most notable being where a container is being
switched between default-permission modes).

So in order to handle all of these cases, a way of tracking the relevant
internal cgroup state (given a certain state of "cgroups.list" and a set
of rules to apply) is necessary. That is the purpose of DevicesEmulator.
Reading "devices.list" is quite important because that's the only way we
can tell if it's safe to skip the troublesome deny-all rules without
making potentially-dangerous assumptions about the container.

We also are currently bug-compatible with the devices cgroup (namely,
removing rules that don't exist or having superfluous rules all works as
with the in-kernel implementation). The only exception to this is that
we give an error if a user requests to revoke part of a wildcard
exception, because allowing such configurations could result in security
holes (cgroupv1 silently ignores such rules, meaning in white-list mode
that the access is still permitted).

[1]: b2bec9806f ("cgroup: devices: eradicate the Allow/Deny lists")

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:42:20 +10:00
Aleksa Sarai 24388be71e
configs: use different types for .Devices and .Resources.Devices
Making them the same type is simply confusing, but also means that you
could accidentally use one in the wrong context. This eliminates that
problem. This also includes a whole bunch of cleanups for the types
within DeviceRule, so that they can be used more ergonomically.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Aleksa Sarai 60e21ec26e
specconv: remove default /dev/console access
/dev/console is a host resouce which gives a bunch of permissions that
we really shouldn't be giving to containers, not to mention that
/dev/console in containers is actually /dev/pts/$n. Drop this since
arguably this is a fairly scary thing to allow...

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Aleksa Sarai b2bec9806f
cgroup: devices: eradicate the Allow/Deny lists
These lists have been in the codebase for a very long time, and have
been unused for a large portion of that time -- specconv doesn't
generate them and the only user of these flags has been tests (which
doesn't inspire much confidence).

In addition, we had an incorrect implementation of a white-list policy.
This wasn't exploitable because all of our users explicitly specify
"deny all" as the first rule, but it was a pretty glaring issue that
came from the "feature" that users can select whether they prefer a
white- or black- list. Fix this by always writing a deny-all rule (which
is what our users were doing anyway, to work around this bug).

This is one of many changes needed to clean up the devices cgroup code.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Aleksa Sarai 859a780d6f
cgroups: add GetFreezerState() helper to Manager
This is effectively a nicer implementation of the container.isPaused()
helper, but to be used within the cgroup code for handling some fun
issues we have to fix with the systemd cgroup driver.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Akihiro Suda 2b9a36ee8c
Merge pull request #2398 from pkagrawal/master
Honor spec.Process.NoNewPrivileges in specconv.CreateLibcontainerConfig
2020-05-12 15:05:55 +09:00
Kir Kolyshkin ca1d135bd4 runc checkpoint: fix --status-fd to accept fd
1. The command `runc checkpoint --lazy-server --status-fd $FD` actually
accepts a file name as an $FD. Make it accept a file descriptor,
like its name implies and the documentation states.

In addition, since runc itself does not use the result of CRIU status
fd, remove the code which relays it, and pass the FD directly to CRIU.

Note 1: runc should close this file descriptor itself after passing it
to criu, otherwise whoever waits on it might wait forever.

Note 2: due to the way criu swrk consumes the fd (it reopens
/proc/$SENDER_PID/fd/$FD), runc can't close it as soon as criu swrk has
started. There is no good way to know when criu swrk has reopened the
fd, so we assume that as soon as we have received something back, the
fd is already reopened.

2. Since the meaning of --status-fd has changed, the test case using
it needs to be fixed as well.

Modify the lazy migration test to remove "sleep 2", actually waiting
for the the lazy page server to be ready.

While at it,

 - remove the double fork (using shell's background process is
   sufficient here);

 - check the exit code for "runc checkpoint" and "criu lazy-pages";

 - remove the check for no errors in dump.log after restore, as we
   are already checking its exit code.

[v2: properly close status fd after spawning criu]
[v3: move close status fd to after the first read]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-11 15:36:50 -07:00
Pradyumna Agrawal 4aa9101477 Honor spec.Process.NoNewPrivileges in specconv.CreateLibcontainerConfig
The change ensures that the passed in value of NoNewPrivileges under spec.Process
is reflected in the container config generated by specconv.CreateLibcontainerConfig

Closes #2397

Signed-off-by: Pradyumna Agrawal <pradyumnaa@vmware.com>
2020-05-11 13:38:14 -07:00
Kir Kolyshkin 714c91e9f7 Simplify cgroup path handing in v2 via unified API
This unties the Gordian Knot of using GetPaths in cgroupv2 code.

The problem is, the current code uses GetPaths for three kinds of things:

1. Get all the paths to cgroup v1 controllers to save its state (see
   (*linuxContainer).currentState(), (*LinuxFactory).loadState()
   methods).

2. Get all the paths to cgroup v1 controllers to have the setns process
    enter the proper cgroups in `(*setnsProcess).start()`.

3. Get the path to a specific controller (for example,
   `m.GetPaths()["devices"]`).

Now, for cgroup v2 instead of a set of per-controller paths, we have only
one single unified path, and a dedicated function `GetUnifiedPath()` to get it.

This discrepancy between v1 and v2 cgroupManager API leads to the
following problems with the code:

 - multiple if/else code blocks that have to treat v1 and v2 separately;

 - backward-compatible GetPaths() methods in v2 controllers;

 -  - repeated writing of the PID into the same cgroup for v2;

Overall, it's hard to write the right code with all this, and the code
that is written is kinda hard to follow.

The solution is to slightly change the API to do the 3 things outlined
above in the same manner for v1 and v2:

1. Use `GetPaths()` for state saving and setns process cgroups entering.

2. Introduce and use Path(subsys string) to obtain a path to a
   subsystem. For v2, the argument is ignored and the unified path is
   returned.

This commit converts all the controllers to the new API, and modifies
all the users to use it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 12:04:06 -07:00
Kir Kolyshkin 2c8d668eee
Merge pull request #2387 from kolyshkin/g-knot-prepare
cgroup refactoring

LGTMs: AkihiroSuda, mrunalp.
2020-05-08 12:03:22 -07:00
Kir Kolyshkin 1d143562d2 libct/cgroups/fs: access m.paths under lock
1. Prevent theoretical "concurrent map access" error to m.paths.

2. There is no need to call m.Paths -- we can access m.paths directly.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:09:55 -07:00
Kir Kolyshkin 51e1a0842d libct/cgroups/systemd/v1: privatize v1 manager
This patch was generated entirely by gorename -- nothing to review here.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:09:48 -07:00
Kir Kolyshkin d827e323b0 libct/cgroups/systemd/v1: add NewLegacyManager
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:07:40 -07:00
Kir Kolyshkin fc620fdf81 libct/cgroups/fs: privatize Manager and its fields
This was generated entirely by gorename -- nothing to review here.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:07:00 -07:00
Kir Kolyshkin 5935bf8c21 libct/cgroups/fs: introduce NewManager()
...and use it from libcontainer/factory_linux.go.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:06:05 -07:00
Kir Kolyshkin 24f945e08d libct/cgroups/systemd/v2: return a public interface
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:06:02 -07:00
Kir Kolyshkin 63854b0ea8 newSetnsProcess: reuse state.CgroupPaths
c.cgroupManager.GetPaths() are called twice here: once in currentState()
and then in newSetnsProcess(). Reuse the result of the first call, which
is stored into state.CgroupPaths.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:05:59 -07:00
Kir Kolyshkin 9a3e632625 notify: simplify usage
Instead of passing the whole map of paths, pass the path to the memory
controller which these functions actually require.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:05:58 -07:00