Commit Graph

138 Commits

Author SHA1 Message Date
Kir Kolyshkin 2a322e91ec cgroupv1: remove subsystemSet.Get()
Instead of iterating over m.paths, iterate over subsystems and look up
the path for each. This is faster since a map lookup is faster than
iterating over the names in Get. A quick benchmark shows that the new
way is 2.5x faster than the old one.

Note though that this is not done to make things faster, as savings are
negligible, but to make things simpler by removing some code.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-07-06 18:31:46 -07:00
Mrunal Patel 30dc54a995
Merge pull request #2503 from giuseppe/cgroup-fixes
cgroup, systemd: cleanup cgroups
2020-07-06 15:14:29 -07:00
Mrunal Patel 3f81131845
Merge pull request #2490 from kolyshkin/dev-opt
libct/cgroups: add SkipDevices to Resources
2020-07-06 14:28:30 -07:00
Giuseppe Scrivano 32034481ea
cgroup, systemd: cleanup cgroups
some hierarchies were created directly by .Apply() on top of systemd
managed cgroups.  systemd doesn't manage these and as a result we leak
these cgroups.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2020-07-06 23:06:16 +02:00
Giuseppe Scrivano 2deaeab08f
cgroup: store the result of IsRunningSystemd
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2020-07-05 12:42:27 +02:00
Kir Kolyshkin cd479f9d14 cgroupv1/freezer: don't use subsystemSet.Get()
Iterating over the list of subsystems and comparing their names to get an
instance of fs.cgroupFreezer is useless and a waste of time, since it is
a shallow type (i.e. does not have any data/state) and we can create an
instance in place.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-07-03 14:00:44 -07:00
Kir Kolyshkin 108ee85b82 libct/cgroups: add SkipDevices to Resources
The kubelet uses libct/cgroups code to set up cgroups. It creates a
parent cgroup (kubepods) to put the containers into.

The problem (for cgroupv2 that uses eBPF for device configuration) is
the hard requirement to have devices cgroup configured results in
leaking an eBPF program upon every kubelet restart.  program. If kubelet
is restarted 64+ times, the cgroup can't be configured anymore.

Work around this by adding a SkipDevices flag to Resources.

A check was added so that if SkipDevices is set, such a "container"
can't be started (to make sure it is only used for non-containers).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-07-02 15:19:31 -07:00
Peter Hunt 6a0f64e7c9 systemd: add unit tests for systemdVersion
Signed-off-by: Peter Hunt <pehunt@redhat.com>
2020-06-18 22:30:50 -04:00
Peter Hunt 6369e38871 systemd: parse systemdVersion in more situations
there have been cases observed where instead of `v$VER.0-$OS` the systemdVersion returned is just `$VER`, or `$VER-1`.
handle these cases

Signed-off-by: Peter Hunt <pehunt@redhat.com>
2020-06-18 22:30:50 -04:00
Mrunal Patel 406298fdf0
Merge pull request #2466 from kolyshkin/systemd-cpu-quota-period
cgroups/systemd: add setting CPUQuotaPeriod prop
2020-06-17 12:03:30 -07:00
Kir Kolyshkin e751a168dc cgroups/systemd: add setting CPUQuotaPeriod prop
For some reason, runc systemd drivers (both v1 and v2) never set
systemd unit property named `CPUQuotaPeriod` (known as
`CPUQuotaPeriodUSec` on dbus and in `systemctl show` output).

Set it, and add a check to all the integration tests. The check is less
than trivial because, when not set, the value is shown as "infinity" but
when set to the same (default) value, shown as "100ms", so in case we
expect 100ms (period = 100000 us), we have to _also_ check for
"infinity".

[v2: add systemd version checks since CPUQuotaPeriod requires v242+]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-16 15:48:06 -07:00
Kir Kolyshkin dd2426d067 libct/cgroups: fix m.paths map access
This fixes a few cases of accessing m.paths map directly without holding
the mutex lock.

Fixes: 9087f2e82
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-15 18:30:16 -07:00
Kir Kolyshkin 5b247e739c
Merge pull request #2338 from lifubang/systemdcgroupv2
fix path error in systemd when stopped

LGTMs: @mrunalp @AkihiroSuda
2020-06-15 18:01:13 -07:00
Kir Kolyshkin a92b0327ce cgroups/systemd: fix set CPU quota if period is unset
systemd drivers ignore --cpu-quota during update if the CPU
period was not set earlier.

Fixed by adding the default for the period.

The test will be added by the following commit.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-09 17:32:17 -07:00
Kir Kolyshkin 8b9646775e cgroups/systemd: unify adding CpuQuota
The code that adds CpuQuotaPerSecUSec is the same in v1 and v2
systemd cgroup driver. Move it to common.

No functional change.

Note that the comment telling that we always set this property
contradicts with the current code, and therefore it is removed.

[v2: drop cgroupv1-specific comment]
[v3: drop returning error as it's not used]
[v4: remove an obsoleted comment]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-09 17:14:43 -07:00
Kir Kolyshkin 2ce20ed158 cgroups/systemd: simplify gen*ResourcesProperties
Use r instead of c.Resources for readability. No functional change.

This commit has been brought to you by '<,'>s/c\.Resources\./r./g

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-08 13:42:09 -07:00
lifubang 9087f2e827 fix path error in systemd when stopped
When we use cgroup with systemd driver, the cgroup path will be auto removed
by systemd when all processes exited. So we should check cgroup path exists
when we access the cgroup path, for example in `kill/ps`, or else we will
got an error.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-06-02 18:17:43 +08:00
Mrunal Patel 332a84581e
Merge pull request #2443 from kolyshkin/kmem-fixup
cgroupv1/systemd.Set: don't enable kernel memory acct
2020-05-31 10:04:45 -07:00
Kir Kolyshkin 3fe6e04510 cgroupv1/systemd.Set: don't enable kernel memory acct
This is a regression from commit 1d4ccc8e0. We only need to enable
kernel memory accounting once, from the (*legacyManager*).Apply(),
and there is no need to do it in (*legacyManager*).Set().

While at it, rename the method to better reflect what it's doing.

This saves 1 call to mountinfo parser.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-29 17:54:50 -07:00
Kir Kolyshkin 3249e2379c cgroupv1: check cpu shares in place
Commit 4e65e0e90a added a check for cpu shares. Apparently, the
kernel allows to set a value higher than max or lower than min without
an error, but the value read back is always within the limits.

The check (which was later moved out to a separate CheckCpushares()
function) is always performed after setting the cpu shares, so let's
move it to the very place where it is set.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-29 16:46:28 -07:00
Kir Kolyshkin be5467872d cgroupv1: minimal fix for cpu quota regression
This is a quick-n-dirty fix the regression introduced by commit
06d7c1d, which made it impossible to only set CpuQuota
(without the CpuPeriod). It partially reverts the above commit,
and adds a test case.

The proper fix will follow.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-26 11:02:16 -07:00
Kir Kolyshkin 3c6e8ac4d2 cgroupv2: set mem+swap to max if mem set to max
... and mem+swap is not explicitly set otherwise.

This ensures compatibility with cgroupv1 controller which interprets
things this way.

With this fixed, we can finally enable swap tests for cgroupv2.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-22 21:32:16 -07:00
Kir Kolyshkin 59897367c4 cgroups/systemd: allow to set -1 as pids.limit
Currently, both systemd cgroup drivers (v1 and v2) only set
"TasksMax" unit property if the value > 0, so there is no
way to update the limit to -1 / unlimited / infinity / max.

Since systemd driver is backed by fs driver, and both fs and fs2
set the limit of -1 properly, it works, but systemd still has
the old value:

 # runc --systemd-cgroup update $CT --pids-limit 42
 # systemctl show runc-$CT.scope | grep TasksMax
 TasksMax=42
 # cat /sys/fs/cgroup/system.slice/runc-$CT.scope/pids.max
 42

 # ./runc --systemd-cgroup update $CT --pids-limit -1
 # systemctl show runc-$CT.scope | grep TasksMax=
 TasksMax=42
 # cat /sys/fs/cgroup/system.slice/runc-xx77.scope/pids.max
 max

Fix by changing the condition to allow -1 as a valid value.

NOTE other negative values are still being ignored by systemd drivers
(as it was done before). I am not sure whether this is correct, or
should we return an error.

A test case is added.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-20 13:20:04 -07:00
Kir Kolyshkin 06d7c1d261 systemd+cgroupv1: fix updating CPUQuotaPerSecUSec
1. do not allow to set quota without period or period without quota, as we
   won't be able to calculate new value for CPUQuotaPerSecUSec otherwise.

2. do not ignore setting quota to -1 when a period is not set.

3. update the test case accordingly.

Note that systemd value checks will be added in the next commit.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-20 13:17:18 -07:00
Kir Kolyshkin e4a84bea99 cgroupv2+systemd: set MemoryLow
For some reason, this was not set before.

Test case is added by the next commit.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-20 13:15:29 -07:00
John Hwang 7fc291fd45 Replace formatted errors when unneeded
Signed-off-by: John Hwang <John.F.Hwang@gmail.com>
2020-05-16 18:13:21 -07:00
Aleksa Sarai b810da1490
cgroups: systemd: make use of Device*= properties
It seems we missed that systemd added support for the devices cgroup, as
a result systemd would actually *write an allow-all rule each time you
did 'runc update'* if you used the systemd cgroup driver. This is
obviously ... bad and was a clear security bug. Luckily the commits which
introduced this were never in an actual runc release.

So we simply generate the cgroupv1-style rules (which is what systemd's
DeviceAllow wants) and default to a deny-all ruleset. Unfortunately it
turns out that systemd is susceptible to the same spurrious error
failure that we were, so that problem is out of our hands for systemd
cgroup users.

However, systemd has a similar bug to the one fixed in [1]. It will
happily write a disruptive deny-all rule when it is not necessary.
Unfortunately, we cannot even use devices.Emulator to generate a minimal
set of transition rules because the DBus API is limited (you can only
clear or append to the DeviceAllow= list -- so we are forced to always
clear it). To work around this, we simply freeze the container during
SetUnitProperties.

[1]: afe83489d4 ("cgroupv1: devices: use minimal transition rules with devices.Emulator")

Fixes: 1d4ccc8e0c ("fix data inconsistent when runc update in systemd driven cgroup v1")
Fixes: 7682a2b2a5 ("fix data inconsistent when runc update in systemd driven cgroup v2")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:43:56 +10:00
Aleksa Sarai 859a780d6f
cgroups: add GetFreezerState() helper to Manager
This is effectively a nicer implementation of the container.isPaused()
helper, but to be used within the cgroup code for handling some fun
issues we have to fix with the systemd cgroup driver.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Kir Kolyshkin 714c91e9f7 Simplify cgroup path handing in v2 via unified API
This unties the Gordian Knot of using GetPaths in cgroupv2 code.

The problem is, the current code uses GetPaths for three kinds of things:

1. Get all the paths to cgroup v1 controllers to save its state (see
   (*linuxContainer).currentState(), (*LinuxFactory).loadState()
   methods).

2. Get all the paths to cgroup v1 controllers to have the setns process
    enter the proper cgroups in `(*setnsProcess).start()`.

3. Get the path to a specific controller (for example,
   `m.GetPaths()["devices"]`).

Now, for cgroup v2 instead of a set of per-controller paths, we have only
one single unified path, and a dedicated function `GetUnifiedPath()` to get it.

This discrepancy between v1 and v2 cgroupManager API leads to the
following problems with the code:

 - multiple if/else code blocks that have to treat v1 and v2 separately;

 - backward-compatible GetPaths() methods in v2 controllers;

 -  - repeated writing of the PID into the same cgroup for v2;

Overall, it's hard to write the right code with all this, and the code
that is written is kinda hard to follow.

The solution is to slightly change the API to do the 3 things outlined
above in the same manner for v1 and v2:

1. Use `GetPaths()` for state saving and setns process cgroups entering.

2. Introduce and use Path(subsys string) to obtain a path to a
   subsystem. For v2, the argument is ignored and the unified path is
   returned.

This commit converts all the controllers to the new API, and modifies
all the users to use it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 12:04:06 -07:00
Kir Kolyshkin 51e1a0842d libct/cgroups/systemd/v1: privatize v1 manager
This patch was generated entirely by gorename -- nothing to review here.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:09:48 -07:00
Kir Kolyshkin d827e323b0 libct/cgroups/systemd/v1: add NewLegacyManager
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:07:40 -07:00
Kir Kolyshkin 24f945e08d libct/cgroups/systemd/v2: return a public interface
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:06:02 -07:00
Akihiro Suda bf15cc99b1 cgroup v2: support rootless systemd
Tested with both Podman (master) and Moby (master), on Ubuntu 19.10 .

$ podman --cgroup-manager=systemd run -it --rm --runtime=runc \
  --cgroupns=host --memory 42m --cpus 0.42 --pids-limit 42 alpine
/ # cat /proc/self/cgroup
0::/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope
/ # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/memory.max
44040192
/ # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/cpu.max
42000 100000
/ # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/pids.max
42

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-05-08 12:39:20 +09:00
lifubang a70f354680 let runc disable swap in cgroup v2
In cgroup v2, when memory and memorySwap set to the same value which is greater than zero,
runc should write zero in `memory.swap.max` to disable swap.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-05-03 20:57:36 +08:00
lifubang bfa1b2aab3 check that StartTransientUnit and StopUnit succeeds
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-04-28 15:46:28 +08:00
lifubang 1d4ccc8e0c fix data inconsistent when runc update in systemd driven cgroup v1
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-04-23 19:32:57 +08:00
lifubang 7682a2b2a5 fix data inconsistent when runc update in systemd driven cgroup v2
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-04-23 19:32:07 +08:00
Kir Kolyshkin 4b4bc995ad CreateCgroupPath: only enable needed controllers
1. Instead of enabling all available controllers, figure out which
   ones are required, and only enable those.

2. Amend all setFoo() functions to call isFooSet(). While this might
   seem unnecessary, it might actually help to uncover a bug.
   Imagine someone:
    - adds a cgroup.Resources.CpuFoo setting;
    - modifies setCpu() to apply the new setting;
    - but forgets to amend isCpuSet() accordingly <-- BUG

   In this case, a test case modifying CpuFoo will help
   to uncover the BUG. This is the reason why it's added.

This patch *could be* amended by enabling controllers on a best-effort
basis, i.e. :

 - do not return an error early if we can't enable some controllers;
 - if we fail to enable all controllers at once (usually because one
   of them can't be enabled), try enabling them one by one.

Currently this is not implemented, and it's not clear whether this
would be a good way to go or not.

[v2: add/use is${Controller}Set() functions]
[v3: document neededControllers()]
[v4: drop "best-effort" part]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin bb47e35843 cgroup/systemd: reorganize
1. Rename the files
  - v1.go: cgroupv1 aka legacy;
  - v2.go: cgroupv2 aka unified hierarchy;
  - unsupported.go: when systemd is not available.

2. Move the code that is common between v1 and v2 to common.go

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin 813cb3eb94 cgroupv2: fix fs2 cgroup init
fs2 cgroup driver was not working because it did not enable controllers
while creating cgroup directory; instead it was merely doing MkdirAll()
and gathered the list of available controllers in NewManager().

Also, cgroup should be created in Apply(), not while creating a new
manager instance.

To fix:

1. Move the createCgroupsv2Path function from systemd driver to fs2 driver,
   renaming it to CreateCgroupPath. Use in Apply() from both fs2 and
   systemd drivers.

2. Delay available controllers map initialization to until it is needed.

With this patch:
 - NewManager() only performs minimal initialization (initializin
   m.dirPath, if not provided);
 - Apply() properly creates cgroup path, enabling the controllers;
 - m.controllers is initialized lazily on demand.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin dbeff89491 cgroupv2/systemd: privatize UnifiedManager
... and its Cgroup field. There is no sense to keep it public.

This was generated by gorename.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin 88c13c0713 cgroupv2: use SecureJoin in systemd driver
It seems that some paths are coming from user and are therefore
untrusted.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:20:22 -07:00
Kir Kolyshkin 9c80cd672d cgroupv2: rm legacy Paths from systemd driver
Having map of per-subsystem paths in systemd unified cgroups
driver does not make sense and makes the code less readable.

To get rid of it, move the systemd v1-or-v2 init code to
libcontainer/factory_linux.go which already has a function
to deduce unified path out of paths map.

End result is much cleaner code. Besides, we no longer write pid
to the same cgroup file 7 times in Apply() like we did before.

While at it
 - add `rootless` flag which is passed on to fs2 manager
 - merge getv2Path() into GetUnifiedPath(), don't overwrite
   path if it is set during initialization (on Load).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:19:51 -07:00
Ted Yu 614bb96676 cgroupv2/systemd: Properly remove intermediate directory
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-04-13 08:32:08 -07:00
Kir Kolyshkin c86be8a2c1 cgroupv2: fix setting MemorySwap
The resources.MemorySwap field from OCI is memory+swap, while cgroupv2
has a separate swap limit, so subtract memory from the limit (and make
sure values are set and sane).

Make sure to set MemorySwapMax for systemd, too. Since systemd does not
have MemorySwapMax for cgroupv1, it is only needed for v2 driver.

[v2: return -1 on any negative value, add unit test]
[v3: treat any negative value other than -1 as error]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-07 20:45:53 -07:00
Tobias Klauser 3e678c08f9 Remove unused consts testScopeWait and testSliceWait
These are unused since commit 518c855833 ("Remove libcontainer
detection for systemd features")

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2020-04-03 21:09:43 +02:00
Mrunal Patel d05e5728aa systemd: Lazy initialize the systemd dbus connection
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2020-03-30 15:24:06 -07:00
Mrunal Patel 33c6125da6 systemd: Export IsSystemdRunning() function
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2020-03-30 15:24:06 -07:00
Kir Kolyshkin a949e4f22f cgroupv2: UnifiedManager.Apply: simplify
Remove joinCgroupsV2() function, as its name and second parameter
are misleading. Use createCgroupsv2Path() directly, do not call
getv2Path() twice.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-26 19:20:00 -07:00
Kir Kolyshkin 5406833a65 cgroupv2/systemd: add getv2Path
Function getSubsystemPath(), while works for v2 unified case, is
suboptimal, as it does a few unnecessary calls.

Add a simplified version of getSubsystemPath(), called getv2Path(),
and use it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-26 19:17:09 -07:00