Instead of iterating over m.paths, iterate over subsystems and look up
the path for each. This is faster since a map lookup is faster than
iterating over the names in Get. A quick benchmark shows that the new
way is 2.5x faster than the old one.
Note though that this is not done to make things faster, as savings are
negligible, but to make things simpler by removing some code.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
some hierarchies were created directly by .Apply() on top of systemd
managed cgroups. systemd doesn't manage these and as a result we leak
these cgroups.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Iterating over the list of subsystems and comparing their names to get an
instance of fs.cgroupFreezer is useless and a waste of time, since it is
a shallow type (i.e. does not have any data/state) and we can create an
instance in place.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The kubelet uses libct/cgroups code to set up cgroups. It creates a
parent cgroup (kubepods) to put the containers into.
The problem (for cgroupv2 that uses eBPF for device configuration) is
the hard requirement to have devices cgroup configured results in
leaking an eBPF program upon every kubelet restart. program. If kubelet
is restarted 64+ times, the cgroup can't be configured anymore.
Work around this by adding a SkipDevices flag to Resources.
A check was added so that if SkipDevices is set, such a "container"
can't be started (to make sure it is only used for non-containers).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
there have been cases observed where instead of `v$VER.0-$OS` the systemdVersion returned is just `$VER`, or `$VER-1`.
handle these cases
Signed-off-by: Peter Hunt <pehunt@redhat.com>
For some reason, runc systemd drivers (both v1 and v2) never set
systemd unit property named `CPUQuotaPeriod` (known as
`CPUQuotaPeriodUSec` on dbus and in `systemctl show` output).
Set it, and add a check to all the integration tests. The check is less
than trivial because, when not set, the value is shown as "infinity" but
when set to the same (default) value, shown as "100ms", so in case we
expect 100ms (period = 100000 us), we have to _also_ check for
"infinity".
[v2: add systemd version checks since CPUQuotaPeriod requires v242+]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This fixes a few cases of accessing m.paths map directly without holding
the mutex lock.
Fixes: 9087f2e82
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
systemd drivers ignore --cpu-quota during update if the CPU
period was not set earlier.
Fixed by adding the default for the period.
The test will be added by the following commit.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The code that adds CpuQuotaPerSecUSec is the same in v1 and v2
systemd cgroup driver. Move it to common.
No functional change.
Note that the comment telling that we always set this property
contradicts with the current code, and therefore it is removed.
[v2: drop cgroupv1-specific comment]
[v3: drop returning error as it's not used]
[v4: remove an obsoleted comment]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Use r instead of c.Resources for readability. No functional change.
This commit has been brought to you by '<,'>s/c\.Resources\./r./g
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When we use cgroup with systemd driver, the cgroup path will be auto removed
by systemd when all processes exited. So we should check cgroup path exists
when we access the cgroup path, for example in `kill/ps`, or else we will
got an error.
Signed-off-by: lifubang <lifubang@acmcoder.com>
This is a regression from commit 1d4ccc8e0. We only need to enable
kernel memory accounting once, from the (*legacyManager*).Apply(),
and there is no need to do it in (*legacyManager*).Set().
While at it, rename the method to better reflect what it's doing.
This saves 1 call to mountinfo parser.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 4e65e0e90a added a check for cpu shares. Apparently, the
kernel allows to set a value higher than max or lower than min without
an error, but the value read back is always within the limits.
The check (which was later moved out to a separate CheckCpushares()
function) is always performed after setting the cpu shares, so let's
move it to the very place where it is set.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is a quick-n-dirty fix the regression introduced by commit
06d7c1d, which made it impossible to only set CpuQuota
(without the CpuPeriod). It partially reverts the above commit,
and adds a test case.
The proper fix will follow.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
... and mem+swap is not explicitly set otherwise.
This ensures compatibility with cgroupv1 controller which interprets
things this way.
With this fixed, we can finally enable swap tests for cgroupv2.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Currently, both systemd cgroup drivers (v1 and v2) only set
"TasksMax" unit property if the value > 0, so there is no
way to update the limit to -1 / unlimited / infinity / max.
Since systemd driver is backed by fs driver, and both fs and fs2
set the limit of -1 properly, it works, but systemd still has
the old value:
# runc --systemd-cgroup update $CT --pids-limit 42
# systemctl show runc-$CT.scope | grep TasksMax
TasksMax=42
# cat /sys/fs/cgroup/system.slice/runc-$CT.scope/pids.max
42
# ./runc --systemd-cgroup update $CT --pids-limit -1
# systemctl show runc-$CT.scope | grep TasksMax=
TasksMax=42
# cat /sys/fs/cgroup/system.slice/runc-xx77.scope/pids.max
max
Fix by changing the condition to allow -1 as a valid value.
NOTE other negative values are still being ignored by systemd drivers
(as it was done before). I am not sure whether this is correct, or
should we return an error.
A test case is added.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. do not allow to set quota without period or period without quota, as we
won't be able to calculate new value for CPUQuotaPerSecUSec otherwise.
2. do not ignore setting quota to -1 when a period is not set.
3. update the test case accordingly.
Note that systemd value checks will be added in the next commit.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
It seems we missed that systemd added support for the devices cgroup, as
a result systemd would actually *write an allow-all rule each time you
did 'runc update'* if you used the systemd cgroup driver. This is
obviously ... bad and was a clear security bug. Luckily the commits which
introduced this were never in an actual runc release.
So we simply generate the cgroupv1-style rules (which is what systemd's
DeviceAllow wants) and default to a deny-all ruleset. Unfortunately it
turns out that systemd is susceptible to the same spurrious error
failure that we were, so that problem is out of our hands for systemd
cgroup users.
However, systemd has a similar bug to the one fixed in [1]. It will
happily write a disruptive deny-all rule when it is not necessary.
Unfortunately, we cannot even use devices.Emulator to generate a minimal
set of transition rules because the DBus API is limited (you can only
clear or append to the DeviceAllow= list -- so we are forced to always
clear it). To work around this, we simply freeze the container during
SetUnitProperties.
[1]: afe83489d4 ("cgroupv1: devices: use minimal transition rules with devices.Emulator")
Fixes: 1d4ccc8e0c ("fix data inconsistent when runc update in systemd driven cgroup v1")
Fixes: 7682a2b2a5 ("fix data inconsistent when runc update in systemd driven cgroup v2")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This is effectively a nicer implementation of the container.isPaused()
helper, but to be used within the cgroup code for handling some fun
issues we have to fix with the systemd cgroup driver.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This unties the Gordian Knot of using GetPaths in cgroupv2 code.
The problem is, the current code uses GetPaths for three kinds of things:
1. Get all the paths to cgroup v1 controllers to save its state (see
(*linuxContainer).currentState(), (*LinuxFactory).loadState()
methods).
2. Get all the paths to cgroup v1 controllers to have the setns process
enter the proper cgroups in `(*setnsProcess).start()`.
3. Get the path to a specific controller (for example,
`m.GetPaths()["devices"]`).
Now, for cgroup v2 instead of a set of per-controller paths, we have only
one single unified path, and a dedicated function `GetUnifiedPath()` to get it.
This discrepancy between v1 and v2 cgroupManager API leads to the
following problems with the code:
- multiple if/else code blocks that have to treat v1 and v2 separately;
- backward-compatible GetPaths() methods in v2 controllers;
- - repeated writing of the PID into the same cgroup for v2;
Overall, it's hard to write the right code with all this, and the code
that is written is kinda hard to follow.
The solution is to slightly change the API to do the 3 things outlined
above in the same manner for v1 and v2:
1. Use `GetPaths()` for state saving and setns process cgroups entering.
2. Introduce and use Path(subsys string) to obtain a path to a
subsystem. For v2, the argument is ignored and the unified path is
returned.
This commit converts all the controllers to the new API, and modifies
all the users to use it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In cgroup v2, when memory and memorySwap set to the same value which is greater than zero,
runc should write zero in `memory.swap.max` to disable swap.
Signed-off-by: lifubang <lifubang@acmcoder.com>
1. Instead of enabling all available controllers, figure out which
ones are required, and only enable those.
2. Amend all setFoo() functions to call isFooSet(). While this might
seem unnecessary, it might actually help to uncover a bug.
Imagine someone:
- adds a cgroup.Resources.CpuFoo setting;
- modifies setCpu() to apply the new setting;
- but forgets to amend isCpuSet() accordingly <-- BUG
In this case, a test case modifying CpuFoo will help
to uncover the BUG. This is the reason why it's added.
This patch *could be* amended by enabling controllers on a best-effort
basis, i.e. :
- do not return an error early if we can't enable some controllers;
- if we fail to enable all controllers at once (usually because one
of them can't be enabled), try enabling them one by one.
Currently this is not implemented, and it's not clear whether this
would be a good way to go or not.
[v2: add/use is${Controller}Set() functions]
[v3: document neededControllers()]
[v4: drop "best-effort" part]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Rename the files
- v1.go: cgroupv1 aka legacy;
- v2.go: cgroupv2 aka unified hierarchy;
- unsupported.go: when systemd is not available.
2. Move the code that is common between v1 and v2 to common.go
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
fs2 cgroup driver was not working because it did not enable controllers
while creating cgroup directory; instead it was merely doing MkdirAll()
and gathered the list of available controllers in NewManager().
Also, cgroup should be created in Apply(), not while creating a new
manager instance.
To fix:
1. Move the createCgroupsv2Path function from systemd driver to fs2 driver,
renaming it to CreateCgroupPath. Use in Apply() from both fs2 and
systemd drivers.
2. Delay available controllers map initialization to until it is needed.
With this patch:
- NewManager() only performs minimal initialization (initializin
m.dirPath, if not provided);
- Apply() properly creates cgroup path, enabling the controllers;
- m.controllers is initialized lazily on demand.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Having map of per-subsystem paths in systemd unified cgroups
driver does not make sense and makes the code less readable.
To get rid of it, move the systemd v1-or-v2 init code to
libcontainer/factory_linux.go which already has a function
to deduce unified path out of paths map.
End result is much cleaner code. Besides, we no longer write pid
to the same cgroup file 7 times in Apply() like we did before.
While at it
- add `rootless` flag which is passed on to fs2 manager
- merge getv2Path() into GetUnifiedPath(), don't overwrite
path if it is set during initialization (on Load).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The resources.MemorySwap field from OCI is memory+swap, while cgroupv2
has a separate swap limit, so subtract memory from the limit (and make
sure values are set and sane).
Make sure to set MemorySwapMax for systemd, too. Since systemd does not
have MemorySwapMax for cgroupv1, it is only needed for v2 driver.
[v2: return -1 on any negative value, add unit test]
[v3: treat any negative value other than -1 as error]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Remove joinCgroupsV2() function, as its name and second parameter
are misleading. Use createCgroupsv2Path() directly, do not call
getv2Path() twice.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Function getSubsystemPath(), while works for v2 unified case, is
suboptimal, as it does a few unnecessary calls.
Add a simplified version of getSubsystemPath(), called getv2Path(),
and use it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>