When we use cgroup with systemd driver, the cgroup path will be auto removed
by systemd when all processes exited. So we should check cgroup path exists
when we access the cgroup path, for example in `kill/ps`, or else we will
got an error.
Signed-off-by: lifubang <lifubang@acmcoder.com>
This is a regression from commit 1d4ccc8e0. We only need to enable
kernel memory accounting once, from the (*legacyManager*).Apply(),
and there is no need to do it in (*legacyManager*).Set().
While at it, rename the method to better reflect what it's doing.
This saves 1 call to mountinfo parser.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 4e65e0e90a added a check for cpu shares. Apparently, the
kernel allows to set a value higher than max or lower than min without
an error, but the value read back is always within the limits.
The check (which was later moved out to a separate CheckCpushares()
function) is always performed after setting the cpu shares, so let's
move it to the very place where it is set.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. In cases there are no sub-cgroups, a single rmdir should be faster
than iterating through the list of files.
2. Use unix.Rmdir() to save one more syscall since os.Remove() tries
unlink(2) first which fails on a directory, and only then tries
rmdir(2).
3. Re-use rmdir.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is a quick-n-dirty fix the regression introduced by commit
06d7c1d, which made it impossible to only set CpuQuota
(without the CpuPeriod). It partially reverts the above commit,
and adds a test case.
The proper fix will follow.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
... and mem+swap is not explicitly set otherwise.
This ensures compatibility with cgroupv1 controller which interprets
things this way.
With this fixed, we can finally enable swap tests for cgroupv2.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Partially revert "CreateCgroupPath: only enable needed controllers"
If we update a resource which did not limited in the beginning,
it will have no effective.
2. Returns err if we use an non enabled controller,
or else the user may feel success, but actually there are no effective.
Signed-off-by: lifubang <lifubang@acmcoder.com>
Commit 18ebc51b3cc3 "Reset Swap when memory is set to unlimited (-1)"
added handling of the case when a user updates the container limits
to set memory to unlimited (-1) but do not set any other limits.
Apparently, in this case, if swap limit was previously set, kernel fails
to set memory.limit_in_bytes to -1 if memory.memsw.limit_in_bytes is
not set to -1.
What the above commit fails to handle correctly is the request when
Memory is set to -1 and MemorySwap is set to some specific limit N
(where N > 0). In this case, the value of N is silently discarded
and MemorySwap is set to -1 instead.
This is wrong thing to do, as the limit set, even if incorrectly,
should not be ignored.
Fix this by only assigning MemorySwap == -1 in case it was not
explicitly set.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Currently, both systemd cgroup drivers (v1 and v2) only set
"TasksMax" unit property if the value > 0, so there is no
way to update the limit to -1 / unlimited / infinity / max.
Since systemd driver is backed by fs driver, and both fs and fs2
set the limit of -1 properly, it works, but systemd still has
the old value:
# runc --systemd-cgroup update $CT --pids-limit 42
# systemctl show runc-$CT.scope | grep TasksMax
TasksMax=42
# cat /sys/fs/cgroup/system.slice/runc-$CT.scope/pids.max
42
# ./runc --systemd-cgroup update $CT --pids-limit -1
# systemctl show runc-$CT.scope | grep TasksMax=
TasksMax=42
# cat /sys/fs/cgroup/system.slice/runc-xx77.scope/pids.max
max
Fix by changing the condition to allow -1 as a valid value.
NOTE other negative values are still being ignored by systemd drivers
(as it was done before). I am not sure whether this is correct, or
should we return an error.
A test case is added.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. do not allow to set quota without period or period without quota, as we
won't be able to calculate new value for CPUQuotaPerSecUSec otherwise.
2. do not ignore setting quota to -1 when a period is not set.
3. update the test case accordingly.
Note that systemd value checks will be added in the next commit.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The function GetClosestMountpointAncestor is not very efficient,
does not really belong to cgroup package, and is only used once
(from fs/cpuset.go).
Remove it, replacing with the implementation based on moby/sys/mountinfo
parser.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This function is not very efficient, does not really belong to cgroup
package, and is only used once (from fs/cpuset.go).
Prepare to remove it by replacing with the implementation based on
the parser from github.com/moby/sys/mountinfo parser.
This commit is here to make sure the proposed replacement passes the
unit test.
Funny, but the unit test need to be slightly modified since it
supplies the wrong mountinfo (space as the first character, empty line
at the end).
Validated by
$ go test -v -run Ance
=== RUN TestGetClosestMountpointAncestor
--- PASS: TestGetClosestMountpointAncestor (0.00s)
PASS
ok github.com/opencontainers/runc/libcontainer/cgroups 0.002s
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
It seems we missed that systemd added support for the devices cgroup, as
a result systemd would actually *write an allow-all rule each time you
did 'runc update'* if you used the systemd cgroup driver. This is
obviously ... bad and was a clear security bug. Luckily the commits which
introduced this were never in an actual runc release.
So we simply generate the cgroupv1-style rules (which is what systemd's
DeviceAllow wants) and default to a deny-all ruleset. Unfortunately it
turns out that systemd is susceptible to the same spurrious error
failure that we were, so that problem is out of our hands for systemd
cgroup users.
However, systemd has a similar bug to the one fixed in [1]. It will
happily write a disruptive deny-all rule when it is not necessary.
Unfortunately, we cannot even use devices.Emulator to generate a minimal
set of transition rules because the DBus API is limited (you can only
clear or append to the DeviceAllow= list -- so we are forced to always
clear it). To work around this, we simply freeze the container during
SetUnitProperties.
[1]: afe83489d4 ("cgroupv1: devices: use minimal transition rules with devices.Emulator")
Fixes: 1d4ccc8e0c ("fix data inconsistent when runc update in systemd driven cgroup v1")
Fixes: 7682a2b2a5 ("fix data inconsistent when runc update in systemd driven cgroup v2")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Now that all of the infrastructure for devices.Emulator is in place, we
can finally implement minimal transition rules for devices cgroups. This
allows for minimal disruption to running containers if a rule update is
requested. Only in very rare circumstances (black-list cgroups and mode
switching) will a clear-all rule be written. As a result, containers
should no longer see spurious errors.
A similar issue affects the cgroupv2 devices setup, but that is a topic
for another time (as the solution is drastically different).
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Okay, this requires a bit of explanation.
The reason for this emulation is to allow us to have seamless updates of
the devices cgroup for running containers. This was triggered by several
users having issues where our initial writing of a deny-all rule (in all
cases) results in spurrious errors.
The obvious solution would be to just remove the deny-all rule, right?
Well, it turns out that runc doesn't actually control the deny-all rule
because all users of runc have explicitly specified their own deny-all
rule for many years. This appears to have been done to work around a bug
in runc (which this series has fixed in [1]) where we would actually act
as a black-list despite this being a violation of the OCI spec.
This means that not adding our own deny-all rule in the case of updates
won't solve the issue. However, it will also not solve the issue in
several other cases (the most notable being where a container is being
switched between default-permission modes).
So in order to handle all of these cases, a way of tracking the relevant
internal cgroup state (given a certain state of "cgroups.list" and a set
of rules to apply) is necessary. That is the purpose of DevicesEmulator.
Reading "devices.list" is quite important because that's the only way we
can tell if it's safe to skip the troublesome deny-all rules without
making potentially-dangerous assumptions about the container.
We also are currently bug-compatible with the devices cgroup (namely,
removing rules that don't exist or having superfluous rules all works as
with the in-kernel implementation). The only exception to this is that
we give an error if a user requests to revoke part of a wildcard
exception, because allowing such configurations could result in security
holes (cgroupv1 silently ignores such rules, meaning in white-list mode
that the access is still permitted).
[1]: b2bec9806f ("cgroup: devices: eradicate the Allow/Deny lists")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Making them the same type is simply confusing, but also means that you
could accidentally use one in the wrong context. This eliminates that
problem. This also includes a whole bunch of cleanups for the types
within DeviceRule, so that they can be used more ergonomically.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
/dev/console is a host resouce which gives a bunch of permissions that
we really shouldn't be giving to containers, not to mention that
/dev/console in containers is actually /dev/pts/$n. Drop this since
arguably this is a fairly scary thing to allow...
Signed-off-by: Aleksa Sarai <asarai@suse.de>
These lists have been in the codebase for a very long time, and have
been unused for a large portion of that time -- specconv doesn't
generate them and the only user of these flags has been tests (which
doesn't inspire much confidence).
In addition, we had an incorrect implementation of a white-list policy.
This wasn't exploitable because all of our users explicitly specify
"deny all" as the first rule, but it was a pretty glaring issue that
came from the "feature" that users can select whether they prefer a
white- or black- list. Fix this by always writing a deny-all rule (which
is what our users were doing anyway, to work around this bug).
This is one of many changes needed to clean up the devices cgroup code.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This is effectively a nicer implementation of the container.isPaused()
helper, but to be used within the cgroup code for handling some fun
issues we have to fix with the systemd cgroup driver.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This unties the Gordian Knot of using GetPaths in cgroupv2 code.
The problem is, the current code uses GetPaths for three kinds of things:
1. Get all the paths to cgroup v1 controllers to save its state (see
(*linuxContainer).currentState(), (*LinuxFactory).loadState()
methods).
2. Get all the paths to cgroup v1 controllers to have the setns process
enter the proper cgroups in `(*setnsProcess).start()`.
3. Get the path to a specific controller (for example,
`m.GetPaths()["devices"]`).
Now, for cgroup v2 instead of a set of per-controller paths, we have only
one single unified path, and a dedicated function `GetUnifiedPath()` to get it.
This discrepancy between v1 and v2 cgroupManager API leads to the
following problems with the code:
- multiple if/else code blocks that have to treat v1 and v2 separately;
- backward-compatible GetPaths() methods in v2 controllers;
- - repeated writing of the PID into the same cgroup for v2;
Overall, it's hard to write the right code with all this, and the code
that is written is kinda hard to follow.
The solution is to slightly change the API to do the 3 things outlined
above in the same manner for v1 and v2:
1. Use `GetPaths()` for state saving and setns process cgroups entering.
2. Introduce and use Path(subsys string) to obtain a path to a
subsystem. For v2, the argument is ignored and the unified path is
returned.
This commit converts all the controllers to the new API, and modifies
all the users to use it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Prevent theoretical "concurrent map access" error to m.paths.
2. There is no need to call m.Paths -- we can access m.paths directly.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The test cases need to take into account the assembly modifications.
The instruction:
LdXMemH dst: r2 src: r1 off: 0 imm: 0
has been replaced with:
LdXMemW dst: r2 src: r1 off: 0 imm: 0
And32Imm dst: r2 imm: 65535
Signed-off-by: Alice Frosi <afrosi@de.ibm.com>
Load the full 32 bits word and take the lower 16 bits, instead of
reading just 16 bits.
Same fix as 07bae05e61
Signed-off-by: Alice Frosi <afrosi@de.ibm.com>
In cgroup v2, when memory and memorySwap set to the same value which is greater than zero,
runc should write zero in `memory.swap.max` to disable swap.
Signed-off-by: lifubang <lifubang@acmcoder.com>
Function defaultPath always parses /proc/self/cgroup, but
the resulting value is not always used.
Avoid unnecessary reading/parsing by moving the code
to just before its use.
Modify the test case accordingly.
[v2: test: use UnifiedMountpoint, skip test if not on v2]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. There is no need to try removing it recursively.
2. Do not treat ENOENT as an error (similar to fs
and systemd v1 drivers).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Instead of enabling all available controllers, figure out which
ones are required, and only enable those.
2. Amend all setFoo() functions to call isFooSet(). While this might
seem unnecessary, it might actually help to uncover a bug.
Imagine someone:
- adds a cgroup.Resources.CpuFoo setting;
- modifies setCpu() to apply the new setting;
- but forgets to amend isCpuSet() accordingly <-- BUG
In this case, a test case modifying CpuFoo will help
to uncover the BUG. This is the reason why it's added.
This patch *could be* amended by enabling controllers on a best-effort
basis, i.e. :
- do not return an error early if we can't enable some controllers;
- if we fail to enable all controllers at once (usually because one
of them can't be enabled), try enabling them one by one.
Currently this is not implemented, and it's not clear whether this
would be a good way to go or not.
[v2: add/use is${Controller}Set() functions]
[v3: document neededControllers()]
[v4: drop "best-effort" part]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Rename the files
- v1.go: cgroupv1 aka legacy;
- v2.go: cgroupv2 aka unified hierarchy;
- unsupported.go: when systemd is not available.
2. Move the code that is common between v1 and v2 to common.go
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This check was removed in commit 5406833a65. Now, when this
function is called from a few places, it is no longer obvious
that the path always starts with /sys/fs/cgroup/, so reinstate
the check just to be on the safe side.
This check also ensures that elements[3:] can be used safely.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
fs2 cgroup driver was not working because it did not enable controllers
while creating cgroup directory; instead it was merely doing MkdirAll()
and gathered the list of available controllers in NewManager().
Also, cgroup should be created in Apply(), not while creating a new
manager instance.
To fix:
1. Move the createCgroupsv2Path function from systemd driver to fs2 driver,
renaming it to CreateCgroupPath. Use in Apply() from both fs2 and
systemd drivers.
2. Delay available controllers map initialization to until it is needed.
With this patch:
- NewManager() only performs minimal initialization (initializin
m.dirPath, if not provided);
- Apply() properly creates cgroup path, enabling the controllers;
- m.controllers is initialized lazily on demand.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The fs2 cgroup driver has a sanity check for path.
Since systemd driver is relying on the same path,
it makes sense to move this check to the common code.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Having map of per-subsystem paths in systemd unified cgroups
driver does not make sense and makes the code less readable.
To get rid of it, move the systemd v1-or-v2 init code to
libcontainer/factory_linux.go which already has a function
to deduce unified path out of paths map.
End result is much cleaner code. Besides, we no longer write pid
to the same cgroup file 7 times in Apply() like we did before.
While at it
- add `rootless` flag which is passed on to fs2 manager
- merge getv2Path() into GetUnifiedPath(), don't overwrite
path if it is set during initialization (on Load).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In this very case, the code is writing to cgroup2 filesystem,
and the file name is well known and can't possibly be a symlink.
So, using securejoin is redundant.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In many places (not all of them though) we can use `unix.`
instead of `syscall.` as these are indentical.
In particular, x/sys/unix defines:
```go
type Signal = syscall.Signal
type Errno = syscall.Errno
type SysProcAttr = syscall.SysProcAttr
const ENODEV = syscall.Errno(0x13)
```
and unix.Exec() calls syscall.Exec().
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When the cgroupv2 fs driver is used without setting cgroupsPath,
it picks up a path from /proc/self/cgroup. On a host with systemd,
such a path can look like (examples from my machines):
- /user.slice/user-1000.slice/session-4.scope
- /user.slice/user-1000.slice/user@1000.service/gnome-launched-xfce4-terminal.desktop-4260.scope
- /user.slice/user-1000.slice/user@1000.service/gnome-terminal-server.service
This cgroup already contains processes in it, which prevents to enable
controllers for a sub-cgroup (writing to cgroup.subtree_control fails
with EBUSY or EOPNOTSUPP).
Obviously, a parent cgroup (which does not contain tasks) should be used.
Fixes opencontainers/runc/issues/2298
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 6905b72154 treats all negative values as "max",
citing cgroup v1 compatibility as a reason. In fact, in
cgroup v1 only -1 is treated as "unlimited", and other
negative values usually calse an error.
Treat -1 as "max", pass other negative values as is
(the error will be returned from the kernel).
Fixes: 6905b72154
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The resources.MemorySwap field from OCI is memory+swap, while cgroupv2
has a separate swap limit, so subtract memory from the limit (and make
sure values are set and sane).
Make sure to set MemorySwapMax for systemd, too. Since systemd does not
have MemorySwapMax for cgroupv1, it is only needed for v2 driver.
[v2: return -1 on any negative value, add unit test]
[v3: treat any negative value other than -1 as error]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
otherwise the memoryStats and hugetlbStats maps are not initialized
and GetStats() segfaults when using them.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Make use of errors.Is() and errors.As() where appropriate to check
the underlying error. The biggest motivation is to simplify the code.
The feature requires go 1.13 but since merging #2256 we are already
not supporting go 1.12 (which is an unsupported release anyway).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Using errors.Unwrap() is not the best thing to do, since it returns
nil in case of an error which was not wrapped. More to say,
errors package provides more elegant ways to check for underlying
errors, such as errors.As() and errors.Is().
This reverts commit f8e138855d, reversing
changes made to 6ca9d8e6da.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Tested that the EINTR is still being detected:
> $ go1.14 test -c # 1.14 is needed for EINTR to happen
> $ sudo ./fscommon.test
> INFO[0000] interrupted while writing 1063068 to /sys/fs/cgroup/memory/test-eint-89293785/memory.limit_in_bytes
> PASS
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>