To make a bind mount read-only, it needs to be remounted. This is what
the code removed does, but it is not needed here.
We have to deal with three cases here:
1. cgroup v2 unified mode. In this case the mount is real mount with
fstype=cgroup2, and there is no need to have a bind mount on top,
as we pass readonly flag to the mount as is.
2. cgroup v1 + cgroupns (enableCgroupns == true). In this case the
"mount" is in fact a set of real mounts with fstype=cgroup, and
they are all performed in mountCgroupV1, with readonly flag
added if needed.
3. cgroup v1 as is (enableCgroupns == false). In this case
mountCgroupV1() calls mountToRootfs() again with an argument
from the list obtained from getCgroupMounts(), i.e. a bind
mount with the same flags as the original mount has (plus
unix.MS_BIND | unix.MS_REC), and mountToRootfs() does remounting
(under the case "bind":).
So, the code which this patch is removing is not needed -- it
essentially does nothing in case 3 above (since the bind mount
is already remounted readonly), and in cases 1 and 2 it
creates an unneeded extra bind mount on top of a real one (or set of
real ones).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case of cgroupv2 unified hierarchy, the /sys/fs/cgroup mount
is the real mount with fstype of cgroup2 (rather than a set of
external bind mounts like for cgroupv1).
So, we should not add it to the list of "external bind mounts"
on both checkpoint and restore.
Without this fix, checkpoint integration tests fail on cgroup v2.
Also, same is true for cgroup v1 + cgroupns.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. There is no need to try removing it recursively.
2. Do not treat ENOENT as an error (similar to fs
and systemd v1 drivers).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Instead of enabling all available controllers, figure out which
ones are required, and only enable those.
2. Amend all setFoo() functions to call isFooSet(). While this might
seem unnecessary, it might actually help to uncover a bug.
Imagine someone:
- adds a cgroup.Resources.CpuFoo setting;
- modifies setCpu() to apply the new setting;
- but forgets to amend isCpuSet() accordingly <-- BUG
In this case, a test case modifying CpuFoo will help
to uncover the BUG. This is the reason why it's added.
This patch *could be* amended by enabling controllers on a best-effort
basis, i.e. :
- do not return an error early if we can't enable some controllers;
- if we fail to enable all controllers at once (usually because one
of them can't be enabled), try enabling them one by one.
Currently this is not implemented, and it's not clear whether this
would be a good way to go or not.
[v2: add/use is${Controller}Set() functions]
[v3: document neededControllers()]
[v4: drop "best-effort" part]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Rename the files
- v1.go: cgroupv1 aka legacy;
- v2.go: cgroupv2 aka unified hierarchy;
- unsupported.go: when systemd is not available.
2. Move the code that is common between v1 and v2 to common.go
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This check was removed in commit 5406833a65. Now, when this
function is called from a few places, it is no longer obvious
that the path always starts with /sys/fs/cgroup/, so reinstate
the check just to be on the safe side.
This check also ensures that elements[3:] can be used safely.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
fs2 cgroup driver was not working because it did not enable controllers
while creating cgroup directory; instead it was merely doing MkdirAll()
and gathered the list of available controllers in NewManager().
Also, cgroup should be created in Apply(), not while creating a new
manager instance.
To fix:
1. Move the createCgroupsv2Path function from systemd driver to fs2 driver,
renaming it to CreateCgroupPath. Use in Apply() from both fs2 and
systemd drivers.
2. Delay available controllers map initialization to until it is needed.
With this patch:
- NewManager() only performs minimal initialization (initializin
m.dirPath, if not provided);
- Apply() properly creates cgroup path, enabling the controllers;
- m.controllers is initialized lazily on demand.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The fs2 cgroup driver has a sanity check for path.
Since systemd driver is relying on the same path,
it makes sense to move this check to the common code.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Having map of per-subsystem paths in systemd unified cgroups
driver does not make sense and makes the code less readable.
To get rid of it, move the systemd v1-or-v2 init code to
libcontainer/factory_linux.go which already has a function
to deduce unified path out of paths map.
End result is much cleaner code. Besides, we no longer write pid
to the same cgroup file 7 times in Apply() like we did before.
While at it
- add `rootless` flag which is passed on to fs2 manager
- merge getv2Path() into GetUnifiedPath(), don't overwrite
path if it is set during initialization (on Load).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In this very case, the code is writing to cgroup2 filesystem,
and the file name is well known and can't possibly be a symlink.
So, using securejoin is redundant.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In many places (not all of them though) we can use `unix.`
instead of `syscall.` as these are indentical.
In particular, x/sys/unix defines:
```go
type Signal = syscall.Signal
type Errno = syscall.Errno
type SysProcAttr = syscall.SysProcAttr
const ENODEV = syscall.Errno(0x13)
```
and unix.Exec() calls syscall.Exec().
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 9a0184b10f meant to enable using cgroup v2 freezer
for criu >= 3.14, but it looks like it is doing something else
instead.
The logic here is:
- for cgroup v1, set FreezeCgroup, if available
- for cgroup v2, only set it for criu >= 3.14
- do not use GetPaths() in case v2 is used
(this method is obsoleted for v2 and will be removed)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When the cgroupv2 fs driver is used without setting cgroupsPath,
it picks up a path from /proc/self/cgroup. On a host with systemd,
such a path can look like (examples from my machines):
- /user.slice/user-1000.slice/session-4.scope
- /user.slice/user-1000.slice/user@1000.service/gnome-launched-xfce4-terminal.desktop-4260.scope
- /user.slice/user-1000.slice/user@1000.service/gnome-terminal-server.service
This cgroup already contains processes in it, which prevents to enable
controllers for a sub-cgroup (writing to cgroup.subtree_control fails
with EBUSY or EOPNOTSUPP).
Obviously, a parent cgroup (which does not contain tasks) should be used.
Fixes opencontainers/runc/issues/2298
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Commit 6905b72154 treats all negative values as "max",
citing cgroup v1 compatibility as a reason. In fact, in
cgroup v1 only -1 is treated as "unlimited", and other
negative values usually calse an error.
Treat -1 as "max", pass other negative values as is
(the error will be returned from the kernel).
Fixes: 6905b72154
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The resources.MemorySwap field from OCI is memory+swap, while cgroupv2
has a separate swap limit, so subtract memory from the limit (and make
sure values are set and sane).
Make sure to set MemorySwapMax for systemd, too. Since systemd does not
have MemorySwapMax for cgroupv1, it is only needed for v2 driver.
[v2: return -1 on any negative value, add unit test]
[v3: treat any negative value other than -1 as error]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
otherwise the memoryStats and hugetlbStats maps are not initialized
and GetStats() segfaults when using them.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Make use of errors.Is() and errors.As() where appropriate to check
the underlying error. The biggest motivation is to simplify the code.
The feature requires go 1.13 but since merging #2256 we are already
not supporting go 1.12 (which is an unsupported release anyway).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Using errors.Unwrap() is not the best thing to do, since it returns
nil in case of an error which was not wrapped. More to say,
errors package provides more elegant ways to check for underlying
errors, such as errors.As() and errors.Is().
This reverts commit f8e138855d, reversing
changes made to 6ca9d8e6da.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
CI was failing on cgroup v2 because mockCgroupManager.GetUnifiedPath()
was returning an error.
Now the function returns the value of mockCgroupManager.unifiedPath,
but the value is currently not used in the tests.
Fix#2286
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Using GetPaths from cgroupv2 unified hierarchy code is deprecated
and this function will (hopefully) be removed.
Use GetUnifiedPath() for v2 case.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Tested that the EINTR is still being detected:
> $ go1.14 test -c # 1.14 is needed for EINTR to happen
> $ sudo ./fscommon.test
> INFO[0000] interrupted while writing 1063068 to /sys/fs/cgroup/memory/test-eint-89293785/memory.limit_in_bytes
> PASS
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When performing checkpoint or restore of cgroupv2 unified hierarchy,
there is no need to call getCgroupMounts() / cgroups.GetCgroupMounts()
as there's only a single mount in there.
This eliminates the last internal (i.e. runc) use case of
cgroups.GetCgroupMounts() for v2 unified. Unfortunately, there
are external ones (e.g. moby/moby) so we can't yet let it
return an error.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
It turns out that ioutil.Readfile wraps the error in a *os.PathError.
Since we cannot guarantee compilation with golang >= v1.13, we are
manually unwrapping the error.
Signed-off-by: Kieron Browne <kbrowne@pivotal.io>
The Travis tests running on Fedora 31 with cgroup2 on Vagrant had the
CRIU parts disabled because of a couple of problems.
One problem was a bug in runc and CRIU handling that Andrei fixed.
In addition four patches from the upcoming CRIU 3.14 are needed for
minimal cgroup2 support (freezer and mounting of cgroup2). With Andrei's
fix and the CRIU cgroup2 support and the runc CRIU cgroup2 integration
it is now possible the checkpoint integration tests again on the Fedora
Vagrant cgroup2 based integration test.
To run CRIU based tests the modules of Fedora 31 (the test host system)
are mounted inside of the container used to test runc in the buster
based container with -v /lib/modules:/lib/modules.
Signed-off-by: Adrian Reber <areber@redhat.com>
The newest CRIU version supports freezer v2 and this tells runc
to use it if new enough or fall back to non-freezer based process
freezing on cgroup v2 system.
Signed-off-by: Adrian Reber <areber@redhat.com>
The line we are parsing looks like this
> flags : fpu vme de pse <...>
so look for "flags" as a prefix, not substring.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The Err() method should be called after the Scan() loop, not inside it.
Found by
git grep -A3 -F '.Scan()' | grep Err
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Remove joinCgroupsV2() function, as its name and second parameter
are misleading. Use createCgroupsv2Path() directly, do not call
getv2Path() twice.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Function getSubsystemPath(), while works for v2 unified case, is
suboptimal, as it does a few unnecessary calls.
Add a simplified version of getSubsystemPath(), called getv2Path(),
and use it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This code is a copy-paste from cgroupv1 systemd code. Its aim
is to check whether a subsystem is available, and skip those
that are not.
In case v2 unified hierarchy is used, getSubsystemPath never
returns "not found" error, so calling it is useless.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Cgroup v1 kernel doc [1] says:
> We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``.
and cgroup v2 kernel documentation [2] says:
> - If a controller implements an absolute resource guarantee and/or
> limit, the interface files should be named "min" and "max"
> respectively. If a controller implements best effort resource
> guarantee and/or limit, the interface files should be named "low"
> and "high" respectively.
>
> In the above four control files, the special token "max" should be
> used to represent upward infinity for both reading and writing.
Allow -1 value to still be used for v2, converting it to "max"
where it makes sense to do so.
This fixes the following issue:
> runc update test_update --memory-swap -1:
> error while setting cgroup v2: [write /sys/fs/cgroup/machine.slice/runc-cgroups-integration-test.scope/memory.swap.max: invalid argument
> failed to write "-1" to "/sys/fs/cgroup/machine.slice/runc-cgroups-integration-test.scope/memory.swap.max"
> github.com/opencontainers/runc/libcontainer/cgroups/fscommon.WriteFile
> /home/kir/go/src/github.com/opencontainers/runc/libcontainer/cgroups/fscommon/fscommon.go:21
> github.com/opencontainers/runc/libcontainer/cgroups/fs2.setMemory
> /home/kir/go/src/github.com/opencontainers/runc/libcontainer/cgroups/fs2/memory.go:20
> github.com/opencontainers/runc/libcontainer/cgroups/fs2.(*manager).Set
> /home/kir/go/src/github.com/opencontainers/runc/libcontainer/cgroups/fs2/fs2.go:175
> github.com/opencontainers/runc/libcontainer/cgroups/systemd.(*UnifiedManager).Set
> /home/kir/go/src/github.com/opencontainers/runc/libcontainer/cgroups/systemd/unified_hierarchy.go:290
> github.com/opencontainers/runc/libcontainer.(*linuxContainer).Set
> /home/kir/go/src/github.com/opencontainers/runc/libcontainer/container_linux.go:211
[1] linux/Documentation/admin-guide/cgroup-v1/memory.rst
[2] linux/Documentation/admin-guide/cgroup-v2.rst
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
To the best of my knowledge, it has been decided to drop the kernel
memory controller from the cgroupv2 hierarchy, so "kernel memory limits"
do not exist if we're using v2 unified.
So, we need to ignore kernel memory setting. This was already done in
non-systemd case (see commit 88e8350de), let's do the same for systemd.
This fixes the following error:
> container_linux.go:349: starting container process caused "process_linux.go:306: applying cgroup configuration for process caused \"open /sys/fs/cgroup/machine.slice/runc-cgroups-integration-test.scope/tasks: no such file or directory\""
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. rootfs is already validated to be kosher by (*ConfigValidator).rootfs()
2. mount points from /proc/self/mountinfo are absolute and clean, too
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Delete libcontainer/mount in favor of github.com/moby/sys/mountinfo,
which is fast mountinfo parser.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Return earlier if there is an error.
2. Do not use filepath.Split on every entry, use info.Name() instead.
3. Make readProcsFile() accept file name as an argument, to avoid
unnecessary file name and directory splitting and merging.
4. Skip on info.IsDir() -- this avoids an error when cgroup name is
set to "cgroup.procs".
This is still not very good since filepath.Walk() performs an unnecessary
stat(2) on every entry, but better than before.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
fmt.Sprintf is slow and is not needed here, string concatenation would
be sufficient. It is also redundant to convert []byte from string and
back, since `bytes` package now provides the same functions as `strings`.
Use Fields() instead of TrimSpace() and Split(), mainly for readability
(note Fields() is somewhat slower than Split() but here it doesn't
matter much).
Use Join() to prepend the plus signs.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Golang 1.14 introduces asynchronous preemption which results into
applications getting frequent EINTR (syscall interrupted) errors when
invoking slow syscalls, e.g. when writing to cgroup files.
As writing to cgroups is idempotent, it is safe to retry writing to the
file whenever the write syscall is interrupted.
Signed-off-by: Mario Nitchev <marionitchev@gmail.com>
* TestConvertCPUSharesToCgroupV2Value(0) was returning 70369281052672, while the correct value is 0
* ConvertBlkIOToCgroupV2Value(0) was returning 32, while the correct value is 0
* ConvertBlkIOToCgroupV2Value(1000) was returning 4, while the correct value is 10000
Fix#2244
Follow-up to #2212#2213
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
linuxContainer.Signal() can race with another call to say Destroy()
which clears the container's initProcess. This can cause a nil pointer
dereference in Signal().
This patch will synchronize Signal() and Destroy() by grabbing the
container's mutex as part of the Signal() call.
Signed-off-by: Pradyumna Agrawal <pradyumnaa@vmware.com>
Some systemd properties are documented as having "Sec" suffix
(e.g. "TimeoutStopSec") but are expected to have "USec" suffix
when passed over dbus, so let's provide appropriate conversion
to improve compatibility.
This means, one can specify TimeoutStopSec with a numeric argument,
in seconds, and it will be properly converted to TimeoutStopUsec
with the argument in microseconds. As a side bonus, even float
values are converted, so e.g. TimeoutStopSec=1.5 is possible.
This turned out a bit more tricky to implement when I was
originally expected, since there are a handful of numeric
types in dbus and each one requires explicit conversion.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case systemd is used to set cgroups for the container,
it creates a scope unit dedicated to it (usually named
`runc-$ID.scope`).
This patch adds an ability to set arbitrary systemd properties
for the systemd unit via runtime spec annotations.
Initially this was developed as an ability to specify the
`TimeoutStopUSec` property, but later generalized to work with
arbitrary ones.
Example usage: add the following to runtime spec (config.json):
```
"annotations": {
"org.systemd.property.TimeoutStopUSec": "uint64 123456789",
"org.systemd.property.CollectMode":"'inactive-or-failed'"
},
```
and start the container (e.g. `runc --systemd-cgroup run $ID`).
The above will set the following systemd parameters:
* `TimeoutStopSec` to 2 minutes and 3 seconds,
* `CollectMode` to "inactive-or-failed".
The values are in the gvariant format (see [1]). To figure out
which type systemd expects for a particular parameter, see
systemd sources.
In particular, parameters with `USec` suffix require an `uint64`
typed argument, while gvariant assumes int32 for a numeric values,
therefore the explicit type is required.
NOTE that systemd receives the time-typed parameters as *USec
but shows them (in `systemctl show`) as *Sec. For example,
the stop timeout should be set as `TimeoutStopUSec` but
is shown as `TimeoutStopSec`.
[1] https://developer.gnome.org/glib/stable/gvariant-text.html
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Adrian reported that the checkpoint test stated failing:
=== RUN TestCheckpoint
--- FAIL: TestCheckpoint (0.38s)
checkpoint_test.go:297: Did not restore the pipe correctly:
The problem here is when we start exec.Cmd, we don't call its wait
method. This means that we don't wait cmd.goroutines ans so we don't
know when all data will be read from process pipes.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
mount(2) will blindly follow symlinks, which is a problem because it
allows a malicious container to trick runc into mounting /proc to an
entirely different location (and thus within the attacker's control for
a rename-exchange attack).
This is just a hotfix (to "stop the bleeding"), and the more complete
fix would be finish libpathrs and port runc to it (to avoid these types
of attacks entirely, and defend against a variety of other /proc-related
attacks). It can be bypased by someone having "/" be a volume controlled
by another container.
Fixes: CVE-2019-19921
Signed-off-by: Aleksa Sarai <asarai@suse.de>
A new method was added to the cgroup interface when #2177 was merged.
After #2177 got merged, #2169 was merged without rebase (sorry!) and compilation was failing:
libcontainer/cgroups/fs2/fs2.go:208:22: container.Cgroup undefined (type *configs.Config has no field or method Cgroup)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
`configs.Cgroup` contains the configuration used to create cgroups. This
configuration must be saved to disk, since it's required to restore the
cgroup manager that was used to create the cgroups.
Add method to get cgroup configuration from cgroup Manager to allow API users
save it to disk and restore a cgroup manager later.
fixes#2176
Signed-off-by: Julio Montes <julio.montes@intel.com>
A `config.Cgroups` object is required to manipulate cgroups v1 and v2 using
libcontainer.
Export `createCgroupConfig` to allow API users to create `config.Cgroups`
objects using directly libcontainer API.
Signed-off-by: Julio Montes <julio.montes@intel.com>
split fs2 package from fs, as mixing up fs and fs2 is very likely to result in
unmaintainable code.
Inspired by containerd/cgroups#109
Fix#2157
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
The libcontainer network statistics are unreachable without manually
creating a libcontainer instance. To retrieve them via the CLI interface
of runc, we now expose them as well.
Signed-off-by: Sascha Grunert <sgrunert@suse.com>