Commit Graph

1563 Commits

Author SHA1 Message Date
lifubang a67dab0ac2 Revert "CreateCgroupPath: only enable needed controllers"
1. Partially revert "CreateCgroupPath: only enable needed controllers"
If we update a resource which did not limited in the beginning,
it will have no effective.
2. Returns err if we use an non enabled controller,
or else the user may feel success, but actually there are no effective.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-05-21 12:17:46 +08:00
Kir Kolyshkin d57f5bb286 cgroupv1: don't ignore MemorySwap if Memory==-1
Commit 18ebc51b3cc3 "Reset Swap when memory is set to unlimited (-1)"
added handling of the case when a user updates the container limits
to set memory to unlimited (-1) but do not set any other limits.
Apparently, in this case, if swap limit was previously set, kernel fails
to set memory.limit_in_bytes to -1 if memory.memsw.limit_in_bytes is
not set to -1.

What the above commit fails to handle correctly is the request when
Memory is set to -1 and MemorySwap is set to some specific limit N
(where N > 0). In this case, the value of N is silently discarded
and MemorySwap is set to -1 instead.

This is wrong thing to do, as the limit set, even if incorrectly,
should not be ignored.

Fix this by only assigning MemorySwap == -1 in case it was not
explicitly set.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-20 17:23:40 -07:00
Kir Kolyshkin 59897367c4 cgroups/systemd: allow to set -1 as pids.limit
Currently, both systemd cgroup drivers (v1 and v2) only set
"TasksMax" unit property if the value > 0, so there is no
way to update the limit to -1 / unlimited / infinity / max.

Since systemd driver is backed by fs driver, and both fs and fs2
set the limit of -1 properly, it works, but systemd still has
the old value:

 # runc --systemd-cgroup update $CT --pids-limit 42
 # systemctl show runc-$CT.scope | grep TasksMax
 TasksMax=42
 # cat /sys/fs/cgroup/system.slice/runc-$CT.scope/pids.max
 42

 # ./runc --systemd-cgroup update $CT --pids-limit -1
 # systemctl show runc-$CT.scope | grep TasksMax=
 TasksMax=42
 # cat /sys/fs/cgroup/system.slice/runc-xx77.scope/pids.max
 max

Fix by changing the condition to allow -1 as a valid value.

NOTE other negative values are still being ignored by systemd drivers
(as it was done before). I am not sure whether this is correct, or
should we return an error.

A test case is added.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-20 13:20:04 -07:00
Kir Kolyshkin 06d7c1d261 systemd+cgroupv1: fix updating CPUQuotaPerSecUSec
1. do not allow to set quota without period or period without quota, as we
   won't be able to calculate new value for CPUQuotaPerSecUSec otherwise.

2. do not ignore setting quota to -1 when a period is not set.

3. update the test case accordingly.

Note that systemd value checks will be added in the next commit.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-20 13:17:18 -07:00
Kir Kolyshkin e4a84bea99 cgroupv2+systemd: set MemoryLow
For some reason, this was not set before.

Test case is added by the next commit.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-20 13:15:29 -07:00
Kir Kolyshkin 716079f95b
Merge pull request #2406 from cyphar/devices-cgroup-header
cgroups: add copyright header to devices.Emulator implementation
2020-05-20 13:01:34 -07:00
Akihiro Suda cd4b71c27a
Merge pull request #2409 from adrianreber/go-criu-4-0-0
Update to latest go-criu
2020-05-21 01:39:09 +09:00
Adrian Reber 944e057025
Update to latest go-criu (4.0.2)
This updates to the latest version of go-criu (4.0.2) which is based on
CRIU 3.14.

As go-criu provides an existing way to query the CRIU binary for its
version this also removes all the code from runc to handle CRIU version
checking and now relies on go-criu.

An important side effect of this change is that this raises the minimum
CRIU version to 3.0.0 as that is the first CRIU version that supports
CRIU version queries via RPC in contrast to parsing the output of
'criu --version'

CRIU 3.0 has been released in April of 2017.

Signed-off-by: Adrian Reber <areber@redhat.com>
2020-05-20 13:49:38 +02:00
Giuseppe Scrivano 41aa19662b
libcontainer: honor seccomp errnoRet
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2020-05-20 09:11:55 +02:00
Akihiro Suda 2fa3c286b5 fix "libcontainer/cgroups/fs/cpuset.go:63:14: undefined: fmt"
The compilation error had ocurred because of a bad rebase during #2401 and #2413

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-05-19 23:38:20 +09:00
Akihiro Suda f369199ff6
Merge pull request #2413 from JFHwang/2392-spec-check
Add nil check of spec.Process in validateProcessSpec()
2020-05-19 08:11:22 +09:00
Mrunal Patel 53a4649776
Merge pull request #2401 from kolyshkin/fs-cpuset-mountinfo
libct/cgroup: rm GetClosestMountpointAncestor using moby/sys/mountinfo parser
2020-05-18 10:43:55 -07:00
Mrunal Patel 825e91ada6
Merge pull request #2341 from kolyshkin/test-cpt-lazy
runc checkpoint: fix --status-fd to accept fd
2020-05-18 10:43:24 -07:00
John Hwang 7fc291fd45 Replace formatted errors when unneeded
Signed-off-by: John Hwang <John.F.Hwang@gmail.com>
2020-05-16 18:13:21 -07:00
lifubang 9ad1beb40f never write empty string to memory.swap.max
Because the empty string means set swap to 0.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-05-16 06:52:14 +08:00
Aleksa Sarai dc9a7879f9
cgroups: add copyright header to devices.Emulator implementation
I forgot to include this in the original patchset.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-15 11:29:51 +10:00
Akihiro Suda 3f1e886991
Merge pull request #2391 from cyphar/devices-cgroup
cgroup: devices: major cleanups and minimal transition rules
2020-05-14 09:57:06 +09:00
Kir Kolyshkin 2db3240f35 libct/cgroups: rm GetClosestMountpointAncestor
The function GetClosestMountpointAncestor is not very efficient,
does not really belong to cgroup package, and is only used once
(from fs/cpuset.go).

Remove it, replacing with the implementation based on moby/sys/mountinfo
parser.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-13 17:32:06 -07:00
Kir Kolyshkin f160352682 libct/cgroup: prep to rm GetClosestMountpointAncestor
This function is not very efficient, does not really belong to cgroup
package, and is only used once (from fs/cpuset.go).

Prepare to remove it by replacing with the implementation based on
the parser from github.com/moby/sys/mountinfo parser.

This commit is here to make sure the proposed replacement passes the
unit test.

Funny, but the unit test need to be slightly modified since it
supplies the wrong mountinfo (space as the first character, empty line
at the end).

Validated by

 $ go test -v -run Ance
 === RUN   TestGetClosestMountpointAncestor
 --- PASS: TestGetClosestMountpointAncestor (0.00s)
 PASS
 ok  	github.com/opencontainers/runc/libcontainer/cgroups	0.002s

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-13 16:26:16 -07:00
Kir Kolyshkin 85d4264d8a
Merge pull request #2390 from lifubang/threadedordomain
cgroupv2: don't enable threaded mode by default

LGTMs: AkihiroSuda, cyphar, kolyshkin
2020-05-13 14:30:25 -07:00
Kir Kolyshkin 4b71877f99
Merge pull request #2292 from Creatone/creatone/extend-intelrdt
Add RDT Memory Bandwidth Monitoring (MBM) and Cache Monitoring Technology (CMT) statistics.
2020-05-13 13:33:55 -07:00
Kir Kolyshkin 41855317b6
Merge pull request #2271 from katarzyna-z/kk-cpuacct-usage-all
Add reading of information from cpuacct.usage_all
2020-05-13 13:33:05 -07:00
lifubang fe0669b26d don't enable threaded mode by default
Because in threaded mode, we can't enable the memory controller -- it isn't thread-aware.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-05-13 16:27:36 +08:00
Aleksa Sarai b810da1490
cgroups: systemd: make use of Device*= properties
It seems we missed that systemd added support for the devices cgroup, as
a result systemd would actually *write an allow-all rule each time you
did 'runc update'* if you used the systemd cgroup driver. This is
obviously ... bad and was a clear security bug. Luckily the commits which
introduced this were never in an actual runc release.

So we simply generate the cgroupv1-style rules (which is what systemd's
DeviceAllow wants) and default to a deny-all ruleset. Unfortunately it
turns out that systemd is susceptible to the same spurrious error
failure that we were, so that problem is out of our hands for systemd
cgroup users.

However, systemd has a similar bug to the one fixed in [1]. It will
happily write a disruptive deny-all rule when it is not necessary.
Unfortunately, we cannot even use devices.Emulator to generate a minimal
set of transition rules because the DBus API is limited (you can only
clear or append to the DeviceAllow= list -- so we are forced to always
clear it). To work around this, we simply freeze the container during
SetUnitProperties.

[1]: afe83489d4 ("cgroupv1: devices: use minimal transition rules with devices.Emulator")

Fixes: 1d4ccc8e0c ("fix data inconsistent when runc update in systemd driven cgroup v1")
Fixes: 7682a2b2a5 ("fix data inconsistent when runc update in systemd driven cgroup v2")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:43:56 +10:00
Aleksa Sarai afe83489d4
cgroupv1: devices: use minimal transition rules with devices.Emulator
Now that all of the infrastructure for devices.Emulator is in place, we
can finally implement minimal transition rules for devices cgroups. This
allows for minimal disruption to running containers if a rule update is
requested. Only in very rare circumstances (black-list cgroups and mode
switching) will a clear-all rule be written. As a result, containers
should no longer see spurious errors.

A similar issue affects the cgroupv2 devices setup, but that is a topic
for another time (as the solution is drastically different).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:42:43 +10:00
Aleksa Sarai 2353ffec2b
cgroups: implement a devices cgroupv1 emulator
Okay, this requires a bit of explanation.

The reason for this emulation is to allow us to have seamless updates of
the devices cgroup for running containers. This was triggered by several
users having issues where our initial writing of a deny-all rule (in all
cases) results in spurrious errors.

The obvious solution would be to just remove the deny-all rule, right?
Well, it turns out that runc doesn't actually control the deny-all rule
because all users of runc have explicitly specified their own deny-all
rule for many years. This appears to have been done to work around a bug
in runc (which this series has fixed in [1]) where we would actually act
as a black-list despite this being a violation of the OCI spec.

This means that not adding our own deny-all rule in the case of updates
won't solve the issue. However, it will also not solve the issue in
several other cases (the most notable being where a container is being
switched between default-permission modes).

So in order to handle all of these cases, a way of tracking the relevant
internal cgroup state (given a certain state of "cgroups.list" and a set
of rules to apply) is necessary. That is the purpose of DevicesEmulator.
Reading "devices.list" is quite important because that's the only way we
can tell if it's safe to skip the troublesome deny-all rules without
making potentially-dangerous assumptions about the container.

We also are currently bug-compatible with the devices cgroup (namely,
removing rules that don't exist or having superfluous rules all works as
with the in-kernel implementation). The only exception to this is that
we give an error if a user requests to revoke part of a wildcard
exception, because allowing such configurations could result in security
holes (cgroupv1 silently ignores such rules, meaning in white-list mode
that the access is still permitted).

[1]: b2bec9806f ("cgroup: devices: eradicate the Allow/Deny lists")

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:42:20 +10:00
Aleksa Sarai 24388be71e
configs: use different types for .Devices and .Resources.Devices
Making them the same type is simply confusing, but also means that you
could accidentally use one in the wrong context. This eliminates that
problem. This also includes a whole bunch of cleanups for the types
within DeviceRule, so that they can be used more ergonomically.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Aleksa Sarai 60e21ec26e
specconv: remove default /dev/console access
/dev/console is a host resouce which gives a bunch of permissions that
we really shouldn't be giving to containers, not to mention that
/dev/console in containers is actually /dev/pts/$n. Drop this since
arguably this is a fairly scary thing to allow...

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Aleksa Sarai b2bec9806f
cgroup: devices: eradicate the Allow/Deny lists
These lists have been in the codebase for a very long time, and have
been unused for a large portion of that time -- specconv doesn't
generate them and the only user of these flags has been tests (which
doesn't inspire much confidence).

In addition, we had an incorrect implementation of a white-list policy.
This wasn't exploitable because all of our users explicitly specify
"deny all" as the first rule, but it was a pretty glaring issue that
came from the "feature" that users can select whether they prefer a
white- or black- list. Fix this by always writing a deny-all rule (which
is what our users were doing anyway, to work around this bug).

This is one of many changes needed to clean up the devices cgroup code.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Aleksa Sarai 859a780d6f
cgroups: add GetFreezerState() helper to Manager
This is effectively a nicer implementation of the container.isPaused()
helper, but to be used within the cgroup code for handling some fun
issues we have to fix with the systemd cgroup driver.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Akihiro Suda 2b9a36ee8c
Merge pull request #2398 from pkagrawal/master
Honor spec.Process.NoNewPrivileges in specconv.CreateLibcontainerConfig
2020-05-12 15:05:55 +09:00
Kir Kolyshkin ca1d135bd4 runc checkpoint: fix --status-fd to accept fd
1. The command `runc checkpoint --lazy-server --status-fd $FD` actually
accepts a file name as an $FD. Make it accept a file descriptor,
like its name implies and the documentation states.

In addition, since runc itself does not use the result of CRIU status
fd, remove the code which relays it, and pass the FD directly to CRIU.

Note 1: runc should close this file descriptor itself after passing it
to criu, otherwise whoever waits on it might wait forever.

Note 2: due to the way criu swrk consumes the fd (it reopens
/proc/$SENDER_PID/fd/$FD), runc can't close it as soon as criu swrk has
started. There is no good way to know when criu swrk has reopened the
fd, so we assume that as soon as we have received something back, the
fd is already reopened.

2. Since the meaning of --status-fd has changed, the test case using
it needs to be fixed as well.

Modify the lazy migration test to remove "sleep 2", actually waiting
for the the lazy page server to be ready.

While at it,

 - remove the double fork (using shell's background process is
   sufficient here);

 - check the exit code for "runc checkpoint" and "criu lazy-pages";

 - remove the check for no errors in dump.log after restore, as we
   are already checking its exit code.

[v2: properly close status fd after spawning criu]
[v3: move close status fd to after the first read]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-11 15:36:50 -07:00
Pradyumna Agrawal 4aa9101477 Honor spec.Process.NoNewPrivileges in specconv.CreateLibcontainerConfig
The change ensures that the passed in value of NoNewPrivileges under spec.Process
is reflected in the container config generated by specconv.CreateLibcontainerConfig

Closes #2397

Signed-off-by: Pradyumna Agrawal <pradyumnaa@vmware.com>
2020-05-11 13:38:14 -07:00
Kir Kolyshkin 714c91e9f7 Simplify cgroup path handing in v2 via unified API
This unties the Gordian Knot of using GetPaths in cgroupv2 code.

The problem is, the current code uses GetPaths for three kinds of things:

1. Get all the paths to cgroup v1 controllers to save its state (see
   (*linuxContainer).currentState(), (*LinuxFactory).loadState()
   methods).

2. Get all the paths to cgroup v1 controllers to have the setns process
    enter the proper cgroups in `(*setnsProcess).start()`.

3. Get the path to a specific controller (for example,
   `m.GetPaths()["devices"]`).

Now, for cgroup v2 instead of a set of per-controller paths, we have only
one single unified path, and a dedicated function `GetUnifiedPath()` to get it.

This discrepancy between v1 and v2 cgroupManager API leads to the
following problems with the code:

 - multiple if/else code blocks that have to treat v1 and v2 separately;

 - backward-compatible GetPaths() methods in v2 controllers;

 -  - repeated writing of the PID into the same cgroup for v2;

Overall, it's hard to write the right code with all this, and the code
that is written is kinda hard to follow.

The solution is to slightly change the API to do the 3 things outlined
above in the same manner for v1 and v2:

1. Use `GetPaths()` for state saving and setns process cgroups entering.

2. Introduce and use Path(subsys string) to obtain a path to a
   subsystem. For v2, the argument is ignored and the unified path is
   returned.

This commit converts all the controllers to the new API, and modifies
all the users to use it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 12:04:06 -07:00
Kir Kolyshkin 2c8d668eee
Merge pull request #2387 from kolyshkin/g-knot-prepare
cgroup refactoring

LGTMs: AkihiroSuda, mrunalp.
2020-05-08 12:03:22 -07:00
Kir Kolyshkin 1d143562d2 libct/cgroups/fs: access m.paths under lock
1. Prevent theoretical "concurrent map access" error to m.paths.

2. There is no need to call m.Paths -- we can access m.paths directly.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:09:55 -07:00
Kir Kolyshkin 51e1a0842d libct/cgroups/systemd/v1: privatize v1 manager
This patch was generated entirely by gorename -- nothing to review here.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:09:48 -07:00
Kir Kolyshkin d827e323b0 libct/cgroups/systemd/v1: add NewLegacyManager
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:07:40 -07:00
Kir Kolyshkin fc620fdf81 libct/cgroups/fs: privatize Manager and its fields
This was generated entirely by gorename -- nothing to review here.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:07:00 -07:00
Kir Kolyshkin 5935bf8c21 libct/cgroups/fs: introduce NewManager()
...and use it from libcontainer/factory_linux.go.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:06:05 -07:00
Kir Kolyshkin 24f945e08d libct/cgroups/systemd/v2: return a public interface
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:06:02 -07:00
Kir Kolyshkin 63854b0ea8 newSetnsProcess: reuse state.CgroupPaths
c.cgroupManager.GetPaths() are called twice here: once in currentState()
and then in newSetnsProcess(). Reuse the result of the first call, which
is stored into state.CgroupPaths.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:05:59 -07:00
Kir Kolyshkin 9a3e632625 notify: simplify usage
Instead of passing the whole map of paths, pass the path to the memory
controller which these functions actually require.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:05:58 -07:00
Alice Frosi b18a9650f8 test: update devicefilter tests
The test cases need to take into account the assembly modifications.
The instruction:
	LdXMemH dst: r2 src: r1 off: 0 imm: 0
has been replaced with:
        LdXMemW dst: r2 src: r1 off: 0 imm: 0
        And32Imm dst: r2 imm: 65535

Signed-off-by: Alice Frosi <afrosi@de.ibm.com>
2020-05-08 07:31:05 +01:00
Alice Frosi 128cb60f58 ebpf: fix big endian issue for s390x
Load the full 32 bits word and take the lower 16 bits, instead of
reading just 16 bits.

Same fix as 07bae05e61

Signed-off-by: Alice Frosi <afrosi@de.ibm.com>
2020-05-08 07:31:05 +01:00
Akihiro Suda bf15cc99b1 cgroup v2: support rootless systemd
Tested with both Podman (master) and Moby (master), on Ubuntu 19.10 .

$ podman --cgroup-manager=systemd run -it --rm --runtime=runc \
  --cgroupns=host --memory 42m --cpus 0.42 --pids-limit 42 alpine
/ # cat /proc/self/cgroup
0::/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope
/ # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/memory.max
44040192
/ # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/cpu.max
42000 100000
/ # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/pids.max
42

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-05-08 12:39:20 +09:00
Akihiro Suda 492cfd8bf9
Merge pull request #2352 from lifubang/eventsv2
fix runc events error in cgroup v2
2020-05-08 12:51:05 +09:00
lifubang 657407ff23 fix runc events error in cgroup v2
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-05-07 22:18:46 +08:00
Sebastiaan van Stijn b48bbdd08d
vendor: opencontainers/selinux v1.5.1, update deprecated uses
full diff: https://github.com/opencontainers/selinux/v1.4.0...v1.5.1

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-05-05 15:53:40 +02:00
Katarzyna Kujawa 407e9f9d0d Add reading of information from cpuacct.usage_all
Remove logrus logs from tests

Signed-off-by: Katarzyna Kujawa <katarzyna.kujawa@intel.com>
2020-05-05 08:51:12 +02:00
Mrunal Patel a57358e016
Merge pull request #2370 from lifubang/swap0
let runc disable swap in cgroup v2
2020-05-04 16:57:12 -07:00
Sebastiaan van Stijn 402d645c5c
Simplify ticks, as the value is a constant
See for example in the Musl libc source code https://git.musl-libc.org/cgit/musl/tree/src/conf/sysconf.c#n29

This removes the cgo dependency for the system package.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-05-04 23:05:46 +02:00
Sebastiaan van Stijn 9df0b5e268
libcontainer: RunningInUserNS() use sync.Once
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-05-04 15:53:33 +02:00
Mrunal Patel 6161d255b6
Merge pull request #2375 from tedyu/wait-lazy-close
Close fd in case fd.Write() returns error
2020-05-03 09:03:40 -07:00
lifubang a70f354680 let runc disable swap in cgroup v2
In cgroup v2, when memory and memorySwap set to the same value which is greater than zero,
runc should write zero in `memory.swap.max` to disable swap.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-05-03 20:57:36 +08:00
Ted Yu db29dce076 Close fd in case fd.Write() returns error
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-05-02 20:06:08 -07:00
Sebastiaan van Stijn 64ca54816c
libcontainer: simplify error message
The error message was including both the rootfs path, and the full
mount path, which also includes the path of the rootfs.

This patch removes the rootfs path from the error message, as it
was redundant, and made the error message overly verbose

Before this patch (errors wrapped for readability):

```
container_linux.go:348: starting container process caused: process_linux.go:438:
container init caused: rootfs_linux.go:58: mounting "/foo.txt"
to rootfs "/var/lib/docker/overlay2/de506d67da606b807009e23b548fec60d72359c77eec88785d8c7ecd54a6e4b2/merged"
at "/var/lib/docker/overlay2/de506d67da606b807009e23b548fec60d72359c77eec88785d8c7ecd54a6e4b2/merged/usr/share/nginx/html"
caused: not a directory: unknown
```

With this patch applied:

```
container_linux.go:348: starting container process caused: process_linux.go:438:
container init caused: rootfs_linux.go:58: mounting "/foo.txt"
to rootfs at "/var/lib/docker/overlay2/de506d67da606b807009e23b548fec60d72359c77eec88785d8c7ecd54a6e4b2/merged/usr/share/nginx/html"
caused: not a directory: unknown
```

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-05-03 02:59:46 +02:00
Sebastiaan van Stijn 2adfd20ac9
libcontainer: don't double-quote errors
genericError.Error() was formatting the underlying error using `%q`; as a
result, quotes in underlying errors were escaped multiple times, which
caused the output to become hard to read, for example (wrapped for readability):

```
container_linux.go:345: starting container process caused "process_linux.go:430:
container init caused \"rootfs_linux.go:58: mounting \\\"/foo.txt\\\"
to rootfs \\\"/var/lib/docker/overlay2/f49a0ae0ec6646c818dcf05dbcbbdd79fc7c42561f3684fbb1fc5d2b9d3ad192/merged\\\"
at \\\"/var/lib/docker/overlay2/f49a0ae0ec6646c818dcf05dbcbbdd79fc7c42561f3684fbb1fc5d2b9d3ad192/merged/usr/share/nginx/html\\\"
caused \\\"not a directory\\\"\"": unknown
```

With this patch applied:

```
container_linux.go:348: starting container process caused: process_linux.go:438:
container init caused: rootfs_linux.go:58: mounting "/foo.txt"
to rootfs "/var/lib/docker/overlay2/de506d67da606b807009e23b548fec60d72359c77eec88785d8c7ecd54a6e4b2/merged"
at "/var/lib/docker/overlay2/de506d67da606b807009e23b548fec60d72359c77eec88785d8c7ecd54a6e4b2/merged/usr/share/nginx/html"
caused: not a directory: unknown
```

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-05-03 02:55:15 +02:00
Kir Kolyshkin c3b0b13fe9 cgroups/fs2: don't always parse /proc/self/cgroup
Function defaultPath always parses /proc/self/cgroup, but
the resulting value is not always used.

Avoid unnecessary reading/parsing by moving the code
to just before its use.

Modify the test case accordingly.

[v2: test: use UnifiedMountpoint, skip test if not on v2]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-28 22:16:36 -07:00
Kir Kolyshkin 0a4dcc0203
Merge pull request #2331 from lifubang/StartTransientUnit
check that StartTransientUnit/StopUnit succeeds

LGTMs: @AkihiroSuda @kolyshkin 
Closes #2313, #2309
2020-04-28 10:47:52 -07:00
lifubang bfa1b2aab3 check that StartTransientUnit and StopUnit succeeds
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-04-28 15:46:28 +08:00
Akihiro Suda 60c647e3b8 fs2: fix cgroup.subtree_control EPERM on rootless + add CI
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-04-27 13:30:15 +09:00
Paweł Szulik 799d94818d intelrdt: Add Cache Monitoring Technology stats
Signed-off-by: Paweł Szulik <pawel.szulik@intel.com>
2020-04-25 09:43:48 +02:00
Kir Kolyshkin b19f9cecfe
Merge pull request #2343 from lifubang/updateSystemdScope
fix data inconsistency when using runc update in systemd driven cgroup
2020-04-24 23:34:19 -07:00
Akihiro Suda 0fd8d468ea
Merge pull request #2318 from lifubang/linuxResources
cgroupv2: use default allowed devices when linux resources is null
2020-04-25 09:00:23 +09:00
Mrunal Patel 634e51b52c
Merge pull request #2335 from kolyshkin/cgroupv2-cpt
Fix cgroupv2 checkpoint/restore
2020-04-24 08:47:36 -07:00
Akihiro Suda 49ca1fd074
Merge pull request #2347 from kolyshkin/v2-allow-all-devs
cgroupv2: allow to set EnableAllDevices=true
2020-04-24 16:09:40 +09:00
Mrunal Patel c420a3ec7f
Merge pull request #2324 from kolyshkin/criu-freezer
libcontainer: fix Checkpoint wrt cgroupv2
2020-04-23 19:24:38 -07:00
Kir Kolyshkin 440244268b
Merge pull request #2330 from KentaTada/use-linuxnamespace-const
libcontainer: use consts of Namespace from runtime-spec
2020-04-23 18:58:29 -07:00
Kir Kolyshkin 55d5c99ca7 libct/mountToRootfs: rm useless code
To make a bind mount read-only, it needs to be remounted. This is what
the code removed does, but it is not needed here.

We have to deal with three cases here:

1. cgroup v2 unified mode. In this case the mount is real mount with
   fstype=cgroup2, and there is no need to have a bind mount on top,
   as we pass readonly flag to the mount as is.

2. cgroup v1 + cgroupns (enableCgroupns == true). In this case the
   "mount" is in fact a set of real mounts with fstype=cgroup, and
   they are all performed in mountCgroupV1, with readonly flag
   added if needed.

3. cgroup v1 as is (enableCgroupns == false). In this case
   mountCgroupV1() calls mountToRootfs() again with an argument
   from the list obtained from getCgroupMounts(), i.e. a bind
   mount with the same flags as the original mount has (plus
   unix.MS_BIND | unix.MS_REC), and mountToRootfs() does remounting
   (under the case "bind":).

So, the code which this patch is removing is not needed -- it
essentially does nothing in case 3 above (since the bind mount
is already remounted readonly), and in cases 1 and 2 it
creates an unneeded extra bind mount on top of a real one (or set of
real ones).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-23 16:49:12 -07:00
Kir Kolyshkin 20959b1666 libcontainer/integration/checkpoint_test: simplify
Since commit 9280e3566d it is not longer needed to have `cgroup2'
mount.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-23 15:22:32 -07:00
lifubang 1d4ccc8e0c fix data inconsistent when runc update in systemd driven cgroup v1
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-04-23 19:32:57 +08:00
lifubang 7682a2b2a5 fix data inconsistent when runc update in systemd driven cgroup v2
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-04-23 19:32:07 +08:00
Kenta Tada 4474795388 libcontainer: use x/sys/unix instead of the hardcoded value
PR_SET_CHILD_SUBREAPER is defined in x/sys/unix.

Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
2020-04-23 10:49:51 +09:00
Kir Kolyshkin 9280e3566d checkpoint/restore: fix cgroupv2 handling
In case of cgroupv2 unified hierarchy, the /sys/fs/cgroup mount
is the real mount with fstype of cgroup2 (rather than a set of
external bind mounts like for cgroupv1).

So, we should not add it to the list of "external bind mounts"
on both checkpoint and restore.

Without this fix, checkpoint integration tests fail on cgroup v2.

Also, same is true for cgroup v1 + cgroupns.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-22 11:26:43 -07:00
Kir Kolyshkin 75a92ea615 cgroupv2: allow to set EnableAllDevices=true
In this case we just do not install any eBPF rules
checking the devices.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-22 11:05:36 -07:00
Mrunal Patel 46be7b612e
Merge pull request #2299 from kolyshkin/fs2-init-ctrl
cgroupv2: fix fs2 driver initialization
2020-04-20 21:27:42 -07:00
Kir Kolyshkin ab276b1c09 cgroups/fs2/Destroy: use Remove, ignore ENOENT
1. There is no need to try removing it recursively.

2. Do not treat ENOENT as an error (similar to fs
   and systemd v1 drivers).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin 4b4bc995ad CreateCgroupPath: only enable needed controllers
1. Instead of enabling all available controllers, figure out which
   ones are required, and only enable those.

2. Amend all setFoo() functions to call isFooSet(). While this might
   seem unnecessary, it might actually help to uncover a bug.
   Imagine someone:
    - adds a cgroup.Resources.CpuFoo setting;
    - modifies setCpu() to apply the new setting;
    - but forgets to amend isCpuSet() accordingly <-- BUG

   In this case, a test case modifying CpuFoo will help
   to uncover the BUG. This is the reason why it's added.

This patch *could be* amended by enabling controllers on a best-effort
basis, i.e. :

 - do not return an error early if we can't enable some controllers;
 - if we fail to enable all controllers at once (usually because one
   of them can't be enabled), try enabling them one by one.

Currently this is not implemented, and it's not clear whether this
would be a good way to go or not.

[v2: add/use is${Controller}Set() functions]
[v3: document neededControllers()]
[v4: drop "best-effort" part]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin bb47e35843 cgroup/systemd: reorganize
1. Rename the files
  - v1.go: cgroupv1 aka legacy;
  - v2.go: cgroupv2 aka unified hierarchy;
  - unsupported.go: when systemd is not available.

2. Move the code that is common between v1 and v2 to common.go

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin de1134156b cgroups/fs2/CreateCgroupPath: nit
This slightly improves code readability.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin b5c1949f2a cgroups/fs2/CreateCgroupPath: reinstate check
This check was removed in commit 5406833a65. Now, when this
function is called from a few places, it is no longer obvious
that the path always starts with /sys/fs/cgroup/, so reinstate
the check just to be on the safe side.

This check also ensures that elements[3:] can be used safely.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin 813cb3eb94 cgroupv2: fix fs2 cgroup init
fs2 cgroup driver was not working because it did not enable controllers
while creating cgroup directory; instead it was merely doing MkdirAll()
and gathered the list of available controllers in NewManager().

Also, cgroup should be created in Apply(), not while creating a new
manager instance.

To fix:

1. Move the createCgroupsv2Path function from systemd driver to fs2 driver,
   renaming it to CreateCgroupPath. Use in Apply() from both fs2 and
   systemd drivers.

2. Delay available controllers map initialization to until it is needed.

With this patch:
 - NewManager() only performs minimal initialization (initializin
   m.dirPath, if not provided);
 - Apply() properly creates cgroup path, enabling the controllers;
 - m.controllers is initialized lazily on demand.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin 60eaed2ed6 cgroupv2: move sanity path check to common code
The fs2 cgroup driver has a sanity check for path.
Since systemd driver is relying on the same path,
it makes sense to move this check to the common code.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin dbeff89491 cgroupv2/systemd: privatize UnifiedManager
... and its Cgroup field. There is no sense to keep it public.

This was generated by gorename.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin 88c13c0713 cgroupv2: use SecureJoin in systemd driver
It seems that some paths are coming from user and are therefore
untrusted.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:20:22 -07:00
Kir Kolyshkin 9c80cd672d cgroupv2: rm legacy Paths from systemd driver
Having map of per-subsystem paths in systemd unified cgroups
driver does not make sense and makes the code less readable.

To get rid of it, move the systemd v1-or-v2 init code to
libcontainer/factory_linux.go which already has a function
to deduce unified path out of paths map.

End result is much cleaner code. Besides, we no longer write pid
to the same cgroup file 7 times in Apply() like we did before.

While at it
 - add `rootless` flag which is passed on to fs2 manager
 - merge getv2Path() into GetUnifiedPath(), don't overwrite
   path if it is set during initialization (on Load).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:19:51 -07:00
Kenta Tada 3de8613327 libcontainer: use consts of Namespace from runtime-spec
Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
2020-04-19 23:21:40 +09:00
Kir Kolyshkin 480bca91be cgroups/fs2: move type decl to beginning
It was weird having it somewhere in the middle.

No code change, just moving it around.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-18 18:43:41 -07:00
Kir Kolyshkin 353e91770b cgroups/fs2: do not use securejoin
In this very case, the code is writing to cgroup2 filesystem,
and the file name is well known and can't possibly be a symlink.
So, using securejoin is redundant.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-18 18:43:41 -07:00
Kir Kolyshkin 58f970a01f cgroups/fscommon: use errors.Is
This is a forgotten hunk from PR #2291.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-18 16:16:49 -07:00
Kir Kolyshkin af6b9e7fa9 nit: do not use syscall package
In many places (not all of them though) we can use `unix.`
instead of `syscall.` as these are indentical.

In particular, x/sys/unix defines:

```go
type Signal = syscall.Signal
type Errno = syscall.Errno
type SysProcAttr = syscall.SysProcAttr

const ENODEV      = syscall.Errno(0x13)
```

and unix.Exec() calls syscall.Exec().

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-18 16:16:49 -07:00
Kir Kolyshkin b3a481eb77 libcontainer: fix Checkpoint wrt cgroupv2
Commit 9a0184b10f meant to enable using cgroup v2 freezer
for criu >= 3.14, but it looks like it is doing something else
instead.

The logic here is:

 - for cgroup v1, set FreezeCgroup, if available
 - for cgroup v2, only set it for criu >= 3.14
 - do not use GetPaths() in case v2 is used
   (this method is obsoleted for v2 and will be removed)

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-17 16:17:00 -07:00
lifubang d0f9b9ce42 default join cgroup namespace in runc example
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-04-17 21:37:50 +08:00
Aleksa Sarai e4981c91b5
merge branch 'pr-2317'
Ted Yu (1):
  Defer netns.Close() after error check

LGTMs: @AkihiroSuda @cyphar
Closes #2317
2020-04-16 23:35:07 +10:00
lifubang d2a9c5da37 using default allowed devices when linux resources is null
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-04-16 11:40:44 +08:00
Ted Yu 7a978e354a Defer netns.Close() after error check
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-04-15 18:33:20 -07:00
Akihiro Suda 9f6a2d4ddc
Merge pull request #2305 from kolyshkin/fs2-fix-default
cgroupv2: fix fs2 driver default path
2020-04-16 10:16:48 +09:00
Paweł Szulik d1e4c7b803 intelrdt: add mbm stats
Signed-off-by: Paweł Szulik <pawel.szulik@intel.com>
2020-04-15 13:53:56 +02:00
Michael Crosby 5c6216b1ed
Merge pull request #2278 from iwankgb/memory.numa_stats
Exposing memory.numa_stats
2020-04-14 11:32:51 -04:00