Commit Graph

352 Commits

Author SHA1 Message Date
Aleksa Sarai 24388be71e
configs: use different types for .Devices and .Resources.Devices
Making them the same type is simply confusing, but also means that you
could accidentally use one in the wrong context. This eliminates that
problem. This also includes a whole bunch of cleanups for the types
within DeviceRule, so that they can be used more ergonomically.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Aleksa Sarai 60e21ec26e
specconv: remove default /dev/console access
/dev/console is a host resouce which gives a bunch of permissions that
we really shouldn't be giving to containers, not to mention that
/dev/console in containers is actually /dev/pts/$n. Drop this since
arguably this is a fairly scary thing to allow...

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Aleksa Sarai b2bec9806f
cgroup: devices: eradicate the Allow/Deny lists
These lists have been in the codebase for a very long time, and have
been unused for a large portion of that time -- specconv doesn't
generate them and the only user of these flags has been tests (which
doesn't inspire much confidence).

In addition, we had an incorrect implementation of a white-list policy.
This wasn't exploitable because all of our users explicitly specify
"deny all" as the first rule, but it was a pretty glaring issue that
came from the "feature" that users can select whether they prefer a
white- or black- list. Fix this by always writing a deny-all rule (which
is what our users were doing anyway, to work around this bug).

This is one of many changes needed to clean up the devices cgroup code.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Aleksa Sarai 859a780d6f
cgroups: add GetFreezerState() helper to Manager
This is effectively a nicer implementation of the container.isPaused()
helper, but to be used within the cgroup code for handling some fun
issues we have to fix with the systemd cgroup driver.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Kir Kolyshkin 714c91e9f7 Simplify cgroup path handing in v2 via unified API
This unties the Gordian Knot of using GetPaths in cgroupv2 code.

The problem is, the current code uses GetPaths for three kinds of things:

1. Get all the paths to cgroup v1 controllers to save its state (see
   (*linuxContainer).currentState(), (*LinuxFactory).loadState()
   methods).

2. Get all the paths to cgroup v1 controllers to have the setns process
    enter the proper cgroups in `(*setnsProcess).start()`.

3. Get the path to a specific controller (for example,
   `m.GetPaths()["devices"]`).

Now, for cgroup v2 instead of a set of per-controller paths, we have only
one single unified path, and a dedicated function `GetUnifiedPath()` to get it.

This discrepancy between v1 and v2 cgroupManager API leads to the
following problems with the code:

 - multiple if/else code blocks that have to treat v1 and v2 separately;

 - backward-compatible GetPaths() methods in v2 controllers;

 -  - repeated writing of the PID into the same cgroup for v2;

Overall, it's hard to write the right code with all this, and the code
that is written is kinda hard to follow.

The solution is to slightly change the API to do the 3 things outlined
above in the same manner for v1 and v2:

1. Use `GetPaths()` for state saving and setns process cgroups entering.

2. Introduce and use Path(subsys string) to obtain a path to a
   subsystem. For v2, the argument is ignored and the unified path is
   returned.

This commit converts all the controllers to the new API, and modifies
all the users to use it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 12:04:06 -07:00
Kir Kolyshkin 2c8d668eee
Merge pull request #2387 from kolyshkin/g-knot-prepare
cgroup refactoring

LGTMs: AkihiroSuda, mrunalp.
2020-05-08 12:03:22 -07:00
Kir Kolyshkin 1d143562d2 libct/cgroups/fs: access m.paths under lock
1. Prevent theoretical "concurrent map access" error to m.paths.

2. There is no need to call m.Paths -- we can access m.paths directly.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:09:55 -07:00
Kir Kolyshkin 51e1a0842d libct/cgroups/systemd/v1: privatize v1 manager
This patch was generated entirely by gorename -- nothing to review here.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:09:48 -07:00
Kir Kolyshkin d827e323b0 libct/cgroups/systemd/v1: add NewLegacyManager
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:07:40 -07:00
Kir Kolyshkin fc620fdf81 libct/cgroups/fs: privatize Manager and its fields
This was generated entirely by gorename -- nothing to review here.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:07:00 -07:00
Kir Kolyshkin 5935bf8c21 libct/cgroups/fs: introduce NewManager()
...and use it from libcontainer/factory_linux.go.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:06:05 -07:00
Kir Kolyshkin 24f945e08d libct/cgroups/systemd/v2: return a public interface
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:06:02 -07:00
Alice Frosi b18a9650f8 test: update devicefilter tests
The test cases need to take into account the assembly modifications.
The instruction:
	LdXMemH dst: r2 src: r1 off: 0 imm: 0
has been replaced with:
        LdXMemW dst: r2 src: r1 off: 0 imm: 0
        And32Imm dst: r2 imm: 65535

Signed-off-by: Alice Frosi <afrosi@de.ibm.com>
2020-05-08 07:31:05 +01:00
Alice Frosi 128cb60f58 ebpf: fix big endian issue for s390x
Load the full 32 bits word and take the lower 16 bits, instead of
reading just 16 bits.

Same fix as 07bae05e61

Signed-off-by: Alice Frosi <afrosi@de.ibm.com>
2020-05-08 07:31:05 +01:00
Akihiro Suda bf15cc99b1 cgroup v2: support rootless systemd
Tested with both Podman (master) and Moby (master), on Ubuntu 19.10 .

$ podman --cgroup-manager=systemd run -it --rm --runtime=runc \
  --cgroupns=host --memory 42m --cpus 0.42 --pids-limit 42 alpine
/ # cat /proc/self/cgroup
0::/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope
/ # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/memory.max
44040192
/ # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/cpu.max
42000 100000
/ # cat /sys/fs/cgroup/user.slice/user-1001.slice/user@1001.service/user.slice/libpod-132ff0d72245e6f13a3bbc6cdc5376886897b60ac59eaa8dea1df7ab959cbf1c.scope/pids.max
42

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-05-08 12:39:20 +09:00
Mrunal Patel a57358e016
Merge pull request #2370 from lifubang/swap0
let runc disable swap in cgroup v2
2020-05-04 16:57:12 -07:00
Sebastiaan van Stijn 402d645c5c
Simplify ticks, as the value is a constant
See for example in the Musl libc source code https://git.musl-libc.org/cgit/musl/tree/src/conf/sysconf.c#n29

This removes the cgo dependency for the system package.

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-05-04 23:05:46 +02:00
lifubang a70f354680 let runc disable swap in cgroup v2
In cgroup v2, when memory and memorySwap set to the same value which is greater than zero,
runc should write zero in `memory.swap.max` to disable swap.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-05-03 20:57:36 +08:00
Kir Kolyshkin c3b0b13fe9 cgroups/fs2: don't always parse /proc/self/cgroup
Function defaultPath always parses /proc/self/cgroup, but
the resulting value is not always used.

Avoid unnecessary reading/parsing by moving the code
to just before its use.

Modify the test case accordingly.

[v2: test: use UnifiedMountpoint, skip test if not on v2]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-28 22:16:36 -07:00
Kir Kolyshkin 0a4dcc0203
Merge pull request #2331 from lifubang/StartTransientUnit
check that StartTransientUnit/StopUnit succeeds

LGTMs: @AkihiroSuda @kolyshkin 
Closes #2313, #2309
2020-04-28 10:47:52 -07:00
lifubang bfa1b2aab3 check that StartTransientUnit and StopUnit succeeds
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-04-28 15:46:28 +08:00
Akihiro Suda 60c647e3b8 fs2: fix cgroup.subtree_control EPERM on rootless + add CI
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-04-27 13:30:15 +09:00
Kir Kolyshkin b19f9cecfe
Merge pull request #2343 from lifubang/updateSystemdScope
fix data inconsistency when using runc update in systemd driven cgroup
2020-04-24 23:34:19 -07:00
lifubang 1d4ccc8e0c fix data inconsistent when runc update in systemd driven cgroup v1
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-04-23 19:32:57 +08:00
lifubang 7682a2b2a5 fix data inconsistent when runc update in systemd driven cgroup v2
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-04-23 19:32:07 +08:00
Kir Kolyshkin 75a92ea615 cgroupv2: allow to set EnableAllDevices=true
In this case we just do not install any eBPF rules
checking the devices.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-22 11:05:36 -07:00
Mrunal Patel 46be7b612e
Merge pull request #2299 from kolyshkin/fs2-init-ctrl
cgroupv2: fix fs2 driver initialization
2020-04-20 21:27:42 -07:00
Kir Kolyshkin ab276b1c09 cgroups/fs2/Destroy: use Remove, ignore ENOENT
1. There is no need to try removing it recursively.

2. Do not treat ENOENT as an error (similar to fs
   and systemd v1 drivers).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin 4b4bc995ad CreateCgroupPath: only enable needed controllers
1. Instead of enabling all available controllers, figure out which
   ones are required, and only enable those.

2. Amend all setFoo() functions to call isFooSet(). While this might
   seem unnecessary, it might actually help to uncover a bug.
   Imagine someone:
    - adds a cgroup.Resources.CpuFoo setting;
    - modifies setCpu() to apply the new setting;
    - but forgets to amend isCpuSet() accordingly <-- BUG

   In this case, a test case modifying CpuFoo will help
   to uncover the BUG. This is the reason why it's added.

This patch *could be* amended by enabling controllers on a best-effort
basis, i.e. :

 - do not return an error early if we can't enable some controllers;
 - if we fail to enable all controllers at once (usually because one
   of them can't be enabled), try enabling them one by one.

Currently this is not implemented, and it's not clear whether this
would be a good way to go or not.

[v2: add/use is${Controller}Set() functions]
[v3: document neededControllers()]
[v4: drop "best-effort" part]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin bb47e35843 cgroup/systemd: reorganize
1. Rename the files
  - v1.go: cgroupv1 aka legacy;
  - v2.go: cgroupv2 aka unified hierarchy;
  - unsupported.go: when systemd is not available.

2. Move the code that is common between v1 and v2 to common.go

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin de1134156b cgroups/fs2/CreateCgroupPath: nit
This slightly improves code readability.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin b5c1949f2a cgroups/fs2/CreateCgroupPath: reinstate check
This check was removed in commit 5406833a65. Now, when this
function is called from a few places, it is no longer obvious
that the path always starts with /sys/fs/cgroup/, so reinstate
the check just to be on the safe side.

This check also ensures that elements[3:] can be used safely.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin 813cb3eb94 cgroupv2: fix fs2 cgroup init
fs2 cgroup driver was not working because it did not enable controllers
while creating cgroup directory; instead it was merely doing MkdirAll()
and gathered the list of available controllers in NewManager().

Also, cgroup should be created in Apply(), not while creating a new
manager instance.

To fix:

1. Move the createCgroupsv2Path function from systemd driver to fs2 driver,
   renaming it to CreateCgroupPath. Use in Apply() from both fs2 and
   systemd drivers.

2. Delay available controllers map initialization to until it is needed.

With this patch:
 - NewManager() only performs minimal initialization (initializin
   m.dirPath, if not provided);
 - Apply() properly creates cgroup path, enabling the controllers;
 - m.controllers is initialized lazily on demand.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin 60eaed2ed6 cgroupv2: move sanity path check to common code
The fs2 cgroup driver has a sanity check for path.
Since systemd driver is relying on the same path,
it makes sense to move this check to the common code.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin dbeff89491 cgroupv2/systemd: privatize UnifiedManager
... and its Cgroup field. There is no sense to keep it public.

This was generated by gorename.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin 88c13c0713 cgroupv2: use SecureJoin in systemd driver
It seems that some paths are coming from user and are therefore
untrusted.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:20:22 -07:00
Kir Kolyshkin 9c80cd672d cgroupv2: rm legacy Paths from systemd driver
Having map of per-subsystem paths in systemd unified cgroups
driver does not make sense and makes the code less readable.

To get rid of it, move the systemd v1-or-v2 init code to
libcontainer/factory_linux.go which already has a function
to deduce unified path out of paths map.

End result is much cleaner code. Besides, we no longer write pid
to the same cgroup file 7 times in Apply() like we did before.

While at it
 - add `rootless` flag which is passed on to fs2 manager
 - merge getv2Path() into GetUnifiedPath(), don't overwrite
   path if it is set during initialization (on Load).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:19:51 -07:00
Kir Kolyshkin 480bca91be cgroups/fs2: move type decl to beginning
It was weird having it somewhere in the middle.

No code change, just moving it around.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-18 18:43:41 -07:00
Kir Kolyshkin 353e91770b cgroups/fs2: do not use securejoin
In this very case, the code is writing to cgroup2 filesystem,
and the file name is well known and can't possibly be a symlink.
So, using securejoin is redundant.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-18 18:43:41 -07:00
Kir Kolyshkin 58f970a01f cgroups/fscommon: use errors.Is
This is a forgotten hunk from PR #2291.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-18 16:16:49 -07:00
Kir Kolyshkin af6b9e7fa9 nit: do not use syscall package
In many places (not all of them though) we can use `unix.`
instead of `syscall.` as these are indentical.

In particular, x/sys/unix defines:

```go
type Signal = syscall.Signal
type Errno = syscall.Errno
type SysProcAttr = syscall.SysProcAttr

const ENODEV      = syscall.Errno(0x13)
```

and unix.Exec() calls syscall.Exec().

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-18 16:16:49 -07:00
Akihiro Suda 9f6a2d4ddc
Merge pull request #2305 from kolyshkin/fs2-fix-default
cgroupv2: fix fs2 driver default path
2020-04-16 10:16:48 +09:00
Michael Crosby 5c6216b1ed
Merge pull request #2278 from iwankgb/memory.numa_stats
Exposing memory.numa_stats
2020-04-14 11:32:51 -04:00
Ted Yu 614bb96676 cgroupv2/systemd: Properly remove intermediate directory
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-04-13 08:32:08 -07:00
Kir Kolyshkin ea36045fe1 cgroupv2: fix fs2 driver default path
When the cgroupv2 fs driver is used without setting cgroupsPath,
it picks up a path from /proc/self/cgroup. On a host with systemd,
such a path can look like (examples from my machines):

 - /user.slice/user-1000.slice/session-4.scope
 - /user.slice/user-1000.slice/user@1000.service/gnome-launched-xfce4-terminal.desktop-4260.scope
 - /user.slice/user-1000.slice/user@1000.service/gnome-terminal-server.service

This cgroup already contains processes in it, which prevents to enable
controllers for a sub-cgroup (writing to cgroup.subtree_control fails
with EBUSY or EOPNOTSUPP).

Obviously, a parent cgroup (which does not contain tasks) should be used.

Fixes opencontainers/runc/issues/2298

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-09 10:47:19 -07:00
Kenta Tada e58a406b77 libcontainer: remove unneeded import
Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
2020-04-09 20:14:39 +09:00
Michael Crosby 9a93b7378c
Merge pull request #2288 from kolyshkin/mem-swap
cgroupv2: fix setting MemorySwap
2020-04-08 14:54:22 -04:00
iwankgb 7fe0a98e79
Exposing memory.numa_stats
Making information on page usage by type and NUMA node available

Signed-off-by: Maciej "Iwan" Iwanowski <maciej.iwanowski@intel.com>
2020-04-08 17:40:09 +02:00
Kir Kolyshkin 568cd62fa1 cgroupv2: only treat -1 as "max"
Commit 6905b72154 treats all negative values as "max",
citing cgroup v1 compatibility as a reason. In fact, in
cgroup v1 only -1 is treated as "unlimited", and other
negative values usually calse an error.

Treat -1 as "max", pass other negative values as is
(the error will be returned from the kernel).

Fixes: 6905b72154
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-08 04:08:49 -07:00
Kir Kolyshkin c86be8a2c1 cgroupv2: fix setting MemorySwap
The resources.MemorySwap field from OCI is memory+swap, while cgroupv2
has a separate swap limit, so subtract memory from the limit (and make
sure values are set and sane).

Make sure to set MemorySwapMax for systemd, too. Since systemd does not
have MemorySwapMax for cgroupv1, it is only needed for v2 driver.

[v2: return -1 on any negative value, add unit test]
[v3: treat any negative value other than -1 as error]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-07 20:45:53 -07:00