Commit Graph

1608 Commits

Author SHA1 Message Date
Ted Yu db29dce076 Close fd in case fd.Write() returns error
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-05-02 20:06:08 -07:00
Sebastiaan van Stijn 64ca54816c
libcontainer: simplify error message
The error message was including both the rootfs path, and the full
mount path, which also includes the path of the rootfs.

This patch removes the rootfs path from the error message, as it
was redundant, and made the error message overly verbose

Before this patch (errors wrapped for readability):

```
container_linux.go:348: starting container process caused: process_linux.go:438:
container init caused: rootfs_linux.go:58: mounting "/foo.txt"
to rootfs "/var/lib/docker/overlay2/de506d67da606b807009e23b548fec60d72359c77eec88785d8c7ecd54a6e4b2/merged"
at "/var/lib/docker/overlay2/de506d67da606b807009e23b548fec60d72359c77eec88785d8c7ecd54a6e4b2/merged/usr/share/nginx/html"
caused: not a directory: unknown
```

With this patch applied:

```
container_linux.go:348: starting container process caused: process_linux.go:438:
container init caused: rootfs_linux.go:58: mounting "/foo.txt"
to rootfs at "/var/lib/docker/overlay2/de506d67da606b807009e23b548fec60d72359c77eec88785d8c7ecd54a6e4b2/merged/usr/share/nginx/html"
caused: not a directory: unknown
```

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-05-03 02:59:46 +02:00
Sebastiaan van Stijn 2adfd20ac9
libcontainer: don't double-quote errors
genericError.Error() was formatting the underlying error using `%q`; as a
result, quotes in underlying errors were escaped multiple times, which
caused the output to become hard to read, for example (wrapped for readability):

```
container_linux.go:345: starting container process caused "process_linux.go:430:
container init caused \"rootfs_linux.go:58: mounting \\\"/foo.txt\\\"
to rootfs \\\"/var/lib/docker/overlay2/f49a0ae0ec6646c818dcf05dbcbbdd79fc7c42561f3684fbb1fc5d2b9d3ad192/merged\\\"
at \\\"/var/lib/docker/overlay2/f49a0ae0ec6646c818dcf05dbcbbdd79fc7c42561f3684fbb1fc5d2b9d3ad192/merged/usr/share/nginx/html\\\"
caused \\\"not a directory\\\"\"": unknown
```

With this patch applied:

```
container_linux.go:348: starting container process caused: process_linux.go:438:
container init caused: rootfs_linux.go:58: mounting "/foo.txt"
to rootfs "/var/lib/docker/overlay2/de506d67da606b807009e23b548fec60d72359c77eec88785d8c7ecd54a6e4b2/merged"
at "/var/lib/docker/overlay2/de506d67da606b807009e23b548fec60d72359c77eec88785d8c7ecd54a6e4b2/merged/usr/share/nginx/html"
caused: not a directory: unknown
```

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2020-05-03 02:55:15 +02:00
Kir Kolyshkin c3b0b13fe9 cgroups/fs2: don't always parse /proc/self/cgroup
Function defaultPath always parses /proc/self/cgroup, but
the resulting value is not always used.

Avoid unnecessary reading/parsing by moving the code
to just before its use.

Modify the test case accordingly.

[v2: test: use UnifiedMountpoint, skip test if not on v2]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-28 22:16:36 -07:00
Kir Kolyshkin 0a4dcc0203
Merge pull request #2331 from lifubang/StartTransientUnit
check that StartTransientUnit/StopUnit succeeds

LGTMs: @AkihiroSuda @kolyshkin 
Closes #2313, #2309
2020-04-28 10:47:52 -07:00
lifubang bfa1b2aab3 check that StartTransientUnit and StopUnit succeeds
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-04-28 15:46:28 +08:00
Akihiro Suda 60c647e3b8 fs2: fix cgroup.subtree_control EPERM on rootless + add CI
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-04-27 13:30:15 +09:00
Paweł Szulik 799d94818d intelrdt: Add Cache Monitoring Technology stats
Signed-off-by: Paweł Szulik <pawel.szulik@intel.com>
2020-04-25 09:43:48 +02:00
Kir Kolyshkin b19f9cecfe
Merge pull request #2343 from lifubang/updateSystemdScope
fix data inconsistency when using runc update in systemd driven cgroup
2020-04-24 23:34:19 -07:00
Akihiro Suda 0fd8d468ea
Merge pull request #2318 from lifubang/linuxResources
cgroupv2: use default allowed devices when linux resources is null
2020-04-25 09:00:23 +09:00
Mrunal Patel 634e51b52c
Merge pull request #2335 from kolyshkin/cgroupv2-cpt
Fix cgroupv2 checkpoint/restore
2020-04-24 08:47:36 -07:00
Akihiro Suda 49ca1fd074
Merge pull request #2347 from kolyshkin/v2-allow-all-devs
cgroupv2: allow to set EnableAllDevices=true
2020-04-24 16:09:40 +09:00
Mrunal Patel c420a3ec7f
Merge pull request #2324 from kolyshkin/criu-freezer
libcontainer: fix Checkpoint wrt cgroupv2
2020-04-23 19:24:38 -07:00
Kir Kolyshkin 440244268b
Merge pull request #2330 from KentaTada/use-linuxnamespace-const
libcontainer: use consts of Namespace from runtime-spec
2020-04-23 18:58:29 -07:00
Kir Kolyshkin 55d5c99ca7 libct/mountToRootfs: rm useless code
To make a bind mount read-only, it needs to be remounted. This is what
the code removed does, but it is not needed here.

We have to deal with three cases here:

1. cgroup v2 unified mode. In this case the mount is real mount with
   fstype=cgroup2, and there is no need to have a bind mount on top,
   as we pass readonly flag to the mount as is.

2. cgroup v1 + cgroupns (enableCgroupns == true). In this case the
   "mount" is in fact a set of real mounts with fstype=cgroup, and
   they are all performed in mountCgroupV1, with readonly flag
   added if needed.

3. cgroup v1 as is (enableCgroupns == false). In this case
   mountCgroupV1() calls mountToRootfs() again with an argument
   from the list obtained from getCgroupMounts(), i.e. a bind
   mount with the same flags as the original mount has (plus
   unix.MS_BIND | unix.MS_REC), and mountToRootfs() does remounting
   (under the case "bind":).

So, the code which this patch is removing is not needed -- it
essentially does nothing in case 3 above (since the bind mount
is already remounted readonly), and in cases 1 and 2 it
creates an unneeded extra bind mount on top of a real one (or set of
real ones).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-23 16:49:12 -07:00
Kir Kolyshkin 20959b1666 libcontainer/integration/checkpoint_test: simplify
Since commit 9280e3566d it is not longer needed to have `cgroup2'
mount.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-23 15:22:32 -07:00
lifubang 1d4ccc8e0c fix data inconsistent when runc update in systemd driven cgroup v1
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-04-23 19:32:57 +08:00
lifubang 7682a2b2a5 fix data inconsistent when runc update in systemd driven cgroup v2
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-04-23 19:32:07 +08:00
Kenta Tada 4474795388 libcontainer: use x/sys/unix instead of the hardcoded value
PR_SET_CHILD_SUBREAPER is defined in x/sys/unix.

Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
2020-04-23 10:49:51 +09:00
Kir Kolyshkin 9280e3566d checkpoint/restore: fix cgroupv2 handling
In case of cgroupv2 unified hierarchy, the /sys/fs/cgroup mount
is the real mount with fstype of cgroup2 (rather than a set of
external bind mounts like for cgroupv1).

So, we should not add it to the list of "external bind mounts"
on both checkpoint and restore.

Without this fix, checkpoint integration tests fail on cgroup v2.

Also, same is true for cgroup v1 + cgroupns.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-22 11:26:43 -07:00
Kir Kolyshkin 75a92ea615 cgroupv2: allow to set EnableAllDevices=true
In this case we just do not install any eBPF rules
checking the devices.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-22 11:05:36 -07:00
Mrunal Patel 46be7b612e
Merge pull request #2299 from kolyshkin/fs2-init-ctrl
cgroupv2: fix fs2 driver initialization
2020-04-20 21:27:42 -07:00
Kir Kolyshkin ab276b1c09 cgroups/fs2/Destroy: use Remove, ignore ENOENT
1. There is no need to try removing it recursively.

2. Do not treat ENOENT as an error (similar to fs
   and systemd v1 drivers).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin 4b4bc995ad CreateCgroupPath: only enable needed controllers
1. Instead of enabling all available controllers, figure out which
   ones are required, and only enable those.

2. Amend all setFoo() functions to call isFooSet(). While this might
   seem unnecessary, it might actually help to uncover a bug.
   Imagine someone:
    - adds a cgroup.Resources.CpuFoo setting;
    - modifies setCpu() to apply the new setting;
    - but forgets to amend isCpuSet() accordingly <-- BUG

   In this case, a test case modifying CpuFoo will help
   to uncover the BUG. This is the reason why it's added.

This patch *could be* amended by enabling controllers on a best-effort
basis, i.e. :

 - do not return an error early if we can't enable some controllers;
 - if we fail to enable all controllers at once (usually because one
   of them can't be enabled), try enabling them one by one.

Currently this is not implemented, and it's not clear whether this
would be a good way to go or not.

[v2: add/use is${Controller}Set() functions]
[v3: document neededControllers()]
[v4: drop "best-effort" part]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin bb47e35843 cgroup/systemd: reorganize
1. Rename the files
  - v1.go: cgroupv1 aka legacy;
  - v2.go: cgroupv2 aka unified hierarchy;
  - unsupported.go: when systemd is not available.

2. Move the code that is common between v1 and v2 to common.go

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin de1134156b cgroups/fs2/CreateCgroupPath: nit
This slightly improves code readability.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin b5c1949f2a cgroups/fs2/CreateCgroupPath: reinstate check
This check was removed in commit 5406833a65. Now, when this
function is called from a few places, it is no longer obvious
that the path always starts with /sys/fs/cgroup/, so reinstate
the check just to be on the safe side.

This check also ensures that elements[3:] can be used safely.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin 813cb3eb94 cgroupv2: fix fs2 cgroup init
fs2 cgroup driver was not working because it did not enable controllers
while creating cgroup directory; instead it was merely doing MkdirAll()
and gathered the list of available controllers in NewManager().

Also, cgroup should be created in Apply(), not while creating a new
manager instance.

To fix:

1. Move the createCgroupsv2Path function from systemd driver to fs2 driver,
   renaming it to CreateCgroupPath. Use in Apply() from both fs2 and
   systemd drivers.

2. Delay available controllers map initialization to until it is needed.

With this patch:
 - NewManager() only performs minimal initialization (initializin
   m.dirPath, if not provided);
 - Apply() properly creates cgroup path, enabling the controllers;
 - m.controllers is initialized lazily on demand.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin 60eaed2ed6 cgroupv2: move sanity path check to common code
The fs2 cgroup driver has a sanity check for path.
Since systemd driver is relying on the same path,
it makes sense to move this check to the common code.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin dbeff89491 cgroupv2/systemd: privatize UnifiedManager
... and its Cgroup field. There is no sense to keep it public.

This was generated by gorename.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:27:40 -07:00
Kir Kolyshkin 88c13c0713 cgroupv2: use SecureJoin in systemd driver
It seems that some paths are coming from user and are therefore
untrusted.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:20:22 -07:00
Kir Kolyshkin 9c80cd672d cgroupv2: rm legacy Paths from systemd driver
Having map of per-subsystem paths in systemd unified cgroups
driver does not make sense and makes the code less readable.

To get rid of it, move the systemd v1-or-v2 init code to
libcontainer/factory_linux.go which already has a function
to deduce unified path out of paths map.

End result is much cleaner code. Besides, we no longer write pid
to the same cgroup file 7 times in Apply() like we did before.

While at it
 - add `rootless` flag which is passed on to fs2 manager
 - merge getv2Path() into GetUnifiedPath(), don't overwrite
   path if it is set during initialization (on Load).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-19 16:19:51 -07:00
Kenta Tada 3de8613327 libcontainer: use consts of Namespace from runtime-spec
Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
2020-04-19 23:21:40 +09:00
Kir Kolyshkin 480bca91be cgroups/fs2: move type decl to beginning
It was weird having it somewhere in the middle.

No code change, just moving it around.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-18 18:43:41 -07:00
Kir Kolyshkin 353e91770b cgroups/fs2: do not use securejoin
In this very case, the code is writing to cgroup2 filesystem,
and the file name is well known and can't possibly be a symlink.
So, using securejoin is redundant.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-18 18:43:41 -07:00
Kir Kolyshkin 58f970a01f cgroups/fscommon: use errors.Is
This is a forgotten hunk from PR #2291.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-18 16:16:49 -07:00
Kir Kolyshkin af6b9e7fa9 nit: do not use syscall package
In many places (not all of them though) we can use `unix.`
instead of `syscall.` as these are indentical.

In particular, x/sys/unix defines:

```go
type Signal = syscall.Signal
type Errno = syscall.Errno
type SysProcAttr = syscall.SysProcAttr

const ENODEV      = syscall.Errno(0x13)
```

and unix.Exec() calls syscall.Exec().

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-18 16:16:49 -07:00
Kir Kolyshkin b3a481eb77 libcontainer: fix Checkpoint wrt cgroupv2
Commit 9a0184b10f meant to enable using cgroup v2 freezer
for criu >= 3.14, but it looks like it is doing something else
instead.

The logic here is:

 - for cgroup v1, set FreezeCgroup, if available
 - for cgroup v2, only set it for criu >= 3.14
 - do not use GetPaths() in case v2 is used
   (this method is obsoleted for v2 and will be removed)

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-17 16:17:00 -07:00
lifubang d0f9b9ce42 default join cgroup namespace in runc example
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-04-17 21:37:50 +08:00
Aleksa Sarai e4981c91b5
merge branch 'pr-2317'
Ted Yu (1):
  Defer netns.Close() after error check

LGTMs: @AkihiroSuda @cyphar
Closes #2317
2020-04-16 23:35:07 +10:00
lifubang d2a9c5da37 using default allowed devices when linux resources is null
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-04-16 11:40:44 +08:00
Ted Yu 7a978e354a Defer netns.Close() after error check
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-04-15 18:33:20 -07:00
Akihiro Suda 9f6a2d4ddc
Merge pull request #2305 from kolyshkin/fs2-fix-default
cgroupv2: fix fs2 driver default path
2020-04-16 10:16:48 +09:00
Paweł Szulik d1e4c7b803 intelrdt: add mbm stats
Signed-off-by: Paweł Szulik <pawel.szulik@intel.com>
2020-04-15 13:53:56 +02:00
Michael Crosby 5c6216b1ed
Merge pull request #2278 from iwankgb/memory.numa_stats
Exposing memory.numa_stats
2020-04-14 11:32:51 -04:00
Ted Yu 614bb96676 cgroupv2/systemd: Properly remove intermediate directory
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-04-13 08:32:08 -07:00
Kir Kolyshkin ea36045fe1 cgroupv2: fix fs2 driver default path
When the cgroupv2 fs driver is used without setting cgroupsPath,
it picks up a path from /proc/self/cgroup. On a host with systemd,
such a path can look like (examples from my machines):

 - /user.slice/user-1000.slice/session-4.scope
 - /user.slice/user-1000.slice/user@1000.service/gnome-launched-xfce4-terminal.desktop-4260.scope
 - /user.slice/user-1000.slice/user@1000.service/gnome-terminal-server.service

This cgroup already contains processes in it, which prevents to enable
controllers for a sub-cgroup (writing to cgroup.subtree_control fails
with EBUSY or EOPNOTSUPP).

Obviously, a parent cgroup (which does not contain tasks) should be used.

Fixes opencontainers/runc/issues/2298

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-09 10:47:19 -07:00
Kenta Tada e58a406b77 libcontainer: remove unneeded import
Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
2020-04-09 20:14:39 +09:00
Paweł Szulik 7fa13b2773 intelrdt: change parseCpuInfoFile to return struct
Signed-off-by: Paweł Szulik <pawel.szulik@intel.com>
2020-04-08 23:03:36 +02:00
Michael Crosby 9a93b7378c
Merge pull request #2288 from kolyshkin/mem-swap
cgroupv2: fix setting MemorySwap
2020-04-08 14:54:22 -04:00
iwankgb 7fe0a98e79
Exposing memory.numa_stats
Making information on page usage by type and NUMA node available

Signed-off-by: Maciej "Iwan" Iwanowski <maciej.iwanowski@intel.com>
2020-04-08 17:40:09 +02:00
Kir Kolyshkin 568cd62fa1 cgroupv2: only treat -1 as "max"
Commit 6905b72154 treats all negative values as "max",
citing cgroup v1 compatibility as a reason. In fact, in
cgroup v1 only -1 is treated as "unlimited", and other
negative values usually calse an error.

Treat -1 as "max", pass other negative values as is
(the error will be returned from the kernel).

Fixes: 6905b72154
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-08 04:08:49 -07:00
Kir Kolyshkin c86be8a2c1 cgroupv2: fix setting MemorySwap
The resources.MemorySwap field from OCI is memory+swap, while cgroupv2
has a separate swap limit, so subtract memory from the limit (and make
sure values are set and sane).

Make sure to set MemorySwapMax for systemd, too. Since systemd does not
have MemorySwapMax for cgroupv1, it is only needed for v2 driver.

[v2: return -1 on any negative value, add unit test]
[v3: treat any negative value other than -1 as error]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-07 20:45:53 -07:00
Giuseppe Scrivano 8b7ac5f4a5
libcontainer: use cgroups.NewStats
otherwise the memoryStats and hugetlbStats maps are not initialized
and GetStats() segfaults when using them.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2020-04-07 09:45:57 +02:00
Michael Crosby d5e91b1c22
Merge pull request #2289 from AkihiroSuda/fix-TestGetContainerStateAfterUpdate
Fix TestGetContainerStateAfterUpdate on cgroup v2
2020-04-06 17:30:11 -04:00
Mrunal Patel 0c7a9c0267
Merge pull request #2294 from tklauser/unused-consts
Remove unused consts testScopeWait and testSliceWait
2020-04-06 13:26:42 -07:00
Ted Yu 21d7bb95eb Close criuServer so that even if CRIU crashes or unexpectedly exits, runc will not hang
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-04-03 15:27:27 -07:00
Tobias Klauser 3e678c08f9 Remove unused consts testScopeWait and testSliceWait
These are unused since commit 518c855833 ("Remove libcontainer
detection for systemd features")

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2020-04-03 21:09:43 +02:00
Michael Crosby e4363b0387
Merge pull request #2291 from kolyshkin/errors-unwrap-v2
Use errors.As() and errors.Is() to unwrap errors
2020-04-03 11:46:11 -04:00
Michael Crosby ec8c6950c7
Merge pull request #2235 from Zyqsempai/add-hugetlb-controller-to-cgroupv2
Added HugeTlb controller for cgroupv2
2020-04-03 11:15:06 -04:00
Kir Kolyshkin b2272b2cba libcontainer: use errors.Is() and errors.As()
Make use of errors.Is() and errors.As() where appropriate to check
the underlying error. The biggest motivation is to simplify the code.

The feature requires go 1.13 but since merging #2256 we are already
not supporting go 1.12 (which is an unsupported release anyway).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-02 20:34:01 -07:00
Kir Kolyshkin c39f87a47a Revert "Merge pull request #2280 from kolyshkin/errors-unwrap"
Using errors.Unwrap() is not the best thing to do, since it returns
nil in case of an error which was not wrapped. More to say,
errors package provides more elegant ways to check for underlying
errors, such as errors.As() and errors.Is().

This reverts commit f8e138855d, reversing
changes made to 6ca9d8e6da.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-02 19:41:11 -07:00
Akihiro Suda 4540b596b8 Fix TestGetContainerStateAfterUpdate on cgroup v2
CI was failing on cgroup v2 because mockCgroupManager.GetUnifiedPath()
was returning an error.

Now the function returns the value of mockCgroupManager.unifiedPath,
but the value is currently not used in the tests.

Fix #2286

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-04-03 09:12:38 +09:00
Michael Crosby f8e138855d
Merge pull request #2280 from kolyshkin/errors-unwrap
Use errors.Unwrap() where possible
2020-04-02 14:39:06 -04:00
Michael Crosby 6ca9d8e6da
Merge pull request #2283 from tedyu/runc-path-in-prefix
isPathInPrefixList return value should be reverted
2020-04-02 14:09:49 -04:00
Michael Crosby b26e4f27c1
Merge pull request #2284 from tedyu/criu-svr-close
Avoid double close of criuServer
2020-04-02 14:07:35 -04:00
Mrunal Patel e3e26cafe9
Merge pull request #2276 from kolyshkin/criu-v2
cgroupv2: don't use GetCgroupMounts for criu c/r
2020-04-01 17:36:24 -07:00
Ted Yu 49896ab0f4 Avoid double close of criuServer
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-04-01 16:15:23 -07:00
Ted Yu d02fc48422 isPathInPrefixList return value should be reverted
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-04-01 15:45:31 -07:00
Kir Kolyshkin 8d7977ee6e libct/isPaused: don't use GetPaths from v2 code
Using GetPaths from cgroupv2 unified hierarchy code is deprecated
and this function will (hopefully) be removed.

Use GetUnifiedPath() for v2 case.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-31 20:24:28 -07:00
Kir Kolyshkin 12e156f076 libct.isPaused: use errors.Unwrap
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-31 20:07:04 -07:00
Kir Kolyshkin 272c83e169 libct/cgroups: use errors.Unwrap
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-31 20:07:04 -07:00
Kir Kolyshkin bd737f1e94 libct/cgroups/fs: use errors.Unwrap
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-31 20:07:04 -07:00
Kir Kolyshkin d2dfc635ea libct/cgroups/fs2: use errors.Unwrap
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-31 20:07:04 -07:00
Kir Kolyshkin e4e35b8de8 libct/cgroups/fscommon.WriteFile: use errors.Unwrap
Tested that the EINTR is still being detected:

> $ go1.14 test -c # 1.14 is needed for EINTR to happen
> $ sudo ./fscommon.test
> INFO[0000] interrupted while writing 1063068 to /sys/fs/cgroup/memory/test-eint-89293785/memory.limit_in_bytes
> PASS

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-31 20:07:04 -07:00
Kir Kolyshkin 66778b3c28 libct/setKernelMemory: use errors.Unwrap
This simplifies code a lot.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-31 20:07:04 -07:00
Kir Kolyshkin fc840f199f cgroupv2: don't use GetCgroupMounts for criu c/r
When performing checkpoint or restore of cgroupv2 unified hierarchy,
there is no need to call getCgroupMounts() / cgroups.GetCgroupMounts()
as there's only a single mount in there.

This eliminates the last internal (i.e. runc) use case of
cgroups.GetCgroupMounts() for v2 unified. Unfortunately, there
are external ones (e.g. moby/moby) so we can't yet let it
return an error.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-31 17:05:11 -07:00
Michael Crosby 9ec5b03e5a
Merge pull request #2259 from adrianreber/v2-test
Add minimal cgroup2 checkpoint/restore support
2020-03-31 15:01:18 -04:00
Michael Crosby 8221d999f3
Merge pull request #2279 from masters-of-cats/freezer
Actually check for syscall.ENODEV when checking if a container is paused
2020-03-31 14:59:20 -04:00
Yulia Nedyalkova 2abc6a3605 Actually check for syscall.ENODEV when checking if a container is paused
It turns out that ioutil.Readfile wraps the error in a *os.PathError.
Since we cannot guarantee compilation with golang >= v1.13, we are
manually unwrapping the error.

Signed-off-by: Kieron Browne <kbrowne@pivotal.io>
2020-03-31 15:52:20 +01:00
Adrian Reber 3e99aa3628
Fix checkpoint/restore tests on Fedora 31
The Travis tests running on Fedora 31 with cgroup2 on Vagrant had the
CRIU parts disabled because of a couple of problems.

One problem was a bug in runc and CRIU handling that Andrei fixed.

In addition four patches from the upcoming  CRIU 3.14 are needed for
minimal cgroup2 support (freezer and mounting of cgroup2). With Andrei's
fix and the CRIU cgroup2 support and the runc CRIU cgroup2 integration
it is now possible the checkpoint integration tests again on the Fedora
Vagrant cgroup2 based integration test.

To run CRIU based tests the modules of Fedora 31 (the test host system)
are mounted inside of the container used to test runc in the buster
based container with -v /lib/modules:/lib/modules.

Signed-off-by: Adrian Reber <areber@redhat.com>
2020-03-31 16:36:36 +02:00
Adrian Reber 9a0184b10f
cgroup2: use CRIU's new freezer v2 support
The newest CRIU version supports freezer v2 and this tells runc
to use it if new enough or fall back to non-freezer based process
freezing on cgroup v2 system.

Signed-off-by: Adrian Reber <areber@redhat.com>
2020-03-31 16:36:35 +02:00
Mrunal Patel d05e5728aa systemd: Lazy initialize the systemd dbus connection
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2020-03-30 15:24:06 -07:00
Mrunal Patel 33c6125da6 systemd: Export IsSystemdRunning() function
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2020-03-30 15:24:06 -07:00
Mrunal Patel f1eea9051c
Merge pull request #2275 from kolyshkin/scan-nits
bifio.Scan.Err usage nits
2020-03-27 11:41:06 -07:00
Mrunal Patel 53ad1d5100
Merge pull request #2256 from kolyshkin/mountinfo-alt
Use faster mountinfo parser (part 1)
2020-03-27 11:36:51 -07:00
Mrunal Patel 75ff40cd28
Merge pull request #2273 from kolyshkin/v2-untangle
cgroup v2 cleanups
2020-03-27 11:21:36 -07:00
Kir Kolyshkin aab2c8ba52 libcontainer/intelrdt: optimize parseCpuInfoFile
The line we are parsing looks like this

> flags		: fpu vme de pse <...>

so look for "flags" as a prefix, not substring.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-27 00:41:11 -07:00
Kir Kolyshkin 0af5cd2041 Nit: fix use of bufio.Scanner.Err
The Err() method should be called after the Scan() loop, not inside it.

Found by

 git grep -A3 -F '.Scan()' | grep Err

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-27 00:12:17 -07:00
Qiang Huang d4a6a1d998
Merge pull request #2258 from masters-of-cats/eintr-retry
Retry writing to cgroup files on EINTR error
2020-03-27 11:21:41 +08:00
Kir Kolyshkin b45db5d3b2 libcontainer/cgroup: obsolete Get*Cgroup for v2
These functions should not be called from any code handling
the cgroup2 unified hierarchy.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-26 19:20:00 -07:00
Kir Kolyshkin a949e4f22f cgroupv2: UnifiedManager.Apply: simplify
Remove joinCgroupsV2() function, as its name and second parameter
are misleading. Use createCgroupsv2Path() directly, do not call
getv2Path() twice.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-26 19:20:00 -07:00
Kir Kolyshkin 5406833a65 cgroupv2/systemd: add getv2Path
Function getSubsystemPath(), while works for v2 unified case, is
suboptimal, as it does a few unnecessary calls.

Add a simplified version of getSubsystemPath(), called getv2Path(),
and use it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-26 19:17:09 -07:00
Kir Kolyshkin ec1f957b23 cgroupv2: don't use getSubsystemPath in Apply
This code is a copy-paste from cgroupv1 systemd code. Its aim
is to check whether a subsystem is available, and skip those
that are not.

In case v2 unified hierarchy is used, getSubsystemPath never
returns "not found" error, so calling it is useless.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-26 13:32:34 -07:00
Kir Kolyshkin 6905b72154 cgroupv2: use "max" for negative values
Cgroup v1 kernel doc [1] says:

> We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``.

and cgroup v2 kernel documentation [2] says:

> - If a controller implements an absolute resource guarantee and/or
>  limit, the interface files should be named "min" and "max"
>  respectively.  If a controller implements best effort resource
>  guarantee and/or limit, the interface files should be named "low"
>  and "high" respectively.
>
>  In the above four control files, the special token "max" should be
>  used to represent upward infinity for both reading and writing.

Allow -1 value to still be used for v2, converting it to "max"
where it makes sense to do so.

This fixes the following issue:

> runc update test_update --memory-swap -1:
> error while setting cgroup v2: [write /sys/fs/cgroup/machine.slice/runc-cgroups-integration-test.scope/memory.swap.max: invalid argument
> failed to write "-1" to "/sys/fs/cgroup/machine.slice/runc-cgroups-integration-test.scope/memory.swap.max"
> github.com/opencontainers/runc/libcontainer/cgroups/fscommon.WriteFile
> 	/home/kir/go/src/github.com/opencontainers/runc/libcontainer/cgroups/fscommon/fscommon.go:21
> github.com/opencontainers/runc/libcontainer/cgroups/fs2.setMemory
> 	/home/kir/go/src/github.com/opencontainers/runc/libcontainer/cgroups/fs2/memory.go:20
> github.com/opencontainers/runc/libcontainer/cgroups/fs2.(*manager).Set
> 	/home/kir/go/src/github.com/opencontainers/runc/libcontainer/cgroups/fs2/fs2.go:175
> github.com/opencontainers/runc/libcontainer/cgroups/systemd.(*UnifiedManager).Set
> 	/home/kir/go/src/github.com/opencontainers/runc/libcontainer/cgroups/systemd/unified_hierarchy.go:290
> github.com/opencontainers/runc/libcontainer.(*linuxContainer).Set
> 	/home/kir/go/src/github.com/opencontainers/runc/libcontainer/container_linux.go:211

[1] linux/Documentation/admin-guide/cgroup-v1/memory.rst
[2] linux/Documentation/admin-guide/cgroup-v2.rst

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-26 11:14:32 -07:00
Mrunal Patel 96596cbbec
Merge pull request #2270 from kolyshkin/systemd-no-kmem
cgroupv2: don't try to set kmem for systemd case
2020-03-25 21:39:52 -07:00
Kir Kolyshkin a675b5ebea cgroupv2: don't try to set kmem for systemd case
To the best of my knowledge, it has been decided to drop the kernel
memory controller from the cgroupv2 hierarchy, so "kernel memory limits"
do not exist if we're using v2 unified.

So, we need to ignore kernel memory setting. This was already done in
non-systemd case (see commit 88e8350de), let's do the same for systemd.

This fixes the following error:

> container_linux.go:349: starting container process caused "process_linux.go:306: applying cgroup configuration for process caused \"open /sys/fs/cgroup/machine.slice/runc-cgroups-integration-test.scope/tasks: no such file or directory\""

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-25 20:00:23 -07:00
Mrunal Patel be51398a8a
Merge pull request #2193 from milkwine/fix-readSync
fix readSync
2020-03-24 14:29:42 -07:00
Mrunal Patel 7de5db3dad
Merge pull request #2263 from kolyshkin/nits
Assorted minor nits in libcontainer
2020-03-24 14:17:22 -07:00
Akihiro Suda cc183ca662
Merge pull request #2242 from AkihiroSuda/vendor-systemd
vendor: update go-systemd and godbus
2020-03-25 02:40:22 +09:00
Mrunal Patel 3087d43bc8
Merge pull request #1826 from jingxiaolu/fix_specconv_process_nil
specconv: fix null spec.Process making runc panic
2020-03-23 21:07:06 -07:00
Kir Kolyshkin dd7b34618f libct/msMoveRoot: benefit from GetMounts filter
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-21 10:33:43 -07:00
Kir Kolyshkin fc4357a8b0 libct/msMoveRoot: rm redundant filepath.Abs() calls
1. rootfs is already validated to be kosher by (*ConfigValidator).rootfs()

2. mount points from /proc/self/mountinfo are absolute and clean, too

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-21 10:33:43 -07:00
Kir Kolyshkin dce0de8975 getParentMount: benefit from GetMounts filter
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-21 10:33:43 -07:00
Kir Kolyshkin 81d8452e30 libct/TestFactoryNewTmpfs: benefit from GetMounts
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-21 10:33:43 -07:00
Kir Kolyshkin c7ab2c036b libcontainer: switch to moby/sys/mountinfo package
Delete libcontainer/mount in favor of github.com/moby/sys/mountinfo,
which is fast mountinfo parser.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-21 10:33:43 -07:00
Kir Kolyshkin a572216f74 libcontainer/intelrdt: rm fmt.Sprintf
It it not needed as it does nothing here.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-20 12:33:24 -07:00
Kir Kolyshkin 5542a2c77d libcontainer/cgroups: GetAllPids: optimize
1. Return earlier if there is an error.

2. Do not use filepath.Split on every entry, use info.Name() instead.

3. Make readProcsFile() accept file name as an argument, to avoid
   unnecessary file name and directory splitting and merging.

4. Skip on info.IsDir() -- this avoids an error when cgroup name is
   set to "cgroup.procs".

This is still not very good since filepath.Walk() performs an unnecessary
stat(2) on every entry, but better than before.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-20 12:27:36 -07:00
Kir Kolyshkin 12dc475dd6 libcontainer: simplify createCgroupsv2Path
fmt.Sprintf is slow and is not needed here, string concatenation would
be sufficient. It is also redundant to convert []byte from string and
back, since `bytes` package now provides the same functions as `strings`.

Use Fields() instead of TrimSpace() and Split(), mainly for readability
(note Fields() is somewhat slower than Split() but here it doesn't
matter much).

Use Join() to prepend the plus signs.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-20 11:51:55 -07:00
Mario Nitchev 648295be98 Skip test for cgroups v2
Signed-off-by: Yulia Nedyalkova <julianedialkova@hotmail.com>
2020-03-19 12:54:54 +02:00
Danail Branekov f34eb2c003 Retry writing to cgroup files on EINTR error
Golang 1.14 introduces asynchronous preemption which results into
applications getting frequent EINTR (syscall interrupted) errors when
invoking slow syscalls, e.g. when writing to cgroup files.

As writing to cgroups is idempotent, it is safe to retry writing to the
file whenever the write syscall is interrupted.

Signed-off-by: Mario Nitchev <marionitchev@gmail.com>
2020-03-18 13:00:05 +02:00
SiYu Zhao 34d471769b fix readSync
Signed-off-by: SiYu Zhao <d.chaser.zsy@gmail.com>
2020-03-17 11:26:46 +08:00
Michael Crosby 939cd0b734
Merge pull request #1737 from wking/remove-procConsole-comment
libcontainer/sync: Drop procConsole transaction from comments
2020-03-16 14:00:00 -04:00
Michael Crosby 88474967d3
Merge pull request #1974 from openSUSE/unreachable-code
Remove unreachable code paths
2020-03-16 13:56:05 -04:00
Akihiro Suda 525b9f311c
Merge pull request #2248 from AkihiroSuda/fix-cgroupv2-conversion
cgroup2: fix conversion
2020-03-16 14:00:02 +09:00
Akihiro Suda 492d525e55 vendor: update go-systemd and godbus
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-03-16 13:26:03 +09:00
Mrunal Patel 981dbef514
Merge pull request #2226 from avagin/runsc-restore-cmd-wait
restore: fix a race condition in process.Wait()
2020-03-15 18:48:16 -07:00
Akihiro Suda aa269315a4 cgroup2: add CpuMax conversion
Fix #2243

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-03-13 02:58:39 +09:00
Akihiro Suda 64e9a97981 cgroup2: fix conversion
* TestConvertCPUSharesToCgroupV2Value(0) was returning 70369281052672, while the correct value is 0
* ConvertBlkIOToCgroupV2Value(0) was returning 32, while the correct value is 0
* ConvertBlkIOToCgroupV2Value(1000) was returning 4, while the correct value is 10000

Fix #2244
Follow-up to #2212 #2213

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-03-13 02:57:07 +09:00
Sascha Grunert b477a159db
Remove unreachable code paths
Signed-off-by: Sascha Grunert <sgrunert@suse.com>
2020-03-12 09:13:03 +01:00
Mrunal Patel 0ff53526a4
Merge pull request #2252 from pkagrawal/2251-fix
Synchronize the call to linuxContainer.Signal()
2020-03-11 11:11:56 -07:00
Akihiro Suda 71dfb559d6
Merge pull request #2238 from tedyu/init-proc-err-ret
Use named error return for initProcess#start
2020-03-11 01:03:13 +09:00
Boris Popovschi 89a87adb38 Changed hugetlb pagesizes info source
Signed-off-by: Boris Popovschi <zyqsempai@mail.ru>
2020-03-10 15:28:45 +02:00
Boris Popovschi d804611d05 Added failcnt stats
Signed-off-by: Boris Popovschi <zyqsempai@mail.ru>
2020-03-10 15:19:44 +02:00
l00397676 62cfad97ca specconv: add a test case to check null spec.Process
Signed-off-by: l00397676 <lujingxiao@huawei.com>
2020-03-10 11:43:51 +08:00
Pradyumna Agrawal 5b2b138d24 Synchronize the call to linuxContainer.Signal()
linuxContainer.Signal() can race with another call to say Destroy()
which clears the container's initProcess. This can cause a nil pointer
dereference in Signal().

This patch will synchronize Signal() and Destroy() by grabbing the
container's mutex as part of the Signal() call.

Signed-off-by: Pradyumna Agrawal <pradyumnaa@vmware.com>
2020-03-09 11:15:22 -07:00
zyu 957da1f9ab Use named error return for initProcess#start
Signed-off-by: zyu <yuzhihong@gmail.com>
2020-03-09 09:29:03 -07:00
Akihiro Suda 6503438fd6
Merge pull request #2212 from Zyqsempai/2211-convert-blkio-weight-properly
Convert blkioWeight to io.weight properly
2020-03-05 09:32:45 +09:00
Aleksa Sarai 93e5c4d320
merge branch 'pr-2232'
Aleksa Sarai (1):
  libcontainer: dual-license nsenter/cloned_binary.c

LGTMs: @mrunalp @AkihiroSuda
Closes #2232
2020-03-04 11:10:49 +11:00
Qiang Huang 3b7e32feba
Merge pull request #2210 from Zyqsempai/2164-remove-deprecated-systemd-resources
Exchange deprecated systemd resources with the appropriate for cgroupv2
2020-02-29 10:13:55 +08:00
Boris Popovschi 7f37afa892 Added HugeTlb controller for cgroupv2
Signed-off-by: Boris Popovschi <zyqsempai@mail.ru>
2020-02-25 14:50:55 +02:00
Aleksa Sarai 98de84265d
libcontainer: dual-license nsenter/cloned_binary.c
The new license is Apache-2.0 OR LPGL-2.1-or-later. This is necessary
for libcrun to be relicensed under the LGPL-2.1[1], and all of the
relevant copyright holders have agreed to relicense this code under the
dual license:

  * Aleksa Sarai [2]
  * Christian Brauner [3]
  * Justin Cormack [4]

Because it is still dual-licensed as an Apache-2.0 work, this doesn't
affect it's usability within runc or any other dependent projects.

[1]: https://github.com/containers/crun/issues/256
[2]: https://github.com/containers/crun/issues/256#issuecomment-589498088
[3]: https://github.com/containers/crun/issues/256#issuecomment-589605034
[4]: https://github.com/containers/crun/issues/256#issuecomment-589504231

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-02-22 00:17:07 +11:00
Aleksa Sarai 0f32b03dda
merge branch 'pr-2192'
Boris Popovschi (2):
  Fix skip message for cgroupv2
  Fix MAJ:MIN io.stat parsing order

LGTMs: @hqhq @cyphar
Closes #2192
2020-02-21 16:00:17 +11:00
Boris Popovschi 4b8134f63b Convert blkioWeight to io.weight properly
Signed-off-by: Boris Popovschi <zyqsempai@mail.ru>
2020-02-18 15:44:07 +02:00
Kir Kolyshkin 1cd71dfd71 systemd properties: support for *Sec values
Some systemd properties are documented as having "Sec" suffix
(e.g. "TimeoutStopSec") but are expected to have "USec" suffix
when passed over dbus, so let's provide appropriate conversion
to improve compatibility.

This means, one can specify TimeoutStopSec with a numeric argument,
in seconds, and it will be properly converted to TimeoutStopUsec
with the argument in microseconds. As a side bonus, even float
values are converted, so e.g. TimeoutStopSec=1.5 is possible.

This turned out a bit more tricky to implement when I was
originally expected, since there are a handful of numeric
types in dbus and each one requires explicit conversion.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-02-17 16:07:19 -08:00
Kir Kolyshkin 4c5c3fb960 Support for setting systemd properties via annotations
In case systemd is used to set cgroups for the container,
it creates a scope unit dedicated to it (usually named
`runc-$ID.scope`).

This patch adds an ability to set arbitrary systemd properties
for the systemd unit via runtime spec annotations.

Initially this was developed as an ability to specify the
`TimeoutStopUSec` property, but later generalized to work with
arbitrary ones.

Example usage: add the following to runtime spec (config.json):

```
	"annotations": {
		"org.systemd.property.TimeoutStopUSec": "uint64 123456789",
		"org.systemd.property.CollectMode":"'inactive-or-failed'"
	},
```

and start the container (e.g. `runc --systemd-cgroup run $ID`).

The above will set the following systemd parameters:
* `TimeoutStopSec` to 2 minutes and 3 seconds,
* `CollectMode` to "inactive-or-failed".

The values are in the gvariant format (see [1]). To figure out
which type systemd expects for a particular parameter, see
systemd sources.

In particular, parameters with `USec` suffix require an `uint64`
typed argument, while gvariant assumes int32 for a numeric values,
therefore the explicit type is required.

NOTE that systemd receives the time-typed parameters as *USec
but shows them (in `systemctl show`) as *Sec. For example,
the stop timeout should be set as `TimeoutStopUSec` but
is shown as `TimeoutStopSec`.

[1] https://developer.gnome.org/glib/stable/gvariant-text.html

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-02-17 16:07:19 -08:00
Mrunal Patel 81ef5024f8
Merge pull request #2213 from Zyqsempai/2166-convert-cpu-weight-poperly
Added conversion for cpu.weight v2
2020-02-17 07:49:39 -08:00
Boris Popovschi 7c439cc6f6 Added conversion for cpu.weight v2
Signed-off-by: Boris Popovschi <zyqsempai@mail.ru>
2020-02-12 11:32:34 +02:00
Andrei Vagin 269ea385a4 restore: fix a race condition in process.Wait()
Adrian reported that the checkpoint test stated failing:
=== RUN   TestCheckpoint
--- FAIL: TestCheckpoint (0.38s)
    checkpoint_test.go:297: Did not restore the pipe correctly:

The problem here is when we start exec.Cmd, we don't call its wait
method. This means that we don't wait cmd.goroutines ans so we don't
know when all data will be read from process pipes.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2020-02-10 10:21:08 -08:00
Boris Popovschi 3b992087b8 Fix skip message for cgroupv2
Signed-off-by: Boris Popovschi <zyqsempai@mail.ru>
2020-02-03 14:27:12 +02:00
Mrunal Patel 2fc03cc11c
Merge pull request #2207 from cyphar/fix-double-volume-attack
rootfs: do not permit /proc mounts to non-directories
2020-01-22 08:06:10 -08:00
Aleksa Sarai 3291d66b98
rootfs: do not permit /proc mounts to non-directories
mount(2) will blindly follow symlinks, which is a problem because it
allows a malicious container to trick runc into mounting /proc to an
entirely different location (and thus within the attacker's control for
a rename-exchange attack).

This is just a hotfix (to "stop the bleeding"), and the more complete
fix would be finish libpathrs and port runc to it (to avoid these types
of attacks entirely, and defend against a variety of other /proc-related
attacks). It can be bypased by someone having "/" be a volume controlled
by another container.

Fixes: CVE-2019-19921
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-01-17 14:00:30 +11:00
Aleksa Sarai f6fb7a0338
merge branch 'pr-2133'
Julia Nedialkova (1):
  Handle ENODEV when accessing the freezer.state file

LGTMs: @crosbymichael @cyphar
Closes #2133
2020-01-17 02:07:19 +11:00
Boris Popovschi 5b96f314ba Exchanged deprecated systemd resources with the appropriate for cgroupv2
Signed-off-by: Boris Popovschi <zyqsempai@mail.ru>
2020-01-15 18:09:33 +02:00
Boris Popovschi cf9b7c33e1 Fix MAJ:MIN io.stat parsing order
Signed-off-by: Boris Popovschi <zyqsempai@mail.ru>
2020-01-15 14:39:14 +02:00
Akihiro Suda 55f8c254be temporarily disable CRIU tests
Ubuntu kernel is temporarily broken: https://github.com/opencontainers/runc/pull/2198#issuecomment-571124087

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-01-14 11:18:44 +09:00
Akihiro Suda 5c20ea1472 fix merging #2177 and #2169
A new method was added to the cgroup interface when #2177 was merged.

After #2177 got merged, #2169 was merged without rebase (sorry!) and compilation was failing:

  libcontainer/cgroups/fs2/fs2.go:208:22: container.Cgroup undefined (type *configs.Config has no field or method Cgroup)

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-01-14 11:13:25 +09:00
Mrunal Patel 5cc0deaf7a
Merge pull request #2169 from AkihiroSuda/split-fs
cgroup2: split fs2 from fs
2020-01-13 16:23:27 -08:00
Michael Crosby 2b52db7527
Merge pull request #2177 from devimc/topic/libcontainer/kata-containers
libcontainer: export and add new methods to allow cgroups manipulation
2020-01-02 11:47:12 -05:00
Jordan Liggitt 8541d9cf3d Fix race checking for process exit and waiting for exec fifo
Signed-off-by: Jordan Liggitt <liggitt@google.com>
2019-12-18 18:48:18 +00:00
Julio Montes 8ddd892072 libcontainer: add method to get cgroup config from cgroup Manager
`configs.Cgroup` contains the configuration used to create cgroups. This
configuration must be saved to disk, since it's required to restore the
cgroup manager that was used to create the cgroups.
Add method to get cgroup configuration from cgroup Manager to allow API users
save it to disk and restore a cgroup manager later.

fixes #2176

Signed-off-by: Julio Montes <julio.montes@intel.com>
2019-12-17 22:46:03 +00:00
Julio Montes cd7c59d042 libcontainer: export createCgroupConfig
A `config.Cgroups` object is required to manipulate cgroups v1 and v2 using
libcontainer.
Export `createCgroupConfig` to allow API users to create `config.Cgroups`
objects using directly libcontainer API.

Signed-off-by: Julio Montes <julio.montes@intel.com>
2019-12-17 22:46:03 +00:00
Aleksa Sarai 7496a96825
merge branch 'pr-2086'
* Kurnia D Win (1):
  fix permission denied

LGTMs: @crosbymichael @cyphar
Closes #2086
2019-12-17 20:49:52 +11:00
Aleksa Sarai 201b063745
merge branch 'pr-2141'
Radostin Stoyanov (1):
  criu: Ensure other users cannot read c/r files

LGTMs: @crosbymichael @cyphar
Closes #2141
2019-12-07 09:32:58 +11:00
Akihiro Suda ec49f98d72 fs2: support legacy device spec (to pass CI)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-12-06 15:53:07 +09:00
Akihiro Suda 88e8350de2 cgroup2: split fs2 from fs
split fs2 package from fs, as mixing up fs and fs2 is very likely to result in
unmaintainable code.

Inspired by containerd/cgroups#109

Fix #2157

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-12-06 15:42:10 +09:00
Aleksa Sarai 5e63695384
merge branch 'pr-2174'
Sascha Grunert (1):
  Expose network interfaces via runc events

LGTMs: @cyphar @mrunalp
Closes #2174
2019-12-06 13:07:44 +11:00
Michael Crosby 8bb10af481
Merge pull request #2165 from AkihiroSuda/travis-f31
.travis.yml: add Fedora 31 vagrant box (for cgroup2)
2019-12-05 16:26:51 -05:00
Sascha Grunert 41a20b5852
Expose network interfaces via runc events
The libcontainer network statistics are unreachable without manually
creating a libcontainer instance. To retrieve them via the CLI interface
of runc, we now expose them as well.

Signed-off-by: Sascha Grunert <sgrunert@suse.com>
2019-12-05 13:20:51 +01:00
Akihiro Suda faf1e44ea9 cgroup2: ebpf: increase RLIM_MEMLOCK to avoid BPF_PROG_LOAD error
Fix #2167

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-11-07 15:43:27 +09:00
Mrunal Patel 46def4cc4c
Merge pull request #2154 from jpeach/2008-remove-static-build-tag
Remove the static_build build tag.
2019-11-04 17:10:59 -08:00
Akihiro Suda ccd4436fc4 .travis.yml: add Fedora 31 vagrant box (for cgroup2)
As the baby step, only unit tests are executed.

Failing tests are currently skipped and will be fixed in follow-up PRs.

Fix #2124

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-31 16:53:01 +09:00
Akihiro Suda faf673ee45 cgroup2: port over eBPF device controller from crun
The implementation is based on https://github.com/containers/crun/blob/0.10.2/src/libcrun/ebpf.c

Although ebpf.c is originally licensed under LGPL-3.0-or-later, the author
Giuseppe Scrivano agreed to relicense the file in Apache License 2.0:
https://github.com/opencontainers/runc/issues/2144#issuecomment-543116397

See libcontainer/cgroups/ebpf/devicefilter/devicefilter_test.go for tested configurations.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-31 14:01:46 +09:00
Qiang Huang e57a774066
Merge pull request #2149 from AkihiroSuda/cgroup2-ps
cgroup2: implement `runc ps`
2019-10-31 09:44:39 +08:00
Qiang Huang d239ca8425
Merge pull request #2148 from AkihiroSuda/cg2-ignore-cpuset-when-no-config
cgroup2: cpuset_v2: skip Apply when no limit is specified
2019-10-29 21:57:58 +08:00
Mrunal Patel 03cf145f5a
Merge pull request #2159 from AkihiroSuda/cgroup2-mount-in-userns
cgroup2: allow mounting /sys/fs/cgroup in UserNS without unsharing CgroupNS
2019-10-28 19:19:09 -07:00
Akihiro Suda 74a3fe5d1b cgroup2: do not parse /proc/cgroups
/proc/cgroups is meaningless for v2 and should be ignored.

https://github.com/torvalds/linux/blob/v5.3/Documentation/admin-guide/cgroup-v2.rst#deprecated-v1-core-features

* Now GetAllSubsystems() parses /sys/fs/cgroup/cgroup.controller, not /proc/cgroups.
  The function result also contains "pseudo" controllers: {"devices", "freezer"}.
  As it is hard to detect availability of pseudo controllers, pseudo controllers
  are always assumed to be available.

* Now IOGroupV2.Name() returns "io", not "blkio"

Fix #2155 #2156

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-28 00:00:33 +09:00
Akihiro Suda 9c81440fb5 cgroup2: allow mounting /sys/fs/cgroup in UserNS without unsharing CgroupNS
Bind-mount /sys/fs/cgroup when we are in UserNS but CgroupNS is not unshared,
because we cannot mount cgroup2.

This behavior correspond to crun v0.10.2.

Fix #2158

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-27 23:09:41 +09:00
James Peach 13919f5dfd Remove the static_build build tag.
The `static_build` build tag was introduced in e9944d0f
to remove build warnings related to systemd cgroup driver
dependencies. Since then, those dependencies have changed and
building the systemd cgroup driver no longer imports dlopen.

After this change, runc builds will always include the systemd
cgroup driver.

This fixes #2008.

Signed-off-by: James Peach <jpeach@apache.org>
2019-10-26 08:28:45 +11:00
Michael Crosby c4d8e1688c
Merge pull request #2140 from crosbymichael/fs-unified
Set unified mountpoint in find mnt func
2019-10-24 15:20:47 -04:00
Akihiro Suda dbd771e475 cgroup2: implement `runc ps`
Implemented `runc ps` for cgroup v2 , using a newly added method `m.GetUnifiedPath()`.
Unlike the v1  implementation that checks `m.GetPaths()["devices"]`, the v2 implementation does not require the device controller to be available.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-19 01:59:24 +09:00
Akihiro Suda d918e7f408 cpuset_v2: skip Apply when no limit is specified
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-19 00:33:31 +09:00
Akihiro Suda 033936ef76 io_v2.go: remove blkio v1 code
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-18 21:33:48 +09:00
Radostin Stoyanov a610a84821 criu: Ensure other users cannot read c/r files
No checkpoint files should be readable by
anyone else but the user creating it.

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
2019-10-17 07:49:38 +01:00
Michael Crosby b28f58f31b
Set unified mountpoint in find mnt func
This is needed for the fsv2 cgroups to work when there is a unified mountpoint.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2019-10-15 15:40:03 -04:00
Radostin Stoyanov f017e0f9e1 checkpoint: Set descriptors.json file mode to 0600
Prevent unprivileged users from being able to read descriptors.json

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
2019-10-12 19:29:44 +01:00
Aleksa Sarai 1b8a1eeec3
merge branch 'pr-2132'
Support different field counts of cpuaact.stats

LGTMs: @crosbymichael @cyphar
Closes #2132
2019-10-02 01:50:47 +10:00
Aleksa Sarai d463f6485b
*: verify that operations on /proc/... are on procfs
This is an additional mitigation for CVE-2019-16884. The primary problem
is that Docker can be coerced into bind-mounting a file system on top of
/proc (resulting in label-related writes to /proc no longer happening).

While we are working on mitigations against permitting the mounts, this
helps avoid our code from being tricked into writing to non-procfs
files. This is not a perfect solution (after all, there might be a
bind-mount of a different procfs file over the target) but in order to
exploit that you would need to be able to tweak a config.json pretty
specifically (which thankfully Docker doesn't allow).

Specifically this stops AppArmor from not labeling a process silently
due to /proc/self/attr/... being incorrectly set, and stops any
accidental fd leaks because /proc/self/fd/... is not real.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-09-30 09:06:48 +10:00
tianye15 28e58a0f6a Support different field counts of cpuaact.stats
Signed-off-by: skilxnTL <tylxltt@gmail.com>
2019-09-29 10:20:58 +08:00
Julia Nedialkova e63b797f38 Handle ENODEV when accessing the freezer.state file
...when checking if a container is paused

Signed-off-by: Julia Nedialkova <julianedialkova@hotmail.com>
2019-09-27 17:02:56 +03:00
blacktop 84373aaa56 Add SCMP_ACT_LOG as a valid Seccomp action (#1951)
Signed-off-by: blacktop <blacktop@users.noreply.github.com>
2019-09-26 11:03:03 -04:00
Michael Crosby 331692baa7 Only allow proc mount if it is procfs
Fixes #2128

This allows proc to be bind mounted for host and rootless namespace usecases but
it removes the ability to mount over the top of proc with a directory.

```bash
> sudo docker run --rm  apparmor
docker: Error response from daemon: OCI runtime create failed:
container_linux.go:346: starting container process caused "process_linux.go:449:
container init caused \"rootfs_linux.go:58: mounting
\\\"/var/lib/docker/volumes/aae28ea068c33d60e64d1a75916cf3ec2dc3634f97571854c9ed30c8401460c1/_data\\\"
to rootfs
\\\"/var/lib/docker/overlay2/a6be5ae911bf19f8eecb23a295dec85be9a8ee8da66e9fb55b47c841d1e381b7/merged\\\"
at \\\"/proc\\\" caused
\\\"\\\\\\\"/var/lib/docker/overlay2/a6be5ae911bf19f8eecb23a295dec85be9a8ee8da66e9fb55b47c841d1e381b7/merged/proc\\\\\\\"
cannot be mounted because it is not of type proc\\\"\"": unknown.

> sudo docker run --rm -v /proc:/proc apparmor

docker-default (enforce)        root     18989  0.9  0.0   1288     4 ?
Ss   16:47   0:00 sleep 20
```

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2019-09-24 11:00:18 -04:00
Jonathan Rudenberg af7b6547ec libcontainer/nsenter: Don't import C in non-cgo file
Signed-off-by: Jonathan Rudenberg <jonathan@titanous.com>
2019-09-11 17:03:07 +00:00
Giuseppe Scrivano 718a566e02
cgroup: support mount of cgroup2
convert a "cgroup" mount to "cgroup2" when the system uses cgroups v2
unified hierarchy.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2019-09-06 17:57:14 +02:00
Sebastiaan van Stijn eb86f6037e
bump syndtr/gocapability d98352740cb2c55f81556b63d4a1ec64c5a319c2
relevant changes:

  - syndtr/gocapability#14 capability: Deprecate NewPid and NewFile for NewPid2 and NewFile2
  - syndtr/gocapability#16 Fix capHeader.pid type

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-09-06 01:44:26 +02:00
Mrunal Patel 92ac8e3f84
Merge pull request #2113 from giuseppe/cgroupv2
libcontainer: initial support for cgroups v2
2019-09-05 13:14:29 -07:00
Giuseppe Scrivano 524cb7c318
libcontainer: add systemd.UnifiedManager
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2019-09-05 13:02:27 +02:00
Giuseppe Scrivano ec11136828
libcontainer, cgroups: rename systemd.Manager to LegacyManager
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2019-09-05 13:02:26 +02:00
Giuseppe Scrivano 1932917b71
libcontainer: add initial support for cgroups v2
allow to set what subsystems are used by
libcontainer/cgroups/fs.Manager.

subsystemsUnified is used on a system running with cgroups v2 unified
mode.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2019-09-05 13:02:25 +02:00
Mrunal Patel 92d851e03b
Merge pull request #2123 from carlosedp/riscv64
Bump x/sys and update syscall for initial Risc-V support
2019-09-04 14:10:26 -07:00
Carlos de Paula 4316e4d047 Bump x/sys and update syscall to start Risc-V support
Signed-off-by: Carlos de Paula <me@carlosedp.com>
2019-08-29 12:09:08 -03:00
Akihiro Suda 0bc069d795 nsenter: fix clang-tidy warning
nsexec.c:148:3: warning: Initialized va_list 'args' is leaked [clang-analyzer-valist.Unterminated]

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-08-29 00:18:02 +09:00
Akihiro Suda b225ef58fb nsenter: minor clean up
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-08-28 19:50:35 +09:00
Daniel J Walsh e4aa73424b
Rename cgroups_windows.go to cgroups_unsupported.go
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2019-08-26 18:13:52 -04:00
Mrunal Patel c61c7370f9
Merge pull request #2103 from sipsma/cgnil
cgroups/fs: check nil pointers in cgroup manager
2019-08-26 14:05:44 -07:00
Mrunal Patel 68d73f0a2e
Merge pull request #2107 from sashayakovtseva/public-get-devices
Make get devices function public
2019-08-26 09:58:10 -07:00
Kenta Tada c740965a18 libcontainer: update masked paths of /proc
This commit updates the masked paths of /proc.

Related issues:
* https://github.com/moby/moby/pull/37404
* https://github.com/moby/moby/pull/38299
* https://github.com/moby/moby/pull/36368

Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
2019-08-26 12:25:56 +09:00
Mrunal Patel 3525eddec5
Merge pull request #2117 from filbranden/detection1
Remove libcontainer detection for systemd features
2019-08-25 13:15:15 -07:00
Filipe Brandenburger 518c855833 Remove libcontainer detection for systemd features
Transient units (and transient slice units) have been available for quite a
long time and RHEL 7 with systemd v219 (likely the oldest OS we care about at
this point) supports that. A system running a systemd without these features is
likely to break a lot of other stuff that runc/libcontainer care about.

Regarding delegated slices, modern systemd doesn't allow it and
runc/libcontainer run fine on it, so we might as well just stop requesting it
on older versions of systemd which allowed it. (Those versions never really
changed behavior significantly when that option was passed anyways.)

Signed-off-by: Filipe Brandenburger <filbranden@gmail.com>
2019-08-22 21:53:24 -07:00
Filipe Brandenburger 588f040a77 Avoid the dependency on cgo through go-systemd/util package
This dependency is only needed in package "github.com/coreos/go-systemd/util"
and we only use it for IsRunningSystemd(), which is a simple Go function that
just stats a file.

Let's just borrow it here, so we remove the dependency and can remove that
package from vendored build.

This also removes dependencies on dlopen and on trying to find libsystemd.so
or libsystemd-login.so in the system.

Tested that this still builds and works as expected.

Signed-off-by: Filipe Brandenburger <filbranden@gmail.com>
2019-08-22 21:07:24 -07:00