Commit Graph

236 Commits

Author SHA1 Message Date
Kir Kolyshkin 108ee85b82 libct/cgroups: add SkipDevices to Resources
The kubelet uses libct/cgroups code to set up cgroups. It creates a
parent cgroup (kubepods) to put the containers into.

The problem (for cgroupv2 that uses eBPF for device configuration) is
the hard requirement to have devices cgroup configured results in
leaking an eBPF program upon every kubelet restart.  program. If kubelet
is restarted 64+ times, the cgroup can't be configured anymore.

Work around this by adding a SkipDevices flag to Resources.

A check was added so that if SkipDevices is set, such a "container"
can't be started (to make sure it is only used for non-containers).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-07-02 15:19:31 -07:00
Kir Kolyshkin dff7685c18
Merge pull request #2459 from tedyu/linux-cont-set-cfg
Set configs back when intelrdt configs cannot be set

LGTMS: @AkihiroSuda @kolyshkin
2020-06-19 12:57:53 -07:00
Akihiro Suda 9748b48742
Merge pull request #2229 from RenaudWasTaken/create-container
Add CreateRuntime, CreateContainer and StartContainer Hooks
2020-06-19 12:27:51 +09:00
Renaud Gaubert ccdd75760c Add the CreateRuntime, CreateContainer and StartContainer Hooks
Signed-off-by: Renaud Gaubert <rgaubert@nvidia.com>
2020-06-17 02:10:00 +00:00
Kir Kolyshkin d5c57dcea6 libct/criuApplyCgroups: don't set cgroup paths for v2
There is no need to have cgroupv1-specific controller paths on restore
in case of cgroupv2.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-16 12:40:02 -07:00
Kir Kolyshkin 52b56bc28e libc/criuSwrk: remove applyCgroups param
Its value can be easily deduced from the request type.

While at it, simplify the call logic.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-06-16 12:40:01 -07:00
Kir Kolyshkin 5b247e739c
Merge pull request #2338 from lifubang/systemdcgroupv2
fix path error in systemd when stopped

LGTMs: @mrunalp @AkihiroSuda
2020-06-15 18:01:13 -07:00
Akihiro Suda 601fa557c0
Merge pull request #2414 from kolyshkin/criu-notif
use lazy-pages ready notification for criu >= 3.15
2020-06-16 09:31:12 +09:00
Mrunal Patel a4a306d2a2 Write state.json atomically
We want to make sure that the state file is syned and cannot be
read partially or truncated.

Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2020-06-10 20:21:04 -07:00
Ted Yu 9d275d326c Set configs back when intelrdt configs cannot be set
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-06-06 15:10:45 -07:00
lifubang 9087f2e827 fix path error in systemd when stopped
When we use cgroup with systemd driver, the cgroup path will be auto removed
by systemd when all processes exited. So we should check cgroup path exists
when we access the cgroup path, for example in `kill/ps`, or else we will
got an error.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-06-02 18:17:43 +08:00
Akihiro Suda c91fe9aeba cgroup2: exec: join the cgroup of the init process on EBUSY
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-05-31 13:09:43 +09:00
Ted Yu 3ba3d9b1bd Wait for criuProcess once
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-05-29 15:50:37 -07:00
Kir Kolyshkin 68391c0e96 use lazy-pages ready notification for criu >= 3.15
This relies on https://github.com/checkpoint-restore/criu/pull/1069
and emulates the previous behavior by writing \0 and closing status
fd (as it was done by criu).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-23 11:37:28 -07:00
Kir Kolyshkin 7ab1329835 libct/criuNotifications: simplify switch
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-23 11:37:28 -07:00
Adrian Reber 944e057025
Update to latest go-criu (4.0.2)
This updates to the latest version of go-criu (4.0.2) which is based on
CRIU 3.14.

As go-criu provides an existing way to query the CRIU binary for its
version this also removes all the code from runc to handle CRIU version
checking and now relies on go-criu.

An important side effect of this change is that this raises the minimum
CRIU version to 3.0.0 as that is the first CRIU version that supports
CRIU version queries via RPC in contrast to parsing the output of
'criu --version'

CRIU 3.0 has been released in April of 2017.

Signed-off-by: Adrian Reber <areber@redhat.com>
2020-05-20 13:49:38 +02:00
Akihiro Suda f369199ff6
Merge pull request #2413 from JFHwang/2392-spec-check
Add nil check of spec.Process in validateProcessSpec()
2020-05-19 08:11:22 +09:00
Mrunal Patel 825e91ada6
Merge pull request #2341 from kolyshkin/test-cpt-lazy
runc checkpoint: fix --status-fd to accept fd
2020-05-18 10:43:24 -07:00
John Hwang 7fc291fd45 Replace formatted errors when unneeded
Signed-off-by: John Hwang <John.F.Hwang@gmail.com>
2020-05-16 18:13:21 -07:00
Aleksa Sarai 859a780d6f
cgroups: add GetFreezerState() helper to Manager
This is effectively a nicer implementation of the container.isPaused()
helper, but to be used within the cgroup code for handling some fun
issues we have to fix with the systemd cgroup driver.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-05-13 17:38:45 +10:00
Kir Kolyshkin ca1d135bd4 runc checkpoint: fix --status-fd to accept fd
1. The command `runc checkpoint --lazy-server --status-fd $FD` actually
accepts a file name as an $FD. Make it accept a file descriptor,
like its name implies and the documentation states.

In addition, since runc itself does not use the result of CRIU status
fd, remove the code which relays it, and pass the FD directly to CRIU.

Note 1: runc should close this file descriptor itself after passing it
to criu, otherwise whoever waits on it might wait forever.

Note 2: due to the way criu swrk consumes the fd (it reopens
/proc/$SENDER_PID/fd/$FD), runc can't close it as soon as criu swrk has
started. There is no good way to know when criu swrk has reopened the
fd, so we assume that as soon as we have received something back, the
fd is already reopened.

2. Since the meaning of --status-fd has changed, the test case using
it needs to be fixed as well.

Modify the lazy migration test to remove "sleep 2", actually waiting
for the the lazy page server to be ready.

While at it,

 - remove the double fork (using shell's background process is
   sufficient here);

 - check the exit code for "runc checkpoint" and "criu lazy-pages";

 - remove the check for no errors in dump.log after restore, as we
   are already checking its exit code.

[v2: properly close status fd after spawning criu]
[v3: move close status fd to after the first read]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-11 15:36:50 -07:00
Kir Kolyshkin 714c91e9f7 Simplify cgroup path handing in v2 via unified API
This unties the Gordian Knot of using GetPaths in cgroupv2 code.

The problem is, the current code uses GetPaths for three kinds of things:

1. Get all the paths to cgroup v1 controllers to save its state (see
   (*linuxContainer).currentState(), (*LinuxFactory).loadState()
   methods).

2. Get all the paths to cgroup v1 controllers to have the setns process
    enter the proper cgroups in `(*setnsProcess).start()`.

3. Get the path to a specific controller (for example,
   `m.GetPaths()["devices"]`).

Now, for cgroup v2 instead of a set of per-controller paths, we have only
one single unified path, and a dedicated function `GetUnifiedPath()` to get it.

This discrepancy between v1 and v2 cgroupManager API leads to the
following problems with the code:

 - multiple if/else code blocks that have to treat v1 and v2 separately;

 - backward-compatible GetPaths() methods in v2 controllers;

 -  - repeated writing of the PID into the same cgroup for v2;

Overall, it's hard to write the right code with all this, and the code
that is written is kinda hard to follow.

The solution is to slightly change the API to do the 3 things outlined
above in the same manner for v1 and v2:

1. Use `GetPaths()` for state saving and setns process cgroups entering.

2. Introduce and use Path(subsys string) to obtain a path to a
   subsystem. For v2, the argument is ignored and the unified path is
   returned.

This commit converts all the controllers to the new API, and modifies
all the users to use it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 12:04:06 -07:00
Kir Kolyshkin 63854b0ea8 newSetnsProcess: reuse state.CgroupPaths
c.cgroupManager.GetPaths() are called twice here: once in currentState()
and then in newSetnsProcess(). Reuse the result of the first call, which
is stored into state.CgroupPaths.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:05:59 -07:00
Kir Kolyshkin 9a3e632625 notify: simplify usage
Instead of passing the whole map of paths, pass the path to the memory
controller which these functions actually require.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-05-08 10:05:58 -07:00
lifubang 657407ff23 fix runc events error in cgroup v2
Signed-off-by: lifubang <lifubang@acmcoder.com>
2020-05-07 22:18:46 +08:00
Ted Yu db29dce076 Close fd in case fd.Write() returns error
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-05-02 20:06:08 -07:00
Mrunal Patel 634e51b52c
Merge pull request #2335 from kolyshkin/cgroupv2-cpt
Fix cgroupv2 checkpoint/restore
2020-04-24 08:47:36 -07:00
Mrunal Patel c420a3ec7f
Merge pull request #2324 from kolyshkin/criu-freezer
libcontainer: fix Checkpoint wrt cgroupv2
2020-04-23 19:24:38 -07:00
Kir Kolyshkin 9280e3566d checkpoint/restore: fix cgroupv2 handling
In case of cgroupv2 unified hierarchy, the /sys/fs/cgroup mount
is the real mount with fstype of cgroup2 (rather than a set of
external bind mounts like for cgroupv1).

So, we should not add it to the list of "external bind mounts"
on both checkpoint and restore.

Without this fix, checkpoint integration tests fail on cgroup v2.

Also, same is true for cgroup v1 + cgroupns.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-22 11:26:43 -07:00
Kir Kolyshkin af6b9e7fa9 nit: do not use syscall package
In many places (not all of them though) we can use `unix.`
instead of `syscall.` as these are indentical.

In particular, x/sys/unix defines:

```go
type Signal = syscall.Signal
type Errno = syscall.Errno
type SysProcAttr = syscall.SysProcAttr

const ENODEV      = syscall.Errno(0x13)
```

and unix.Exec() calls syscall.Exec().

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-18 16:16:49 -07:00
Kir Kolyshkin b3a481eb77 libcontainer: fix Checkpoint wrt cgroupv2
Commit 9a0184b10f meant to enable using cgroup v2 freezer
for criu >= 3.14, but it looks like it is doing something else
instead.

The logic here is:

 - for cgroup v1, set FreezeCgroup, if available
 - for cgroup v2, only set it for criu >= 3.14
 - do not use GetPaths() in case v2 is used
   (this method is obsoleted for v2 and will be removed)

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-17 16:17:00 -07:00
Ted Yu 7a978e354a Defer netns.Close() after error check
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-04-15 18:33:20 -07:00
Ted Yu 21d7bb95eb Close criuServer so that even if CRIU crashes or unexpectedly exits, runc will not hang
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-04-03 15:27:27 -07:00
Kir Kolyshkin b2272b2cba libcontainer: use errors.Is() and errors.As()
Make use of errors.Is() and errors.As() where appropriate to check
the underlying error. The biggest motivation is to simplify the code.

The feature requires go 1.13 but since merging #2256 we are already
not supporting go 1.12 (which is an unsupported release anyway).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-02 20:34:01 -07:00
Kir Kolyshkin c39f87a47a Revert "Merge pull request #2280 from kolyshkin/errors-unwrap"
Using errors.Unwrap() is not the best thing to do, since it returns
nil in case of an error which was not wrapped. More to say,
errors package provides more elegant ways to check for underlying
errors, such as errors.As() and errors.Is().

This reverts commit f8e138855d, reversing
changes made to 6ca9d8e6da.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-04-02 19:41:11 -07:00
Michael Crosby f8e138855d
Merge pull request #2280 from kolyshkin/errors-unwrap
Use errors.Unwrap() where possible
2020-04-02 14:39:06 -04:00
Michael Crosby 6ca9d8e6da
Merge pull request #2283 from tedyu/runc-path-in-prefix
isPathInPrefixList return value should be reverted
2020-04-02 14:09:49 -04:00
Michael Crosby b26e4f27c1
Merge pull request #2284 from tedyu/criu-svr-close
Avoid double close of criuServer
2020-04-02 14:07:35 -04:00
Mrunal Patel e3e26cafe9
Merge pull request #2276 from kolyshkin/criu-v2
cgroupv2: don't use GetCgroupMounts for criu c/r
2020-04-01 17:36:24 -07:00
Ted Yu 49896ab0f4 Avoid double close of criuServer
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-04-01 16:15:23 -07:00
Ted Yu d02fc48422 isPathInPrefixList return value should be reverted
Signed-off-by: Ted Yu <yuzhihong@gmail.com>
2020-04-01 15:45:31 -07:00
Kir Kolyshkin 8d7977ee6e libct/isPaused: don't use GetPaths from v2 code
Using GetPaths from cgroupv2 unified hierarchy code is deprecated
and this function will (hopefully) be removed.

Use GetUnifiedPath() for v2 case.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-31 20:24:28 -07:00
Kir Kolyshkin 12e156f076 libct.isPaused: use errors.Unwrap
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-31 20:07:04 -07:00
Kir Kolyshkin fc840f199f cgroupv2: don't use GetCgroupMounts for criu c/r
When performing checkpoint or restore of cgroupv2 unified hierarchy,
there is no need to call getCgroupMounts() / cgroups.GetCgroupMounts()
as there's only a single mount in there.

This eliminates the last internal (i.e. runc) use case of
cgroups.GetCgroupMounts() for v2 unified. Unfortunately, there
are external ones (e.g. moby/moby) so we can't yet let it
return an error.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-31 17:05:11 -07:00
Michael Crosby 9ec5b03e5a
Merge pull request #2259 from adrianreber/v2-test
Add minimal cgroup2 checkpoint/restore support
2020-03-31 15:01:18 -04:00
Yulia Nedyalkova 2abc6a3605 Actually check for syscall.ENODEV when checking if a container is paused
It turns out that ioutil.Readfile wraps the error in a *os.PathError.
Since we cannot guarantee compilation with golang >= v1.13, we are
manually unwrapping the error.

Signed-off-by: Kieron Browne <kbrowne@pivotal.io>
2020-03-31 15:52:20 +01:00
Adrian Reber 9a0184b10f
cgroup2: use CRIU's new freezer v2 support
The newest CRIU version supports freezer v2 and this tells runc
to use it if new enough or fall back to non-freezer based process
freezing on cgroup v2 system.

Signed-off-by: Adrian Reber <areber@redhat.com>
2020-03-31 16:36:35 +02:00
Michael Crosby 88474967d3
Merge pull request #1974 from openSUSE/unreachable-code
Remove unreachable code paths
2020-03-16 13:56:05 -04:00
Mrunal Patel 981dbef514
Merge pull request #2226 from avagin/runsc-restore-cmd-wait
restore: fix a race condition in process.Wait()
2020-03-15 18:48:16 -07:00
Sascha Grunert b477a159db
Remove unreachable code paths
Signed-off-by: Sascha Grunert <sgrunert@suse.com>
2020-03-12 09:13:03 +01:00