1. Return earlier if there is an error.
2. Do not use filepath.Split on every entry, use info.Name() instead.
3. Make readProcsFile() accept file name as an argument, to avoid
unnecessary file name and directory splitting and merging.
4. Skip on info.IsDir() -- this avoids an error when cgroup name is
set to "cgroup.procs".
This is still not very good since filepath.Walk() performs an unnecessary
stat(2) on every entry, but better than before.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
fmt.Sprintf is slow and is not needed here, string concatenation would
be sufficient. It is also redundant to convert []byte from string and
back, since `bytes` package now provides the same functions as `strings`.
Use Fields() instead of TrimSpace() and Split(), mainly for readability
(note Fields() is somewhat slower than Split() but here it doesn't
matter much).
Use Join() to prepend the plus signs.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Golang 1.14 introduces asynchronous preemption which results into
applications getting frequent EINTR (syscall interrupted) errors when
invoking slow syscalls, e.g. when writing to cgroup files.
As writing to cgroups is idempotent, it is safe to retry writing to the
file whenever the write syscall is interrupted.
Signed-off-by: Mario Nitchev <marionitchev@gmail.com>
* TestConvertCPUSharesToCgroupV2Value(0) was returning 70369281052672, while the correct value is 0
* ConvertBlkIOToCgroupV2Value(0) was returning 32, while the correct value is 0
* ConvertBlkIOToCgroupV2Value(1000) was returning 4, while the correct value is 10000
Fix#2244
Follow-up to #2212#2213
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
linuxContainer.Signal() can race with another call to say Destroy()
which clears the container's initProcess. This can cause a nil pointer
dereference in Signal().
This patch will synchronize Signal() and Destroy() by grabbing the
container's mutex as part of the Signal() call.
Signed-off-by: Pradyumna Agrawal <pradyumnaa@vmware.com>
Some systemd properties are documented as having "Sec" suffix
(e.g. "TimeoutStopSec") but are expected to have "USec" suffix
when passed over dbus, so let's provide appropriate conversion
to improve compatibility.
This means, one can specify TimeoutStopSec with a numeric argument,
in seconds, and it will be properly converted to TimeoutStopUsec
with the argument in microseconds. As a side bonus, even float
values are converted, so e.g. TimeoutStopSec=1.5 is possible.
This turned out a bit more tricky to implement when I was
originally expected, since there are a handful of numeric
types in dbus and each one requires explicit conversion.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
In case systemd is used to set cgroups for the container,
it creates a scope unit dedicated to it (usually named
`runc-$ID.scope`).
This patch adds an ability to set arbitrary systemd properties
for the systemd unit via runtime spec annotations.
Initially this was developed as an ability to specify the
`TimeoutStopUSec` property, but later generalized to work with
arbitrary ones.
Example usage: add the following to runtime spec (config.json):
```
"annotations": {
"org.systemd.property.TimeoutStopUSec": "uint64 123456789",
"org.systemd.property.CollectMode":"'inactive-or-failed'"
},
```
and start the container (e.g. `runc --systemd-cgroup run $ID`).
The above will set the following systemd parameters:
* `TimeoutStopSec` to 2 minutes and 3 seconds,
* `CollectMode` to "inactive-or-failed".
The values are in the gvariant format (see [1]). To figure out
which type systemd expects for a particular parameter, see
systemd sources.
In particular, parameters with `USec` suffix require an `uint64`
typed argument, while gvariant assumes int32 for a numeric values,
therefore the explicit type is required.
NOTE that systemd receives the time-typed parameters as *USec
but shows them (in `systemctl show`) as *Sec. For example,
the stop timeout should be set as `TimeoutStopUSec` but
is shown as `TimeoutStopSec`.
[1] https://developer.gnome.org/glib/stable/gvariant-text.html
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Adrian reported that the checkpoint test stated failing:
=== RUN TestCheckpoint
--- FAIL: TestCheckpoint (0.38s)
checkpoint_test.go:297: Did not restore the pipe correctly:
The problem here is when we start exec.Cmd, we don't call its wait
method. This means that we don't wait cmd.goroutines ans so we don't
know when all data will be read from process pipes.
Signed-off-by: Andrei Vagin <avagin@gmail.com>
mount(2) will blindly follow symlinks, which is a problem because it
allows a malicious container to trick runc into mounting /proc to an
entirely different location (and thus within the attacker's control for
a rename-exchange attack).
This is just a hotfix (to "stop the bleeding"), and the more complete
fix would be finish libpathrs and port runc to it (to avoid these types
of attacks entirely, and defend against a variety of other /proc-related
attacks). It can be bypased by someone having "/" be a volume controlled
by another container.
Fixes: CVE-2019-19921
Signed-off-by: Aleksa Sarai <asarai@suse.de>
A new method was added to the cgroup interface when #2177 was merged.
After #2177 got merged, #2169 was merged without rebase (sorry!) and compilation was failing:
libcontainer/cgroups/fs2/fs2.go:208:22: container.Cgroup undefined (type *configs.Config has no field or method Cgroup)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
`configs.Cgroup` contains the configuration used to create cgroups. This
configuration must be saved to disk, since it's required to restore the
cgroup manager that was used to create the cgroups.
Add method to get cgroup configuration from cgroup Manager to allow API users
save it to disk and restore a cgroup manager later.
fixes#2176
Signed-off-by: Julio Montes <julio.montes@intel.com>
A `config.Cgroups` object is required to manipulate cgroups v1 and v2 using
libcontainer.
Export `createCgroupConfig` to allow API users to create `config.Cgroups`
objects using directly libcontainer API.
Signed-off-by: Julio Montes <julio.montes@intel.com>
split fs2 package from fs, as mixing up fs and fs2 is very likely to result in
unmaintainable code.
Inspired by containerd/cgroups#109
Fix#2157
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
The libcontainer network statistics are unreachable without manually
creating a libcontainer instance. To retrieve them via the CLI interface
of runc, we now expose them as well.
Signed-off-by: Sascha Grunert <sgrunert@suse.com>
As the baby step, only unit tests are executed.
Failing tests are currently skipped and will be fixed in follow-up PRs.
Fix#2124
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
/proc/cgroups is meaningless for v2 and should be ignored.
https://github.com/torvalds/linux/blob/v5.3/Documentation/admin-guide/cgroup-v2.rst#deprecated-v1-core-features
* Now GetAllSubsystems() parses /sys/fs/cgroup/cgroup.controller, not /proc/cgroups.
The function result also contains "pseudo" controllers: {"devices", "freezer"}.
As it is hard to detect availability of pseudo controllers, pseudo controllers
are always assumed to be available.
* Now IOGroupV2.Name() returns "io", not "blkio"
Fix#2155#2156
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Bind-mount /sys/fs/cgroup when we are in UserNS but CgroupNS is not unshared,
because we cannot mount cgroup2.
This behavior correspond to crun v0.10.2.
Fix#2158
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
The `static_build` build tag was introduced in e9944d0f
to remove build warnings related to systemd cgroup driver
dependencies. Since then, those dependencies have changed and
building the systemd cgroup driver no longer imports dlopen.
After this change, runc builds will always include the systemd
cgroup driver.
This fixes#2008.
Signed-off-by: James Peach <jpeach@apache.org>
Implemented `runc ps` for cgroup v2 , using a newly added method `m.GetUnifiedPath()`.
Unlike the v1 implementation that checks `m.GetPaths()["devices"]`, the v2 implementation does not require the device controller to be available.
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
This is an additional mitigation for CVE-2019-16884. The primary problem
is that Docker can be coerced into bind-mounting a file system on top of
/proc (resulting in label-related writes to /proc no longer happening).
While we are working on mitigations against permitting the mounts, this
helps avoid our code from being tricked into writing to non-procfs
files. This is not a perfect solution (after all, there might be a
bind-mount of a different procfs file over the target) but in order to
exploit that you would need to be able to tweak a config.json pretty
specifically (which thankfully Docker doesn't allow).
Specifically this stops AppArmor from not labeling a process silently
due to /proc/self/attr/... being incorrectly set, and stops any
accidental fd leaks because /proc/self/fd/... is not real.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Fixes#2128
This allows proc to be bind mounted for host and rootless namespace usecases but
it removes the ability to mount over the top of proc with a directory.
```bash
> sudo docker run --rm apparmor
docker: Error response from daemon: OCI runtime create failed:
container_linux.go:346: starting container process caused "process_linux.go:449:
container init caused \"rootfs_linux.go:58: mounting
\\\"/var/lib/docker/volumes/aae28ea068c33d60e64d1a75916cf3ec2dc3634f97571854c9ed30c8401460c1/_data\\\"
to rootfs
\\\"/var/lib/docker/overlay2/a6be5ae911bf19f8eecb23a295dec85be9a8ee8da66e9fb55b47c841d1e381b7/merged\\\"
at \\\"/proc\\\" caused
\\\"\\\\\\\"/var/lib/docker/overlay2/a6be5ae911bf19f8eecb23a295dec85be9a8ee8da66e9fb55b47c841d1e381b7/merged/proc\\\\\\\"
cannot be mounted because it is not of type proc\\\"\"": unknown.
> sudo docker run --rm -v /proc:/proc apparmor
docker-default (enforce) root 18989 0.9 0.0 1288 4 ?
Ss 16:47 0:00 sleep 20
```
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
relevant changes:
- syndtr/gocapability#14 capability: Deprecate NewPid and NewFile for NewPid2 and NewFile2
- syndtr/gocapability#16 Fix capHeader.pid type
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
allow to set what subsystems are used by
libcontainer/cgroups/fs.Manager.
subsystemsUnified is used on a system running with cgroups v2 unified
mode.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Transient units (and transient slice units) have been available for quite a
long time and RHEL 7 with systemd v219 (likely the oldest OS we care about at
this point) supports that. A system running a systemd without these features is
likely to break a lot of other stuff that runc/libcontainer care about.
Regarding delegated slices, modern systemd doesn't allow it and
runc/libcontainer run fine on it, so we might as well just stop requesting it
on older versions of systemd which allowed it. (Those versions never really
changed behavior significantly when that option was passed anyways.)
Signed-off-by: Filipe Brandenburger <filbranden@gmail.com>
This dependency is only needed in package "github.com/coreos/go-systemd/util"
and we only use it for IsRunningSystemd(), which is a simple Go function that
just stats a file.
Let's just borrow it here, so we remove the dependency and can remove that
package from vendored build.
This also removes dependencies on dlopen and on trying to find libsystemd.so
or libsystemd-login.so in the system.
Tested that this still builds and works as expected.
Signed-off-by: Filipe Brandenburger <filbranden@gmail.com>
`Init` on the `Process` struct specifies whether the process is the first process in the container. This needs to be set to `true` when running the container.
Signed-off-by: Andreas Stocker <astocker@anexia-it.com>
when exec as root and config.Cwd is not owned by root, exec will fail
because root doesn't have the caps.
So, Chdir should be done before setting the caps.
Signed-off-by: Kurnia D Win <kurnia.d.win@gmail.com>
The hugetlb cgroup control files (introduced here in 2012:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=abb8206cb0773)
use "KB" and not "kB"
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/hugetlb_cgroup.c?h=v5.0#n349).
The behavior in the kernel has not changed since the introduction, and
the current code using "kB" will therefore fail on devices with small
amounts of ram (see
https://github.com/kubernetes/kubernetes/issues/77169) running a kernel
with config flag CONFIG_HUGETLBFS=y
As seen from the code in "mem_fmt" inside hugetlb_cgroup.c, only "KB",
"MB" and "GB" are used, so the others may be removed as well.
Here is a real world example of the files inside the
"/sys/kernel/mm/hugepages/" directory:
- "hugepages-64kB"
- "hugepages-2048kB"
- "hugepages-32768kB"
- "hugepages-1048576kB"
And the corresponding cgroup files:
- "hugetlb.64KB._____"
- "hugetlb.2MB._____"
- "hugetlb.32MB._____"
- "hugetlb.1GB._____"
Signed-off-by: Odin Ugedal <odin@ugedal.com>
This will permit us to extend the internals of systemd.Manager to include
further information about the system, such as whether cgroupv1, cgroupv2 or
both are in effect.
Furthermore, it allows a future refactor of moving more of UseSystemd() code
into the factory initialization function.
Signed-off-by: Filipe Brandenburger <filbranden@gmail.com>
Minor refactoring to use the filePair struct for both init sock and log pipe
Co-authored-by: Julia Nedialkova <julianedialkova@hotmail.com>
Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>
In the exception handling of initProcess.start(), we need to add the
missing IntelRdtManager.Destroy() handler in defer func.
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
Bump logrus so that we can use logrus.StandardLogger().Logf instead
Co-authored-by: Julia Nedialkova <julianedialkova@hotmail.com>
Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>
We discovered in umoci that setting a dummy type of "none" would result
in file-based bind-mounts no longer working properly, which is caused by
a restriction for when specconv will change the device type to "bind" to
work around rootfs_linux.go's ... issues.
However, bind-mounts don't have a type (and Linux will ignore any type
specifier you give it) because the type is copied from the source of the
bind-mount. So we should always overwrite it to avoid user confusion.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Refactor configuring logging into a reusable component
so that it can be nicely used in both main() and init process init()
Co-authored-by: Georgi Sabev <georgethebeatle@gmail.com>
Co-authored-by: Giuseppe Capizzi <gcapizzi@pivotal.io>
Co-authored-by: Claudia Beresford <cberesford@pivotal.io>
Signed-off-by: Danail Branekov <danailster@gmail.com>
Add support for children processes logging (including nsexec).
A pipe is used to send logs from children to parent in JSON.
The JSON format used is the same used by logrus JSON formatted,
i.e. children process can use standard logrus APIs.
Signed-off-by: Marco Vedovati <mvedovati@suse.com>
Whenever processes are spawned using nsexec, a zombie runc:[1:CHILD]
process will always be created and will need to be reaped by the parent
Signed-off-by: Alex Fang <littlelightlittlefire@gmail.com>
secure_getenv is a Glibc extension and so this code does not compile
on Musl libc any more after this patch.
secure_getenv is only intended to be used in setuid binaries, in
order that they should not trust their environment. It simply returns
NULL if the binary is running setuid. If runc was installed setuid,
the user can already do anything as root, so it is game over, so this
check is not needed.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Work is ongoing in the kernel to support different kernel
keyrings per user namespace. We want to allow SELinux to manage
kernel keyrings inside of the container.
Currently when runc creates the kernel keyring it gets the label which runc is
running with ususally `container_runtime_t`, with this change the kernel keyring
will be labeled with the container process label container_t:s0:C1,c2.
Container running as container_t:s0:c1,c2 can manage keyrings with the same label.
This change required a revendoring or the SELinux go bindings.
github.com/opencontainers/selinux.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
There are some circumstances where sendfile(2) can fail (one example is
that AppArmor appears to block writing to deleted files with sendfile(2)
under some circumstances) and so we need to have a userspace fallback.
It's fairly trivial (and handles short-writes).
Signed-off-by: Aleksa Sarai <asarai@suse.de>
The usage of memfd_create(2) and other copying techniques is quite
wasteful, despite attempts to minimise it with _LIBCONTAINER_STATEDIR.
memfd_create(2) added ~10M of memory usage to the cgroup associated with
the container, which can result in some setups getting OOM'd (or just
hogging the hosts' memory when you have lots of created-but-not-started
containers sticking around).
The easiest way of solving this is by creating a read-only bind-mount of
the binary, opening that read-only bindmount, and then umounting it to
ensure that the host won't accidentally be re-mounted read-write. This
avoids all copying and cleans up naturally like the other techniques
used. Unfortunately, like the O_TMPFILE fallback, this requires being
able to create a file inside _LIBCONTAINER_STATEDIR (since bind-mounting
over the most obvious path -- /proc/self/exe -- is a *very bad idea*).
Unfortunately detecting this isn't fool-proof -- on a system with a
read-only root filesystem (that might become read-write during "runc
init" execution), we cannot tell whether we have already done an ro
remount. As a partial mitigation, we store a _LIBCONTAINER_CLONED_BINARY
environment variable which is checked *alongside* the protection being
present.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Writing a file to tmpfs actually incurs a memcg penalty, and thus the
benefit of being able to disable memfd_create(2) with
_LIBCONTAINER_DISABLE_MEMFD_CLONE is fairly minimal -- though it should
be noted that quite a few distributions don't use tmpfs for /tmp (and
instead have it as a regular directory or subvolume of the host
filesystem).
Since runc must have write access to the state directory anyway (and the
state directory is usually not on a tmpfs) we can use that instead of
/tmp -- avoiding potential memcg costs with no real downside.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
In order to get around the memfd_create(2) requirement, 0a8e4117e7
("nsenter: clone /proc/self/exe to avoid exposing host binary to
container") added an O_TMPFILE fallback. However, this fallback was
flawed in two ways:
* It required O_TMPFILE which is relatively new (having been added to
Linux 3.11).
* The fallback choice was made at compile-time, not runtime. This
results in several complications when it comes to running binaries
on different machines to the ones they were built on.
The easiest way to resolve these things is to have fallbacks work in a
more procedural way (though it does make the code unfortunately more
complicated) and to add a new fallback that uses mkotemp(3).
Signed-off-by: Aleksa Sarai <asarai@suse.de>
For a variety of reasons, sendfile(2) can end up doing a short-copy so
we need to just loop until we hit the binary size. Since /proc/self/exe
is tautologically our own binary, there's no chance someone is going to
modify it underneath us (or changing the size).
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This makes use of the vendored in Go bindings and removes the copy of
the CRIU RPC interface definition. runc now relies on go-criu for RPC
definition and hopefully more CRIU functions can be used in the future
from the CRIU Go bindings.
Signed-off-by: Adrian Reber <areber@redhat.com>
My first attempt to simplify this and make it less costly focussed on
the way constructors are called. I was under the impression that the ELF
specification mandated that arg, argv, and actually even envp need to be
passed to functions located in the .init_arry section (aka
"constructors"). Actually, the specifications is (cf. [2]):
SHT_INIT_ARRAY
This section contains an array of pointers to initialization functions,
as described in ``Initialization and Termination Functions'' in Chapter
5. Each pointer in the array is taken as a parameterless procedure with
a void return.
which means that this becomes a libc specific decision. Glibc passes
down those args, musl doesn't. So this approach can't work. However, we
can at least remove the environment parsing part based on POSIX since
[1] mandates that there should be an environ variable defined in
unistd.h which provides access to the environment. See also the relevant
Open Group specification [1].
[1]: http://pubs.opengroup.org/onlinepubs/9699919799/
[2]: http://www.sco.com/developers/gabi/latest/ch4.sheader.html#init_array
Fixes: CVE-2019-5736
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
The detection for scope properties (whether scope units support
DefaultDependencies= or Delegate=) has always been broken, since systemd
refuses to create scopes unless at least one PID is attached to it (and
this has been so since scope units were introduced in systemd v205.)
This can be seen in journal logs whenever a container is started with
libpod:
Feb 11 15:08:07 myhost systemd[1]: libcontainer-12345-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing.
Feb 11 15:08:07 myhost systemd[1]: libcontainer-12345-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing.
Since this logic never worked, just assume both attributes are supported
(which is what the code does when detection fails for this reason, since
it's looking for an "unknown attribute" or "read-only attribute" to mark
them as false) and skip the detection altogether.
Signed-off-by: Filipe Brandenburger <filbranden@google.com>
runc creates all missing mountpoints when it starts a container, this
commit also creates those mountpoints during restore. Now it is possible
to restore a container using the same, but newly created rootfs just as
during container start.
Signed-off-by: Adrian Reber <areber@redhat.com>
During rootfs setup all mountpoints (directory and files) are created
before bind mounting the bind mounts. This does not happen during
container restore via CRIU. If restoring in an identical but newly created
rootfs, the restore fails right now. This just factors out the code to
create the bind mount mountpoints so that it also can be used during
restore.
Signed-off-by: Adrian Reber <areber@redhat.com>
There are quite a few circumstances where /proc/self/exe pointing to a
pretty important container binary is a _bad_ thing, so to avoid this we
have to make a copy (preferably doing self-clean-up and not being
writeable).
We require memfd_create(2) -- though there is an O_TMPFILE fallback --
but we can always extend this to use a scratch MNT_DETACH overlayfs or
tmpfs. The main downside to this approach is no page-cache sharing for
the runc binary (which overlayfs would give us) but this is far less
complicated.
This is only done during nsenter so that it happens transparently to the
Go code, and any libcontainer users benefit from it. This also makes
ExtraFiles and --preserve-fds handling trivial (because we don't need to
worry about it).
Fixes: CVE-2019-5736
Co-developed-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
For some reason, libcontainer/integration has a whole bunch of incorrect
usages of libcontainer.Factory -- causing test failures with a set of
security patches that will be published soon. Fixing ths is fairly
trivial (switch to creating a new libcontainer.Factory once in each
process, rather than creating one in TestMain globally).
Signed-off-by: Aleksa Sarai <asarai@suse.de>
When creating a new user namespace, the kernel doesn't allow to mount
a new procfs or sysfs file system if there is not already one instance
fully visible in the current mount namespace.
When using --no-pivot we were effectively inhibiting this protection
from the kernel, as /proc and /sys from the host are still present in
the container mount namespace.
A container without full access to /proc could then create a new user
namespace, and from there able to mount a fully visible /proc, bypassing
the limitations in the container.
A simple reproducer for this issue is:
unshare -mrfp sh -c "mount -t proc none /proc && echo c > /proc/sysrq-trigger"
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
since commit df3fa115f9 it is not
possible to set a kernel memory limit when using the systemd cgroups
backend as we use cgroup.Apply twice.
Skip enabling kernel memory if there are already tasks in the cgroup.
Without this patch, runc fails with:
container_linux.go:344: starting container process caused
"process_linux.go:311: applying cgroup configuration for process
caused \"failed to set memory.kmem.limit_in_bytes, because either
tasks have already joined this cgroup or it has children\""
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
This patch fixes a corner case when destroy a container:
If we start a container without 'intelRdt' config set, and then we run
“runc update --l3-cache-schema/--mem-bw-schema” to add 'intelRdt' config
implicitly.
Now if we enter "exit" from the container inside, we will pass through
linuxContainer.Destroy() -> state.destroy() -> intelRdtManager.Destroy().
But in IntelRdtManager.Destroy(), IntelRdtManager.Path is still null
string, it hasn’t been initialized yet. As a result, the created rdt
group directory during "runc update" will not be removed as expected.
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
CRIU 3.11 introduces configuration files:
https://criu.org/Configuration_fileshttps://lisas.de/~adrian/posts/2018-Nov-08-criu-configuration-files.html
This enables the user to influence CRIU's behaviour without code changes
if using new CRIU features or if the user wants to enable certain CRIU
behaviour without always specifying certain options.
With this it is possible to write 'tcp-established' to the configuration
file:
$ echo tcp-established > /etc/criu/runc.conf
and from now on all checkpoints will preserve the state of established
TCP connections. This removes the need to always use
$ runc checkpoint --tcp-stablished
If the goal is to always checkpoint with '--tcp-established'
It also adds the possibility for unexpected CRIU behaviour if the user
created a configuration file at some point in time and forgets about it.
As a result of the discussion in https://github.com/opencontainers/runc/pull/1933
it is now also possible to define a CRIU configuration file for each
container with the annotation 'org.criu.config'.
If 'org.criu.config' does not exist, runc will tell CRIU to use
'/etc/criu/runc.conf' if it exists.
If 'org.criu.config' is set to an empty string (''), runc will tell CRIU
to not use any runc specific configuration file at all.
If 'org.criu.config' is set to a non-empty string, runc will use that
value as an additional configuration file for CRIU.
With the annotation the user can decide to use the default configuration
file ('/etc/criu/runc.conf'), none or a container specific configuration
file.
Signed-off-by: Adrian Reber <areber@redhat.com>