Commit Graph

1455 Commits

Author SHA1 Message Date
Kir Kolyshkin fc4357a8b0 libct/msMoveRoot: rm redundant filepath.Abs() calls
1. rootfs is already validated to be kosher by (*ConfigValidator).rootfs()

2. mount points from /proc/self/mountinfo are absolute and clean, too

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-21 10:33:43 -07:00
Kir Kolyshkin dce0de8975 getParentMount: benefit from GetMounts filter
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-21 10:33:43 -07:00
Kir Kolyshkin 81d8452e30 libct/TestFactoryNewTmpfs: benefit from GetMounts
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-21 10:33:43 -07:00
Kir Kolyshkin c7ab2c036b libcontainer: switch to moby/sys/mountinfo package
Delete libcontainer/mount in favor of github.com/moby/sys/mountinfo,
which is fast mountinfo parser.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-21 10:33:43 -07:00
Kir Kolyshkin a572216f74 libcontainer/intelrdt: rm fmt.Sprintf
It it not needed as it does nothing here.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-20 12:33:24 -07:00
Kir Kolyshkin 5542a2c77d libcontainer/cgroups: GetAllPids: optimize
1. Return earlier if there is an error.

2. Do not use filepath.Split on every entry, use info.Name() instead.

3. Make readProcsFile() accept file name as an argument, to avoid
   unnecessary file name and directory splitting and merging.

4. Skip on info.IsDir() -- this avoids an error when cgroup name is
   set to "cgroup.procs".

This is still not very good since filepath.Walk() performs an unnecessary
stat(2) on every entry, but better than before.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-20 12:27:36 -07:00
Kir Kolyshkin 12dc475dd6 libcontainer: simplify createCgroupsv2Path
fmt.Sprintf is slow and is not needed here, string concatenation would
be sufficient. It is also redundant to convert []byte from string and
back, since `bytes` package now provides the same functions as `strings`.

Use Fields() instead of TrimSpace() and Split(), mainly for readability
(note Fields() is somewhat slower than Split() but here it doesn't
matter much).

Use Join() to prepend the plus signs.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-03-20 11:51:55 -07:00
Mario Nitchev 648295be98 Skip test for cgroups v2
Signed-off-by: Yulia Nedyalkova <julianedialkova@hotmail.com>
2020-03-19 12:54:54 +02:00
Danail Branekov f34eb2c003 Retry writing to cgroup files on EINTR error
Golang 1.14 introduces asynchronous preemption which results into
applications getting frequent EINTR (syscall interrupted) errors when
invoking slow syscalls, e.g. when writing to cgroup files.

As writing to cgroups is idempotent, it is safe to retry writing to the
file whenever the write syscall is interrupted.

Signed-off-by: Mario Nitchev <marionitchev@gmail.com>
2020-03-18 13:00:05 +02:00
SiYu Zhao 34d471769b fix readSync
Signed-off-by: SiYu Zhao <d.chaser.zsy@gmail.com>
2020-03-17 11:26:46 +08:00
Michael Crosby 939cd0b734
Merge pull request #1737 from wking/remove-procConsole-comment
libcontainer/sync: Drop procConsole transaction from comments
2020-03-16 14:00:00 -04:00
Michael Crosby 88474967d3
Merge pull request #1974 from openSUSE/unreachable-code
Remove unreachable code paths
2020-03-16 13:56:05 -04:00
Akihiro Suda 525b9f311c
Merge pull request #2248 from AkihiroSuda/fix-cgroupv2-conversion
cgroup2: fix conversion
2020-03-16 14:00:02 +09:00
Akihiro Suda 492d525e55 vendor: update go-systemd and godbus
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-03-16 13:26:03 +09:00
Mrunal Patel 981dbef514
Merge pull request #2226 from avagin/runsc-restore-cmd-wait
restore: fix a race condition in process.Wait()
2020-03-15 18:48:16 -07:00
Akihiro Suda aa269315a4 cgroup2: add CpuMax conversion
Fix #2243

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-03-13 02:58:39 +09:00
Akihiro Suda 64e9a97981 cgroup2: fix conversion
* TestConvertCPUSharesToCgroupV2Value(0) was returning 70369281052672, while the correct value is 0
* ConvertBlkIOToCgroupV2Value(0) was returning 32, while the correct value is 0
* ConvertBlkIOToCgroupV2Value(1000) was returning 4, while the correct value is 10000

Fix #2244
Follow-up to #2212 #2213

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-03-13 02:57:07 +09:00
Sascha Grunert b477a159db
Remove unreachable code paths
Signed-off-by: Sascha Grunert <sgrunert@suse.com>
2020-03-12 09:13:03 +01:00
Mrunal Patel 0ff53526a4
Merge pull request #2252 from pkagrawal/2251-fix
Synchronize the call to linuxContainer.Signal()
2020-03-11 11:11:56 -07:00
Akihiro Suda 71dfb559d6
Merge pull request #2238 from tedyu/init-proc-err-ret
Use named error return for initProcess#start
2020-03-11 01:03:13 +09:00
Boris Popovschi 89a87adb38 Changed hugetlb pagesizes info source
Signed-off-by: Boris Popovschi <zyqsempai@mail.ru>
2020-03-10 15:28:45 +02:00
Boris Popovschi d804611d05 Added failcnt stats
Signed-off-by: Boris Popovschi <zyqsempai@mail.ru>
2020-03-10 15:19:44 +02:00
l00397676 62cfad97ca specconv: add a test case to check null spec.Process
Signed-off-by: l00397676 <lujingxiao@huawei.com>
2020-03-10 11:43:51 +08:00
Pradyumna Agrawal 5b2b138d24 Synchronize the call to linuxContainer.Signal()
linuxContainer.Signal() can race with another call to say Destroy()
which clears the container's initProcess. This can cause a nil pointer
dereference in Signal().

This patch will synchronize Signal() and Destroy() by grabbing the
container's mutex as part of the Signal() call.

Signed-off-by: Pradyumna Agrawal <pradyumnaa@vmware.com>
2020-03-09 11:15:22 -07:00
zyu 957da1f9ab Use named error return for initProcess#start
Signed-off-by: zyu <yuzhihong@gmail.com>
2020-03-09 09:29:03 -07:00
Akihiro Suda 6503438fd6
Merge pull request #2212 from Zyqsempai/2211-convert-blkio-weight-properly
Convert blkioWeight to io.weight properly
2020-03-05 09:32:45 +09:00
Aleksa Sarai 93e5c4d320
merge branch 'pr-2232'
Aleksa Sarai (1):
  libcontainer: dual-license nsenter/cloned_binary.c

LGTMs: @mrunalp @AkihiroSuda
Closes #2232
2020-03-04 11:10:49 +11:00
Qiang Huang 3b7e32feba
Merge pull request #2210 from Zyqsempai/2164-remove-deprecated-systemd-resources
Exchange deprecated systemd resources with the appropriate for cgroupv2
2020-02-29 10:13:55 +08:00
Boris Popovschi 7f37afa892 Added HugeTlb controller for cgroupv2
Signed-off-by: Boris Popovschi <zyqsempai@mail.ru>
2020-02-25 14:50:55 +02:00
Aleksa Sarai 98de84265d
libcontainer: dual-license nsenter/cloned_binary.c
The new license is Apache-2.0 OR LPGL-2.1-or-later. This is necessary
for libcrun to be relicensed under the LGPL-2.1[1], and all of the
relevant copyright holders have agreed to relicense this code under the
dual license:

  * Aleksa Sarai [2]
  * Christian Brauner [3]
  * Justin Cormack [4]

Because it is still dual-licensed as an Apache-2.0 work, this doesn't
affect it's usability within runc or any other dependent projects.

[1]: https://github.com/containers/crun/issues/256
[2]: https://github.com/containers/crun/issues/256#issuecomment-589498088
[3]: https://github.com/containers/crun/issues/256#issuecomment-589605034
[4]: https://github.com/containers/crun/issues/256#issuecomment-589504231

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-02-22 00:17:07 +11:00
Aleksa Sarai 0f32b03dda
merge branch 'pr-2192'
Boris Popovschi (2):
  Fix skip message for cgroupv2
  Fix MAJ:MIN io.stat parsing order

LGTMs: @hqhq @cyphar
Closes #2192
2020-02-21 16:00:17 +11:00
Boris Popovschi 4b8134f63b Convert blkioWeight to io.weight properly
Signed-off-by: Boris Popovschi <zyqsempai@mail.ru>
2020-02-18 15:44:07 +02:00
Kir Kolyshkin 1cd71dfd71 systemd properties: support for *Sec values
Some systemd properties are documented as having "Sec" suffix
(e.g. "TimeoutStopSec") but are expected to have "USec" suffix
when passed over dbus, so let's provide appropriate conversion
to improve compatibility.

This means, one can specify TimeoutStopSec with a numeric argument,
in seconds, and it will be properly converted to TimeoutStopUsec
with the argument in microseconds. As a side bonus, even float
values are converted, so e.g. TimeoutStopSec=1.5 is possible.

This turned out a bit more tricky to implement when I was
originally expected, since there are a handful of numeric
types in dbus and each one requires explicit conversion.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-02-17 16:07:19 -08:00
Kir Kolyshkin 4c5c3fb960 Support for setting systemd properties via annotations
In case systemd is used to set cgroups for the container,
it creates a scope unit dedicated to it (usually named
`runc-$ID.scope`).

This patch adds an ability to set arbitrary systemd properties
for the systemd unit via runtime spec annotations.

Initially this was developed as an ability to specify the
`TimeoutStopUSec` property, but later generalized to work with
arbitrary ones.

Example usage: add the following to runtime spec (config.json):

```
	"annotations": {
		"org.systemd.property.TimeoutStopUSec": "uint64 123456789",
		"org.systemd.property.CollectMode":"'inactive-or-failed'"
	},
```

and start the container (e.g. `runc --systemd-cgroup run $ID`).

The above will set the following systemd parameters:
* `TimeoutStopSec` to 2 minutes and 3 seconds,
* `CollectMode` to "inactive-or-failed".

The values are in the gvariant format (see [1]). To figure out
which type systemd expects for a particular parameter, see
systemd sources.

In particular, parameters with `USec` suffix require an `uint64`
typed argument, while gvariant assumes int32 for a numeric values,
therefore the explicit type is required.

NOTE that systemd receives the time-typed parameters as *USec
but shows them (in `systemctl show`) as *Sec. For example,
the stop timeout should be set as `TimeoutStopUSec` but
is shown as `TimeoutStopSec`.

[1] https://developer.gnome.org/glib/stable/gvariant-text.html

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2020-02-17 16:07:19 -08:00
Mrunal Patel 81ef5024f8
Merge pull request #2213 from Zyqsempai/2166-convert-cpu-weight-poperly
Added conversion for cpu.weight v2
2020-02-17 07:49:39 -08:00
Boris Popovschi 7c439cc6f6 Added conversion for cpu.weight v2
Signed-off-by: Boris Popovschi <zyqsempai@mail.ru>
2020-02-12 11:32:34 +02:00
Andrei Vagin 269ea385a4 restore: fix a race condition in process.Wait()
Adrian reported that the checkpoint test stated failing:
=== RUN   TestCheckpoint
--- FAIL: TestCheckpoint (0.38s)
    checkpoint_test.go:297: Did not restore the pipe correctly:

The problem here is when we start exec.Cmd, we don't call its wait
method. This means that we don't wait cmd.goroutines ans so we don't
know when all data will be read from process pipes.

Signed-off-by: Andrei Vagin <avagin@gmail.com>
2020-02-10 10:21:08 -08:00
Boris Popovschi 3b992087b8 Fix skip message for cgroupv2
Signed-off-by: Boris Popovschi <zyqsempai@mail.ru>
2020-02-03 14:27:12 +02:00
Mrunal Patel 2fc03cc11c
Merge pull request #2207 from cyphar/fix-double-volume-attack
rootfs: do not permit /proc mounts to non-directories
2020-01-22 08:06:10 -08:00
Aleksa Sarai 3291d66b98
rootfs: do not permit /proc mounts to non-directories
mount(2) will blindly follow symlinks, which is a problem because it
allows a malicious container to trick runc into mounting /proc to an
entirely different location (and thus within the attacker's control for
a rename-exchange attack).

This is just a hotfix (to "stop the bleeding"), and the more complete
fix would be finish libpathrs and port runc to it (to avoid these types
of attacks entirely, and defend against a variety of other /proc-related
attacks). It can be bypased by someone having "/" be a volume controlled
by another container.

Fixes: CVE-2019-19921
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2020-01-17 14:00:30 +11:00
Aleksa Sarai f6fb7a0338
merge branch 'pr-2133'
Julia Nedialkova (1):
  Handle ENODEV when accessing the freezer.state file

LGTMs: @crosbymichael @cyphar
Closes #2133
2020-01-17 02:07:19 +11:00
Boris Popovschi 5b96f314ba Exchanged deprecated systemd resources with the appropriate for cgroupv2
Signed-off-by: Boris Popovschi <zyqsempai@mail.ru>
2020-01-15 18:09:33 +02:00
Boris Popovschi cf9b7c33e1 Fix MAJ:MIN io.stat parsing order
Signed-off-by: Boris Popovschi <zyqsempai@mail.ru>
2020-01-15 14:39:14 +02:00
Akihiro Suda 55f8c254be temporarily disable CRIU tests
Ubuntu kernel is temporarily broken: https://github.com/opencontainers/runc/pull/2198#issuecomment-571124087

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-01-14 11:18:44 +09:00
Akihiro Suda 5c20ea1472 fix merging #2177 and #2169
A new method was added to the cgroup interface when #2177 was merged.

After #2177 got merged, #2169 was merged without rebase (sorry!) and compilation was failing:

  libcontainer/cgroups/fs2/fs2.go:208:22: container.Cgroup undefined (type *configs.Config has no field or method Cgroup)

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2020-01-14 11:13:25 +09:00
Mrunal Patel 5cc0deaf7a
Merge pull request #2169 from AkihiroSuda/split-fs
cgroup2: split fs2 from fs
2020-01-13 16:23:27 -08:00
Michael Crosby 2b52db7527
Merge pull request #2177 from devimc/topic/libcontainer/kata-containers
libcontainer: export and add new methods to allow cgroups manipulation
2020-01-02 11:47:12 -05:00
Jordan Liggitt 8541d9cf3d Fix race checking for process exit and waiting for exec fifo
Signed-off-by: Jordan Liggitt <liggitt@google.com>
2019-12-18 18:48:18 +00:00
Julio Montes 8ddd892072 libcontainer: add method to get cgroup config from cgroup Manager
`configs.Cgroup` contains the configuration used to create cgroups. This
configuration must be saved to disk, since it's required to restore the
cgroup manager that was used to create the cgroups.
Add method to get cgroup configuration from cgroup Manager to allow API users
save it to disk and restore a cgroup manager later.

fixes #2176

Signed-off-by: Julio Montes <julio.montes@intel.com>
2019-12-17 22:46:03 +00:00
Julio Montes cd7c59d042 libcontainer: export createCgroupConfig
A `config.Cgroups` object is required to manipulate cgroups v1 and v2 using
libcontainer.
Export `createCgroupConfig` to allow API users to create `config.Cgroups`
objects using directly libcontainer API.

Signed-off-by: Julio Montes <julio.montes@intel.com>
2019-12-17 22:46:03 +00:00
Aleksa Sarai 7496a96825
merge branch 'pr-2086'
* Kurnia D Win (1):
  fix permission denied

LGTMs: @crosbymichael @cyphar
Closes #2086
2019-12-17 20:49:52 +11:00
Aleksa Sarai 201b063745
merge branch 'pr-2141'
Radostin Stoyanov (1):
  criu: Ensure other users cannot read c/r files

LGTMs: @crosbymichael @cyphar
Closes #2141
2019-12-07 09:32:58 +11:00
Akihiro Suda ec49f98d72 fs2: support legacy device spec (to pass CI)
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-12-06 15:53:07 +09:00
Akihiro Suda 88e8350de2 cgroup2: split fs2 from fs
split fs2 package from fs, as mixing up fs and fs2 is very likely to result in
unmaintainable code.

Inspired by containerd/cgroups#109

Fix #2157

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-12-06 15:42:10 +09:00
Aleksa Sarai 5e63695384
merge branch 'pr-2174'
Sascha Grunert (1):
  Expose network interfaces via runc events

LGTMs: @cyphar @mrunalp
Closes #2174
2019-12-06 13:07:44 +11:00
Michael Crosby 8bb10af481
Merge pull request #2165 from AkihiroSuda/travis-f31
.travis.yml: add Fedora 31 vagrant box (for cgroup2)
2019-12-05 16:26:51 -05:00
Sascha Grunert 41a20b5852
Expose network interfaces via runc events
The libcontainer network statistics are unreachable without manually
creating a libcontainer instance. To retrieve them via the CLI interface
of runc, we now expose them as well.

Signed-off-by: Sascha Grunert <sgrunert@suse.com>
2019-12-05 13:20:51 +01:00
Akihiro Suda faf1e44ea9 cgroup2: ebpf: increase RLIM_MEMLOCK to avoid BPF_PROG_LOAD error
Fix #2167

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-11-07 15:43:27 +09:00
Mrunal Patel 46def4cc4c
Merge pull request #2154 from jpeach/2008-remove-static-build-tag
Remove the static_build build tag.
2019-11-04 17:10:59 -08:00
Akihiro Suda ccd4436fc4 .travis.yml: add Fedora 31 vagrant box (for cgroup2)
As the baby step, only unit tests are executed.

Failing tests are currently skipped and will be fixed in follow-up PRs.

Fix #2124

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-31 16:53:01 +09:00
Akihiro Suda faf673ee45 cgroup2: port over eBPF device controller from crun
The implementation is based on https://github.com/containers/crun/blob/0.10.2/src/libcrun/ebpf.c

Although ebpf.c is originally licensed under LGPL-3.0-or-later, the author
Giuseppe Scrivano agreed to relicense the file in Apache License 2.0:
https://github.com/opencontainers/runc/issues/2144#issuecomment-543116397

See libcontainer/cgroups/ebpf/devicefilter/devicefilter_test.go for tested configurations.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-31 14:01:46 +09:00
Qiang Huang e57a774066
Merge pull request #2149 from AkihiroSuda/cgroup2-ps
cgroup2: implement `runc ps`
2019-10-31 09:44:39 +08:00
Qiang Huang d239ca8425
Merge pull request #2148 from AkihiroSuda/cg2-ignore-cpuset-when-no-config
cgroup2: cpuset_v2: skip Apply when no limit is specified
2019-10-29 21:57:58 +08:00
Mrunal Patel 03cf145f5a
Merge pull request #2159 from AkihiroSuda/cgroup2-mount-in-userns
cgroup2: allow mounting /sys/fs/cgroup in UserNS without unsharing CgroupNS
2019-10-28 19:19:09 -07:00
Akihiro Suda 74a3fe5d1b cgroup2: do not parse /proc/cgroups
/proc/cgroups is meaningless for v2 and should be ignored.

https://github.com/torvalds/linux/blob/v5.3/Documentation/admin-guide/cgroup-v2.rst#deprecated-v1-core-features

* Now GetAllSubsystems() parses /sys/fs/cgroup/cgroup.controller, not /proc/cgroups.
  The function result also contains "pseudo" controllers: {"devices", "freezer"}.
  As it is hard to detect availability of pseudo controllers, pseudo controllers
  are always assumed to be available.

* Now IOGroupV2.Name() returns "io", not "blkio"

Fix #2155 #2156

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-28 00:00:33 +09:00
Akihiro Suda 9c81440fb5 cgroup2: allow mounting /sys/fs/cgroup in UserNS without unsharing CgroupNS
Bind-mount /sys/fs/cgroup when we are in UserNS but CgroupNS is not unshared,
because we cannot mount cgroup2.

This behavior correspond to crun v0.10.2.

Fix #2158

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-27 23:09:41 +09:00
James Peach 13919f5dfd Remove the static_build build tag.
The `static_build` build tag was introduced in e9944d0f
to remove build warnings related to systemd cgroup driver
dependencies. Since then, those dependencies have changed and
building the systemd cgroup driver no longer imports dlopen.

After this change, runc builds will always include the systemd
cgroup driver.

This fixes #2008.

Signed-off-by: James Peach <jpeach@apache.org>
2019-10-26 08:28:45 +11:00
Michael Crosby c4d8e1688c
Merge pull request #2140 from crosbymichael/fs-unified
Set unified mountpoint in find mnt func
2019-10-24 15:20:47 -04:00
Akihiro Suda dbd771e475 cgroup2: implement `runc ps`
Implemented `runc ps` for cgroup v2 , using a newly added method `m.GetUnifiedPath()`.
Unlike the v1  implementation that checks `m.GetPaths()["devices"]`, the v2 implementation does not require the device controller to be available.

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-19 01:59:24 +09:00
Akihiro Suda d918e7f408 cpuset_v2: skip Apply when no limit is specified
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-19 00:33:31 +09:00
Akihiro Suda 033936ef76 io_v2.go: remove blkio v1 code
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-10-18 21:33:48 +09:00
Radostin Stoyanov a610a84821 criu: Ensure other users cannot read c/r files
No checkpoint files should be readable by
anyone else but the user creating it.

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
2019-10-17 07:49:38 +01:00
Michael Crosby b28f58f31b
Set unified mountpoint in find mnt func
This is needed for the fsv2 cgroups to work when there is a unified mountpoint.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2019-10-15 15:40:03 -04:00
Radostin Stoyanov f017e0f9e1 checkpoint: Set descriptors.json file mode to 0600
Prevent unprivileged users from being able to read descriptors.json

Signed-off-by: Radostin Stoyanov <rstoyanov1@gmail.com>
2019-10-12 19:29:44 +01:00
Aleksa Sarai 1b8a1eeec3
merge branch 'pr-2132'
Support different field counts of cpuaact.stats

LGTMs: @crosbymichael @cyphar
Closes #2132
2019-10-02 01:50:47 +10:00
Aleksa Sarai d463f6485b
*: verify that operations on /proc/... are on procfs
This is an additional mitigation for CVE-2019-16884. The primary problem
is that Docker can be coerced into bind-mounting a file system on top of
/proc (resulting in label-related writes to /proc no longer happening).

While we are working on mitigations against permitting the mounts, this
helps avoid our code from being tricked into writing to non-procfs
files. This is not a perfect solution (after all, there might be a
bind-mount of a different procfs file over the target) but in order to
exploit that you would need to be able to tweak a config.json pretty
specifically (which thankfully Docker doesn't allow).

Specifically this stops AppArmor from not labeling a process silently
due to /proc/self/attr/... being incorrectly set, and stops any
accidental fd leaks because /proc/self/fd/... is not real.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-09-30 09:06:48 +10:00
tianye15 28e58a0f6a Support different field counts of cpuaact.stats
Signed-off-by: skilxnTL <tylxltt@gmail.com>
2019-09-29 10:20:58 +08:00
Julia Nedialkova e63b797f38 Handle ENODEV when accessing the freezer.state file
...when checking if a container is paused

Signed-off-by: Julia Nedialkova <julianedialkova@hotmail.com>
2019-09-27 17:02:56 +03:00
blacktop 84373aaa56 Add SCMP_ACT_LOG as a valid Seccomp action (#1951)
Signed-off-by: blacktop <blacktop@users.noreply.github.com>
2019-09-26 11:03:03 -04:00
Michael Crosby 331692baa7 Only allow proc mount if it is procfs
Fixes #2128

This allows proc to be bind mounted for host and rootless namespace usecases but
it removes the ability to mount over the top of proc with a directory.

```bash
> sudo docker run --rm  apparmor
docker: Error response from daemon: OCI runtime create failed:
container_linux.go:346: starting container process caused "process_linux.go:449:
container init caused \"rootfs_linux.go:58: mounting
\\\"/var/lib/docker/volumes/aae28ea068c33d60e64d1a75916cf3ec2dc3634f97571854c9ed30c8401460c1/_data\\\"
to rootfs
\\\"/var/lib/docker/overlay2/a6be5ae911bf19f8eecb23a295dec85be9a8ee8da66e9fb55b47c841d1e381b7/merged\\\"
at \\\"/proc\\\" caused
\\\"\\\\\\\"/var/lib/docker/overlay2/a6be5ae911bf19f8eecb23a295dec85be9a8ee8da66e9fb55b47c841d1e381b7/merged/proc\\\\\\\"
cannot be mounted because it is not of type proc\\\"\"": unknown.

> sudo docker run --rm -v /proc:/proc apparmor

docker-default (enforce)        root     18989  0.9  0.0   1288     4 ?
Ss   16:47   0:00 sleep 20
```

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2019-09-24 11:00:18 -04:00
Jonathan Rudenberg af7b6547ec libcontainer/nsenter: Don't import C in non-cgo file
Signed-off-by: Jonathan Rudenberg <jonathan@titanous.com>
2019-09-11 17:03:07 +00:00
Giuseppe Scrivano 718a566e02
cgroup: support mount of cgroup2
convert a "cgroup" mount to "cgroup2" when the system uses cgroups v2
unified hierarchy.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2019-09-06 17:57:14 +02:00
Sebastiaan van Stijn eb86f6037e
bump syndtr/gocapability d98352740cb2c55f81556b63d4a1ec64c5a319c2
relevant changes:

  - syndtr/gocapability#14 capability: Deprecate NewPid and NewFile for NewPid2 and NewFile2
  - syndtr/gocapability#16 Fix capHeader.pid type

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2019-09-06 01:44:26 +02:00
Mrunal Patel 92ac8e3f84
Merge pull request #2113 from giuseppe/cgroupv2
libcontainer: initial support for cgroups v2
2019-09-05 13:14:29 -07:00
Giuseppe Scrivano 524cb7c318
libcontainer: add systemd.UnifiedManager
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2019-09-05 13:02:27 +02:00
Giuseppe Scrivano ec11136828
libcontainer, cgroups: rename systemd.Manager to LegacyManager
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2019-09-05 13:02:26 +02:00
Giuseppe Scrivano 1932917b71
libcontainer: add initial support for cgroups v2
allow to set what subsystems are used by
libcontainer/cgroups/fs.Manager.

subsystemsUnified is used on a system running with cgroups v2 unified
mode.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2019-09-05 13:02:25 +02:00
Mrunal Patel 92d851e03b
Merge pull request #2123 from carlosedp/riscv64
Bump x/sys and update syscall for initial Risc-V support
2019-09-04 14:10:26 -07:00
Carlos de Paula 4316e4d047 Bump x/sys and update syscall to start Risc-V support
Signed-off-by: Carlos de Paula <me@carlosedp.com>
2019-08-29 12:09:08 -03:00
Akihiro Suda 0bc069d795 nsenter: fix clang-tidy warning
nsexec.c:148:3: warning: Initialized va_list 'args' is leaked [clang-analyzer-valist.Unterminated]

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-08-29 00:18:02 +09:00
Akihiro Suda b225ef58fb nsenter: minor clean up
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
2019-08-28 19:50:35 +09:00
Daniel J Walsh e4aa73424b
Rename cgroups_windows.go to cgroups_unsupported.go
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2019-08-26 18:13:52 -04:00
Mrunal Patel c61c7370f9
Merge pull request #2103 from sipsma/cgnil
cgroups/fs: check nil pointers in cgroup manager
2019-08-26 14:05:44 -07:00
Mrunal Patel 68d73f0a2e
Merge pull request #2107 from sashayakovtseva/public-get-devices
Make get devices function public
2019-08-26 09:58:10 -07:00
Kenta Tada c740965a18 libcontainer: update masked paths of /proc
This commit updates the masked paths of /proc.

Related issues:
* https://github.com/moby/moby/pull/37404
* https://github.com/moby/moby/pull/38299
* https://github.com/moby/moby/pull/36368

Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
2019-08-26 12:25:56 +09:00
Mrunal Patel 3525eddec5
Merge pull request #2117 from filbranden/detection1
Remove libcontainer detection for systemd features
2019-08-25 13:15:15 -07:00
Filipe Brandenburger 518c855833 Remove libcontainer detection for systemd features
Transient units (and transient slice units) have been available for quite a
long time and RHEL 7 with systemd v219 (likely the oldest OS we care about at
this point) supports that. A system running a systemd without these features is
likely to break a lot of other stuff that runc/libcontainer care about.

Regarding delegated slices, modern systemd doesn't allow it and
runc/libcontainer run fine on it, so we might as well just stop requesting it
on older versions of systemd which allowed it. (Those versions never really
changed behavior significantly when that option was passed anyways.)

Signed-off-by: Filipe Brandenburger <filbranden@gmail.com>
2019-08-22 21:53:24 -07:00
Filipe Brandenburger 588f040a77 Avoid the dependency on cgo through go-systemd/util package
This dependency is only needed in package "github.com/coreos/go-systemd/util"
and we only use it for IsRunningSystemd(), which is a simple Go function that
just stats a file.

Let's just borrow it here, so we remove the dependency and can remove that
package from vendored build.

This also removes dependencies on dlopen and on trying to find libsystemd.so
or libsystemd-login.so in the system.

Tested that this still builds and works as expected.

Signed-off-by: Filipe Brandenburger <filbranden@gmail.com>
2019-08-22 21:07:24 -07:00
sashayakovtseva afc24792dc Make get devices function public
Signed-off-by: sashayakovtseva <sasha@sylabs.io>
2019-08-15 17:16:47 +03:00
Erik Sipsma 9c822e4847 cgroups/fs: check nil pointers in cgroup manager
Signed-off-by: Erik Sipsma <sipsma@amazon.com>
2019-08-14 09:50:45 -07:00
Mrunal Patel 2e94378464
Merge pull request #2094 from sipsma/2093-nodotudev
Skip searching /dev/.udev for device nodes.
2019-08-05 10:41:54 -07:00
Erik Sipsma f08cdaeec9 Skip searching /dev/.udev for device nodes.
Closes: #2093

Signed-off-by: Erik Sipsma <sipsma@amazon.com>
2019-07-31 19:41:33 +00:00
Andreas Stocker 808e809f8a doc: First process in container needs `Init: true`
`Init` on the `Process` struct specifies whether the process is the first process in the container. This needs to be set to `true` when running the container.

Signed-off-by: Andreas Stocker <astocker@anexia-it.com>
2019-07-29 22:24:28 +02:00
Kurnia D Win 5e0e67d76c
fix permission denied
when exec as root and config.Cwd is not owned by root, exec will fail
because root doesn't have the caps.

So, Chdir should be done before setting the caps.

Signed-off-by: Kurnia D Win <kurnia.d.win@gmail.com>
2019-07-18 12:49:36 +07:00
Mrunal Patel b4a0b1d737
Merge pull request #2065 from odinuge/master
Fix cgroup hugetlb size prefix for kB
2019-06-06 12:38:57 -07:00
Kenta Tada b54fd85bbf libcontainer: change seccomp test for clone syscall
This commit changes the value of seccomp test for clone syscall.
Also hardcoded values should be changed because it is unclear to
understand what flags are tested.

Related issues:

* https://github.com/containerd/containerd/pull/3314
* https://github.com/moby/moby/pull/39308
* https://github.com/opencontainers/runtime-tools/pull/694

Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
2019-06-04 18:52:00 +09:00
Odin Ugedal 6f77e35daf
Export list of HugePageSizeUnits
This will allow others to import it instead of copying it.

Signed-off-by: Odin Ugedal <odin@ugedal.com>
2019-05-30 20:17:30 +02:00
Odin Ugedal c6445b1c1c
Add tests for GetHugePageSize
Add tests to avoid regressions

Signed-off-by: Odin Ugedal <odin@ugedal.com>
2019-05-30 17:27:32 +02:00
Odin Ugedal 273e7b74a7
Fix cgroup hugetlb size prefix for kB
The hugetlb cgroup control files (introduced here in 2012:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=abb8206cb0773)
use "KB" and not "kB"
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/hugetlb_cgroup.c?h=v5.0#n349).

The behavior in the kernel has not changed since the introduction, and
the current code using "kB" will therefore fail on devices with small
amounts of ram (see
https://github.com/kubernetes/kubernetes/issues/77169) running a kernel
with config flag CONFIG_HUGETLBFS=y

As seen from the code in "mem_fmt" inside hugetlb_cgroup.c, only "KB",
"MB" and "GB" are used, so the others may be removed as well.

Here is a real world example of the files inside the
"/sys/kernel/mm/hugepages/" directory:
- "hugepages-64kB"
- "hugepages-2048kB"
- "hugepages-32768kB"
- "hugepages-1048576kB"

And the corresponding cgroup files:
- "hugetlb.64KB._____"
- "hugetlb.2MB._____"
- "hugetlb.32MB._____"
- "hugetlb.1GB._____"

Signed-off-by: Odin Ugedal <odin@ugedal.com>
2019-05-29 21:52:43 +02:00
Mrunal Patel 5ef781c2e7
Merge pull request #2061 from KentaTada/add-cgroup-namespace-test
libcontainer: fix TestGetContainerState to check configs.NEWCGROUP
2019-05-22 16:09:38 -07:00
Qiang Huang c8337777b6
Merge pull request #2042 from xiaochenshen/rdt-add-missing-destroy
libcontainer: intelrdt: add missing destroy handler in defer func
2019-05-21 09:48:00 +08:00
Kenta Tada 65032b55b1 libcontainer: fix TestGetContainerState to check configs.NEWCGROUP
This test needs to handle the case of configs.NEWCGROUP
as Namespace's type.

Signed-off-by: Kenta Tada <Kenta.Tada@sony.com>
2019-05-21 09:10:38 +09:00
Mrunal Patel 2484581dd7
Merge pull request #2035 from cyphar/bindmount-types
specconv: always set "type: bind" in case of MS_BIND
2019-05-07 15:47:58 -07:00
Mrunal Patel a0ecf749ee
Merge pull request #2047 from filbranden/systemd7
Move systemd.Manager initialization into a function in that module
2019-05-07 15:08:41 -07:00
Filipe Brandenburger 46351eb3d1 Move systemd.Manager initialization into a function in that module
This will permit us to extend the internals of systemd.Manager to include
further information about the system, such as whether cgroupv1, cgroupv2 or
both are in effect.

Furthermore, it allows a future refactor of moving more of UseSystemd() code
into the factory initialization function.

Signed-off-by: Filipe Brandenburger <filbranden@gmail.com>
2019-05-01 13:22:19 -07:00
Georgi Sabev a146081828 Write logs to stderr by default
Minor refactoring to use the filePair struct for both init sock and log pipe

Co-authored-by: Julia Nedialkova <julianedialkova@hotmail.com>
Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>
2019-04-24 15:18:14 +03:00
Georgi Sabev 68b4ff5b37 Simplify bail logic & minor nsexec improvements
Co-authored-by: Julia Nedialkova <julianedialkova@hotmail.com>
Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>
2019-04-24 15:16:11 +03:00
Xiaochen Shen 17b37ea3fa libcontainer: intelrdt: add missing destroy handler in defer func
In the exception handling of initProcess.start(), we need to add the
missing IntelRdtManager.Destroy() handler in defer func.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2019-04-24 16:41:51 +08:00
Georgi Sabev 475aef10f7 Remove redundant log function
Bump logrus so that we can use logrus.StandardLogger().Logf instead

Co-authored-by: Julia Nedialkova <julianedialkova@hotmail.com>
Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>
2019-04-22 17:54:55 +03:00
Georgi Sabev ba3cabf932 Improve nsexec logging
* Simplify logging function
* Logs contain __FUNCTION__:__LINE__
* Bail uses write_log

Co-authored-by: Julia Nedialkova <julianedialkova@hotmail.com>
Co-authored-by: Danail Branekov <danailster@gmail.com>
Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>
2019-04-22 17:53:52 +03:00
Aleksa Sarai 8296826da5
specconv: always set "type: bind" in case of MS_BIND
We discovered in umoci that setting a dummy type of "none" would result
in file-based bind-mounts no longer working properly, which is caused by
a restriction for when specconv will change the device type to "bind" to
work around rootfs_linux.go's ... issues.

However, bind-mounts don't have a type (and Linux will ignore any type
specifier you give it) because the type is copied from the source of the
bind-mount. So we should always overwrite it to avoid user confusion.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-04-08 15:08:08 +10:00
Danail Branekov c486e3c406 Address comments in PR 1861
Refactor configuring logging into a reusable component
so that it can be nicely used in both main() and init process init()

Co-authored-by: Georgi Sabev <georgethebeatle@gmail.com>
Co-authored-by: Giuseppe Capizzi <gcapizzi@pivotal.io>
Co-authored-by: Claudia Beresford <cberesford@pivotal.io>
Signed-off-by: Danail Branekov <danailster@gmail.com>
2019-04-04 14:57:28 +03:00
Marco Vedovati feebfac358 Remove pipe close before exec.
Pipe close before exec is not necessary as os.Pipe() is calling pipe2
with O_CLOEXEC option.

Signed-off-by: Marco Vedovati <mvedovati@suse.com>
2019-04-04 14:53:30 +03:00
Marco Vedovati 9a599f62fb Support for logging from children processes
Add support for children processes logging (including nsexec).
A pipe is used to send logs from children to parent in JSON.
The JSON format used is the same used by logrus JSON formatted,
i.e. children process can use standard logrus APIs.

Signed-off-by: Marco Vedovati <mvedovati@suse.com>
2019-04-04 14:53:23 +03:00
Michael Crosby 11fc498ffa
Merge pull request #2023 from LittleLightLittleFire/2022-fix-runc-zombie-process-regression
Fixes regression causing zombie runc:[1:CHILD] processes
2019-03-22 14:06:31 -04:00
Mrunal Patel dd22a84864
Merge pull request #2012 from rhatdan/selinux
Need to setup labeling of kernel keyrings.
2019-03-20 21:17:18 -07:00
Alex Fang eab5330908 Fixes regression causing zombie runc:[1:CHILD] processes
Whenever processes are spawned using nsexec, a zombie runc:[1:CHILD]
process will always be created and will need to be reaped by the parent

Signed-off-by: Alex Fang <littlelightlittlefire@gmail.com>
2019-03-21 13:43:38 +11:00
Aleksa Sarai f56b4cbead
merge branch 'pr-2015'
Use getenv not secure_getenv

LGTMs: @crosbymichael @cyphar
Closes #2015
2019-03-16 17:30:56 +11:00
Filipe Brandenburger 4b2b978291 Add cgroup name to error message
More information should help troubleshoot an issue when this error occurs.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2019-03-14 10:25:00 -07:00
Justin Cormack 6f714aa928
Use getenv not secure_getenv
secure_getenv is a Glibc extension and so this code does not compile
on Musl libc any more after this patch.

secure_getenv is only intended to be used in setuid binaries, in
order that they should not trust their environment. It simply returns
NULL if the binary is running setuid. If runc was installed setuid,
the user can already do anything as root, so it is game over, so this
check is not needed.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2019-03-14 10:58:10 +00:00
Daniel J Walsh cd96170c10
Need to setup labeling of kernel keyrings.
Work is ongoing in the kernel to support different kernel
keyrings per user namespace.  We want to allow SELinux to manage
kernel keyrings inside of the container.

Currently when runc creates the kernel keyring it gets the label which runc is
running with ususally `container_runtime_t`, with this change the kernel keyring
will be labeled with the container process label container_t:s0:C1,c2.

Container running as container_t:s0:c1,c2 can manage keyrings with the same label.

This change required a revendoring or the SELinux go bindings.

github.com/opencontainers/selinux.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2019-03-13 17:57:30 -04:00
Mrunal Patel 2b18fe1d88
Merge pull request #1984 from cyphar/memfd-cleanups
nsenter: cloned_binary: "memfd" cleanups
2019-03-07 10:18:33 -08:00
Michael Crosby f739110263
Merge pull request #1968 from adrianreber/podman
Create bind mount mountpoints during restore
2019-03-04 11:37:07 -06:00
Aleksa Sarai 2d4a37b427
nsenter: cloned_binary: userspace copy fallback if sendfile fails
There are some circumstances where sendfile(2) can fail (one example is
that AppArmor appears to block writing to deleted files with sendfile(2)
under some circumstances) and so we need to have a userspace fallback.
It's fairly trivial (and handles short-writes).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-03-01 23:29:10 +11:00
Aleksa Sarai 16612d74de
nsenter: cloned_binary: try to ro-bind /proc/self/exe before copying
The usage of memfd_create(2) and other copying techniques is quite
wasteful, despite attempts to minimise it with _LIBCONTAINER_STATEDIR.
memfd_create(2) added ~10M of memory usage to the cgroup associated with
the container, which can result in some setups getting OOM'd (or just
hogging the hosts' memory when you have lots of created-but-not-started
containers sticking around).

The easiest way of solving this is by creating a read-only bind-mount of
the binary, opening that read-only bindmount, and then umounting it to
ensure that the host won't accidentally be re-mounted read-write. This
avoids all copying and cleans up naturally like the other techniques
used. Unfortunately, like the O_TMPFILE fallback, this requires being
able to create a file inside _LIBCONTAINER_STATEDIR (since bind-mounting
over the most obvious path -- /proc/self/exe -- is a *very bad idea*).

Unfortunately detecting this isn't fool-proof -- on a system with a
read-only root filesystem (that might become read-write during "runc
init" execution), we cannot tell whether we have already done an ro
remount. As a partial mitigation, we store a _LIBCONTAINER_CLONED_BINARY
environment variable which is checked *alongside* the protection being
present.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-03-01 23:29:08 +11:00
Aleksa Sarai af9da0a450
nsenter: cloned_binary: use the runc statedir for O_TMPFILE
Writing a file to tmpfs actually incurs a memcg penalty, and thus the
benefit of being able to disable memfd_create(2) with
_LIBCONTAINER_DISABLE_MEMFD_CLONE is fairly minimal -- though it should
be noted that quite a few distributions don't use tmpfs for /tmp (and
instead have it as a regular directory or subvolume of the host
filesystem).

Since runc must have write access to the state directory anyway (and the
state directory is usually not on a tmpfs) we can use that instead of
/tmp -- avoiding potential memcg costs with no real downside.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-03-01 23:28:51 +11:00
Aleksa Sarai 2429d59352
nsenter: cloned_binary: expand and add pre-3.11 fallbacks
In order to get around the memfd_create(2) requirement, 0a8e4117e7
("nsenter: clone /proc/self/exe to avoid exposing host binary to
container") added an O_TMPFILE fallback. However, this fallback was
flawed in two ways:

 * It required O_TMPFILE which is relatively new (having been added to
   Linux 3.11).

 * The fallback choice was made at compile-time, not runtime. This
   results in several complications when it comes to running binaries
   on different machines to the ones they were built on.

The easiest way to resolve these things is to have fallbacks work in a
more procedural way (though it does make the code unfortunately more
complicated) and to add a new fallback that uses mkotemp(3).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-03-01 23:28:50 +11:00
Aleksa Sarai 5b775bf297
nsenter: cloned_binary: detect and handle short copies
For a variety of reasons, sendfile(2) can end up doing a short-copy so
we need to just loop until we hit the binary size. Since /proc/self/exe
is tautologically our own binary, there's no chance someone is going to
modify it underneath us (or changing the size).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-02-26 19:51:01 +11:00
Mrunal Patel 5b5130ad76
Merge pull request #1963 from adrianreber/go-criu
Vendor in go-criu and use it for CRIU's RPC definition
2019-02-23 10:44:28 -08:00
Adrian Reber 9edb5494bb
Use vendored in CRIU Go bindings
This makes use of the vendored in Go bindings and removes the copy of
the CRIU RPC interface definition. runc now relies on go-criu for RPC
definition and hopefully more CRIU functions can be used in the future
from the CRIU Go bindings.

Signed-off-by: Adrian Reber <areber@redhat.com>
2019-02-14 18:20:02 +01:00
Christian Brauner bb7d8b1f41
nsexec (CVE-2019-5736): avoid parsing environ
My first attempt to simplify this and make it less costly focussed on
the way constructors are called. I was under the impression that the ELF
specification mandated that arg, argv, and actually even envp need to be
passed to functions located in the .init_arry section (aka
"constructors"). Actually, the specifications is (cf. [2]):

SHT_INIT_ARRAY
This section contains an array of pointers to initialization functions,
as described in ``Initialization and Termination Functions'' in Chapter
5. Each pointer in the array is taken as a parameterless procedure with
a void return.

which means that this becomes a libc specific decision. Glibc passes
down those args, musl doesn't. So this approach can't work. However, we
can at least remove the environment parsing part based on POSIX since
[1] mandates that there should be an environ variable defined in
unistd.h which provides access to the environment. See also the relevant
Open Group specification [1].

[1]: http://pubs.opengroup.org/onlinepubs/9699919799/
[2]: http://www.sco.com/developers/gabi/latest/ch4.sheader.html#init_array

Fixes: CVE-2019-5736
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2019-02-14 16:06:21 +01:00
Filipe Brandenburger cd41feb46b Remove detection for scope properties, which have always been broken
The detection for scope properties (whether scope units support
DefaultDependencies= or Delegate=) has always been broken, since systemd
refuses to create scopes unless at least one PID is attached to it (and
this has been so since scope units were introduced in systemd v205.)

This can be seen in journal logs whenever a container is started with
libpod:

  Feb 11 15:08:07 myhost systemd[1]: libcontainer-12345-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing.
  Feb 11 15:08:07 myhost systemd[1]: libcontainer-12345-systemd-test-default-dependencies.scope: Scope has no PIDs. Refusing.

Since this logic never worked, just assume both attributes are supported
(which is what the code does when detection fails for this reason, since
it's looking for an "unknown attribute" or "read-only attribute" to mark
them as false) and skip the detection altogether.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2019-02-11 16:05:37 -08:00
Adrian Reber 7354546cc8
Create mountpoints also on restore
runc creates all missing mountpoints when it starts a container, this
commit also creates those mountpoints during restore. Now it is possible
to restore a container using the same, but newly created rootfs just as
during container start.

Signed-off-by: Adrian Reber <areber@redhat.com>
2019-02-08 15:59:51 +01:00
Adrian Reber f661e02343
factor out bind mount mountpoint creation
During rootfs setup all mountpoints (directory and files) are created
before bind mounting the bind mounts. This does not happen during
container restore via CRIU. If restoring in an identical but newly created
rootfs, the restore fails right now. This just factors out the code to
create the bind mount mountpoints so that it also can be used during
restore.

Signed-off-by: Adrian Reber <areber@redhat.com>
2019-02-08 15:59:51 +01:00
Aleksa Sarai 0a8e4117e7
nsenter: clone /proc/self/exe to avoid exposing host binary to container
There are quite a few circumstances where /proc/self/exe pointing to a
pretty important container binary is a _bad_ thing, so to avoid this we
have to make a copy (preferably doing self-clean-up and not being
writeable).

We require memfd_create(2) -- though there is an O_TMPFILE fallback --
but we can always extend this to use a scratch MNT_DETACH overlayfs or
tmpfs. The main downside to this approach is no page-cache sharing for
the runc binary (which overlayfs would give us) but this is far less
complicated.

This is only done during nsenter so that it happens transparently to the
Go code, and any libcontainer users benefit from it. This also makes
ExtraFiles and --preserve-fds handling trivial (because we don't need to
worry about it).

Fixes: CVE-2019-5736
Co-developed-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-02-08 18:57:59 +11:00
Mrunal Patel e4fa8a4575
Merge pull request #1955 from xiaochenshen/rdt-fix-destroy-issue
libcontainer: intelrdt: fix null intelrdt path issue in Destroy()
2019-02-01 13:18:56 -08:00
Mrunal Patel 4e4c907193
Merge pull request #1950 from cloudfoundry-incubator/enter-pid-race
Resilience in adding of exec tasks to cgroups
2019-02-01 13:18:16 -08:00
Aleksa Sarai 565325fc36
integration: fix mis-use of libcontainer.Factory
For some reason, libcontainer/integration has a whole bunch of incorrect
usages of libcontainer.Factory -- causing test failures with a set of
security patches that will be published soon. Fixing ths is fairly
trivial (switch to creating a new libcontainer.Factory once in each
process, rather than creating one in TestMain globally).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2019-01-24 23:12:48 +13:00
Michael Crosby c1e454b2a1
Merge pull request #1960 from giuseppe/fix-kmem-systemd
systemd: fix setting kernel memory limit
2019-01-15 13:21:01 -05:00
Michael Crosby 4e9d52da54
Merge pull request #1933 from adrianreber/master
Add CRIU configuration file support
2019-01-15 11:22:38 -05:00