Commit Graph

219 Commits

Author SHA1 Message Date
Michael Crosby 76520a4bf0
Merge pull request #1872 from masters-of-cats/better-find-cgroup-mountpoint
Respect container's cgroup path
2018-11-16 14:06:54 -05:00
Mrunal Patel 4769cdf607
Merge pull request #1916 from crosbymichael/cgns
Add support for cgroup namespace
2018-11-13 12:21:38 -08:00
Mrunal Patel f000fe11ec
Merge pull request #1917 from slp/master
libcontainer: map PidsLimit to systemd's TasksMax property
2018-11-13 12:21:23 -08:00
Michael Crosby aa7917b751
Merge pull request #1911 from theSuess/linter-fixes
Various cleanups to address linter issues
2018-11-13 12:13:34 -05:00
Kir Kolyshkin 6a2c155968 libcontainer: ability to compile without kmem
Commit fe898e7862 (PR #1350) enables kernel memory accounting
for all cgroups created by libcontainer -- even if kmem limit is
not configured.

Kernel memory accounting is known to be broken in some kernels,
specifically the ones from RHEL7 (including RHEL 7.5). Those
kernels do not support kernel memory reclaim, and are prone to
oopses. Unconditionally enabling kmem acct on such kernels lead
to bugs, such as

* https://github.com/opencontainers/runc/issues/1725
* https://github.com/kubernetes/kubernetes/issues/61937
* https://github.com/moby/moby/issues/29638

This commit gives a way to compile runc without kernel memory setting
support. To do so, use something like

	make BUILDTAGS="seccomp nokmem"

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2018-10-31 20:35:51 -07:00
Yuanhong Peng df3fa115f9 Add support for cgroup namespace
Cgroup namespace can be configured in `config.json` as other
namespaces. Here is an example:

```
"namespaces": [
	{
		"type": "pid"
	},
	{
		"type": "network"
	},
	{
		"type": "ipc"
	},
	{
		"type": "uts"
	},
	{
		"type": "mount"
	},
	{
		"type": "cgroup"
	}
],

```

Note that if you want to run a container which has shared cgroup ns with
another container, then it's strongly recommended that you set
proper `CgroupsPath` of both containers(the second container's cgroup
path must be the subdirectory of the first one). Or there might be
some unexpected results.

Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2018-10-31 10:51:43 -04:00
Sergio Lopez 5c6b9c3c1c libcontainer: map PidsLimit to systemd's TasksMax property
Currently runc applies PidsLimit restriction by writing directly to
cgroup's pids.max, without notifying systemd. As a consequence, when the
later updates the context of the corresponding scope, pids.max is reset
to the value of systemd's TasksMax property.

This can be easily reproduced this way (I'm using "postfix" here just an
example, any unrelated but existing service will do):

 # CTR=`docker run --pids-limit 111 --detach --rm busybox /bin/sleep 8h`
 # cat /sys/fs/cgroup/pids/system.slice/docker-${CTR}.scope/pids.max
 111
 # systemctl disable --now postfix
 # systemctl enable --now postfix
 # cat /sys/fs/cgroup/pids/system.slice/docker-${CTR}.scope/pids.max
 max

This patch adds TasksAccounting=true and TasksMax=PidsLimit to the
properties sent to systemd.

Signed-off-by: Sergio Lopez <slp@redhat.com>
2018-10-24 17:20:27 +02:00
Mrunal Patel a00bf01908
Merge pull request #1862 from AkihiroSuda/decompose-rootless-pr
Disable rootless mode except RootlessCgMgr when executed as the root in userns (fix Docker-in-LXD regression)
2018-10-15 17:32:15 -07:00
Dominik Süß 0b412e9482 various cleanups to address linter issues
Signed-off-by: Dominik Süß <dominik@suess.wtf>
2018-10-13 21:14:03 +02:00
Danail Branekov a1d5398afa Respect container's cgroup path
Respect the container's cgroup path when finding the container's
cgroup mount point, which is useful in multi-tenant environments, where
containers have their own unique cgroup mounts

Signed-off-by: Danail Branekov <danailster@gmail.com>
Signed-off-by: Oliver Stenbom <ostenbom@pivotal.io>
Signed-off-by: Giuseppe Capizzi <gcapizzi@pivotal.io>
2018-09-25 17:43:36 +01:00
Aleksa Sarai 578fe65e4f
merge branch 'pr-1817'
Fix duplicate entries and missing entries in getCgroupMountsHelper
  Add test for testing cgroup mounts on bedrock linux
  Stop relying on number of subsystems for cgroups

LGTMs: @crosbymichael @cyphar
Closes #1817
2018-09-19 19:48:17 +10:00
Akihiro Suda 06f789cf26 Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.

`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.

`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)

When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.

This PR does not have any impact on CLI flags and `state.json`.

Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
  For runc-in-userns, `runc spec`  without `--rootless` should work, when sufficient numbers of
  UID/GID are mapped.

Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
  (`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
  This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.

Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-09-07 15:05:03 +09:00
Yan Zhu feb90346e0 doc: fix typo
Signed-off-by: Yan Zhu <yanzhu@alauda.io>
2018-09-07 11:58:59 +08:00
Jay Kamat a2faaa1317
Fix duplicate entries and missing entries in getCgroupMountsHelper
Signed-off-by: Jay Kamat <jaygkamat@gmail.com>
2018-07-31 20:12:18 -07:00
Jay Kamat e5a7c61f3c Add test for testing cgroup mounts on bedrock linux
Add a mountinfo from a bedrock linux system with 4 strata, and include
it for tests

Signed-off-by: Jay Kamat <jaygkamat@gmail.com>
Signed-off-by: Daniel Dao <dqminh89@gmail.com>
2018-06-24 00:01:07 +01:00
Daniel Dao 5ee0648bfb Stop relying on number of subsystems for cgroups
When there are complicated mount setups, there can be multiple mount
points which have the subsystem we are looking for. Instead of
counting the mountpoints, tick off subsystems until we have found them
all.

Without the 'all' flag, ignore duplicate subsystems after the first.

Signed-off-by: Daniel Dao <dqminh89@gmail.com>
2018-06-24 00:00:58 +01:00
Aleksa Sarai 939d5a3753
cgroup: clean up isIgnorableError for skippable EROFS
Include a rootless argument for isIgnorableError to avoid people
accidentally using isIgnorableError when they shouldn't (we don't ignore
any errors when running as root as that really isn't safe).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-05-25 11:31:41 +10:00
Qiang Huang dd67ab10d7
Merge pull request #1759 from cyphar/rootless-erofs-as-eperm
rootless: cgroup: treat EROFS as a skippable error
2018-05-25 09:24:16 +08:00
Derek Carr b515963c10 systemd cpu quota ignores -1
Signed-off-by: Derek Carr <decarr@redhat.com>
2018-05-23 14:28:39 -04:00
Filipe Brandenburger 165ee45334 Make channel for StartTransientUnit buffered
So that, if a timeout happens and we decide to stop blocking on the
operation, the writer will not block when they try to report the result
of the operation.

This should address Issue #1780 and it's a follow up for PR #1683,
PR #1754 and PR #1772.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2018-04-14 08:49:50 -07:00
Filipe Brandenburger 0e16bd9b53 Detect whether Delegate is available on both slices and scopes
Starting with systemd 237, in preparation for cgroup v2, delegation is
only now available for scopes, not slices.

Update libcontainer code to detect whether delegation is available on
both and use that information when creating new slices.

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2018-04-10 11:42:55 -07:00
Filipe Brandenburger 8ab251f298 Fix systemd.Apply() to check for DBus error before waiting on a channel.
The channel was introduced in #1683 to work around a race condition.
However, the check for error in StartTransientUnit ignores the error for
an already existing unit, and in that case there will be no notification
from DBus (so waiting on the channel will make it hang.)

Later PR #1754 added a timeout, which worked around the issue, but we
can fix this correctly by only waiting on the channel when there is no
error. Fix the code to do so.

The timeout handling was kept, since there might be other cases where
this situation occurs (https://bugzilla.redhat.com/show_bug.cgi?id=1548358
mentions calling this code from inside a container, it's unclear whether
an existing container was in use or not, so not sure whether this would
have fixed that bug as well.)

Signed-off-by: Filipe Brandenburger <filbranden@google.com>
2018-04-09 11:51:59 -07:00
Aleksa Sarai 03e585985f
rootless: cgroup: treat EROFS as a skippable error
In some cases, /sys/fs/cgroups is mounted read-only. In rootless
containers we can consider this effectively identical to having cgroups
that we don't have write permission to -- because the user isn't
responsible for the read-only setup and cannot modify it. The rules are
identical to when /sys/fs/cgroups is not writable by the unprivileged
user.

An example of this is the default configuration of Docker, where cgroups
are mounted as read-only as a preventative security measure.

Reported-by: Vladimir Rutsky <rutsky@google.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-03-17 13:53:42 +11:00
Qiang Huang 9facb87f87
Merge pull request #1754 from vikaschoudhary16/add-timeout
Add timeout while waiting for StartTransinetUnit completion signal
2018-03-08 09:09:34 +08:00
vikaschoudhary16 04e95b526d Add timeout while waiting for StartTransinetUnit completion signal from dbus
Signed-off-by: vikaschoudhary16 <choudharyvikas16@gmail.com>
2018-03-07 05:11:38 -05:00
Denys Smirnov 3d26fc3fd7 cgroups/fs: fix NPE on Destroy than no cgroups are set
Currently Manager accepts nil cgroups when calling Apply, but it will panic then trying to call Destroy with the same config.

Signed-off-by: Denys Smirnov <denys@sourced.tech>
2018-03-06 23:31:31 +01:00
Michael Crosby 595bea022f
Merge pull request #1722 from ravisantoshgudimetla/fix-systemd-path
fix systemd slice expansion so that it could be consumed by cAdvisor
2018-02-20 09:59:24 -05:00
ravisantoshgudimetla 7019e1de7b fix systemd slice expansion so that it could be consumed by cAdvisor
Signed-off-by: ravisantoshgudimetla <ravisantoshgudimetla@gmail.com>
2018-02-18 21:32:39 -05:00
vikaschoudhary16 d5b4a3eddb Fix race against systemd
- T0: runc triggers a systemd unit creation asynchronously from [here](https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/systemd/apply_systemd.go#L298)
- T1: runc then moves ahead and starts creating cgroup paths(.scope directories), [here](https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/systemd/apply_systemd.go#L348). Kernel creates .scope directory and cgroup.procs file(along with other default files) in the directory automatically, in an atomic manner.
- T3: systemd execution thread which was invoked at time `T0`, is still in the process of unit creation. systemd also trying to create cgroup paths and deletes the `.scope` directory which is created at time `T1` by runc from [here](https://github.com/systemd/systemd/blob/v219/src/shared/cgroup-util.c#L1630) in the code

Signed-off-by: vikaschoudhary16 <choudharyvikas16@gmail.com>
2018-01-08 09:37:26 -05:00
Seth Jennings bca53e7b49 systemd: adjust CPUQuotaPerSecUSec to compensate for systemd internal handling
Signed-off-by: Seth Jennings <sjenning@redhat.com>
2017-11-15 20:20:06 -06:00
Michael Crosby ff4481dbf6 Merge pull request #1540 from cloudfoundry-incubator/rootless-cgroups
Support cgroups with limits as rootless
2017-10-16 12:03:49 -04:00
Sebastien Boeuf acb93c9c62 libcontainer: cgroups: Write freezer state after every state check
This commit ensures we write the expected freezer cgroup state after
every state check, in case the state check does not give the expected
result. This can happen when a new task is created and prevents the
whole cgroup to be FROZEN, leaving the state into FREEZING instead.

This patch prevents the case of an infinite loop to happen.

Fixes https://github.com/opencontainers/runc/issues/1609

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
2017-10-12 07:07:28 -07:00
Will Martin ca4f427af1 Support cgroups with limits as rootless
Signed-off-by: Ed King <eking@pivotal.io>
Signed-off-by: Gabriel Rosenhouse <grosenhouse@pivotal.io>
Signed-off-by: Konstantinos Karampogias <konstantinos.karampogias@swisscom.com>
2017-10-05 11:22:54 +01:00
Yong Tang e9944d0f4c Disable systemd in static build
This fix tries to address the warnings caused by static build
with go 1.9. As systemd needs dlopen/dlclose, the following warnings
will be generated for static build in go 1.9:
```
root@f4b077232050:/go/src/github.com/opencontainers/runc# make static
CGO_ENABLED=1 go build  -tags "seccomp cgo static_build" -ldflags "-w -extldflags -static -X main.gitCommit="1c81e2a794c6e26a4c650142ae8893c47f619764" -X main.version=1.0.0-rc4+dev " -o runc .
/tmp/go-link-113476657/000007.o: In function `_cgo_a5acef59ed3f_Cfunc_dlopen':
/tmp/go-build/github.com/opencontainers/runc/vendor/github.com/coreos/pkg/dlopen/_obj/cgo-gcc-prolog:76: warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
```

This fix disables systemd when `static_build` flag is on (apply_nosystemd.go
is used instead).

This fix also fixes a small bug in `apply_nosystemd.go` for return value.

Signed-off-by: Yong Tang <yong.tang.github@outlook.com>
2017-09-11 18:38:22 +00:00
Qiang Huang acaf6897f5 Fix systemd cgroup after memory type changed
Fixes: #1557

I'm not quite sure about the root cause, looks like
systemd still want them to be uint64.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-08-25 01:14:16 -04:00
Michael Crosby 882d8eaba6 Merge pull request #1537 from tklauser/staticcheck
Fix issues found by staticcheck
2017-08-02 09:52:11 -04:00
Tobias Klauser e4e56cb6d8 libcontainer: remove ineffective break statements
go's switch statement doesn't need an explicit break. Remove it where
that is the case and add a comment to indicate the purpose where the
removal would lead to an empty case.

Found with honnef.co/go/tools/cmd/staticcheck

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-07-28 15:13:39 +02:00
Steven Hartland ee4f68e302 Updated logrus to v1
Updated logrus to use v1 which includes a breaking name change Sirupsen -> sirupsen.

This includes a manual edit of the docker term package to also correct the name there too.

Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
2017-07-19 15:20:56 +00:00
Daniel, Dao Quang Minh 7139b61f7f Merge pull request #1378 from derekwaynecarr/expose_use_hierarchy
Expose memory.use_hierarchy in MemoryStats
2017-06-30 16:08:21 +01:00
Justin Cormack 3d9074ead3 Update memory specs to use int64 not uint64
replace #1492 #1494
fix #1422

Since https://github.com/opencontainers/runtime-spec/pull/876 the memory
specifications are now `int64`, as that better matches the visible interface where
`-1` is a valid value. Otherwise finding the correct value was difficult as it
was kernel dependent.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2017-06-27 12:16:07 +01:00
Daniel, Dao Quang Minh 67bd2ab554 Merge pull request #1442 from clnperez/libcontainer-sys-unix
Move libcontainer to x/sys/unix
2017-05-26 12:18:33 +01:00
Michael Crosby 18cd7e06f7 Merge pull request #1372 from cloudfoundry-incubator/cpuset-mount-root
Handle container creation when cgroups have already been mounted in another location
2017-05-25 09:53:57 -07:00
Christy Perez 3d7cb4293c Move libcontainer to x/sys/unix
Since syscall is outdated and broken for some architectures,
use x/sys/unix instead.

There are still some dependencies on the syscall package that will
remain in syscall for the forseeable future:

Errno
Signal
SysProcAttr

Additionally:
- os still uses syscall, so it needs to be kept for anything
returning *os.ProcessState, such as process.Wait.

Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
2017-05-22 17:35:20 -05:00
Derek Carr 4d6225aec2 Expose memory.use_hierarchy in MemoryStats
Signed-off-by: Derek Carr <decarr@redhat.com>
2017-03-31 13:40:34 -04:00
Aleksa Sarai baeef29858
rootless: add rootless cgroup manager
The rootless cgroup manager acts as a noop for all set and apply
operations. It is just used for rootless setups. Currently this is far
too simple (we need to add opportunistic cgroup management), but is good
enough as a first-pass at a noop cgroup manager.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-03-23 20:46:20 +11:00
Qiang Huang 8430cc4f48 Use uint64 for resources to keep consistency with runtime-spec
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-03-20 18:51:39 +08:00
Craig Furman f5c5aac958 Create containers when cgroups already mounted
Runc needs to copy certain files from the top of the cgroup cpuset hierarchy
into the container's cpuset cgroup directory. Currently, runc determines
which directory is the top of the hierarchy by using the parent dir of
the first entry in /proc/self/mountinfo of type cgroup.

This creates problems when cgroup subsystems are mounted arbitrarily in
different dirs on the host.

Now, we use the most deeply nested mountpoint that contains the
container's cpuset cgroup directory.

Signed-off-by: Konstantinos Karampogias <konstantinos.karampogias@swisscom.com>
Signed-off-by: Will Martin <wmartin@pivotal.io>
2017-03-15 10:10:30 +00:00
Qiang Huang 8773c5f9a6 Remove unused function in systemd cgroup
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-03-07 15:11:37 +08:00
Michael Crosby 49a33c41f8 Merge pull request #1344 from xuxinkun/fixCPUQuota20170224
fix cpu.cfs_quota_us changed when systemd daemon-reload using systemd.
2017-03-06 10:02:28 -08:00
xuxinkun c44aec9b23 fix cpu.cfs_quota_us changed when systemd daemon-reload using systemd.
Signed-off-by: xuxinkun <xuxinkun@gmail.com>
2017-03-06 20:08:30 +11:00