Commit Graph

3715 Commits

Author SHA1 Message Date
Michael Crosby bd420b59f1
Merge pull request #1925 from Ace-Tang/fix_dup_ns
test: fix TestDupNamespaces fail to test dup-ns error
2018-11-13 12:11:11 -05:00
Aleksa Sarai bb522d6eca
merge branch 'pr-1928'
rootless: fix potential panic in shouldUseRootlessCgroupManager

LGTMs: @crosbymichael @cyphar
Closes #1928
2018-11-13 16:04:44 +11:00
Ace-Tang 714a4d466a rootless: fix potential panic in shouldUseRootlessCgroupManager
Signed-off-by: Ace-Tang <aceapril@126.com>
2018-11-10 21:04:36 +08:00
Michael Crosby 079817cc26
Merge pull request #1926 from Ace-Tang/fix_spec_proc
libcontainer: fix potential panic if spec.Process is nil
2018-11-07 10:07:30 -05:00
Ace-Tang 16d55f17a8 libcontainer: fix potential panic if spec.Process is nil
for the code logic, pointer 'spec.Process' should be judge first
to avoid panic.

Signed-off-by: Ace-Tang <aceapril@126.com>
2018-11-06 11:55:30 +08:00
Ace-Tang 95d1aa1886 test: fix TestDupNamespaces
add Root in created spec, or error message is 'Root must be specified'

Signed-off-by: Ace-Tang <aceapril@126.com>
2018-11-06 11:36:27 +08:00
Michael Crosby b1068fb925
Merge pull request #1814 from rhatdan/selinux
SELinux labels are tied to the thread
2018-11-05 10:00:11 -05:00
Mrunal Patel 15b24b70df
Merge pull request #1922 from kolyshkin/cgo
Makefile: rm cgo tag
2018-11-02 09:35:33 -07:00
Aleksa Sarai df8dd8d940
merge branch 'pr-1923'
readme: add nokmem build tag

LGTMs: @crosbymichael @cyphar
Closes #1923
2018-11-03 03:04:40 +11:00
Ace-Tang f1b1407e1b readme: add nokmem build tag
Signed-off-by: Ace-Tang <aceapril@126.com>
2018-11-02 11:56:54 +08:00
Kir Kolyshkin 1e0d04c642 Makefile: rm cgo tag
There is no need to explicitly add `cgo` build tag, it is set by
by go tools if cgo is enabled.

Fixes: ecd6463101

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2018-11-01 17:01:12 -07:00
Aleksa Sarai 9f1e94488e
merge branch 'pr-1921'
libcontainer: ability to compile without kmem

LGTMs: @mrunalp @cyphar
Closes #1921
2018-11-02 09:54:16 +11:00
Michael Crosby 9e5aa7494d
Merge pull request #1918 from giuseppe/skip-setgroups
rootless: fix running with /proc/self/setgroups set to deny
2018-11-01 13:16:47 -04:00
Kir Kolyshkin 6a2c155968 libcontainer: ability to compile without kmem
Commit fe898e7862 (PR #1350) enables kernel memory accounting
for all cgroups created by libcontainer -- even if kmem limit is
not configured.

Kernel memory accounting is known to be broken in some kernels,
specifically the ones from RHEL7 (including RHEL 7.5). Those
kernels do not support kernel memory reclaim, and are prone to
oopses. Unconditionally enabling kmem acct on such kernels lead
to bugs, such as

* https://github.com/opencontainers/runc/issues/1725
* https://github.com/kubernetes/kubernetes/issues/61937
* https://github.com/moby/moby/issues/29638

This commit gives a way to compile runc without kernel memory setting
support. To do so, use something like

	make BUILDTAGS="seccomp nokmem"

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2018-10-31 20:35:51 -07:00
Chris Aniszczyk f3ce8221ea
Merge pull request #1913 from xiaochenshen/rdt-add-diagnostics
libcontainer: intelrdt: add user-friendly diagnostics for Intel RDT operation errors
2018-10-25 14:27:17 -05:00
Giuseppe Scrivano 869add3318
rootless: fix running with /proc/self/setgroups set to deny
This is a regression from 06f789cf26
when the user namespace was configured without a privileged helper.
To allow a single mapping in an user namespace, it is necessary to set
/proc/self/setgroups to "deny".

For a simple reproducer, the user namespace can be created with
"unshare -r".

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2018-10-25 15:44:15 +02:00
Aleksa Sarai e93996674f
merge branch 'pr-1903'
clarify license information

LGTMs: @hqhq @cyphar
Closes #1903
2018-10-24 22:03:44 +11:00
Michael Crosby 7ca079fdeb
Merge pull request #1915 from HaraldNordgren/go_versions
Bump Travis versions
2018-10-23 16:22:37 -04:00
Harald Nordgren 630fb5b802 Bump Travis versions
Signed-off-by: Harald Nordgren <haraldnordgren@gmail.com>
2018-10-21 22:06:36 +02:00
Xiaochen Shen 6c307f8ff2 libcontainer: intelrdt: add user-friendly diagnostics for Intel RDT operation errors
Linux kernel v4.15 introduces better diagnostics for Intel RDT operation
errors. If any error returns when making new directories or writing to
any of the control file in resctrl filesystem, reading file
/sys/fs/resctrl/info/last_cmd_status could provide more information that
can be conveyed in the error returns from file operations.

Some examples:
  echo "L3:0=f3;1=ff" > /sys/fs/resctrl/container_id/schemata
  -bash: echo: write error: Invalid argument
  cat /sys/fs/resctrl/info/last_cmd_status
  mask f3 has non-consecutive 1-bits

  echo "MB:0=0;1=110" > /sys/fs/resctrl/container_id/schemata
  -bash: echo: write error: Invalid argument
  cat /sys/fs/resctrl/info/last_cmd_status
  MB value 0 out of range [10,100]

  cd /sys/fs/resctrl
  mkdir 1 2 3 4 5 6 7 8
  mkdir: cannot create directory '8': No space left on device
  cat /sys/fs/resctrl/info/last_cmd_status
  out of CLOSIDs

See 'last_cmd_status' for more details in kernel documentation:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

In runc, we could append the diagnostics information to the error
message of Intel RDT operation errors to provide more user-friendly
information.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2018-10-19 00:16:08 +08:00
Mrunal Patel c2ab1e656e
Merge pull request #1910 from adrianreber/tip
Fix travis Go: tip
2018-10-17 12:47:08 -07:00
Michael Crosby 58592df567
Merge pull request #1880 from AkihiroSuda/fix-subgid
libcontainer: CurrentGroupSubGIDs -> CurrentUserSubGIDs
2018-10-16 15:21:51 -04:00
Michael Crosby 78ef28e63b
Merge pull request #1632 from xiaochenshen/rdt-mba
libcontainer: intelrdt: add support for Intel RDT/MBA in runc
2018-10-16 09:54:22 -04:00
Xiaochen Shen d59b17d6d5 libcontainer: intelrdt: Add more check if sub-features are enabled
Double check if Intel RDT sub-features are available in "resource
control" filesystem. Intel RDT sub-features can be selectively disabled
or enabled by kernel command line (e.g., rdt=!l3cat,mba) in 4.14 and
newer kernel.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2018-10-16 14:29:44 +08:00
Xiaochen Shen f097339289 libcontainer: intelrdt: add test cases for Intel RDT/MBA
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2018-10-16 14:29:39 +08:00
Xiaochen Shen 1ed597bfe6 libcontainer: intelrdt: add update command support for Intel RDT/MBA
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2018-10-16 14:29:34 +08:00
Xiaochen Shen 27560ace2f libcontainer: intelrdt: add support for Intel RDT/MBA in runc
Memory Bandwidth Allocation (MBA) is a resource allocation sub-feature
of Intel Resource Director Technology (RDT) which is supported on some
Intel Xeon platforms. Intel RDT/MBA provides indirect and approximate
throttle over memory bandwidth for the software. A user controls the
resource by indicating the percentage of maximum memory bandwidth.

Hardware details of Intel RDT/MBA can be found in section 17.18 of
Intel Software Developer Manual:
https://software.intel.com/en-us/articles/intel-sdm

In Linux 4.12 kernel and newer, Intel RDT/MBA is enabled by kernel
config CONFIG_INTEL_RDT. If hardware support, CPU flags `rdt_a` and
`mba` will be set in /proc/cpuinfo.

Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
|   |-- L3
|   |   |-- cbm_mask
|   |   |-- min_cbm_bits
|   |   |-- num_closids
|   |-- MB
|       |-- bandwidth_gran
|       |-- delay_linear
|       |-- min_bandwidth
|       |-- num_closids
|-- ...
|-- schemata
|-- tasks
|-- <container_id>
    |-- ...
    |-- schemata
    |-- tasks

For MBA support for `runc`, we will reuse the infrastructure and code
base of Intel RDT/CAT which implemented in #1279. We could also make
use of `tasks` and `schemata` configuration for memory bandwidth
resource constraints.

The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the
task ID to the "tasks" file (which will automatically remove them from
the previous group to which they belonged). New tasks created by
fork(2) and clone(2) are added to the same group as their parent.

The file `schemata` has a list of all the resources available to this
group. Each resource (L3 cache, memory bandwidth) has its own line and
format.

Memory bandwidth schema:
It has allocation values for memory bandwidth on each socket, which
contains L3 cache id and memory bandwidth percentage.
    Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."

The minimum bandwidth percentage value for each CPU model is predefined
and can be looked up through "info/MB/min_bandwidth". The bandwidth
granularity that is allocated is also dependent on the CPU model and
can be looked up at "info/MB/bandwidth_gran". The available bandwidth
control steps are: min_bw + N * bw_gran. Intermediate values are
rounded to the next control step available on the hardware.

For more information about Intel RDT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

An example for runc:
Consider a two-socket machine with two L3 caches where the minimum
memory bandwidth of 10% with a memory bandwidth granularity of 10%.
Tasks inside the container may use a maximum memory bandwidth of 20%
on socket 0 and 70% on socket 1.

"linux": {
    "intelRdt": {
        "memBwSchema": "MB:0=20;1=70"
    }
}

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2018-10-16 14:29:29 +08:00
Xiaochen Shen c1cece7e23 libcontainer: intelrdt: add Intel RDT/MBA docs in SPEC.md
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2018-10-16 14:28:19 +08:00
Xiaochen Shen bd90541666 vendor: bump runtime-spec to 5684b8af48c1
Update runtime-spec to get Intel RDT/MBA Linux configs which will be
used in successive commits.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2018-10-16 13:18:25 +08:00
Mrunal Patel a00bf01908
Merge pull request #1862 from AkihiroSuda/decompose-rootless-pr
Disable rootless mode except RootlessCgMgr when executed as the root in userns (fix Docker-in-LXD regression)
2018-10-15 17:32:15 -07:00
Adrian Reber 0d01164756 Fix travis Go: tip
This fixes

 libcontainer/container_linux.go:1200: Error call has possible formatting directive %s

Signed-off-by: Adrian Reber <areber@redhat.com>
2018-10-13 10:44:07 +00:00
Aleksa Sarai 398f670bcb
merge branch 'pr-1908'
fix build break

LGTMs: @crosbymichael @cyphar
Closes #1908
2018-10-13 18:32:37 +11:00
Mike Brown 36f8472053 fix build break
Signed-off-by: Mike Brown <brownwm@us.ibm.com>
2018-10-12 09:22:35 -05:00
Aleksa Sarai e40d4635c4
merge branch 'pr-1894'
Move spec.Linux.IntelRdt check to spec.Linux != nil block

LGTMs: @crosbymichael @cyphar
Closes #1894
2018-10-09 02:41:13 +11:00
Jonathan Marler 1499c746a1 Move spec.Linux.IntelRdt check to spec.Linux != nil block
Signed-off-by: Jonathan Marler <johnnymarler@gmail.com>
2018-10-04 21:30:55 -06:00
Mike Brown 26bdc0dce7 clarify license information
Signed-off-by: Mike Brown <brownwm@us.ibm.com>
2018-10-03 10:39:44 -05:00
Mrunal Patel 2abd837c8c
Merge pull request #1893 from cyphar/keyctl-ignore-enosys
keyring: handle ENOSYS with keyctl(KEYCTL_JOIN_SESSION_KEYRING)
2018-09-25 13:35:16 -07:00
Mrunal Patel 00dc70017d
Merge pull request #1895 from giuseppe/fix-tty-hang
tty: close epollConsole on errors
2018-09-20 10:02:08 -07:00
Giuseppe Scrivano ec0d23a92f
tty: close epollConsole on errors
make sure epollConsole is closed before returning an error.  It solves
a hang when using these commands with a container that uses a
terminal:

runc run foo &
ssh root@localhost runc exec foo echo hello

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2018-09-20 16:51:51 +02:00
Aleksa Sarai 578fe65e4f
merge branch 'pr-1817'
Fix duplicate entries and missing entries in getCgroupMountsHelper
  Add test for testing cgroup mounts on bedrock linux
  Stop relying on number of subsystems for cgroups

LGTMs: @crosbymichael @cyphar
Closes #1817
2018-09-19 19:48:17 +10:00
Michael Crosby cc8146cf93
Merge pull request #1858 from marcov/nsenter-README
Update outdated nsenter README content
2018-09-17 10:53:19 -04:00
Michael Crosby d77251d5fc
Merge pull request #1892 from Ace-Tang/add_clean_test
test: add more test case for CleanPath
2018-09-17 10:51:17 -04:00
Michael Crosby 8facd6d2d5
Merge pull request #1886 from halfcrazy/fix/typo
doc: fix typo
2018-09-17 10:44:23 -04:00
Aleksa Sarai 40f1468413
keyring: handle ENOSYS with keyctl(KEYCTL_JOIN_SESSION_KEYRING)
While all modern kernels (and I do mean _all_ of them -- this syscall
was added in 2.6.10 before git had begun development!) have support for
this syscall, LXC has a default seccomp profile that returns ENOSYS for
this syscall. For most syscalls this would be a deal-breaker, and our
use of session keyrings is security-based there are a few mitigating
factors that make this change not-completely-insane:

  * We already have a flag that disables the use of session keyrings
    (for older kernels that had system-wide keyring limits and so
    on). So disabling it is not a new idea.

  * While the primary justification of using session keys *is*
    security-based, it's more of a security-by-obscurity protection.
    The main defense keyrings have is VFS credentials -- which is
    something that users already have better security tools for
    (setuid(2) and user namespaces).

  * Given the security justification you might argue that we
    shouldn't silently ignore this. However, the only way for the
    kernel to return -ENOSYS is either being ridiculously old (at
    which point we wouldn't work anyway) or that there is a seccomp
    profile in place blocking it.

    Given that the seccomp profile (if malicious) could very easily
    just return 0 or a silly return code (or something even more
    clever with seccomp-bpf) and trick us without this patch, there
    isn't much of a significant change in how much seccomp can trick
    us with or without this patch.

Given all of that over-analysis, I'm pretty convinced there isn't a
security problem in this very specific case and it will help out the
ChromeOS folks by allowing Docker to run inside their LXC container
setup. I'd be happy to be proven wrong.

Ref: https://bugs.chromium.org/p/chromium/issues/detail?id=860565
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2018-09-17 21:38:30 +10:00
Ace-Tang 5963cf2afc test: add more test case for CleanPath
Signed-off-by: Ace-Tang <aceapril@126.com>
2018-09-14 21:37:12 +08:00
Akihiro Suda 06f789cf26 Disable rootless mode except RootlessCgMgr when executed as the root in userns
This PR decomposes `libcontainer/configs.Config.Rootless bool` into `RootlessEUID bool` and
`RootlessCgroups bool`, so as to make "runc-in-userns" to be more compatible with "rootful" runc.

`RootlessEUID` denotes that runc is being executed as a non-root user (euid != 0) in
the current user namespace. `RootlessEUID` is almost identical to the former `Rootless`
except cgroups stuff.

`RootlessCgroups` denotes that runc is unlikely to have the full access to cgroups.
`RootlessCgroups` is set to false if runc is executed as the root (euid == 0) in the initial namespace.
Otherwise `RootlessCgroups` is set to true.
(Hint: if `RootlessEUID` is true, `RootlessCgroups` becomes true as well)

When runc is executed as the root (euid == 0) in an user namespace (e.g. by Docker-in-LXD, Podman, Usernetes),
`RootlessEUID` is set to false but `RootlessCgroups` is set to true.
So, "runc-in-userns" behaves almost same as "rootful" runc except that cgroups errors are ignored.

This PR does not have any impact on CLI flags and `state.json`.

Note about CLI:
* Now `runc --rootless=(auto|true|false)` CLI flag is only used for setting `RootlessCgroups`.
* Now `runc spec --rootless` is only required when `RootlessEUID` is set to true.
  For runc-in-userns, `runc spec`  without `--rootless` should work, when sufficient numbers of
  UID/GID are mapped.

Note about `$XDG_RUNTIME_DIR` (e.g. `/run/user/1000`):
* `$XDG_RUNTIME_DIR` is ignored if runc is being executed as the root (euid == 0) in the initial namespace, for backward compatibility.
  (`/run/runc` is used)
* If runc is executed as the root (euid == 0) in an user namespace, `$XDG_RUNTIME_DIR` is honored if `$USER != "" && $USER != "root"`.
  This allows unprivileged users to allow execute runc as the root in userns, without mounting writable `/run/runc`.

Note about `state.json`:
* `rootless` is set to true when `RootlessEUID == true && RootlessCgroups == true`.

Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>
2018-09-07 15:05:03 +09:00
Yan Zhu feb90346e0 doc: fix typo
Signed-off-by: Yan Zhu <yanzhu@alauda.io>
2018-09-07 11:58:59 +08:00
Michael Crosby 70ca035aa6
Merge pull request #1883 from lifubang/containeridinpath
fix delete other file bug when container id is ..
2018-09-05 13:43:21 -04:00
Mrunal Patel 9cda583235
Merge pull request #1832 from giuseppe/runc-drop-invalid-proc-destination-with-chroot
linux: drop check for /proc as invalid dest
2018-09-04 09:26:21 -07:00
Michael Crosby 784b601f68
Merge pull request #1882 from accepting/dev
libcontainer: add /proc/loadavg to the white list of bind mount
2018-09-04 11:27:45 -04:00