Commit Graph

3391 Commits

Author SHA1 Message Date
Aleksa Sarai 969bb49cc3
nsenter: do not resolve path in nsexec context
With the addition of our new{uid,gid}map support, we used to call
execvp(3) from inside nsexec. This would mean that the path resolution
for the binaries would happen in nsexec. Move the resolution to the
initial setup code, and pass the absolute path to nsexec.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-09-09 12:45:33 +10:00
Aleksa Sarai 6097ce74d8
nsenter: correctly handle newgidmap path for rootless containers
After quite a bit of debugging, I found that previous versions of this
patchset did not include newgidmap in a rootless setting. Fix this by
passing it whenever group mappings are applied, and also providing some
better checking for try_mapping_tool. This commit also includes some
stylistic improvements.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-09-09 12:45:32 +10:00
Giuseppe Scrivano 3282f5a7c1
tests: fix for rootless multiple uids/gids
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2017-09-09 12:45:32 +10:00
Giuseppe Scrivano d8b669400a
rootless: allow multiple user/group mappings
Take advantage of the newuidmap/newgidmap tools to allow multiple
users/groups to be mapped into the new user namespace in the rootless
case.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
[ rebased to handle intelrdt changes. ]
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-09-09 12:45:32 +10:00
Giuseppe Scrivano fdf85e35b3
main: honor XDG_RUNTIME_DIR for rootless containers
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2017-09-09 12:44:34 +10:00
Mrunal Patel 13fa5d2953 Merge pull request #1588 from s7v7nislands/delete_unused
Delete unused function
2017-09-08 17:34:00 -07:00
Michael Crosby b82d07e816 Merge pull request #1587 from Mashimiao/fix-namespace-empty
Fixes #1585 config.Namespaces is empty when accessed
2017-09-08 10:50:16 -04:00
Michael Crosby 9755e0065f Merge pull request #1589 from xiaochenshen/rdt-cat-bug-fix
libcontainer: intelrdt: use init() to avoid race condition
2017-09-08 10:41:45 -04:00
Xiaochen Shen 88d22fde40 libcontainer: intelrdt: use init() to avoid race condition
This is the follow-up PR of #1279 to fix remaining issues:

Use init() to avoid race condition in IsIntelRdtEnabled().
Add also rename some variables and functions.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-09-08 17:15:31 +08:00
s7v7nislands c795b8690b Delete unused function
Signed-off-by: Xiaobing Jiang <s7v7nislands@gmail.com>
2017-09-08 10:35:46 +08:00
Ma Shimiao c3d20e7817 Fixes #1585 config.Namespaces is empty when accessed
Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
2017-09-08 09:30:07 +08:00
Mrunal Patel deb9d7fd96 Merge pull request #1569 from cyphar/delay-seccomp
init: delay seccomp application as late as possible
2017-09-07 13:27:37 -07:00
Mrunal Patel 7e036aa0b0 Merge pull request #1541 from adrianreber/lazy
checkpoint: support lazy migration
2017-09-07 13:25:04 -07:00
Mrunal Patel 5274430fee Merge pull request #1279 from xiaochenshen/rdt-cat-resource-manager-v1
libcontainer: add support for Intel RDT/CAT in runc
2017-09-06 14:36:02 -07:00
Adrian Reber ec260653b7 lazy-migration: add test case
The lazy-pages test case is not as straight forward as the other test
cases. This is related to the fact that restoring requires a different
name if restored on the same host. During 'runc checkpoint' the
container is not destroyed before all memory pages have been transferred
to the destination and thus the same container name cannot be used.

As real world usage will rather migrate a container from one system to
another than lazy migrate a container on the same host this is only
problematic for this test case.

Another reason is that it requires starting 'runc checkpoint' and 'criu
lazy-pages' in the background as those process need to be running to
start the final restore 'runc restore'.

CRIU upstream is currently discussing to automatically start 'criu
lazy-pages' which would simplify the lazy-pages test case a bit.

The handling and checking of the background processes make the test case
not the most elegant as at one point a 'sleep 2' is required to make
sure that 'runc checkpoint' had time to do its thing before looking at
log files.

Before running the actual test criu is called in feature checking mode
to make sure lazy migration is in the test case criu enabled. If not,
the test is skipped.

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-09-06 12:35:39 +00:00
Adrian Reber 60ae7091de checkpoint: support lazy migration
With the help of userfaultfd CRIU supports lazy migration. Lazy
migration means that memory pages are only transferred from the
migration source to the migration destination on page fault.

This enables to reduce the downtime during process or container
migration to a minimum as the memory does not need to be transferred
during migration.

Lazy migration currently depends on userfaultfd being available on the
current Linux kernel and if the used CRIU version supports lazy
migration. Both dependencies can be checked by querying CRIU via RPC if
the lazy migration feature is available. Using feature checking instead
of version comparison enables runC to use CRIU features from the
criu-dev branch. This way the user can decide if lazy migration should
be available by choosing the right kernel and CRIU branch.

To use lazy migration the CRIU process during dump needs to dump
everything besides the memory pages and then it opens a network port
waiting for remote page fault requests:

 # runc checkpoint httpd --lazy-pages --page-server 0.0.0.0:27 \
  --status-fd /tmp/postcopy-pipe

In this example CRIU will hang/wait once it has opened the network port
and wait for network connection. As runC waits for CRIU to finish it
will also hang until the lazy migration has finished. To know when the
restore on the destination side can start the '--status-fd' parameter is
used:

 #️ runc checkpoint --help | grep status
  --status-fd value   criu writes \0 to this FD once lazy-pages is ready

The parameter '--status-fd' is directly from CRIU and this way the
process outside of runC which controls the migration knows exactly when
to transfer the checkpoint (without memory pages) to the destination and
that the restore can be started.

On the destination side it is necessary to start CRIU in 'lazy-pages'
mode like this:

 # criu lazy-pages --page-server --address 192.168.122.3 --port 27 \
  -D checkpoint

and tell runC to do a lazy restore:

 # runc restore -d --image-path checkpoint --work-path checkpoint \
  --lazy-pages httpd

If both processes on the restore side have the same working directory
'criu lazy-pages' creates a unix domain socket where it waits for
requests from the actual restore. runC starts CRIU restore in lazy
restore mode and talks to 'criu lazy-pages' that it wants to restore
memory pages on demand. CRIU continues to restore the process and once
the process is running and accesses the first non-existing memory page
the 'criu lazy-pages' server will request the page from the source
system. Thus all pages from the source system will be transferred to the
destination system. Once all pages have been transferred runC on the
source system will end and the container will have finished migration.

This can also be combined with CRIU's pre-copy support. The combination
of pre-copy and post-copy (lazy migration) provides the possibility to
migrate containers with minimal downtimes.

Some additional background about post-copy migration can be found in
these articles:

 https://lisas.de/~adrian/?p=1253
 https://lisas.de/~adrian/?p=1183

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-09-06 12:35:38 +00:00
Adrian Reber a3a632ad28 checkpoint: add support to query for lazy page support
Before adding the actual lazy migration support, this adds the feature
check for lazy-pages. Right now lazy migration, which is based on
userfaultd is only available in the criu-dev branch and not yet in a
release. As the check does not dependent on a certain version but on
a CRIU feature which can be queried it can be part of runC without a new
version check depending on a feature from criu-dev.

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-09-06 12:35:38 +00:00
Mrunal Patel aea4f21eec Merge pull request #1575 from cyphar/tty-resize-ignore-errors
signal: ignore tty.resize errors
2017-09-01 11:20:26 -07:00
Xiaochen Shen 4d2756c116 libcontainer: add test cases for Intel RDT/CAT
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-09-01 14:35:40 +08:00
Xiaochen Shen 692f6e1e27 libcontainer: add support for Intel RDT/CAT in runc
About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.

About Intel RDT/CAT kernel interface:
In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.

Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.

Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
|   |-- L3
|       |-- cbm_mask
|       |-- min_cbm_bits
|       |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
    |-- cpus
    |-- schemata
    |-- tasks

For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
resource constraints.

The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file  (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.

The file `schemata` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
	Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.

The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.

For more information about Intel RDT/CAT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
inside the container only have access to the "upper" 80% of L3 cache id 0 and
the "lower" 50% L3 cache id 1:

"linux": {
	"intelRdt": {
		"l3CacheSchema": "L3:0=ffff0;1=3ff"
	}
}

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-09-01 14:26:33 +08:00
Xiaochen Shen af3b0d9dce libcontainer/SPEC.md: add documentation for Intel RDT/CAT
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-09-01 14:26:33 +08:00
Michael Crosby 84a082bfef Merge pull request #1578 from cyphar/remove-shfmt-from-ci
travis: drop shfmt install
2017-08-31 09:46:39 -04:00
Aleksa Sarai ace083b650
travis: drop shfmt install
It looks like we missed this in 5930d5b427 ("Remove shfmt"), which was
causing CI to break (since it looks like the repo has moved or something
like that). Since we're no longer using shfmt, drop it completely from
the repo.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-08-31 20:49:51 +10:00
Aleksa Sarai 10b175ce49
signal: ignore tty.resize errors
Fixes a race that occurred very frequently in testing where the tty of
the container may be closed by the time that runc gets to sending
SIGWINCH. This failure mode is not fatal, but it would cause test
failures due to expected outputs not matching. On further review it
appears that the original addition of these checks in 4c5bf649d0
("Check error return values") was actually not necessary, so partially
revert that change.

The particular failure mode this resolves would manifest as error logs
of the form:

  time="2017-08-24T07:59:50Z" level=error msg="bad file descriptor"

Fixes: 4c5bf649d0 ("Check error return values")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-08-27 00:44:17 +10:00
Aleksa Sarai 1f32fff46d
setns init: delay seccomp as late as possible
This mirrors the standard_init_linux.go seccomp code, which only applies
seccomp early if NoNewPrivileges is enabled. Otherwise it's done
immediately before execve to reduce the amount of syscalls necessary for
users to enable in their seccomp profiles.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-08-26 13:42:30 +10:00
Aleksa Sarai 3ddde27d7d
init: move close(stateDirFd) before seccomp apply
This further reduces the number of syscalls that a user needs to enable
in their seccomp profile.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-08-26 13:42:26 +10:00
Qiang Huang 1c81e2a794 Merge pull request #1572 from tych0/fix-readonly-userns
fix --read-only containers under --userns-remap
2017-08-26 09:38:14 +08:00
Aleksa Sarai 4d6e6720a7
Merge branch 'pr-1573'
Fix systemd cgroup after memory type changed

LGTMs: @crosbymichael @cyphar
Closes #1573
2017-08-25 23:55:27 +10:00
Michael Crosby 4e33faefa7 Merge pull request #1570 from cyphar/close-statedirfd-hole
init: switch away from stateDirFd entirely
2017-08-25 09:52:16 -04:00
Qiang Huang acaf6897f5 Fix systemd cgroup after memory type changed
Fixes: #1557

I'm not quite sure about the root cause, looks like
systemd still want them to be uint64.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-08-25 01:14:16 -04:00
Aleksa Sarai 7d66aab77a
init: switch away from stateDirFd entirely
While we have significant protections in place against CVE-2016-9962, we
still were holding onto a file descriptor that referenced the host
filesystem. This meant that in certain scenarios it was still possible
for a semi-privileged container to gain access to the host filesystem
(if they had CAP_SYS_PTRACE).

Instead, open the FIFO itself using a O_PATH. This allows us to
reference the FIFO directly without providing the ability for
directory-level access. When opening the FIFO inside the init process,
open it through procfs to re-open the actual FIFO (this is currently the
only supported way to open such a file descriptor).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-08-25 13:19:03 +10:00
Tycho Andersen 66eb2a3e8f fix --read-only containers under --userns-remap
The documentation here:
https://docs.docker.com/engine/security/userns-remap/#user-namespace-known-limitations

says that readonly containers can't be used with user namespaces do to some
kernel restriction. In fact, there is a special case in the kernel to be
able to do stuff like this, so let's use it.

This takes us from:

ubuntu@docker:~$ docker run -it --read-only ubuntu
docker: Error response from daemon: oci runtime error: container_linux.go:262: starting container process caused "process_linux.go:339: container init caused \"rootfs_linux.go:125: remounting \\\"/dev\\\" as readonly caused \\\"operation not permitted\\\"\"".

to:

ubuntu@docker:~$ docker-runc --version
runc version 1.0.0-rc4+dev
commit: ae2948042b08ad3d6d13cd09f40a50ffff4fc688-dirty
spec: 1.0.0
ubuntu@docker:~$ docker run -it --read-only ubuntu
root@181e2acb909a:/# touch foo
touch: cannot touch 'foo': Read-only file system

Signed-off-by: Tycho Andersen <tycho@docker.com>
2017-08-24 16:43:21 -06:00
Michael Crosby ae2948042b Merge pull request #1561 from nseps/master
Add AutoDedup option to CriuOpts
2017-08-18 12:50:27 -04:00
Nikolas Sepos 3f234b15d0 Add auto-dedup flag for checkpoint/restore
When doing incremental dumps is useful to use auto deduplication of
memory images to save space.

Signed-off-by: Nikolas Sepos <nikolas.sepos@gmail.com>
2017-08-18 16:19:21 +02:00
Nikolas Sepos da4a5a9515 Add AutoDedup option to CriuOpts
Memory image deduplication, very useful for incremental dumps.

See: https://criu.org/Memory_images_deduplication

Signed-off-by: Nikolas Sepos <nikolas.sepos@gmail.com>
2017-08-18 01:21:42 +02:00
Aleksa Sarai 59bbdc41a3
merge branch 'pr-1560'
Check error return values

LGTMs: @crosbymichael @cyphar
Closes #1560
2017-08-18 01:31:18 +10:00
Michael Crosby ccd2c20aa4 Merge pull request #1559 from Mashimiao/panic-fix-nil-linux
fix panic when Linux is nil for rootless case
2017-08-17 09:57:35 -04:00
Tobias Klauser 4c5bf649d0 Check error return values
Both tty.resize and notifySocket.setupSocket return an error which isn't
handled in the caller. Fix this and either log or propagate the errors.

Found using https://github.com/mvdan/unparam

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-08-17 11:41:19 +02:00
Michael Crosby c6126b2141 Merge pull request #1554 from cyphar/use-umoci-release-script
release: import umoci's release.sh script
2017-08-16 09:46:56 -04:00
Aleksa Sarai c24f602407
ci: smoke-test the release script
To make sure that `make release` doesn't suddenly break after we've cut
a release, smoke-test the release scripts. The script won't fail if GPG
keys aren't found, so running in CI shouldn't be a huge issue.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-08-16 14:44:45 +10:00
Aleksa Sarai ed68ee1e10
release: import umoci's release.sh script
This script is far easier to use than the previous `make release`
target, not to mention that it also automatically signs all of the
artefacts and makes everything really easy to do for maintainers.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-08-16 14:35:52 +10:00
Ma Shimiao 2333e7dc67 fix panic when Linux is nil for rootless case
congfig.Sysctl setting is duplicated.
when contianer is rootless and Linux is nil, runc will panic.

Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
2017-08-16 09:11:13 +08:00
Mrunal Patel b31bdfc38a Merge pull request #1558 from hqhq/update_state
Update state after update
2017-08-15 10:46:44 -07:00
Qiang Huang e6e1c34a7d Update state after update
state.json should be a reflection of the container's
realtime state, including resource configurations,
so we should update state.json after updating container
resources.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-08-15 14:38:44 +08:00
Qiang Huang eb464f7e43 Merge pull request #1542 from cyphar/buildmode-pic
makefile: enable -buildmode=pie
2017-08-15 09:30:40 +08:00
Aleksa Sarai b45e243f8b
*: enable -buildmode=pie
Go has supported PIC builds for a while now, and given the security
benefits of using PIC binaries we should really enable them. There also
appears to be some indication that non-PIC builds have been interacting
oddly on ppc64le (the linker cannot load some shared libraries), and
using PIC builds appears to solve this problem.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-08-15 00:12:27 +10:00
Michael Crosby 760c67744b Merge pull request #1555 from cyphar/remove-install-flag-makefile
makefile: drop usage of --install
2017-08-14 10:04:33 -04:00
Michael Crosby 3096b3fc85 Merge pull request #1556 from hqhq/fix_flakytest_TestNotifyOnOOM
Fix flaky test TestNotifyOnOOM
2017-08-14 10:03:23 -04:00
Qiang Huang 9aa46c1e66 Merge pull request #1551 from crosbymichael/linux-nil
fix panic when Linux is nil
2017-08-14 19:35:31 +08:00
Qiang Huang 7726bcf0e2 Some fixes for testMemoryNotification
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-08-14 15:28:03 +08:00