After quite a bit of debugging, I found that previous versions of this
patchset did not include newgidmap in a rootless setting. Fix this by
passing it whenever group mappings are applied, and also providing some
better checking for try_mapping_tool. This commit also includes some
stylistic improvements.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Take advantage of the newuidmap/newgidmap tools to allow multiple
users/groups to be mapped into the new user namespace in the rootless
case.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
[ rebased to handle intelrdt changes. ]
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This is the follow-up PR of #1279 to fix remaining issues:
Use init() to avoid race condition in IsIntelRdtEnabled().
Add also rename some variables and functions.
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
This applies cgroups earlier for container creation before the init
process starts running and forking off any additional processes.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
With the help of userfaultfd CRIU supports lazy migration. Lazy
migration means that memory pages are only transferred from the
migration source to the migration destination on page fault.
This enables to reduce the downtime during process or container
migration to a minimum as the memory does not need to be transferred
during migration.
Lazy migration currently depends on userfaultfd being available on the
current Linux kernel and if the used CRIU version supports lazy
migration. Both dependencies can be checked by querying CRIU via RPC if
the lazy migration feature is available. Using feature checking instead
of version comparison enables runC to use CRIU features from the
criu-dev branch. This way the user can decide if lazy migration should
be available by choosing the right kernel and CRIU branch.
To use lazy migration the CRIU process during dump needs to dump
everything besides the memory pages and then it opens a network port
waiting for remote page fault requests:
# runc checkpoint httpd --lazy-pages --page-server 0.0.0.0:27 \
--status-fd /tmp/postcopy-pipe
In this example CRIU will hang/wait once it has opened the network port
and wait for network connection. As runC waits for CRIU to finish it
will also hang until the lazy migration has finished. To know when the
restore on the destination side can start the '--status-fd' parameter is
used:
#️ runc checkpoint --help | grep status
--status-fd value criu writes \0 to this FD once lazy-pages is ready
The parameter '--status-fd' is directly from CRIU and this way the
process outside of runC which controls the migration knows exactly when
to transfer the checkpoint (without memory pages) to the destination and
that the restore can be started.
On the destination side it is necessary to start CRIU in 'lazy-pages'
mode like this:
# criu lazy-pages --page-server --address 192.168.122.3 --port 27 \
-D checkpoint
and tell runC to do a lazy restore:
# runc restore -d --image-path checkpoint --work-path checkpoint \
--lazy-pages httpd
If both processes on the restore side have the same working directory
'criu lazy-pages' creates a unix domain socket where it waits for
requests from the actual restore. runC starts CRIU restore in lazy
restore mode and talks to 'criu lazy-pages' that it wants to restore
memory pages on demand. CRIU continues to restore the process and once
the process is running and accesses the first non-existing memory page
the 'criu lazy-pages' server will request the page from the source
system. Thus all pages from the source system will be transferred to the
destination system. Once all pages have been transferred runC on the
source system will end and the container will have finished migration.
This can also be combined with CRIU's pre-copy support. The combination
of pre-copy and post-copy (lazy migration) provides the possibility to
migrate containers with minimal downtimes.
Some additional background about post-copy migration can be found in
these articles:
https://lisas.de/~adrian/?p=1253https://lisas.de/~adrian/?p=1183
Signed-off-by: Adrian Reber <areber@redhat.com>
Before adding the actual lazy migration support, this adds the feature
check for lazy-pages. Right now lazy migration, which is based on
userfaultd is only available in the criu-dev branch and not yet in a
release. As the check does not dependent on a certain version but on
a CRIU feature which can be queried it can be part of runC without a new
version check depending on a feature from criu-dev.
Signed-off-by: Adrian Reber <areber@redhat.com>
About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.
This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).
For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.
About Intel RDT/CAT kernel interface:
In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.
Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| |-- cbm_mask
| |-- min_cbm_bits
| |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
|-- cpus
|-- schemata
|-- tasks
For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.
The file `schemata` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.
The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.
For more information about Intel RDT/CAT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
inside the container only have access to the "upper" 80% of L3 cache id 0 and
the "lower" 50% L3 cache id 1:
"linux": {
"intelRdt": {
"l3CacheSchema": "L3:0=ffff0;1=3ff"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
This mirrors the standard_init_linux.go seccomp code, which only applies
seccomp early if NoNewPrivileges is enabled. Otherwise it's done
immediately before execve to reduce the amount of syscalls necessary for
users to enable in their seccomp profiles.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Fixes: #1557
I'm not quite sure about the root cause, looks like
systemd still want them to be uint64.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
While we have significant protections in place against CVE-2016-9962, we
still were holding onto a file descriptor that referenced the host
filesystem. This meant that in certain scenarios it was still possible
for a semi-privileged container to gain access to the host filesystem
(if they had CAP_SYS_PTRACE).
Instead, open the FIFO itself using a O_PATH. This allows us to
reference the FIFO directly without providing the ability for
directory-level access. When opening the FIFO inside the init process,
open it through procfs to re-open the actual FIFO (this is currently the
only supported way to open such a file descriptor).
Signed-off-by: Aleksa Sarai <asarai@suse.de>
The documentation here:
https://docs.docker.com/engine/security/userns-remap/#user-namespace-known-limitations
says that readonly containers can't be used with user namespaces do to some
kernel restriction. In fact, there is a special case in the kernel to be
able to do stuff like this, so let's use it.
This takes us from:
ubuntu@docker:~$ docker run -it --read-only ubuntu
docker: Error response from daemon: oci runtime error: container_linux.go:262: starting container process caused "process_linux.go:339: container init caused \"rootfs_linux.go:125: remounting \\\"/dev\\\" as readonly caused \\\"operation not permitted\\\"\"".
to:
ubuntu@docker:~$ docker-runc --version
runc version 1.0.0-rc4+dev
commit: ae2948042b08ad3d6d13cd09f40a50ffff4fc688-dirty
spec: 1.0.0
ubuntu@docker:~$ docker run -it --read-only ubuntu
root@181e2acb909a:/# touch foo
touch: cannot touch 'foo': Read-only file system
Signed-off-by: Tycho Andersen <tycho@docker.com>
congfig.Sysctl setting is duplicated.
when contianer is rootless and Linux is nil, runc will panic.
Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
state.json should be a reflection of the container's
realtime state, including resource configurations,
so we should update state.json after updating container
resources.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Fixes: #1228
It can be reproduced by applying this patch:
```diff
@@ -45,6 +46,7 @@ func registerMemoryEvent(cgDir string, evName string, arg string) (<-chan struct
go func() {
defer func() {
close(ch)
+ <-time.After(1 * time.Second)
eventfd.Close()
evFile.Close()
}()
```
We can close channel after fds were closed.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Linux is not always not nil.
If Linux is nil, panic will occur.
Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Before this change, some file type would be treated as char devices
(e.g. symlinks).
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
This allows the libcontainer to automatically clean up runc:[1:CHILD]
processes created as part of nsenter.
Signed-off-by: Alex Fang <littlelightlittlefire@gmail.com>
With this runC also uses RPC to ask CRIU for its version. CRIU supports
a VERSION RPC since CRIU 3.0 and using the RPC interface does not
require parsing the console output of CRIU (which could change anytime).
For older CRIU versions which do not yet have the VERSION RPC runC falls
back to its old CRIU output parsing mode.
Once CRIU 3.0 is the minimum version required for runC the old code can
be removed.
v2:
* adapt to changes in the previous patches based on the review
Signed-off-by: Adrian Reber <areber@redhat.com>
Update criurpc.proto for the upcoming VERSION RPC.
This includes lazy_pages for the upcoming lazy migration support.
Signed-off-by: Adrian Reber <areber@redhat.com>
To use the CRIU VERSION RPC the criuSwrk function is adapted to work
with CriuOpts set to 'nil' as CriuOpts is not required for the VERSION
RPC.
Also do not print c.criuVersion if it is '0' as the first RPC call will
always be the VERSION call and only after that the version will be
known.
Signed-off-by: Adrian Reber <areber@redhat.com>
If the version of criu has already been determined there is no need to
ask criu for the version again. Use the value from c.criuVersion.
v2:
* reduce unnecessary code movement in the patch series
* factor out the criu version parsing into a separate function
Signed-off-by: Adrian Reber <areber@redhat.com>
The checkCriuVersion function used a string to specify the minimum
version required. This is more comfortable for an external interface
but for an internal function this added unnecessary complexity. This
changes to version string like '1.5.2' to an integer like 10502. This is
already the format used internally in the function.
Signed-off-by: Adrian Reber <areber@redhat.com>
go's switch statement doesn't need an explicit break. Remove it where
that is the case and add a comment to indicate the purpose where the
removal would lead to an empty case.
Found with honnef.co/go/tools/cmd/staticcheck
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Handle err return value of fmt.Scanf, os.Pipe and unix.ParseUnixRights.
Found with honnef.co/go/tools/cmd/staticcheck
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
This moves all console code to use github.com/containerd/console library to
handle console I/O. Also move to use EpollConsole by default when user requests
a terminal so we can still cope when the other side temporarily goes away.
Signed-off-by: Daniel Dao <dqminh89@gmail.com>
Refactor DeviceFromPath in order to get rid of package syscall and
directly use the functions from x/sys/unix. This also allows to get rid
of the conversion from the OS-independent file mode values (from the os
package) to Linux specific values and instead let's us use the raw
file mode value directly.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Updated logrus to use v1 which includes a breaking name change Sirupsen -> sirupsen.
This includes a manual edit of the docker term package to also correct the name there too.
Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
Use ParseSocketControlMessage and ParseUnixRights from
golang.org/x/sys/unix instead of their syscall equivalent.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
fix#1476
If containerA shares namespace, say ipc namespace, with containerB, then
its ipc namespace path would be the same as containerB and be stored in
`state.json`. Exec into containerA will just read the namespace paths
stored in this file and join these namespaces. So, if containerB has
already been stopped, `docker exec containerA` will fail.
To address this issue, we should always save own namespace paths no
matter if we share namespaces with other containers.
Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com>
It appears as though these semantics were not fully thought out when
implementing them for rootless containers. It is not necessary (and
could be potentially dangerous) to set the owner of /run/ctr/$id to be
the root inside the container (if user namespaces are being used).
Instead, just use the e{g,u}id of runc to determine the owner.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Use IoctlGetInt and IoctlGetTermios/IoctlSetTermios instead of manually
reimplementing them.
Because of unlockpt, the ioctl wrapper is still needed as it needs to
pass a pointer to a value, which is not supported by any ioctl function
in x/sys/unix yet.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Use unix.Prctl() instead of manually reimplementing it using
unix.RawSyscall. Also use unix.SECCOMP_MODE_FILTER instead of locally
defining it.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Using MS_PRIVATE meant that there was a race between the mount(2) and
the umount2(2) calls where runc inadvertently has a live reference to a
mountpoint that existed on the host (which the host cannot kill
implicitly through an unmount and peer sharing).
In particular, this means that if we have a devicemapper mountpoint and
the host is trying to delete the underlying device, the delete will fail
because it is "in use" during the race. While the race is _very_ small
(and libdm actually retries to avoid these sorts of cases) this appears
to manifest in various cases.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
replace #1492#1494fix#1422
Since https://github.com/opencontainers/runtime-spec/pull/876 the memory
specifications are now `int64`, as that better matches the visible interface where
`-1` is a valid value. Otherwise finding the correct value was difficult as it
was kernel dependent.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Use unix.Eventfd() instead of calling manually reimplementing it using
the raw syscall. Also use the correct corresponding unix.EFD_CLOEXEC
flag instead of unix.FD_CLOEXEC (which can have a different value on
some architectures and thus might lead to unexpected behavior).
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
And Stat_t.PID and Stat_t.Name while we're at it. Then use the new
.State property in runType to distinguish between running and
zombie/dead processes, since kill(2) does not [1]. With this change
we no longer claim Running status for zombie/dead processes.
I've also removed the kill(2) call from runType. It was originally
added in 13841ef3 (new-api: return the Running state only if the init
process is alive, 2014-12-23), but we've been accessing
/proc/[pid]/stat since 14e95b2a (Make state detection precise,
2016-07-05, #930), and with the /stat access the kill(2) check is
redundant.
I also don't see much point to the previously-separate
doesInitProcessExist, so I've inlined that logic in runType.
It would be nice to distinguish between "/proc/[pid]/stat doesn't
exist" and errors parsing its contents, but I've skipped that for the
moment.
The Running -> Stopped change in checkpoint_test.go is because the
post-checkpoint process is a zombie, and with this commit zombie
processes are Stopped (and no longer Running).
[1]: https://github.com/opencontainers/runc/pull/1483#issuecomment-307527789
Signed-off-by: W. Trevor King <wking@tremily.us>
And convert the various start-time properties from strings to uint64s.
This removes all internal consumers of the deprecated
GetProcessStartTime function.
Signed-off-by: W. Trevor King <wking@tremily.us>
Use KeyctlJoinSessionKeyring, KeyctlString and KeyctlSetperm from
golang.org/x/sys/unix instead of manually reimplementing them.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
And use it only in local tooling that is forwarding the pseudoterminal
master. That way runC no longer has an opinion on the onlcr setting
for folks who are creating a terminal and detaching. They'll use
--console-socket and can setup the pseudoterminal however they like
without runC having an opinion. With this commit, the only cases
where runC still has applies SaneTerminal is when *it* is the process
consuming the master descriptor.
Signed-off-by: W. Trevor King <wking@tremily.us>
Use the NLA_ALIGNTO and NLA_HDRLEN constants from x/sys/unix instead of
syscall, as the syscall package shouldn't be used anymore (except for a
few exceptions).
This also makes the syscall_NLA_HDRLEN workaround for gccgo unnecessary.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Use the symlink xattr syscall wrappers Lgetxattr, Llistxattr and
Lsetxattr from x/sys/unix (introduced in
golang/sys@b90f89a1e7) instead of
providing own wrappers. Leave the functionality of system.Lgetxattr
intact with respect to the retry with a larger buffer, but switch it to
use unix.Lgetxattr.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Follow commit 3d7cb4293c ("Move libcontainer to x/sys/unix") and also
move the examples in README.md from syscall to x/sys/unix.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Since syscall is outdated and broken for some architectures,
use x/sys/unix instead.
There are still some dependencies on the syscall package that will
remain in syscall for the forseeable future:
Errno
Signal
SysProcAttr
Additionally:
- os still uses syscall, so it needs to be kept for anything
returning *os.ProcessState, such as process.Wait.
Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
* User Case:
User could use prestart hook to add block devices to container. so the
hook should have a way to set the permissions of the devices.
Just move cgroup config operation before prestart hook will work.
Signed-off-by: Wentao Zhang <zhangwentao234@huawei.com>
FreeBSD does not support cgroups or namespaces, which the code suggested, and is not supported
in runc anyway right now. So clean up the file naming to use `_linux` where appropriate.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>