Due to the semantics of chroot(2) when it comes to mount namespaces, it
is not generally safe to use MS_PRIVATE as a mount propgation when using
chroot(2). The reason for this is that this effectively results in a set
of mount references being held by the chroot'd namespace which the
namespace cannot free. pivot_root(2) does not have this issue because
the @old_root can be unmounted by the process.
Ultimately, --no-pivot is not really necessary anymore as a commonly
used option since f8e6b5af5e ("rootfs: make pivot_root not use a
temporary directory") resolved the read-only issue. But if someone
really needs to use it, MS_PRIVATE is never a good idea.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
The code in prepareRoot (e385f67a0e/libcontainer/rootfs_linux.go (L599-L605))
attempts to default the rootfs mount to `rslave`. However, since the spec
conversion has already defaulted it to `rprivate`, that code doesn't
actually ever do anything.
This changes the spec conversion code to accept "" and treat it as 0.
Implicitly, this makes rootfs propagation default to `rslave`, which is
a part of fixing the moby bug https://github.com/moby/moby/issues/34672
Alternate implementatoins include changing this defaulting to be
`rslave` and removing the defaulting code in prepareRoot, or skipping
the mapping entirely for "", but I think this change is the cleanest of
those options.
Signed-off-by: Euan Kemp <euan.kemp@coreos.com>
About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.
This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).
For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.
About Intel RDT/CAT kernel interface:
In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.
Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| |-- cbm_mask
| |-- min_cbm_bits
| |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
|-- cpus
|-- schemata
|-- tasks
For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.
The file `schemata` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.
The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.
For more information about Intel RDT/CAT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
inside the container only have access to the "upper" 80% of L3 cache id 0 and
the "lower" 50% L3 cache id 1:
"linux": {
"intelRdt": {
"l3CacheSchema": "L3:0=ffff0;1=3ff"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
congfig.Sysctl setting is duplicated.
when contianer is rootless and Linux is nil, runc will panic.
Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
Linux is not always not nil.
If Linux is nil, panic will occur.
Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Since syscall is outdated and broken for some architectures,
use x/sys/unix instead.
There are still some dependencies on the syscall package that will
remain in syscall for the forseeable future:
Errno
Signal
SysProcAttr
Additionally:
- os still uses syscall, so it needs to be kept for anything
returning *os.ProcessState, such as process.Wait.
Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
Since this is a runC-specific feature, this belongs here over in
opencontainers/ocitools (which is for generic OCI runtimes).
In addition, we don't create a new network namespace. This is because
currently if you want to set up a veth bridge you need CAP_NET_ADMIN in
both network namespaces' pinned user namespace to create the necessary
interfaces in each network namespace.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Previously Host{U,G}ID only gave you the root mapping, which isn't very
useful if you are trying to do other things with the IDMaps.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This enables the support for the rootless container mode. There are many
restrictions on what rootless containers can do, so many different runC
commands have been disabled:
* runc checkpoint
* runc events
* runc pause
* runc ps
* runc restore
* runc resume
* runc update
The following commands work:
* runc create
* runc delete
* runc exec
* runc kill
* runc list
* runc run
* runc spec
* runc state
In addition, any specification options that imply joining cgroups have
also been disabled. This is due to support for unprivileged subtree
management not being available from Linux upstream.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
When spec file contains duplicated namespaces, e.g.
specs: specs.Spec{
Linux: &specs.Linux{
Namespaces: []specs.Namespace{
{
Type: "pid",
},
{
Type: "pid",
Path: "/proc/1/ns/pid",
},
},
},
}
runc should report malformed spec instead of using latest one by
default, because this spec could be quite confusing.
Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
Alternative of #895 , part of #892
The intension of current behavior if to create cgroup in
parent cgroup of current process, but we did this in a
wrong way, we used devices cgroup path of current process
as the default parent path for all subsystems, this is
wrong because we don't always have the same cgroup path
for all subsystems.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Setting classid of net_cls cgroup failed:
ERRO[0000] process_linux.go:291: setting cgroup config for ready process caused "failed to write 𐀁 to net_cls.classid: write /sys/fs/cgroup/net_cls,net_prio/user.slice/abc/net_cls.classid: invalid argument"
process_linux.go:291: setting cgroup config for ready process caused "failed to write 𐀁 to net_cls.classid: write /sys/fs/cgroup/net_cls,net_prio/user.slice/abc/net_cls.classid: invalid argument"
The spec has classid as a *uint32, the libcontainer configs should match the type.
Signed-off-by: Hushan Jia <hushan.jia@gmail.com>
This adds an `--no-new-keyring` flag to run and create so that a new
session keyring is not created for the container and the calling
processes keyring is inherited.
Fixes#818
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>