About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.
This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).
For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.
About Intel RDT/CAT kernel interface:
In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.
Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.
Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
| |-- L3
| |-- cbm_mask
| |-- min_cbm_bits
| |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
|-- cpus
|-- schemata
|-- tasks
For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
resource constraints.
The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.
The file `schemata` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.
The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.
For more information about Intel RDT/CAT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
inside the container only have access to the "upper" 80% of L3 cache id 0 and
the "lower" 50% L3 cache id 1:
"linux": {
"intelRdt": {
"l3CacheSchema": "L3:0=ffff0;1=3ff"
}
}
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
congfig.Sysctl setting is duplicated.
when contianer is rootless and Linux is nil, runc will panic.
Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
Linux is not always not nil.
If Linux is nil, panic will occur.
Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Since syscall is outdated and broken for some architectures,
use x/sys/unix instead.
There are still some dependencies on the syscall package that will
remain in syscall for the forseeable future:
Errno
Signal
SysProcAttr
Additionally:
- os still uses syscall, so it needs to be kept for anything
returning *os.ProcessState, such as process.Wait.
Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
Since this is a runC-specific feature, this belongs here over in
opencontainers/ocitools (which is for generic OCI runtimes).
In addition, we don't create a new network namespace. This is because
currently if you want to set up a veth bridge you need CAP_NET_ADMIN in
both network namespaces' pinned user namespace to create the necessary
interfaces in each network namespace.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Previously Host{U,G}ID only gave you the root mapping, which isn't very
useful if you are trying to do other things with the IDMaps.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This enables the support for the rootless container mode. There are many
restrictions on what rootless containers can do, so many different runC
commands have been disabled:
* runc checkpoint
* runc events
* runc pause
* runc ps
* runc restore
* runc resume
* runc update
The following commands work:
* runc create
* runc delete
* runc exec
* runc kill
* runc list
* runc run
* runc spec
* runc state
In addition, any specification options that imply joining cgroups have
also been disabled. This is due to support for unprivileged subtree
management not being available from Linux upstream.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
When spec file contains duplicated namespaces, e.g.
specs: specs.Spec{
Linux: &specs.Linux{
Namespaces: []specs.Namespace{
{
Type: "pid",
},
{
Type: "pid",
Path: "/proc/1/ns/pid",
},
},
},
}
runc should report malformed spec instead of using latest one by
default, because this spec could be quite confusing.
Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
Alternative of #895 , part of #892
The intension of current behavior if to create cgroup in
parent cgroup of current process, but we did this in a
wrong way, we used devices cgroup path of current process
as the default parent path for all subsystems, this is
wrong because we don't always have the same cgroup path
for all subsystems.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Setting classid of net_cls cgroup failed:
ERRO[0000] process_linux.go:291: setting cgroup config for ready process caused "failed to write 𐀁 to net_cls.classid: write /sys/fs/cgroup/net_cls,net_prio/user.slice/abc/net_cls.classid: invalid argument"
process_linux.go:291: setting cgroup config for ready process caused "failed to write 𐀁 to net_cls.classid: write /sys/fs/cgroup/net_cls,net_prio/user.slice/abc/net_cls.classid: invalid argument"
The spec has classid as a *uint32, the libcontainer configs should match the type.
Signed-off-by: Hushan Jia <hushan.jia@gmail.com>
This adds an `--no-new-keyring` flag to run and create so that a new
session keyring is not created for the container and the calling
processes keyring is inherited.
Fixes#818
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This adds a `--no-pivot` cli flag to runc so that a container's rootfs
can be located ontop of ramdisk/tmpfs and not fail because you cannot
pivot root.
This should be a cli flag and not part of the spec because this is a
detail of the host/runtime environment and not an attribute of a
container.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Users of libcontainer other than runc may also require parsing and
converting specification configuration files.
Since runc cannot be imported, move the relevant functions and
definitions to a separate package, libcontainer/specconv.
Signed-off-by: Ido Yariv <ido@wizery.com>