libcontainer: intelrdt: add Intel RDT/MBA docs in SPEC.md
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
This commit is contained in:
parent
bd90541666
commit
c1cece7e23
|
@ -156,17 +156,21 @@ init process will block waiting for the parent to finish setup.
|
|||
|
||||
### IntelRdt
|
||||
|
||||
Intel platforms with new Xeon CPU support Intel Resource Director Technology
|
||||
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
|
||||
currently supports L3 cache resource allocation.
|
||||
Intel platforms with new Xeon CPU support Resource Director Technology (RDT).
|
||||
Cache Allocation Technology (CAT) and Memory Bandwidth Allocation (MBA) are
|
||||
two sub-features of RDT.
|
||||
|
||||
This feature provides a way for the software to restrict cache allocation to a
|
||||
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
|
||||
The different subsets are identified by class of service (CLOS) and each CLOS
|
||||
has a capacity bitmask (CBM).
|
||||
Cache Allocation Technology (CAT) provides a way for the software to restrict
|
||||
cache allocation to a defined 'subset' of L3 cache which may be overlapping
|
||||
with other 'subsets'. The different subsets are identified by class of
|
||||
service (CLOS) and each CLOS has a capacity bitmask (CBM).
|
||||
|
||||
It can be used to handle L3 cache resource allocation for containers if
|
||||
hardware and kernel support Intel RDT/CAT.
|
||||
Memory Bandwidth Allocation (MBA) provides indirect and approximate throttle
|
||||
over memory bandwidth for the software. A user controls the resource by
|
||||
indicating the percentage of maximum memory bandwidth.
|
||||
|
||||
It can be used to handle L3 cache and memory bandwidth resources allocation
|
||||
for containers if hardware and kernel support Intel RDT CAT and MBA features.
|
||||
|
||||
In Linux 4.10 kernel or newer, the interface is defined and exposed via
|
||||
"resource control" filesystem, which is a "cgroup-like" interface.
|
||||
|
@ -175,6 +179,9 @@ Comparing with cgroups, it has similar process management lifecycle and
|
|||
interfaces in a container. But unlike cgroups' hierarchy, it has single level
|
||||
filesystem layout.
|
||||
|
||||
CAT and MBA features are introduced in Linux 4.10 and 4.12 kernel via
|
||||
"resource control" filesystem.
|
||||
|
||||
Intel RDT "resource control" filesystem hierarchy:
|
||||
```
|
||||
mount -t resctrl resctrl /sys/fs/resctrl
|
||||
|
@ -182,59 +189,85 @@ tree /sys/fs/resctrl
|
|||
/sys/fs/resctrl/
|
||||
|-- info
|
||||
| |-- L3
|
||||
| |-- cbm_mask
|
||||
| |-- min_cbm_bits
|
||||
| | |-- cbm_mask
|
||||
| | |-- min_cbm_bits
|
||||
| | |-- num_closids
|
||||
| |-- MB
|
||||
| |-- bandwidth_gran
|
||||
| |-- delay_linear
|
||||
| |-- min_bandwidth
|
||||
| |-- num_closids
|
||||
|-- cpus
|
||||
|-- ...
|
||||
|-- schemata
|
||||
|-- tasks
|
||||
|-- <container_id>
|
||||
|-- cpus
|
||||
|-- ...
|
||||
|-- schemata
|
||||
|-- tasks
|
||||
|
||||
```
|
||||
|
||||
For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
|
||||
resource constraints.
|
||||
For runc, we can make use of `tasks` and `schemata` configuration for L3
|
||||
cache and memory bandwidth resources constraints.
|
||||
|
||||
The file `tasks` has a list of tasks that belongs to this group (e.g.,
|
||||
<container_id>" group). Tasks can be added to a group by writing the task ID
|
||||
to the "tasks" file (which will automatically remove them from the previous
|
||||
to the "tasks" file (which will automatically remove them from the previous
|
||||
group to which they belonged). New tasks created by fork(2) and clone(2) are
|
||||
added to the same group as their parent. If a pid is not in any sub group, it
|
||||
is in root group.
|
||||
added to the same group as their parent.
|
||||
|
||||
The file `schemata` has allocation masks/values for L3 cache on each socket,
|
||||
which contains L3 cache id and capacity bitmask (CBM).
|
||||
The file `schemata` has a list of all the resources available to this group.
|
||||
Each resource (L3 cache, memory bandwidth) has its own line and format.
|
||||
|
||||
L3 cache schema:
|
||||
It has allocation bitmasks/values for L3 cache on each socket, which
|
||||
contains L3 cache id and capacity bitmask (CBM).
|
||||
```
|
||||
Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
|
||||
```
|
||||
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
|
||||
Which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.
|
||||
For example, on a two-socket machine, the schema line could be "L3:0=ff;1=c0"
|
||||
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.
|
||||
|
||||
The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
|
||||
be set is less than the max bit. The max bits in the CBM is varied among
|
||||
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
|
||||
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
|
||||
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
|
||||
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
|
||||
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.
|
||||
supported Intel CPU models. Kernel will check if it is valid when writing.
|
||||
e.g., default value 0xfffff in root indicates the max bits of CBM is 20
|
||||
bits, which mapping to entire L3 cache capacity. Some valid CBM values to
|
||||
set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.
|
||||
|
||||
For more information about Intel RDT/CAT kernel interface:
|
||||
Memory bandwidth schema:
|
||||
It has allocation values for memory bandwidth on each socket, which contains
|
||||
L3 cache id and memory bandwidth percentage.
|
||||
```
|
||||
Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."
|
||||
```
|
||||
For example, on a two-socket machine, the schema line could be "MB:0=20;1=70"
|
||||
|
||||
The minimum bandwidth percentage value for each CPU model is predefined and
|
||||
can be looked up through "info/MB/min_bandwidth". The bandwidth granularity
|
||||
that is allocated is also dependent on the CPU model and can be looked up at
|
||||
"info/MB/bandwidth_gran". The available bandwidth control steps are:
|
||||
min_bw + N * bw_gran. Intermediate values are rounded to the next control
|
||||
step available on the hardware.
|
||||
|
||||
For more information about Intel RDT kernel interface:
|
||||
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
|
||||
|
||||
An example for runc:
|
||||
```
|
||||
An example for runc:
|
||||
Consider a two-socket machine with two L3 caches where the default CBM is
|
||||
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
|
||||
inside the container only have access to the "upper" 80% of L3 cache id 0 and
|
||||
the "lower" 50% L3 cache id 1:
|
||||
0x7ff and the max CBM length is 11 bits, and minimum memory bandwidth of 10%
|
||||
with a memory bandwidth granularity of 10%.
|
||||
|
||||
Tasks inside the container only have access to the "upper" 7/11 of L3 cache
|
||||
on socket 0 and the "lower" 5/11 L3 cache on socket 1, and may use a
|
||||
maximum memory bandwidth of 20% on socket 0 and 70% on socket 1.
|
||||
|
||||
"linux": {
|
||||
"intelRdt": {
|
||||
"l3CacheSchema": "L3:0=ffff0;1=3ff"
|
||||
}
|
||||
"intelRdt": {
|
||||
"closID": "guaranteed_group",
|
||||
"l3CacheSchema": "L3:0=7f0;1=1f",
|
||||
"memBwSchema": "MB:0=20;1=70"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
|
Loading…
Reference in New Issue