This partially revert #648 , after a second thought, I think we
should use specs value the same as kernel API input, see:
https://github.com/opencontainers/runtime-spec/issues/692#issuecomment-281889852
For memory and hugetlb limits *.limit_in_bytes, cgroup APIs take the values
as string, but the parsed values are unsigned long, see:
https://github.com/torvalds/linux/blob/v4.10/mm/page_counter.c#L175-L193
For `cpu.cfs_quota_us` and `cpu.rt_runtime_us`, cgroup APIs take the input
value as signed long long, while `cpu.cfs_period_us` and `cpu.rt_periof_us`
take the input value as unsigned long long.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
I think runtime should generate an error, if devices has
duplicated device path.
Because we don't know which one is really needed.
Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
This restriction originally landed via 02b456e9 (Clarify behavior
around namespaces paths, 2015-09-08, #158). The hostname case landed
via 66a0543e (config: Require a new UTS namespace for config.json's
hostname, 2015-10-05, #214) citing the namespace restriction. The
restriciton extended to runtime namespaces in 01c2d55f (config-linux:
Extend no-tweak requirement to runtime namespaces, 2016-08-24, #538).
There was a proposal in-flight to get config-wide consistency around
the no-tweaking concept [1].
In today's meeting, the maintainer consensus was to strike the
no-tweaking restriction [2], which is what I've done here. I've
removed the ROADMAP entry because this gives folks a way to adjust
existing containers (launch a new container which joins and tweaks the
original).
The hostname entry still mentions the UTS namespace to provide a guard
against accidental foot-gunning. There was no no-tweaking language
for properties related to other namespaces (e.g. 'mounts').
Maybe the other namespaces have more obvious names.
[1]: https://github.com/opencontainers/runtime-spec/pull/540
[2]: http://ircbot.wl.linuxfoundation.org/meetings/opencontainers/2017/opencontainers.2017-01-11-22.04.log.html#l-117
Signed-off-by: W. Trevor King <wking@tremily.us>
Carry #499
For these values, cgroup kernal APIs accept -1 to set
them as unlimited, as docker and runc all support
update resources, we should not set drawbacks in spec.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
In all of these cases we want to use the RFC 2119 semantics.
Generated with:
$ sed -i 's/required/REQUIRED/g' config*.md
after which I rolled back the change for:
...controllers required to fulfill...
since that was already MUSTed.
Signed-off-by: W. Trevor King <wking@tremily.us>
In all of these cases we want to use the RFC 2119 semantics.
Generated with:
$ sed -i 's/optional/OPTIONAL/g' config*.md
Signed-off-by: W. Trevor King <wking@tremily.us>
Since [1] we've required runtimes to error out if a configuration
joins an existing namespace and adjusts it somehow (e.g. joining an
existing UTC namespace and setting 'hostname', [2]). However, the
wording from [1] (which survives untouched in the current master) only
talked about "when a path is specified". I see two possible
approaches for internal consistency:
a. Lift the OCI restriction and allow join-and-tweak [3] where the
kernel supports it. When we landed the current restriction, the
main issues seemed to be "we don't have a clear use-case for join
and tweak" [4] (although see [5]) and "this is a foot gun [6,7]"
(I'd rather leave policy to higher-level config linters).
b. Extend the OCI restriction to all cases where the runtime does not
create a new namespace. Besides the already covered "namespace
entry exists and includes 'path'", we'd also want to forbid configs
that were missing the relevant namespace(s) entirely (in which case
the container inherits the host namespace(s)).
I'm partial to (a) in the long run, but (b) is less of a shift from
the current spec and likely a better choice for a pending 1.0.
This commit implements (b).
It also makes it explicit that not listing a namespace type will cause
the container to inherit the runtime namespace of that type.
[1]: https://github.com/opencontainers/runtime-spec/pull/158
Subject: Clarify behavior around namespaces paths
[2]: https://github.com/opencontainers/runtime-spec/pull/214
Subject: config: Require a new UTS namespace for config.json's hostname
[3]: https://github.com/opencontainers/runtime-spec/pull/158#issuecomment-138687129
[4]: https://github.com/opencontainers/runtime-spec/pull/158#issuecomment-138997548
[5]: https://github.com/opencontainers/runtime-spec/pull/305
Subject: [Tracker] Live Container Updates
[6]: https://github.com/opencontainers/runtime-spec/pull/158#issuecomment-139106987
[7]: https://github.com/opencontainers/runtime-spec/issues/537#issuecomment-242132288
Subject: [linux] Tweaking host namespaces?
Signed-off-by: W. Trevor King <wking@tremily.us>
And mark it omitempty to avoid:
$ ocitools generate --template <(echo '{"linux": {"resources": {}}}') | jq .linux
{
"resources": {
"devices": null
}
}
Signed-off-by: W. Trevor King <wking@tremily.us>
To match the omitempty which the Go property has had since 28cc4239
(add omitempty to 'Device' and 'Namespace', 2016-03-10, #340).
Signed-off-by: W. Trevor King <wking@tremily.us>
Make explicit that runtimes only have to attach to the bare minimum
number of cgroups in order to fulfil the users' requirements. However,
runtimes are of course allowed to attach to more than the bare minimum.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Clarify some of the confusion with cgroupsPath. Due to systemd, we
cannot require that relative paths be treated in any specific way. In
addition, add a line stating that not all values of cgroupsPath are
required to be valid (and that runtimes must error out if they have an
invalid cgroup path). However, any given value of cgroupsPath should
provide consistent results.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Some of the wording was a bit clumsy (and incorrect, by conflating
different concepts in control groups as "cgroups").
Signed-off-by: Aleksa Sarai <asarai@suse.de>
The cgroup namespace is a new kernel feature available in 4.6+ that
allows a container to isolate its cgroup hierarchy. This currently only
allows for hiding information from /proc/self/cgroup, and mounting
cgroupfs as an unprivileged user. In the future, this namespace may
allow for subtree management by a container.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
The user-namespace restriction isn't about the root filesystem in
particular. For example, if you bind mount in a second filesystem,
the runtime shouldn't adjust ownership on that filesystem either.
I've also adjusted the old "permissions" to "ownership", since that
more clearly reflects the fields (user and group) that you would
modify if you wanted to adjust for user namespacing.
Signed-off-by: W. Trevor King <wking@tremily.us>
Avoid the dangling 'using' from e9a6d948 (cgroup: Add support for
memory.kmem.tcp.limit_in_bytes, 2015-10-26, #235). I've tried to echo
the kernel docs by mentioning buffer memory [1]. I'd personally
prefer linking to the kernel docs and mentioning
memory.kmem.tcp.limit_in_bytes, but that seemed like too big of a
break from the existing style for this commit.
[1]: https://kernel.org/doc/Documentation/cgroup-v1/memory.txt
Signed-off-by: W. Trevor King <wking@tremily.us>
Fixes#320
This adds the maskedPaths and readonlyPaths fields to the spec so that
proper masking and setting of files in /proc can be configured.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Avoid trouble with situations like:
# mount --bind /mnt/test /mnt/test
# mount --make-rprivate /mnt/test
# touch /mnt/test/mnt /mnt/test/user
# mount --bind /proc/123/ns/mnt /mnt/test/mnt
# mount --bind /proc/123/ns/user /mnt/test/user
# nsenter --mount=/proc/123/ns/mnt --user /proc/123/ns/user sh
which uses the required private mount for binding mount namespace
references [1,2,3]. We want to avoid:
1. Runtime opens /mnt/test/mnt as fd 3.
2. Runtime joins the mount namespace referenced by fd 3.
3. Runtime fails to open /mnt/test/user, because /mnt/test is not
visible in the current mount namespace.
and instead get runtime authors to setup flows like:
1. Runtime opens /mnt/test/mnt as fd 3.
2. Runtime opens /mnt/test/user as fd 4.
3. Runtime joins the mount namespace referenced by fd 3.
4. Runtime joins the user namespace referenced by fd 4.
This also applies to new namespace creation. We want to avoid:
1. Runtime clones a container process with a new mount namespace.
2c. Container process fails to open /mnt/test/user, because /mnt/test
is not visible in the current mount namespace.
in favor of something like:
1. Runtime opens /mnt/test/user as fd 3.
2. Runtime clones a container process with a new mount namespace.
3h. Host process closes unneeded fd 3.
3c. Container process joins the user namespace referenced by fd 3.
I also define runtime and container namespaces, so we have consistent
terminology. I prefer:
* host namespace: a namespace you are in when you invoke the runtime
* host process: the runtime process invoked by the user
* container process: the process created by a clone call in the host
process which will eventually execute the user-configured process.
Both the host and container processes are running runtime code
(although the container process eventually transitions to
user-configured code), so I find "runtime process", "runtime
namespace", etc. to be imprecise. However, the maintainer consensus
is for "runtime namespace" [4,5], so that's what we're going with
here.
[1]: http://karelzak.blogspot.com/2015/04/persistent-namespaces.html
[2]: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4ce5d2b1a8fde84c0eebe70652cf28b9beda6b4e
[3]: http://mid.gmane.org/87haeahkzc.fsf@xmission.com
[4]: https://github.com/opencontainers/specs/pull/275#discussion_r48057211
[5]: https://github.com/opencontainers/specs/pull/275#discussion_r48324264
Signed-off-by: W. Trevor King <wking@tremily.us>
This moves process specific settings like caps, apparmor, and selinux
process label onto the process structure to allow the same settings to
be changed at exec time.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>