I think runtime should generate an error, if devices has
duplicated device path.
Because we don't know which one is really needed.
Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
This restriction originally landed via 02b456e9 (Clarify behavior
around namespaces paths, 2015-09-08, #158). The hostname case landed
via 66a0543e (config: Require a new UTS namespace for config.json's
hostname, 2015-10-05, #214) citing the namespace restriction. The
restriciton extended to runtime namespaces in 01c2d55f (config-linux:
Extend no-tweak requirement to runtime namespaces, 2016-08-24, #538).
There was a proposal in-flight to get config-wide consistency around
the no-tweaking concept [1].
In today's meeting, the maintainer consensus was to strike the
no-tweaking restriction [2], which is what I've done here. I've
removed the ROADMAP entry because this gives folks a way to adjust
existing containers (launch a new container which joins and tweaks the
original).
The hostname entry still mentions the UTS namespace to provide a guard
against accidental foot-gunning. There was no no-tweaking language
for properties related to other namespaces (e.g. 'mounts').
Maybe the other namespaces have more obvious names.
[1]: https://github.com/opencontainers/runtime-spec/pull/540
[2]: http://ircbot.wl.linuxfoundation.org/meetings/opencontainers/2017/opencontainers.2017-01-11-22.04.log.html#l-117
Signed-off-by: W. Trevor King <wking@tremily.us>
Carry #499
For these values, cgroup kernal APIs accept -1 to set
them as unlimited, as docker and runc all support
update resources, we should not set drawbacks in spec.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
In all of these cases we want to use the RFC 2119 semantics.
Generated with:
$ sed -i 's/required/REQUIRED/g' config*.md
after which I rolled back the change for:
...controllers required to fulfill...
since that was already MUSTed.
Signed-off-by: W. Trevor King <wking@tremily.us>
In all of these cases we want to use the RFC 2119 semantics.
Generated with:
$ sed -i 's/optional/OPTIONAL/g' config*.md
Signed-off-by: W. Trevor King <wking@tremily.us>
Since [1] we've required runtimes to error out if a configuration
joins an existing namespace and adjusts it somehow (e.g. joining an
existing UTC namespace and setting 'hostname', [2]). However, the
wording from [1] (which survives untouched in the current master) only
talked about "when a path is specified". I see two possible
approaches for internal consistency:
a. Lift the OCI restriction and allow join-and-tweak [3] where the
kernel supports it. When we landed the current restriction, the
main issues seemed to be "we don't have a clear use-case for join
and tweak" [4] (although see [5]) and "this is a foot gun [6,7]"
(I'd rather leave policy to higher-level config linters).
b. Extend the OCI restriction to all cases where the runtime does not
create a new namespace. Besides the already covered "namespace
entry exists and includes 'path'", we'd also want to forbid configs
that were missing the relevant namespace(s) entirely (in which case
the container inherits the host namespace(s)).
I'm partial to (a) in the long run, but (b) is less of a shift from
the current spec and likely a better choice for a pending 1.0.
This commit implements (b).
It also makes it explicit that not listing a namespace type will cause
the container to inherit the runtime namespace of that type.
[1]: https://github.com/opencontainers/runtime-spec/pull/158
Subject: Clarify behavior around namespaces paths
[2]: https://github.com/opencontainers/runtime-spec/pull/214
Subject: config: Require a new UTS namespace for config.json's hostname
[3]: https://github.com/opencontainers/runtime-spec/pull/158#issuecomment-138687129
[4]: https://github.com/opencontainers/runtime-spec/pull/158#issuecomment-138997548
[5]: https://github.com/opencontainers/runtime-spec/pull/305
Subject: [Tracker] Live Container Updates
[6]: https://github.com/opencontainers/runtime-spec/pull/158#issuecomment-139106987
[7]: https://github.com/opencontainers/runtime-spec/issues/537#issuecomment-242132288
Subject: [linux] Tweaking host namespaces?
Signed-off-by: W. Trevor King <wking@tremily.us>
And mark it omitempty to avoid:
$ ocitools generate --template <(echo '{"linux": {"resources": {}}}') | jq .linux
{
"resources": {
"devices": null
}
}
Signed-off-by: W. Trevor King <wking@tremily.us>
To match the omitempty which the Go property has had since 28cc4239
(add omitempty to 'Device' and 'Namespace', 2016-03-10, #340).
Signed-off-by: W. Trevor King <wking@tremily.us>
Make explicit that runtimes only have to attach to the bare minimum
number of cgroups in order to fulfil the users' requirements. However,
runtimes are of course allowed to attach to more than the bare minimum.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Clarify some of the confusion with cgroupsPath. Due to systemd, we
cannot require that relative paths be treated in any specific way. In
addition, add a line stating that not all values of cgroupsPath are
required to be valid (and that runtimes must error out if they have an
invalid cgroup path). However, any given value of cgroupsPath should
provide consistent results.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Some of the wording was a bit clumsy (and incorrect, by conflating
different concepts in control groups as "cgroups").
Signed-off-by: Aleksa Sarai <asarai@suse.de>
The cgroup namespace is a new kernel feature available in 4.6+ that
allows a container to isolate its cgroup hierarchy. This currently only
allows for hiding information from /proc/self/cgroup, and mounting
cgroupfs as an unprivileged user. In the future, this namespace may
allow for subtree management by a container.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
The user-namespace restriction isn't about the root filesystem in
particular. For example, if you bind mount in a second filesystem,
the runtime shouldn't adjust ownership on that filesystem either.
I've also adjusted the old "permissions" to "ownership", since that
more clearly reflects the fields (user and group) that you would
modify if you wanted to adjust for user namespacing.
Signed-off-by: W. Trevor King <wking@tremily.us>
Avoid the dangling 'using' from e9a6d948 (cgroup: Add support for
memory.kmem.tcp.limit_in_bytes, 2015-10-26, #235). I've tried to echo
the kernel docs by mentioning buffer memory [1]. I'd personally
prefer linking to the kernel docs and mentioning
memory.kmem.tcp.limit_in_bytes, but that seemed like too big of a
break from the existing style for this commit.
[1]: https://kernel.org/doc/Documentation/cgroup-v1/memory.txt
Signed-off-by: W. Trevor King <wking@tremily.us>
Fixes#320
This adds the maskedPaths and readonlyPaths fields to the spec so that
proper masking and setting of files in /proc can be configured.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Avoid trouble with situations like:
# mount --bind /mnt/test /mnt/test
# mount --make-rprivate /mnt/test
# touch /mnt/test/mnt /mnt/test/user
# mount --bind /proc/123/ns/mnt /mnt/test/mnt
# mount --bind /proc/123/ns/user /mnt/test/user
# nsenter --mount=/proc/123/ns/mnt --user /proc/123/ns/user sh
which uses the required private mount for binding mount namespace
references [1,2,3]. We want to avoid:
1. Runtime opens /mnt/test/mnt as fd 3.
2. Runtime joins the mount namespace referenced by fd 3.
3. Runtime fails to open /mnt/test/user, because /mnt/test is not
visible in the current mount namespace.
and instead get runtime authors to setup flows like:
1. Runtime opens /mnt/test/mnt as fd 3.
2. Runtime opens /mnt/test/user as fd 4.
3. Runtime joins the mount namespace referenced by fd 3.
4. Runtime joins the user namespace referenced by fd 4.
This also applies to new namespace creation. We want to avoid:
1. Runtime clones a container process with a new mount namespace.
2c. Container process fails to open /mnt/test/user, because /mnt/test
is not visible in the current mount namespace.
in favor of something like:
1. Runtime opens /mnt/test/user as fd 3.
2. Runtime clones a container process with a new mount namespace.
3h. Host process closes unneeded fd 3.
3c. Container process joins the user namespace referenced by fd 3.
I also define runtime and container namespaces, so we have consistent
terminology. I prefer:
* host namespace: a namespace you are in when you invoke the runtime
* host process: the runtime process invoked by the user
* container process: the process created by a clone call in the host
process which will eventually execute the user-configured process.
Both the host and container processes are running runtime code
(although the container process eventually transitions to
user-configured code), so I find "runtime process", "runtime
namespace", etc. to be imprecise. However, the maintainer consensus
is for "runtime namespace" [4,5], so that's what we're going with
here.
[1]: http://karelzak.blogspot.com/2015/04/persistent-namespaces.html
[2]: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4ce5d2b1a8fde84c0eebe70652cf28b9beda6b4e
[3]: http://mid.gmane.org/87haeahkzc.fsf@xmission.com
[4]: https://github.com/opencontainers/specs/pull/275#discussion_r48057211
[5]: https://github.com/opencontainers/specs/pull/275#discussion_r48324264
Signed-off-by: W. Trevor King <wking@tremily.us>
This moves process specific settings like caps, apparmor, and selinux
process label onto the process structure to allow the same settings to
be changed at exec time.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
In json, os.FileMode would be presented as a uint32, which
is decimal. Otherwise we'll get error:
`invalid character '6' after object key:value pair`
when unmarshal the json file.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Fixes: opencontainers/runc#566
For type rune, we can assign char as 'c' in struct, but after
marshal, it'll be presented as int32. So in json config it needs
to be presented as a number which is not friendly to be identified.
Change it to string so that you can actually write "b", "c" in json
spec and you can easily know what type of device it is.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
With 34a9304a (Merge branch 'for-4.5' of
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup, 2016-01-13,
[1]), Linux restructured their cgroups documentation. This updated
all of our Documentation/cgroups references to match the new layout,
using reference-style links [2] which let us collect link label
definitions at the bottom of the file. That makes the spec source
easier to read (no distracting URLs in the middle of a sentence) and
makes the URLs easier to update (only one place to check / fix).
[1]: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=34a9304a96d6351c2d35dcdc9293258378fc0bd8
[2]: http://daringfireball.net/projects/markdown/syntax#link
Signed-off-by: W. Trevor King <wking@tremily.us>
With mknod entries in linux.devices and cgroups entries in
linux.resources.devices. Background discussion in [1].
For specifying device cgroups independent of device creation. This
makes it easy to distinguish between configs that call for cgroup
adjustments (which have linux.resources entries) from those that
don't. Without this split, folks interested in making that
distinction would have to parse the device section to determine if it
included cgroup changes. This will also make it easy to drop either
portion (mknod [2] or cgroups [3]) independently of the other if the
project decides to do so.
Using seperate sections for mknod and cgroups also allows us to avoid
the complicated validation rules needed for the combined format
mknod/cgroup [4].
Now that there is a section specific to supplying devices, I shifted
the default device listing over from config-linux [5]. The /dev/ptmx
entry is a bit awkward, since it's not a device, but it seemed to fit
better over here. But I would also be fine leaving it with the other
mounts in config-linux.
fileMode, uid, and gid are optional, because mknod(2) doesn't need
them and specifies the handling when they aren't set [6,7].
Similarly, major/minor numbers are only required for S_IFCHR and
S_IFBLK [6]. I've left off wording about required runtime behavior
for unset values, because I'd rather address that with a blanket rule
[8].
For the cgroup, access is optional because the kernel docs show an
example that doesn't write an access field to the devices.deny file
[9]. The current kernel docs don't go into much detail on this
behavior (I expect unset and 'rwm' are equivalent), but if the kernel
doesn't need a value written, the spec should get out of the way and
allow users to not specify a value.
The reference links are sorted into two blocks, with kernel-doc links
sorted alphabetically followed by man pages sorted alphabetically by
section. The cgroup link is new since 2016-01-13 [10].
[1]: https://groups.google.com/a/opencontainers.org/forum/#!topic/dev/y_Fsa2_jJaM
Subject: Separate config entries for device mknod and cgroups?
Date: Mon, 5 Oct 2015 12:46:55 -0700
Message-ID: <20151005194655.GN28418@odin.tremily.us>
[2]: https://github.com/opencontainers/specs/pull/98
[3]: https://groups.google.com/a/opencontainers.org/forum/#!topic/dev/qWHoKs8Fsrk
Subject: removal of cgroups from the OCI Linux spec
Date: Wed, 28 Oct 2015 17:01:59 +0000
Message-ID: <CAD2oYtO1RMCcUp52w-xXemzDTs+J6t4hS5Mm4mX+uBnVONGDfA@mail.gmail.com>
[4]: https://github.com/opencontainers/specs/pull/101
[5]: https://github.com/opencontainers/specs/pull/171#discussion_r41190655
[6]: http://man7.org/linux/man-pages/man2/mknod.2.html#DESCRIPTION
[7]: https://github.com/opencontainers/specs/pull/298/files#r51053835
[8]: https://github.com/opencontainers/specs/pull/285#issuecomment-167823651
[9]: https://kernel.org/doc/Documentation/cgroup-v1/devices.txt
[10]: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=34a9304a96d6351c2d35dcdc9293258378fc0bd8
Signed-off-by: W. Trevor King <wking@tremily.us>
Reverting 7232e4b1 (specs: introduce the concept of a runtime.json,
2015-07-30, #88) after discussion on the mailing list [1]. The main
reason is that it's hard to draw a clear line around "inherently
runtime-specific" or "non-portable", so we shouldn't try to do that in
the spec. Folks who want to flag settings as non-portable for their
own system are welcome to do so (e.g. "we will clobber 'hooks' in
bundles we run") are welcome to do so, but we don't have to have
to split the config into multiple files to do that.
There have been a number of additional changes since #88, so this
isn't a pure Git reversion. Besides copy-pasting and the associated
link-target updates, I've:
* Restored path -> destination, now that the mount type contains both
source and target paths again. I'd prefer 'target' to 'destination'
to match mount(2), but the pre-7232e4b1 phrasing was 'destination'
(possibly due to Windows using 'target' for the source?).
* Restored the Windows mount example to its pre-7232e4b1 content.
* Removed required mounts from the config example (requirements landed
in 3848a238, config-linux: specify the default devices/filesystems
available, 2015-09-09, #164), because specifying those mounts in the
config is now redundant.
* Used headers (vs. bold paragraphs) to set off mount examples so we
get link anchors in the rendered Markdown.
* Replaced references to runtime.json with references to config.json.
[1]: https://groups.google.com/a/opencontainers.org/forum/#!topic/dev/0QbyJDM9fWY
Subject: Single, unified config file (i.e. rolling back specs#88)
Date: Wed, 4 Nov 2015 09:53:20 -0800
Message-ID: <20151104175320.GC24652@odin.tremily.us>
Signed-off-by: W. Trevor King <wking@tremily.us>