Clarify some of the confusion with cgroupsPath. Due to systemd, we
cannot require that relative paths be treated in any specific way. In
addition, add a line stating that not all values of cgroupsPath are
required to be valid (and that runtimes must error out if they have an
invalid cgroup path). However, any given value of cgroupsPath should
provide consistent results.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Some of the wording was a bit clumsy (and incorrect, by conflating
different concepts in control groups as "cgroups").
Signed-off-by: Aleksa Sarai <asarai@suse.de>
The cgroup namespace is a new kernel feature available in 4.6+ that
allows a container to isolate its cgroup hierarchy. This currently only
allows for hiding information from /proc/self/cgroup, and mounting
cgroupfs as an unprivileged user. In the future, this namespace may
allow for subtree management by a container.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
The user-namespace restriction isn't about the root filesystem in
particular. For example, if you bind mount in a second filesystem,
the runtime shouldn't adjust ownership on that filesystem either.
I've also adjusted the old "permissions" to "ownership", since that
more clearly reflects the fields (user and group) that you would
modify if you wanted to adjust for user namespacing.
Signed-off-by: W. Trevor King <wking@tremily.us>
Avoid the dangling 'using' from e9a6d948 (cgroup: Add support for
memory.kmem.tcp.limit_in_bytes, 2015-10-26, #235). I've tried to echo
the kernel docs by mentioning buffer memory [1]. I'd personally
prefer linking to the kernel docs and mentioning
memory.kmem.tcp.limit_in_bytes, but that seemed like too big of a
break from the existing style for this commit.
[1]: https://kernel.org/doc/Documentation/cgroup-v1/memory.txt
Signed-off-by: W. Trevor King <wking@tremily.us>
Fixes#320
This adds the maskedPaths and readonlyPaths fields to the spec so that
proper masking and setting of files in /proc can be configured.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Avoid trouble with situations like:
# mount --bind /mnt/test /mnt/test
# mount --make-rprivate /mnt/test
# touch /mnt/test/mnt /mnt/test/user
# mount --bind /proc/123/ns/mnt /mnt/test/mnt
# mount --bind /proc/123/ns/user /mnt/test/user
# nsenter --mount=/proc/123/ns/mnt --user /proc/123/ns/user sh
which uses the required private mount for binding mount namespace
references [1,2,3]. We want to avoid:
1. Runtime opens /mnt/test/mnt as fd 3.
2. Runtime joins the mount namespace referenced by fd 3.
3. Runtime fails to open /mnt/test/user, because /mnt/test is not
visible in the current mount namespace.
and instead get runtime authors to setup flows like:
1. Runtime opens /mnt/test/mnt as fd 3.
2. Runtime opens /mnt/test/user as fd 4.
3. Runtime joins the mount namespace referenced by fd 3.
4. Runtime joins the user namespace referenced by fd 4.
This also applies to new namespace creation. We want to avoid:
1. Runtime clones a container process with a new mount namespace.
2c. Container process fails to open /mnt/test/user, because /mnt/test
is not visible in the current mount namespace.
in favor of something like:
1. Runtime opens /mnt/test/user as fd 3.
2. Runtime clones a container process with a new mount namespace.
3h. Host process closes unneeded fd 3.
3c. Container process joins the user namespace referenced by fd 3.
I also define runtime and container namespaces, so we have consistent
terminology. I prefer:
* host namespace: a namespace you are in when you invoke the runtime
* host process: the runtime process invoked by the user
* container process: the process created by a clone call in the host
process which will eventually execute the user-configured process.
Both the host and container processes are running runtime code
(although the container process eventually transitions to
user-configured code), so I find "runtime process", "runtime
namespace", etc. to be imprecise. However, the maintainer consensus
is for "runtime namespace" [4,5], so that's what we're going with
here.
[1]: http://karelzak.blogspot.com/2015/04/persistent-namespaces.html
[2]: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4ce5d2b1a8fde84c0eebe70652cf28b9beda6b4e
[3]: http://mid.gmane.org/87haeahkzc.fsf@xmission.com
[4]: https://github.com/opencontainers/specs/pull/275#discussion_r48057211
[5]: https://github.com/opencontainers/specs/pull/275#discussion_r48324264
Signed-off-by: W. Trevor King <wking@tremily.us>
This moves process specific settings like caps, apparmor, and selinux
process label onto the process structure to allow the same settings to
be changed at exec time.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
In json, os.FileMode would be presented as a uint32, which
is decimal. Otherwise we'll get error:
`invalid character '6' after object key:value pair`
when unmarshal the json file.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Fixes: opencontainers/runc#566
For type rune, we can assign char as 'c' in struct, but after
marshal, it'll be presented as int32. So in json config it needs
to be presented as a number which is not friendly to be identified.
Change it to string so that you can actually write "b", "c" in json
spec and you can easily know what type of device it is.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
With 34a9304a (Merge branch 'for-4.5' of
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup, 2016-01-13,
[1]), Linux restructured their cgroups documentation. This updated
all of our Documentation/cgroups references to match the new layout,
using reference-style links [2] which let us collect link label
definitions at the bottom of the file. That makes the spec source
easier to read (no distracting URLs in the middle of a sentence) and
makes the URLs easier to update (only one place to check / fix).
[1]: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=34a9304a96d6351c2d35dcdc9293258378fc0bd8
[2]: http://daringfireball.net/projects/markdown/syntax#link
Signed-off-by: W. Trevor King <wking@tremily.us>
With mknod entries in linux.devices and cgroups entries in
linux.resources.devices. Background discussion in [1].
For specifying device cgroups independent of device creation. This
makes it easy to distinguish between configs that call for cgroup
adjustments (which have linux.resources entries) from those that
don't. Without this split, folks interested in making that
distinction would have to parse the device section to determine if it
included cgroup changes. This will also make it easy to drop either
portion (mknod [2] or cgroups [3]) independently of the other if the
project decides to do so.
Using seperate sections for mknod and cgroups also allows us to avoid
the complicated validation rules needed for the combined format
mknod/cgroup [4].
Now that there is a section specific to supplying devices, I shifted
the default device listing over from config-linux [5]. The /dev/ptmx
entry is a bit awkward, since it's not a device, but it seemed to fit
better over here. But I would also be fine leaving it with the other
mounts in config-linux.
fileMode, uid, and gid are optional, because mknod(2) doesn't need
them and specifies the handling when they aren't set [6,7].
Similarly, major/minor numbers are only required for S_IFCHR and
S_IFBLK [6]. I've left off wording about required runtime behavior
for unset values, because I'd rather address that with a blanket rule
[8].
For the cgroup, access is optional because the kernel docs show an
example that doesn't write an access field to the devices.deny file
[9]. The current kernel docs don't go into much detail on this
behavior (I expect unset and 'rwm' are equivalent), but if the kernel
doesn't need a value written, the spec should get out of the way and
allow users to not specify a value.
The reference links are sorted into two blocks, with kernel-doc links
sorted alphabetically followed by man pages sorted alphabetically by
section. The cgroup link is new since 2016-01-13 [10].
[1]: https://groups.google.com/a/opencontainers.org/forum/#!topic/dev/y_Fsa2_jJaM
Subject: Separate config entries for device mknod and cgroups?
Date: Mon, 5 Oct 2015 12:46:55 -0700
Message-ID: <20151005194655.GN28418@odin.tremily.us>
[2]: https://github.com/opencontainers/specs/pull/98
[3]: https://groups.google.com/a/opencontainers.org/forum/#!topic/dev/qWHoKs8Fsrk
Subject: removal of cgroups from the OCI Linux spec
Date: Wed, 28 Oct 2015 17:01:59 +0000
Message-ID: <CAD2oYtO1RMCcUp52w-xXemzDTs+J6t4hS5Mm4mX+uBnVONGDfA@mail.gmail.com>
[4]: https://github.com/opencontainers/specs/pull/101
[5]: https://github.com/opencontainers/specs/pull/171#discussion_r41190655
[6]: http://man7.org/linux/man-pages/man2/mknod.2.html#DESCRIPTION
[7]: https://github.com/opencontainers/specs/pull/298/files#r51053835
[8]: https://github.com/opencontainers/specs/pull/285#issuecomment-167823651
[9]: https://kernel.org/doc/Documentation/cgroup-v1/devices.txt
[10]: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=34a9304a96d6351c2d35dcdc9293258378fc0bd8
Signed-off-by: W. Trevor King <wking@tremily.us>
Reverting 7232e4b1 (specs: introduce the concept of a runtime.json,
2015-07-30, #88) after discussion on the mailing list [1]. The main
reason is that it's hard to draw a clear line around "inherently
runtime-specific" or "non-portable", so we shouldn't try to do that in
the spec. Folks who want to flag settings as non-portable for their
own system are welcome to do so (e.g. "we will clobber 'hooks' in
bundles we run") are welcome to do so, but we don't have to have
to split the config into multiple files to do that.
There have been a number of additional changes since #88, so this
isn't a pure Git reversion. Besides copy-pasting and the associated
link-target updates, I've:
* Restored path -> destination, now that the mount type contains both
source and target paths again. I'd prefer 'target' to 'destination'
to match mount(2), but the pre-7232e4b1 phrasing was 'destination'
(possibly due to Windows using 'target' for the source?).
* Restored the Windows mount example to its pre-7232e4b1 content.
* Removed required mounts from the config example (requirements landed
in 3848a238, config-linux: specify the default devices/filesystems
available, 2015-09-09, #164), because specifying those mounts in the
config is now redundant.
* Used headers (vs. bold paragraphs) to set off mount examples so we
get link anchors in the rendered Markdown.
* Replaced references to runtime.json with references to config.json.
[1]: https://groups.google.com/a/opencontainers.org/forum/#!topic/dev/0QbyJDM9fWY
Subject: Single, unified config file (i.e. rolling back specs#88)
Date: Wed, 4 Nov 2015 09:53:20 -0800
Message-ID: <20151104175320.GC24652@odin.tremily.us>
Signed-off-by: W. Trevor King <wking@tremily.us>
There are two RootfsPropagation fields, one is Linux.RootfsPropagation,
the other one is LinuxRuntime.RootfsPropagation. They are duplicated,
one of them should be removed.
The RootfsPropagation is definitely a runtime specific configuration,
so we remove the one of Linux.RootfsPropagation.
And the description of it is moved from config-linux.md to
runtime-config-linux.md.
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Based on our discussion in-person yesterday it seems necessary to
separate the concept of runtime configuration from application
configuration. There are a few motivators:
- To support runtime updates of things like cgroups, rlimits, etc we
should separate things that are inherently runtime specific from
things that are static to the application running in the container.
- To support the goal of being able to move a bundle between hosts we
should make it clear what parts of the spec are and are not portable
between hosts so that upon landing on a new host the non-portable
options may be rewritten or removed.
- In order to attach a cryptographic identity to a bundle we must not
include details in the bundle that are host specific.
'From' and 'To' are potentially ambiguous for a one-to-one map like
this, and there's already an established name convention in
SysProcIDMap [1]. This commit removes the mental overhead of two
separate naming schemes for the same information. I'd like to drop
IDMapping entirely in favor of SysProcIDMap, but SysProcIDMap doesn't
give the JSON hints we need for (de)serializing.
[1]: https://golang.org/pkg/syscall/#SysProcIDMap
- link to official SemVer page
- link between config.md and config-linux.md and explain relationship
- fix typo (arch -> os)
- tweak formatting of some special characters