pivotDir is the one where pivot_root() call puts the old root. We will
unmount pivotDir() and delete it.
Previously we were making / always rslave or rprivate. That will mean
that pivotDir() could never have mounts which would be shared with
parent mount namespace. That also means that unmounting pivotDir() was
safe and none of the unmount will propagate to parent namespace and
unmount things which we did not want to.
But now user can specify that apply private, shared, slave on /. That
means some of the mounts we inherited from parent could be shared and that
also means if we umount pivotDir/, those mounts will get unmounted in
parent too. That's not what we want.
Instead make pivotDir rprivate so that unmounts don't propagate back to
parent.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
pivot_root() introduces bunch of restrictions otherwise it fails. parent
mount of container root can not be shared otherwise pivot_root() will
fail.
So far parent could not be shared as we marked everything either private
or slave. But now we have introduced new propagation modes where parent
mount of container rootfs could be shared and pivot_root() will fail.
So check if parent mount is shared and if yes, make it private. This will
make sure pivot_root() works.
Also it will make sure that when we bind mount container rootfs, it does
not propagate to parent mount namespace. Otherwise cleanup becomes a
problem.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
spec introduced a new field rootfsPropagation. Right now that field
is not parsed by runc and it does not take effect. Starting parsing
it and for now allow only limited propagation flags. More can be
opened as new use cases show up.
We are apply propagation flags on / and not rootfs. So ideally
we should introduce another field in spec say rootPropagation. For
now I am parsing rootfsPropagation. Once we agree on design, we
can discuss if we need another field in spec or not.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Right now config.Privatefs is a boolean which determines if / is applied
with propagation flag syscall.MS_PRIVATE | syscall.MS_REC or not.
Soon we want to represent other propagation states like private, [r]slave,
and [r]shared. So either we can introduce more boolean variable or keep
track of propagation flags in an integer variable. Keeping an integer
variable is more versatile and can allow various kind of propagation flags
to be specified. So replace Privatefs with RootPropagation which is an
integer.
Note, this will require changes in docker. Instead of setting Privatefs
to true, they will need to set.
config.RootPropagation = syscall.MS_PRIVATE | syscall.MS_REC
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Do not remount a bind mount to enable flags unless non-default flags are
provided for the requested mount. This solves a problem with user
namespaces and remount of bind mount permissions.
Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
Do not have methods and actions that require syscalls in the configs
package because it breaks cross compile.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This commit allows additional architectures to be added to Seccomp filters
created by containers. This allows containers to make syscalls using these
architectures. For example, in a container on an AMD64 system, only AMD64
syscalls would be usable unless x86 was added to the filter using this patch,
which would allow both 32-bit and 64-bit syscalls to be used.
Signed-off-by: Matthew Heon <mheon@redhat.com>
We need to update the mount's destination after we resolve symlinks so
that it properly creates and mounts the correct location.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Whenever dev/null is used as one of the main processes STDIO, do not try
to change the permissions on it via fchown because we should not do it
in the first place and also this will fail if the container is supposed
to be readonly.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
When executing an additional process in a container, all namespaces are
entered but the user namespace. As a result, the process may be
executed as the host's root user. This has both functionality and
security implications.
Fix this by adding the missing user namespace to the array of
namespaces. Since joining a user namespace in which the caller is
already a member yields an error, skip namespaces we're already in.
Last, remove a needless and buggy AT_SYMLINK_NOFOLLOW in the code.
Signed-off-by: Ido Yariv <ido@wizery.com>
* version in the config example is advanced to 0.1.0
* rootfsPropagation in config.json is removed
(The same one is already in runtime.json)
* rlimit time is changed from magic number to name(string)
* add pids cgroup
* add cgroup path
After this change applied, the example config in this README.md
is consistent with the result of `runc spec`.
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Fix the permissions of the container's main processes STDIO when the
process is not run as the root user. This changes the permissions right
before switching to the specified user so that it's STDIO matches it's
UID and GID.
Add a test for checking that the STDIO of the process is owned by the
specified user.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
When we are using user namespaces we need to make sure that when we do
not have a TTY we change the ownership of the pipe()'s used for the
process to the root user within the container so that when you call
open() on any of the /proc/self/fd/*'s you do not get an EPERM.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>