pivot_root() introduces bunch of restrictions otherwise it fails. parent
mount of container root can not be shared otherwise pivot_root() will
fail.
So far parent could not be shared as we marked everything either private
or slave. But now we have introduced new propagation modes where parent
mount of container rootfs could be shared and pivot_root() will fail.
So check if parent mount is shared and if yes, make it private. This will
make sure pivot_root() works.
Also it will make sure that when we bind mount container rootfs, it does
not propagate to parent mount namespace. Otherwise cleanup becomes a
problem.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
spec introduced a new field rootfsPropagation. Right now that field
is not parsed by runc and it does not take effect. Starting parsing
it and for now allow only limited propagation flags. More can be
opened as new use cases show up.
We are apply propagation flags on / and not rootfs. So ideally
we should introduce another field in spec say rootPropagation. For
now I am parsing rootfsPropagation. Once we agree on design, we
can discuss if we need another field in spec or not.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Right now config.Privatefs is a boolean which determines if / is applied
with propagation flag syscall.MS_PRIVATE | syscall.MS_REC or not.
Soon we want to represent other propagation states like private, [r]slave,
and [r]shared. So either we can introduce more boolean variable or keep
track of propagation flags in an integer variable. Keeping an integer
variable is more versatile and can allow various kind of propagation flags
to be specified. So replace Privatefs with RootPropagation which is an
integer.
Note, this will require changes in docker. Instead of setting Privatefs
to true, they will need to set.
config.RootPropagation = syscall.MS_PRIVATE | syscall.MS_REC
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Do not remount a bind mount to enable flags unless non-default flags are
provided for the requested mount. This solves a problem with user
namespaces and remount of bind mount permissions.
Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
Do not have methods and actions that require syscalls in the configs
package because it breaks cross compile.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This commit allows additional architectures to be added to Seccomp filters
created by containers. This allows containers to make syscalls using these
architectures. For example, in a container on an AMD64 system, only AMD64
syscalls would be usable unless x86 was added to the filter using this patch,
which would allow both 32-bit and 64-bit syscalls to be used.
Signed-off-by: Matthew Heon <mheon@redhat.com>
We need to update the mount's destination after we resolve symlinks so
that it properly creates and mounts the correct location.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Whenever dev/null is used as one of the main processes STDIO, do not try
to change the permissions on it via fchown because we should not do it
in the first place and also this will fail if the container is supposed
to be readonly.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
When executing an additional process in a container, all namespaces are
entered but the user namespace. As a result, the process may be
executed as the host's root user. This has both functionality and
security implications.
Fix this by adding the missing user namespace to the array of
namespaces. Since joining a user namespace in which the caller is
already a member yields an error, skip namespaces we're already in.
Last, remove a needless and buggy AT_SYMLINK_NOFOLLOW in the code.
Signed-off-by: Ido Yariv <ido@wizery.com>
* version in the config example is advanced to 0.1.0
* rootfsPropagation in config.json is removed
(The same one is already in runtime.json)
* rlimit time is changed from magic number to name(string)
* add pids cgroup
* add cgroup path
After this change applied, the example config in this README.md
is consistent with the result of `runc spec`.
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Fix the permissions of the container's main processes STDIO when the
process is not run as the root user. This changes the permissions right
before switching to the specified user so that it's STDIO matches it's
UID and GID.
Add a test for checking that the STDIO of the process is owned by the
specified user.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
When we are using user namespaces we need to make sure that when we do
not have a TTY we change the ownership of the pipe()'s used for the
process to the root user within the container so that when you call
open() on any of the /proc/self/fd/*'s you do not get an EPERM.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
The patch mainly removes the wrong "for writing".
The config files are readonly when `runc start`.
Signed-off-by: Lai Jiangshan <jiangshanlai@gmail.com>
Right now if one passes a mount propagation flag in spec file, it
does not take effect. For example, try following in spec json file.
{
"type": "bind",
"source": "/root/mnt-source",
"destination": "/root/mnt-dest",
"options": "rbind,shared"
}
One would expect that /root/mnt-dest will be shared inside the container
but that's not the case.
#findmnt -o TARGET,PROPAGATION
`-/root/mnt-dest private
Reason being that propagation flags can't be passed in along with other
regular flags. They need to be passed in a separate call to mount syscall.
That too, one propagation flag at a time. (from mount man page).
Hence, store propagation flags separately in a slice and apply these
in that order after the mount call wherever appropriate. This allows
user to control the propagation property of mount point inside
the container.
Storing them separately also solves another problem where recursive flag
(syscall.MS_REC) can get mixed up. For example, options "rbind,private"
and "bind,rprivate" will be same and there will be no way to differentiate
between these if all the flags are stored in a single integer.
This patch would allow one to pass propagation flags "[r]shared,[r]slave,
[r]private,[r]unbindable" in spec file as per mount property.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>