Ensure that the pipe is always closed during the error processing of StartInitialization.
Also:
* Fix a comment typo.
* Use newContainerInit directly as there's no need for i to be an initer.
* Move the comment about the behaviour of Init() directly above it, clarifying what happens for all defers.
Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
Add the import of nsenter to the example in libcontainer's README.md, as without it none of the example code works.
Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
POSIX mandates that `cmsg_len` in `struct cmsghdr` is a `socklen_t`,
which is an `unsigned int`. Musl libc as used in Alpine implements
this; Glibc ignores the spec and makes it a `size_t` ie `unsigned long`.
To avoid the `-Wformat=` warning from the `%lu` on Alpine, cast this
to an `unsigned long` always.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Add godoc links to README.md files for runc and libcontainer so its easy to access the golang documentation.
Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
signalAllProcesses was making the assumption that the requested signal was SIGKILL, possibly due to the signal parameter being added at a later date, and hence it was safe to wait for all processes which is not the case.
BaseContainer.Signal(s os.Signal, all bool) exposes this functionality to consumers, so an arbitrary signal could be used which is not guaranteed to make the processes exit.
Correct the documentation for signalAllProcesses around the signal delivered and update it so that the wait is only performed on SIGKILL hence making it safe to process other signals without risk of blocking forever, while still maintaining compatibility to SIGKILL callers.
Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
The parameters passed to `GetExecUser` is not correct.
Consider the following code:
```
package main
import (
"fmt"
"io"
"os"
)
func main() {
passwd, err := os.Open("/etc/passwd1")
if err != nil {
passwd = nil
} else {
defer passwd.Close()
}
err = GetUserPasswd(passwd)
if err != nil {
fmt.Printf("%#v\n", err)
}
}
func GetUserPasswd(r io.Reader) error {
if r == nil {
return fmt.Errorf("nil source for passwd-formatted
data")
} else {
fmt.Printf("r = %#v\n", r)
}
return nil
}
```
If the file `/etc/passwd1` is not exist, we expect to return
`nil source for passwd-formatted data` error, and in fact, the func
`GetUserPasswd` return nil.
The same logic exists in runc code. this patch fix it.
Signed-off-by: Wang Long <long.wanglong@huawei.com>
The `bufio.Scanner.Scan` method returns false either by reaching the
end of the input or an error. After Scan returns false, the Err method
will return any error that occurred during scanning, except that if it
was io.EOF, Err will return nil.
We should check the error when Scan return false(out of the for loop).
Signed-off-by: Wang Long <long.wanglong@huawei.com>
This sets the init processes that join and setup the container's
namespaces as non-dumpable before they setns to the container's pid (or
any other ) namespace.
This settings is automatically reset to the default after the Exec in
the container so that it does not change functionality for the
applications that are running inside, just our init processes.
This prevents parent processes, the pid 1 of the container, to ptrace
the init process before it drops caps and other sets LSMs.
This patch also ensures that the stateDirFD being used is still closed
prior to exec, even though it is set as O_CLOEXEC, because of the order
in the kernel.
https://github.com/torvalds/linux/blob/v4.9/fs/exec.c#L1290-L1318
The order during the exec syscall is that the process is set back to
dumpable before O_CLOEXEC are processed.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
On some systems, when we mount some cgroup subsystems into
a same mountpoint, the name sequence of mount options and
cgroup directory name can not be the same.
For example, the mount option is cpuacct,cpu, but
mountpoint name is /sys/fs/cgroup/cpu,cpuacct. In current
runc, we set mount destination name from combining
subsystems, which comes from mount option from
/proc/self/mountinfo, so in my case the name would be
/sys/fs/cgroup/cpuacct,cpu, which is differernt from
host, and will break some applications.
Fix it by using directory name from host mountpoint.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
`HookState` struct should follow definition of `State` in runtime-spec:
* modify json name of `version` to `ociVersion`.
* Remove redundant `Rootfs` field as rootfs can be retrived from
`bundlePath/config.json`.
Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
A remount of a mount point must include all the current flags or
these will be cleared:
```
The mountflags and data arguments should match the values used in the
original mount() call, except for those parameters that are being
deliberately changed.
```
The current code does not do this; the bug manifests in the specified
flags for `/dev` being lost on remount read only at present. As we
need to specify flags, split the code path for this from remounting
paths which are not mount points, as these can only inherit the
existing flags of the path, and these cannot be changed.
In the bind case, remove extra flags from the bind remount. A bind
mount can only be remounted read only, no other flags can be set,
all other flags are inherited from the parent. From the man page:
```
Since Linux 2.6.26, this flag can also be used to make an existing
bind mount read-only by specifying mountflags as:
MS_REMOUNT | MS_BIND | MS_RDONLY
Note that only the MS_RDONLY setting of the bind mount can be changed
in this manner.
```
MS_REC can only be set on the original bind, so move this. See note
in man page on bind mounts:
```
The remaining bits in the mountflags argument are also ignored, with
the exception of MS_REC.
```
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
When checking if the provided networking namespace is the host
one or not, we should first check if it's a symbolic link or not
as in some cases we can use persistent networking namespace under
e.g. /var/run/netns/.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Both suffered from different race conditions.
SelinuxEnabled assigned selinuxEnabledChecked before selinuxEnabled.
Thus racing callers could see the wrong selinuxEnabled.
getSelinuxMountPoint assigned selinuxfs to "" before it know the right
value. Thus racing could see "" improperly.
The gate selinuxfs, enabled, and mclist all on the same lock
This fixes all of the tests that were broken as part of the console
rewrite. This includes fixing the integration tests that used TTY
handling inside libcontainer, as well as the bats integration tests that
needed to be rewritten to use recvtty (as they rely on detached
containers that are running).
This patch is part of the console rewrite patchset.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Switch to the actual source of the official Docker library of images, so
that we have a proper source for the test filesystem. In addition,
update to the latest released version (1.25.0 [2016-06-23]) so that we
can use more up-to-date applets in our tests (such as stat(3)).
This patch is part of the console rewrite patchset.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This allows for higher-level orchestrators to be able to have access to
the master pty file descriptor without keeping the runC process running.
This is key to having (detach && createTTY) with a _real_ pty created
inside the container, which is then sent to a higher level orchestrator
over an AF_UNIX socket.
This patch is part of the console rewrite patchset.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Since the gid=X and mode=Y flags can be set inside config.json as mount
options, don't override them with our own defaults. This avoids
/dev/pts/* not being owned by tty in a regular container, as well as all
of the issues with us implementing grantpt(3) manually. This is the
least opinionated approach to take.
This patch is part of the console rewrite patchset.
Reported-by: Mrunal Patel <mrunalp@gmail.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This implements {createTTY, detach} and all of the combinations and
negations of the two that were previously implemented. There are some
valid questions about out-of-OCI-scope topics like !createTTY and how
things should be handled (why do we dup the current stdio to the
process, and how is that not a security issue). However, these will be
dealt with in a separate patchset.
In order to allow for late console setup, split setupRootfs into the
"preparation" section where all of the mounts are created and the
"finalize" section where we pivot_root and set things as ro. In between
the two we can set up all of the console mountpoints and symlinks we
need.
We use two-stage synchronisation to ensures that when the syscalls are
reordered in a suboptimal way, an out-of-place read() on the parentPipe
will not gobble the ancilliary information.
This patch is part of the console rewrite patchset.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
To make the code cleaner, and more clear, refactor the syncT handling
used when creating the `runc init` process. In addition, document the
state changes so that people actually understand what is going on.
Rather than only using syncT for the standard initProcess, use it for
both initProcess and setnsProcess. This removes some special cases, as
well as allowing for the use of syncT with setnsProcess.
Also remove a bunch of the boilerplate around syncT handling.
This patch is part of the console rewrite patchset.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This adds C wrappers for sendmsg and recvmsg, specifically used for
passing around file descriptors in Go. The wrappers (sendfd, recvfd)
expect to be called in a context where it makes sense (where the other
side is carrying out the corresponding action).
This patch is part of the console rewrite patchset.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
In the cases that we got failure on a subsystem's Apply,
we'll get some subsystems' cgroup directories leftover.
On Docker's point of view, start a container failed, use
`docker rm` to remove the container, but some cgroup files
are leftover.
Sometimes we don't want to clean everyting up when something
went wrong, because we need these inter situation
information to debug what's going on, but cgroup directories
are not useful information we want to keep.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
This PR fix issue in this scenario:
```
in terminal 1:
~# cd /sys/fs/cgroup/cpuset
~# mkdir test
~# cd test
~# cat cpuset.cpus
0-3
~# echo 1 > cpuset.cpu_exclusive (make sure you don't have other cgroups under root)
in terminal 2:
~# echo $$ > /sys/fs/cgroup/cpuset/test/tasks
// set resources.cpu.cpus="0-2" in config.json
~# runc run test1
back to terminal 1:
~# cd test1
~# cat cpuset.cpus
0-2
~# echo 1 > cpuset.cpu_exclusive
in terminal 3:
~# echo $$ > /sys/fs/cgroup/test/tasks
// set resources.cpu.cpus="3" in config.json
~# runc run test2
container_linux.go:247: starting container process caused "process_linux.go:258:
applying cgroup configuration for process caused \"failed to write 0-3\\n to
cpuset.cpus: write /sys/fs/cgroup/cpuset/test2/cpuset.cpus: invalid argument\""
```
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
This allows a user to send a signal to all the processes in the
container within a single atomic action to avoid new processes being
forked off before the signal can be sent.
This is basically taking functionality that we already use being
`delete` and exposing it ok the `kill` command by adding a flag.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This moves the ambient capability support behind an `ambient` build tag
so that it is only compiled upon request.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
The default terminal setting for a new pty on Linux (unix98) has +ONLCR,
resulting in '\n' writes by a container process to be converted to
'\r\n' reads by the managing process. This is quite unexpected, and
causes multiple issues with things like bats testing. To fix it, make
the terminal sane after opening it by setting -ONLCR.
This patch might need to be rewritten after the console rewrite patchset
is merged.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
I use the same tool (https://github.com/client9/misspell)
as Daniel used a few days ago, don't why he missed these
typos at that time.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
config.cloneflag is not mandatory, when using `runc exec`,
config.cloneflag can be empty, and even then it won't be
`-1` but `0`.
So this validation is totally wrong and unneeded.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
In user namespaces devices are bind-mounted from the host, so
we need to add them as external mounts for CRIU.
Reported-by: Ross Boucher <boucher@gmail.com>
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
When spec file contains duplicated namespaces, e.g.
specs: specs.Spec{
Linux: &specs.Linux{
Namespaces: []specs.Namespace{
{
Type: "pid",
},
{
Type: "pid",
Path: "/proc/1/ns/pid",
},
},
},
}
runc should report malformed spec instead of using latest one by
default, because this spec could be quite confusing.
Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
Previously we only tested failures, which causes us to miss issues where
setting sysctls would *always* fail.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
When changing this validation, the code actually allowing the validation
to pass was removed. This meant that any net.* sysctl would always fail
to validate.
Fixes: bc84f83344 ("fix docker/docker#27484")
Reported-by: Justin Cormack <justin.cormack@docker.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This reverts part of the commit eb0a144b5e
That commit introduced two issues.
- We need to make parent mount of rootfs private before bind mounting
rootfs. Otherwise bind mounting root can propagate in other mount
namespaces. (If parent mount is shared).
- It broke test TestRootfsPropagationSharedMount() on Fedora.
On fedora /tmp is a mount point with "shared" propagation. I think
you should be able to reproduce it on other distributions as well
as long as you mount tmpfs on /tmp and make it "shared" propagation.
Reason for failure is that pivot_root() fails. And it fails because
kernel does following check.
IS_MNT_SHARED(new_mnt->mnt_parent)
Say /tmp/foo is new rootfs, we have bind mounted rootfs, so new_mnt
is /tmp/foo, and new_mnt->mnt_parent is /tmp which is "shared" on
fedora and above check fails.
So this change broke few things, it is a good idea to revert part of it.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Namely, use an undocumented feature of pivot_root(2) where
pivot_root(".", ".") is actually a feature and allows you to make the
old_root be tied to your /proc/self/cwd in a way that makes unmounting
easy. Thanks a lot to the LXC developers which came up with this idea
first.
This is the first step of many to allowing runC to work with a
completely read-only rootfs.
Signed-off-by: Aleksa Sarai <asarai@suse.de>