Refactor DeviceFromPath in order to get rid of package syscall and
directly use the functions from x/sys/unix. This also allows to get rid
of the conversion from the OS-independent file mode values (from the os
package) to Linux specific values and instead let's us use the raw
file mode value directly.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Updated logrus to use v1 which includes a breaking name change Sirupsen -> sirupsen.
This includes a manual edit of the docker term package to also correct the name there too.
Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
Use ParseSocketControlMessage and ParseUnixRights from
golang.org/x/sys/unix instead of their syscall equivalent.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
fix#1476
If containerA shares namespace, say ipc namespace, with containerB, then
its ipc namespace path would be the same as containerB and be stored in
`state.json`. Exec into containerA will just read the namespace paths
stored in this file and join these namespaces. So, if containerB has
already been stopped, `docker exec containerA` will fail.
To address this issue, we should always save own namespace paths no
matter if we share namespaces with other containers.
Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com>
It appears as though these semantics were not fully thought out when
implementing them for rootless containers. It is not necessary (and
could be potentially dangerous) to set the owner of /run/ctr/$id to be
the root inside the container (if user namespaces are being used).
Instead, just use the e{g,u}id of runc to determine the owner.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Use IoctlGetInt and IoctlGetTermios/IoctlSetTermios instead of manually
reimplementing them.
Because of unlockpt, the ioctl wrapper is still needed as it needs to
pass a pointer to a value, which is not supported by any ioctl function
in x/sys/unix yet.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Use unix.Prctl() instead of manually reimplementing it using
unix.RawSyscall. Also use unix.SECCOMP_MODE_FILTER instead of locally
defining it.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Using MS_PRIVATE meant that there was a race between the mount(2) and
the umount2(2) calls where runc inadvertently has a live reference to a
mountpoint that existed on the host (which the host cannot kill
implicitly through an unmount and peer sharing).
In particular, this means that if we have a devicemapper mountpoint and
the host is trying to delete the underlying device, the delete will fail
because it is "in use" during the race. While the race is _very_ small
(and libdm actually retries to avoid these sorts of cases) this appears
to manifest in various cases.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
replace #1492#1494fix#1422
Since https://github.com/opencontainers/runtime-spec/pull/876 the memory
specifications are now `int64`, as that better matches the visible interface where
`-1` is a valid value. Otherwise finding the correct value was difficult as it
was kernel dependent.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Use unix.Eventfd() instead of calling manually reimplementing it using
the raw syscall. Also use the correct corresponding unix.EFD_CLOEXEC
flag instead of unix.FD_CLOEXEC (which can have a different value on
some architectures and thus might lead to unexpected behavior).
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
And Stat_t.PID and Stat_t.Name while we're at it. Then use the new
.State property in runType to distinguish between running and
zombie/dead processes, since kill(2) does not [1]. With this change
we no longer claim Running status for zombie/dead processes.
I've also removed the kill(2) call from runType. It was originally
added in 13841ef3 (new-api: return the Running state only if the init
process is alive, 2014-12-23), but we've been accessing
/proc/[pid]/stat since 14e95b2a (Make state detection precise,
2016-07-05, #930), and with the /stat access the kill(2) check is
redundant.
I also don't see much point to the previously-separate
doesInitProcessExist, so I've inlined that logic in runType.
It would be nice to distinguish between "/proc/[pid]/stat doesn't
exist" and errors parsing its contents, but I've skipped that for the
moment.
The Running -> Stopped change in checkpoint_test.go is because the
post-checkpoint process is a zombie, and with this commit zombie
processes are Stopped (and no longer Running).
[1]: https://github.com/opencontainers/runc/pull/1483#issuecomment-307527789
Signed-off-by: W. Trevor King <wking@tremily.us>
And convert the various start-time properties from strings to uint64s.
This removes all internal consumers of the deprecated
GetProcessStartTime function.
Signed-off-by: W. Trevor King <wking@tremily.us>
Use KeyctlJoinSessionKeyring, KeyctlString and KeyctlSetperm from
golang.org/x/sys/unix instead of manually reimplementing them.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
And use it only in local tooling that is forwarding the pseudoterminal
master. That way runC no longer has an opinion on the onlcr setting
for folks who are creating a terminal and detaching. They'll use
--console-socket and can setup the pseudoterminal however they like
without runC having an opinion. With this commit, the only cases
where runC still has applies SaneTerminal is when *it* is the process
consuming the master descriptor.
Signed-off-by: W. Trevor King <wking@tremily.us>
Use the NLA_ALIGNTO and NLA_HDRLEN constants from x/sys/unix instead of
syscall, as the syscall package shouldn't be used anymore (except for a
few exceptions).
This also makes the syscall_NLA_HDRLEN workaround for gccgo unnecessary.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Use the symlink xattr syscall wrappers Lgetxattr, Llistxattr and
Lsetxattr from x/sys/unix (introduced in
golang/sys@b90f89a1e7) instead of
providing own wrappers. Leave the functionality of system.Lgetxattr
intact with respect to the retry with a larger buffer, but switch it to
use unix.Lgetxattr.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Follow commit 3d7cb4293c ("Move libcontainer to x/sys/unix") and also
move the examples in README.md from syscall to x/sys/unix.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Since syscall is outdated and broken for some architectures,
use x/sys/unix instead.
There are still some dependencies on the syscall package that will
remain in syscall for the forseeable future:
Errno
Signal
SysProcAttr
Additionally:
- os still uses syscall, so it needs to be kept for anything
returning *os.ProcessState, such as process.Wait.
Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
* User Case:
User could use prestart hook to add block devices to container. so the
hook should have a way to set the permissions of the devices.
Just move cgroup config operation before prestart hook will work.
Signed-off-by: Wentao Zhang <zhangwentao234@huawei.com>
FreeBSD does not support cgroups or namespaces, which the code suggested, and is not supported
in runc anyway right now. So clean up the file naming to use `_linux` where appropriate.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
A freezer cgroup allows to dump processes faster.
If a user wants to checkpoint a container and its storage,
he has to pause a container, but in this case we need to pass
a path to its freezer cgroup to "criu dump".
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
When C/R was implemented, it was enough to call manager.Set to apply
limits and to move a task. Now .Set() and .Apply() have to be called
separately.
Fixes: 8a740d5391 ("libcontainer: cgroups: don't Set in Apply")
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
Instead of relying on version numbers it is possible to check if CRIU
actually supports certain features. This introduces an initial
implementation to check if CRIU and the underlying kernel actually
support dirty memory tracking for memory pre-dumping.
Upstream CRIU also supports the lazy-page migration feature check and
additional feature checks can be included in CRIU to reduce the version
number parsing. There are also certain CRIU features which depend on one
side on the CRIU version but also require certain kernel versions to
actually work. CRIU knows if it can do certain things on the kernel it
is running on and using the feature check RPC interface makes it easier
for runc to decide if the criu+kernel combination will support that
feature.
Feature checking was introduced with CRIU 1.8. Running with older CRIU
versions will ignore the feature check functionality and behave just
like it used to.
v2:
- Do not use reflection to compare requested and responded
features. Checking which feature is available is now hardcoded
and needs to be adapted for every new feature check. The code
is now much more readable and simpler.
v3:
- Move the variable criuFeat out of the linuxContainer struct,
as it is not container specific. Now it is a global variable.
Signed-off-by: Adrian Reber <areber@redhat.com>
The original implementation is in C, which increases cognitive load and
possibly might cause us problems in the future. Since sys/unix is better
maintained than the syscall standard library switching makes more sense.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Since this is a runC-specific feature, this belongs here over in
opencontainers/ocitools (which is for generic OCI runtimes).
In addition, we don't create a new network namespace. This is because
currently if you want to set up a veth bridge you need CAP_NET_ADMIN in
both network namespaces' pinned user namespace to create the necessary
interfaces in each network namespace.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
If the stdio of the container is owned by a group which is not mapped in
the user namespace, attempting to fchown the file descriptor will result
in EINVAL. Counteract this by simply not doing an fchown if the group
owner of the file descriptor has no host mapping according to the
configured GIDMappings.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Previously Host{U,G}ID only gave you the root mapping, which isn't very
useful if you are trying to do other things with the IDMaps.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
The rootless cgroup manager acts as a noop for all set and apply
operations. It is just used for rootless setups. Currently this is far
too simple (we need to add opportunistic cgroup management), but is good
enough as a first-pass at a noop cgroup manager.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This enables the support for the rootless container mode. There are many
restrictions on what rootless containers can do, so many different runC
commands have been disabled:
* runc checkpoint
* runc events
* runc pause
* runc ps
* runc restore
* runc resume
* runc update
The following commands work:
* runc create
* runc delete
* runc exec
* runc kill
* runc list
* runc run
* runc spec
* runc state
In addition, any specification options that imply joining cgroups have
also been disabled. This is due to support for unprivileged subtree
management not being available from Linux upstream.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Effectively, !dumpable makes implementing rootless containers quite
hard, due to a bunch of different operations on /proc/self no longer
being possible without reordering everything.
!dumpable only really makes sense when you are switching between
different security contexts, which is only the case when we are joining
namespaces. Unfortunately this means that !dumpable will still have
issues in this instance, and it should only be necessary to set
!dumpable if we are not joining USER namespaces (new kernels have
protections that make !dumpable no longer necessary). But that's a topic
for another time.
This also includes code to unset and then re-set dumpable when doing the
USER namespace mappings. This should also be safe because in principle
processes in a container can't see us until after we fork into the PID
namespace (which happens after the user mapping).
In rootless containers, it is not possible to set a non-dumpable
process's /proc/self/oom_score_adj (it's owned by root and thus not
writeable). Thus, it needs to be set inside nsexec before we set
ourselves as non-dumpable.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
If we try to pause a container on the system without freezer cgroups,
we can found that runc tries to open ./freezer.state. It is obviously wrong.
$ ./runc pause test
no such directory for freezer.state
$ echo FROZEN > freezer.state
$ ./runc pause test
container not running or created: paused
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
When process config doesnt specify capabilities anywhere, we should not panic
because setting capabilities are optional.
Signed-off-by: Daniel Dao <dqminh89@gmail.com>
This reverts commit d4091ef151.
d4091ef151 ("fix minor issue") doesn't actually make any sense, and
actually makes the code more confusing.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This maybe a nice extra but it adds complication to the usecase. The
contract is listen on the socket and you get an fd to the pty master and
that is that.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Runc needs to copy certain files from the top of the cgroup cpuset hierarchy
into the container's cpuset cgroup directory. Currently, runc determines
which directory is the top of the hierarchy by using the parent dir of
the first entry in /proc/self/mountinfo of type cgroup.
This creates problems when cgroup subsystems are mounted arbitrarily in
different dirs on the host.
Now, we use the most deeply nested mountpoint that contains the
container's cpuset cgroup directory.
Signed-off-by: Konstantinos Karampogias <konstantinos.karampogias@swisscom.com>
Signed-off-by: Will Martin <wmartin@pivotal.io>
In container process's Init function, we use
fd + execFifoFilename to open exec fifo, so this
field in init config is never used.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Fixes: #1347Fixes: #1083
The root cause of #1083 is because we're joining an
existed cgroup whose kmem accouting is not initialized,
and it has child cgroup or tasks in it.
Fix it by checking if the cgroup is first time created,
and we should enable kmem accouting if the cgroup is
craeted by libcontainer with or without kmem limit
configed. Otherwise we'll get issue like #1347
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
It should not be binded to container creation, for
example, runc restore needs to create a
libcontainer.Container, but it won't need exec fifo.
So create exec fifo when container is started or run,
where we really need it.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
The error message added here provides no value as the caller already
knows all the added details. However it is covering up the underyling
system error (typically `ENOTSUP`). There is no way to handle this error before
this change.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
getDevices() has been updated to skip `/dev/.lxc` and `/dev/.lxd-mounts`, which was breaking privileged Docker containers running on runC, inside of LXD managed Linux Containers
Signed-off-by: Carlton-Semple <carlton.semple@ibm.com>
CRIU gets pre-dump to complete iterative migration.
pre-dump saves process memory info only. And it need parent-path
to specify the former memory files.
This patch add pre-dump and parent-path arguments to runc checkpoint
Signed-off-by: Deng Guangxing <dengguangxing@huawei.com>
Signed-off-by: Adrian Reber <areber@redhat.com>
As the runtime-spec allows it, we want to be able to specify overlayfs
mounts with:
{
"destination": "/etc/pki",
"type": "overlay",
"source": "overlay",
"options": [
"lowerdir=/etc/pki:/home/amurdaca/go/src/github.com/opencontainers/runc/rootfs_fedora/etc/pki"
]
},
This patch takes care of allowing overlayfs mounts. Both RO and RW
should be supported.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
`label.InitLabels` takes options as a string slice in the form of:
user:system_u
role:system_r
type:container_t
level:s0:c4,c5
However, `DupSecOpt` and `DisableSecOpt` were still adding a docker
specifc `label=` in front of every option. That leads to `InitLabels`
not being able to correctly init selinux labels in this scenario for
instance:
label.InitLabels(DupSecOpt([%OPTIONS%]))
if `%OPTIONS` has options prefixed with `label=`, that's going to fail.
Fix this by removing that docker specific `label=` prefix.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
Correct container.Destroy() docs to clarify that destroy can only operate on containers in specific states.
Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
If a relative pathed exe is used for InitArgs init will fail to run if Cwd is not set the original path.
Prevent failure of init to run by ensuring that exe in InitArgs is an absolute path.
Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
If we pass a file descriptor to the host filesystem while joining a
container, there is a race condition where a process inside the
container can ptrace(2) the joining process and stop it from closing its
file descriptor to the stateDirFd. Then the process can access the
*host* filesystem from that file descriptor. This was fixed in part by
5d93fed3d2 ("Set init processes as non-dumpable"), but that fix is
more of a hail-mary than an actual fix for the underlying issue.
To fix this, don't open or pass the stateDirFd to the init process
unless we're creating a new container. A proper fix for this would be to
remove the need for even passing around directory file descriptors
(which are quite dangerous in the context of mount namespaces).
There is still an issue with containers that have CAP_SYS_PTRACE and are
using the setns(2)-style of joining a container namespace. Currently I'm
not really sure how to fix it without rampant layer violation.
Fixes: CVE-2016-9962
Fixes: 5d93fed3d2 ("Set init processes as non-dumpable")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
When signaling children and the signal is SIGKILL wait for children
otherwise conditionally wait for children which are ready to report.
This reaps all children which exited due to the signal sent without
blocking indefinitely.
Also:
* Ignore ignore ECHILD, which means the child has already gone.
Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
Ensure that the pipe is always closed during the error processing of StartInitialization.
Also:
* Fix a comment typo.
* Use newContainerInit directly as there's no need for i to be an initer.
* Move the comment about the behaviour of Init() directly above it, clarifying what happens for all defers.
Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
Add the import of nsenter to the example in libcontainer's README.md, as without it none of the example code works.
Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
POSIX mandates that `cmsg_len` in `struct cmsghdr` is a `socklen_t`,
which is an `unsigned int`. Musl libc as used in Alpine implements
this; Glibc ignores the spec and makes it a `size_t` ie `unsigned long`.
To avoid the `-Wformat=` warning from the `%lu` on Alpine, cast this
to an `unsigned long` always.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Add godoc links to README.md files for runc and libcontainer so its easy to access the golang documentation.
Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
signalAllProcesses was making the assumption that the requested signal was SIGKILL, possibly due to the signal parameter being added at a later date, and hence it was safe to wait for all processes which is not the case.
BaseContainer.Signal(s os.Signal, all bool) exposes this functionality to consumers, so an arbitrary signal could be used which is not guaranteed to make the processes exit.
Correct the documentation for signalAllProcesses around the signal delivered and update it so that the wait is only performed on SIGKILL hence making it safe to process other signals without risk of blocking forever, while still maintaining compatibility to SIGKILL callers.
Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
The parameters passed to `GetExecUser` is not correct.
Consider the following code:
```
package main
import (
"fmt"
"io"
"os"
)
func main() {
passwd, err := os.Open("/etc/passwd1")
if err != nil {
passwd = nil
} else {
defer passwd.Close()
}
err = GetUserPasswd(passwd)
if err != nil {
fmt.Printf("%#v\n", err)
}
}
func GetUserPasswd(r io.Reader) error {
if r == nil {
return fmt.Errorf("nil source for passwd-formatted
data")
} else {
fmt.Printf("r = %#v\n", r)
}
return nil
}
```
If the file `/etc/passwd1` is not exist, we expect to return
`nil source for passwd-formatted data` error, and in fact, the func
`GetUserPasswd` return nil.
The same logic exists in runc code. this patch fix it.
Signed-off-by: Wang Long <long.wanglong@huawei.com>
The `bufio.Scanner.Scan` method returns false either by reaching the
end of the input or an error. After Scan returns false, the Err method
will return any error that occurred during scanning, except that if it
was io.EOF, Err will return nil.
We should check the error when Scan return false(out of the for loop).
Signed-off-by: Wang Long <long.wanglong@huawei.com>
This sets the init processes that join and setup the container's
namespaces as non-dumpable before they setns to the container's pid (or
any other ) namespace.
This settings is automatically reset to the default after the Exec in
the container so that it does not change functionality for the
applications that are running inside, just our init processes.
This prevents parent processes, the pid 1 of the container, to ptrace
the init process before it drops caps and other sets LSMs.
This patch also ensures that the stateDirFD being used is still closed
prior to exec, even though it is set as O_CLOEXEC, because of the order
in the kernel.
https://github.com/torvalds/linux/blob/v4.9/fs/exec.c#L1290-L1318
The order during the exec syscall is that the process is set back to
dumpable before O_CLOEXEC are processed.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
On some systems, when we mount some cgroup subsystems into
a same mountpoint, the name sequence of mount options and
cgroup directory name can not be the same.
For example, the mount option is cpuacct,cpu, but
mountpoint name is /sys/fs/cgroup/cpu,cpuacct. In current
runc, we set mount destination name from combining
subsystems, which comes from mount option from
/proc/self/mountinfo, so in my case the name would be
/sys/fs/cgroup/cpuacct,cpu, which is differernt from
host, and will break some applications.
Fix it by using directory name from host mountpoint.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>