The path in the stacktrace might not be:
"github.com/opencontainers/runc/libcontainer/stacktrace"
For example, for me its:
"_/go/src/github.com/opencontainers/runc/libcontainer/stacktrace"
so I changed the check to make sure the tail end of the path matches instead
of the entire thing
Signed-off-by: Doug Davis <dug@us.ibm.com>
This simply move the call to the Prestart hooks to be made once we
receive the procReady message from the client.
This is necessary as we had to move the setns calls within nsexec in
order to be accomodate joining namespaces that only affect future
children (e.g. NEWPID).
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
An init process can join other namespaces (pidns, ipc etc.). This leverages
C code defined in nsenter package to spawn a process with correct namespaces
and clone if necessary.
This moves all setns and cloneflags related code to nsenter layer, which mean
that we dont use Go os/exec to create process with cloneflags and set
uid/gid_map or setgroups anymore. The necessary data is passed from Go to C
using a netlink binary-encoding format.
With this change, setns and init processes are almost the same, which brings
some opportunity for refactoring.
Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
[mickael.laventure@docker.com: adapted to apply on master @ d97d5e]
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@docker.com>
This adds orderNamespacePaths to get correct order of namespaces for the
bootstrap program to join.
Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
This adds `configs.IsNamespaceSupported(nsType)` to check if the host supports
a namespace type.
Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
This allows the mount syscall to validate the addiontal types where we
do not have to perform extra validation and is up to the consumer to
verify the functionality of the type of device they are trying to
mount.
Fixes#572
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Exec erros from the exec() syscall in the container's init should be
treated as if the container ran but couldn't execute the process for the
user instead of returning a libcontainer error as if it was an issue in
the library.
Before specifying different commands like `/etc`, `asldfkjasdlfj`, or
`/alsdjfkasdlfj` would always return 1 on the command line with a
libcontainer specific error message. Now they return the correct
message and exit status defined for unix processes.
Example:
```bash
root@deathstar:/containers/redis# runc start test
exec: "/asdlfkjasldkfj": file does not exist
root@deathstar:/containers/redis# echo $?
127
root@deathstar:/containers/redis# runc start test
exec: "asdlfkjasldkfj": executable file not found in $PATH
root@deathstar:/containers/redis# echo $?
127
root@deathstar:/containers/redis# runc start test
exec: "/etc": permission denied
root@deathstar:/containers/redis# echo $?
126
```
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Signed-off-by: rajasec <rajasec79@gmail.com>
Changing to name values for defer as per review comments
Signed-off-by: rajasec <rajasec79@gmail.com>
Fixed review comments
Signed-off-by: rajasec <rajasec79@gmail.com>
This prior fix to set "-1" explicitly was lost, and it is simpler to use
the same pointer type from the OCI spec to handle nil pointer == -1 ==
unset case.
Also, as a nearly humorous aside, there was a test for MemorySwappiness
that was actually setting Memory, and it was passing because of this
bug (as it was always setting everyone's MemorySwappiness to zero!)
Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
Create a unique session key name for every container. Use the pattern
_ses.<postfix> with postfix being the container's Id.
This patch does not prevent containers from joining each other's session
keyring.
Signed-off-by: Stefan Berger <stefanb@linux.vnet.ibm.com>
Because we more than likely control dev and populate devices and files
inside of it we need to make sure that we fulfil the user's request to
make it ro only after it has been populated. This removes the need to
expose something like ReadonlyPaths in the config but still have the
same outcome but more seemless for the user.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Today mounts in pre-start hooks get overriden by the default mounts.
Moving the pre-start hooks to after the container mounts and before
the pivot/move root gives better flexiblity in the hooks.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
If you don't move the process out of the named cgroup for systemd then
systemd will try to delete all the cgroups that the process is currently
in.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Docker uses Prestart hooks to call a libnetwork hook to create
network devices and set addesses and routes.
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
This options is set a namespace mask which will not be dumped and restored.
For example, we are going to use this option to restore network
for docker containers. CRIU will create a network namespace and
call a libnetwork hook to restore network devices, addresses and routes.
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
When in a non-initial user namespace you cannot update the devices
cgroup whitelist (or blacklist). The kernel won't allow it. So
detect that case and don't try.
This is a step to being able to run docker/runc containers inside a user
namespaced container.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
It's handled in `destroy()`, no need to do this in
`Apply()`. I found this because systemd cgroup didn't
do this removal and it works well.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
In order to avoid problems with security regressions going unnoticed,
add some unit tests that should make sure security regressions in cgroup
path safety cause tests to fail in runC.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Ensure that path safety is maintained, this essentially reapplies
c0cad6aa5e ("cgroups: fs: fix cgroup.Parent path sanitisation"), which
was accidentally removed in 256f3a8ebc ("Add support for CgroupsPath
field").
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Because we are implemented in Go, the number of pids present in a
container is not very well-defined (other than it not being /much/
bigger than the limit you'd want to set). As a result, we need to make
the tests a bit less flaky in this regard.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Create a new session key ring '_ses' for every container. This avoids sharing
the key structure with the process that created the container and the
container inherits from.
This patch fixes it init and exec.
Signed-off-by: Stefan Berger <stefanb@linux.vnet.ibm.com>
GetMounts is very cpu-expensive. I'll change other funcs in this package
to reuse code from GetCgroupMounts later.
Signed-off-by: Alexander Morozov <lk4d4@docker.com>
It's never used and not needed. Our pipe is created with
syscall.SOCK_CLOEXEC, so pipe will be closed once container
process executed successfully, parent process will read EOF
and continue. If container process got error before executed,
we'll write procError to sync with parent.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
By adding detach to runc the container process is the only thing running
on the system is the containers process.
This allows better usage of memeory and no runc process being long
lived. With this addition you also need a delete command because the
detached container will not be able to remove state and the left over
cgroups directories.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
We don't need a CreatedTime method on the container because it's not
part of the interface and can be received via the state. We also do not
need to call it CreateTime because the type of this field is time.Time
so we know its time.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Add some further (not critical, since Docker does this already)
validation to systemd slice names, to make sure users don't get cryptic
errors.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Marshall the raw objects for the sync pipes so that no new line chars
are left behind in the pipe causing errors.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Rather than using '/' to denote hierarchy in slice names, systemd uses
'-' in an odd way. This results in runC incorrectly assuming that
certain kernel features are missing (and using inconsistent paths for
the cgroups not supported by systemd), because the "subsystem path" used
is not the one that systemd has created. Fix all of this by properly
expanding slice names.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
There were issues where a process could die before pausing completed
leaving the container in an inconsistent state and unable to be
destoryed. This makes sure that if the container is paused and the
process is dead it will unfreeze the cgroup before removing them.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Modify the memory cgroup code such that kmem is not managed by Set(), in
order to allow updating of memory constraints for containers by Docker.
This also removes the need to make memory a special case cgroup.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
For existing consumers of libconatiner to not require cwd inside the
libcontainer code. This can be done at the runc level and is already
evaluated there.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Set `-1` doesn't mean disable swap, disable swap means you
can't use swap memory, set `-1` really means you can use
unlimited swap memory.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Due to the fact that the init is implemented in Go (which seemingly
randomly spawns new processes and loves eating memory), most cgroup
configurations are required to have an arbitrary minimum dictated by the
init. This confuses users and makes configuration more annoying than it
should. An example of this is pids.max, where Go spawns multiple
processes that then cause init to violate the pids cgroup constraint
before the container can even start.
Solve this problem by setting the cgroup configurations as late as
possible, to avoid hitting as many of the resources hogged by the Go
init as possible. This has to be done before seccomp rules are applied,
as the parent and child must synchronise in order for the parent to
correctly set the configurations (and writes might be blocked by seccomp).
Signed-off-by: Aleksa Sarai <asarai@suse.com>
It is vital to loudly fail when a user attempts to set a cgroup limit
(rather than using the system default). Otherwise the user will assume
they have security they do not actually have. This mirrors the original
Apply() (that would set cgroup configs) semantics.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Apply and Set are two separate operations, and it doesn't make sense to
group the two together (especially considering that the bootstrap
process is added to the cgroup as well). The only exception to this is
the memory cgroup, which requires the configuration to be set before
processes can join.
One of the weird cases to deal with is systemd. Systemd sets some of the
cgroup configuration options, but not all of them. Because memory is a
special case, we need to explicitly set memory in the systemd Apply().
Otherwise, the rest can be safely re-applied in .Set() as usual.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Add support for the pids cgroup controller to libcontainer, a recent
feature that is available in Linux 4.3+.
Unfortunately, due to the init process being written in Go, it can spawn
an an unknown number of threads due to blocked syscalls. This results in
the init process being unable to run properly, and thus small pids.max
configs won't work properly.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Properly sanitise the --cgroup-parent path, to avoid potential issues
(as it starts creating directories and writing to files as root). In
addition, fix an infinite recursion due to incomplete base cases.
It might be a good idea to move pathClean to a separate library (which
deals with path safety concerns, so all of runC and Docker can take
advantage of it).
Signed-off-by: Aleksa Sarai <asarai@suse.com>
When we launch a container in a new user namespace, we cannot create
devices, so we bind mount the host's devices into place instead.
If we are running in a user namespace (i.e. nested in a container),
then we need to do the same thing. Add a function to detect that
and check for it before doing mknod.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
---
Changelog - add a comment clarifying what's going on with the
uidmap file.
It may be desirable to receive memory pressure levels notifications
before the container depletes all memory. This may be useful for
handling cases where the system thrashes when reaching the container's
memory limits.
Signed-off-by: Ido Yariv <ido@wizery.com>
Due to the fact that the init is implemented in Go (which seemingly
randomly spawns new processes and loves eating memory), most cgroup
configurations are required to have an arbitrary minimum dictated by the
init. This confuses users and makes configuration more annoying than it
should. An example of this is pids.max, where Go spawns multiple
processes that then cause init to violate the pids cgroup constraint
before the container can even start.
Solve this problem by setting the cgroup configurations as late as
possible, to avoid hitting as many of the resources hogged by the Go
init as possible. This has to be done before seccomp rules are applied,
as the parent and child must synchronise in order for the parent to
correctly set the configurations (and writes might be blocked by seccomp).
Signed-off-by: Aleksa Sarai <asarai@suse.com>
It is vital to loudly fail when a user attempts to set a cgroup limit
(rather than using the system default). Otherwise the user will assume
they have security they do not actually have. This mirrors the original
Apply() (that would set cgroup configs) semantics.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Apply and Set are two separate operations, and it doesn't make sense to
group the two together (especially considering that the bootstrap
process is added to the cgroup as well). The only exception to this is
the memory cgroup, which requires the configuration to be set before
processes can join.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Add support for the pids cgroup controller to libcontainer, a recent
feature that is available in Linux 4.3+.
Unfortunately, due to the init process being written in Go, it can spawn
an an unknown number of threads due to blocked syscalls. This results in
the init process being unable to run properly, and thus small pids.max
configs won't work properly.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
syscall.NLA_HDRLEN is not in gccgo (as of 5.3), so in the meantime
use the #defines taken from linux/netlink.h.
See https://github.com/golang/go/issues/13629
Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Add state status() method
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Allow multiple checkpoint on restore
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Handle leave-running state
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Fix state transitions for inprocess
Because the tests use libcontainer in process between the various states
we need to ensure that that usecase works as well as the out of process
one.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Remove isDestroyed method
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Handling Pausing from freezer state
Signed-off-by: Rajasekaran <rajasec79@gmail.com>
freezer status
Signed-off-by: Rajasekaran <rajasec79@gmail.com>
Fixing review comments
Signed-off-by: Rajasekaran <rajasec79@gmail.com>
Added comment when freezer not available
Signed-off-by: Rajasekaran <rajasec79@gmail.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Conflicts:
libcontainer/container_linux.go
Change checkFreezer logic to isPaused()
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Remove state base and factor out destroy func
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Add unit test for state transitions
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This allows us to distinguish cases where a container
needs to just join the paths or also additionally
set cgroups settings. This will help in implementing
cgroupsPath support in the spec.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
replace passing of pid and console path via environment variable with passing
them with netlink message via an established pipe.
this change requires us to set _LIBCONTAINER_INITTYPE and
_LIBCONTAINER_INITPIPE as the env environment of the bootstrap process as we
only send the bootstrap data for setns process right now. When init and setns
bootstrap process are unified (i.e., init use nsexec instead of Go to clone new
process), we can remove _LIBCONTAINER_INITTYPE.
Note:
- we read nlmsghdr first before reading the content so we can get the total
length of the payload and allocate buffer properly instead of allocating
one large buffer.
- check read bytes vs the wanted number. It's an error if we failed to read
the desired number of bytes from the pipe into the buffer.
Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
This patch fixes the following go vet warnings:
```
libcontainer/network_linux.go:96: github.com/vishvananda/netlink.Device
composite literal uses unkeyed fields
libcontainer/network_linux.go:114: github.com/vishvananda/netlink.Device
composite literal uses unkeyed fields
```
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
add bootstrap data to setns process. If we have any bootstrap data then copy it
to the bootstrap process (i.e. nsexec) using the sync pipe. This will allow us
to eventually replace environment variable usage with more structured data
to setup namespaces, write pid/gid map, setgroup etc.
Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
Enables launching userns containers by catching EPERM errors for writing
to devices cgroups, and for mknod invocations.
Signed-off-by: Abin Shahab <ashahab@altiscale.com>
When starting and quering for pids a container can start and exit before
this is set. So set the opts after the process is started and while
libcontainer still has the container's process blocking on the pipe.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
The former cgroup entry is confusing, separate it to parent
and name.
Rename entry `c` to `config`.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
'parent' function is confusing with parent cgroup, it's actually
parent path, so rename it to parentPath.
The name 'data' is too common to be identified, rename it to cgroupData
which is exactly what it is.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
The spec uses symlinks to "/proc/1/..." but the implementation uses
"/proc/self/...": see setupDevSymlinks (libcontainer/rootfs_linux.go).
The implementation is more correct, so I'm changing the spec to match
the implementation.
Signed-off-by: Alban Crequy <alban.crequy@coreos.com>
Minor fix, the former setupDev=true means not setup dev,
which is contrary to intuition, just correct it.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
We have a rule that for optional cgroups, don't fail if some
of them are not mounted, but we want it fail hard when a
user specifies an option and we are unable to fulfill the
request.
Memory cgroup should also follow this rule.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Also add cpuset as the first in the list to address issues setting the
pid in any cgroup before the cpuset is populated.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
It can avoid unnecessary task migrataion, see this scenario:
- container init task is on cpu 1, and we assigned it to cpu 1,
but parent cgroup's cpuset.cpus=2
- we created the cgroup dir and inherited cpuset.cpus from parent as 2
- write container init task's pid to cgroup.procs
- [it's possibile the container init task migrated to cpu 2 here]
- set cpuset.cpus as assigned to cpu 1
- [the container init task has to be migrated back to cpu 1]
So we should set cpuset.cpus and cpuset.mems before writing pids
to cgroup.procs to aviod such problem.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
While testing different versions of criu it helps to know which
criu binary with which options is currently used. Therefore additional
debug output to display these information is added.
v2: increase readability of printed out criu options
Signed-off-by: Adrian Reber <adrian@lisas.de>
Only valid options to --security-opt for label should be
disable, user, role, type, level.
Return error on invalid entry
Signed-off-by: Dan Walsh <dwalsh@redhat.com>
This rather naively fixes an error observed where a processes stdio
streams are not written to when there is an error upon starting up the
process, such as when the executable doesn't exist within the
container's rootfs.
Before the "fix", when an error occurred on start, `terminate` is called
immediately, which calls `cmd.Process.Kill()`, then calling `Wait()` on
the process. In some cases when this `Kill` is called the stdio stream
have not yet been written to, causing non-deterministic output. The
error itself is properly preserved but users attached to the process
will not see this error.
With the fix it is just calling `Wait()` when an error occurs rather
than trying to `Kill()` the process first. This seems to preserve stdio.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Docker pkgs were updated while golinting the whole docker code base.
Now when trying to bump libcontainer/runc in docker, it fails compiling
with the following error:
``
vendor/src/github.com/opencontainers/runc/libcontainer/rootfs_linux.go:424:
undefined: mount.MountInfo
``
This is because, for instance, the mount pkg was updated here
0f5c9d301b (diff-49294d05afa48e2f7c0d2f02c6f7614c)
and now that type is only `mount.Info`.
This patch bump docker pkgs commit and adapt code to it.
Signed-off-by: Antonio Murdaca <amurdaca@redhat.com>
This allows getting the path to the subsystem and so is subsequently
used in EnterPid by an exec process.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
This is meant to be used in retrieving the paths so an exec
process enters all the cgroup paths correctly.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
/etc/groups is not needed when specifying numeric group ids. This
change allows containers without /etc/groups to specify numeric
supplemental groups.
Signed-off-by: Sami Wagiaalla <swagiaal@redhat.com>
Godeps: Vendor opencontainers/specs 96bcd043aa
Fix a bug where it's impossible to pass multiple devices to blkio
cgroup controller files. See https://github.com/opencontainers/runc/issues/274
Signed-off-by: Antonio Murdaca <runcom@linux.com>
pivotDir is the one where pivot_root() call puts the old root. We will
unmount pivotDir() and delete it.
Previously we were making / always rslave or rprivate. That will mean
that pivotDir() could never have mounts which would be shared with
parent mount namespace. That also means that unmounting pivotDir() was
safe and none of the unmount will propagate to parent namespace and
unmount things which we did not want to.
But now user can specify that apply private, shared, slave on /. That
means some of the mounts we inherited from parent could be shared and that
also means if we umount pivotDir/, those mounts will get unmounted in
parent too. That's not what we want.
Instead make pivotDir rprivate so that unmounts don't propagate back to
parent.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
pivot_root() introduces bunch of restrictions otherwise it fails. parent
mount of container root can not be shared otherwise pivot_root() will
fail.
So far parent could not be shared as we marked everything either private
or slave. But now we have introduced new propagation modes where parent
mount of container rootfs could be shared and pivot_root() will fail.
So check if parent mount is shared and if yes, make it private. This will
make sure pivot_root() works.
Also it will make sure that when we bind mount container rootfs, it does
not propagate to parent mount namespace. Otherwise cleanup becomes a
problem.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Right now config.Privatefs is a boolean which determines if / is applied
with propagation flag syscall.MS_PRIVATE | syscall.MS_REC or not.
Soon we want to represent other propagation states like private, [r]slave,
and [r]shared. So either we can introduce more boolean variable or keep
track of propagation flags in an integer variable. Keeping an integer
variable is more versatile and can allow various kind of propagation flags
to be specified. So replace Privatefs with RootPropagation which is an
integer.
Note, this will require changes in docker. Instead of setting Privatefs
to true, they will need to set.
config.RootPropagation = syscall.MS_PRIVATE | syscall.MS_REC
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
Do not remount a bind mount to enable flags unless non-default flags are
provided for the requested mount. This solves a problem with user
namespaces and remount of bind mount permissions.
Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)