This ensures that anything written to the logs are synced as they
happen.
This also changes the error message of the libcontainer error. The
original idea was to have this extra information in the message but it
makes it hard to parse and if the caller needed this information they
can just get it from the error type.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
In order to allow nice usage statistics (in terms of percentages and
other such data), add the value of pids.max to the PidsStats struct
returned from the pids cgroup controller.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Clears supplementary groups that have effect on the
mount permissions before joining the user specified
groups happens.
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
This updates runc and libcontainer to handle rlimits per process and set
them correctly for the container.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
The rhel6 kernel returns EINVAL in this case
Known issue:
* CT with userns doesn't work
This is a copy of
d31e97fa28
to address https://github.com/opencontainers/runc/issues/613
Signed-off-by: Andrey Vagin <avagin@virtuozzo.com>
Signed-off-by: Andrew Fernandes <andrew@fernandes.org>
Now that all the user namespace code is moved into C, these routines are
no longer used.
Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
The re-work of namespace entering lost the setuid/setgid that was part
of the Go-routine based process exec in the prior code. A side issue was
found with setting oom_score_adj before execve() in a userns that is
also solved here.
Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
This commit adds support to libcontainer to allow caps, no new privs,
apparmor, and selinux process label to the process struct so that it can
be used together of override the base settings on the container config
per individual process.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This is needed to make 'runc delete' correctly run the post-stop hooks.
Signed-off-by: Julian Friedman <julz.friedman@uk.ibm.com>
Signed-off-by: Ed King <eking@pivotal.io>
currentState() always adds all possible namespaces to the state,
regardless of whether they are supported.
If orderNamespacePaths detects an unsupported namespace, an error is
returned that results in initialization failure.
Fix this by only adding paths of supported namespaces to the state.
Signed-off-by: Ido Yariv <ido@wizery.com>
The path in the stacktrace might not be:
"github.com/opencontainers/runc/libcontainer/stacktrace"
For example, for me its:
"_/go/src/github.com/opencontainers/runc/libcontainer/stacktrace"
so I changed the check to make sure the tail end of the path matches instead
of the entire thing
Signed-off-by: Doug Davis <dug@us.ibm.com>
This simply move the call to the Prestart hooks to be made once we
receive the procReady message from the client.
This is necessary as we had to move the setns calls within nsexec in
order to be accomodate joining namespaces that only affect future
children (e.g. NEWPID).
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
An init process can join other namespaces (pidns, ipc etc.). This leverages
C code defined in nsenter package to spawn a process with correct namespaces
and clone if necessary.
This moves all setns and cloneflags related code to nsenter layer, which mean
that we dont use Go os/exec to create process with cloneflags and set
uid/gid_map or setgroups anymore. The necessary data is passed from Go to C
using a netlink binary-encoding format.
With this change, setns and init processes are almost the same, which brings
some opportunity for refactoring.
Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
[mickael.laventure@docker.com: adapted to apply on master @ d97d5e]
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@docker.com>
This adds orderNamespacePaths to get correct order of namespaces for the
bootstrap program to join.
Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
This adds `configs.IsNamespaceSupported(nsType)` to check if the host supports
a namespace type.
Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
This allows the mount syscall to validate the addiontal types where we
do not have to perform extra validation and is up to the consumer to
verify the functionality of the type of device they are trying to
mount.
Fixes#572
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Exec erros from the exec() syscall in the container's init should be
treated as if the container ran but couldn't execute the process for the
user instead of returning a libcontainer error as if it was an issue in
the library.
Before specifying different commands like `/etc`, `asldfkjasdlfj`, or
`/alsdjfkasdlfj` would always return 1 on the command line with a
libcontainer specific error message. Now they return the correct
message and exit status defined for unix processes.
Example:
```bash
root@deathstar:/containers/redis# runc start test
exec: "/asdlfkjasldkfj": file does not exist
root@deathstar:/containers/redis# echo $?
127
root@deathstar:/containers/redis# runc start test
exec: "asdlfkjasldkfj": executable file not found in $PATH
root@deathstar:/containers/redis# echo $?
127
root@deathstar:/containers/redis# runc start test
exec: "/etc": permission denied
root@deathstar:/containers/redis# echo $?
126
```
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Signed-off-by: rajasec <rajasec79@gmail.com>
Changing to name values for defer as per review comments
Signed-off-by: rajasec <rajasec79@gmail.com>
Fixed review comments
Signed-off-by: rajasec <rajasec79@gmail.com>
This prior fix to set "-1" explicitly was lost, and it is simpler to use
the same pointer type from the OCI spec to handle nil pointer == -1 ==
unset case.
Also, as a nearly humorous aside, there was a test for MemorySwappiness
that was actually setting Memory, and it was passing because of this
bug (as it was always setting everyone's MemorySwappiness to zero!)
Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
Create a unique session key name for every container. Use the pattern
_ses.<postfix> with postfix being the container's Id.
This patch does not prevent containers from joining each other's session
keyring.
Signed-off-by: Stefan Berger <stefanb@linux.vnet.ibm.com>
Because we more than likely control dev and populate devices and files
inside of it we need to make sure that we fulfil the user's request to
make it ro only after it has been populated. This removes the need to
expose something like ReadonlyPaths in the config but still have the
same outcome but more seemless for the user.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Today mounts in pre-start hooks get overriden by the default mounts.
Moving the pre-start hooks to after the container mounts and before
the pivot/move root gives better flexiblity in the hooks.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
If you don't move the process out of the named cgroup for systemd then
systemd will try to delete all the cgroups that the process is currently
in.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Docker uses Prestart hooks to call a libnetwork hook to create
network devices and set addesses and routes.
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
This options is set a namespace mask which will not be dumped and restored.
For example, we are going to use this option to restore network
for docker containers. CRIU will create a network namespace and
call a libnetwork hook to restore network devices, addresses and routes.
Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>
When in a non-initial user namespace you cannot update the devices
cgroup whitelist (or blacklist). The kernel won't allow it. So
detect that case and don't try.
This is a step to being able to run docker/runc containers inside a user
namespaced container.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
It's handled in `destroy()`, no need to do this in
`Apply()`. I found this because systemd cgroup didn't
do this removal and it works well.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
In order to avoid problems with security regressions going unnoticed,
add some unit tests that should make sure security regressions in cgroup
path safety cause tests to fail in runC.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Ensure that path safety is maintained, this essentially reapplies
c0cad6aa5e ("cgroups: fs: fix cgroup.Parent path sanitisation"), which
was accidentally removed in 256f3a8ebc ("Add support for CgroupsPath
field").
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Because we are implemented in Go, the number of pids present in a
container is not very well-defined (other than it not being /much/
bigger than the limit you'd want to set). As a result, we need to make
the tests a bit less flaky in this regard.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Create a new session key ring '_ses' for every container. This avoids sharing
the key structure with the process that created the container and the
container inherits from.
This patch fixes it init and exec.
Signed-off-by: Stefan Berger <stefanb@linux.vnet.ibm.com>
GetMounts is very cpu-expensive. I'll change other funcs in this package
to reuse code from GetCgroupMounts later.
Signed-off-by: Alexander Morozov <lk4d4@docker.com>
It's never used and not needed. Our pipe is created with
syscall.SOCK_CLOEXEC, so pipe will be closed once container
process executed successfully, parent process will read EOF
and continue. If container process got error before executed,
we'll write procError to sync with parent.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
By adding detach to runc the container process is the only thing running
on the system is the containers process.
This allows better usage of memeory and no runc process being long
lived. With this addition you also need a delete command because the
detached container will not be able to remove state and the left over
cgroups directories.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
We don't need a CreatedTime method on the container because it's not
part of the interface and can be received via the state. We also do not
need to call it CreateTime because the type of this field is time.Time
so we know its time.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Add some further (not critical, since Docker does this already)
validation to systemd slice names, to make sure users don't get cryptic
errors.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
Marshall the raw objects for the sync pipes so that no new line chars
are left behind in the pipe causing errors.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Rather than using '/' to denote hierarchy in slice names, systemd uses
'-' in an odd way. This results in runC incorrectly assuming that
certain kernel features are missing (and using inconsistent paths for
the cgroups not supported by systemd), because the "subsystem path" used
is not the one that systemd has created. Fix all of this by properly
expanding slice names.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
There were issues where a process could die before pausing completed
leaving the container in an inconsistent state and unable to be
destoryed. This makes sure that if the container is paused and the
process is dead it will unfreeze the cgroup before removing them.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Modify the memory cgroup code such that kmem is not managed by Set(), in
order to allow updating of memory constraints for containers by Docker.
This also removes the need to make memory a special case cgroup.
Signed-off-by: Aleksa Sarai <asarai@suse.com>
For existing consumers of libconatiner to not require cwd inside the
libcontainer code. This can be done at the runc level and is already
evaluated there.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Set `-1` doesn't mean disable swap, disable swap means you
can't use swap memory, set `-1` really means you can use
unlimited swap memory.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>