Revert: #935Fixes: #946
I can reproduce #946 on some machines, the problem is on
some machines, it could be very fast that modify time
of `memory.kmem.limit_in_bytes` could be the same as
before it's modified.
And now we'll call `SetKernelMemory` twice on container
creation which cause the second time failure.
Revert this before we find a better solution.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Previously we used the same JSON tag name for the regular and realtime
versions of the CpuRt* fields, which causes issues when you want to use
two different values for the fields.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Setting classid of net_cls cgroup failed:
ERRO[0000] process_linux.go:291: setting cgroup config for ready process caused "failed to write 𐀁 to net_cls.classid: write /sys/fs/cgroup/net_cls,net_prio/user.slice/abc/net_cls.classid: invalid argument"
process_linux.go:291: setting cgroup config for ready process caused "failed to write 𐀁 to net_cls.classid: write /sys/fs/cgroup/net_cls,net_prio/user.slice/abc/net_cls.classid: invalid argument"
The spec has classid as a *uint32, the libcontainer configs should match the type.
Signed-off-by: Hushan Jia <hushan.jia@gmail.com>
1. According to docs of Cmd.Path and Cmd.Args from package "os/exec":
Path is the path of the command to run. Args holds command line
arguments, including the command as Args[0]. We have mixed usage
of args. In InitPath(), InitArgs only take arguments, in InitArgs(),
InitArgs including the command as Args[0]. This is confusing.
2. InitArgs() already have the ability to configure a LinuxFactory
with the provided absolute path to the init binary and arguements as
InitPath() does.
3. exec.Command() will take care of serching executable path.
4. The default "/proc/self/exe" instead of os.Args[0] is passed to
InitArgs in order to allow relative path for the runC binary.
Signed-off-by: Yang Hongyang <imhy.yang@gmail.com>
Added a unit test to verify that 'cpu.rt_runtime_us' and 'cpu.rt_runtime_us'
cgroup values are set when the cgroup is applied to a process.
Signed-off-by: Ben Gray <ben.r.gray@gmail.com>
before trying to move the process into the cgroup.
This is required if runc itself is running in SCHED_RR mode, as it is not
possible to add a process in SCHED_RR mode to a cgroup which hasn't been
assigned any RT bandwidth. And RT bandwidth is not inherited, each new
cgroup starts with 0 b/w.
Signed-off-by: Ben Gray <ben.r.gray@gmail.com>
There's no point in changing directory here. Syscalls are resolved local
to the linkpath, not to the current directory that the process was in
when creating the symlink. Changing directories just confuses people who
are trying to debug things.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Comparisons with paths aren't really a good idea unless you're
guaranteed that the comparison will work will all paths that resolve to
the same lexical path as the compared path.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This removes the use of a signal handler and SIGCONT to signal the init
process to exec the users process.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Prior to this change a cgroup with a `:` character in it's path was not
parsed correctly (as occurs on some instances of systemd cgroups under
some versions of systemd, e.g. 225 with accounting).
This fixes that issue and adds a test.
Signed-off-by: Euan Kemp <euank@coreos.com>
This adds an `--no-new-keyring` flag to run and create so that a new
session keyring is not created for the container and the calling
processes keyring is inherited.
Fixes#818
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Delegate is only available in systemd >218, applying it for older systemd will
result in an error. Therefore we should check for it when testing systemd
properties.
Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
Kernel memory cannot be set in these circumstances (before kernel 4.6):
1. kernel memory is not initialized, and there are tasks in cgroup
2. kernel memory is not initialized, and use_hierarchy is enabled,
and there are sub-cgroups
While we don't need to cover case 2 because when we set kernel
memory in runC, it's either:
- in Apply phase when we create the container, and in this case,
set kernel memory would definitely be valid;
- or in update operation, and in this case, there would be tasks
in cgroup, we only need to check if kernel memory is initialized
or not.
Even if we want to check use_hierarchy, we need to check sub-cgroups
as well, but for here, we can just leave it aside.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
In user namespaces, we need to make sure we don't chown() the console to
unmapped users. This means we need to get both the UID and GID of the
root user in the container when changing the owner.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Fix the following warnings when building runc with gcc 6+:
Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:
In function ‘nsexec’:
Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:322:6:
warning: ‘__s’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
pr_perror("Failed to open %s", ns);
Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:273:30:
note: ‘__s’ was declared here
static struct nsenter_config process_nl_attributes(int pipenum, char
*data, int data_size)
^~~~~~~~~~~~~~~~~~~~~
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
`libcontainer/cgroups/utils.go` uses an incorrect path to the
documentation for cgroups. This updates the comment to use the correct
URL. Fixes#794.
Signed-off-by: Jim Berlage <james.berlage@gmail.com>
See https://github.com/docker/docker/issues/22252
Previously we would apply seccomp rules before applying
capabilities, because it requires CAP_SYS_ADMIN. This
however means that a seccomp profile needs to allow
operations such as setcap() and setuid() which you
might reasonably want to disallow.
If prctl(PR_SET_NO_NEW_PRIVS) has been applied however
setting a seccomp filter is an unprivileged operation.
Therefore if this has been set, apply the seccomp
filter as late as possible, after capabilities have
been dropped and the uid set.
Note a small number of syscalls will take place
after the filter is applied, such as `futex`,
`stat` and `execve`, so these still need to be allowed
in addition to any the program itself needs.
Signed-off-by: Justin Cormack <justin.cormack@docker.com>
Avoid parsing the whole lines of mountinfo after all mountpoints
of the target subsytems are found, or when the target subsystem
is not enabled.
Signed-off-by: Tatsushi Inagaki <e29253@jp.ibm.com>
Remove a wrongly added include which was added in commit 3c2e77ee (Add a
compatibility header for CentOS/RHEL 6, 2016-01-29) apparently to
fix this compile error on centos 6:
> In file included from
> Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:20:
> /usr/include/linux/netlink.h:35: error: expected specifier-qualifier-list before 'sa_family_t'
The glibc bits/sockaddr.h says that this header should never be included
directly[1]. Instead, sys/socket.h should be used.
The problem was correctly fixed later, in commit 394fb55 (Fix build
error on centos6, 2016-03-02) so the incorrect bits/sockaddr.h can
safely be removed.
This is needed to build musl libc.
Fixes#761
[1]: 20003c4988/bits/sockaddr.h (L20)
Signed-off-by: Natanael Copa <natanael.copa@docker.com>
This is the inital port of the libcontainer.Error to added a cause to
all the existing error messages. Going forward, when an error can be
wrapped because it is not being checked at the higher levels for
something like `os.IsNotExist` we can add more information to the error
message like cause and stack file/line information. This will help
higher level tools to know what cause a container start or operation to
fail.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
No substantial code change.
Note that some style errors reported by `golint` are not fixed due to possible compatibility issues.
Signed-off-by: Akihiro Suda <suda.kyoto@gmail.com>
When swap memory is unsupported, Docker will set
cgroup.Resources.MemorySwap as -1.
Fixes: https://github.com/docker/docker/pull/21937
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
setupDev was introduced in #96, but broken since #536 because spec 0.3.0 introduced default devices.
Fix#80 again
Fixdocker/docker#21808
Signed-off-by: Akihiro Suda <suda.kyoto@gmail.com>
Signed-off-by: Alexander Morozov <lk4d4@docker.com>
One of our volume plugins needs to get the label of the target mount point
so that it can set the content inside of the volume to match.
We need label.GetFileLabel() to make this work.
Signed-off-by: Dan Walsh <dwalsh@redhat.com>
Some of the code was quite confusing inside libcontainer/user, so
refactor and comment it so future maintainers can understand what's
going and what edge cases we have to deal with.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Most shadow-related tools don't treat numeric ids as potential
usernames, so change our behaviour to match that. Previously, using an
explicit specification like 111:222 could result in the UID and GID not
being 111 and 222 respectively (which is confusing).
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This adds a `--no-pivot` cli flag to runc so that a container's rootfs
can be located ontop of ramdisk/tmpfs and not fail because you cannot
pivot root.
This should be a cli flag and not part of the spec because this is a
detail of the host/runtime environment and not an attribute of a
container.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
Currently, if we start a container with:
`docker run -ti --name foo --memory 300M --memory-swap 500M busybox sh`
Then we want to update it with:
`docker update --memory 600M --memory-swap 800M foo`
It'll get error because we can't set memory to 600M with
the 500M limit of swap memory.
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
Based on Golang document, %s is for "the uninterpreted bytes of the
string or slice", so %v is more appropriate.
Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>
Users of libcontainer other than runc may also require parsing and
converting specification configuration files.
Since runc cannot be imported, move the relevant functions and
definitions to a separate package, libcontainer/specconv.
Signed-off-by: Ido Yariv <ido@wizery.com>
Fixes#680
This changes setupRlimit to use the Prlimit syscall (rather than
Setrlimit) and moves the call to the parent process. This is necessary
because Setrlimit would affect the libcontainer consumer if called in
the parent, and would fail if called from the child if the
child process is in a user namespace and the requested rlimit is higher
than that in the parent.
Signed-off-by: Julian Friedman <julz.friedman@uk.ibm.com>
Make sure we don't error out collecting statistics for cases where
pids.max == "max". In that case, we can use a limit of 0 which means
"unlimited".
In addition, change the name of the stats attribute (Max) to mirror the
name of the resources attribute in the spec (Limit) so that it's
consistent internally.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This is required because we manage some of the cgroups ourselves.
This recommendation came from talking with systemd devs about
some of the issues that we see when using the systemd cgroups driver.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
We need to make sure the container is destroyed before closing the stdio
for the container. This becomes a big issues when running in the host's
pid namespace because the other processes could have inherited the stdio
of the initial process. The call to close will just block as they still
have the io open.
Calling destroy before closing io, especially in the host pid namespace
will cause all additional processes to be killed in the container's
cgroup. This will allow the io to be closed successfuly.
This change makes sure the order for destroy and close is correct as
well as ensuring that if any errors encoutered during start or exec will
be handled by terminating the process and destroying the container. We
cannot use defers here because we need to enforce the correct ordering
on destroy.
This also sets the subreaper setting for runc so that when running in
pid host, runc can wait on the addiontal processes launched by the
container, useful on destroy, but also good for reaping the additional
processes that were launched.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
This ensures that anything written to the logs are synced as they
happen.
This also changes the error message of the libcontainer error. The
original idea was to have this extra information in the message but it
makes it hard to parse and if the caller needed this information they
can just get it from the error type.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
In order to allow nice usage statistics (in terms of percentages and
other such data), add the value of pids.max to the PidsStats struct
returned from the pids cgroup controller.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Clears supplementary groups that have effect on the
mount permissions before joining the user specified
groups happens.
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
This updates runc and libcontainer to handle rlimits per process and set
them correctly for the container.
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
The rhel6 kernel returns EINVAL in this case
Known issue:
* CT with userns doesn't work
This is a copy of
d31e97fa28
to address https://github.com/opencontainers/runc/issues/613
Signed-off-by: Andrey Vagin <avagin@virtuozzo.com>
Signed-off-by: Andrew Fernandes <andrew@fernandes.org>