Commit Graph

54 Commits

Author SHA1 Message Date
Giuseppe Scrivano d8b669400a
rootless: allow multiple user/group mappings
Take advantage of the newuidmap/newgidmap tools to allow multiple
users/groups to be mapped into the new user namespace in the rootless
case.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
[ rebased to handle intelrdt changes. ]
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-09-09 12:45:32 +10:00
Xiaochen Shen 88d22fde40 libcontainer: intelrdt: use init() to avoid race condition
This is the follow-up PR of #1279 to fix remaining issues:

Use init() to avoid race condition in IsIntelRdtEnabled().
Add also rename some variables and functions.

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-09-08 17:15:31 +08:00
Xiaochen Shen 692f6e1e27 libcontainer: add support for Intel RDT/CAT in runc
About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.

About Intel RDT/CAT kernel interface:
In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.

Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.

Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
|   |-- L3
|       |-- cbm_mask
|       |-- min_cbm_bits
|       |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
    |-- cpus
    |-- schemata
    |-- tasks

For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
resource constraints.

The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file  (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.

The file `schemata` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
	Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.

The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.

For more information about Intel RDT/CAT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
inside the container only have access to the "upper" 80% of L3 cache id 0 and
the "lower" 50% L3 cache id 1:

"linux": {
	"intelRdt": {
		"l3CacheSchema": "L3:0=ffff0;1=3ff"
	}
}

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-09-01 14:26:33 +08:00
Aleksa Sarai 7d66aab77a
init: switch away from stateDirFd entirely
While we have significant protections in place against CVE-2016-9962, we
still were holding onto a file descriptor that referenced the host
filesystem. This meant that in certain scenarios it was still possible
for a semi-privileged container to gain access to the host filesystem
(if they had CAP_SYS_PTRACE).

Instead, open the FIFO itself using a O_PATH. This allows us to
reference the FIFO directly without providing the ability for
directory-level access. When opening the FIFO inside the init process,
open it through procfs to re-open the actual FIFO (this is currently the
only supported way to open such a file descriptor).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-08-25 13:19:03 +10:00
Aleksa Sarai 7cfb107f2c
factory: use e{u,g}id as the owner of /run/runc/$id
It appears as though these semantics were not fully thought out when
implementing them for rootless containers. It is not necessary (and
could be potentially dangerous) to set the owner of /run/ctr/$id to be
the root inside the container (if user namespaces are being used).

Instead, just use the e{g,u}id of runc to determine the owner.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-07-12 06:30:46 +10:00
Christy Perez 3d7cb4293c Move libcontainer to x/sys/unix
Since syscall is outdated and broken for some architectures,
use x/sys/unix instead.

There are still some dependencies on the syscall package that will
remain in syscall for the forseeable future:

Errno
Signal
SysProcAttr

Additionally:
- os still uses syscall, so it needs to be kept for anything
returning *os.ProcessState, such as process.Wait.

Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
2017-05-22 17:35:20 -05:00
Harshal Patil 700c74cb7e Issue #1429 : Removing check for id string length
Signed-off-by: Harshal Patil <harshal.patil@in.ibm.com>
2017-05-04 09:21:29 +05:30
Aleksa Sarai f0876b0427
libcontainer: configs: add proper HostUID and HostGID
Previously Host{U,G}ID only gave you the root mapping, which isn't very
useful if you are trying to do other things with the IDMaps.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-03-23 20:46:20 +11:00
Aleksa Sarai baeef29858
rootless: add rootless cgroup manager
The rootless cgroup manager acts as a noop for all set and apply
operations. It is just used for rootless setups. Currently this is far
too simple (we need to add opportunistic cgroup management), but is good
enough as a first-pass at a noop cgroup manager.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-03-23 20:46:20 +11:00
Michael Crosby 00a0ecf554 Add separate console socket
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-03-16 10:23:59 -07:00
Qiang Huang 805b8c73d3 Do not create exec fifo in factory.Create
It should not be binded to container creation, for
example, runc restore needs to create a
libcontainer.Container, but it won't need exec fifo.

So create exec fifo when container is started or run,
where we really need it.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-02-22 10:34:48 -08:00
Qiang Huang be33383e60 Merge pull request #1293 from stevenh/resolve-initarg
Resolve InitArgs to ensure init works
2017-02-03 19:25:52 +08:00
Steven Hartland b9dfa444c4 Resolve InitArgs to ensure init works
If a relative pathed exe is used for InitArgs init will fail to run if Cwd is not set the original path.

Prevent failure of init to run by ensuring that exe in InitArgs is an absolute path.

Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
2017-02-01 13:42:09 +00:00
Aleksa Sarai e034cedce7
libcontainer: init: only pass stateDirFd when creating a container
If we pass a file descriptor to the host filesystem while joining a
container, there is a race condition where a process inside the
container can ptrace(2) the joining process and stop it from closing its
file descriptor to the stateDirFd. Then the process can access the
*host* filesystem from that file descriptor. This was fixed in part by
5d93fed3d2 ("Set init processes as non-dumpable"), but that fix is
more of a hail-mary than an actual fix for the underlying issue.

To fix this, don't open or pass the stateDirFd to the init process
unless we're creating a new container. A proper fix for this would be to
remove the need for even passing around directory file descriptors
(which are quite dangerous in the context of mount namespaces).

There is still an issue with containers that have CAP_SYS_PTRACE and are
using the setns(2)-style of joining a container namespace. Currently I'm
not really sure how to fix it without rampant layer violation.

Fixes: CVE-2016-9962
Fixes: 5d93fed3d2 ("Set init processes as non-dumpable")
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-02-02 00:41:11 +11:00
Steven Hartland 64aa78b762 Ensure pipe is always closed on error in StartInitialization
Ensure that the pipe is always closed during the error processing of  StartInitialization.

Also:
* Fix a comment typo.
* Use newContainerInit directly as there's no need for i to be an initer.
* Move the comment about the behaviour of Init() directly above it, clarifying what happens for all defers.

Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
2017-01-25 12:36:40 +00:00
Aleksa Sarai 4776b4326a
libcontainer: refactor syncT handling
To make the code cleaner, and more clear, refactor the syncT handling
used when creating the `runc init` process. In addition, document the
state changes so that people actually understand what is going on.

Rather than only using syncT for the standard initProcess, use it for
both initProcess and setnsProcess. This removes some special cases, as
well as allowing for the use of syncT with setnsProcess.

Also remove a bunch of the boilerplate around syncT handling.

This patch is part of the console rewrite patchset.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-12-01 15:46:04 +11:00
Michael Crosby fcc40b7a63 Remove panic from init
Print the error message to stderr if we are unable to return it back via
the pipe to the parent process.  Also, don't panic here as it is most
likely a system or user error and not a programmer error.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-10-17 15:54:51 -07:00
Qiang Huang 3597b7b743 Merge pull request #1087 from williammartin/master
Fix typo when container does not exist
2016-09-29 09:19:45 +08:00
William Martin 152169ed34 Fix typo when container does not exist
Signed-off-by: William Martin <wmartin@pivotal.io>
2016-09-28 11:00:50 +00:00
Peng Gao c5393da813 Refactor enum map range to slice range
grep -r "range map" showw 3 parts use map to
range enum types, use slice instead can get
better performance and less memory usage.

Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>
2016-09-28 15:36:29 +08:00
rajasec 714550f87c Error handling when container not exists
Signed-off-by: rajasec <rajasec79@gmail.com>

Error handling when container not exists

Signed-off-by: rajasec <rajasec79@gmail.com>

Error handling when container not exists

Signed-off-by: rajasec <rajasec79@gmail.com>

Error handling when container not exists

Signed-off-by: rajasec <rajasec79@gmail.com>
2016-08-26 00:00:54 +05:30
Michael Crosby 46d9535096 Merge pull request #934 from macrosheep/fix-initargs
Fix and refactor init args
2016-08-24 10:06:01 -07:00
Yang Hongyang a59d63c5d3 Fix and refactor init args
1. According to docs of Cmd.Path and Cmd.Args from package "os/exec":
   Path is the path of the command to run. Args holds command line
   arguments, including the command as Args[0]. We have mixed usage
   of args. In InitPath(), InitArgs only take arguments, in InitArgs(),
   InitArgs including the command as Args[0]. This is confusing.
2. InitArgs() already have the ability to configure a LinuxFactory
   with the provided absolute path to the init binary and arguements as
   InitPath() does.
3. exec.Command() will take care of serching executable path.
4. The default "/proc/self/exe" instead of os.Args[0] is passed to
   InitArgs in order to allow relative path for the runC binary.

Signed-off-by: Yang Hongyang <imhy.yang@gmail.com>
2016-07-06 23:21:02 -04:00
Yang Hongyang 9ade2cc5ce libcontainer: Add a helper func to set CriuPath
Added a helper func to set CriuPath for LinuxFactory.

Signed-off-by: Yang Hongyang <imhy.yang@gmail.com>
2016-07-06 22:58:55 -04:00
Qiang Huang 14e95b2aa9 Make state detection precise
Fixes: https://github.com/opencontainers/runc/issues/871

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-07-05 08:24:13 +08:00
Michael Crosby 5ce88a95f6 Fix fifo usage with userns
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-06-13 20:20:48 -07:00
Michael Crosby 3aacff695d Use fifo for create/start
This removes the use of a signal handler and SIGCONT to signal the init
process to exec the users process.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-06-13 11:26:53 -07:00
Michael Crosby efcd73fb5b Fix signal handling for unit tests
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-05-31 11:10:47 -07:00
Michael Crosby 3fc929f350 Only create a buffered channel of one
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-05-31 11:06:41 -07:00
Michael Crosby 30f1006b33 Fix libcontainer states
Move initialized to created and destoryed to stopped.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-05-31 11:06:41 -07:00
Michael Crosby 3fe7d7f31e Add create and start command for container lifecycle
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-05-31 11:06:41 -07:00
Alexander Morozov d57898610b Merge pull request #675 from pankit/master
Allow + in container ID
2016-05-25 10:35:08 -07:00
Tonis Tiigi 78ecdfe18e Show proper error from init process panic
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
2016-03-22 15:57:15 -07:00
pankit thapar 4629512d89 Allow + in container ID
Signed-off-by: pankit thapar <pankit@umich.edu>
2016-03-22 11:40:55 -04:00
rajasec d4be3405c7 Fixing valid-id in regex
Signed-off-by: rajasec <rajasec79@gmail.com>
2016-03-14 08:48:41 +05:30
Michael Crosby 213c1a1a4a Revert "Return proper exit code for exec errors"
This reverts commit 6bb653a6e8.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-03-10 11:00:48 -08:00
Michael Crosby 6bb653a6e8 Return proper exit code for exec errors
Exec erros from the exec() syscall in the container's init should be
treated as if the container ran but couldn't execute the process for the
user instead of returning a libcontainer error as if it was an issue in
the library.

Before specifying different commands like `/etc`, `asldfkjasdlfj`, or
`/alsdjfkasdlfj` would always return 1 on the command line with a
libcontainer specific error message.  Now they return the correct
message and exit status defined for unix processes.

Example:

```bash
root@deathstar:/containers/redis# runc start test
exec: "/asdlfkjasldkfj": file does not exist
root@deathstar:/containers/redis# echo $?
127
root@deathstar:/containers/redis# runc start test
exec: "asdlfkjasldkfj": executable file not found in $PATH
root@deathstar:/containers/redis# echo $?
127
root@deathstar:/containers/redis# runc start test
exec: "/etc": permission denied
root@deathstar:/containers/redis# echo $?
126
```

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-02-26 11:41:56 -08:00
Alexander Morozov c6d18308b8 Merge pull request #526 from hqhq/hq_remove_procStart
Remove procStart
2016-02-16 09:12:04 -08:00
Qiang Huang 13e8f6e589 Remove procStart
It's never used and not needed. Our pipe is created with
syscall.SOCK_CLOEXEC, so pipe will be closed once container
process executed successfully, parent process will read EOF
and continue. If container process got error before executed,
we'll write procError to sync with parent.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-01-30 13:41:21 +08:00
Michael Crosby 1172a1e1e5 Update list command and created methods
We don't need a CreatedTime method on the container because it's not
part of the interface and can be received via the state.  We also do not
need to call it CreateTime because the type of this field is time.Time
so we know its time.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-01-28 13:32:24 -08:00
Michael Crosby 480e5f4416 Merge pull request #507 from mikebrow/runc-ls-command
adds list command
2016-01-28 13:20:07 -08:00
Mike Brown 4c871267db adds list command, and a timestamp in the container state
Signed-off-by: Mike Brown <brownwm@us.ibm.com>
2016-01-28 14:21:06 -06:00
Michael Crosby 7cd384c0e5 Merge pull request #515 from crosbymichael/readall
Do not use stream encoders for pipe communication
2016-01-26 14:37:54 -08:00
Michael Crosby ddcee3cc2a Do not use stream encoders
Marshall the raw objects for the sync pipes so that no new line chars
are left behind in the pipe causing errors.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-01-26 11:22:05 -08:00
Doug Davis ff034a5119 Remove the nullState
Add a "createdState" in its place since I think that better describes
what its used for.

Signed-off-by: Doug Davis <dug@us.ibm.com>
2016-01-25 00:26:11 -08:00
Michael Crosby 9c3fa7928e Allow switch to anything from nullState
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2016-01-21 16:48:05 -08:00
Aleksa Sarai 103853ead7 libcontainer: set cgroup config late
Due to the fact that the init is implemented in Go (which seemingly
randomly spawns new processes and loves eating memory), most cgroup
configurations are required to have an arbitrary minimum dictated by the
init. This confuses users and makes configuration more annoying than it
should. An example of this is pids.max, where Go spawns multiple
processes that then cause init to violate the pids cgroup constraint
before the container can even start.

Solve this problem by setting the cgroup configurations as late as
possible, to avoid hitting as many of the resources hogged by the Go
init as possible. This has to be done before seccomp rules are applied,
as the parent and child must synchronise in order for the parent to
correctly set the configurations (and writes might be blocked by seccomp).

Signed-off-by: Aleksa Sarai <asarai@suse.com>
2016-01-12 10:06:35 +11:00
Michael Crosby 4415446c32 Add state pattern for container state transition
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>

Add state status() method

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>

Allow multiple checkpoint on restore

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>

Handle leave-running state

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>

Fix state transitions for inprocess

Because the tests use libcontainer in process between the various states
we need to ensure that that usecase works as well as the out of process
one.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>

Remove isDestroyed method

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>

Handling Pausing from freezer state

Signed-off-by: Rajasekaran <rajasec79@gmail.com>

freezer status

Signed-off-by: Rajasekaran <rajasec79@gmail.com>

Fixing review comments

Signed-off-by: Rajasekaran <rajasec79@gmail.com>

Added comment when freezer not available

Signed-off-by: Rajasekaran <rajasec79@gmail.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>

Conflicts:
	libcontainer/container_linux.go

Change checkFreezer logic to isPaused()

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>

Remove state base and factor out destroy func

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>

Add unit test for state transitions

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2015-12-17 13:55:38 -08:00
Doug Davis e5dc12a0c9 Add more context around some error cases
Signed-off-by: Doug Davis <dug@us.ibm.com>
2015-10-30 10:55:48 -07:00
Alexander Morozov d9e513043c Use /proc/self/exe as default for InitPath
Signed-off-by: Alexander Morozov <lk4d4@docker.com>
2015-07-24 11:45:09 -07:00