jasder/runc - runc - 军科开源项目托管

Commit Graph

Author	SHA1	Message	Date
Tobias Klauser	d713652bda	libcontainer: remove unnecessary type conversions Generated using github.com/mdempsky/unconvert Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2017-09-25 10:41:57 +02:00
Mrunal Patel	d5b43c3981	Merge pull request #1455 from dqminh/epoll-io tty: move IO of master pty to be done with epoll	2017-09-11 11:32:42 -07:00
Aleksa Sarai	6097ce74d8	nsenter: correctly handle newgidmap path for rootless containers After quite a bit of debugging, I found that previous versions of this patchset did not include newgidmap in a rootless setting. Fix this by passing it whenever group mappings are applied, and also providing some better checking for try_mapping_tool. This commit also includes some stylistic improvements. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-09-09 12:45:32 +10:00
Giuseppe Scrivano	d8b669400a	rootless: allow multiple user/group mappings Take advantage of the newuidmap/newgidmap tools to allow multiple users/groups to be mapped into the new user namespace in the rootless case. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> [ rebased to handle intelrdt changes. ] Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-09-09 12:45:32 +10:00
Mrunal Patel	7e036aa0b0	Merge pull request #1541 from adrianreber/lazy checkpoint: support lazy migration	2017-09-07 13:25:04 -07:00
Adrian Reber	60ae7091de	checkpoint: support lazy migration With the help of userfaultfd CRIU supports lazy migration. Lazy migration means that memory pages are only transferred from the migration source to the migration destination on page fault. This enables to reduce the downtime during process or container migration to a minimum as the memory does not need to be transferred during migration. Lazy migration currently depends on userfaultfd being available on the current Linux kernel and if the used CRIU version supports lazy migration. Both dependencies can be checked by querying CRIU via RPC if the lazy migration feature is available. Using feature checking instead of version comparison enables runC to use CRIU features from the criu-dev branch. This way the user can decide if lazy migration should be available by choosing the right kernel and CRIU branch. To use lazy migration the CRIU process during dump needs to dump everything besides the memory pages and then it opens a network port waiting for remote page fault requests: # runc checkpoint httpd --lazy-pages --page-server 0.0.0.0:27 \ --status-fd /tmp/postcopy-pipe In this example CRIU will hang/wait once it has opened the network port and wait for network connection. As runC waits for CRIU to finish it will also hang until the lazy migration has finished. To know when the restore on the destination side can start the '--status-fd' parameter is used: #️ runc checkpoint --help \| grep status --status-fd value criu writes \0 to this FD once lazy-pages is ready The parameter '--status-fd' is directly from CRIU and this way the process outside of runC which controls the migration knows exactly when to transfer the checkpoint (without memory pages) to the destination and that the restore can be started. On the destination side it is necessary to start CRIU in 'lazy-pages' mode like this: # criu lazy-pages --page-server --address 192.168.122.3 --port 27 \ -D checkpoint and tell runC to do a lazy restore: # runc restore -d --image-path checkpoint --work-path checkpoint \ --lazy-pages httpd If both processes on the restore side have the same working directory 'criu lazy-pages' creates a unix domain socket where it waits for requests from the actual restore. runC starts CRIU restore in lazy restore mode and talks to 'criu lazy-pages' that it wants to restore memory pages on demand. CRIU continues to restore the process and once the process is running and accesses the first non-existing memory page the 'criu lazy-pages' server will request the page from the source system. Thus all pages from the source system will be transferred to the destination system. Once all pages have been transferred runC on the source system will end and the container will have finished migration. This can also be combined with CRIU's pre-copy support. The combination of pre-copy and post-copy (lazy migration) provides the possibility to migrate containers with minimal downtimes. Some additional background about post-copy migration can be found in these articles: https://lisas.de/~adrian/?p=1253 https://lisas.de/~adrian/?p=1183 Signed-off-by: Adrian Reber <areber@redhat.com>	2017-09-06 12:35:38 +00:00
Adrian Reber	a3a632ad28	checkpoint: add support to query for lazy page support Before adding the actual lazy migration support, this adds the feature check for lazy-pages. Right now lazy migration, which is based on userfaultd is only available in the criu-dev branch and not yet in a release. As the check does not dependent on a certain version but on a CRIU feature which can be queried it can be part of runC without a new version check depending on a feature from criu-dev. Signed-off-by: Adrian Reber <areber@redhat.com>	2017-09-06 12:35:38 +00:00
Xiaochen Shen	692f6e1e27	libcontainer: add support for Intel RDT/CAT in runc About Intel RDT/CAT feature: Intel platforms with new Xeon CPU support Intel Resource Director Technology (RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which currently supports L3 cache resource allocation. This feature provides a way for the software to restrict cache allocation to a defined 'subset' of L3 cache which may be overlapping with other 'subsets'. The different subsets are identified by class of service (CLOS) and each CLOS has a capacity bitmask (CBM). For more information about Intel RDT/CAT can be found in the section 17.17 of Intel Software Developer Manual. About Intel RDT/CAT kernel interface: In Linux 4.10 kernel or newer, the interface is defined and exposed via "resource control" filesystem, which is a "cgroup-like" interface. Comparing with cgroups, it has similar process management lifecycle and interfaces in a container. But unlike cgroups' hierarchy, it has single level filesystem layout. Intel RDT "resource control" filesystem hierarchy: mount -t resctrl resctrl /sys/fs/resctrl tree /sys/fs/resctrl /sys/fs/resctrl/ \|-- info \| \|-- L3 \| \|-- cbm_mask \| \|-- min_cbm_bits \| \|-- num_closids \|-- cpus \|-- schemata \|-- tasks \|-- <container_id> \|-- cpus \|-- schemata \|-- tasks For runc, we can make use of `tasks` and `schemata` configuration for L3 cache resource constraints. The file `tasks` has a list of tasks that belongs to this group (e.g., <container_id>" group). Tasks can be added to a group by writing the task ID to the "tasks" file (which will automatically remove them from the previous group to which they belonged). New tasks created by fork(2) and clone(2) are added to the same group as their parent. If a pid is not in any sub group, it Is in root group. The file `schemata` has allocation bitmasks/values for L3 cache on each socket, which contains L3 cache id and capacity bitmask (CBM). Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..." For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0` which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0. The valid L3 cache CBM is a contiguous bits set and number of bits that can be set is less than the max bit. The max bits in the CBM is varied among supported Intel Xeon platforms. In Intel RDT "resource control" filesystem layout, the CBM in a group should be a subset of the CBM in root. Kernel will check if it is valid when writing. e.g., 0xfffff in root indicates the max bits of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc. For more information about Intel RDT/CAT kernel interface: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt An example for runc: Consider a two-socket machine with two L3 caches where the default CBM is 0xfffff and the max CBM length is 20 bits. With this configuration, tasks inside the container only have access to the "upper" 80% of L3 cache id 0 and the "lower" 50% L3 cache id 1: "linux": { "intelRdt": { "l3CacheSchema": "L3:0=ffff0;1=3ff" } } Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2017-09-01 14:26:33 +08:00
Aleksa Sarai	7d66aab77a	init: switch away from stateDirFd entirely While we have significant protections in place against CVE-2016-9962, we still were holding onto a file descriptor that referenced the host filesystem. This meant that in certain scenarios it was still possible for a semi-privileged container to gain access to the host filesystem (if they had CAP_SYS_PTRACE). Instead, open the FIFO itself using a O_PATH. This allows us to reference the FIFO directly without providing the ability for directory-level access. When opening the FIFO inside the init process, open it through procfs to re-open the actual FIFO (this is currently the only supported way to open such a file descriptor). Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-08-25 13:19:03 +10:00
Nikolas Sepos	da4a5a9515	Add AutoDedup option to CriuOpts Memory image deduplication, very useful for incremental dumps. See: https://criu.org/Memory_images_deduplication Signed-off-by: Nikolas Sepos <nikolas.sepos@gmail.com>	2017-08-18 01:21:42 +02:00
Qiang Huang	e6e1c34a7d	Update state after update state.json should be a reflection of the container's realtime state, including resource configurations, so we should update state.json after updating container resources. Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>	2017-08-15 14:38:44 +08:00
Adrian Reber	5d386f6e2b	checkpoint: use CRIU VERSION RPC if available With this runC also uses RPC to ask CRIU for its version. CRIU supports a VERSION RPC since CRIU 3.0 and using the RPC interface does not require parsing the console output of CRIU (which could change anytime). For older CRIU versions which do not yet have the VERSION RPC runC falls back to its old CRIU output parsing mode. Once CRIU 3.0 is the minimum version required for runC the old code can be removed. v2: * adapt to changes in the previous patches based on the review Signed-off-by: Adrian Reber <areber@redhat.com>	2017-08-02 16:08:07 +00:00
Adrian Reber	c71d9cd447	criuSwrk: prepare for CRIU VERSION RPC To use the CRIU VERSION RPC the criuSwrk function is adapted to work with CriuOpts set to 'nil' as CriuOpts is not required for the VERSION RPC. Also do not print c.criuVersion if it is '0' as the first RPC call will always be the VERSION call and only after that the version will be known. Signed-off-by: Adrian Reber <areber@redhat.com>	2017-08-02 16:07:28 +00:00
Adrian Reber	c5f0ce979b	checkCriuVersion: only ask criu once about its version If the version of criu has already been determined there is no need to ask criu for the version again. Use the value from c.criuVersion. v2: * reduce unnecessary code movement in the patch series * factor out the criu version parsing into a separate function Signed-off-by: Adrian Reber <areber@redhat.com>	2017-08-02 16:07:15 +00:00
Adrian Reber	b6c47281db	checkCriuVersion: switch to version using int The checkCriuVersion function used a string to specify the minimum version required. This is more comfortable for an external interface but for an internal function this added unnecessary complexity. This changes to version string like '1.5.2' to an integer like 10502. This is already the format used internally in the function. Signed-off-by: Adrian Reber <areber@redhat.com>	2017-08-02 16:05:27 +00:00
Michael Crosby	882d8eaba6	Merge pull request #1537 from tklauser/staticcheck Fix issues found by staticcheck	2017-08-02 09:52:11 -04:00
Tobias Klauser	e4e56cb6d8	libcontainer: remove ineffective break statements go's switch statement doesn't need an explicit break. Remove it where that is the case and add a comment to indicate the purpose where the removal would lead to an empty case. Found with honnef.co/go/tools/cmd/staticcheck Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2017-07-28 15:13:39 +02:00
Tobias Klauser	24a4273cf9	libcontainer: handle error cases Handle err return value of fmt.Scanf, os.Pipe and unix.ParseUnixRights. Found with honnef.co/go/tools/cmd/staticcheck Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2017-07-28 15:13:11 +02:00
Daniel Dao	91eafcbc65	tty: move IO of master pty to be done with epoll This moves all console code to use github.com/containerd/console library to handle console I/O. Also move to use EpollConsole by default when user requests a terminal so we can still cope when the other side temporarily goes away. Signed-off-by: Daniel Dao <dqminh89@gmail.com>	2017-07-28 12:35:02 +01:00
Steven Hartland	ee4f68e302	Updated logrus to v1 Updated logrus to use v1 which includes a breaking name change Sirupsen -> sirupsen. This includes a manual edit of the docker term package to also correct the name there too. Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>	2017-07-19 15:20:56 +00:00
Tobias Klauser	54d27bed7f	libcontainer: use ParseSocketControlMessage/ParseUnixRights from x/sys/unix Use ParseSocketControlMessage and ParseUnixRights from golang.org/x/sys/unix instead of their syscall equivalent. Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2017-07-13 15:02:17 +02:00
W. Trevor King	2bea4c897e	libcontainer/system/proc: Add Stat_t.State And Stat_t.PID and Stat_t.Name while we're at it. Then use the new .State property in runType to distinguish between running and zombie/dead processes, since kill(2) does not [1]. With this change we no longer claim Running status for zombie/dead processes. I've also removed the kill(2) call from runType. It was originally added in `13841ef3` (new-api: return the Running state only if the init process is alive, 2014-12-23), but we've been accessing /proc/[pid]/stat since `14e95b2a` (Make state detection precise, 2016-07-05, #930), and with the /stat access the kill(2) check is redundant. I also don't see much point to the previously-separate doesInitProcessExist, so I've inlined that logic in runType. It would be nice to distinguish between "/proc/[pid]/stat doesn't exist" and errors parsing its contents, but I've skipped that for the moment. The Running -> Stopped change in checkpoint_test.go is because the post-checkpoint process is a zombie, and with this commit zombie processes are Stopped (and no longer Running). [1]: https://github.com/opencontainers/runc/pull/1483#issuecomment-307527789 Signed-off-by: W. Trevor King <wking@tremily.us>	2017-06-20 16:26:55 -07:00
W. Trevor King	75d98b26b7	libcontainer: Replace GetProcessStartTime with Stat_t.StartTime And convert the various start-time properties from strings to uint64s. This removes all internal consumers of the deprecated GetProcessStartTime function. Signed-off-by: W. Trevor King <wking@tremily.us>	2017-06-20 16:26:55 -07:00
Christy Perez	3d7cb4293c	Move libcontainer to x/sys/unix Since syscall is outdated and broken for some architectures, use x/sys/unix instead. There are still some dependencies on the syscall package that will remain in syscall for the forseeable future: Errno Signal SysProcAttr Additionally: - os still uses syscall, so it needs to be kept for anything returning *os.ProcessState, such as process.Wait. Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>	2017-05-22 17:35:20 -05:00
Mrunal Patel	639454475c	Merge pull request #1355 from avagin/cr-console Dump and restore containers with external terminals	2017-05-18 11:22:52 -07:00
Harshal Patil	22953c122f	Remove redundant declaraion of namespace slice Signed-off-by: Harshal Patil <harshal.patil@in.ibm.com>	2017-05-02 10:04:57 +05:30
Andrei Vagin	73258813d3	cr: set a freezer cgroup for criu A freezer cgroup allows to dump processes faster. If a user wants to checkpoint a container and its storage, he has to pause a container, but in this case we need to pass a path to its freezer cgroup to "criu dump". Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>	2017-05-02 04:48:47 +03:00
Andrei Vagin	1c43d091a1	checkpoint: add support for containers with terminals CRIU was extended to report about orphaned master pty-s via RPC. Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>	2017-05-02 04:48:47 +03:00
Andrei Vagin	d307e85dbb	Print a criu version in a error message Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>	2017-05-01 21:45:23 +03:00
Harshal Patil	c44d4fa6ed	Optimizing looping over namespaces Signed-off-by: Harshal Patil <harshal.patil@in.ibm.com>	2017-04-26 11:54:43 +05:30
Qiang Huang	94cfb7955b	Merge pull request #1387 from avagin/freezer Don't try to read freezer.state from the current directory	2017-04-24 20:02:45 -05:00
Mrunal Patel	97db1eaad9	Merge pull request #1396 from harche/cstate Set container state only once during start	2017-04-17 11:32:42 -07:00
Mrunal Patel	7814a0d14b	Merge pull request #1399 from avagin/cr-cgroup restore: apply resource limits	2017-04-13 11:28:28 -07:00
Andrei Vagin	57ef30a2ae	restore: apply resource limits When C/R was implemented, it was enough to call manager.Set to apply limits and to move a task. Now .Set() and .Apply() have to be called separately. Fixes: `8a740d5391` ("libcontainer: cgroups: don't Set in Apply") Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>	2017-04-07 02:47:43 +03:00
Adrian Reber	273b7853c8	checkpoint: check if system supports pre-dumping Instead of relying on version numbers it is possible to check if CRIU actually supports certain features. This introduces an initial implementation to check if CRIU and the underlying kernel actually support dirty memory tracking for memory pre-dumping. Upstream CRIU also supports the lazy-page migration feature check and additional feature checks can be included in CRIU to reduce the version number parsing. There are also certain CRIU features which depend on one side on the CRIU version but also require certain kernel versions to actually work. CRIU knows if it can do certain things on the kernel it is running on and using the feature check RPC interface makes it easier for runc to decide if the criu+kernel combination will support that feature. Feature checking was introduced with CRIU 1.8. Running with older CRIU versions will ignore the feature check functionality and behave just like it used to. v2: - Do not use reflection to compare requested and responded features. Checking which feature is available is now hardcoded and needs to be adapted for every new feature check. The code is now much more readable and simpler. v3: - Move the variable criuFeat out of the linuxContainer struct, as it is not container specific. Now it is a global variable. Signed-off-by: Adrian Reber <areber@redhat.com>	2017-04-06 11:17:52 +00:00
Harshal Patil	1be5d31da2	Set container state only once during start Signed-off-by: Harshal Patil <harshal.patil@in.ibm.com>	2017-04-04 15:08:04 +05:30
Aleksa Sarai	f0876b0427	libcontainer: configs: add proper HostUID and HostGID Previously Host{U,G}ID only gave you the root mapping, which isn't very useful if you are trying to do other things with the IDMaps. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-03-23 20:46:20 +11:00
Aleksa Sarai	baeef29858	rootless: add rootless cgroup manager The rootless cgroup manager acts as a noop for all set and apply operations. It is just used for rootless setups. Currently this is far too simple (we need to add opportunistic cgroup management), but is good enough as a first-pass at a noop cgroup manager. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-03-23 20:46:20 +11:00
Aleksa Sarai	d2f49696b0	runc: add support for rootless containers This enables the support for the rootless container mode. There are many restrictions on what rootless containers can do, so many different runC commands have been disabled: * runc checkpoint * runc events * runc pause * runc ps * runc restore * runc resume * runc update The following commands work: * runc create * runc delete * runc exec * runc kill * runc list * runc run * runc spec * runc state In addition, any specification options that imply joining cgroups have also been disabled. This is due to support for unprivileged subtree management not being available from Linux upstream. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-03-23 20:45:24 +11:00
Aleksa Sarai	6bd4bd9030	*: handle unprivileged operations and !dumpable Effectively, !dumpable makes implementing rootless containers quite hard, due to a bunch of different operations on /proc/self no longer being possible without reordering everything. !dumpable only really makes sense when you are switching between different security contexts, which is only the case when we are joining namespaces. Unfortunately this means that !dumpable will still have issues in this instance, and it should only be necessary to set !dumpable if we are not joining USER namespaces (new kernels have protections that make !dumpable no longer necessary). But that's a topic for another time. This also includes code to unset and then re-set dumpable when doing the USER namespace mappings. This should also be safe because in principle processes in a container can't see us until after we fork into the PID namespace (which happens after the user mapping). In rootless containers, it is not possible to set a non-dumpable process's /proc/self/oom_score_adj (it's owned by root and thus not writeable). Thus, it needs to be set inside nsexec before we set ourselves as non-dumpable. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-03-23 20:45:19 +11:00
Andrei Vagin	88256d646d	Don't try to read freezer.state from the current directory If we try to pause a container on the system without freezer cgroups, we can found that runc tries to open ./freezer.state. It is obviously wrong. $ ./runc pause test no such directory for freezer.state $ echo FROZEN > freezer.state $ ./runc pause test container not running or created: paused Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>	2017-03-23 01:58:45 +03:00
Michael Crosby	00a0ecf554	Add separate console socket Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2017-03-16 10:23:59 -07:00
Mrunal Patel	4f9cb13b64	Update runtime spec to 1.0.0.rc5 Signed-off-by: Mrunal Patel <mrunalp@gmail.com>	2017-03-15 11:38:37 -07:00
Qiang Huang	b7932a2e07	Remove unused ExecFifoPath In container process's Init function, we use fd + execFifoFilename to open exec fifo, so this field in init config is never used. Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>	2017-03-09 10:58:16 +08:00
Qiang Huang	707dd48b2f	Merge pull request #1001 from x1022as/predump add pre-dump and parent-path to checkpoint	2017-02-24 10:55:06 -08:00
Qiang Huang	733563552e	Fix state when _LIBCONTAINER in environment Fixes: #1311 Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>	2017-02-22 10:35:14 -08:00
Qiang Huang	805b8c73d3	Do not create exec fifo in factory.Create It should not be binded to container creation, for example, runc restore needs to create a libcontainer.Container, but it won't need exec fifo. So create exec fifo when container is started or run, where we really need it. Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>	2017-02-22 10:34:48 -08:00
Deng Guangxing	98f004182b	add pre-dump and parent-path to checkpoint CRIU gets pre-dump to complete iterative migration. pre-dump saves process memory info only. And it need parent-path to specify the former memory files. This patch add pre-dump and parent-path arguments to runc checkpoint Signed-off-by: Deng Guangxing <dengguangxing@huawei.com> Signed-off-by: Adrian Reber <areber@redhat.com>	2017-02-14 19:45:07 +08:00
Aleksa Sarai	e034cedce7	libcontainer: init: only pass stateDirFd when creating a container If we pass a file descriptor to the host filesystem while joining a container, there is a race condition where a process inside the container can ptrace(2) the joining process and stop it from closing its file descriptor to the stateDirFd. Then the process can access the host filesystem from that file descriptor. This was fixed in part by `5d93fed3d2` ("Set init processes as non-dumpable"), but that fix is more of a hail-mary than an actual fix for the underlying issue. To fix this, don't open or pass the stateDirFd to the init process unless we're creating a new container. A proper fix for this would be to remove the need for even passing around directory file descriptors (which are quite dangerous in the context of mount namespaces). There is still an issue with containers that have CAP_SYS_PTRACE and are using the setns(2)-style of joining a container namespace. Currently I'm not really sure how to fix it without rampant layer violation. Fixes: CVE-2016-9962 Fixes: `5d93fed3d2` ("Set init processes as non-dumpable") Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-02-02 00:41:11 +11:00
Qiang Huang	db99936a0e	Merge pull request #1110 from avagin/cpt-in-userns checkpoint: handle config.Devices and config.MaskPaths	2017-01-10 00:34:40 -06:00
Zhang Wei	a344b2d6a8	sync up `HookState` with OCI spec `State` `HookState` struct should follow definition of `State` in runtime-spec: * modify json name of `version` to `ociVersion`. * Remove redundant `Rootfs` field as rootfs can be retrived from `bundlePath/config.json`. Signed-off-by: Zhang Wei <zhangwei555@huawei.com>	2016-12-20 00:00:43 +08:00
Mrunal Patel	34f23cb99c	Merge pull request #1018 from cyphar/console-rewrite Consoles, consoles, consoles.	2016-12-07 14:37:19 -08:00
Xianlu Bird	e2e6f58e4e	Fix typo Fix typo	2016-12-01 15:23:58 +08:00
Aleksa Sarai	244c9fc426	*: console rewrite This implements {createTTY, detach} and all of the combinations and negations of the two that were previously implemented. There are some valid questions about out-of-OCI-scope topics like !createTTY and how things should be handled (why do we dup the current stdio to the process, and how is that not a security issue). However, these will be dealt with in a separate patchset. In order to allow for late console setup, split setupRootfs into the "preparation" section where all of the mounts are created and the "finalize" section where we pivot_root and set things as ro. In between the two we can set up all of the console mountpoints and symlinks we need. We use two-stage synchronisation to ensures that when the syscalls are reordered in a suboptimal way, an out-of-place read() on the parentPipe will not gobble the ancilliary information. This patch is part of the console rewrite patchset. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2016-12-01 15:49:36 +11:00
Michael Crosby	e58671e530	Add --all flag to kill This allows a user to send a signal to all the processes in the container within a single atomic action to avoid new processes being forked off before the signal can be sent. This is basically taking functionality that we already use being `delete` and exposing it ok the `kill` command by adding a flag. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-11-08 09:35:02 -08:00
Andrei Vagin	040fb7311c	checkpoint: handle config.Devices and config.MaskPaths In user namespaces devices are bind-mounted from the host, so we need to add them as external mounts for CRIU. Reported-by: Ross Boucher <boucher@gmail.com> Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>	2016-10-26 23:50:54 +03:00
Aleksa Sarai	2cd9c31b99	nsenter: guarantee correct user namespace ordering Depending on your SELinux setup, the order in which you join namespaces can be important. In general, user namespaces should always be joined and unshared first because then the other namespaces are correctly pinned and you have the right priviliges within them. This also is very useful for rootless containers, as well as older kernels that had essentially broken unshare(2) and clone(2) implementations. This also includes huge refactorings in how we spawn processes for complicated reasons that I don't want to get into because it will make me spiral into a cloud of rage. The reasoning is in the giant comment in clone_parent. Have fun. In addition, because we now create multiple children with CLONE_PARENT, we cannot wait for them to SIGCHLD us in the case of a death. Thus, we have to resort to having a child kindly send us their exit code before they die. Hopefully this all works okay, but at this point there's not much more than we can do. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2016-10-04 16:17:55 +11:00
Aleksa Sarai	ed053a740c	nsenter: specify namespace type in setns() This avoids us from running into cases where libcontainer thinks that a particular namespace file is a different type, and makes it a fatal error rather than causing broken functionality. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2016-10-04 16:17:55 +11:00
Wang Long	59a241f647	update the comment for container.Pause() method on linux if a container state is running or created, the container.Pause() method can set the state to pausing, and then paused. this patch update the comment, so it can be consistent with the code. Signed-off-by: Wang Long <long.wanglong@huawei.com>	2016-09-20 10:49:04 +08:00
Qiang Huang	1e319efa36	Merge pull request #815 from rajasec/basecont-comments Updated the libcontainer interface comments	2016-08-26 09:43:50 +08:00
Michael Crosby	46d9535096	Merge pull request #934 from macrosheep/fix-initargs Fix and refactor init args	2016-08-24 10:06:01 -07:00
rajasec	1ea17d73fe	Updated the libcontainer interface comments Signed-off-by: rajasec <rajasec79@gmail.com>	2016-08-23 19:14:27 +05:30
Phil Estes	85f4d20b44	Restored-from-checkpoint containers should have a start time Set the start time similar to a brand new container. Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)	2016-08-21 18:15:18 -04:00
Qiang Huang	41b12c095b	Merge pull request #913 from cloudfoundry-incubator/addgroupsnocompatible Let the user explicitly specify `additionalGids` on `runc exec`	2016-07-15 10:12:31 +08:00
Yang Hongyang	a59d63c5d3	Fix and refactor init args 1. According to docs of Cmd.Path and Cmd.Args from package "os/exec": Path is the path of the command to run. Args holds command line arguments, including the command as Args[0]. We have mixed usage of args. In InitPath(), InitArgs only take arguments, in InitArgs(), InitArgs including the command as Args[0]. This is confusing. 2. InitArgs() already have the ability to configure a LinuxFactory with the provided absolute path to the init binary and arguements as InitPath() does. 3. exec.Command() will take care of serching executable path. 4. The default "/proc/self/exe" instead of os.Args[0] is passed to InitArgs in order to allow relative path for the runC binary. Signed-off-by: Yang Hongyang <imhy.yang@gmail.com>	2016-07-06 23:21:02 -04:00
Qiang Huang	14e95b2aa9	Make state detection precise Fixes: https://github.com/opencontainers/runc/issues/871 Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>	2016-07-05 08:24:13 +08:00
Petar Petrov	f9b72b1b46	Allow additional groups to be overridden in exec Signed-off-by: Julian Friedman <julz.friedman@uk.ibm.com> Signed-off-by: Petar Petrov <pppepito86@gmail.com> Signed-off-by: Georgi Sabev <georgethebeatle@gmail.com>	2016-06-21 10:35:11 +03:00
Mrunal Patel	f5b6ff23b8	Merge pull request #881 from rajasec/update-status Update for stopped container	2016-06-13 16:05:25 -07:00
Michael Crosby	3aacff695d	Use fifo for create/start This removes the use of a signal handler and SIGCONT to signal the init process to exec the users process. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-06-13 11:26:53 -07:00
rajasec	12869604ca	Update for stopped container Signed-off-by: rajasec <rajasec79@gmail.com>	2016-06-04 22:08:08 +05:30
Michael Crosby	1d61abea46	Allow delete of created container Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-06-02 12:26:12 -07:00
Michael Crosby	6eba9b8ffb	Fix SystemError and env lookup Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-05-31 11:10:47 -07:00
Michael Crosby	efcd73fb5b	Fix signal handling for unit tests Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-05-31 11:10:47 -07:00
Michael Crosby	30f1006b33	Fix libcontainer states Move initialized to created and destoryed to stopped. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-05-31 11:06:41 -07:00
Michael Crosby	3fe7d7f31e	Add create and start command for container lifecycle Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-05-31 11:06:41 -07:00
Andrew Vagin	c161e65ac6	cr: don't fill veth devices if netns is in EmptyNs Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>	2016-05-28 01:19:54 +03:00
Qiang Huang	b6e23f8166	Add comments for error cases in status functions Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>	2016-05-16 18:24:07 +08:00
Michael Crosby	7dd87976ed	Merge pull request #758 from rajasec/container-pause-comment Update the comment for container pause	2016-04-19 16:16:41 -07:00
Michael Crosby	6978875298	Add cause to error messages This is the inital port of the libcontainer.Error to added a cause to all the existing error messages. Going forward, when an error can be wrapped because it is not being checked at the higher levels for something like `os.IsNotExist` we can add more information to the error message like cause and stack file/line information. This will help higher level tools to know what cause a container start or operation to fail. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-04-18 11:37:26 -07:00
rajasec	ccbd0a176f	Update the comment for container pause Signed-off-by: rajasec <rajasec79@gmail.com>	2016-04-16 14:59:19 +05:30
Akihiro Suda	1829531241	Fix trivial style errors reported by `go vet` and `golint` No substantial code change. Note that some style errors reported by `golint` are not fixed due to possible compatibility issues. Signed-off-by: Akihiro Suda <suda.kyoto@gmail.com>	2016-04-12 08:13:16 +00:00
George Lestaris	f7ae27bfb7	HookState adhears to OCI Signed-off-by: George Lestaris <glestaris@pivotal.io> Signed-off-by: Ed King <eking@pivotal.io>	2016-04-06 16:57:59 +01:00
Peng Gao	3fa246609c	Fix typo Signed-off-by: Peng Gao <peng.gao.dut@gmail.com>	2016-03-27 12:44:16 +08:00
Jessica Frazelle	2c5b10189c	remove deadcode Signed-off-by: Jessica Frazelle <acidburn@docker.com>	2016-03-17 13:36:28 -07:00
Michael Crosby	732a0fb440	Merge pull request #638 from hqhq/hq_fix_bootstrapData Fix encoding gid mappings	2016-03-14 11:55:12 -07:00
Mrunal Patel	459efccb0a	Merge pull request #576 from avagin/cr Call Prestart hooks before restoring processes	2016-03-14 11:21:29 -07:00
Qiang Huang	2f2c83a2a0	Fix encoding gid mappings Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>	2016-03-12 13:18:42 +08:00
Michael Crosby	20422c9bd9	Update libcontainer to support rlimit per process This updates runc and libcontainer to handle rlimits per process and set them correctly for the container. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-03-10 14:35:16 -08:00
Michael Crosby	3cc90bd2d8	Add support for process overrides of settings This commit adds support to libcontainer to allow caps, no new privs, apparmor, and selinux process label to the process struct so that it can be used together of override the base settings on the container config per individual process. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-03-03 11:41:33 -08:00
Ido Yariv	78f5148c67	Fix handling of unsupported namespaces currentState() always adds all possible namespaces to the state, regardless of whether they are supported. If orderNamespacePaths detects an unsupported namespace, an error is returned that results in initialization failure. Fix this by only adding paths of supported namespaces to the state. Signed-off-by: Ido Yariv <ido@wizery.com>	2016-03-02 10:16:51 -05:00
Daniel, Dao Quang Minh	42d5d04801	Sets custom namespaces for init processes An init process can join other namespaces (pidns, ipc etc.). This leverages C code defined in nsenter package to spawn a process with correct namespaces and clone if necessary. This moves all setns and cloneflags related code to nsenter layer, which mean that we dont use Go os/exec to create process with cloneflags and set uid/gid_map or setgroups anymore. The necessary data is passed from Go to C using a netlink binary-encoding format. With this change, setns and init processes are almost the same, which brings some opportunity for refactoring. Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com> [mickael.laventure@docker.com: adapted to apply on master @ d97d5e] Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@docker.com>	2016-02-28 12:26:53 -08:00
Daniel, Dao Quang Minh	d6bf4049f8	OrderNamespacePaths gets correct order of ns This adds orderNamespacePaths to get correct order of namespaces for the bootstrap program to join. Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>	2016-02-28 12:26:53 -08:00
Stefan Berger	5fbf791e31	Create unique session key name for every container Create a unique session key name for every container. Use the pattern _ses.<postfix> with postfix being the container's Id. This patch does not prevent containers from joining each other's session keyring. Signed-off-by: Stefan Berger <stefanb@linux.vnet.ibm.com>	2016-02-24 08:39:52 -05:00
Andrew Vagin	b8121e8998	checkpoint: call Prestart hooks on restore before restoring processes Docker uses Prestart hooks to call a libnetwork hook to create network devices and set addesses and routes. Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>	2016-02-19 02:40:26 +03:00
Andrew Vagin	46c25be297	checkpoint: add support of the EmptyNs criu option This options is set a namespace mask which will not be dumped and restored. For example, we are going to use this option to restore network for docker containers. CRIU will create a network namespace and call a libnetwork hook to restore network devices, addresses and routes. Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>	2016-02-19 02:40:26 +03:00
Andrew Vagin	a2a771b8e2	libcontainer: update criurpc.proto Signed-off-by: Andrew Vagin <avagin@virtuozzo.com>	2016-02-19 02:38:02 +03:00
Michael Crosby	1172a1e1e5	Update list command and created methods We don't need a CreatedTime method on the container because it's not part of the interface and can be received via the state. We also do not need to call it CreateTime because the type of this field is time.Time so we know its time. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-01-28 13:32:24 -08:00
Michael Crosby	480e5f4416	Merge pull request #507 from mikebrow/runc-ls-command adds list command	2016-01-28 13:20:07 -08:00
Mike Brown	4c871267db	adds list command, and a timestamp in the container state Signed-off-by: Mike Brown <brownwm@us.ibm.com>	2016-01-28 14:21:06 -06:00
Michael Crosby	ddcee3cc2a	Do not use stream encoders Marshall the raw objects for the sync pipes so that no new line chars are left behind in the pipe causing errors. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-01-26 11:22:05 -08:00
Michael Crosby	556f798a19	Fix various state bugs for pause and destroy There were issues where a process could die before pausing completed leaving the container in an inconsistent state and unable to be destoryed. This makes sure that if the container is paused and the process is dead it will unfreeze the cgroup before removing them. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2016-01-21 16:43:33 -08:00
Doug Davis	49dfa1b62d	Remove some hard coded strings Signed-off-by: Doug Davis <dug@us.ibm.com>	2016-01-19 19:02:31 -08:00
Mrunal Patel	6259f09e97	Merge pull request #426 from gitido/pressure_level libcontainer: Add support for memcg pressure notifications	2016-01-14 16:23:07 -08:00
Jimmi Dyson	91c7024e52	Revert to non-recursive GetPids, add recursive GetAllPids Signed-off-by: Jimmi Dyson <jimmidyson@gmail.com>	2016-01-08 19:42:25 +00:00
Ido Yariv	55a8d686a9	libcontainer: Add support for memcg pressure notifications It may be desirable to receive memory pressure levels notifications before the container depletes all memory. This may be useful for handling cases where the system thrashes when reaching the container's memory limits. Signed-off-by: Ido Yariv <ido@wizery.com>	2015-12-28 13:36:55 -05:00
Michael Crosby	4415446c32	Add state pattern for container state transition Signed-off-by: Michael Crosby <crosbymichael@gmail.com> Add state status() method Signed-off-by: Michael Crosby <crosbymichael@gmail.com> Allow multiple checkpoint on restore Signed-off-by: Michael Crosby <crosbymichael@gmail.com> Handle leave-running state Signed-off-by: Michael Crosby <crosbymichael@gmail.com> Fix state transitions for inprocess Because the tests use libcontainer in process between the various states we need to ensure that that usecase works as well as the out of process one. Signed-off-by: Michael Crosby <crosbymichael@gmail.com> Remove isDestroyed method Signed-off-by: Michael Crosby <crosbymichael@gmail.com> Handling Pausing from freezer state Signed-off-by: Rajasekaran <rajasec79@gmail.com> freezer status Signed-off-by: Rajasekaran <rajasec79@gmail.com> Fixing review comments Signed-off-by: Rajasekaran <rajasec79@gmail.com> Added comment when freezer not available Signed-off-by: Rajasekaran <rajasec79@gmail.com> Signed-off-by: Michael Crosby <crosbymichael@gmail.com> Conflicts: libcontainer/container_linux.go Change checkFreezer logic to isPaused() Signed-off-by: Michael Crosby <crosbymichael@gmail.com> Remove state base and factor out destroy func Signed-off-by: Michael Crosby <crosbymichael@gmail.com> Add unit test for state transitions Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2015-12-17 13:55:38 -08:00
Mrunal Patel	55a49f2110	Move the cgroups setting into a Resources struct This allows us to distinguish cases where a container needs to just join the paths or also additionally set cgroups settings. This will help in implementing cgroupsPath support in the spec. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>	2015-12-16 15:53:31 -05:00
Daniel, Dao Quang Minh	7d423cb7a1	setns: replace env with netlink for bootstrap data replace passing of pid and console path via environment variable with passing them with netlink message via an established pipe. this change requires us to set _LIBCONTAINER_INITTYPE and _LIBCONTAINER_INITPIPE as the env environment of the bootstrap process as we only send the bootstrap data for setns process right now. When init and setns bootstrap process are unified (i.e., init use nsexec instead of Go to clone new process), we can remove _LIBCONTAINER_INITTYPE. Note: - we read nlmsghdr first before reading the content so we can get the total length of the payload and allocate buffer properly instead of allocating one large buffer. - check read bytes vs the wanted number. It's an error if we failed to read the desired number of bytes from the pipe into the buffer. Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>	2015-12-03 18:03:48 +00:00
Daniel, Dao Quang Minh	d914bf7347	setns: add bootstrap data add bootstrap data to setns process. If we have any bootstrap data then copy it to the bootstrap process (i.e. nsexec) using the sync pipe. This will allow us to eventually replace environment variable usage with more structured data to setup namespaces, write pid/gid map, setgroup etc. Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>	2015-11-22 11:36:58 +00:00
Michael Crosby	2be14dc963	Merge pull request #392 from mrunalp/poststart Add poststart hooks	2015-11-12 16:34:38 -08:00
Michael Crosby	879dfdd980	Fix race setting process opts When starting and quering for pids a container can start and exit before this is set. So set the opts after the process is started and while libcontainer still has the container's process blocking on the pipe. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2015-11-06 16:51:59 -08:00
Mrunal Patel	452e8a73c5	Integrate poststart hooks with spec * Call poststart hooks after the container is started * Tie in with spec configuration Signed-off-by: Mrunal Patel <mrunalp@gmail.com>	2015-11-06 18:03:32 -05:00
John Howard	a919bd3f67	Windows: Refactor Container interface Signed-off-by: John Howard <jhoward@microsoft.com>	2015-11-02 15:12:16 -08:00
Mrunal Patel	7caef5626b	Merge pull request #359 from jhowardmsft/jjh/state_struct Windows: Refactor state struct	2015-11-02 15:04:12 -08:00
John Howard	fe1cce69b3	Windows: Refactor state struct Signed-off-by: John Howard <jhoward@microsoft.com>	2015-10-26 14:45:20 -07:00
Adrian Reber	c42ef59bf9	Add criu related debug output While testing different versions of criu it helps to know which criu binary with which options is currently used. Therefore additional debug output to display these information is added. v2: increase readability of printed out criu options Signed-off-by: Adrian Reber <adrian@lisas.de>	2015-10-13 10:41:00 +02:00
Hui Kang	25da513c4b	Add option to support criu manage cgroups mode for dump and restore CRIU supports cgroup-manage mode from v1.7 Signed-off-by: Hui Kang <hkang.sunysb@gmail.com>	2015-10-11 04:42:54 +00:00
Mrunal Patel	dcafe48737	Add version to HookState to make it json-compatible with spec State Signed-off-by: Mrunal Patel <mrunalp@gmail.com>	2015-09-23 17:13:00 -07:00
Mrunal Patel	ef9471fd5b	Merge pull request #253 from avagin/cr-cgroups c/r: create cgroups to restore a container	2015-09-11 18:03:40 -07:00
David Calavera	0f28592b35	Turn hook pointers into values. Signed-off-by: David Calavera <david.calavera@gmail.com> Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2015-09-11 11:34:34 -07:00
Michael Crosby	dd969cbacd	Add test for function based hooks Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2015-09-10 18:15:00 -07:00
Michael Crosby	05567f2c94	Implement hooks in libcontainer Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2015-09-10 17:57:31 -07:00
Andrey Vagin	df39686c93	c/r: create cgroups to restore a container Here are two reasons: * If we use systemd, we need to ask it to create cgroups * If a container is restored with another ID, we need to change paths to cgroups. Signed-off-by: Andrey Vagin <avagin@openvz.org>	2015-09-10 21:00:27 +03:00
Mrunal Patel	2f4c229a8c	Merge pull request #215 from boucher/huikang-patch Add hooks for passing explicit veth pairs for forwarding to CRIU	2015-08-24 21:23:29 -07:00
Hui Kang	7f23085c82	Add hooks for passing explicit veth pairs for forwarding to CRIU. Signed-off-by: Hui Kang <hkang.sunysb@gmail.com>	2015-08-24 09:26:39 -07:00
boucher	8c812d0f50	Add the criu log file path to the failure message. Signed-off-by: Ross Boucher <rboucher@gmail.com>	2015-08-21 14:20:59 -07:00
Mrunal Patel	f3a3025933	Fix minor stylistic issues Signed-off-by: Mrunal Patel <mrunalp@gmail.com>	2015-08-04 17:44:45 -04:00
Michael Crosby	a5ef75b681	Add signal API to Container interface This adds a `Signal()` method to the container interface so that the initial process can be signaled after a Load or operation. It also implements signaling the init process from a nonChildProcess. Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2015-08-03 17:07:29 -07:00
Ido Yariv	86a85582d2	Don't set /proc/<PID>/setgroups to deny in Go1.5 A boolean field named GidMappingsEnableSetgroups was added to SysProcAttr in Go1.5. This field determines the value of the process's setgroups proc entry. Since the default is to set the entry to 'deny', calling setgroups will fail on systems running kernels 3.19+. Set GidMappingsEnableSetgroups to true so setgroups wont be set to 'deny'. Signed-off-by: Ido Yariv <ido@wizery.com>	2015-08-03 14:59:15 -04:00
Hui Kang	0f66ff921a	Add debug message when unable to execute criu Signed-off-by: Hui Kang <hkang.sunysb@gmail.com>	2015-08-03 17:09:45 +00:00
Andrey Vagin	af4a5e708a	ct: give criu informations about cgroup mounts Actually cgroup mounts are bind-mounts, so they should be handled by the same way. Reported-by: Ross Boucher <rboucher@gmail.com> Signed-off-by: Andrey Vagin <avagin@openvz.org>	2015-07-20 22:56:07 +03:00
mapk0y	986dc0f730	checkpoint/restore commands support 'file-locks' option. Signed-off-by: mapk0y <mapk0y@gmail.com>	2015-06-27 18:56:24 +09:00
unclejack	9408c09d50	libcontainer: gofmt pass	2015-06-24 01:57:42 +03:00
Michael Crosby	080df7ab88	Update import paths for new repository Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2015-06-21 19:29:59 -07:00
Michael Crosby	8f97d39dd2	Move libcontainer into subdirectory Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2015-06-21 19:29:15 -07:00

1 2 3 4 5

235 Commits