jasder/runc - runc - 军科开源项目托管

Commit Graph

Author	SHA1	Message	Date
Mrunal Patel	bd3c4f844a	Fix race in runc exec There is a race in runc exec when the init process stops just before the check for the container status. It is then wrongly assumed that we are trying to start an init process instead of an exec process. This commit add an Init field to libcontainer Process to distinguish between init and exec processes to prevent this race. Signed-off-by: Mrunal Patel <mrunalp@gmail.com>	2018-06-01 16:25:58 -07:00
Michael Crosby	0e561642f8	Merge pull request #1688 from AkihiroSuda/unshare-m-r main: support rootless mode in userns	2018-05-29 15:41:17 -04:00
Qiang Huang	dd67ab10d7	Merge pull request #1759 from cyphar/rootless-erofs-as-eperm rootless: cgroup: treat EROFS as a skippable error	2018-05-25 09:24:16 +08:00
Akihiro Suda	c93815738a	libcontainer: remove extra CAP_SETGID check for SetgroupAttr Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>	2018-05-24 14:59:30 +09:00
Michael Crosby	bdbb9fab07	Merge pull request #1693 from AkihiroSuda/leave-setgroups-allow libcontainer: allow setgroup in rootless mode	2018-04-24 11:24:04 -04:00
Sebastien Boeuf	985628dda0	libcontainer: Don't set container state to running when exec'ing There is no reason to set the container state to "running" as a temporary value when exec'ing a process on a container in "created" state. The problem doing this is that consumers of the libcontainer library might use it by keeping pointers in memory. In this case, the container state will indicate that the container is running, which is wrong, and this will end up with a failure on the next action because the check for the container state transition will complain. Fixes #1767 Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>	2018-03-30 09:29:18 -07:00
Akihiro Suda	73f3dc6389	libcontainer: allow setgroup in rootless mode Signed-off-by: Akihiro Suda <suda.akihiro@lab.ntt.co.jp>	2018-03-27 17:42:05 +09:00
Aleksa Sarai	fd3a6e6c83	libcontainer: handle unset oomScoreAdj corectly Previously if oomScoreAdj was not set in config.json we would implicitly set oom_score_adj to 0. This is not allowed according to the spec: > If oomScoreAdj is not set, the runtime MUST NOT change the value of > oom_score_adj. Change this so that we do not modify oom_score_adj if oomScoreAdj is not present in the configuration. While this modifies our internal configuration types, the on-disk format is still compatible. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2018-03-17 13:53:42 +11:00
W. Trevor King	50dc7ee96c	libcontainer/capabilities_linux: Drop os.Getpid() call gocapability has supported 0 as "the current PID" since syndtr/gocapability@5e7cce49 (Allow to use the zero value for pid to operate with the current task, 2015-01-15, syndtr/gocapability#2). libcontainer was ported to that approach in `444cc298` (namespaces: allow to use pid namespace without mount namespace, 2015-01-27, docker/libcontainer#358), but the change was clobbered by `22df5551` (Merge branch 'master' into api, 2015-02-19, docker/libcontainer#388) which landed via `5b73860e` (Merge pull request #388 from docker/api, 2015-02-19, docker/libcontainer#388). This commit restores the changes from `444cc298`. Signed-off-by: W. Trevor King <wking@tremily.us>	2018-02-19 15:47:42 -08:00
Ed King	5c0af14bf8	Return from goroutine when it should terminate Signed-off-by: Craig Furman <cfurman@pivotal.io>	2018-01-23 10:46:31 +00:00
Will Martin	8d3e6c9826	Avoid race when opening exec fifo When starting a container with `runc start` or `runc run`, the stub process (runc[2:INIT]) opens a fifo for writing. Its parent runc process will open the same fifo for reading. In this way, they synchronize. If the stub process exits at the wrong time, the parent runc process will block forever. This can happen when racing 2 runc operations against each other: `runc run/start`, and `runc delete`. It could also happen for other reasons, e.g. the kernel's OOM killer may select the stub process. This commit resolves this race by racing the opening of the exec fifo from the runc parent process against the stub process exiting. If the stub process exits before we open the fifo, we return an error. Another solution is to wait on the stub process. However, it seems it would require more refactoring to avoid calling wait multiple times on the same process, which is an error. Signed-off-by: Craig Furman <cfurman@pivotal.io>	2018-01-22 17:03:02 +00:00
Antonio Murdaca	cd1e7abee2	libcontainer: expose annotations in hooks Annotations weren't passed to hooks. This patch fixes that by passing annotations to stdin for hooks. Signed-off-by: Antonio Murdaca <runcom@redhat.com>	2018-01-11 16:54:01 +01:00
Qiang Huang	74a1729647	Merge pull request #1607 from crosbymichael/term-err libcontainer: handler errors from terminate	2017-10-20 15:15:38 +08:00
Petros Angelatos	8098828680	propagate argv0 when re-execing from /proc/self/exe This allows runc to be used as a target for docker's reexec module that depends on a correct argv0 to select which process entrypoint to invoke. Without this patch, when runc re-execs argv0 is set to "/proc/self/exe" and the reexec module doesn't know what to do with it. Signed-off-by: Petros Angelatos <petrosagg@gmail.com>	2017-10-16 14:00:26 +02:00
Michael Crosby	bfe3058fc9	Make process check more forgiving Signed-off-by: Michael Crosby <crosbymichael@gmail.com>	2017-10-10 15:36:19 -04:00
Steven Hartland	eb68b900bc	Prevent invalid errors from terminate Both Process.Kill() and Process.Wait() can return errors that don't impact the correct behaviour of terminate. Instead of letting these get returned and logged, which causes confusion, silently ignore them. Currently the test needs to be a string test as the errors are private to the runtime packages, so its our only option. This can be seen if init fails during the setns. Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>	2017-10-10 15:32:46 -04:00
Konstantinos Karampogias	605dc5c811	Set initial console size based on process spec Signed-off-by: Will Martin <wmartin@pivotal.io> Signed-off-by: Petar Petrov <pppepito86@gmail.com> Signed-off-by: Ed King <eking@pivotal.io> Signed-off-by: Roberto Jimenez Sanchez <jszroberto@gmail.com> Signed-off-by: Thomas Godkin <tgodkin@pivotal.io>	2017-10-04 12:32:16 +01:00
Tobias Klauser	d713652bda	libcontainer: remove unnecessary type conversions Generated using github.com/mdempsky/unconvert Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2017-09-25 10:41:57 +02:00
Mrunal Patel	d5b43c3981	Merge pull request #1455 from dqminh/epoll-io tty: move IO of master pty to be done with epoll	2017-09-11 11:32:42 -07:00
Aleksa Sarai	6097ce74d8	nsenter: correctly handle newgidmap path for rootless containers After quite a bit of debugging, I found that previous versions of this patchset did not include newgidmap in a rootless setting. Fix this by passing it whenever group mappings are applied, and also providing some better checking for try_mapping_tool. This commit also includes some stylistic improvements. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-09-09 12:45:32 +10:00
Giuseppe Scrivano	d8b669400a	rootless: allow multiple user/group mappings Take advantage of the newuidmap/newgidmap tools to allow multiple users/groups to be mapped into the new user namespace in the rootless case. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> [ rebased to handle intelrdt changes. ] Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-09-09 12:45:32 +10:00
Mrunal Patel	7e036aa0b0	Merge pull request #1541 from adrianreber/lazy checkpoint: support lazy migration	2017-09-07 13:25:04 -07:00
Adrian Reber	60ae7091de	checkpoint: support lazy migration With the help of userfaultfd CRIU supports lazy migration. Lazy migration means that memory pages are only transferred from the migration source to the migration destination on page fault. This enables to reduce the downtime during process or container migration to a minimum as the memory does not need to be transferred during migration. Lazy migration currently depends on userfaultfd being available on the current Linux kernel and if the used CRIU version supports lazy migration. Both dependencies can be checked by querying CRIU via RPC if the lazy migration feature is available. Using feature checking instead of version comparison enables runC to use CRIU features from the criu-dev branch. This way the user can decide if lazy migration should be available by choosing the right kernel and CRIU branch. To use lazy migration the CRIU process during dump needs to dump everything besides the memory pages and then it opens a network port waiting for remote page fault requests: # runc checkpoint httpd --lazy-pages --page-server 0.0.0.0:27 \ --status-fd /tmp/postcopy-pipe In this example CRIU will hang/wait once it has opened the network port and wait for network connection. As runC waits for CRIU to finish it will also hang until the lazy migration has finished. To know when the restore on the destination side can start the '--status-fd' parameter is used: #️ runc checkpoint --help \| grep status --status-fd value criu writes \0 to this FD once lazy-pages is ready The parameter '--status-fd' is directly from CRIU and this way the process outside of runC which controls the migration knows exactly when to transfer the checkpoint (without memory pages) to the destination and that the restore can be started. On the destination side it is necessary to start CRIU in 'lazy-pages' mode like this: # criu lazy-pages --page-server --address 192.168.122.3 --port 27 \ -D checkpoint and tell runC to do a lazy restore: # runc restore -d --image-path checkpoint --work-path checkpoint \ --lazy-pages httpd If both processes on the restore side have the same working directory 'criu lazy-pages' creates a unix domain socket where it waits for requests from the actual restore. runC starts CRIU restore in lazy restore mode and talks to 'criu lazy-pages' that it wants to restore memory pages on demand. CRIU continues to restore the process and once the process is running and accesses the first non-existing memory page the 'criu lazy-pages' server will request the page from the source system. Thus all pages from the source system will be transferred to the destination system. Once all pages have been transferred runC on the source system will end and the container will have finished migration. This can also be combined with CRIU's pre-copy support. The combination of pre-copy and post-copy (lazy migration) provides the possibility to migrate containers with minimal downtimes. Some additional background about post-copy migration can be found in these articles: https://lisas.de/~adrian/?p=1253 https://lisas.de/~adrian/?p=1183 Signed-off-by: Adrian Reber <areber@redhat.com>	2017-09-06 12:35:38 +00:00
Adrian Reber	a3a632ad28	checkpoint: add support to query for lazy page support Before adding the actual lazy migration support, this adds the feature check for lazy-pages. Right now lazy migration, which is based on userfaultd is only available in the criu-dev branch and not yet in a release. As the check does not dependent on a certain version but on a CRIU feature which can be queried it can be part of runC without a new version check depending on a feature from criu-dev. Signed-off-by: Adrian Reber <areber@redhat.com>	2017-09-06 12:35:38 +00:00
Xiaochen Shen	692f6e1e27	libcontainer: add support for Intel RDT/CAT in runc About Intel RDT/CAT feature: Intel platforms with new Xeon CPU support Intel Resource Director Technology (RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which currently supports L3 cache resource allocation. This feature provides a way for the software to restrict cache allocation to a defined 'subset' of L3 cache which may be overlapping with other 'subsets'. The different subsets are identified by class of service (CLOS) and each CLOS has a capacity bitmask (CBM). For more information about Intel RDT/CAT can be found in the section 17.17 of Intel Software Developer Manual. About Intel RDT/CAT kernel interface: In Linux 4.10 kernel or newer, the interface is defined and exposed via "resource control" filesystem, which is a "cgroup-like" interface. Comparing with cgroups, it has similar process management lifecycle and interfaces in a container. But unlike cgroups' hierarchy, it has single level filesystem layout. Intel RDT "resource control" filesystem hierarchy: mount -t resctrl resctrl /sys/fs/resctrl tree /sys/fs/resctrl /sys/fs/resctrl/ \|-- info \| \|-- L3 \| \|-- cbm_mask \| \|-- min_cbm_bits \| \|-- num_closids \|-- cpus \|-- schemata \|-- tasks \|-- <container_id> \|-- cpus \|-- schemata \|-- tasks For runc, we can make use of `tasks` and `schemata` configuration for L3 cache resource constraints. The file `tasks` has a list of tasks that belongs to this group (e.g., <container_id>" group). Tasks can be added to a group by writing the task ID to the "tasks" file (which will automatically remove them from the previous group to which they belonged). New tasks created by fork(2) and clone(2) are added to the same group as their parent. If a pid is not in any sub group, it Is in root group. The file `schemata` has allocation bitmasks/values for L3 cache on each socket, which contains L3 cache id and capacity bitmask (CBM). Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..." For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0` which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0. The valid L3 cache CBM is a contiguous bits set and number of bits that can be set is less than the max bit. The max bits in the CBM is varied among supported Intel Xeon platforms. In Intel RDT "resource control" filesystem layout, the CBM in a group should be a subset of the CBM in root. Kernel will check if it is valid when writing. e.g., 0xfffff in root indicates the max bits of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc. For more information about Intel RDT/CAT kernel interface: https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt An example for runc: Consider a two-socket machine with two L3 caches where the default CBM is 0xfffff and the max CBM length is 20 bits. With this configuration, tasks inside the container only have access to the "upper" 80% of L3 cache id 0 and the "lower" 50% L3 cache id 1: "linux": { "intelRdt": { "l3CacheSchema": "L3:0=ffff0;1=3ff" } } Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>	2017-09-01 14:26:33 +08:00
Aleksa Sarai	7d66aab77a	init: switch away from stateDirFd entirely While we have significant protections in place against CVE-2016-9962, we still were holding onto a file descriptor that referenced the host filesystem. This meant that in certain scenarios it was still possible for a semi-privileged container to gain access to the host filesystem (if they had CAP_SYS_PTRACE). Instead, open the FIFO itself using a O_PATH. This allows us to reference the FIFO directly without providing the ability for directory-level access. When opening the FIFO inside the init process, open it through procfs to re-open the actual FIFO (this is currently the only supported way to open such a file descriptor). Signed-off-by: Aleksa Sarai <asarai@suse.de>	2017-08-25 13:19:03 +10:00
Nikolas Sepos	da4a5a9515	Add AutoDedup option to CriuOpts Memory image deduplication, very useful for incremental dumps. See: https://criu.org/Memory_images_deduplication Signed-off-by: Nikolas Sepos <nikolas.sepos@gmail.com>	2017-08-18 01:21:42 +02:00
Qiang Huang	e6e1c34a7d	Update state after update state.json should be a reflection of the container's realtime state, including resource configurations, so we should update state.json after updating container resources. Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>	2017-08-15 14:38:44 +08:00
Adrian Reber	5d386f6e2b	checkpoint: use CRIU VERSION RPC if available With this runC also uses RPC to ask CRIU for its version. CRIU supports a VERSION RPC since CRIU 3.0 and using the RPC interface does not require parsing the console output of CRIU (which could change anytime). For older CRIU versions which do not yet have the VERSION RPC runC falls back to its old CRIU output parsing mode. Once CRIU 3.0 is the minimum version required for runC the old code can be removed. v2: * adapt to changes in the previous patches based on the review Signed-off-by: Adrian Reber <areber@redhat.com>	2017-08-02 16:08:07 +00:00
Adrian Reber	c71d9cd447	criuSwrk: prepare for CRIU VERSION RPC To use the CRIU VERSION RPC the criuSwrk function is adapted to work with CriuOpts set to 'nil' as CriuOpts is not required for the VERSION RPC. Also do not print c.criuVersion if it is '0' as the first RPC call will always be the VERSION call and only after that the version will be known. Signed-off-by: Adrian Reber <areber@redhat.com>	2017-08-02 16:07:28 +00:00
Adrian Reber	c5f0ce979b	checkCriuVersion: only ask criu once about its version If the version of criu has already been determined there is no need to ask criu for the version again. Use the value from c.criuVersion. v2: * reduce unnecessary code movement in the patch series * factor out the criu version parsing into a separate function Signed-off-by: Adrian Reber <areber@redhat.com>	2017-08-02 16:07:15 +00:00
Adrian Reber	b6c47281db	checkCriuVersion: switch to version using int The checkCriuVersion function used a string to specify the minimum version required. This is more comfortable for an external interface but for an internal function this added unnecessary complexity. This changes to version string like '1.5.2' to an integer like 10502. This is already the format used internally in the function. Signed-off-by: Adrian Reber <areber@redhat.com>	2017-08-02 16:05:27 +00:00
Michael Crosby	882d8eaba6	Merge pull request #1537 from tklauser/staticcheck Fix issues found by staticcheck	2017-08-02 09:52:11 -04:00
Tobias Klauser	e4e56cb6d8	libcontainer: remove ineffective break statements go's switch statement doesn't need an explicit break. Remove it where that is the case and add a comment to indicate the purpose where the removal would lead to an empty case. Found with honnef.co/go/tools/cmd/staticcheck Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2017-07-28 15:13:39 +02:00
Tobias Klauser	24a4273cf9	libcontainer: handle error cases Handle err return value of fmt.Scanf, os.Pipe and unix.ParseUnixRights. Found with honnef.co/go/tools/cmd/staticcheck Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2017-07-28 15:13:11 +02:00
Daniel Dao	91eafcbc65	tty: move IO of master pty to be done with epoll This moves all console code to use github.com/containerd/console library to handle console I/O. Also move to use EpollConsole by default when user requests a terminal so we can still cope when the other side temporarily goes away. Signed-off-by: Daniel Dao <dqminh89@gmail.com>	2017-07-28 12:35:02 +01:00
Steven Hartland	ee4f68e302	Updated logrus to v1 Updated logrus to use v1 which includes a breaking name change Sirupsen -> sirupsen. This includes a manual edit of the docker term package to also correct the name there too. Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>	2017-07-19 15:20:56 +00:00
Tobias Klauser	54d27bed7f	libcontainer: use ParseSocketControlMessage/ParseUnixRights from x/sys/unix Use ParseSocketControlMessage and ParseUnixRights from golang.org/x/sys/unix instead of their syscall equivalent. Signed-off-by: Tobias Klauser <tklauser@distanz.ch>	2017-07-13 15:02:17 +02:00
W. Trevor King	2bea4c897e	libcontainer/system/proc: Add Stat_t.State And Stat_t.PID and Stat_t.Name while we're at it. Then use the new .State property in runType to distinguish between running and zombie/dead processes, since kill(2) does not [1]. With this change we no longer claim Running status for zombie/dead processes. I've also removed the kill(2) call from runType. It was originally added in `13841ef3` (new-api: return the Running state only if the init process is alive, 2014-12-23), but we've been accessing /proc/[pid]/stat since `14e95b2a` (Make state detection precise, 2016-07-05, #930), and with the /stat access the kill(2) check is redundant. I also don't see much point to the previously-separate doesInitProcessExist, so I've inlined that logic in runType. It would be nice to distinguish between "/proc/[pid]/stat doesn't exist" and errors parsing its contents, but I've skipped that for the moment. The Running -> Stopped change in checkpoint_test.go is because the post-checkpoint process is a zombie, and with this commit zombie processes are Stopped (and no longer Running). [1]: https://github.com/opencontainers/runc/pull/1483#issuecomment-307527789 Signed-off-by: W. Trevor King <wking@tremily.us>	2017-06-20 16:26:55 -07:00
W. Trevor King	75d98b26b7	libcontainer: Replace GetProcessStartTime with Stat_t.StartTime And convert the various start-time properties from strings to uint64s. This removes all internal consumers of the deprecated GetProcessStartTime function. Signed-off-by: W. Trevor King <wking@tremily.us>	2017-06-20 16:26:55 -07:00
Christy Perez	3d7cb4293c	Move libcontainer to x/sys/unix Since syscall is outdated and broken for some architectures, use x/sys/unix instead. There are still some dependencies on the syscall package that will remain in syscall for the forseeable future: Errno Signal SysProcAttr Additionally: - os still uses syscall, so it needs to be kept for anything returning *os.ProcessState, such as process.Wait. Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>	2017-05-22 17:35:20 -05:00
Mrunal Patel	639454475c	Merge pull request #1355 from avagin/cr-console Dump and restore containers with external terminals	2017-05-18 11:22:52 -07:00
Harshal Patil	22953c122f	Remove redundant declaraion of namespace slice Signed-off-by: Harshal Patil <harshal.patil@in.ibm.com>	2017-05-02 10:04:57 +05:30
Andrei Vagin	73258813d3	cr: set a freezer cgroup for criu A freezer cgroup allows to dump processes faster. If a user wants to checkpoint a container and its storage, he has to pause a container, but in this case we need to pass a path to its freezer cgroup to "criu dump". Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>	2017-05-02 04:48:47 +03:00
Andrei Vagin	1c43d091a1	checkpoint: add support for containers with terminals CRIU was extended to report about orphaned master pty-s via RPC. Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>	2017-05-02 04:48:47 +03:00
Andrei Vagin	d307e85dbb	Print a criu version in a error message Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>	2017-05-01 21:45:23 +03:00
Harshal Patil	c44d4fa6ed	Optimizing looping over namespaces Signed-off-by: Harshal Patil <harshal.patil@in.ibm.com>	2017-04-26 11:54:43 +05:30
Qiang Huang	94cfb7955b	Merge pull request #1387 from avagin/freezer Don't try to read freezer.state from the current directory	2017-04-24 20:02:45 -05:00
Mrunal Patel	97db1eaad9	Merge pull request #1396 from harche/cstate Set container state only once during start	2017-04-17 11:32:42 -07:00
Mrunal Patel	7814a0d14b	Merge pull request #1399 from avagin/cr-cgroup restore: apply resource limits	2017-04-13 11:28:28 -07:00

1 2 3 4

152 Commits