Commit Graph

1084 Commits

Author SHA1 Message Date
Adrian Reber a3a632ad28 checkpoint: add support to query for lazy page support
Before adding the actual lazy migration support, this adds the feature
check for lazy-pages. Right now lazy migration, which is based on
userfaultd is only available in the criu-dev branch and not yet in a
release. As the check does not dependent on a certain version but on
a CRIU feature which can be queried it can be part of runC without a new
version check depending on a feature from criu-dev.

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-09-06 12:35:38 +00:00
Xiaochen Shen 4d2756c116 libcontainer: add test cases for Intel RDT/CAT
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-09-01 14:35:40 +08:00
Xiaochen Shen 692f6e1e27 libcontainer: add support for Intel RDT/CAT in runc
About Intel RDT/CAT feature:
Intel platforms with new Xeon CPU support Intel Resource Director Technology
(RDT). Cache Allocation Technology (CAT) is a sub-feature of RDT, which
currently supports L3 cache resource allocation.

This feature provides a way for the software to restrict cache allocation to a
defined 'subset' of L3 cache which may be overlapping with other 'subsets'.
The different subsets are identified by class of service (CLOS) and each CLOS
has a capacity bitmask (CBM).

For more information about Intel RDT/CAT can be found in the section 17.17
of Intel Software Developer Manual.

About Intel RDT/CAT kernel interface:
In Linux 4.10 kernel or newer, the interface is defined and exposed via
"resource control" filesystem, which is a "cgroup-like" interface.

Comparing with cgroups, it has similar process management lifecycle and
interfaces in a container. But unlike cgroups' hierarchy, it has single level
filesystem layout.

Intel RDT "resource control" filesystem hierarchy:
mount -t resctrl resctrl /sys/fs/resctrl
tree /sys/fs/resctrl
/sys/fs/resctrl/
|-- info
|   |-- L3
|       |-- cbm_mask
|       |-- min_cbm_bits
|       |-- num_closids
|-- cpus
|-- schemata
|-- tasks
|-- <container_id>
    |-- cpus
    |-- schemata
    |-- tasks

For runc, we can make use of `tasks` and `schemata` configuration for L3 cache
resource constraints.

The file `tasks` has a list of tasks that belongs to this group (e.g.,
<container_id>" group). Tasks can be added to a group by writing the task ID
to the "tasks" file  (which will automatically remove them from the previous
group to which they belonged). New tasks created by fork(2) and clone(2) are
added to the same group as their parent. If a pid is not in any sub group, it
Is in root group.

The file `schemata` has allocation bitmasks/values for L3 cache on each socket,
which contains L3 cache id and capacity bitmask (CBM).
	Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
For example, on a two-socket machine, L3's schema line could be `L3:0=ff;1=c0`
which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.

The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
be set is less than the max bit. The max bits in the CBM is varied among
supported Intel Xeon platforms. In Intel RDT "resource control" filesystem
layout, the CBM in a group should be a subset of the CBM in root. Kernel will
check if it is valid when writing. e.g., 0xfffff in root indicates the max bits
of CBM is 20 bits, which mapping to entire L3 cache capacity. Some valid CBM
values to set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.

For more information about Intel RDT/CAT kernel interface:
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt

An example for runc:
Consider a two-socket machine with two L3 caches where the default CBM is
0xfffff and the max CBM length is 20 bits. With this configuration, tasks
inside the container only have access to the "upper" 80% of L3 cache id 0 and
the "lower" 50% L3 cache id 1:

"linux": {
	"intelRdt": {
		"l3CacheSchema": "L3:0=ffff0;1=3ff"
	}
}

Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-09-01 14:26:33 +08:00
Xiaochen Shen af3b0d9dce libcontainer/SPEC.md: add documentation for Intel RDT/CAT
Signed-off-by: Xiaochen Shen <xiaochen.shen@intel.com>
2017-09-01 14:26:33 +08:00
Aleksa Sarai 1f32fff46d
setns init: delay seccomp as late as possible
This mirrors the standard_init_linux.go seccomp code, which only applies
seccomp early if NoNewPrivileges is enabled. Otherwise it's done
immediately before execve to reduce the amount of syscalls necessary for
users to enable in their seccomp profiles.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-08-26 13:42:30 +10:00
Aleksa Sarai 3ddde27d7d
init: move close(stateDirFd) before seccomp apply
This further reduces the number of syscalls that a user needs to enable
in their seccomp profile.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-08-26 13:42:26 +10:00
Qiang Huang 1c81e2a794 Merge pull request #1572 from tych0/fix-readonly-userns
fix --read-only containers under --userns-remap
2017-08-26 09:38:14 +08:00
Aleksa Sarai 4d6e6720a7
Merge branch 'pr-1573'
Fix systemd cgroup after memory type changed

LGTMs: @crosbymichael @cyphar
Closes #1573
2017-08-25 23:55:27 +10:00
Qiang Huang acaf6897f5 Fix systemd cgroup after memory type changed
Fixes: #1557

I'm not quite sure about the root cause, looks like
systemd still want them to be uint64.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-08-25 01:14:16 -04:00
Aleksa Sarai 7d66aab77a
init: switch away from stateDirFd entirely
While we have significant protections in place against CVE-2016-9962, we
still were holding onto a file descriptor that referenced the host
filesystem. This meant that in certain scenarios it was still possible
for a semi-privileged container to gain access to the host filesystem
(if they had CAP_SYS_PTRACE).

Instead, open the FIFO itself using a O_PATH. This allows us to
reference the FIFO directly without providing the ability for
directory-level access. When opening the FIFO inside the init process,
open it through procfs to re-open the actual FIFO (this is currently the
only supported way to open such a file descriptor).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-08-25 13:19:03 +10:00
Tycho Andersen 66eb2a3e8f fix --read-only containers under --userns-remap
The documentation here:
https://docs.docker.com/engine/security/userns-remap/#user-namespace-known-limitations

says that readonly containers can't be used with user namespaces do to some
kernel restriction. In fact, there is a special case in the kernel to be
able to do stuff like this, so let's use it.

This takes us from:

ubuntu@docker:~$ docker run -it --read-only ubuntu
docker: Error response from daemon: oci runtime error: container_linux.go:262: starting container process caused "process_linux.go:339: container init caused \"rootfs_linux.go:125: remounting \\\"/dev\\\" as readonly caused \\\"operation not permitted\\\"\"".

to:

ubuntu@docker:~$ docker-runc --version
runc version 1.0.0-rc4+dev
commit: ae2948042b08ad3d6d13cd09f40a50ffff4fc688-dirty
spec: 1.0.0
ubuntu@docker:~$ docker run -it --read-only ubuntu
root@181e2acb909a:/# touch foo
touch: cannot touch 'foo': Read-only file system

Signed-off-by: Tycho Andersen <tycho@docker.com>
2017-08-24 16:43:21 -06:00
Nikolas Sepos da4a5a9515 Add AutoDedup option to CriuOpts
Memory image deduplication, very useful for incremental dumps.

See: https://criu.org/Memory_images_deduplication

Signed-off-by: Nikolas Sepos <nikolas.sepos@gmail.com>
2017-08-18 01:21:42 +02:00
Michael Crosby ccd2c20aa4 Merge pull request #1559 from Mashimiao/panic-fix-nil-linux
fix panic when Linux is nil for rootless case
2017-08-17 09:57:35 -04:00
Ma Shimiao 2333e7dc67 fix panic when Linux is nil for rootless case
congfig.Sysctl setting is duplicated.
when contianer is rootless and Linux is nil, runc will panic.

Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
2017-08-16 09:11:13 +08:00
Mrunal Patel b31bdfc38a Merge pull request #1558 from hqhq/update_state
Update state after update
2017-08-15 10:46:44 -07:00
Qiang Huang e6e1c34a7d Update state after update
state.json should be a reflection of the container's
realtime state, including resource configurations,
so we should update state.json after updating container
resources.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-08-15 14:38:44 +08:00
Michael Crosby 3096b3fc85 Merge pull request #1556 from hqhq/fix_flakytest_TestNotifyOnOOM
Fix flaky test TestNotifyOnOOM
2017-08-14 10:03:23 -04:00
Qiang Huang 7726bcf0e2 Some fixes for testMemoryNotification
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-08-14 15:28:03 +08:00
Qiang Huang 40a1fb0e2f Fix flaky test TestNotifyOnOOM
Fixes: #1228

It can be reproduced by applying this patch:
```diff
@@ -45,6 +46,7 @@ func registerMemoryEvent(cgDir string, evName string, arg string) (<-chan struct
        go func() {
                defer func() {
                        close(ch)
+                       <-time.After(1 * time.Second)
                        eventfd.Close()
                        evFile.Close()
                }()
```

We can close channel after fds were closed.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-08-14 15:18:59 +08:00
Ma Shimiao 527dc5acbb fix panic when Linux is nil
Linux is not always not nil.
If Linux is nil, panic will occur.

Signed-off-by: Ma Shimiao <mashimiao.fnst@cn.fujitsu.com>
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-08-10 15:57:49 -04:00
Kenfe-Mickael Laventure 3ed492ad33
Handle non-devices correctly in DeviceFromPath
Before this change, some file type would be treated as char devices
(e.g. symlinks).

Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
2017-08-09 08:52:20 -07:00
Alex Fang e92add2151 Pass back the pid of runc:[1:CHILD] so we can wait on it
This allows the libcontainer to automatically clean up runc:[1:CHILD]
processes created as part of nsenter.

Signed-off-by: Alex Fang <littlelightlittlefire@gmail.com>
2017-08-05 13:44:36 +10:00
Aleksa Sarai 45bde006ca
merge branch 'pr-1535'
LGTMs: @avagin @cyphar
Closes #1535
2017-08-05 13:33:07 +10:00
Aleksa Sarai 22bbec1b7f
merge branch 'pr-1548'
LGTMs: @crosbymichael @mrunalp @cyphar
Closes #1548
2017-08-05 13:02:46 +10:00
Mrunal Patel 135b9992b3 Merge pull request #1544 from mlaventure/fix-device-from-path
Fix condition to detect device type in DeviceFromPath
2017-08-04 17:36:57 -07:00
Kenfe-Mickael Laventure 6056912217
Revert "Merge pull request #1450 from vrothberg/sgid-non-numeric"
This reverts commit 5c73abbe75, reversing
changes made to 51b501dab1.

Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
2017-08-04 14:28:21 -07:00
Kenfe-Mickael Laventure 25f4c7e72b
Move user pkg unix specific calls to unix file
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
2017-08-03 11:31:21 -07:00
Kenfe-Mickael Laventure 9ed15e94c8
Fix condition to detect device type in DeviceFromPath
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
2017-08-03 11:06:54 -07:00
Adrian Reber 5d386f6e2b checkpoint: use CRIU VERSION RPC if available
With this runC also uses RPC to ask CRIU for its version. CRIU supports
a VERSION RPC since CRIU 3.0 and using the RPC interface does not
require parsing the console output of CRIU (which could change anytime).

For older CRIU versions which do not yet have the VERSION RPC runC falls
back to its old CRIU output parsing mode.

Once CRIU 3.0 is the minimum version required for runC the old code can
be removed.

v2:
 * adapt to changes in the previous patches based on the review

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-08-02 16:08:07 +00:00
Adrian Reber 2393692536 criurpc.proto: copy latest criurpc.proto from criu 3.3
Update criurpc.proto for the upcoming VERSION RPC.

This includes lazy_pages for the upcoming lazy migration support.

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-08-02 16:07:32 +00:00
Adrian Reber c71d9cd447 criuSwrk: prepare for CRIU VERSION RPC
To use the CRIU VERSION RPC the criuSwrk function is adapted to work
with CriuOpts set to 'nil' as CriuOpts is not required for the VERSION
RPC.

Also do not print c.criuVersion if it is '0' as the first RPC call will
always be the VERSION call and only after that the version will be
known.

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-08-02 16:07:28 +00:00
Adrian Reber c5f0ce979b checkCriuVersion: only ask criu once about its version
If the version of criu has already been determined there is no need to
ask criu for the version again. Use the value from c.criuVersion.

v2:
 * reduce unnecessary code movement in the patch series
 * factor out the criu version parsing into a separate function

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-08-02 16:07:15 +00:00
Adrian Reber b6c47281db checkCriuVersion: switch to version using int
The checkCriuVersion function used a string to specify the minimum
version required. This is more comfortable for an external interface
but for an internal function this added unnecessary complexity. This
changes to version string like '1.5.2' to an integer like 10502. This is
already the format used internally in the function.

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-08-02 16:05:27 +00:00
Michael Crosby 882d8eaba6 Merge pull request #1537 from tklauser/staticcheck
Fix issues found by staticcheck
2017-08-02 09:52:11 -04:00
Daniel, Dao Quang Minh b313a75364 Merge pull request #1477 from yummypeng/save-own-ns-path
Always save own namespace paths
2017-08-02 11:24:30 +01:00
Tobias Klauser e4e56cb6d8 libcontainer: remove ineffective break statements
go's switch statement doesn't need an explicit break. Remove it where
that is the case and add a comment to indicate the purpose where the
removal would lead to an empty case.

Found with honnef.co/go/tools/cmd/staticcheck

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-07-28 15:13:39 +02:00
Tobias Klauser 24a4273cf9 libcontainer: handle error cases
Handle err return value of fmt.Scanf, os.Pipe and unix.ParseUnixRights.

Found with honnef.co/go/tools/cmd/staticcheck

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-07-28 15:13:11 +02:00
Daniel Dao 91eafcbc65
tty: move IO of master pty to be done with epoll
This moves all console code to use github.com/containerd/console library to
handle console I/O. Also move to use EpollConsole by default when user requests
a terminal so we can still cope when the other side temporarily goes away.

Signed-off-by: Daniel Dao <dqminh89@gmail.com>
2017-07-28 12:35:02 +01:00
Michael Crosby e775f0fba3 Merge pull request #1526 from stevenh/logrus-v1
Updated logrus to v1
2017-07-27 13:28:55 -04:00
yangshukui 5428532bdd remove the code that close negative descriptor
Signed-off-by: yangshukui <yangshukui@huawei.com>
2017-07-24 11:10:18 +08:00
Tobias Klauser b0d014d0e1 libcontainer: one more switch from syscall to x/sys/unix
Refactor DeviceFromPath in order to get rid of package syscall and
directly use the functions from x/sys/unix. This also allows to get rid
of the conversion from the OS-independent file mode values (from the os
package) to Linux specific values and instead let's us use the raw
file mode value directly.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-07-21 16:59:15 +02:00
Steven Hartland ee4f68e302 Updated logrus to v1
Updated logrus to use v1 which includes a breaking name change Sirupsen -> sirupsen.

This includes a manual edit of the docker term package to also correct the name there too.

Signed-off-by: Steven Hartland <steven.hartland@multiplay.co.uk>
2017-07-19 15:20:56 +00:00
Daniel, Dao Quang Minh 7ab4f43a4b Merge pull request #1519 from tklauser/moar-unix
libcontainer: use additional functions and constants from x/sys/unix
2017-07-17 10:07:22 +01:00
Qiang Huang 825b5c020a Merge pull request #1516 from cyphar/list-casting-unicode
list: fix various problems with owner field
2017-07-16 14:57:20 +08:00
Tobias Klauser 4019833d46 libcontainer: use PR_SET_NO_NEW_PRIVS from x/sys/unix
Use PR_SET_NO_NEW_PRIVS defined in golang.org/x/sys/unix instead of
manually defining it.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-07-13 15:31:33 +02:00
Tobias Klauser 54d27bed7f libcontainer: use ParseSocketControlMessage/ParseUnixRights from x/sys/unix
Use ParseSocketControlMessage and ParseUnixRights from
golang.org/x/sys/unix instead of their syscall equivalent.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-07-13 15:02:17 +02:00
Yuanhong Peng e939079acf Always save own namespace paths
fix #1476

If containerA shares namespace, say ipc namespace, with containerB, then
its ipc namespace path would be the same as containerB and be stored in
`state.json`. Exec into containerA will just read the namespace paths
stored in this file and join these namespaces. So, if containerB has
already been stopped, `docker exec containerA` will fail.

To address this issue, we should always save own namespace paths no
matter if we share namespaces with other containers.

Signed-off-by: Yuanhong Peng <pengyuanhong@huawei.com>
2017-07-13 16:13:05 +08:00
Michael Crosby eb70c213ba Update runtime-spec to rc6
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-07-12 16:24:04 -07:00
Aleksa Sarai 7cfb107f2c
factory: use e{u,g}id as the owner of /run/runc/$id
It appears as though these semantics were not fully thought out when
implementing them for rootless containers. It is not necessary (and
could be potentially dangerous) to set the owner of /run/ctr/$id to be
the root inside the container (if user namespaces are being used).

Instead, just use the e{g,u}id of runc to determine the owner.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-07-12 06:30:46 +10:00
Tobias Klauser 078e903296 libcontainer: use ioctl wrappers from x/sys/unix
Use IoctlGetInt and IoctlGetTermios/IoctlSetTermios instead of manually
reimplementing them.

Because of unlockpt, the ioctl wrapper is still needed as it needs to
pass a pointer to a value, which is not supported by any ioctl function
in x/sys/unix yet.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-07-10 10:56:58 +02:00
Tobias Klauser a380fae959 libcontainer: use Prctl() from x/sys/unix
Use unix.Prctl() instead of manually reimplementing it using
unix.RawSyscall. Also use unix.SECCOMP_MODE_FILTER instead of locally
defining it.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-07-10 10:56:58 +02:00
Michael Crosby 5c73abbe75 Merge pull request #1450 from vrothberg/sgid-non-numeric
libcontainer/user: add supplementary groups only for non-numeric users
2017-07-07 09:43:30 -07:00
Daniel, Dao Quang Minh 7139b61f7f Merge pull request #1378 from derekwaynecarr/expose_use_hierarchy
Expose memory.use_hierarchy in MemoryStats
2017-06-30 16:08:21 +01:00
Michael Crosby fef3aced0e Merge pull request #1460 from wking/mount-option-lazytime
libcontainer/specconv/spec_linux: Add support for (no)lazytime
2017-06-29 10:06:23 -07:00
Aleksa Sarai 117c92745b
rootfs: switch ms_private remount of oldroot to ms_slave
Using MS_PRIVATE meant that there was a race between the mount(2) and
the umount2(2) calls where runc inadvertently has a live reference to a
mountpoint that existed on the host (which the host cannot kill
implicitly through an unmount and peer sharing).

In particular, this means that if we have a devicemapper mountpoint and
the host is trying to delete the underlying device, the delete will fail
because it is "in use" during the race. While the race is _very_ small
(and libdm actually retries to avoid these sorts of cases) this appears
to manifest in various cases.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-06-29 01:20:23 +10:00
Justin Cormack 3d9074ead3 Update memory specs to use int64 not uint64
replace #1492 #1494
fix #1422

Since https://github.com/opencontainers/runtime-spec/pull/876 the memory
specifications are now `int64`, as that better matches the visible interface where
`-1` is a valid value. Otherwise finding the correct value was difficult as it
was kernel dependent.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2017-06-27 12:16:07 +01:00
Justin Cormack e1146182a8 Remove Platform as no longer in OCI spec
This was never used, just validated, so was removed from spec.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2017-06-27 12:16:07 +01:00
Michael Crosby d337d807fc Merge pull request #1482 from tklauser/x-sys-unix-keyctl
Use keyctl wrappers from x/sys/unix
2017-06-23 11:07:55 -07:00
Mrunal Patel 8e1896b3bd Merge pull request #1491 from tklauser/unix-eventfd
Use Eventfd() from golang.org/x/sys/unix
2017-06-22 19:02:44 -07:00
Michael Crosby bd65ef625d Merge pull request #1489 from wking/process-status
libcontainer/container_linux: Consider process state (running, zombie, etc.) in runType
2017-06-21 10:24:04 -07:00
Tobias Klauser da4cebcfe2 libcontainer: use Eventfd() from x/sys/unix
Use unix.Eventfd() instead of calling manually reimplementing it using
the raw syscall. Also use the correct corresponding unix.EFD_CLOEXEC
flag instead of unix.FD_CLOEXEC (which can have a different value on
some architectures and thus might lead to unexpected behavior).

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-06-21 10:02:00 +02:00
W. Trevor King 2bea4c897e libcontainer/system/proc: Add Stat_t.State
And Stat_t.PID and Stat_t.Name while we're at it.  Then use the new
.State property in runType to distinguish between running and
zombie/dead processes, since kill(2) does not [1].  With this change
we no longer claim Running status for zombie/dead processes.

I've also removed the kill(2) call from runType.  It was originally
added in 13841ef3 (new-api: return the Running state only if the init
process is alive, 2014-12-23), but we've been accessing
/proc/[pid]/stat since 14e95b2a (Make state detection precise,
2016-07-05, #930), and with the /stat access the kill(2) check is
redundant.

I also don't see much point to the previously-separate
doesInitProcessExist, so I've inlined that logic in runType.

It would be nice to distinguish between "/proc/[pid]/stat doesn't
exist" and errors parsing its contents, but I've skipped that for the
moment.

The Running -> Stopped change in checkpoint_test.go is because the
post-checkpoint process is a zombie, and with this commit zombie
processes are Stopped (and no longer Running).

[1]: https://github.com/opencontainers/runc/pull/1483#issuecomment-307527789

Signed-off-by: W. Trevor King <wking@tremily.us>
2017-06-20 16:26:55 -07:00
W. Trevor King 75d98b26b7 libcontainer: Replace GetProcessStartTime with Stat_t.StartTime
And convert the various start-time properties from strings to uint64s.
This removes all internal consumers of the deprecated
GetProcessStartTime function.

Signed-off-by: W. Trevor King <wking@tremily.us>
2017-06-20 16:26:55 -07:00
Michael Crosby 6e57120d9f Merge pull request #1481 from elianka/dev
update READ.me for new struct configs.Config.Capabilities
2017-06-20 13:15:04 -07:00
W. Trevor King 439eaa3584 libcontainer/system/proc: Add Stat and Stat_t
So we can extract more than the start time with a single read.

Signed-off-by: W. Trevor King <wking@tremily.us>
2017-06-14 15:28:03 -07:00
Tobias Klauser cfe87fe3e2 Use keyctl wrappers from x/sys/unix
Use KeyctlJoinSessionKeyring, KeyctlString and KeyctlSetperm from
golang.org/x/sys/unix instead of manually reimplementing them.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-06-09 15:55:18 +02:00
Kang Liang a341724c95 update READ.me for new struct configs.Config.Capabilities
Signed-off-by: Kang Liang <kangliang424@gmail.com>
2017-06-09 18:47:05 +08:00
W. Trevor King 830c0d70df libcontainer/console_linux.go: Make SaneTerminal public
And use it only in local tooling that is forwarding the pseudoterminal
master.  That way runC no longer has an opinion on the onlcr setting
for folks who are creating a terminal and detaching.  They'll use
--console-socket and can setup the pseudoterminal however they like
without runC having an opinion.  With this commit, the only cases
where runC still has applies SaneTerminal is when *it* is the process
consuming the master descriptor.

Signed-off-by: W. Trevor King <wking@tremily.us>
2017-06-07 21:32:41 -07:00
Tobias Klauser 553016d7da Use Prctl() from x/sys/unix instead of own wrapper
Use unix.Prctl() instead of reimplemnting it as system.Prctl().

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-06-07 15:03:15 +02:00
Mrunal Patel 9d6821d1b5 Merge pull request #1473 from crosbymichael/update-spec
Update spec to 239c4e44f2
2017-06-06 10:26:07 -07:00
Vladimir Stefanovic d01050e6d4 Add support for mips/mips64
Signed-off-by: Vladimir Stefanovic <vladimir.stefanovic@imgtec.com>
2017-06-02 22:30:00 +02:00
Tobias Klauser 306b4980f7 Use NLA_* constants from x/sys/unix instead of syscall
Use the NLA_ALIGNTO and NLA_HDRLEN constants from x/sys/unix instead of
syscall, as the syscall package shouldn't be used anymore (except for a
few exceptions).

This also makes the syscall_NLA_HDRLEN workaround for gccgo unnecessary.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-06-02 10:42:11 +02:00
W. Trevor King 4f81337e95 libcontainer/specconv/spec_linux: Add support for (no)lazytime
And also silent, loud, (no)iversion, and (no)acl.  This is part of
catching runC up with the spec, which punts valid options to mount(8)
[1,2].

(no)acl is a filesystem-specific entry in mount(8), but it's
represented by a MS_* flag in mount(2) so we need an entry in the
translation table.

[1]: https://github.com/opencontainers/runtime-spec/blame/v1.0.0-rc5/config.md#L68
[2]: https://github.com/opencontainers/runtime-spec/pull/771

Signed-off-by: W. Trevor King <wking@tremily.us>
2017-06-01 20:43:35 -07:00
Michael Crosby 18f336d23b Merge pull request #1470 from tklauser/x-sys-unix-symlink-xattrs
Use symlink xattr functions from x/sys/unix
2017-06-01 18:14:19 -07:00
Michael Crosby 854b41d81e Update spec to 239c4e44f2
This provides updates to runc for the spec changes with *Process and
OOMScoreAdj

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-06-01 16:29:47 -07:00
Tobias Klauser d8b5c1c810 Use symlink xattr functions from x/sys/unix
Use the symlink xattr syscall wrappers Lgetxattr, Llistxattr and
Lsetxattr from x/sys/unix (introduced in
golang/sys@b90f89a1e7) instead of
providing own wrappers. Leave the functionality of system.Lgetxattr
intact with respect to the retry with a larger buffer, but switch it to
use unix.Lgetxattr.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-05-31 13:50:34 +02:00
Tobias Klauser b5768387c6 Switch examples in README.md from syscall to x/sys/unix
Follow commit 3d7cb4293c ("Move libcontainer to x/sys/unix") and also
move the examples in README.md from syscall to x/sys/unix.

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
2017-05-30 14:50:59 +02:00
Daniel, Dao Quang Minh 67bd2ab554 Merge pull request #1442 from clnperez/libcontainer-sys-unix
Move libcontainer to x/sys/unix
2017-05-26 12:18:33 +01:00
Qiang Huang d7c264aaf1 Merge pull request #1239 from moypray/cgroup
Fix setup cgroup before prestart hook
2017-05-26 09:22:49 +08:00
Michael Crosby 18cd7e06f7 Merge pull request #1372 from cloudfoundry-incubator/cpuset-mount-root
Handle container creation when cgroups have already been mounted in another location
2017-05-25 09:53:57 -07:00
Christy Perez 3d7cb4293c Move libcontainer to x/sys/unix
Since syscall is outdated and broken for some architectures,
use x/sys/unix instead.

There are still some dependencies on the syscall package that will
remain in syscall for the forseeable future:

Errno
Signal
SysProcAttr

Additionally:
- os still uses syscall, so it needs to be kept for anything
returning *os.ProcessState, such as process.Wait.

Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
2017-05-22 17:35:20 -05:00
Wentao Zhang 09c1f5c055 Fix setup cgroup before prestart hook
* User Case:
User could use prestart hook to add block devices to container. so the
hook should have a way to set the permissions of the devices.

Just move cgroup config operation before prestart hook will work.

Signed-off-by: Wentao Zhang <zhangwentao234@huawei.com>
2017-05-19 17:53:43 +08:00
Mrunal Patel 639454475c Merge pull request #1355 from avagin/cr-console
Dump and restore containers with external terminals
2017-05-18 11:22:52 -07:00
Valentin Rothberg 77421139ab libcontainer/user: add supplementary groups only for non-numeric users
Signed-off-by: Valentin Rothberg <vrothberg@suse.com>
2017-05-16 13:54:27 +02:00
Justin Cormack 4c67360296 Clean up unix vs linux usage
FreeBSD does not support cgroups or namespaces, which the code suggested, and is not supported
in runc anyway right now. So clean up the file naming to use `_linux` where appropriate.

Signed-off-by: Justin Cormack <justin.cormack@docker.com>
2017-05-12 17:22:09 +01:00
Qiang Huang 21ef2e3d12 Merge pull request #1410 from chchliang/statustest
add createdState and runningState status testcase
2017-05-12 16:17:17 +08:00
Michael Crosby 2daa11574b Merge pull request #1438 from hqhq/fix_rootfs_comments
Fix comments about when to pivot_root
2017-05-05 20:15:49 -07:00
Qiang Huang 96e0df7633 Fix comments about when to pivot_root
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-05-06 07:59:03 +08:00
Harshal Patil 700c74cb7e Issue #1429 : Removing check for id string length
Signed-off-by: Harshal Patil <harshal.patil@in.ibm.com>
2017-05-04 09:21:29 +05:30
Harshal Patil 22953c122f Remove redundant declaraion of namespace slice
Signed-off-by: Harshal Patil <harshal.patil@in.ibm.com>
2017-05-02 10:04:57 +05:30
Andrei Vagin 73258813d3 cr: set a freezer cgroup for criu
A freezer cgroup allows to dump processes faster.

If a user wants to checkpoint a container and its storage,
he has to pause a container, but in this case we need to pass
a path to its freezer cgroup to "criu dump".

Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
2017-05-02 04:48:47 +03:00
Andrei Vagin 1c43d091a1 checkpoint: add support for containers with terminals
CRIU was extended to report about orphaned master pty-s via RPC.

Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
2017-05-02 04:48:47 +03:00
Andrei Vagin 1a8b0aced5 Update criurpc
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
2017-05-01 21:55:57 +03:00
Andrei Vagin f8ca1926c4 libcontainer: check cpt/rst for containers with userns
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
2017-05-01 21:45:23 +03:00
Andrei Vagin d307e85dbb Print a criu version in a error message
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
2017-05-01 21:45:23 +03:00
Harshal Patil c44d4fa6ed Optimizing looping over namespaces
Signed-off-by: Harshal Patil <harshal.patil@in.ibm.com>
2017-04-26 11:54:43 +05:30
Qiang Huang 94cfb7955b Merge pull request #1387 from avagin/freezer
Don't try to read freezer.state from the current directory
2017-04-24 20:02:45 -05:00
chchliang 4f0e6c4ef0 add createdState and runningState status testcase
Signed-off-by: chchliang <chen.chuanliang@zte.com.cn>
2017-04-19 16:28:03 +08:00
Daniel, Dao Quang Minh 9f1ef73ef9 Merge pull request #1402 from chchliang/generictest
add testcase in generic_error_test.go
2017-04-18 11:42:24 +01:00
chchliang a23d7c2eab add testcase in generic_error_test.go
Signed-off-by: chchliang <chen.chuanliang@zte.com.cn>
2017-04-18 08:56:02 +08:00
Mrunal Patel 97db1eaad9 Merge pull request #1396 from harche/cstate
Set container state only once during start
2017-04-17 11:32:42 -07:00
Daniel, Dao Quang Minh 13a8c5d140 Merge pull request #1365 from hqhq/use_go_selinux
Use opencontainers/selinux package
2017-04-15 14:22:32 +01:00
Mrunal Patel 7814a0d14b Merge pull request #1399 from avagin/cr-cgroup
restore: apply resource limits
2017-04-13 11:28:28 -07:00
Michael Crosby f8ce01dbdc Merge pull request #1371 from adrianreber/master
checkpoint: check if system supports pre-dumping
2017-04-12 10:08:02 -07:00
CuiHaozhi 248c586500 could load a stopped container.
Signed-off-by: CuiHaozhi <cuihz@wise2c.com>
2017-04-07 07:39:41 -04:00
Andrei Vagin 57ef30a2ae restore: apply resource limits
When C/R was implemented, it was enough to call manager.Set to apply
limits and to move a task. Now .Set() and .Apply() have to be called
separately.

Fixes: 8a740d5391 ("libcontainer: cgroups: don't Set in Apply")
Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
2017-04-07 02:47:43 +03:00
Christy Perez fca53109c1 Fix console syscalls
Fixes opencontainers/runc/issues/1364

Signed-off-by: Christy Perez <christy@linux.vnet.ibm.com>
2017-04-06 16:51:54 -05:00
Adrian Reber 273b7853c8 checkpoint: check if system supports pre-dumping
Instead of relying on version numbers it is possible to check if CRIU
actually supports certain features. This introduces an initial
implementation to check if CRIU and the underlying kernel actually
support dirty memory tracking for memory pre-dumping.

Upstream CRIU also supports the lazy-page migration feature check and
additional feature checks can be included in CRIU to reduce the version
number parsing. There are also certain CRIU features which depend on one
side on the CRIU version but also require certain kernel versions to
actually work. CRIU knows if it can do certain things on the kernel it
is running on and using the feature check RPC interface makes it easier
for runc to decide if the criu+kernel combination will support that
feature.

Feature checking was introduced with CRIU 1.8. Running with older CRIU
versions will ignore the feature check functionality and behave just
like it used to.

v2:
 - Do not use reflection to compare requested and responded
   features. Checking which feature is available is now hardcoded
   and needs to be adapted for every new feature check. The code
   is now much more readable and simpler.

v3:
 - Move the variable criuFeat out of the linuxContainer struct,
   as it is not container specific. Now it is a global variable.

Signed-off-by: Adrian Reber <areber@redhat.com>
2017-04-06 11:17:52 +00:00
Harshal Patil 1be5d31da2 Set container state only once during start
Signed-off-by: Harshal Patil <harshal.patil@in.ibm.com>
2017-04-04 15:08:04 +05:30
Derek Carr 4d6225aec2 Expose memory.use_hierarchy in MemoryStats
Signed-off-by: Derek Carr <decarr@redhat.com>
2017-03-31 13:40:34 -04:00
Aleksa Sarai cbc4f9865a
libcontainer: rewrite cmsg to use sys/unix
The original implementation is in C, which increases cognitive load and
possibly might cause us problems in the future. Since sys/unix is better
maintained than the syscall standard library switching makes more sense.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-03-30 16:03:21 +11:00
Aleksa Sarai d04cbc49d2
rootless: add autogenerated rootless config from `runc spec`
Since this is a runC-specific feature, this belongs here over in
opencontainers/ocitools (which is for generic OCI runtimes).

In addition, we don't create a new network namespace. This is because
currently if you want to set up a veth bridge you need CAP_NET_ADMIN in
both network namespaces' pinned user namespace to create the necessary
interfaces in each network namespace.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-03-23 20:46:21 +11:00
Aleksa Sarai 76aeaf8181
libcontainer: init: fix unmapped console fchown
If the stdio of the container is owned by a group which is not mapped in
the user namespace, attempting to fchown the file descriptor will result
in EINVAL. Counteract this by simply not doing an fchown if the group
owner of the file descriptor has no host mapping according to the
configured GIDMappings.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-03-23 20:46:21 +11:00
Aleksa Sarai f0876b0427
libcontainer: configs: add proper HostUID and HostGID
Previously Host{U,G}ID only gave you the root mapping, which isn't very
useful if you are trying to do other things with the IDMaps.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-03-23 20:46:20 +11:00
Aleksa Sarai baeef29858
rootless: add rootless cgroup manager
The rootless cgroup manager acts as a noop for all set and apply
operations. It is just used for rootless setups. Currently this is far
too simple (we need to add opportunistic cgroup management), but is good
enough as a first-pass at a noop cgroup manager.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-03-23 20:46:20 +11:00
Aleksa Sarai d2f49696b0
runc: add support for rootless containers
This enables the support for the rootless container mode. There are many
restrictions on what rootless containers can do, so many different runC
commands have been disabled:

* runc checkpoint
* runc events
* runc pause
* runc ps
* runc restore
* runc resume
* runc update

The following commands work:

* runc create
* runc delete
* runc exec
* runc kill
* runc list
* runc run
* runc spec
* runc state

In addition, any specification options that imply joining cgroups have
also been disabled. This is due to support for unprivileged subtree
management not being available from Linux upstream.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-03-23 20:45:24 +11:00
Aleksa Sarai 6bd4bd9030
*: handle unprivileged operations and !dumpable
Effectively, !dumpable makes implementing rootless containers quite
hard, due to a bunch of different operations on /proc/self no longer
being possible without reordering everything.

!dumpable only really makes sense when you are switching between
different security contexts, which is only the case when we are joining
namespaces. Unfortunately this means that !dumpable will still have
issues in this instance, and it should only be necessary to set
!dumpable if we are not joining USER namespaces (new kernels have
protections that make !dumpable no longer necessary). But that's a topic
for another time.

This also includes code to unset and then re-set dumpable when doing the
USER namespace mappings. This should also be safe because in principle
processes in a container can't see us until after we fork into the PID
namespace (which happens after the user mapping).

In rootless containers, it is not possible to set a non-dumpable
process's /proc/self/oom_score_adj (it's owned by root and thus not
writeable). Thus, it needs to be set inside nsexec before we set
ourselves as non-dumpable.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-03-23 20:45:19 +11:00
Qiang Huang 5e7b48f7c0 Use opencontainers/selinux package
It's splitted as a separate project.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-03-23 08:21:19 +08:00
Andrei Vagin 88256d646d Don't try to read freezer.state from the current directory
If we try to pause a container on the system without freezer cgroups,
we can found that runc tries to open ./freezer.state. It is obviously wrong.

$ ./runc pause test
no such directory for freezer.state

$ echo FROZEN > freezer.state
$ ./runc pause test
container not running or created: paused

Signed-off-by: Andrei Vagin <avagin@virtuozzo.com>
2017-03-23 01:58:45 +03:00
Daniel Dao 09c72cea69
fix panic regression when config doesnt have caps
When process config doesnt specify capabilities anywhere, we should not panic
because setting capabilities are optional.

Signed-off-by: Daniel Dao <dqminh89@gmail.com>
2017-03-21 00:45:26 +00:00
Michael Crosby 767783a631 Merge pull request #1375 from hqhq/use_uint64_for_resources
Use uint64 for resources to keep consistency with runtime-spec
2017-03-20 12:47:21 -07:00
Qiang Huang 8430cc4f48 Use uint64 for resources to keep consistency with runtime-spec
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-03-20 18:51:39 +08:00
Aleksa Sarai c651512ad8
Revert "fix minor issue"
This reverts commit d4091ef151.

d4091ef151 ("fix minor issue") doesn't actually make any sense, and
actually makes the code more confusing.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2017-03-20 12:28:43 +11:00
Qiang Huang d270940363 Merge pull request #1356 from crosbymichael/console-socket
Add separate console socket
2017-03-18 04:03:03 -05:00
Mrunal Patel c266f1470c Merge pull request #1373 from moypray/minor
fix minor issue
2017-03-16 12:15:46 -07:00
Wentao Zhang d4091ef151 fix minor issue
When failed to attach veth pair, should remove the veth device

Signed-off-by: Wentao Zhang <zhangwentao234@huawei.com>
2017-03-17 03:18:44 +08:00
Michael Crosby 957ef9cc73 Remove terminal info
This maybe a nice extra but it adds complication to the usecase.  The
contract is listen on the socket and you get an fd to the pty master and
that is that.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-03-16 10:23:59 -07:00
Michael Crosby 00a0ecf554 Add separate console socket
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-03-16 10:23:59 -07:00
Mrunal Patel 4f903a21c4 Remove ambient build tag
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2017-03-15 11:38:43 -07:00
Mrunal Patel 4f9cb13b64 Update runtime spec to 1.0.0.rc5
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2017-03-15 11:38:37 -07:00
Craig Furman f5c5aac958 Create containers when cgroups already mounted
Runc needs to copy certain files from the top of the cgroup cpuset hierarchy
into the container's cpuset cgroup directory. Currently, runc determines
which directory is the top of the hierarchy by using the parent dir of
the first entry in /proc/self/mountinfo of type cgroup.

This creates problems when cgroup subsystems are mounted arbitrarily in
different dirs on the host.

Now, we use the most deeply nested mountpoint that contains the
container's cpuset cgroup directory.

Signed-off-by: Konstantinos Karampogias <konstantinos.karampogias@swisscom.com>
Signed-off-by: Will Martin <wmartin@pivotal.io>
2017-03-15 10:10:30 +00:00
Qiang Huang b7932a2e07 Remove unused ExecFifoPath
In container process's Init function, we use
fd + execFifoFilename to open exec fifo, so this
field in init config is never used.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-03-09 10:58:16 +08:00
Qiang Huang df4d872dd9 Merge pull request #1327 from CarltonSemple/lxd-fix
Update devices_unix.go for LXD
2017-03-08 19:34:31 -06:00
Carlton-Semple 0590736890 Added comment linking to LXD issue 2825
Signed-off-by: Carlton-Semple <carlton.semple@ibm.com>
2017-03-08 10:25:37 -05:00
Qiang Huang 8773c5f9a6 Remove unused function in systemd cgroup
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-03-07 15:11:37 +08:00
Michael Crosby 49a33c41f8 Merge pull request #1344 from xuxinkun/fixCPUQuota20170224
fix cpu.cfs_quota_us changed when systemd daemon-reload using systemd.
2017-03-06 10:02:28 -08:00
xuxinkun c44aec9b23 fix cpu.cfs_quota_us changed when systemd daemon-reload using systemd.
Signed-off-by: xuxinkun <xuxinkun@gmail.com>
2017-03-06 20:08:30 +11:00
Michael Crosby c50d024500 Merge pull request #1280 from datawolf/user
user: fix the parameter error
2017-02-27 11:22:58 -08:00
Qiang Huang fe898e7862 Fix kmem accouting when use with cgroupsPath
Fixes: #1347
Fixes: #1083

The root cause of #1083 is because we're joining an
existed cgroup whose kmem accouting is not initialized,
and it has child cgroup or tasks in it.

Fix it by checking if the cgroup is first time created,
and we should enable kmem accouting if the cgroup is
craeted by libcontainer with or without kmem limit
configed. Otherwise we'll get issue like #1347

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-02-25 10:58:18 -08:00
Qiang Huang 707dd48b2f Merge pull request #1001 from x1022as/predump
add pre-dump and parent-path to checkpoint
2017-02-24 10:55:06 -08:00
Aleksa Sarai 02141ce862
merge branch 'pr-1317'
Closes #1317
LGTMs: @cyphar @crosbymichael
2017-02-24 08:21:58 +11:00
Qiang Huang 733563552e Fix state when _LIBCONTAINER in environment
Fixes: #1311

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-02-22 10:35:14 -08:00
Qiang Huang 805b8c73d3 Do not create exec fifo in factory.Create
It should not be binded to container creation, for
example, runc restore needs to create a
libcontainer.Container, but it won't need exec fifo.

So create exec fifo when container is started or run,
where we really need it.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-02-22 10:34:48 -08:00
Brian Goff d193f95d07 Don't override system error
The error message added here provides no value as the caller already
knows all the added details. However it is covering up the underyling
system error (typically `ENOTSUP`). There is no way to handle this error before
this change.

Signed-off-by: Brian Goff <cpuguy83@gmail.com>
2017-02-22 09:29:38 -05:00
Michael Crosby 8438b26e9f Merge pull request #1237 from hqhq/fix_sync_race
Fix race condition when sync with child and grandchild
2017-02-20 17:16:43 -08:00
Michael Crosby 4a164a826c Use %zu for printing of size_t values
This helps fix compile warnings on some arm systems.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-02-20 16:57:27 -08:00
Qiang Huang a54316bae1 Fix race condition when sync with child and grandchild
Fixes: #1236
Fixes: #1281

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2017-02-18 20:42:08 +08:00
Qiang Huang 6b1d0e76f2 Merge pull request #1127 from boynux/fix-set-mem-to-unlimited
Fixes set memory to unlimited
2017-02-16 09:51:23 +08:00
Mohammad Arab 18ebc51b3c Reset Swap when memory is set to unlimited (-1)
Kernel validation fails if memory set to -1 which is unlimited but
swap is not set so.

Signed-off-by: Mohammad Arab <boynux@gmail.com>
2017-02-15 08:11:57 +01:00
Carlton Semple 9a7e5a9434 Update devices_unix.go for LXD
getDevices() has been updated to skip `/dev/.lxc` and `/dev/.lxd-mounts`, which was breaking privileged Docker containers running on runC, inside of LXD managed Linux Containers

Signed-off-by: Carlton-Semple <carlton.semple@ibm.com>
2017-02-14 16:12:03 -05:00