Commit Graph

33 Commits

Author SHA1 Message Date
Michael Crosby 5d93fed3d2 Set init processes as non-dumpable
This sets the init processes that join and setup the container's
namespaces as non-dumpable before they setns to the container's pid (or
any other ) namespace.

This settings is automatically reset to the default after the Exec in
the container so that it does not change functionality for the
applications that are running inside, just our init processes.

This prevents parent processes, the pid 1 of the container, to ptrace
the init process before it drops caps and other sets LSMs.

This patch also ensures that the stateDirFD being used is still closed
prior to exec, even though it is set as O_CLOEXEC, because of the order
in the kernel.

https://github.com/torvalds/linux/blob/v4.9/fs/exec.c#L1290-L1318

The order during the exec syscall is that the process is set back to
dumpable before O_CLOEXEC are processed.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2017-01-11 09:56:56 -08:00
Aleksa Sarai 244c9fc426
*: console rewrite
This implements {createTTY, detach} and all of the combinations and
negations of the two that were previously implemented. There are some
valid questions about out-of-OCI-scope topics like !createTTY and how
things should be handled (why do we dup the current stdio to the
process, and how is that not a security issue). However, these will be
dealt with in a separate patchset.

In order to allow for late console setup, split setupRootfs into the
"preparation" section where all of the mounts are created and the
"finalize" section where we pivot_root and set things as ro. In between
the two we can set up all of the console mountpoints and symlinks we
need.

We use two-stage synchronisation to ensures that when the syscalls are
reordered in a suboptimal way, an out-of-place read() on the parentPipe
will not gobble the ancilliary information.

This patch is part of the console rewrite patchset.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-12-01 15:49:36 +11:00
Daniel, Dao Quang Minh f156f73c2a Merge pull request #1154 from hqhq/sync_child
Sync with grandchild
2016-11-23 09:10:00 -08:00
Qiang Huang 16a2e8ba6e Sync with grandchild
Without this, it's possible that father process exit with
0 before grandchild exit with error.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-11-17 08:59:37 +08:00
rajasec 43287af982 Fixing error message in nsexec
Signed-off-by: rajasec <rajasec79@gmail.com>
2016-11-10 17:06:50 +05:30
Qiang Huang 84a4218ece More fix to nsexec.c's comments
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-11-03 10:15:01 +08:00
Aleksa Sarai 9b15bf17a0
nsenter: fix up comments
Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-11-01 00:21:09 +11:00
Qiang Huang f520eab891 Remove unnecessary cloneflag validation
config.cloneflag is not mandatory, when using `runc exec`,
config.cloneflag can be empty, and even then it won't be
`-1` but `0`.

So this validation is totally wrong and unneeded.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-10-27 09:34:20 +08:00
Aleksa Sarai e3cd191acc
nsenter: un-split clone(cloneflags) for RHEL
Without this patch applied, RHEL's SELinux policies cause container
creation to not really work. Unfortunately this might be an issue for
rootless containers (opencontainers/runc#774) but we'll cross that
bridge when we come to it.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-10-18 18:26:27 +11:00
Aleksa Sarai 2cd9c31b99
nsenter: guarantee correct user namespace ordering
Depending on your SELinux setup, the order in which you join namespaces
can be important. In general, user namespaces should *always* be joined
and unshared first because then the other namespaces are correctly
pinned and you have the right priviliges within them. This also is very
useful for rootless containers, as well as older kernels that had
essentially broken unshare(2) and clone(2) implementations.

This also includes huge refactorings in how we spawn processes for
complicated reasons that I don't want to get into because it will make
me spiral into a cloud of rage. The reasoning is in the giant comment in
clone_parent. Have fun.

In addition, because we now create multiple children with CLONE_PARENT,
we cannot wait for them to SIGCHLD us in the case of a death. Thus, we
have to resort to having a child kindly send us their exit code before
they die. Hopefully this all works okay, but at this point there's not
much more than we can do.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-10-04 16:17:55 +11:00
Aleksa Sarai ed053a740c
nsenter: specify namespace type in setns()
This avoids us from running into cases where libcontainer thinks that a
particular namespace file is a different type, and makes it a fatal
error rather than causing broken functionality.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-10-04 16:17:55 +11:00
Shukui Yang e15af9ffbb remove redundant by in annotation(nsexec.c)
Signed-off-by: Shukui Yang <yangshukui@huawei.com>
2016-09-05 10:53:19 +08:00
Mrunal Patel 0bd675a56c Fix format specifier for size_t
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
2016-08-17 11:40:08 -07:00
Aleksa Sarai 4e72ffc237
nsenter: simplify netlink parsing
This just moves everything to one function so we don't have to pass a
bunch of things to functions when there's no real benefit. It also makes
the API nicer.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-08-17 08:21:48 +10:00
Aleksa Sarai faa3281ce8
nsenter: major cleanup
Removed a lot of clutter, improved the style of the code, removed
unnecessary complexity. In addition, made errors unique by making bail()
exit with a unique error code. Most of this code comes from the current
state of the rootless containers branch.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
2016-08-13 03:18:04 +10:00
Antonio Murdaca 9d14efec4c libcontainer: nsenter: nsexec.c: fix warnings
Fix the following warnings when building runc with gcc 6+:

Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:
In function ‘nsexec’:
Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:322:6:
warning: ‘__s’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
      pr_perror("Failed to open %s", ns);
Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:273:30:
note: ‘__s’ was declared here
 static struct nsenter_config process_nl_attributes(int pipenum, char
*data, int data_size)
                              ^~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Antonio Murdaca <runcom@redhat.com>
2016-05-14 11:19:44 +02:00
Natanael Copa ac6bd95319 nsexec: fix build against musl libc
Remove a wrongly added include which was added in commit 3c2e77ee (Add a
compatibility header for CentOS/RHEL 6, 2016-01-29) apparently to
fix this compile error on centos 6:

> In file included from
> Godeps/_workspace/src/github.com/opencontainers/runc/libcontainer/nsenter/nsexec.c:20:
> /usr/include/linux/netlink.h:35: error: expected specifier-qualifier-list before 'sa_family_t'

The glibc bits/sockaddr.h says that this header should never be included
directly[1]. Instead, sys/socket.h should be used.

The problem was correctly fixed later, in commit 394fb55 (Fix build
error on centos6, 2016-03-02) so the incorrect bits/sockaddr.h can
safely be removed.

This is needed to build musl libc.

Fixes #761

[1]: 20003c4988/bits/sockaddr.h (L20)

Signed-off-by: Natanael Copa <natanael.copa@docker.com>
2016-04-19 10:58:17 +02:00
Qiang Huang d9520aeba4 Close opened files before exit
Not to say it'll cause memory leak, it'll still be a
good practice.

Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-03-28 11:16:34 +08:00
Qiang Huang 3b7e10652b Refactor nsexec.c and add some comments
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-03-28 11:16:12 +08:00
Tonis Tiigi 04da969aa8 Clear groups after entering userns
Clears supplementary groups that have effect on the
mount permissions before joining the user specified
groups happens.

Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
2016-03-10 22:23:38 -08:00
Andrey Vagin 080eac3d2a nsexec: don't use CLONE_PARENT and CLONE_NEWPID together
The rhel6 kernel returns EINVAL in this case

Known issue:
* CT with userns doesn't work

This is a copy of
d31e97fa28
to address https://github.com/opencontainers/runc/issues/613

Signed-off-by: Andrey Vagin <avagin@virtuozzo.com>
Signed-off-by: Andrew Fernandes <andrew@fernandes.org>
2016-03-10 14:28:10 -05:00
Michael Crosby 3af08519d0 Merge pull request #616 from hqhq/hq_remove_dup_headfile
Remove duplicated included head file
2016-03-08 10:54:31 -08:00
Phil Estes 178bad5e71 Properly setuid/setgid after entering userns
The re-work of namespace entering lost the setuid/setgid that was part
of the Go-routine based process exec in the prior code. A side issue was
found with setting oom_score_adj before execve() in a userns that is
also solved here.

Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
2016-03-04 11:12:26 -05:00
Qiang Huang 87e05b84e2 Remove duplicated included head file
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com>
2016-03-03 11:08:18 +08:00
Ye Yin 394fb55d85 Fix build error on centos6
Signed-off-by: Ye Yin <eyniy@qq.com>
2016-03-02 18:32:19 +08:00
Kenfe-Mickael Laventure 08c3c6ebe2 Refactor nsexec
Cut nsexec in smaller chunk routines to make it more readable.

Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@gmail.com>
2016-02-28 12:26:53 -08:00
Daniel, Dao Quang Minh 002b6c2fe8 Reorder and remove unused imports in nsexec.c
Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
2016-02-28 12:26:53 -08:00
Daniel, Dao Quang Minh 42d5d04801 Sets custom namespaces for init processes
An init process can join other namespaces (pidns, ipc etc.). This leverages
C code defined in nsenter package to spawn a process with correct namespaces
and clone if necessary.

This moves all setns and cloneflags related code to nsenter layer, which mean
that we dont use Go os/exec to create process with cloneflags and set
uid/gid_map or setgroups anymore. The necessary data is passed from Go to C
using a netlink binary-encoding format.

With this change, setns and init processes are almost the same, which brings
some opportunity for refactoring.

Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
[mickael.laventure@docker.com: adapted to apply on master @ d97d5e]
Signed-off-by: Kenfe-Mickael Laventure <mickael.laventure@docker.com>
2016-02-28 12:26:53 -08:00
Andrew Fernandes 3c2e77eed5 Add a compatibility header for CentOS/RHEL 6
Signed-off-by: Andrew Fernandes <andrew@fernandes.org>
2016-01-29 20:46:50 +00:00
Daniel, Dao Quang Minh 7d423cb7a1 setns: replace env with netlink for bootstrap data
replace passing of pid and console path via environment variable with passing
them with netlink message via an established pipe.

this change requires us to set _LIBCONTAINER_INITTYPE and
_LIBCONTAINER_INITPIPE as the env environment of the bootstrap process as we
only send the bootstrap data for setns process right now. When init and setns
bootstrap process are unified (i.e., init use nsexec instead of Go to clone new
process), we can remove _LIBCONTAINER_INITTYPE.

Note:
- we read nlmsghdr first before reading the content so we can get the total
  length of the payload and allocate buffer properly instead of allocating
  one large buffer.

- check read bytes vs the wanted number. It's an error if we failed to read
  the desired number of bytes from the pipe into the buffer.

Signed-off-by: Daniel, Dao Quang Minh <dqminh89@gmail.com>
2015-12-03 18:03:48 +00:00
Bogdan Purcareata 4c5eb45862 nsexec: Align clone child stack ptr to 16
This is required on ARM64 builds that use the clone syscall. Check [1].

[1] http://lxr.free-electrons.com/source/arch/arm64/kernel/process.c#L264

Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com>
2015-10-06 10:41:18 +00:00
Ido Yariv 08366a8597 Enter existing user namespace if present
When executing an additional process in a container, all namespaces are
entered but the user namespace. As a result, the process may be
executed as the host's root user. This has both functionality and
security implications.

Fix this by adding the missing user namespace to the array of
namespaces. Since joining a user namespace in which the caller is
already a member yields an error, skip namespaces we're already in.

Last, remove a needless and buggy AT_SYMLINK_NOFOLLOW in the code.

Signed-off-by: Ido Yariv <ido@wizery.com>
2015-09-21 21:49:52 -04:00
Michael Crosby 8f97d39dd2 Move libcontainer into subdirectory
Signed-off-by: Michael Crosby <crosbymichael@gmail.com>
2015-06-21 19:29:15 -07:00