jasder/runc - runc - 军科开源项目托管

Commit Graph

Author	SHA1	Message	Date
Justin Cormack	6f714aa928	Use getenv not secure_getenv secure_getenv is a Glibc extension and so this code does not compile on Musl libc any more after this patch. secure_getenv is only intended to be used in setuid binaries, in order that they should not trust their environment. It simply returns NULL if the binary is running setuid. If runc was installed setuid, the user can already do anything as root, so it is game over, so this check is not needed. Signed-off-by: Justin Cormack <justin.cormack@docker.com>	2019-03-14 10:58:10 +00:00
Aleksa Sarai	2d4a37b427	nsenter: cloned_binary: userspace copy fallback if sendfile fails There are some circumstances where sendfile(2) can fail (one example is that AppArmor appears to block writing to deleted files with sendfile(2) under some circumstances) and so we need to have a userspace fallback. It's fairly trivial (and handles short-writes). Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-03-01 23:29:10 +11:00
Aleksa Sarai	16612d74de	nsenter: cloned_binary: try to ro-bind /proc/self/exe before copying The usage of memfd_create(2) and other copying techniques is quite wasteful, despite attempts to minimise it with _LIBCONTAINER_STATEDIR. memfd_create(2) added ~10M of memory usage to the cgroup associated with the container, which can result in some setups getting OOM'd (or just hogging the hosts' memory when you have lots of created-but-not-started containers sticking around). The easiest way of solving this is by creating a read-only bind-mount of the binary, opening that read-only bindmount, and then umounting it to ensure that the host won't accidentally be re-mounted read-write. This avoids all copying and cleans up naturally like the other techniques used. Unfortunately, like the O_TMPFILE fallback, this requires being able to create a file inside _LIBCONTAINER_STATEDIR (since bind-mounting over the most obvious path -- /proc/self/exe -- is a very bad idea). Unfortunately detecting this isn't fool-proof -- on a system with a read-only root filesystem (that might become read-write during "runc init" execution), we cannot tell whether we have already done an ro remount. As a partial mitigation, we store a _LIBCONTAINER_CLONED_BINARY environment variable which is checked alongside the protection being present. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-03-01 23:29:08 +11:00
Aleksa Sarai	af9da0a450	nsenter: cloned_binary: use the runc statedir for O_TMPFILE Writing a file to tmpfs actually incurs a memcg penalty, and thus the benefit of being able to disable memfd_create(2) with _LIBCONTAINER_DISABLE_MEMFD_CLONE is fairly minimal -- though it should be noted that quite a few distributions don't use tmpfs for /tmp (and instead have it as a regular directory or subvolume of the host filesystem). Since runc must have write access to the state directory anyway (and the state directory is usually not on a tmpfs) we can use that instead of /tmp -- avoiding potential memcg costs with no real downside. Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-03-01 23:28:51 +11:00
Aleksa Sarai	2429d59352	nsenter: cloned_binary: expand and add pre-3.11 fallbacks In order to get around the memfd_create(2) requirement, `0a8e4117e7` ("nsenter: clone /proc/self/exe to avoid exposing host binary to container") added an O_TMPFILE fallback. However, this fallback was flawed in two ways: * It required O_TMPFILE which is relatively new (having been added to Linux 3.11). * The fallback choice was made at compile-time, not runtime. This results in several complications when it comes to running binaries on different machines to the ones they were built on. The easiest way to resolve these things is to have fallbacks work in a more procedural way (though it does make the code unfortunately more complicated) and to add a new fallback that uses mkotemp(3). Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-03-01 23:28:50 +11:00
Aleksa Sarai	5b775bf297	nsenter: cloned_binary: detect and handle short copies For a variety of reasons, sendfile(2) can end up doing a short-copy so we need to just loop until we hit the binary size. Since /proc/self/exe is tautologically our own binary, there's no chance someone is going to modify it underneath us (or changing the size). Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-02-26 19:51:01 +11:00
Christian Brauner	bb7d8b1f41	nsexec (CVE-2019-5736): avoid parsing environ My first attempt to simplify this and make it less costly focussed on the way constructors are called. I was under the impression that the ELF specification mandated that arg, argv, and actually even envp need to be passed to functions located in the .init_arry section (aka "constructors"). Actually, the specifications is (cf. [2]): SHT_INIT_ARRAY This section contains an array of pointers to initialization functions, as described in ``Initialization and Termination Functions'' in Chapter 5. Each pointer in the array is taken as a parameterless procedure with a void return. which means that this becomes a libc specific decision. Glibc passes down those args, musl doesn't. So this approach can't work. However, we can at least remove the environment parsing part based on POSIX since [1] mandates that there should be an environ variable defined in unistd.h which provides access to the environment. See also the relevant Open Group specification [1]. [1]: http://pubs.opengroup.org/onlinepubs/9699919799/ [2]: http://www.sco.com/developers/gabi/latest/ch4.sheader.html#init_array Fixes: CVE-2019-5736 Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>	2019-02-14 16:06:21 +01:00
Aleksa Sarai	0a8e4117e7	nsenter: clone /proc/self/exe to avoid exposing host binary to container There are quite a few circumstances where /proc/self/exe pointing to a pretty important container binary is a _bad_ thing, so to avoid this we have to make a copy (preferably doing self-clean-up and not being writeable). We require memfd_create(2) -- though there is an O_TMPFILE fallback -- but we can always extend this to use a scratch MNT_DETACH overlayfs or tmpfs. The main downside to this approach is no page-cache sharing for the runc binary (which overlayfs would give us) but this is far less complicated. This is only done during nsenter so that it happens transparently to the Go code, and any libcontainer users benefit from it. This also makes ExtraFiles and --preserve-fds handling trivial (because we don't need to worry about it). Fixes: CVE-2019-5736 Co-developed-by: Christian Brauner <christian.brauner@ubuntu.com> Signed-off-by: Aleksa Sarai <asarai@suse.de>	2019-02-08 18:57:59 +11:00

8 Commits