There are quite a few circumstances where /proc/self/exe pointing to a
pretty important container binary is a _bad_ thing, so to avoid this we
have to make a copy (preferably doing self-clean-up and not being
writeable).
We require memfd_create(2) -- though there is an O_TMPFILE fallback --
but we can always extend this to use a scratch MNT_DETACH overlayfs or
tmpfs. The main downside to this approach is no page-cache sharing for
the runc binary (which overlayfs would give us) but this is far less
complicated.
This is only done during nsenter so that it happens transparently to the
Go code, and any libcontainer users benefit from it. This also makes
ExtraFiles and --preserve-fds handling trivial (because we don't need to
worry about it).
Fixes: CVE-2019-5736
Co-developed-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Aleksa Sarai <asarai@suse.de>