docs: add information about terminals
Users can get very confused by how terminals work with runc, and the quite confusing "terminal: ..." option. Add a document which goes through all of the important parts of terminal handling in runc, in the hopes that we can just point people to this as an explanation. Signed-off-by: Avi Deitcher <avi@deitcher.net> [cyphar: quite a large rewrite to fix factual errors and structure] Co-authored-by: Avi Deitcher <avi@deitcher.net> Signed-off-by: Aleksa Sarai <asarai@suse.de>
This commit is contained in:
parent
0154d05778
commit
472fcb30d9
|
@ -0,0 +1,314 @@
|
|||
# Terminals and Standard IO #
|
||||
|
||||
*Note that the default configuration of `runc` (foreground, new terminal) is
|
||||
generally the best option for most users. This document exists to help explain
|
||||
what the purpose of the different modes is, and to try to steer users away from
|
||||
common mistakes and misunderstandings.*
|
||||
|
||||
In general, most processes on Unix (and Unix-like) operating systems have 3
|
||||
standard file descriptors provided at the start, collectively referred to as
|
||||
"standard IO" (`stdio`):
|
||||
|
||||
* `0`: standard-in (`stdin`), the input stream into the process
|
||||
* `1`: standard-out (`stdout`), the output stream from the process
|
||||
* `2`: standard-error (`stderr`), the error stream from the process
|
||||
|
||||
When creating and running a container via `runc`, it is important to take care
|
||||
to structure the `stdio` the new container's process receives. In some ways
|
||||
containers are just regular processes, while in other ways they're an isolated
|
||||
sub-partition of your machine (in a similar sense to a VM). This means that the
|
||||
structure of IO is not as simple as with ordinary programs (which generally
|
||||
just use the file descriptors you give them).
|
||||
|
||||
## Other File Descriptors ##
|
||||
|
||||
Before we continue, it is important to note that processes can have more file
|
||||
descriptors than just `stdio`. By default in `runc` no other file descriptors
|
||||
will be passed to the spawned container process. If you wish to explicitly pass
|
||||
file descriptors to the container you have to use the `--preserve-fds` option.
|
||||
These ancillary file descriptors don't have any of the strange semantics
|
||||
discussed further in this document (those only apply to `stdio`) -- they are
|
||||
passed untouched by `runc`.
|
||||
|
||||
It should be noted that `--preserve-fds` does not take individual file
|
||||
descriptors to preserve. Instead, it takes how many file descriptors (not
|
||||
including `stdio` or `LISTEN_FDS`) should be passed to the container. In the
|
||||
following example:
|
||||
|
||||
```
|
||||
% runc run --preserve-fds 5 <container>
|
||||
```
|
||||
|
||||
`runc` will pass the first `5` file descriptors (`3`, `4`, `5`, `6`, and `7` --
|
||||
assuming that `LISTEN_FDS` has not been configured) to the container.
|
||||
|
||||
In addition to `--preserve-fds`, `LISTEN_FDS` file descriptors are passed
|
||||
automatically to allow for `systemd`-style socket activation. To extend the
|
||||
above example:
|
||||
|
||||
```
|
||||
% LISTEN_PID=$pid_of_runc LISTEN_FDS=3 runc run --preserve-fds 5 <container>
|
||||
```
|
||||
|
||||
`runc` will now pass the first `8` file descriptors (and it will also pass
|
||||
`LISTEN_FDS=3` and `LISTEN_PID=1` to the container). The first `3` (`3`, `4`,
|
||||
and `5`) were passed due to `LISTEN_FDS` and the other `5` (`6`, `7`, `8`, `9`,
|
||||
and `10`) were passed due to `--preserve-fds`. You should keep this in mind if
|
||||
you use `runc` directly in something like a `systemd` unit file. To disable
|
||||
this `LISTEN_FDS`-style passing just unset `LISTEN_FDS`.
|
||||
|
||||
**Be very careful when passing file descriptors to a container process.** Due
|
||||
to some Linux kernel (mis)features, a container with access to certain types of
|
||||
file descriptors (such as `O_PATH` descriptors) outside of the container's root
|
||||
file system can use these to break out of the container's pivoted mount
|
||||
namespace. [This has resulted in CVEs in the past.][CVE-2016-9962]
|
||||
|
||||
[CVE-2016-9962]: https://nvd.nist.gov/vuln/detail/CVE-2016-9962
|
||||
|
||||
## <a name="terminal-modes" /> Terminal Modes ##
|
||||
|
||||
`runc` supports two distinct methods for passing `stdio` to the container's
|
||||
primary process:
|
||||
|
||||
* [new terminal](#new-terminal) (`terminal: true`)
|
||||
* [pass-through](#pass-through) (`terminal: false`)
|
||||
|
||||
When first using `runc` these two modes will look incredibly similar, but this
|
||||
can be quite deceptive as these different modes have quite different
|
||||
characteristics.
|
||||
|
||||
By default, `runc spec` will create a configuration that will create a new
|
||||
terminal (`terminal: true`). However, if the `terminal: ...` line is not
|
||||
present in `config.json` then pass-through is the default.
|
||||
|
||||
*In general we recommend using new terminal, because it means that tools like
|
||||
`sudo` will work inside your container. But pass-through can be useful if you
|
||||
know what you're doing, or if you're using `runc` as part of a non-interactive
|
||||
pipeline.*
|
||||
|
||||
### <a name="new-terminal"> New Terminal ###
|
||||
|
||||
In new terminal mode, `runc` will create a brand-new "console" (or more
|
||||
precisely, a new pseudo-terminal using the container's namespaced
|
||||
`/dev/pts/ptmx`) for your contained process to use as its `stdio`.
|
||||
|
||||
When you start a process in new terminal mode, `runc` will do the following:
|
||||
|
||||
1. Create a new pseudo-terminal.
|
||||
2. Pass the slave end to the container's primary process as its `stdio`.
|
||||
3. Send the master end to a process to interact with the `stdio` for the
|
||||
container's primary process ([details below](#runc-modes)).
|
||||
|
||||
It should be noted that since a new pseudo-terminal is being used for
|
||||
communication with the container, some strange properties of pseudo-terminals
|
||||
might surprise you. For instance, by default, all new pseudo-terminals
|
||||
translate the byte `'\n'` to the sequence `'\r\n'` on both `stdout` and
|
||||
`stderr`. In addition there are [a whole range of `ioctls(2)` that can only
|
||||
interact with pseudo-terminal `stdio`][tty_ioctl(4)].
|
||||
|
||||
> **NOTE**: In new terminal mode, all three `stdio` file descriptors are the
|
||||
> same underlying file. The reason for this is to match how a shell's `stdio`
|
||||
> looks to a process (as well as remove race condition issues with having to
|
||||
> deal with multiple master pseudo-terminal file descriptors). However this
|
||||
> means that it is not really possible to uniquely distinguish between `stdout`
|
||||
> and `stderr` from the caller's perspective.
|
||||
|
||||
[tty_ioctl(4)]: https://linux.die.net/man/4/tty_ioctl
|
||||
|
||||
### <a name="pass-through"> Pass-Through ###
|
||||
|
||||
If you have already set up some file handles that you wish your contained
|
||||
process to use as its `stdio`, then you can ask `runc` to pass them through to
|
||||
the contained process (this is not necessarily the same as `--preserve-fds`'s
|
||||
passing of file descriptors -- [details below](#runc-modes)). As an example
|
||||
(assuming that `terminal: false` is set in `config.json`):
|
||||
|
||||
```
|
||||
% echo input | runc run some_container > /tmp/log.out 2>& /tmp/log.err
|
||||
```
|
||||
|
||||
Here the container's various `stdio` file descriptors will be substituted with
|
||||
the following:
|
||||
|
||||
* `stdin` will be sourced from the `echo input` pipeline.
|
||||
* `stdout` will be output into `/tmp/log.out` on the host.
|
||||
* `stderr` will be output into `/tmp/log.err` on the host.
|
||||
|
||||
It should be noted that the actual file handles seen inside the container may
|
||||
be different [based on the mode `runc` is being used in](#runc-modes) (for
|
||||
instance, the file referenced by `1` could be `/tmp/log.out` directly or a pipe
|
||||
which `runc` is using to buffer output, based on the mode). However the net
|
||||
result will be the same in either case. In principle you could use the [new
|
||||
terminal mode](#new-terminal) in a pipeline, but the difference will become
|
||||
more clear when you are introduced to [`runc`'s detached mode](#runc-modes).
|
||||
|
||||
## <a name="runc-modes" /> `runc` Modes ##
|
||||
|
||||
`runc` itself runs in two modes:
|
||||
|
||||
* [foreground](#foreground)
|
||||
* [detached](#detached)
|
||||
|
||||
You can use either [terminal mode](#terminal-modes) with either `runc` mode.
|
||||
However, there are considerations that may indicate preference for one mode
|
||||
over another. It should be noted that while two types of modes (terminal and
|
||||
`runc`) are conceptually independent from each other, you should be aware of
|
||||
the intricacies of which combination you are using.
|
||||
|
||||
*In general we recommend using foreground because it's the most
|
||||
straight-forward to use, with the only downside being that you will have a
|
||||
long-running `runc` process. Detached mode is difficult to get right and
|
||||
generally requires having your own `stdio` management.*
|
||||
|
||||
### Foreground ###
|
||||
|
||||
The default (and most straight-forward) mode of `runc`. In this mode, your
|
||||
`runc` command remains in the foreground with the container process as a child.
|
||||
All `stdio` is buffered through the foreground `runc` process (irrespective of
|
||||
which terminal mode you are using). This is conceptually quite similar to
|
||||
running a normal process interactively in a shell (and if you are using `runc`
|
||||
in a shell interactively, this is what you should use).
|
||||
|
||||
Because the `stdio` will be buffered in this mode, some very important
|
||||
peculiarities of this mode should be kept in mind:
|
||||
|
||||
* With [new terminal mode](#new-terminal), the container will see a
|
||||
pseudo-terminal as its `stdio` (as you might expect). However, the `stdio` of
|
||||
the foreground `runc` process will remain the `stdio` that the process was
|
||||
started with -- and `runc` will copy all `stdio` between its `stdio` and the
|
||||
container's `stdio`. This means that while a new pseudo-terminal has been
|
||||
created, the foreground `runc` process manages it over the lifetime of the
|
||||
container.
|
||||
|
||||
* With [pass-through mode](#pass-through), the foreground `runc`'s `stdio` is
|
||||
**not** passed to the container. Instead, the container's `stdio` is a set of
|
||||
pipes which are used to copy data between `runc`'s `stdio` and the
|
||||
container's `stdio`. This means that the container never has direct access to
|
||||
host file descriptors (aside from the pipes created by the container runtime,
|
||||
but that shouldn't be an issue).
|
||||
|
||||
The main drawback of the foreground mode of operation is that it requires a
|
||||
long-running foreground `runc` process. If you kill the foreground `runc`
|
||||
process then you will no longer have access to the `stdio` of the container
|
||||
(and in most cases this will result in the container dying abnormally due to
|
||||
`SIGPIPE` or some other error). By extension this means that any bug in the
|
||||
long-running foreground `runc` process (such as a memory leak) or a stray
|
||||
OOM-kill sweep could result in your container being killed **through no fault
|
||||
of the user**. In addition, there is no way in foreground mode of passing a
|
||||
file descriptor directly to the container process as its `stdio` (like
|
||||
`--preserve-fds` does).
|
||||
|
||||
These shortcomings are obviously sub-optimal and are the reason that `runc` has
|
||||
an additional mode called "detached mode".
|
||||
|
||||
### Detached ###
|
||||
|
||||
In contrast to foreground mode, in detached mode there is no long-running
|
||||
foreground `runc` process once the container has started. In fact, there is no
|
||||
long-running `runc` process at all. However, this means that it is up to the
|
||||
caller to handle the `stdio` after `runc` has set it up for you. In a shell
|
||||
this means that the `runc` command will exit and control will return to the
|
||||
shell, after the container has been set up.
|
||||
|
||||
You can run `runc` in detached mode in one of the following ways:
|
||||
|
||||
* `runc run -d ...` which operates similar to `runc run` but is detached.
|
||||
* `runc create` followed by `runc start` which is the standard container
|
||||
lifecycle defined by the OCI runtime specification (`runc create` sets up the
|
||||
container completely, waiting for `runc start` to begin execution of user
|
||||
code).
|
||||
|
||||
The main use-case of detached mode is for higher-level tools that want to be
|
||||
wrappers around `runc`. By running `runc` in detached mode, those tools have
|
||||
far more control over the container's `stdio` without `runc` getting in the
|
||||
way (most wrappers around `runc` like `cri-o` or `containerd` use detached mode
|
||||
for this reason).
|
||||
|
||||
Unfortunately using detached mode is a bit more complicated and requires more
|
||||
care than the foreground mode -- mainly because it is now up to the caller to
|
||||
handle the `stdio` of the container.
|
||||
|
||||
#### Detached Pass-Through ####
|
||||
|
||||
In detached mode, pass-through actually does what it says on the tin -- the
|
||||
`stdio` file descriptors of the `runc` process are passed through (untouched)
|
||||
to the container's `stdio`. The purpose of this option is to allow a user to
|
||||
set up `stdio` for a container themselves and then force `runc` to just use
|
||||
their pre-prepared `stdio` (without any pseudo-terminal funny business). *If
|
||||
you don't see why this would be useful, don't use this option.*
|
||||
|
||||
**You must be incredibly careful when using detached pass-through (especially
|
||||
in a shell).** The reason for this is that by using detached pass-through you
|
||||
are passing host file descriptors to the container. In the case of a shell,
|
||||
usually your `stdio` is going to be a pseudo-terminal (on your host). A
|
||||
malicious container could take advantage of TTY-specific `ioctls` like
|
||||
`TIOCSTI` to fake input into the **host** shell (remember that in detached
|
||||
mode, control is returned to your shell and so the terminal you've given the
|
||||
container is being read by a shell prompt).
|
||||
|
||||
There are also several other issues with running non-malicious containers in a
|
||||
shell with detached pass-through (where you pass your shell's `stdio` to the
|
||||
container):
|
||||
|
||||
* Output from the container will be interleaved with output from your shell (in
|
||||
a non-deterministic way), without any real way of distinguishing from where a
|
||||
particular piece of output came from.
|
||||
|
||||
* Any input to `stdin` will be non-deterministically split and given to either
|
||||
the container or the shell (because both are blocked on a `read(2)` of the
|
||||
same FIFO-style file descriptor).
|
||||
|
||||
They are all related to the fact that there is going to be a race when either
|
||||
your host or the container tries to read from (or write to) `stdio`. This
|
||||
problem is especially obvious when in a shell, where usually the terminal has
|
||||
been put into raw mode (where each individual key-press should cause `read(2)`
|
||||
to return).
|
||||
|
||||
> **NOTE**: There is also currently a [known problem][issue-1721] where using
|
||||
> detached pass-through will result in the container hanging if the `stdout` or
|
||||
> `stderr` is a pipe (though this should be a temporary issue).
|
||||
|
||||
[issue-1721]: https://github.com/opencontainers/runc/issues/1721
|
||||
|
||||
#### Detached New Terminal ####
|
||||
|
||||
When creating a new pseudo-terminal in detached mode, and fairly obvious
|
||||
problem appears -- how do we use the new terminal that `runc` created? Unlike
|
||||
in pass-through, `runc` has created a new set of file descriptors that need to
|
||||
be used by *something* in order for container communication to work.
|
||||
|
||||
The way this problem is resolved is through the use of Unix domain sockets.
|
||||
There is a feature of Unix sockets called `SCM_RIGHTS` which allows a file
|
||||
descriptor to be sent through a Unix socket to a completely separate process
|
||||
(which can then use that file descriptor as though they opened it). When using
|
||||
`runc` in detached new terminal mode, this is how a user gets access to the
|
||||
pseudo-terminal's master file descriptor.
|
||||
|
||||
To this end, there is a new option (which is required if you want to use `runc`
|
||||
in detached new terminal mode): `--console-socket`. This option takes the path
|
||||
to a Unix domain socket which `runc` will connect to and send the
|
||||
pseudo-terminal master file descriptor down. The general process for getting
|
||||
the pseudo-terminal master is as follows:
|
||||
|
||||
1. Create a Unix domain socket at some path, `$socket_path`.
|
||||
2. Call `runc run` or `runc create` with the argument `--console-socket
|
||||
$socket_path`.
|
||||
3. Using `recvmsg(2)` retrieve the file descriptor sent using `SCM_RIGHTS` by
|
||||
`runc`.
|
||||
4. Now the manager can interact with the `stdio` of the container, using the
|
||||
retrieved pseudo-terminal master.
|
||||
|
||||
After `runc` exits, the only process with a copy of the pseudo-terminal master
|
||||
file descriptor is whoever read the file descriptor from the socket.
|
||||
|
||||
> **NOTE**: Currently `runc` doesn't support abstract socket addresses (due to
|
||||
> it not being possible to pass an `argv` with a null-byte as the first
|
||||
> character). In the future this may change, but currently you must use a valid
|
||||
> path name.
|
||||
|
||||
In order to help users make use of detached new terminal mode, we have provided
|
||||
a [Go implementation in the `go-runc` bindings][containerd/go-runc.Socket], as
|
||||
well as [a simple client][recvtty].
|
||||
|
||||
[containerd/go-runc.Socket]: https://godoc.org/github.com/containerd/go-runc#Socket
|
||||
[recvtty]: /contrib/cmd/recvtty
|
Loading…
Reference in New Issue