diff options
Diffstat (limited to 'doc/designs')
-rw-r--r-- | doc/designs/quic-design/dgram-api.md | 487 |
1 files changed, 487 insertions, 0 deletions
diff --git a/doc/designs/quic-design/dgram-api.md b/doc/designs/quic-design/dgram-api.md new file mode 100644 index 0000000000..cbd6d56970 --- /dev/null +++ b/doc/designs/quic-design/dgram-api.md @@ -0,0 +1,487 @@ +Datagram BIO API revisions for sendmmsg/recvmmsg +================================================ + +We need to evolve the API surface of BIO which is relevant to BIO_dgram (and the +eventual BIO_dgram_mem) to support APIs which allow multiple datagrams to be +sent or received simultaneously, such as sendmmsg(2)/recvmmsg(2). + +The adopted design +------------------ + +### Design decisions + +The adopted design makes the following design decisions: + +- We use a sendmmsg/recvmmsg-like API. The alternative API was not considered + for adoption because it is an explicit goal that the adopted API be suitable + for concurrent use on the same BIO. + +- We define our own structures rather than using the OS's `struct mmsghdr`. + The motivations for this are: + + - It ensures portability between OSes and allows the API to be used + on OSes which do not support `sendmmsg` or `sendmsg`. + + - It allows us to use structures in keeping with OpenSSL's existing + abstraction layers (e.g. `BIO_ADDR` rather than `struct sockaddr`). + + - We do not have to expose functionality which we cannot guarantee + we can support on all platforms (for example, arbitrary control messages). + + - It avoids the need to include OS headers in our own public headers, + which would pollute the environment of applications which include + our headers, potentially undesirably. + +- For OSes which do not support `sendmmsg`, we emulate it using repeated + calls to `sendmsg`. For OSes which do not support `sendmsg`, we emulate it + using `sendto` to the extent feasible. This avoids the need for code consuming + these new APIs to define a fallback code path. + +- We do not define any flags at this time, as the flags previously considered + for adoption cannot be supported on all platforms (Win32 does not have + `MSG_DONTWAIT`). + +- We ensure the extensibility of our `BIO_MSG` structure in a way that preserves + ABI compatibility using a `stride` argument which callers must set to + `sizeof(BIO_MSG)`. Implementations can examine the stride field to determine + whether a given field is part of a `BIO_MSG`. This allows us to add optional + fields to `BIO_MSG` at a later time without breaking ABI. All new fields must + be added to the end of the structure. + +- The BIO methods are designed to support stateless operation in which they + are simply calls to the equivalent system calls, where supported, without + changing BIO state. In particular, this means that things like retry flags are + not set or cleared by `BIO_sendmmsg` or `BIO_recvmmsg`. + + The motivation for this is that these functions are intended to support + concurrent use on the same BIO. If they read or modify BIO state, they would + need to be sychronised with a lock, undermining performance on what (for + `BIO_dgram`) would otherwise be a straight system call. + +- We do not support iovecs. The motivations for this are: + + - Not all platforms can support iovecs (e.g. Windows). + + - The only way we could emulate iovecs on platforms which don't support + them is by copying the data to be sent into a staging buffer. This would + defeat all of the advantages of iovecs and prevent us from meeting our + zero/single-copy requirements. Moreover, it would lead to extremely + surprising performance variations for consumers of the API. + + - We do not believe iovecs are needed to meet our performance requirements + for QUIC. The reason for this is that aside from a minimal packet header, + all data in QUIC is encrypted, so all data sent via QUIC must pass through + an encrypt step anyway, meaning that all data sent will already be copied + and there is not going to be any issue depositing the ciphertext in a + staging buffer together with the frame header. + + - Even if we did support iovecs, we would have to impose a limit + on the number of iovecs supported, because we translate from our own + structures (as discussed above) and also intend these functions to be + stateless and not requiire locking. Therefore the OS-native iovec structures + would need to be allocated on the stack. + +- Sometimes, an application may wish to learn the local interface address + associated with a receive operation or specify the local interface address to + be used for a send operation. We support this, but require this functionality + to be explicitly enabled before use. + + The reason for this is that enabling this functionality generally requires + that the socket be reconfigured using `setsockopt` on most platforms. Doing + this on-demand would require state in the BIO to determine whether this + functionality is currently switched on, which would require otherwise + unnecessary locking, undermining performance in concurrent usage of this API + on a given BIO. By requiring this functionality to be enabled explicitly + before use, this allows this initialization to be done up front without + performance cost. It also aids users of the API to understand that this + functionality is not always available and to detect when this functionality is + available in advance. + +### Design + +The currently proposed design is as follows: + +```c +typedef struct bio_msg_st { + void *data; + size_t data_len; + BIO_ADDR *peer, *local; + uint64_t flags; +} BIO_MSG; + +#define BIO_UNPACK_ERRNO(e) /*...*/ +#define BIO_IS_ERRNO(e) /*...*/ + +ossl_ssize_t BIO_sendmmsg(BIO *b, BIO_MSG *msg, size_t stride, + size_t num_msg, uint64_t flags); +ossl_ssize_t BIO_recvmmsg(BIO *b, BIO_MSG *msg, size_t stride, + size_t num_msg, uint64_t flags); +``` + +The API is used as follows: + +- `msg` points to an array of `num_msg` `BIO_MSG` structures. + +- Both functions have identical prototypes, and return the number of messages + processed in the array. If no messages were sent due to an error, `-1` is + returned. If an OS-level socket error occurs, a negative value `v` is + returned. The caller should determine that `v` is an OS-level socket error by + calling `BIO_IS_ERRNO(v)` and may obtain the OS-level socket error code by + calling `BIO_UNPACK_ERRNO(v)`. + +- `stride` must be set to `sizeof(BIO_MSG)`. + +- `data` points to the buffer of data to be sent or to be filled with received + data. `data_len` is the size of the buffer in bytes on call. If the + given message in the array is processed (i.e., if the return value + exceeds the index of that message in the array), `data_len` is updated + to the actual amount of data sent or received at return time. + +- `flags` in the `BIO_MSG` structure provides per-message flags to + the `BIO_sendmmsg` or `BIO_recvmmsg` call. If the given message in the array + is processed, `flags` is written with zero or more result flags at return + time. The `flags` argument to the call itself provides for global flags + affecting all messages in the array. Currently, no per-message or global flags + are defined and all of these fields are set to zero on call and on return. + +- `peer` and `local` are optional pointers to `BIO_ADDR` structures into + which the remote and local addresses are to be filled. If either of these + are NULL, the given addressing information is not requested. Local address + support may not be available in all circumstances, in which case processing of + the message fails. (This means that the function returns the number of + messages processed, or -1 if the message in question is the first message.) + + Support for `local` must be explicitly enabled before use, otherwise + attempts to use it fail. + +Local address support is enabled as follows: + +```c +int BIO_dgram_set_local_addr_enable(BIO *b, int enable); +int BIO_dgram_get_local_addr_enable(BIO *b); +int BIO_dgram_get_local_addr_cap(BIO *b); +``` + +`BIO_dgram_get_local_addr_cap()` returns 1 if local address support is +available. It is then enabled using `BIO_dgram_set_local_addr_enable()`, which +fails if support is not available. + +Options which were considered +----------------------------- + +Options for the API surface which were considered included: + +### sendmmsg/recvmmsg-like API + +This design was chosen to form the basis of the adopted design, which is +described above. + +```c +int BIO_readm(BIO *b, BIO_mmsghdr *msgvec, + unsigned len, int flags, struct timespec *timeout); +int BIO_writem(BIO *b, BIO_mmsghdr *msgvec, + unsigned len, int flags, struct timespec *timeout); +``` + +We can either define `BIO_mmsghdr` as a typedef of `struct mmsghdr` or redefine +an equivalent structure. The former has the advantage that we can just pass the +structures through to the syscall without copying them. + +Note that in `BIO_mem_dgram` we will have to process and therefore understand +the contents of `struct mmsghdr` ourselves. Therefore, initially we define a +subset of `struct mmsghdr` as being supported, specifically no control messages; +`msg_name` and `msg_iov` only. + +The flags argument is defined by us. Initially we can support something like +`MSG_DONTWAIT` (say, `BIO_DONTWAIT`). + +#### Implementation Questions + +If we go with this, there are some issues that arise: + +- Are `BIO_mmsghdr`, `BIO_msghdr` and `BIO_iovec` simple typedefs + for OS-provided structures, or our own independent structure + definitions? + + - If we use OS-provided structures: + + - We would need to include the OS headers which provide these + structures in our public API headers. + + - If we choose to support these functions when OS support is not available + (see discussion below), We would need to define our own structures in this + case (a “polyfill” approach). + + - If we use our own structures: + + - We would need to translate these structures during every call. + + But we would need to have storage inside the BIO_dgram for *m* `struct + msghdr`, *m\*v* iovecs, etc. Since we want to support multithreaded use + these allocations probably will need to be on the stack, and therefore + must be limited. + + Limiting *m* isn't a problem, because `sendmmsg` returns the number + of messages sent, so the existing semantics we are trying to match + lets us just send or receive fewer messages than we were asked to. + + However, it does seem like we will need to limit *v*, the number of iovecs + per message. So what limit should we give to *v*, the number of iovecs? We + will need a fixed stack allocation of OS iovec structures and we can + allocate from this stack allocation as we iterate through the `BIO_msghdr` + we have been given. So in practice we could just only send messages + until we reach our iovec limit, and then return. + + For example, suppose we allocate 64 iovecs internally: + + ```c + struct iovec vecs[64]; + ``` + + If the first message passed to a call to `BIO_writem` has 64 iovecs + attached to it, no further messages can be sent and `BIO_writem` + returns 1. + + If three messages are sent, with 32, 32, and 1 iovecs respectively, + the first two messages are sent and `BIO_writem` returns 2. + + So the only important thing we would need to document in this API + is the limit of iovecs on a single message; in other words, the + number of iovecs which must not be exceeded if a forward progress + guarantee is to be made. e.g. if we allocate 64 iovecs internally, + `BIO_writem` with a single message with 65 iovecs will never work + and this becomes part of the API contract. + + Obviously these quantities of iovecs are unrealistically large. + iovecs are small, so we can afford to set the limit high enough + that it shouldn't cause any problems in practice. We can increase + the limit later without a breaking API change, but we cannot decrease + it later. So we might want to start with something small, like 8. + +- We also need to decide what to do for OSes which don't support at least + `sendmsg`/`recvmsg`. + + - Don't provide these functions and require all users of these functions to + have an alternate code path which doesn't rely on them? + + - Not providing these functions on OSes that don't support + at least sendmsg/recvmsg is a simple solution but adds + complexity to code using BIO_dgram. (Though it does communicate + to code more realistic performance expectations since it + knows when these functions are actually available.) + + - Provide these functions and emulate the functionality: + + - However there is a question here as to how we implement + the iovec arguments on platforms without `sendmsg`/`recvmsg`. (We cannot + use `writev`/`readv` because we need peer address information.) Logically + implementing these would then have to be done by copying buffers around + internally before calling `sendto`/`recvfrom`, defeating the point of + iovecs and providing a performance profile which is surprising to code + using BIO_dgram. + + - Another option could be a variable limit on the number of iovecs, + which can be queried from BIO_dgram. This would be a constant set + when libcrypto is compiled. It would be 1 for platforms not supporting + `sendmsg`/`recvmsg`. This again adds burdens on the code using + BIO_dgram, but it seems the only way to avoid the surprising performance + pitfall of buffer copying to emulate iovec support. There is a fair risk + of code being written which accidentially works on one platform but not + another, because the author didn't realise the iovec limit is 1 on some + platforms. Possibly we could have an “iovec limit” variable in the + BIO_dgram which is 1 by default, which can be increased by a call to a + function BIO_set_iovec_limit, but not beyond the fixed size discussed + above. It would return failure if not possible and this would give client + code a clear way to determine if its expectations are met. + +### Alternate API + +Could we use a simplified API? For example, could we have an API that returns +one datagram where BIO_dgram uses `readmmsg` internally and queues the returned +datagrams, thereby still avoiding extra syscalls but offering a simple API. + +The problem here is we want to support “single-copy” (where the data is only +copied as it is decrypted). Thus BIO_dgram needs to know the final resting place +of encrypted data at the time it makes the `readmmsg` call. + +One option would be to allow the user to set a callback on BIO_dgram it can use +to request a new buffer, then have an API which returns the buffer: + +```c +int BIO_dgram_set_read_callback(BIO *b, + void *(*cb)(size_t len, void *arg), + void *arg); +int BIO_dgram_set_read_free_callback(BIO *b, + void (*cb)(void *buf, + size_t buf_len, + void *arg), + void *arg); +int BIO_read_dequeue(BIO *b, void **buf, size_t *buf_len); +``` + +The BIO_dgram calls the specified callback when it needs to generate internal +iovecs for its `readmmsg` call, and the received datagrams can then be popped by +the application and freed as it likes. (The read free callback above is only +used in rare circumstances, such as when calls to `BIO_read` and +`BIO_read_dequeue` are alternated, or when the BIO_dgram is destroyed prior to +all read buffers being dequeued; see below.) For convenience we could have an +extra call to allow a buffer to be pushed back into the BIO_dgram's internal +queue of unused read buffers, which avoids the need for the application to do +its own management of such recycled buffers: + +```c +int BIO_dgram_push_read_buffer(BIO *b, void *buf, size_t buf_len); +``` + +On the write side, the application provides buffers and can get a callback when +they are freed. BIO_write_queue just queues for transmission, and the `sendmmsg` +call is made when calling `BIO_flush`. (TBD: whether it is reasonable to +overload the semantics of BIO_flush in this way.) + +```c +int BIO_dgram_set_write_done_callback(BIO *b, + void (*cb)(const void *buf, + size_t buf_len, + int status, + void *arg), + void *arg); +int BIO_write_queue(BIO *b, const void *buf, size_t buf_len); +int BIO_flush(BIO *b); +``` + +The status argument to the write done callback will be 1 on success, some +negative value on failure, and some special negative value if the BIO_dgram is +being freed before the write could be completed. + +For send/receive addresses, we import the `BIO_(set|get)_dgram_(origin|dest)` +APIs proposed in the sendmsg/recvmsg PR (#5257). `BIO_get_dgram_(origin|dest)` +should be called immediately after `BIO_read_dequeue` and +`BIO_set_dgram_(origin|dest)` should be called immediately before +`BIO_write_queue`. + +This approach allows `BIO_dgram` to support myriad options via composition of +successive function calls in a “builder” style rather than via a single function +call with an excessive number of arguments or pointers to unwieldy ever-growing +argument structures, requiring constant revision of the central read/write +functions of the BIO API. + +Note that since `BIO_set_dgram_(origin|dest)` sets data on outgoing packets and +`BIO_get_dgram_(origin|dest)` gets data on incoming packets, it doesn't follow +that these are accessing the same data (they are not setters and getters of a +variables called "dgram origin" and "dgram destination", even though they look +like setters and getters of the same variables from the name.) We probably want +to separate these as there is no need for a getter for outgoing packet +destination, for example, and by separating these we allow the possibility of +multithreaded use (one thread reads, one thread writes) in the future. Possibly +we should choose less confusing names for these functions. Maybe +`BIO_set_outgoing_dgram_(origin|dest)` and +`BIO_get_incoming_dgram_(origin|dest)`. + +Pros of this approach: + + - Application can generate one datagram at a time and still get the advantages + of sendmmsg/recvmmsg (fewer syscalls, etc.) + + We probably want this for our own QUIC implementation built on top of this + anyway. Otherwise we will need another piece to do basically the same thing + and agglomerate multiple datagrams into a single BIO call. Unless we only + want use `sendmmsg` constructively in trivial cases (e.g. where we send two + datagrams from the same function immediately after one another... doesn't + seem like a common use case.) + + - Flexible support for single-copy (zero-copy). + +Cons of this approach: + + - Very different way of doing reads/writes might be strange to existing + applications. *But* the primary consumer of this new API will be our own + QUIC implementation so probably not a big deal. We can always support + `BIO_read`/`BIO_write` as a less efficient fallback for existing third party + users of BIO_dgram. + +#### Compatibility interop + +Suppose the following sequence happens: + +1. BIO_read (legacy call path) +2. BIO_read_dequeue (`recvmmsg` based call path with callback-allocated buffer) +3. BIO_read (legacy call path) + +For (1) we have two options + +a. Use `recvmmsg` and add the received datagrams to an RX queue just as for the + `BIO_read_dequeue` path. We use an OpenSSL-provided default allocator + (`OPENSSL_malloc`) and flag these datagrams as needing to be freed by OpenSSL, + not the application. + + When the application calls `BIO_read`, a copy is performed and the internal + buffer is freed. + +b. Use `recvfrom` directly. This means we have a `recvmmsg` path and a + `recvfrom` path depending on what API is being used. + + The disadvantage of (a) is it yields an extra copy relative to what we have now, + whereas with (b) the buffer passed to `BIO_read` gets passed through to the + syscall and we do not have to copy anything. + + Since we will probably need to support platforms without + `sendmmsg`/`recvmmsg` support anyway, (b) seems like the better option. + +For (2) the new API is used. Since the previous call to BIO_read is essentially +“stateless” (it's just a simple call to `recvfrom`, and doesn't require mutation +of any internal BIO state other than maybe the last datagram source/destination +address fields), BIO_dgram can go ahead and start using the `recvmmsg` code +path. Since the RX queue will obviously be empty at this point, it is +initialised and filled using `recvmmsg`, then one datagram is popped from it. + +For (3) we have a legacy `BIO_read` but we have several datagrams still in the +RX queue. In this case we do have to copy - we have no choice. However this only +happens in circumstances where a user of BIO_dgram alternates between old and +new APIs, which should be very unusual. + +Subsequently for (3) we have to free the buffer using the free callback. This is +an unusual case where BIO_dgram is responsible for freeing read buffers and not +the application (the only other case being premature destruction, see below). +But since this seems a very strange API usage pattern, we may just want to fail +in this case. + +Probably not worth supporting this. So we can have the following rule: + +- After the first call to `BIO_read_dequeue` is made on a BIO_dgram, all + subsequent calls to ordinary `BIO_read` will fail. + +Of course, all of the above applies analogously to the TX side. + +#### BIO_dgram_pair + +We will also implement from scratch a BIO_dgram_pair. This will be provided as a +BIO pair which provides identical semantics to the BIO_dgram above, both for the +legacy and zero-copy code paths. + +#### Thread safety + +It is a functional assumption of the above design that we would never want to +have more than one thread doing TX on the same BIO and never have more than one +thread doing RX on the same BIO. + +If we did ever want to do this, multiple BIOs on the same FD is one possibility +(for the BIO_dgram case at least). But I don't believe there is any general +intention to support multithreaded use of a single BIO at this time (unless I am +mistaken), so this seems like it isn't an issue. + +If we wanted to support multithreaded use of the same FD using the same BIO, we +would need to revisit the set-call-then-execute-call API approach above +(`BIO_(set|get)_dgram_(origin|dest)`) as this would pose a problem. But I mainly +mention this only for completeness. Our recent learnt lessons on cache +contention suggest that this probably wouldn't be a good idea anyway. + +#### Other questions + +BIO_dgram will call the allocation function to get buffers for `recvmmsg` to +fill. We might want to have a way to specify how many buffers it should offer to +`recvmmsg`, and thus how many buffers it allocates in advance. + +#### Premature destruction + +If BIO_dgram is freed before all datagrams are read, the read buffer free +callback is used to free any unreturned read buffers. |