path: root/doc
diff options
authorHugo Landau <>2022-08-22 15:32:16 +0100
committerHugo Landau <>2022-09-26 08:01:55 +0100
commit508e087c4c9e0f6548816e0044022b257f179585 (patch)
treec0e0bd453c13213c4793330d6a4080b159619733 /doc
parent28a5aa0cbdddfdf4d82a437d72407d4f52d4e54a (diff)
QUIC Flow Control
Reviewed-by: Paul Dale <> Reviewed-by: Matt Caswell <> Reviewed-by: Tomas Mraz <> (Merged from
Diffstat (limited to 'doc')
1 files changed, 272 insertions, 0 deletions
diff --git a/doc/designs/quic-design/ b/doc/designs/quic-design/
new file mode 100644
index 0000000000..54ba209bad
--- /dev/null
+++ b/doc/designs/quic-design/
@@ -0,0 +1,272 @@
+Flow Control
+Introduction to QUIC Flow Control
+QUIC flow control acts at both connection and stream levels. At any time,
+transmission of stream data could be prevented by connection-level flow control,
+by stream-level flow control, or both. Flow control uses a credit-based model in
+which the relevant flow control limit is expressed as the maximum number of
+bytes allowed to be sent on a stream, or across all streams, since the beginning
+of the stream or connection. This limit may be periodically bumped.
+It is important to note that both connection and stream-level flow control
+relate only to the transmission of QUIC stream data. QUIC flow control at stream
+level counts the total number of logical bytes sent on a given stream. Note that
+this does not count retransmissions; thus, if a byte is sent, lost, and sent
+again, this still only counts as one byte for the purposes of flow control. Note
+that the total number of logical bytes sent on a given stream is equivalent to
+the current “length” of the stream. In essence, the relevant quantity is
+`max(offset + len)` for all STREAM frames `(offset, len)` we have ever sent for
+the stream.
+(It is essential that this be determined correctly, as deadlock may occur if we
+believe we have exhausted our flow control credit whereas the peer believes we
+have not, as the peer may wait indefinitely for us to send more data before
+advancing us more flow control credit.)
+QUIC flow control at connection level is based on the sum of all the logical
+bytes transmitted across all streams since the start of the connection.
+Connection-level flow control is controlled by the `MAX_DATA` frame;
+stream-level flow control is controlled by the `MAX_STREAM_DATA` frame.
+The `DATA_BLOCKED` and `STREAM_DATA_BLOCKED` frames defined by RFC 9000 are less
+important than they first appear, as peers are not allowed to rely on them. (For
+example, a peer is not allowed to wait until we send `DATA_BLOCKED` to increase
+our connection-level credit, and a conformant QUIC implementation can choose to
+never generate either of these frame types.) These frames rather serve two
+purposes: to enhance flow control performance, and as a debugging aid.
+However, their implementation is not critical.
+Note that it follows from the above that the CRYPTO-frame stream is not subject
+to flow control.
+Note that flow control and congestion control are completely separate
+mechanisms. In a given circumstance, either or both mechanisms may restrict our
+ability to transmit application data.
+Consider the following diagram:
+ | | | | |
+ | |<-- credit| -->| |
+ | <-|- threshold -|----->| |
+ ----------------->
+ window size
+We introduce the following terminology:
+- **Controlled bytes** refers to any byte which counts for purposes of flow
+ control. A controlled byte is any byte of application data in a STREAM frame
+ payload, the first time it is sent (retransmissions do not count).
+- (RX side only) **Retirement**, which refers to where we dequeue one or more
+ controlled bytes from a QUIC stream and hand them to the application, meaning
+ we are no longer responsible for them.
+ Retirement is an important factor in our RX flow control design, as we want
+ peers to transmit not just at the rate that our QUIC implementation can
+ process incoming data, but also at a rate the application can handle.
+- (RX side only) The **Retired Watermark** (RWM), the total number of retired
+ controlled bytes since the beginning of the connection or stream.
+- The **Spent Watermark** (SWM), which is the number of controlled bytes we have
+ sent (for the TX side) or received (for the RX side). This represents the
+ amount of flow control budget which has been spent. It is a monotonic value
+ and never decreases. On the RX side, such bytes have not necessarily been
+ retired yet.
+- The **Credit Watermark** (CWM), which is the number of bytes which have
+ been authorized for transmission so far. This count is a cumulative count
+ since the start of the connection or stream and thus is also monotonic.
+- The available **credit**, which is always simply the difference between
+ the SWM and the CWM.
+- (RX side only) The **threshold**, which is how close we let the RWM
+ get to the CWM before we choose to extend the peer more credit by bumping the
+ CWM. The threshold is relative to (i.e., subtracted from) the CWM.
+- (RX side only) The **window size**, which is the amount by which we or a peer
+ choose to bump the CWM each time, as we reach or exceed the threshold. The new
+ CWM is calculated as the SWM plus the window size (note that it added to the
+ SWM, not the old CWM.)
+Note that:
+- If the available credit is zero, the TX side is blocked due to a lack of
+ credit.
+- If any circumstance occurs which would cause the SWM to exceed the CWM,
+ a flow control protocol violation has occurred and the connection
+ should be terminated.
+Connection-Level Flow Control - TX Side
+TX side flow control is exceptionally simple. It can be modelled as the
+following state machine:
+ ---> event: On TX (numBytes)
+ ---> event: On TX Window Updated (numBytes)
+ <--- event: On TX Blocked
+ Get TX Window() -> numBytes
+The On TX event is passed to the state machine whenever we send a packet.
+`numBytes` is the total number of controlled bytes we sent in the packet (i.e.,
+the number of bytes of STREAM frame payload which are not retransmissions). This
+value is added to the TX-side SWM value. Note that this may be zero, though
+there is no need to pass the event in this case.
+The On TX Window Updated event is passed to the state machine whenever we have
+our CWM increased. In other words, it is passed whenever we receive a `MAX_DATA`
+frame, with the integer value contained in that frame (or when we receive the
+`initial_max_data` transport parameter).
+The On TX Window Updated event expresses the CWM (that is, the cumulative
+number of controlled bytes we are allowed to send since the start of the
+connection), thus it is monotonic and may never regress. If an On TX Window
+Update event is passed to the state machine with a value lower than that passed
+in any previous such event, it indicates a peer protocol error or a local
+programming error.
+The Get TX Window function returns our credit value (that is, it returns the
+number of controlled bytes we are allowed to send). This value is reduced by the
+On TX event and increased by the On TX Window Updated event. In fact, it is
+simply the difference between the last On TX Window Updated value and the sum of
+the `numBytes` arguments of all On TX events so far; it is that simple.
+The On TX Blocked event is emitted at the time of any edge transition where the
+value which would be returned by the Get TX Window function changes from
+non-zero to zero. This always occurs during processing of an On TX event. (This
+event is intended to assist in deciding when to generate `DATA_BLOCKED`
+We must not exceed the flow control limits, else the peer may terminate the
+connection with an error.
+An initial connection-level credit is communicated by the peer in the
+`initial_max_data` transport parameter. All other credits occur as a result of a
+`MAX_DATA` frame.
+Stream-Level Flow Control - TX Side
+Stream-level flow control works exactly the same as connection-level flow
+control for the TX side.
+The On TX Window Updated event occurs in response to the `MAX_STREAM_DATA`
+frame, or based on the relevant transport parameter
+(`initial_max_stream_data_bidi_local`, `initial_max_stream_data_bidi_remote`,
+The On TX Blocked event can be used to decide when to generate
+Note that the number of controlled bytes we can send in a stream is limited by
+both connection and stream-level flow control; thus the number of controlled
+bytes we can send is the lesser value of the values returned by the Get TX
+Window function on the connection-level and stream-level state machines,
+Connection-Level Flow Control - RX Side
+ ---> event: On RX Controlled Bytes (numBytes) [internal event]
+ ---> event: On Retire Controlled Bytes (numBytes)
+ <--- event: Increase Window (numBytes)
+ <--- event: Flow Control Error
+RX side connection-level flow control provides an indication of when to generate
+`MAX_DATA` frames to bump the peer's connection-level transmission credit. It is
+somewhat more involved than the TX side.
+The state machine receives On RX Controlled Bytes events from stream-level flow
+controllers. Callers do not pass the event themselves. The event is generated by
+a stream-level flow controller whenever we receive any controlled bytes.
+`numBytes` is the number of controlled bytes we received. (This event is
+generated by stream-level flow control as retransmitted stream data must be
+counted only once, and the stream-level flow control is therefore in the best
+position to determine how many controlled bytes (i.e., new, non-retransmitted
+stream payload bytes) have been received).
+If we receive more controlled bytes than we authorized, the state machine emits
+the Flow Control Error event. The connection should be terminated with a
+protocol error in this case.
+The state machine emits the Increase Window event when it thinks that the peer
+should be advanced more flow control credit (i.e., when the CWM should be
+bumped). `numBytes` is the new CWM value, and is monotonic with regard to all
+previous Increase Window events emitted by the state machine.
+The state machine is passed the On Retire Controlled bytes event when one or
+more controlled bytes are dequeued from any stream and passed to the
+The state machine uses the cadence of the On Retire Controlled Bytes events it
+receives to determine when to increase the flow control window. Thus, the On
+Retire Controlled Bytes event should be sent to the state machine when
+processing of the received controlled bytes has been *completed* (i.e., passed
+to the application).
+Stream-Level Flow Control - RX Side
+RX-side stream-level flow control works similarly to RX-side connection-level
+flow control. There are a few differences:
+- There is no On RX Controlled Bytes event.
+- The On Retire Controlled Bytes event may optionally pass the same event
+ to a connection-level flow controller (an implementation decision), as these
+ events should always occur at the same time.
+- An additional event is added, which replaces the On RX Controlled Bytes event:
+ ---> event: On RX Stream Frame (offsetPlusLength, isFin)
+ This event should be passed to the state machine when a STREAM frame is
+ received. The `offsetPlusLength` argument is the sum of the offset field of
+ the STREAM frame and the length of the frame's payload in bytes. The isFin
+ argument should specify whether the STREAM frame had the FIN flag set.
+ This event is used to generate the internal On RX Controlled Bytes event to
+ the connection-level flow controller. It is also used by stream-level flow
+ control to determine if flow control limits are violated by the peer.
+ The state machine handles `offsetPlusLength` monotonically and ignores the
+ event if a previous such event already had an equal or greater value. The
+ reason this event is used instead of a `On RX (numBytes)` style event is that
+ this API can be monotonic and thus easier to use (the caller does not need to
+ remember if they have already counted a specific controlled byte in a STREAM
+ frame, which may after all duplicate some of the controlled bytes in a
+ previous STREAM frame).
+RX Window Sizing
+For RX flow control we must determine our window size. This is the value we add
+to the peer's current SWM to determine the new CWM each time as RWM reaches the
+threshold. The window size should be adapted dynamically according to network
+Many implementations choose to have a mechanism for increasing the window size
+but not decreasing it, a simple approach which we adopt here.
+The common algorithm is a so-called auto-tuning approach in which the rate of
+window consumption (i.e., the rate at which RWM approaches CWM after CWM is
+bumped) is measured and compared to the measured connection RTT. If the time it
+takes to consume one window size exceeds a fixed multiple of the RTT, the window
+size is doubled, up to an implementation-chosen maximum window size.
+Auto-tuning occurs in 'epochs'. At the end of each auto-tuning epoch, a decision
+is made on whether to double the window size, and a new auto-tuning epoch is
+For more information on auto-tuning, see [Flow control in
+and [QUIC Flow