1 files changed, 858 insertions, 239 deletions
diff --git a/proposals/324-rtt-congestion-control.txt b/proposals/324-rtt-congestion-control.txt
index f4397b6..78b6789 100644
--- a/proposals/324-rtt-congestion-control.txt
+++ b/proposals/324-rtt-congestion-control.txt
@@ -51,6 +51,9 @@ be found at:
 An exhaustive list of citations for further reading is in Section
 [CITATIONS].
 
+A glossary of common congestion control acronyms and terminology is in
+Section [GLOSSARY].
+
 
 1. Overview [OVERVIEW]
 
@@ -115,15 +118,14 @@ RELAY_COMMAND_SENDME relay cells every CIRCWINDOW_INCREMENT (100) cells
 of received RELAY_COMMAND_DATA.
 
 This allows those endpoints to measure the current circuit RTT, by
-measuring the amount of time between sending of every 100th data cell
-and the arrival of the SENDME command that the other endpoint
-immediately sends to ack that 100th cell.
-
-Circuits will record the current RTT measurement as a field in their
-circuit_t data structure. The current RTT is N-EWMA smoothed[27] over a
-'cc_ewma_cwnd_cnt' multiple of congestion window worth of sendme acks.
+measuring the amount of time between sending a RELAY_COMMAND_DATA cell
+that would trigger a SENDME from the other endpoint, and the arrival of
+that SENDME cell. This means that RTT is measured every 'cc_sendme_inc'
+data cells.
 
-Circuits will also record the minimum and maximum RTT seen so far.
+Circuits will record the minimum and maximum RTT measurement, as well as
+a smoothed value of representing the current RTT. The smoothing for the
+current RTT is performed as specified in [N_EWMA_SMOOTHING].
 
 Algorithms that make use of this RTT measurement for congestion
 window update are specified in [CONTROL_ALGORITHMS].
@@ -146,11 +148,39 @@ If the time delta is 0, that is always treated as a clock stall.
 
 If we have measured at least 'cc_bwe_min' RTT values or we have successfully
 exited slow start, then every sendme ACK, the new candidate RTT is compared to
-the stored EWMA RTT. If the new RTT is either 100 times larger than the EWMA
-RTT, or 100 times smaller than the stored EWMA RTT, then we do not record that
+the stored EWMA RTT. If the new RTT is either 5000 times larger than the EWMA
+RTT, or 5000 times smaller than the stored EWMA RTT, then we do not record that
 estimate, and do not update BDP or the congestion control algorithms for that
 SENDME ack.
 
+Moreover, if a clock stall is detected by *any* circuit, this fact is
+cached, and this cached value is used on circuits for which we do not
+have enough data to compute the above heueristics. This cached value is
+also exported for use by the edge connection rate calculations done by
+[XON_ADVISORY].
+
+2.1.2. N_EWMA Smoothing [N_EWMA_SMOOTHING]
+
+Both RTT estimation and SENDME BDP estimation require smoothing, to
+reduce the effects of packet jitter.
+
+This smoothing is performed using N_EWMA[27], which is an Exponential
+Moving Average with alpha = 2/(N+1):
+
+  N_EWMA = BDP*2/(N+1) + N_EWMA_prev*(N-1)/(N+1).
+
+Flow control rate limiting uses this function
+
+For both RTT and SENDME BDP estimation, N is the number of SENDME acks
+between congestion window updates, divided by the value of consensus
+parameter 'cc_ewma_cwnd_pct', and then capped at a max of 'cc_ewma_max',
+but always at least 2:
+
+  N = MAX(MIN(CWND_UPDATE_RATE(cc)*cc_ewma_cwnd_pct/100, cc_ewma_max), 2);
+
+CWND_UPDATE_RATE is normally just round(CWND/cc_sendme_inc), but after
+slow start, it is round(CWND/(cc_cwnd_inc_rate*cc_sendme_inc)).
+
 2.2. SENDME behavior changes
 
 We will make four major changes to SENDME behavior to aid in computing
@@ -168,19 +198,9 @@ congestion, since the RTT will be measured more often. If
 experimentation in Shadow shows that more frequent SENDMEs reduce
 congestion and improve performance but add significant overhead, we can
 reduce SENDME overhead by allowing SENDME cells to carry stream data, as
-well, using Proposal 325.
-
-  TODO: Refer to or include final version of the negotiation proposal for
-  this: https://gitlab.torproject.org/tpo/core/tor/-/issues/40377
-
-Second, all end-to-end relay cells except RELAY_COMMAND_DROP and SENDME
-itself will count towards SENDME cell counts. The details behind how these
-cells are handled is addressed in section [SYSTEM_INTERACTIONS].
-
-   XXX: Hrmm, this is not strictly necessary for this proposal to function.
-   It may make conflux easier though, but we can defer it to then. The
-   current implementation only counts RELAY_COMMAND_DATA towards sendme
-   acks, which is the same behavior as the old fixed sendme implementation.
+well, using Proposal 325. The method for negotiating a common value of
+cc_sendme_inc on a circuit is covered in [ONION_NEGOTIATION] and
+[EXIT_NEGOTIATION].
 
 Third, authenticated SENDMEs can remain as-is in terms of protocol
 behavior, but will require some implementation updates to account for
@@ -297,23 +317,11 @@ BDP.
 We specify all three because they are all useful in various cases. These cases
 are broken up and combined to form the Piecewise BDP estimator.
 
-The SENDME estimator is smoothed through N-EWMA averaging on these
-estimators[27], to smooth out temporary effects. This is accomplished by using
-an EWMA function with alpha = 2/(N+1):
-
-  N_EWMA = BDP*2/(N+1) + N_EWMA_prev*(N-1)/(N+1).
-
-N is specified in terms of the number of current congestion windows worth of
-SENDMEs to average, via consensus parameter 'cc_ewma_cwnd_cnt'.
-
-The cwnd and inflight estimators also use N-EWMA smoothing in the same way,
-but only smooth the current RTT estimate, not the congestion window value.
-
 3.1.1. SENDME arrival BDP estimation
 
-The most accurate way to estimate BDP is to measure the amount of time between
-SENDME acks. In this period of time, we know that the endpoint successfully
-received 'cc_sendme_inc' cells.
+It is possible to directly measure BDP via the amount of time between SENDME
+acks. In this period of time, we know that the endpoint successfully received
+'cc_sendme_inc' cells.
 
 This means that the bandwidth of the circuit is then calculated as:
 
@@ -337,7 +345,7 @@ With this, the calculation becomes:
 
 Note that because we are counting the number of cells *between* the first
 and last sendme of the congestion window, we must subtract 1 from the number
-of sendmes actually recieved. Over the time period between the first and last
+of sendmes actually received. Over the time period between the first and last
 sendme of the congestion window, the other endpoint successfully read
 (num_sendmes-1) * cc_sendme_inc cells.
 
@@ -346,17 +354,22 @@ truncation, we compute the BDP using multiplication first:
 
    BDP = (num_sendmes-1) * cc_sendme_inc * RTT_min / num_sendme_timestamp_delta
 
-Note that the SENDME BDP estimation will only work after two (2) SENDME acks
-have been received. Additionally, it tends not to be stable unless at least
-five (5) num_sendme's are used, due to ack compression. This is controlled by
-the 'cc_bwe_min' consensus parameter. Finally, if [CLOCK_HEURISTICS] have
-detected a clock jump or stall, this estimator is not updated.
+After all of this, the BDP is smoothed using [N_EWMA_SMOOTHING].
+
+This smoothing means that the SENDME BDP estimation will only work after two
+(2) SENDME acks have been received. Additionally, it tends not to be stable
+unless at least 'cc_bwe_min' sendme's are used. This is controlled by the
+'cc_bwe_min' consensus parameter. Finally, if [CLOCK_HEURISTICS] have detected
+a clock jump or stall, this estimator is not updated.
 
 If all edge connections no longer have data available to send on a circuit
 and all circuit queues have drained without blocking the local orconn, we stop
 updating this BDP estimate and discard old timestamps. However, we retain the
 actual estimator value.
 
+Unfortunately, even after all of this, SENDME BDP estimation proved unreliable
+in Shadow simulation, due to ack compression.
+
 3.1.2. Congestion Window BDP Estimation
 
 Assuming that the current congestion window is at or above the current BDP,
@@ -425,8 +438,9 @@ and RTT_current. These are measured using the time delta between every
 measured arbitrarily, so long as it is larger than what we would get from
 SENDME.
 
-RTT_current is N-EWMA smoothed over 'cc_ewma_cwnd_cnt' congestion windows
-worth of SENDME acks.
+RTT_current is N-EWMA smoothed over 'cc_ewma_cwnd_pct' percent of
+congestion windows worth of SENDME acks, up to a max of 'cc_ewma_max' acks, as
+described in [N_EWMA_SMOOTHING].
 
 Recall that BOOTLEG_RTT_TOR emits a congestion signal when the current
 RTT falls below some fractional threshold ('cc_westwood_rtt_thresh') fraction
@@ -447,9 +461,10 @@ each ack. Congestion signals are only evaluated when it reaches 0.
 
 Note that because the congestion signal threshold of TOR_WESTWOOD is a
 function of RTT_max, and excessive queuing can cause an increase in RTT_max,
-TOR_WESTWOOD may have a runaway condition. For this reason, we specify a
-cc_westwodd_rtt_m backoff parameter, which can reduce RTT_max by a percentage
-of the difference between RTT_min and RTT_max, in the event of congestion.
+TOR_WESTWOOD may have runaway conditions. Additionally, if stream activity is
+constant, but of a lower bandwidth than the circuit, this will not drive the
+RTT upwards, and this can result in a congestion window that continues to
+increase in the absence of any other concurrent activity.
 
 Here is the complete congestion window algorithm for Tor Westwood. This will run
 each time we get a SENDME (aka sendme_process_circuit_level()):
@@ -504,27 +519,29 @@ Thus, Vegas estimates estimates the queue use caused by congestion as:
 
    queue_use = cwnd - BDP
 
-Original TCP Vegas used a cwnd BDP estimator only. However, because the SENDME
-BDP estimate is more accurate, and because Piecewise BDP allows us to back off
-more when the local OR connection is blocked, we use Piecewise BDP here.
-
-For testing, we also parameterize this queue_use calculation as a tunable
+Original TCP Vegas used a cwnd BDP estimator only. We added the ability to
+switch this BDP estimator in the implementation, and experimented with various
+options. We also parameterized this queue_use calculation as a tunable
 weighted average between the cwnd-based BDP estimate and the piecewise
-estimate (consensus parameter 'cc_vegas_bdp_mix').
+estimate (consensus parameter 'cc_vegas_bdp_mix'). After much testing of
+various ways to compute BDP, we were still unable to do much better than the
+original cwnd estimator. So while this capability to change the BDP estimator
+remains in the C implementation, we do not expect it to be used.
 
-Additionally, if the local OR connection is blocked at the time of SENDME ack
-arrival, this is treated as an immediate congestion signal.
+However, it was useful to use a local OR connection block at the time of
+SENDME ack arrival, as an immediate congestion signal.
 
-(We can also optionally use the ECN signal described in
-ideas/xxx-backward-ecn.txt, to exit Slow Start.)
+(As an additional optimization, we could also use the ECN signal described in
+ideas/xxx-backward-ecn.txt, but this is not implemented. It is likely only of
+any benefit during Slow Start, and even that benefit is likely small.)
 
 Congestion signals from RTT, blocked OR connections, or ECN are processed only
 once per congestion window. This is achieved through the next_cc_event flag,
 which is initialized to a cwnd worth of SENDME acks, and is decremented
 each ack. Congestion signals are only evaluated when it reaches 0.
 
-Here is the complete pseudocode for TOR_VEGAS, which is run once per SENDME
-ack:
+Here is the complete pseudocode for TOR_VEGAS, which is run every time
+an endpoint receives a SENDME ack:
 
   # Update acked cells
   inflight -= cc_sendme_inc
@@ -545,18 +562,30 @@ ack:
 
      if in_slow_start:
        if queue_use < cc_vegas_gamma and not orconn_blocked:
-         cwnd = MAX(cwnd * cc_cwnd_inc_pct_ss, BDP)    # Exponential or BDP
+         # Increment by slow start %, or at least 2 sendme_inc's worth
+         cwnd = cwnd + MAX(cwnd * cc_cwnd_inc_pct_ss, 2*cc_sendme_inc)
+         # If our BDP estimator thinks the BDP is still larger, use that
+         cwnd = MAX(cwnd, BDP)
        else:
-         cwnd = BDP
+         cwnd = BDP + cc_vegas_gamma
          in_slow_start = 0
      else:
-       if queue_use > cc_vegas_beta or orconn_blocked:
-         cwnd += cc_cwnd_inc
-       elif queue_use < cc_vegas_alpha:
+       if queue_use > cc_vegas_delta:
+         cwnd = BDP + cc_vegas_delta - cc_cwnd_inc
+       elif queue_use > cc_vegas_beta or orconn_blocked:
          cwnd -= cc_cwnd_inc
+       elif queue_use < cc_vegas_alpha:
+         cwnd += cc_cwnd_inc
 
      cwnd = MAX(cwnd, cc_circwindow_min)
-     next_cc_event = cwnd / (cc_cwnd_inc_rate * cc_sendme_inc)
+
+     # Count the number of sendme acks until next update of cwnd,
+     # rounded to nearest integer
+     if in_slow_start:
+       next_cc_event = round(cwnd / cc_sendme_inc)
+     else
+       # Never increment faster in slow start, only steady state.
+       next_cc_event = round(cwnd / (cc_cwnd_inc_rate * cc_sendme_inc))
 
 
 3.4. Tor NOLA: Direct BDP tracker [TOR_NOLA]
@@ -604,6 +633,12 @@ following functions:
    - connection_edge_package_raw_inbuf()
    - circuit_resume_edge_reading()
 
+The decision on when a stream is blocked is performed in:
+  - sendme_note_stream_data_packaged()
+  - sendme_stream_data_received()
+  - sendme_connection_edge_consider_sending()
+  - sendme_process_stream_level()
+
 Tor currently maintains separate windows for each stream on a circuit,
 to provide individual stream flow control. Circuit windows are SENDME
 acked as soon as a relay data cell is decrypted and recognized. Stream
@@ -637,50 +672,161 @@ either have to commit to allocating a full congestion window worth
 memory for each stream, or impose a speed limit on our streams.
 
 Hence, we will discard stream windows entirely, and instead use a
-simpler buffer-based design that uses XON/XOFF as a backstop. This will
-allow us to make full use of the circuit congestion window for every
-stream in combination, while still avoiding buffer buildup inside the
-network.
+simpler buffer-based design that uses XON/XOFF to signal when this
+buffer is too large. Additionally, the XON cell will contain advisory
+rate information based on the rate at which that edge connection can
+write data while it has data to write. The other endpoint can rate limit
+sending data for that stream to the rate advertised in the XON, to avoid
+excessive XON/XOFF chatter and sub-optimal behavior.
+
+This will allow us to make full use of the circuit congestion window for
+every stream in combination, while still avoiding buffer buildup inside
+the network.
 
 4.1. Stream Flow Control Without Windows [WINDOWLESS_FLOW]
 
-Each endpoint (client, Exit, or onion service) should send circuit-level
+Each endpoint (client, Exit, or onion service) sends circuit-level
 SENDME acks for all circuit cells as soon as they are decrypted and
-recognized, but *before* delivery to their edge connections. If the edge
-connection is blocked because an application is not reading, data will
-build up in a queue at that endpoint.
-
-Three consensus parameters will govern the max length of this queue:
-xoff_client, xoff_exit, and xoff_mobile. These will be used for Tor
-clients, exits, and mobile devices, respectively. These cutoffs will be
-a percentage of current 'cwnd' rather than number of cells. Something
-like 5% of 'cwnd' should be plenty, since these edge connections should
-normally drain *much* faster than Tor itself.
-
-If the length of an application stream queue exceeds the size provided
-in the appropriate consensus parameter, a RELAY_COMMAND_STREAM_XOFF will
-be sent, which instructs the other endpoint to stop sending from that
-edge connection. This XOFF cell can optionally contain any available
-stream data, as well.
-
-As soon as the queue begins to drain, a RELAY_COMMAND_STREAM_XON will
-sent, which allows the other end to resume reading on that edge
-connection. Because application streams should drain quickly once they
-are active, we will send the XON command as soon as they start draining.
-If the queues fill again, another XOFF will be sent. If this results in
-excessive XOFF/XON flapping and chatter, we will also use consensus
-parameters xon_client, xon_exit, and xon_mobile to optionally specify
-when to send an XON. These parameters will be defined in terms of cells
-below the xoff_* levels, rather than percentage. The XON cells can also
-contain stream data, if any is available.
-
-Tor's OOM killer will be invoked to close any streams whose application
-buffer grows too large, due to memory shortage, or malicious endpoints.
-
-Note that no stream buffer should ever grow larger than the xoff level
-plus 'cwnd', unless an endpoint is ignoring XOFF. So,
-'xoff_{client,exit,mobile} + cwnd' should be the hard-close stream
-cutoff, regardless of OOM killer status.
+recognized, but *before* delivery to their edge connections.
+
+This means that if the edge connection is blocked because an
+application's SOCKS connection or a destination site's TCP connection is
+not reading, data will build up in a queue at that endpoint,
+specifically in the edge connection's outbuf.
+
+Consensus parameters will govern the length of this queue that
+determines when XON and XOFF cells are sent, as well as when advisory
+XON cells that contain rate information can be sent. These parameters
+are separate for the queue lengths of exits, and of clients/services.
+
+(Because clients and services will typically have localhost connections
+for their edges, they will need similar buffering limits. Exits may have
+different properties, since their edges will be remote.)
+
+The trunnel relay cell payload definitions for XON and XOFF are:
+
+struct xoff_cell {
+  u8 version IN [0x00];
+}
+
+struct xon_cell {
+  u8 version IN [0x00];
+
+  u32 kbps_ewma;
+}
+
+4.1.1. XON/XOFF behavior
+
+If the length of an edge outbuf queue exceeds the size provided in the
+appropriate client or exit XOFF consensus parameter, a
+RELAY_COMMAND_STREAM_XOFF will be sent, which instructs the other endpoint to
+stop sending from that edge connection.
+
+Once the queue is expected to empty, a RELAY_COMMAND_STREAM_XON will be sent,
+which allows the other end to resume reading on that edge connection. This XON
+also indicates the average rate of queue drain since the XOFF.
+
+Advisory XON cells are also sent whenever the edge connection's drain
+rate changes by more than 'cc_xon_change_pct' percent compared to
+the previously sent XON cell's value.
+
+4.1.2. Edge bandwidth rate advertisement [XON_ADVISORY]
+
+As noted above, the XON cell provides a field to indicate the N_EWMA rate which
+edge connections drain their outgoing buffers.
+
+To compute the drain rate, we maintain a timestamp and a byte count of how many
+bytes were written onto the socket from the connection outbuf.
+
+In order to measure the drain rate of a connection, we need to measure the time
+it took between flushing N bytes on the socket and when the socket is available
+for writing again. In other words, we are measuring the time it took for the
+kernel to send N bytes between the first flush on the socket and the next
+poll() write event.
+
+For example, lets say we just wrote 100 bytes on the socket at time t = 0sec
+and at time t = 2sec the socket becomes writeable again, we then estimate that
+the rate of the socket is 100 / 2sec thus 50B/sec.
+
+To make such measurement, we start the timer by recording a timestamp as soon
+as data begins to accumulate in an edge connection's outbuf, currently 16KB (32
+cells). We use such value for now because Tor write up to 32 cells at once on a
+connection outbuf and so we use this burst of data as an indicator that bytes
+are starting to accumulate.
+
+After 'cc_xon_rate' cells worth of stream data, we use N_EWMA to average this
+rate into a running EWMA average, with N specified by consensus parameter
+'cc_xon_ewma_cnt'. Every EWMA update, the byte count is set to 0 and a new
+timestamp is recorded. In this way, the EWMA counter is averaging N counts of
+'cc_xon_rate' cells worth of bytes each.
+
+If the buffers are non-zero, and we have sent an XON before, and the N_EWMA
+rate has changed more than 'cc_xon_change_pct' since the last XON, we send an
+updated rate. Because the EWMA rate is only updated every 'cc_xon_rate' cells
+worth of bytes, such advisory XON updates can not be sent more frequent than
+this, and should be sent much less often in practice.
+
+When the outbuf completely drains to 0, and has been 0 for 'cc_xon_rate' cells
+worth of data, we double the EWMA rate. We continue to double it while the
+outbuf is 0, every 'cc_xon_rate' cells. The measurement timestamp is also set
+back to 0.
+
+When an XOFF is sent, the EWMA rate is reset to 0, to allow fresh calculation
+upon drain.
+
+If a clock stall or jump is detected by [CLOCK_HEURISTICS], we also
+clear the fields, but do not record them in ewma.
+
+NOTE: Because our timestamps are microseconds, we chose to compute and
+transmit both of these rates as 1000 byte/sec units, as this reduces the
+number of multiplications and divisions and avoids precision loss.
+
+4.1.3. Oomkiller behavior
+
+A malicious client can attempt to exhaust memory in an Exits outbufs, by
+ignoring XOFF and advisory XONs. Implementations MAY choose to close specific
+streams with outbufs that grow too large, but since the exit does not know
+with certainty the client's congestion window, it is non-trival to determine
+the exact upper limit a well-behaved client might send on a blocked stream.
+
+Implementations MUST close the streams with the oldest chunks present in their
+outbufs, while under global memory pressure, until memory pressure is
+relieved.
+
+4.1.4. Sidechannel mitigation
+
+In order to mitigate DropMark attacks[28], both XOFF and advisory XON
+transmission must be restricted. Because DropMark attacks are most severe
+before data is sent, clients MUST ensure that an XOFF does not arrive before
+it has sent the appropriate XOFF limit of bytes on a stream ('cc_xoff_exit'
+for exits, 'cc_xoff_client' for onions).
+
+Clients also SHOULD ensure that advisory XONs do not arrive before the
+minimum of the XOFF limit and 'cc_xon_rate' full cells worth of bytes have
+been transmitted.
+
+Clients SHOULD ensure that advisory XONs do not arrive more frequently than
+every 'cc_xon_rate' cells worth of sent data. Clients also SHOULD ensure than
+XOFFs do not arrive more frequently than every XOFF limit worth of sent data.
+
+Implementations SHOULD close the circuit if these limits are violated on the
+client-side, to detect and resist dropmark attacks[28].
+
+Additionally, because edges no longer use stream SENDME windows, we alter the
+half-closed connection handling to be time based instead of data quantity
+based. Half-closed connections are allowed to receive data up to the larger
+value of the congestion control max_rtt field or the circuit build timeout
+(for onion service circuits, we use twice the circuit build timeout). Any data
+or relay cells after this point are considered invalid data on the circuit.
+
+Recall that all of the dropped cell enforcement in C-Tor is performed by
+accounting data provided through the control port CIRC_BW fields, currently
+enforced only by using the vanguards addon[29].
+
+The C-Tor implementation exposes all of these properties to CIRC_BW for
+vanguards to enforce, but does not enforce them itself. So violations of any
+of these limits do not cause circuit closure unless that addon is used (as
+with the rest of the dropped cell side channel handling in C-Tor).
 
 
 5. System Interactions [SYSTEM_INTERACTIONS]
@@ -788,15 +934,6 @@ delivery of SENDMEs will still allow a full congestion window full of
 data to arrive. This will also require tuning and experimentation, and
 optimum results will vary between simulator and live network.
 
-  TODO: Can we use explicit SENDME sequence number acking to make a
-       connection-resumption conflux, to recover from circuit collapse
-       or client migration? I am having trouble coming up with a design
-       that does not require Exits to maintain a full congestion
-       window full of data as a retransmit buffer in the event of
-       circuit close. Such reconnect activity might require assistance
-       from Guard relays so that they can help clients discover which
-       cells were sent vs lost.
-
 
 6. Performance Evaluation [EVALUATION]
 
@@ -810,13 +947,23 @@ determine how stable the RTT measurements of circuits are across the
 live Tor network, to determine if we need more frequent SENDMEs, and/or
 need to use any RTT smoothing or averaging.
 
-
 These experiments were performed using onion service clients and services on
 the live Tor network. From these experiments, we tuned the RTT and BDP
 estimators, and arrived at reasonable values for EWMA smoothing and the
 minimum number of SENDME acks required to estimate BDP.
 
-We should check small variations in the EWMA smoothing and minimum BDP ack
+Additionally, we specified that the algorithms maintain previous congestion
+window estimates in the event that a circuit goes idle, rather than revert to
+slow start. We experimented with intermittent idle/active live onion clients
+to make sure that this behavior is acceptable, and it appeared to be.
+
+In Shadow experimentation, the primary thing to test will be if the OR conn on
+Exit relays blocks too frequently when under load, thus causing excessive
+congestion signals, and overuse of the Inflight BDP estimator as opposed
+to SENDME or CWND BDP. It may also be the case that this behavior is optimal,
+even if it does happen.
+
+Finally, we should check small variations in the EWMA smoothing and minimum BDP ack
 counts in Shadow experimentation, to check for high variability in these
 estimates, and other surprises.
 
@@ -847,44 +994,93 @@ parameter for them. This will allow us to set more reasonable parameter
 values, without waiting for all clients to upgrade.
 
 Because custom congestion control can be deployed by any Exit or onion
-service that desires better service, we will need to be particularly
-careful about how congestion control algorithms interact with rogue
-implementations that more aggressively increase their window sizes.
-During these adversarial-style experiments, we must verify that cheaters
-do not get better service, and that Tor's circuit OOM killer properly
-closes circuits that seriously abuse the congestion control algorithm,
-as per [SECURITY_ANALYSIS]. This may requiring tuning CircuitEWMAHalfLife,
-as well as the oomkiller cutoffs.
-
-Additionally, we specified that the algorithms maintain previous congestion
-window estimates in the event that a circuit goes idle, rather than revert
-to slow start. We should experiment with intermittent idle/active clients
-to make sure that this behavior is acceptable.
+service that desires better service, we will need to be particularly careful
+about how congestion control algorithms interact with rogue implementations
+that more aggressively increase their window sizes.  During these
+adversarial-style experiments, we must verify that cheaters do not get
+better service, and that Tor's circuit OOM killer properly closes circuits
+that seriously abuse the congestion control algorithm, as per
+[SECURITY_ANALYSIS]. This may requiring tuning 'circ_max_cell_queue_size',
+and 'CircuitPriorityHalflifeMsec'.
+
+Additionally, we will need to experiment with reducing the cell queue limits
+on OR conns before they are blocked (OR_CONN_HIGHWATER), and study the
+interaction of that with treating the or conn block as a congestion signal.
+
+Finally, we will need to monitor our Shadow experiments for evidence of ack
+compression, which can cause the BDP estimator to over-estimate the congestion
+window. We will instrument our Shadow simulations to alert if they discover
+excessive congestion window values, and tweak 'cc_bwe_min' and
+'cc_sendme_inc' appropriately. We can set the 'cc_cwnd_max' parameter value
+to low values (eg: ~2000 or so) to watch for evidence of this in Shadow, and
+log. Similarly, we should watch for evidence that the 'cc_cwnd_min' parameter
+value is rarely hit in Shadow, as this indicates that the cwnd may be too
+small to measure BDP (for cwnd less than 'cc_sendme_inc'*'cc_bwe_min').
 
 6.3. Flow Control Algorithm Experiments
 
-We will need to tune the xoff_* consensus parameters to minimize the
-amount of edge connection buffering as well as XON/XOFF chatter for
-Exits. This can be done in simulation, but will require fine-tuning on
-the live network.
+Flow control only applies when the edges outside of Tor (SOCKS application,
+onion service application, or TCP destination site) are *slower* than Tor's
+congestion window. This typically means that the application is either
+suspended or reading too slow off its SOCKS connection, or the TCP destination
+site itself is bandwidth throttled on its downstream.
+
+To examine these properties, we will perform live onion service testing, where
+curl is used to download a large file. We will test no rate limit, and
+verify that XON/XOFF was never sent. We then suspend this download, verify
+that an XOFF is sent, and transmission stops. Upon resuming this download, the
+download rate should return to normal. We will also use curl's --limit-rate
+option, to exercise that the flow control properly measures the drain rate and
+limits the buffering in the outbuf, modulo kernel socket and localhost TCP
+buffering.
+
+However, flow control can also get triggered at Exits in a situation where
+either TCP fairness issues or Tor's mainloop does not properly allocate
+enough capacity to edge uploads, causing them to be rate limited below the
+circuit's congestion window, even though the TCP destination actually has
+sufficient downstream capacity.
+
+Exits are also most vulnerable to the buffer bloat caused by such uploads,
+since there may be many uploads active at once.
+
+To study this, we will run shadow simulations. Because Shadow does *not*
+rate limit its tgen TCP endpoints, and only rate limits the relays
+themselves, if *any* XON/XOFF activity happens in Shadow *at all*, it is
+evidence that such fairness issues can ocurr.
+
+Just in case Shadow does not have sufficient edge activity to trigger such
+emergent behavior, when congestion control is enabled on the live network, we
+will also need to instrument a live exit, to verify that XON/XOFF is not
+happening frequently on it. Relays may also report these statistics in
+extra-info descriptor, to help with monitoring the live network conditions, but
+this might also require aggregation or minimization.
+
+If excessive XOFF/XON activity happens at Exits, we will need to investigate
+tuning the libevent mainloop to prioritize edge writes over orconn writes.
+Additionally, we can lower 'cc_xoff_exit'. Linux Exits can also lower the
+'net.ipv[46].tcp_wmem' sysctl value, to reduce the amount of kernel socket
+buffering they do on such streams, which will improve XON/XOFF responsiveness
+and reduce memory usage.
 
-Additionally, we will need to experiment with reducing the cell queue
-limits on OR conns before they are blocked, and study the interaction
-of that with treating the or conn block as a congestion signal.
+6.4. Performance Metrics [EVALUATION_METRICS]
 
-  TODO: We should make the cell queue highwater value into a consensus
-  parameter in the flow control implementation.
+The primary metrics that we will be using to measure the effectiveness
+of congestion control in simulation are TTFB/RTT, throughput, and utilization.
 
-Relays may report these statistics in extra-info descriptor, to help with
-monitoring the live network conditions, but this might also require
-aggregation or minimization.
+We will calibrate the Shadow simulator so that it has similar CDFs for all of
+these metrics as the live network, without using congestion control.
 
-6.4. Performance Metrics [EVALUATION_METRICS]
+Then, we will want to inspect CDFs of these three metrics for various
+congestion control algorithms and parameters. 
+
+The live network testing will also spot-check performance characteristics of
+a couple algorithm and parameter sets, to ensure we see similar results as
+Shadow.
 
-Because congestion control will affect so many aspects of performance,
-from throughput to RTT, to load balancing, queue length, overload, and
-other failure conditions, the full set of performance metrics will be
-required:
+On the live network, because congestion control will affect so many aspects of
+performance, from throughput to RTT, to load balancing, queue length,
+overload, and other failure conditions, the full set of performance metrics
+will be required, to check for any emergent behaviors:
   https://gitlab.torproject.org/legacy/trac/-/wikis/org/roadmaps/CoreTor/PerformanceMetrics
 
 We will also need to monitor network health for relay queue lengths,
@@ -909,13 +1105,18 @@ These are sorted in order of importance to tune, most important first.
           use, as an integer.
     - Range: [0,3]  (0=fixed, 1=Westwood, 2=Vegas, 3=NOLA)
     - Default: 2
-    - Tuning Values: 0-3
+    - Tuning Values: [2,3]
     - Tuning Notes:
            These algorithms need to be tested against percentages of current
            fixed alg client competition, in Shadow. Their optimal parameter
            values, and even the optimal algorithm itself, will likely depend
            upon how much fixed sendme traffic is in competition. See the
            algorithm-specific parameters for additional tuning notes.
+    - Shadow Tuning Results:
+           Westwood exhibited responsiveness problems, drift, and overshoot.
+           NOLA exhibited ack compression resulting in over-estimating the
+           BDP. Vegas, when tuned properly, kept queues low and throughput
+           high, but even.
 
   cc_bwe_min:
     - Description:
@@ -923,7 +1124,7 @@ These are sorted in order of importance to tune, most important first.
           estimate bandwidth (and thus BDP).
     - Range: [2, 20]
     - Default: 5
-    - Tuning Values: 3-5
+    - Tuning Values: 4-10
     - Tuning Notes:
            The lower this value is, the sooner we can get an estimate of
            the true BDP of a circuit. Low values may lead to massive
@@ -935,11 +1136,15 @@ These are sorted in order of importance to tune, most important first.
            when the congestion window is small. If we need small congestion
            windows, we should also lower cc_sendme_inc, which will get us more
            frequent acks with less data.
+    - Shadow Tuning Results:
+           Regarless of how high this was set, there were still cases where
+           queues built up, causing BDP over-estimation. As a result, we disable
+           use of the BDP estimator, and only use the Vegas CWND estimator.
 
   cc_sendme_inc:
     - Description: Specifies how many cells a SENDME acks
-    - Range: [1, 5000]
-    - Default: 50
+    - Range: [1, 255]
+    - Default: 31
     - Tuning Values: 25,33,50
     - Tuning Notes:
            Increasing this increases overhead, but also increases BDP
@@ -947,14 +1152,22 @@ These are sorted in order of importance to tune, most important first.
            and the old code sent sendmes at both every 50 cells, and every
            100, we can set this as low as 33 to have the same amount of
            overhead.
+    - Shadow Tuning Results:
+           This was optimal at 31-32 cells, which is also the number of
+           cells that fit in a TLS frame. Much of the rest of Tor has
+           processing values at 32 cells, as well.
+    - Consensus Update Notes:
+           This value MUST only be changed by a factor of 2, every 4 hours.
+           If greater changes are needed, they MUST be spread out over
+           multiple consensus updates.
 
   cc_cwnd_init:
     - Description: Initial congestion window for new congestion
                    control Tor clients. This can be set much higher
                    than TCP, since actual TCP to the guard will prevent
                    buffer bloat issues at local routers.
-    - Range: [100, 10000]
-    - Default: 500
+    - Range: [31, 10000]
+    - Default: 4*31
     - Tuning Values: 150,200,250,500
     - Tuning Notes:
            Higher initial congestion windows allow the algorithms to
@@ -962,17 +1175,42 @@ These are sorted in order of importance to tune, most important first.
            and latency.  Ultimately, the ICW should be set to approximately
            'cc_bwe_min'*'cc_sendme_inc', but the presence of competing
            fixed window clients may require higher values.
+    - Shadow Tuning Results:
+           Setting this too high caused excessive cell queues at relays.
+           4*31 ended up being a sweet spot.
+    - Consensus Update Notes:
+           This value must never be set below cc_sendme_inc.
 
   cc_cwnd_min:
     - Description: The minimum allowed cwnd.
-    - Range: [cc_sendme_inc, 1000]
-    - Tuning Values: [50, 100, 150]
+    - Range: [31, 1000]
+    - Default: 31
+    - Tuning Values: [100, 150, 200]
     - Tuning Notes:
            If the cwnd falls below cc_sendme_inc, connections can't send
            enough data to get any acks, and will stall. If it falls below
            cc_bwe_min*cc_sendme_inc, connections can't use SENDME BDP
            estimates. Likely we want to set this around
            cc_bwe_min*cc_sendme_inc, but no lower than cc_sendme_inc.
+    - Shadow Tuning Results:
+           We set this at 31 cells, the cc_sendme_inc.
+    - Consensus Update Notes:
+           This value must never be set below cc_sendme_inc.
+
+  cc_cwnd_max:
+    - Description: The maximum allowed cwnd.
+    - Range: [500, INT32_MAX]
+    - Default: INT32_MAX
+    - Tuning Values: [5000, 10000, 20000]
+    - Tuning Notes:
+       If cc_bwe_min is set too low, the BDP estimator may over-estimate the
+       congestion window in the presence of large queues, due to SENDME ack
+       compression. Once all clients have upgraded to congestion control,
+       queues large enough to cause ack compression should become rare. This
+       parameter exists primarily to verify this in Shadow, but we preserve it
+       as a consensus parameter for emergency use in the live network, as well.
+    - Shadow Tuning Results:
+       We kept this at INT32_MAX.
 
   circwindow:
     - Description: Initial congestion window for legacy Tor clients
@@ -997,38 +1235,69 @@ These are sorted in order of importance to tune, most important first.
            only be updated once every cwnd worth of packets. We may find it
            better to update more frequently, but this is probably unlikely
            to help a great deal.
+    - Shadow Tuning Results:
+           Increasing this during slow start caused overshoot and excessive
+           queues. Increasing this after slow start was suboptimal for
+           performance. We keep this at 1.
 
-  cc_ewma_cwnd_cnt:
+  cc_ewma_cwnd_pct:
     - Description: This specifies the N in N-EWMA smoothing of RTT and BDP
-                   estimation. It allows us to average these values over 1
-                   or more congestion windows.
-    - Range: [1, 100]
-    - Default: 2
-    - Tuning Values: [1,2,3]
+                   estimation, as a percent of the number of SENDME acks
+                   in a congestion window. It allows us to average these RTT
+                   values over a percentage of the congestion window,
+                   capped by 'cc_ewma_max' below, and specified in
+                   [N_EWMA_SMOOTHING].
+    - Range: [1, 255]
+    - Default: 50,100
+    - Tuning Values: [25,50,100]
     - Tuning Notes:
            Smoothing our estimates reduces the effects of ack compression and
            other ephemeral network hiccups; changing this much is unlikely
            to have a huge impact on performance.
+    - Shadow Tuning Results:
+           Setting this to 50 seemed to reduce cell queues, but this may also
+           have impacted performance.
+
+  cc_ewma_max:
+    - Description: This specifies the max N in N_EWMA smoothing of RTT and BDP
+                   estimation. It allows us to place a cap on the N of EWMA
+                   smoothing, as specified in [N_EWMA_SMOOTHING].
+    - Range: [2, INT32_MAX]
+    - Default: 10
+    - Tuning Values: [10,20]
+    - Shadow Tuning Results:
+           We ended up needing this to make Vegas more responsive to
+           congestion, to avoid overloading slow relays. Values of 10 or 20
+           were best.
 
   cc_cwnd_inc:
     - Description: How much to increment the congestion window by during
                    steady state, every cwnd.
     - Range: [1, 1000]
-    - Default: 50
+    - Default: 31
     - Tuning Values: 25,50,100
     - Tuning Notes:
            We are unlikely to need to tune this much, but it might be worth
            trying a couple values.
+    - Shadow Tuning Results:
+           Increasing this negatively impacted performance. Keeping it at
+           cc_sendme_inc is best.
 
   cc_cwnd_inc_pct_ss:
     - Description: Percentage of the current congestion window to increment
                    by during slow start, every cwnd.
-    - Range: [10, 300]
+    - Range: [1, 500]
+    - Default: 50
     - Tuning Values: 50,100,200
     - Tuning Notes:
            On the current live network, the algorithms tended to exit slow
            start early, so we did not exercise this much. This may not be the
            case in Shadow, or once clients upgrade to the new algorithms.
+    - Shadow Tuning Results:
+           Setting this above 50 caused excessive queues to build up in
+           Shadow. This may have been due to imbalances in Shadow client
+           allocation, though. Values of 50-100 will be explored after
+           examining Shadow Guard Relay Utilization.
 
   cc_bdp_alg:
     - Description: The BDP estimation algorithm to use.
@@ -1036,9 +1305,21 @@ These are sorted in order of importance to tune, most important first.
     - Default: 7 (Piecewise EWMA)
     - Tuning Notes:
            We don't expect to need to tune this.
+    - Shadow Tuning Results:
+           We leave this as-is, but disable it in Vegas instead, below.
 
 6.5.2. Westwood parameters
 
+  Westwood has runaway conditions. Because the congestion signal threshold of
+  TOR_WESTWOOD is a function of RTT_max, excessive queuing can cause an
+  increase in RTT_max. Additionally, if stream activity is constant, but of
+  a lower bandwidth than the circuit, this will not drive the RTT upwards,
+  and this can result in a congestion window that continues to increase in the
+  absence of any other concurrent activity.
+
+  For these reasons, we are unlikely to spend much time deeply investigating
+  Westwood in Shadow, beyond a simulaton or two to check these behaviors.
+
   cc_westwood_rtt_thresh:
     - Description:
               Specifies the cutoff for BOOTLEG_RTT_TOR to deliver
@@ -1084,14 +1365,18 @@ These are sorted in order of importance to tune, most important first.
 
 6.5.3. Vegas Parameters
 
-
-  cc_vegas_alpha:
-  cc_vegas_beta:
-  cc_vegas_gamma:
+  cc_vegas_alpha_{exit,onion,sos,vg,sbws}:
+  cc_vegas_beta_{exit,onion,sos,vg,sbws}:
+  cc_vegas_gamma_{exit,onion,sos,vg,sbws}:
+  cc_vegas_delta_{exit,onion,sos,vg,sbws}:
     - Description: These parameters govern the number of cells
                    that [TOR_VEGAS] can detect in queue before reacting.
-    - Range: [1, 1000]
-    - Defaults: 6*cc_sendme_inc for gamma and beta; 3*cc_sendme_inc for alpha
+    - Range: [0, 1000] (except delta, which has max of INT32_MAX)
+    - Defaults:
+           alpha: path_len*2*31-31
+           beta: path_len*2*31
+           gamma: path_len*2*31
+           delta: path_len*2*31 + 2*31
     - Tuning Notes:
            The amount of queued cells that Vegas should tolerate is heavily
            dependent upon competing congestion control algorithms. The specified
@@ -1100,18 +1385,30 @@ These are sorted in order of importance to tune, most important first.
            clients using fixed windows is reduced (or circwindow is lowered), we
            can reduce these parameters, which will result in less overall queuing
            and congestion on the network.
+    - Shadow Tuning Results:
+           We found that the best values for 3-hop Exit circuits was to set
+           beta and gamma to the size of the outbufs times the number of hops.
+           This has the effect that Vegas detects congestion and backs off
+           when this much queue delay is detected. Alpha is set to one TLS
+           record/sendme_inc below this value. If the queue length is detected
+           to be below that, we increase the congestion window. We still
+           need to verify that the path length multiplier still holds for
+           other types of circuits, specifically onion services.
 
   cc_vegas_bdp_mix:
     - Description: This parameter allows us to specify a weighted percent
                    average between the cwnd BDP estimator and the piecewise
                    estimator.
     - Range: [0, 100]
-    - Default: 0 (use 100% Piecewise EWMA)
+    - Default: 100 (use 100% CWND estimator)
     - Tuning Notes:
            The original Vegas literature specified the CWND estimator, but
            the Piecewise estimator is very likely more suited for our use
            case. This parameter allows us to average them. We're unlikely to
            need it.
+    - Shadow Tuning Results:
+           Because the BDP estimator had ack compression and over-estimation,
+           we the CWND estimator.
 
 6.5.4. NOLA Parameters
 
@@ -1129,26 +1426,182 @@ These are sorted in order of importance to tune, most important first.
             absent any more agressive competition, we do not need to overshoot
             the BDP estimate.
 
-
 6.5.5. Flow Control Parameters
 
- TODO: These may expand, particularly to include cell_queue_highwater
+  As with previous sections, the parameters in this section are sorted with
+  the parameters that are most impportant to tune, first.
 
-  xoff_client
-  xoff_mobile
-  xoff_exit
-    - Description: Specifies the stream queue size as a percentage of
-                   'cwnd' at an endpoint before an XOFF is sent.
-    - Range: [1, 100]
-    - Default: 5
+  These parameters have been tuned using onion services. The defaults are
+  believed to be good.
 
-  xon_client
-  xon_mobile
-  xon_exit
-    - Description: Specifies the how many cells below xoff_* before
-                   an XON is sent from an endpoint.
-    - Range: [1, 10000000]
-    - Default: 10000
+  cc_xoff_client
+  cc_xoff_exit
+    - Description: Specifies the outbuf length, in relay cell multiples,
+                   before we send an XOFF.
+    - Range: [1, 10000]
+    - Default: 500
+    - Tuning Values: [500, 1000]
+    - Tuning Notes:
+        This threshold plus the sender's cwnd must be greater than the
+        cc_xon_rate value, or a rate cannot be computed. Unfortunately,
+        unless it is sent, the receiver does not know the cwnd. Therefore,
+        this value should always be higher than cc_xon_rate minus 
+        'cc_cwnd_min' (100) minus the xon threshhold value (0).
+
+  cc_xon_rate
+    - Description: Specifies how many full packed cells of bytes must arrive
+                   before we can compute a rate, as well as how often we can
+                   send XONs.
+    - Range: [1, 5000]
+    - Default: 500
+    - Tuning Values: [500, 1000]
+    - Tuning Notes:
+        Setting this high will prevent excessive XONs, as well as reduce
+        side channel potential, but it will delay response to queuing.
+        and will hinder our ability to detect rate changes. However, low
+        values will also reduce our ability to accurately measure drain
+        rate. This value should always be lower than 'cc_xoff_*' +
+        'cc_cwnd_min', so that a rate can be computed solely from the outbuf
+        plus inflight data.
+
+ cc_xon_change_pct
+    - Description: Specifies how much the edge drain rate can change before
+                   we send another advisory cell.
+    - Range: [1, 99]
+    - Default: 25
+    - Tuning values: [25, 50, 75]
+    - Tuning Notes:
+        Sending advisory updates due to a rate change may help us avoid
+        hitting the XOFF limit, but it may also not help much unless we
+        are already above the advise limit.
+
+  cc_xon_ewma_cnt
+    - Description: Specifies the N in the N_EWMA of rates.
+    - Range: [2, 100]
+    - Default: 2
+    - Tuning values: [2, 3, 5]
+    - Tuning Notes:
+        Setting this higher will smooth over changes in the rate field,
+        and thus avoid XONs, but will reduce our reactivity to rate changes.
+ 
+
+6.5.6. External Performance Parameters to Tune
+
+  The following parameters are from other areas of Tor, but tuning them
+  will improve congestion control performance. They are again sorted
+  by most important to tune, first.
+
+  cbtquantile
+    - Description: Specifies the percentage cutoff for the circuit build
+                   timeout mechanism.
+    - Range: [60, 80]
+    - Default: 80    
+    - Tuning Values: [70, 75, 80]
+    - Tuning Notes:
+       The circuit build timeout code causes Tor to use only the fastest
+       'cbtquantile' percentage of paths to build through the network.
+       Lowering this value will help avoid congested relays, and improve
+       latency.
+
+  CircuitPriorityHalflifeMsec
+    - Description: The CircEWMA half-life specifies the time period after
+                   which the cell count on a circuit is halved. This allows
+                   circuits to regain their priority if they stop being bursty.
+    - Range: [1, INT32_MAX]
+    - Default: 30000
+    - Tuning Values: [5000, 15000, 30000, 60000]
+    - Tuning Notes:
+       When we last tuned this, it was before KIST[31], so previous values may
+       have little relevance to today. According to the CircEWMA paper[30], values
+       that are too small will fail to differentiate bulk circuits from interactive
+       ones, and values that are too large will allow new bulk circuits to keep
+       priority over interactive circuits for too long. The paper does say
+       that the system was not overly sensitive to specific values, though.
+
+  CircuitPriorityTickSecs
+    - Description: This specifies how often in seconds we readjust circuit
+                   priority based on their EWMA.
+    - Range: [1, 600]
+    - Default: 10
+    - Tuning Values: [1, 5, 10]
+    - Tuning Notes:
+        Even less is known about the optimal value for this parameter. At a
+        guess, it should be more often than the half-life. Changing it also
+        influences the half-life decay, though, at least according to the
+        CircEWMA paper[30].
+
+  KISTSchedRunInterval
+    - If 0, KIST is disabled. (We should also test KIST disabled)
+
+
+6.5.7. External Memory Reduction Parameters to Tune
+
+  The following parameters are from other areas of Tor, but tuning them
+  will reduce memory utilization in relays. They are again sorted by most
+  important to tune, first.
+
+  circ_max_cell_queue_size
+    - Description: Specifies the minimum number of cells that are allowed
+                   to accumulate in a relay queue before closing the circuit.
+    - Range: [1000, INT32_MAX]
+    - Default: 50000
+    - Tuning Values: [1000, 2500, 5000]
+    - Tuning Notes:
+       Once all clients have upgraded to congestion control, relay circuit
+       queues should be minimized. We should minimize this value, as any
+       high amounts of queueing is a likely violator of the algorithm.
+
+  cellq_low
+  cellq_high
+    - Description: Specifies the number of cells that can build up in
+                   a circuit's queue for delivery onto a channel (from edges)
+                   before we either block or unblock reading from streams
+                   attached to that circuit.
+    - Range: [1, 1000]
+    - Default: low=10, high=256
+    - Tuning Values: low=[0, 2, 4, 8]; high=[16, 32, 64]
+    - Tuning Notes:
+        When data arrives from edges into Tor, it gets packaged up into cells
+        and then delivered to the cell queue, and from there is dequeued and
+        sent on a channel. If the channel has blocked (see below params), then
+        this queue grows until the high watermark, at which point Tor stops
+        reading on all edges associated with a circuit, and a congestion
+        signal is delivered to that circuit. At 256 cells, this is ~130k of
+        data for *every* circuit, which is far more than Tor can write in a
+        channel outbuf. Lowering this will reduce latency, reduce memory
+        usage, and improve responsiveness to congestion. However, if it is
+        too low, we may incur additional mainloop invocations, which are
+        expensive. We will need to trace or monitor epoll() invocations in
+        Shadow or on a Tor exit to verify that low values do not lead to
+        more mainloop invocations.
+    - Shadow Tuning Results:
+        After extensive tuning, it turned out that the defaults were optimal
+        in terms of throughput.
+
+  orconn_high
+  orconn_low
+    - Description: Specifies the number of bytes that can be held in an
+                   orconn's outbuf before we block or unblock the orconn.
+    - Range: [509, INT32_MAX]
+    - Default: low=16k, high=32k
+    - Tuning Notes:
+        When the orconn's outbuf is above the high watermark, cells begin
+        to accumulate in the cell queue as opposed to being added to the
+        outbuf. It may make sense to lower this to be more in-line with the
+        cellq values above. Also note that the low watermark is only used by
+        the vanilla scheduler, so tuning it may be relevant when we test with
+        KIST disabled. Just like the cell queue, if this is set lower, congestion
+        signals will arrive sooner to congestion control when orconns become
+        blocked, and less memory will occupy queues. It will also reduce latency.
+        Note that if this is too low, we may not fill TLS records, and we may
+        incur excessive epoll()/mainloop invocations. Tuning this is likely
+        less beneficial than tuning the above cell_queue, unless KIST is
+        disabled.
+
+  MaxMemInQueues
+    - Should be possible to set much lower, similarly to help with
+      OOM conditions due to protocol violation. Unfortunately, this
+      is just a torrc, and a bad one at that.
 
 
 7. Protocol format specifications [PROTOCOL_SPEC]
@@ -1317,85 +1770,96 @@ also be used as a side channel. So we must limit its use to a couple of
 cells per circuit, at most.
   https://blog.torproject.org/tor-security-advisory-relay-early-traffic-confirmation-attack
 
-9. Onion Services
+
+9. Onion Service Negotiation [ONION_NEGOTIATION]
 
 Onion service requires us to advertise the protocol version and congestion
 control parameters in a different way since the end points do not know each
-other like a client knows all the relays and what they support.
+other like a client knows all the relays and what they support. Additionally,
+we cannot use ntorv3 for onion service negotiation, because it is not
+supported at all rendezvous and introduction points.
 
 To address this, this is done in two parts. First, the service needs to
-advertise to the world that it supports congestion control. This is done
-through a new line in the descriptor, see section 9.1 below.
+advertise to the world that it supports congestion control, and its view of
+the current cc_sendme_inc consensus parameter. This is done through a new
+line in the onion service descriptor, see section 9.1 below.
 
 Second, the client needs to inform the service that it wants to use congestion
-control on the rendezvous circuit and with wich parameters. This is done
-through the INTRODUCE cell as an extension, see section 9.2 below.
+control on the rendezvous circuit. This is done through the INTRODUCE cell as
+an extension, see section 9.2 below.
 
-9.1. Descriptor
+9.1. Onion Service Descriptor
 
-We propose to add a new line to advertise the flow control protocol version:
+We propose to add a new line to advertise the flow control protocol version,
+in the encrypted section of the onion service descriptor:
 
-  "flow-control" SP version-number NL
+  "flow-control" SP version-range SP sendme-inc NL
 
-    The "version-number" value is the same as the protocol version FlowCtrl
-    that relay advertises which is defined earlier in this proposal. Current
-    value can only be "2".
+    The "version-range" value is the same as the protocol version FlowCtrl
+    that relay advertises which is defined earlier in this proposal. The
+    current value is "1-2".
 
-Clients that are able to parse this line and know the protocol version can then
-send in the INTRODUCE1 cell congestion control parameters that they would like
-to use which is detailed in the next section.
+    The "sendme-inc" value comes from the service's current cc_sendme_inc
+    consensus parameter.
 
-This line MUST be removed from the descriptor if the consensus disables
-congestion control by setting "cc_alg=0". In other words, a service should only
-advertise its flow control version if congestion control is enabled network
-wide.
+Clients MUST ignore additional unknown versions in "version-range", and MUST
+ignore any additional values on this line.
 
-9.2 INTRODUCE cell extension
+Clients SHOULD use the highest value in "version-range" to govern their
+protocol choice for "FlowCtrl" and INTRODUCE cell format, as per Section 9.2
+below.
 
-We propose a new extension to the INTRODUCE cell which can be used to send
-congestion control parameters down to the service. It is important to mention
-that this is an extension to be used in the encrypted setion of the cell and
-not its readable section by the introduction point.
+If clients do not support any of the versions in "version-range", they SHOULD
+reject the descriptor. (They MAY choose to ignore this line instead, but doing
+so means using the old fixed-window SENDME flow control, which will likely be
+bad for the network).
 
-If used, it needs to be encoded within the N_EXTENSIONS field of the ENCRYPTED
-section of the INTRODUCE1 cell defined in rend-spec-v3.txt section 3.3. The
-content is defined as follow:
+Clients that are able to parse this line and know the protocol version
+MUST validate that the "sendme-inc" value is within a multiple of 2 of the
+"cc_sendme_inc" in the consensus that they see. If "sendme-inc" is not within
+range, they MUST reject the descriptor.
 
-  EXT_FIELD_TYPE:
+If their consensus also lists a non-zero "cc_alg", they MAY then send in the
+INTRODUCE1 cell congestion control request extention field, which is detailed
+in the next section.
 
-    [01] -- Congestion Control Parameters.
+A service should only advertise its flow control version if congestion control is
+enabled. It MUST remove this line if congestion control is disabled.
 
-    If this flag is set, the extension should be used by the service to learn
-    what are the congestion control parameters to use on the rendezvous
-    circuit.
+If the service observes a change in 'cc_sendme_inc' consensus parameter since
+it last published its descriptor, it MUST immediately close its introduction
+points, and publish a new descriptor with the new "sendme-inc" value. The
+additional step of closing the introduction points ensures that no clients
+arrive using a cached descriptor, with the old "sendme-inc" value.
 
-  EXT_FIELD content format is:
+9.2 INTRODUCE cell extension
 
-    N_PARAMS          [1 byte]
-    N_PARAMS times:
-      PARAM_TYPE      [1 byte]
-      PARAM_VALUE     [8 byte]
+We propose a new extension to the INTRODUCE cell which can be used to send
+congestion control parameters down to the service. It is important to mention
+that this is an extension to be used in the encrypted setion of the cell and
+not its readable section by the introduction point.
 
-    The PARAM_TYPE possible values are:
+If used, it needs to be encoded within the ENCRYPTED section of the INTRODUCE1
+cell defined in rend-spec-v3.txt section 3.3. The content is defined as follow:
 
-      [01] -- Newest protocol version supported
-              The newest protocol version that the client supports.
+  EXT_FIELD_TYPE:
 
-      [02] -- SENDME increment
-              The initial SENDME increment that should be used by both end points
-              when using congestion control.
+    [01] -- Congestion Control Request.
 
-    The PARAM_VALUE size is 8 bytes in order to accomodate 64bit values. It
-    MUST match the specified limits for the following PARAM_TYPE:
+This field is has zero payload length. Its presence signifies that the client wants to
+use congestion control. The client MUST NOT set this field, or use
+ntorv3, if the service did not list "2" in the "FlowCtrl" line in the
+descriptor. The client SHOULD NOT provide this field if the consensus parameter
+'cc_alg' is 0.
 
-      [01] -- Min: 2, Max: 255
-      [02] -- Min: 1, Max: 5000 (same as "cc_sendme_inc")
+The service MUST ignore any unknown fields.
 
 9.3 Protocol Flow
 
 First, the client reads the "flow-control" line in the descriptor and gets the
-maximum value from what it supports and the service supports. As an example, if
-the client supports 2-3-4 and the service supports 2-3, then 3 is chosen.
+maximum value from that line's "version-range" and the service supports. As an
+example, if the client supports 2-3-4 and the service supports 2-3, then 3 is
+chosen.
 
 It then sends that value along its desired cc_sendme_inc value in the
 INTRODUCE1 cell in the extension.
@@ -1406,7 +1870,7 @@ then applied to the rendezvous circuit.
 
 9.4 Circuit Behavior
 
-If the extension is not found in the cell, the service MUST not use congestion
+If the extension is not found in the cell, the service MUST NOT use congestion
 control on the rendezvous circuit.
 
 Any invalid values received in the extension should result in closing the
@@ -1424,15 +1888,159 @@ everyone looks the same again.
 The new extension is located in the encrypted part of the INTRODUCE1 cell and
 thus the introduction point can't learn its content.
 
-10. Acknowledgements
+
+10. Exit negotiation [EXIT_NEGOTIATION]
+
+Similar to onion services, clients and exits will need to negotiate the
+decision to use congestion control, as well as a common value for
+'cc_sendme_inc', for a given circuit.
+
+10.1. When to negotiate
+
+Clients decide to initiate a negotiation attempt for a circuit if the
+consensus lists a non-zero 'cc_alg' parameter value, and the protover line
+for their chosen exit includes a value of 2 in the "FlowCtrl" field.
+
+If the FlowCtrl=2 subprotocol is absent, a client MUST NOT attempt negotiation.
+
+If 'cc_alg' is absent or zero, a client SHOULD NOT attempt
+negotiation, or use ntorv3.
+
+If the protover and consensus conditions are met, clients SHOULD negotiate
+with the Exit if the circuit is to be used for exit stream activity. Clients
+SHOULD NOT negotiate congestion control for one-hop circuits, or internal
+circuits.
+
+10.2. What to negotiate
+
+Clients and exits need not agree on a specific congestion control algorithm,
+or any aspects of its behavior. Each endpoint's management of its congestion
+window is independent. However, because the new algorithms no longer use
+stream SENDMEs or fixed window sizes, they cannot be used with an endpoint
+expecting the old behavior.
+
+Additionally, each endpoint must agree on the the SENDME increment rate, in
+order to synchronize SENDME authentication and pacing.
+
+For this reason, negotiation needs to establish a boolean: "use congestion
+control", and an integer value for SENDME increment.
+
+No other parameters need to be negotiated.
+
+10.3. How to negotiate
+
+Negotiation is performed by sending an ntorv3 onionskin, as specified in
+Proposal 332, to the Exit node. The encrypted payload contents from the
+clients are encoded as an extension field, as in the onion service INTRO1
+cell:
+
+  EXT_FIELD_TYPE:
+
+    [01] -- Congestion Control Request.
+
+As in the INTRO1 extension field, this field is has zero payload length.
+
+Its presence signifies that the client wants to use congestion control.
+Again, the client MUST NOT set this field, or use ntorv3, if this exit did not
+list "2" in the "FlowCtrl" version line. The client SHOULD NOT set this to 1
+if the consensus parameter 'cc_alg' is 0.
+
+The Exit MUST ignore any additional unknown extension fields.
+
+The server's encrypted ntorv3 reply payload is encoded as:
+
+   EXT_FIELD_TYPE:
+
+    [02] -- Congestion Control Response.
+
+    If this flag is set, the extension should be used by the service to learn
+    what are the congestion control parameters to use on the rendezvous
+    circuit.
+
+  EXT_FIELD content payload is a single byte:
+
+    sendme_inc        [1 byte]
+
+The Exit MUST provide its current view of 'cc_sendme_inc' in this payload if it
+observes a non-zero 'cc_alg' consensus parameter. Exits SHOULD only include
+this field once.
+
+The client MUST use the FIRST such field value, and ignore any duplicate field
+specifiers. The client MUST ignore any unknown additional fields.
+
+10.5. Client checks
+
+The client MUST reject any ntorv3 replies for non-ntorv3 onionskins.
+
+The client MUST reject an ntorv3 reply with field EXT_FIELD_TYPE=02, if the
+client did not include EXT_FIELD_TYPE=01 in its handshake.
+
+The client SHOULD reject a sendme_inc field value that differs from the
+current 'cc_sendme_inc' consensus parameter by more than a factor of 2, in
+either direction.
+
+If a client rejects a handshake, it MUST close the circuit.
+
+10.6. Managing consenus updates
+
+The pedantic reader will note that a rogue consensus can cause all clients
+to decide to close circuits by changing 'cc_sendme_inc' by a large margin.
+
+As a matter of policy, the directory authorities MUST NOT change
+'cc_sendme_inc' by more than a factor of two (2), within a four (4) hour
+window, for this reason.
+
+In Shadow simulation, the optimal 'cc_sendme_inc' value to be ~31 cells, or
+one (1) TLS record worth of cells. We do not expect to change this value
+significantly.
+
+
+11. Acknowledgements
 
 Immense thanks to Toke Høiland-Jørgensen for considerable input into all
 aspects of the TCP congestion control background material for this proposal,
 as well as review of our versions of the algorithms.
 
 
+12. Glossary [GLOSSARY]
+
+ACK - Acknowledgment. In congestion control, this is a type of packet that
+signals that the endpoint received a packet or packet set. In Tor, ACKs are
+called SENDMEs.
+
+BDP - Bandwidth Delay Product. This is the quantity of bytes that are actively
+in transit on a path at any given time. Typically, this does not count packets
+waiting in queues. It is essentially RTT*BWE - queue_delay.
+
+BWE - BandWidth Estimate. This is the estimated throughput on a path.
+
+CWND - Congestion WiNDow. This is the total number of packets that are allowed
+to be "outstanding" (aka not ACKed) on a path at any given time. An ideal
+congestion control algorithm sets CWND=BDP.
 
-11. [CITATIONS]
+EWMA - Exponential Weighted Moving Average. This is a mechanism for smoothing
+out high-frequency changes in a value, due to temporary effects.
+
+ICW - Initial Congestion Window. This is the initial value of the congestion
+window at the start of a connection.
+
+RTT - Round Trip Time. This is the time it takes for one endpoint to send a
+packet to the other endpoint, and get a response.
+
+SS - Slow Start. This is the initial phase of most congestion control
+algorithms. Despite the name, it is an exponential growth phase, to quickly
+increase the congestion window from the ICW value up the path BDP. After Slow
+Start, changes to the congestion window are linear.
+
+XOFF - Transmitter Off. In flow control, XOFF means that the receiver is
+receiving data too fast and is beginning to queue. It is sent to tell the
+sender to stop sending.
+
+XON - Transmitter On. In flow control, XON means that the receiver is ready to
+receive more data. It is sent to tell the sender to resume sending.
+
+
+13. [CITATIONS]
 
 1. Options for Congestion Control in Tor-Like Networks.
    https://lists.torproject.org/pipermail/tor-dev/2020-January/014140.html
@@ -1514,3 +2122,14 @@ as well as review of our versions of the algorithms.
 
 27. Exponentially Weighted Moving Average
     https://corporatefinanceinstitute.com/resources/knowledge/trading-investing/exponentially-weighted-moving-average-ewma/
+
+28. Dropping on the Edge
+    https://www.petsymposium.org/2018/files/papers/issue2/popets-2018-0011.pdf
+
+29. https://github.com/mikeperry-tor/vanguards/blob/master/README_TECHNICAL.md#the-bandguards-subsystem
+
+30. An Improved Algorithm for Tor Circuit Scheduling.
+    https://www.cypherpunks.ca/~iang/pubs/ewma-ccs.pdf
+
+31. KIST: Kernel-Informed Socket Transport for Tor
+    https://matt.traudt.xyz/static/papers/kist-tops2018.pdf