1 files changed, 249 insertions, 77 deletions
diff --git a/proposals/324-rtt-congestion-control.txt b/proposals/324-rtt-congestion-control.txt
index c617983..952d140 100644
--- a/proposals/324-rtt-congestion-control.txt
+++ b/proposals/324-rtt-congestion-control.txt
@@ -144,20 +144,22 @@ measured by the RTT estimator, and if these heurtics detect a stall or a jump,
 we do not use that value to update RTT or BDP, nor do we update any congestion
 control algorithm information that round.
 
-If the time delta is 0, that is always treated as a clock stall.
+If the time delta is 0, that is always treated as a clock stall, the RTT is
+not used, congestion control is not updated, and this fact is cached globally.
 
-If we have measured at least 'cc_bwe_min' RTT values or we have successfully
-exited slow start, then every sendme ACK, the new candidate RTT is compared to
-the stored EWMA RTT. If the new RTT is either 5000 times larger than the EWMA
-RTT, or 5000 times smaller than the stored EWMA RTT, then we do not record that
-estimate, and do not update BDP or the congestion control algorithms for that
-SENDME ack.
+If the circuit does not yet have an EWMA RTT or it is still in Slow Start, then
+no further checks are performed, and the RTT is used.
 
-Moreover, if a clock stall is detected by *any* circuit, this fact is
-cached, and this cached value is used on circuits for which we do not
-have enough data to compute the above heueristics. This cached value is
-also exported for use by the edge connection rate calculations done by
-[XON_ADVISORY].
+If the circuit has stored an EWMA RTT and has exited Slow Start, then every
+sendme ACK, the new candidate RTT is compared to the stored EWMA RTT. If the
+new RTT is 5000 times larger than the EWMA RTT, then the circuit does not
+record that estimate, and does not update BDP or the congestion control
+algorithms for that SENDME ack. If the new RTT is 5000 times smaller than the
+EWMA RTT, then the circuit uses the globally cached value from above (ie: it
+assumes the clock is stalled *only* if there was previously *also* a 0-delta RTT).
+
+If both ratio checks pass, the globally cached clock stall state is set to
+false (no stall), and the RTT value is used.
 
 2.1.2. N_EWMA Smoothing [N_EWMA_SMOOTHING]
 
@@ -173,10 +175,12 @@ Moving Average with alpha = 2/(N+1):
 Note that the second rearranged form MUST be used in order to ensure that
 rounding errors are handled in the same manner as other implementations.
 
-Flow control rate limiting uses this function
+Flow control rate limiting uses this function.
+
+During Slow Start, N is set to `cc_ewma_ss`, for RTT estimation.
 
-For both RTT and SENDME BDP estimation, N is the number of SENDME acks
-between congestion window updates, divided by the value of consensus
+After Slow Start, for both RTT and SENDME BDP estimation, N is the number
+of SENDME acks between congestion window updates, divided by the value of consensus
 parameter 'cc_ewma_cwnd_pct', and then capped at a max of 'cc_ewma_max',
 but always at least 2:
 
@@ -391,6 +395,13 @@ Simplifying:
 
    BDP = cwnd * RTT_min / RTT_current_ewma
 
+The RTT_min for this calculation comes from the minimum RTT_current_ewma seen
+in the lifetime of this circuit. If the congestion window falls to
+`cc_cwnd_min` after slow start, implementations MAY choose to reset RTT_min
+for use in this calculation to either the RTT_current_ewma, or a
+percentile-weighted average between RTT_min and RTT_current_ewma, specified by
+`cc_rtt_reset_pct`. This helps with escaping starvation conditions.
+
 The net effect of this estimation is to correct for any overshoot of
 the cwnd over the actual BDP. It will obviously underestimate BDP if cwnd
 is below BDP.
@@ -533,63 +544,151 @@ original cwnd estimator. So while this capability to change the BDP estimator
 remains in the C implementation, we do not expect it to be used.
 
 However, it was useful to use a local OR connection block at the time of
-SENDME ack arrival, as an immediate congestion signal.
+SENDME ack arrival, as an immediate congestion signal. Note that in C-Tor,
+this orconn_block state is not derived from any socket info, but instead is a
+heuristic that declares an orconn as blocked if any circuit cell queue
+exceeds the 'cellq_high' consensus parameter.
 
 (As an additional optimization, we could also use the ECN signal described in
 ideas/xxx-backward-ecn.txt, but this is not implemented. It is likely only of
 any benefit during Slow Start, and even that benefit is likely small.)
 
-Congestion signals from RTT, blocked OR connections, or ECN are processed only
-once per congestion window. This is achieved through the next_cc_event flag,
-which is initialized to a cwnd worth of SENDME acks, and is decremented
-each ack. Congestion signals are only evaluated when it reaches 0.
-
-Here is the complete pseudocode for TOR_VEGAS, which is run every time
-an endpoint receives a SENDME ack:
-
-  # Update acked cells
-  inflight -= cc_sendme_inc
+During Slow Start, we use RFC3742 Limited Slow Start[32], which checks the
+congestion signals from RTT, blocked OR connections, or ECN every single
+SENDME ack. It also provides a `cc_sscap_*` parameter for each path length,
+which reduces the congestion window increment rate after it is crossed, as
+per the rules in RFC3742:
+  rfc3742_ss_inc(cwnd):
+    if cwnd <= cc_ss_cap_pathtype:
+      # Below the cap, we increment as per cc_cwnd_inc_pct_ss percent:
+      return round(cc_cwnd_inc_pct_ss*cc_sendme_inc/100)
+    else:
+      # This returns an increment equivalent to RFC3742, rounded,
+      # with a minimum of inc=1.
+      # From RFC3742:
+      #  K = int(cwnd/(0.5 max_ssthresh));
+      #  inc = int(MSS/K);
+      return MAX(round((cc_sendme_inc*cc_ss_cap_pathtype)/(2*cwnd)), 1);
+
+During both Slow Start, and Steady State, if the congestion window is not full,
+we never increase the congestion window. We can still decrease it, or exit slow
+start, in this case. This is done to avoid causing overshoot. The original TCP
+Vegas addressed this problem by computing BDP and queue_use from inflight,
+instead of cwnd, but we found that approach to have signficantly worse
+performance.
+
+Because C-Tor is single-threaded, multiple SENDME acks may arrive during one
+processing loop, before edge connections resume reading. For this reason,
+we provide two heuristics to provide some slack in determining the full
+condition. The first is to allow a gap between inflight and cwnd,
+parameterized as 'cc_cwnd_full_gap' multiples of 'cc_sendme_inc':
+   cwnd_is_full(cwnd, inflight):
+     if inflight + 'cc_cwnd_full_gap'*'cc_sendme_inc' >= cwnd:
+       return true
+     else
+       return false
 
+The second heuristic immediately resets the full state if it falls below
+'cc_cwnd_full_minpct' full:
+   cwnd_is_nonfull(cwnd, inflight):
+     if 100*inflight < 'cc_cwnd_full_minpct'*cwnd:
+       return true
+     else
+       return false
+
+This full status is cached once per cwnd if 'cc_cwnd_full_per_cwnd=1';
+otherwise it is cached once per cwnd update. These two helper functions
+determine the number of acks in each case:
+   SENDME_PER_CWND(cwnd):
+     return ((cwnd + 'cc_sendme_inc'/2)/'cc_sendme_inc')
+   CWND_UPDATE_RATE(cwnd, in_slow_start):
+     # In Slow Start, update every SENDME
+     if in_slow_start:
+       return 1
+     else: # Otherwise, update as per the 'cc_inc_rate' (31)
+       return ((cwnd + 'cc_cwnd_inc_rate'*'cc_sendme_inc'/2)
+           / ('cc_cwnd_inc_rate'*'cc_sendme_inc'));
+
+Shadow experimentation indicates that 'cc_cwnd_full_gap=2' and
+'cc_cwnd_full_per_cwnd=0' minimizes queue overshoot, where as
+'cc_cwnd_full_per_cwnd=1' and 'cc_cwnd_full_gap=1' is slightly better
+for performance. Since there may be a difference between Shadow and live,
+we leave this parmeterization in place.
+
+Here is the complete pseudocode for TOR_VEGAS with RFC3742, which is run every
+time an endpoint receives a SENDME ack. All variables are scoped to the
+circuit, unless prefixed by an underscore (local), or in single quotes
+(consensus parameters):
+
+  # Decrement counters that signal either an update or cwnd event
   if next_cc_event:
     next_cc_event--
+  if next_cwnd_event:
+    next_cwnd_event--
 
   # Do not update anything if we detected a clock stall or jump,
   # as per [CLOCK_HEURISTICS]
   if clock_stalled_or_jumped:
+    inflight -= 'cc_sendme_inc'
     return
 
+  if BDP > cwnd:
+    _queue_use = 0
+  else:
+    _queue_use = cwnd - BDP
+
+  if cwnd_is_full(cwnd, inflight):
+    cwnd_full = 1
+  else if cwnd_is_nonfull(cwnd, inflight):
+    cwnd_full = 0
+
+  if in_slow_start:
+    if _queue_use < 'cc_vegas_gamma' and not orconn_blocked:
+      # Only increase cwnd if the cwnd is full
+      if cwnd_full:
+        _inc = rfc3742_ss_inc(cwnd);
+        cwnd += _inc
+
+        # If the RFC3742 increment drops below steady-state increment
+        # over a full cwnd worth of acks, exit slow start.
+        if _inc*SENDME_PER_CWND(cwnd) <= 'cc_cwnd_inc'*'cc_cwnd_inc_rate':
+          in_slow_start = 0
+    else: # Limit hit. Exit Slow start (even if cwnd not full)
+      in_slow_start = 0
+      cwnd = BDP + 'cc_vegas_gamma'
+
+    # Provide an emergency hard-max on slow start:
+    if cwnd >= 'cc_ss_max':
+      cwnd = 'cc_ss_max'
+      in_slow_start = 0
+  else if next_cc_event == 0:
+    if _queue_use > 'cc_vegas_delta':
+      cwnd = BDP + 'cc_vegas_delta' - 'cc_cwnd_inc'
+    elif _queue_use > cc_vegas_beta or orconn_blocked:
+      cwnd -= 'cc_cwnd_inc'
+    elif cwnd_full and _queue_use < 'cc_vegas_alpha':
+      # Only increment if queue is low, *and* the cwnd is full
+      cwnd += 'cc_cwnd_inc'
+
+    cwnd = MAX(cwnd, 'cc_circwindow_min')
+
+  # Specify next cwnd and cc update
   if next_cc_event == 0:
-     if BDP > cwnd:
-       queue_use = 0
-     else:
-       queue_use = cwnd - BDP
-
-     if in_slow_start:
-       if queue_use < cc_vegas_gamma and not orconn_blocked:
-         # Increment by slow start %, or at least 2 sendme_inc's worth
-         cwnd = cwnd + MAX(cwnd * cc_cwnd_inc_pct_ss, 2*cc_sendme_inc)
-         # If our BDP estimator thinks the BDP is still larger, use that
-         cwnd = MAX(cwnd, BDP)
-       else:
-         cwnd = BDP + cc_vegas_gamma
-         in_slow_start = 0
-     else:
-       if queue_use > cc_vegas_delta:
-         cwnd = BDP + cc_vegas_delta - cc_cwnd_inc
-       elif queue_use > cc_vegas_beta or orconn_blocked:
-         cwnd -= cc_cwnd_inc
-       elif queue_use < cc_vegas_alpha:
-         cwnd += cc_cwnd_inc
-
-     cwnd = MAX(cwnd, cc_circwindow_min)
+    next_cc_event = CWND_UPDATE_RATE(cwnd)
+  if next_cwnd_event == 0:
+    next_cwnd_event = SENDME_PER_CWND(cwnd)
+
+  # Determine if we need to reset the cwnd_full state
+  # (Parameterized)
+  if 'cc_cwnd_full_per_cwnd' == 1:
+    if next_cwnd_event == SENDME_PER_CWND(cwnd):
+      cwnd_full = 0
+  else:
+    if next_cc_event == CWND_UPDATE_RATE(cwnd):
+      cwnd_full = 0
 
-     # Count the number of sendme acks until next update of cwnd,
-     # rounded to nearest integer
-     if in_slow_start:
-       next_cc_event = round(cwnd / cc_sendme_inc)
-     else
-       # Never increment faster in slow start, only steady state.
-       next_cc_event = round(cwnd / (cc_cwnd_inc_rate * cc_sendme_inc))
+  # Update acked cells
+  inflight -= 'cc_sendme_inc'
 
 
 3.4. Tor NOLA: Direct BDP tracker [TOR_NOLA]
@@ -719,6 +818,12 @@ struct xon_cell {
   u32 kbps_ewma;
 }
 
+Parties SHOULD treat XON or XOFF cells with unrecognized versions as a
+protocol violation.
+
+In `xon_cell`, a zero value for `kbps_ewma` means that the stream's rate is
+unlimited.  Parties should therefore not send "0" to mean "do not send data".
+
 4.1.1. XON/XOFF behavior
 
 If the length of an edge outbuf queue exceeds the size provided in the
@@ -1274,6 +1379,25 @@ These are sorted in order of importance to tune, most important first.
            congestion, to avoid overloading slow relays. Values of 10 or 20
            were best.
 
+  cc_ewma_ss:
+    - Description: This specifies the N in N_EWMA smoothing of RTT during
+                   Slow Start.
+    - Range: [2, INT32_MAX]
+    - Default: 2
+    - Tuning Values: [2,4]
+    - Shadow Tuning Results:
+           Setting this to 2 helped reduce overshoot during Slow Start.
+
+  cc_rtt_reset_pct:
+    - Description: Describes a percentile average between RTT_min and
+                   RTT_current_ewma, for use to reset RTT_min, when the
+                   congestion window hits cwnd_min.
+    - Range: [0, 100]
+    - Default: 100
+    - Shadow Tuning Results:
+           cwnd_min is not hit in Shadow simulations, but it can be hit
+           on the live network while under DoS conditions, and with cheaters.
+
   cc_cwnd_inc:
     - Description: How much to increment the congestion window by during
                    steady state, every cwnd.
@@ -1369,18 +1493,23 @@ These are sorted in order of importance to tune, most important first.
 
 6.5.3. Vegas Parameters
 
-  cc_vegas_alpha_{exit,onion,sos,vg,sbws}:
-  cc_vegas_beta_{exit,onion,sos,vg,sbws}:
-  cc_vegas_gamma_{exit,onion,sos,vg,sbws}:
-  cc_vegas_delta_{exit,onion,sos,vg,sbws}:
+  cc_vegas_alpha_{exit,onion,sbws}:
+  cc_vegas_beta_{exit,onion,sbws}:
+  cc_vegas_gamma_{exit,onion,sbws}:
+  cc_vegas_delta_{exit,onion,sbws}:
     - Description: These parameters govern the number of cells
                    that [TOR_VEGAS] can detect in queue before reacting.
     - Range: [0, 1000] (except delta, which has max of INT32_MAX)
     - Defaults:
-           alpha: path_len*2*31-31
-           beta: path_len*2*31
-           gamma: path_len*2*31
-           delta: path_len*2*31 + 2*31
+           # OUTBUF_CELLS=62
+           cc_vegas_alpha_exit (2*OUTBUF_CELLS)
+           cc_vegas_beta_exit (4*OUTBUF_CELLS)
+           cc_vegas_gamma_exit (3*OUTBUF_CELLS)
+           cc_vegas_delta_exit (6*OUTBUF_CELLS)
+           cc_vegas_alpha_onion (3*OUTBUF_CELLS)
+           cc_vegas_beta_onion (7*OUTBUF_CELLS)
+           cc_vegas_gamma_onion (5*OUTBUF_CELLS)
+           cc_vegas_delta_onion (9*OUTBUF_CELLS)
     - Tuning Notes:
            The amount of queued cells that Vegas should tolerate is heavily
            dependent upon competing congestion control algorithms. The specified
@@ -1399,20 +1528,60 @@ These are sorted in order of importance to tune, most important first.
            need to verify that the path length multiplier still holds for
            other types of circuits, specifically onion services.
 
-  cc_vegas_bdp_mix:
-    - Description: This parameter allows us to specify a weighted percent
-                   average between the cwnd BDP estimator and the piecewise
-                   estimator.
-    - Range: [0, 100]
-    - Default: 100 (use 100% CWND estimator)
-    - Tuning Notes:
-           The original Vegas literature specified the CWND estimator, but
-           the Piecewise estimator is very likely more suited for our use
-           case. This parameter allows us to average them. We're unlikely to
-           need it.
+  cc_sscap_sbws_{exit,onion,sbws}:
+    - Description: These parameters describe the RFC3742 'cap', after which
+         congestion window increments are reduced. INT32_MAX disables
+         RFC3742.
+    - Range: [100, INT32_MAX]
+    - Defaults:
+         sbws: 400
+         exit: 500
+         onion: 600
     - Shadow Tuning Results:
-           Because the BDP estimator had ack compression and over-estimation,
-           we the CWND estimator.
+         We picked these defaults based on the average congestion window
+         seen in Shadow sims for exits and onion service circuits.
+
+   cc_ss_max:
+      - Description: This parameter provides a hard-max on the congestion
+        window in slow start.
+      - Range: [500, INT32_MAX]
+      - Default: 5000
+      - Shadow Tuning Results:
+         The largest congestion window seen in Shadow is ~3000, so this was
+         set as a safety valve above that.
+
+   cc_cwnd_full_gap:
+      - Description: This parameter defines the integer number of
+        'cc_sendme_inc' multiples of gap allowed between inflight and
+        cwnd, to still declare the cwnd full.
+      - Range: [0, INT32_MAX]
+      - Default: 1-2
+      - Shadow Tuning Results:
+        A value of 0 resulted in a slight loss of performance, and increased
+        variance in throughput. The optimal number here likely depends on
+        edgeconn inbuf size, edgeconn kernel buffer size, and eventloop
+        behavior.
+
+   cc_cwnd_full_minpct:
+      - Description: This paramter defines a low watermark in percent. If 
+        inflight falls below this percent of cwnd, the congestion window
+        is immediately declared non-full.
+      - Range: [0, 100]
+      - Default: 75
+
+   cc_cwnd_full_per_cwnd:
+      - Description: This parameter governs how often a cwnd must be
+        full, in order to allow congestion window increase. If it is 1,
+        then the cwnd only needs to be full once per cwnd worth of acks.
+        If it is 0, then it must be full once every cwnd update (ie:
+        every SENDME).
+      - Range: [0, 1]
+      - Default: 1
+      - Shadow Tuning Results:
+        A value of 0 resulted in a slight loss of performance, and increased
+        variance in throughput. The optimal number here likely depends on
+        edgeconn inbuf size, edgeconn kernel buffer size, and eventloop
+        behavior.
 
 6.5.4. NOLA Parameters
 
@@ -2137,3 +2306,6 @@ receive more data. It is sent to tell the sender to resume sending.
 
 31. KIST: Kernel-Informed Socket Transport for Tor
     https://matt.traudt.xyz/static/papers/kist-tops2018.pdf
+
+32. RFC3742 Limited Slow Start
+    https://datatracker.ietf.org/doc/html/rfc3742#section-2