aboutsummaryrefslogtreecommitdiff
path: root/proposals/324-rtt-congestion-control.txt
diff options
context:
space:
mode:
Diffstat (limited to 'proposals/324-rtt-congestion-control.txt')
-rw-r--r--proposals/324-rtt-congestion-control.txt326
1 files changed, 249 insertions, 77 deletions
diff --git a/proposals/324-rtt-congestion-control.txt b/proposals/324-rtt-congestion-control.txt
index c617983..952d140 100644
--- a/proposals/324-rtt-congestion-control.txt
+++ b/proposals/324-rtt-congestion-control.txt
@@ -144,20 +144,22 @@ measured by the RTT estimator, and if these heurtics detect a stall or a jump,
we do not use that value to update RTT or BDP, nor do we update any congestion
control algorithm information that round.
-If the time delta is 0, that is always treated as a clock stall.
+If the time delta is 0, that is always treated as a clock stall, the RTT is
+not used, congestion control is not updated, and this fact is cached globally.
-If we have measured at least 'cc_bwe_min' RTT values or we have successfully
-exited slow start, then every sendme ACK, the new candidate RTT is compared to
-the stored EWMA RTT. If the new RTT is either 5000 times larger than the EWMA
-RTT, or 5000 times smaller than the stored EWMA RTT, then we do not record that
-estimate, and do not update BDP or the congestion control algorithms for that
-SENDME ack.
+If the circuit does not yet have an EWMA RTT or it is still in Slow Start, then
+no further checks are performed, and the RTT is used.
-Moreover, if a clock stall is detected by *any* circuit, this fact is
-cached, and this cached value is used on circuits for which we do not
-have enough data to compute the above heueristics. This cached value is
-also exported for use by the edge connection rate calculations done by
-[XON_ADVISORY].
+If the circuit has stored an EWMA RTT and has exited Slow Start, then every
+sendme ACK, the new candidate RTT is compared to the stored EWMA RTT. If the
+new RTT is 5000 times larger than the EWMA RTT, then the circuit does not
+record that estimate, and does not update BDP or the congestion control
+algorithms for that SENDME ack. If the new RTT is 5000 times smaller than the
+EWMA RTT, then the circuit uses the globally cached value from above (ie: it
+assumes the clock is stalled *only* if there was previously *also* a 0-delta RTT).
+
+If both ratio checks pass, the globally cached clock stall state is set to
+false (no stall), and the RTT value is used.
2.1.2. N_EWMA Smoothing [N_EWMA_SMOOTHING]
@@ -173,10 +175,12 @@ Moving Average with alpha = 2/(N+1):
Note that the second rearranged form MUST be used in order to ensure that
rounding errors are handled in the same manner as other implementations.
-Flow control rate limiting uses this function
+Flow control rate limiting uses this function.
+
+During Slow Start, N is set to `cc_ewma_ss`, for RTT estimation.
-For both RTT and SENDME BDP estimation, N is the number of SENDME acks
-between congestion window updates, divided by the value of consensus
+After Slow Start, for both RTT and SENDME BDP estimation, N is the number
+of SENDME acks between congestion window updates, divided by the value of consensus
parameter 'cc_ewma_cwnd_pct', and then capped at a max of 'cc_ewma_max',
but always at least 2:
@@ -391,6 +395,13 @@ Simplifying:
BDP = cwnd * RTT_min / RTT_current_ewma
+The RTT_min for this calculation comes from the minimum RTT_current_ewma seen
+in the lifetime of this circuit. If the congestion window falls to
+`cc_cwnd_min` after slow start, implementations MAY choose to reset RTT_min
+for use in this calculation to either the RTT_current_ewma, or a
+percentile-weighted average between RTT_min and RTT_current_ewma, specified by
+`cc_rtt_reset_pct`. This helps with escaping starvation conditions.
+
The net effect of this estimation is to correct for any overshoot of
the cwnd over the actual BDP. It will obviously underestimate BDP if cwnd
is below BDP.
@@ -533,63 +544,151 @@ original cwnd estimator. So while this capability to change the BDP estimator
remains in the C implementation, we do not expect it to be used.
However, it was useful to use a local OR connection block at the time of
-SENDME ack arrival, as an immediate congestion signal.
+SENDME ack arrival, as an immediate congestion signal. Note that in C-Tor,
+this orconn_block state is not derived from any socket info, but instead is a
+heuristic that declares an orconn as blocked if any circuit cell queue
+exceeds the 'cellq_high' consensus parameter.
(As an additional optimization, we could also use the ECN signal described in
ideas/xxx-backward-ecn.txt, but this is not implemented. It is likely only of
any benefit during Slow Start, and even that benefit is likely small.)
-Congestion signals from RTT, blocked OR connections, or ECN are processed only
-once per congestion window. This is achieved through the next_cc_event flag,
-which is initialized to a cwnd worth of SENDME acks, and is decremented
-each ack. Congestion signals are only evaluated when it reaches 0.
-
-Here is the complete pseudocode for TOR_VEGAS, which is run every time
-an endpoint receives a SENDME ack:
-
- # Update acked cells
- inflight -= cc_sendme_inc
+During Slow Start, we use RFC3742 Limited Slow Start[32], which checks the
+congestion signals from RTT, blocked OR connections, or ECN every single
+SENDME ack. It also provides a `cc_sscap_*` parameter for each path length,
+which reduces the congestion window increment rate after it is crossed, as
+per the rules in RFC3742:
+ rfc3742_ss_inc(cwnd):
+ if cwnd <= cc_ss_cap_pathtype:
+ # Below the cap, we increment as per cc_cwnd_inc_pct_ss percent:
+ return round(cc_cwnd_inc_pct_ss*cc_sendme_inc/100)
+ else:
+ # This returns an increment equivalent to RFC3742, rounded,
+ # with a minimum of inc=1.
+ # From RFC3742:
+ # K = int(cwnd/(0.5 max_ssthresh));
+ # inc = int(MSS/K);
+ return MAX(round((cc_sendme_inc*cc_ss_cap_pathtype)/(2*cwnd)), 1);
+
+During both Slow Start, and Steady State, if the congestion window is not full,
+we never increase the congestion window. We can still decrease it, or exit slow
+start, in this case. This is done to avoid causing overshoot. The original TCP
+Vegas addressed this problem by computing BDP and queue_use from inflight,
+instead of cwnd, but we found that approach to have signficantly worse
+performance.
+
+Because C-Tor is single-threaded, multiple SENDME acks may arrive during one
+processing loop, before edge connections resume reading. For this reason,
+we provide two heuristics to provide some slack in determining the full
+condition. The first is to allow a gap between inflight and cwnd,
+parameterized as 'cc_cwnd_full_gap' multiples of 'cc_sendme_inc':
+ cwnd_is_full(cwnd, inflight):
+ if inflight + 'cc_cwnd_full_gap'*'cc_sendme_inc' >= cwnd:
+ return true
+ else
+ return false
+The second heuristic immediately resets the full state if it falls below
+'cc_cwnd_full_minpct' full:
+ cwnd_is_nonfull(cwnd, inflight):
+ if 100*inflight < 'cc_cwnd_full_minpct'*cwnd:
+ return true
+ else
+ return false
+
+This full status is cached once per cwnd if 'cc_cwnd_full_per_cwnd=1';
+otherwise it is cached once per cwnd update. These two helper functions
+determine the number of acks in each case:
+ SENDME_PER_CWND(cwnd):
+ return ((cwnd + 'cc_sendme_inc'/2)/'cc_sendme_inc')
+ CWND_UPDATE_RATE(cwnd, in_slow_start):
+ # In Slow Start, update every SENDME
+ if in_slow_start:
+ return 1
+ else: # Otherwise, update as per the 'cc_inc_rate' (31)
+ return ((cwnd + 'cc_cwnd_inc_rate'*'cc_sendme_inc'/2)
+ / ('cc_cwnd_inc_rate'*'cc_sendme_inc'));
+
+Shadow experimentation indicates that 'cc_cwnd_full_gap=2' and
+'cc_cwnd_full_per_cwnd=0' minimizes queue overshoot, where as
+'cc_cwnd_full_per_cwnd=1' and 'cc_cwnd_full_gap=1' is slightly better
+for performance. Since there may be a difference between Shadow and live,
+we leave this parmeterization in place.
+
+Here is the complete pseudocode for TOR_VEGAS with RFC3742, which is run every
+time an endpoint receives a SENDME ack. All variables are scoped to the
+circuit, unless prefixed by an underscore (local), or in single quotes
+(consensus parameters):
+
+ # Decrement counters that signal either an update or cwnd event
if next_cc_event:
next_cc_event--
+ if next_cwnd_event:
+ next_cwnd_event--
# Do not update anything if we detected a clock stall or jump,
# as per [CLOCK_HEURISTICS]
if clock_stalled_or_jumped:
+ inflight -= 'cc_sendme_inc'
return
+ if BDP > cwnd:
+ _queue_use = 0
+ else:
+ _queue_use = cwnd - BDP
+
+ if cwnd_is_full(cwnd, inflight):
+ cwnd_full = 1
+ else if cwnd_is_nonfull(cwnd, inflight):
+ cwnd_full = 0
+
+ if in_slow_start:
+ if _queue_use < 'cc_vegas_gamma' and not orconn_blocked:
+ # Only increase cwnd if the cwnd is full
+ if cwnd_full:
+ _inc = rfc3742_ss_inc(cwnd);
+ cwnd += _inc
+
+ # If the RFC3742 increment drops below steady-state increment
+ # over a full cwnd worth of acks, exit slow start.
+ if _inc*SENDME_PER_CWND(cwnd) <= 'cc_cwnd_inc'*'cc_cwnd_inc_rate':
+ in_slow_start = 0
+ else: # Limit hit. Exit Slow start (even if cwnd not full)
+ in_slow_start = 0
+ cwnd = BDP + 'cc_vegas_gamma'
+
+ # Provide an emergency hard-max on slow start:
+ if cwnd >= 'cc_ss_max':
+ cwnd = 'cc_ss_max'
+ in_slow_start = 0
+ else if next_cc_event == 0:
+ if _queue_use > 'cc_vegas_delta':
+ cwnd = BDP + 'cc_vegas_delta' - 'cc_cwnd_inc'
+ elif _queue_use > cc_vegas_beta or orconn_blocked:
+ cwnd -= 'cc_cwnd_inc'
+ elif cwnd_full and _queue_use < 'cc_vegas_alpha':
+ # Only increment if queue is low, *and* the cwnd is full
+ cwnd += 'cc_cwnd_inc'
+
+ cwnd = MAX(cwnd, 'cc_circwindow_min')
+
+ # Specify next cwnd and cc update
if next_cc_event == 0:
- if BDP > cwnd:
- queue_use = 0
- else:
- queue_use = cwnd - BDP
-
- if in_slow_start:
- if queue_use < cc_vegas_gamma and not orconn_blocked:
- # Increment by slow start %, or at least 2 sendme_inc's worth
- cwnd = cwnd + MAX(cwnd * cc_cwnd_inc_pct_ss, 2*cc_sendme_inc)
- # If our BDP estimator thinks the BDP is still larger, use that
- cwnd = MAX(cwnd, BDP)
- else:
- cwnd = BDP + cc_vegas_gamma
- in_slow_start = 0
- else:
- if queue_use > cc_vegas_delta:
- cwnd = BDP + cc_vegas_delta - cc_cwnd_inc
- elif queue_use > cc_vegas_beta or orconn_blocked:
- cwnd -= cc_cwnd_inc
- elif queue_use < cc_vegas_alpha:
- cwnd += cc_cwnd_inc
-
- cwnd = MAX(cwnd, cc_circwindow_min)
+ next_cc_event = CWND_UPDATE_RATE(cwnd)
+ if next_cwnd_event == 0:
+ next_cwnd_event = SENDME_PER_CWND(cwnd)
+
+ # Determine if we need to reset the cwnd_full state
+ # (Parameterized)
+ if 'cc_cwnd_full_per_cwnd' == 1:
+ if next_cwnd_event == SENDME_PER_CWND(cwnd):
+ cwnd_full = 0
+ else:
+ if next_cc_event == CWND_UPDATE_RATE(cwnd):
+ cwnd_full = 0
- # Count the number of sendme acks until next update of cwnd,
- # rounded to nearest integer
- if in_slow_start:
- next_cc_event = round(cwnd / cc_sendme_inc)
- else
- # Never increment faster in slow start, only steady state.
- next_cc_event = round(cwnd / (cc_cwnd_inc_rate * cc_sendme_inc))
+ # Update acked cells
+ inflight -= 'cc_sendme_inc'
3.4. Tor NOLA: Direct BDP tracker [TOR_NOLA]
@@ -719,6 +818,12 @@ struct xon_cell {
u32 kbps_ewma;
}
+Parties SHOULD treat XON or XOFF cells with unrecognized versions as a
+protocol violation.
+
+In `xon_cell`, a zero value for `kbps_ewma` means that the stream's rate is
+unlimited. Parties should therefore not send "0" to mean "do not send data".
+
4.1.1. XON/XOFF behavior
If the length of an edge outbuf queue exceeds the size provided in the
@@ -1274,6 +1379,25 @@ These are sorted in order of importance to tune, most important first.
congestion, to avoid overloading slow relays. Values of 10 or 20
were best.
+ cc_ewma_ss:
+ - Description: This specifies the N in N_EWMA smoothing of RTT during
+ Slow Start.
+ - Range: [2, INT32_MAX]
+ - Default: 2
+ - Tuning Values: [2,4]
+ - Shadow Tuning Results:
+ Setting this to 2 helped reduce overshoot during Slow Start.
+
+ cc_rtt_reset_pct:
+ - Description: Describes a percentile average between RTT_min and
+ RTT_current_ewma, for use to reset RTT_min, when the
+ congestion window hits cwnd_min.
+ - Range: [0, 100]
+ - Default: 100
+ - Shadow Tuning Results:
+ cwnd_min is not hit in Shadow simulations, but it can be hit
+ on the live network while under DoS conditions, and with cheaters.
+
cc_cwnd_inc:
- Description: How much to increment the congestion window by during
steady state, every cwnd.
@@ -1369,18 +1493,23 @@ These are sorted in order of importance to tune, most important first.
6.5.3. Vegas Parameters
- cc_vegas_alpha_{exit,onion,sos,vg,sbws}:
- cc_vegas_beta_{exit,onion,sos,vg,sbws}:
- cc_vegas_gamma_{exit,onion,sos,vg,sbws}:
- cc_vegas_delta_{exit,onion,sos,vg,sbws}:
+ cc_vegas_alpha_{exit,onion,sbws}:
+ cc_vegas_beta_{exit,onion,sbws}:
+ cc_vegas_gamma_{exit,onion,sbws}:
+ cc_vegas_delta_{exit,onion,sbws}:
- Description: These parameters govern the number of cells
that [TOR_VEGAS] can detect in queue before reacting.
- Range: [0, 1000] (except delta, which has max of INT32_MAX)
- Defaults:
- alpha: path_len*2*31-31
- beta: path_len*2*31
- gamma: path_len*2*31
- delta: path_len*2*31 + 2*31
+ # OUTBUF_CELLS=62
+ cc_vegas_alpha_exit (2*OUTBUF_CELLS)
+ cc_vegas_beta_exit (4*OUTBUF_CELLS)
+ cc_vegas_gamma_exit (3*OUTBUF_CELLS)
+ cc_vegas_delta_exit (6*OUTBUF_CELLS)
+ cc_vegas_alpha_onion (3*OUTBUF_CELLS)
+ cc_vegas_beta_onion (7*OUTBUF_CELLS)
+ cc_vegas_gamma_onion (5*OUTBUF_CELLS)
+ cc_vegas_delta_onion (9*OUTBUF_CELLS)
- Tuning Notes:
The amount of queued cells that Vegas should tolerate is heavily
dependent upon competing congestion control algorithms. The specified
@@ -1399,20 +1528,60 @@ These are sorted in order of importance to tune, most important first.
need to verify that the path length multiplier still holds for
other types of circuits, specifically onion services.
- cc_vegas_bdp_mix:
- - Description: This parameter allows us to specify a weighted percent
- average between the cwnd BDP estimator and the piecewise
- estimator.
- - Range: [0, 100]
- - Default: 100 (use 100% CWND estimator)
- - Tuning Notes:
- The original Vegas literature specified the CWND estimator, but
- the Piecewise estimator is very likely more suited for our use
- case. This parameter allows us to average them. We're unlikely to
- need it.
+ cc_sscap_sbws_{exit,onion,sbws}:
+ - Description: These parameters describe the RFC3742 'cap', after which
+ congestion window increments are reduced. INT32_MAX disables
+ RFC3742.
+ - Range: [100, INT32_MAX]
+ - Defaults:
+ sbws: 400
+ exit: 500
+ onion: 600
- Shadow Tuning Results:
- Because the BDP estimator had ack compression and over-estimation,
- we the CWND estimator.
+ We picked these defaults based on the average congestion window
+ seen in Shadow sims for exits and onion service circuits.
+
+ cc_ss_max:
+ - Description: This parameter provides a hard-max on the congestion
+ window in slow start.
+ - Range: [500, INT32_MAX]
+ - Default: 5000
+ - Shadow Tuning Results:
+ The largest congestion window seen in Shadow is ~3000, so this was
+ set as a safety valve above that.
+
+ cc_cwnd_full_gap:
+ - Description: This parameter defines the integer number of
+ 'cc_sendme_inc' multiples of gap allowed between inflight and
+ cwnd, to still declare the cwnd full.
+ - Range: [0, INT32_MAX]
+ - Default: 1-2
+ - Shadow Tuning Results:
+ A value of 0 resulted in a slight loss of performance, and increased
+ variance in throughput. The optimal number here likely depends on
+ edgeconn inbuf size, edgeconn kernel buffer size, and eventloop
+ behavior.
+
+ cc_cwnd_full_minpct:
+ - Description: This paramter defines a low watermark in percent. If
+ inflight falls below this percent of cwnd, the congestion window
+ is immediately declared non-full.
+ - Range: [0, 100]
+ - Default: 75
+
+ cc_cwnd_full_per_cwnd:
+ - Description: This parameter governs how often a cwnd must be
+ full, in order to allow congestion window increase. If it is 1,
+ then the cwnd only needs to be full once per cwnd worth of acks.
+ If it is 0, then it must be full once every cwnd update (ie:
+ every SENDME).
+ - Range: [0, 1]
+ - Default: 1
+ - Shadow Tuning Results:
+ A value of 0 resulted in a slight loss of performance, and increased
+ variance in throughput. The optimal number here likely depends on
+ edgeconn inbuf size, edgeconn kernel buffer size, and eventloop
+ behavior.
6.5.4. NOLA Parameters
@@ -2137,3 +2306,6 @@ receive more data. It is sent to tell the sender to resume sending.
31. KIST: Kernel-Informed Socket Transport for Tor
https://matt.traudt.xyz/static/papers/kist-tops2018.pdf
+
+32. RFC3742 Limited Slow Start
+ https://datatracker.ietf.org/doc/html/rfc3742#section-2