diff options
Diffstat (limited to 'proposals/324-rtt-congestion-control.txt')
-rw-r--r-- | proposals/324-rtt-congestion-control.txt | 326 |
1 files changed, 249 insertions, 77 deletions
diff --git a/proposals/324-rtt-congestion-control.txt b/proposals/324-rtt-congestion-control.txt index c617983..952d140 100644 --- a/proposals/324-rtt-congestion-control.txt +++ b/proposals/324-rtt-congestion-control.txt @@ -144,20 +144,22 @@ measured by the RTT estimator, and if these heurtics detect a stall or a jump, we do not use that value to update RTT or BDP, nor do we update any congestion control algorithm information that round. -If the time delta is 0, that is always treated as a clock stall. +If the time delta is 0, that is always treated as a clock stall, the RTT is +not used, congestion control is not updated, and this fact is cached globally. -If we have measured at least 'cc_bwe_min' RTT values or we have successfully -exited slow start, then every sendme ACK, the new candidate RTT is compared to -the stored EWMA RTT. If the new RTT is either 5000 times larger than the EWMA -RTT, or 5000 times smaller than the stored EWMA RTT, then we do not record that -estimate, and do not update BDP or the congestion control algorithms for that -SENDME ack. +If the circuit does not yet have an EWMA RTT or it is still in Slow Start, then +no further checks are performed, and the RTT is used. -Moreover, if a clock stall is detected by *any* circuit, this fact is -cached, and this cached value is used on circuits for which we do not -have enough data to compute the above heueristics. This cached value is -also exported for use by the edge connection rate calculations done by -[XON_ADVISORY]. +If the circuit has stored an EWMA RTT and has exited Slow Start, then every +sendme ACK, the new candidate RTT is compared to the stored EWMA RTT. If the +new RTT is 5000 times larger than the EWMA RTT, then the circuit does not +record that estimate, and does not update BDP or the congestion control +algorithms for that SENDME ack. If the new RTT is 5000 times smaller than the +EWMA RTT, then the circuit uses the globally cached value from above (ie: it +assumes the clock is stalled *only* if there was previously *also* a 0-delta RTT). + +If both ratio checks pass, the globally cached clock stall state is set to +false (no stall), and the RTT value is used. 2.1.2. N_EWMA Smoothing [N_EWMA_SMOOTHING] @@ -173,10 +175,12 @@ Moving Average with alpha = 2/(N+1): Note that the second rearranged form MUST be used in order to ensure that rounding errors are handled in the same manner as other implementations. -Flow control rate limiting uses this function +Flow control rate limiting uses this function. + +During Slow Start, N is set to `cc_ewma_ss`, for RTT estimation. -For both RTT and SENDME BDP estimation, N is the number of SENDME acks -between congestion window updates, divided by the value of consensus +After Slow Start, for both RTT and SENDME BDP estimation, N is the number +of SENDME acks between congestion window updates, divided by the value of consensus parameter 'cc_ewma_cwnd_pct', and then capped at a max of 'cc_ewma_max', but always at least 2: @@ -391,6 +395,13 @@ Simplifying: BDP = cwnd * RTT_min / RTT_current_ewma +The RTT_min for this calculation comes from the minimum RTT_current_ewma seen +in the lifetime of this circuit. If the congestion window falls to +`cc_cwnd_min` after slow start, implementations MAY choose to reset RTT_min +for use in this calculation to either the RTT_current_ewma, or a +percentile-weighted average between RTT_min and RTT_current_ewma, specified by +`cc_rtt_reset_pct`. This helps with escaping starvation conditions. + The net effect of this estimation is to correct for any overshoot of the cwnd over the actual BDP. It will obviously underestimate BDP if cwnd is below BDP. @@ -533,63 +544,151 @@ original cwnd estimator. So while this capability to change the BDP estimator remains in the C implementation, we do not expect it to be used. However, it was useful to use a local OR connection block at the time of -SENDME ack arrival, as an immediate congestion signal. +SENDME ack arrival, as an immediate congestion signal. Note that in C-Tor, +this orconn_block state is not derived from any socket info, but instead is a +heuristic that declares an orconn as blocked if any circuit cell queue +exceeds the 'cellq_high' consensus parameter. (As an additional optimization, we could also use the ECN signal described in ideas/xxx-backward-ecn.txt, but this is not implemented. It is likely only of any benefit during Slow Start, and even that benefit is likely small.) -Congestion signals from RTT, blocked OR connections, or ECN are processed only -once per congestion window. This is achieved through the next_cc_event flag, -which is initialized to a cwnd worth of SENDME acks, and is decremented -each ack. Congestion signals are only evaluated when it reaches 0. - -Here is the complete pseudocode for TOR_VEGAS, which is run every time -an endpoint receives a SENDME ack: - - # Update acked cells - inflight -= cc_sendme_inc +During Slow Start, we use RFC3742 Limited Slow Start[32], which checks the +congestion signals from RTT, blocked OR connections, or ECN every single +SENDME ack. It also provides a `cc_sscap_*` parameter for each path length, +which reduces the congestion window increment rate after it is crossed, as +per the rules in RFC3742: + rfc3742_ss_inc(cwnd): + if cwnd <= cc_ss_cap_pathtype: + # Below the cap, we increment as per cc_cwnd_inc_pct_ss percent: + return round(cc_cwnd_inc_pct_ss*cc_sendme_inc/100) + else: + # This returns an increment equivalent to RFC3742, rounded, + # with a minimum of inc=1. + # From RFC3742: + # K = int(cwnd/(0.5 max_ssthresh)); + # inc = int(MSS/K); + return MAX(round((cc_sendme_inc*cc_ss_cap_pathtype)/(2*cwnd)), 1); + +During both Slow Start, and Steady State, if the congestion window is not full, +we never increase the congestion window. We can still decrease it, or exit slow +start, in this case. This is done to avoid causing overshoot. The original TCP +Vegas addressed this problem by computing BDP and queue_use from inflight, +instead of cwnd, but we found that approach to have signficantly worse +performance. + +Because C-Tor is single-threaded, multiple SENDME acks may arrive during one +processing loop, before edge connections resume reading. For this reason, +we provide two heuristics to provide some slack in determining the full +condition. The first is to allow a gap between inflight and cwnd, +parameterized as 'cc_cwnd_full_gap' multiples of 'cc_sendme_inc': + cwnd_is_full(cwnd, inflight): + if inflight + 'cc_cwnd_full_gap'*'cc_sendme_inc' >= cwnd: + return true + else + return false +The second heuristic immediately resets the full state if it falls below +'cc_cwnd_full_minpct' full: + cwnd_is_nonfull(cwnd, inflight): + if 100*inflight < 'cc_cwnd_full_minpct'*cwnd: + return true + else + return false + +This full status is cached once per cwnd if 'cc_cwnd_full_per_cwnd=1'; +otherwise it is cached once per cwnd update. These two helper functions +determine the number of acks in each case: + SENDME_PER_CWND(cwnd): + return ((cwnd + 'cc_sendme_inc'/2)/'cc_sendme_inc') + CWND_UPDATE_RATE(cwnd, in_slow_start): + # In Slow Start, update every SENDME + if in_slow_start: + return 1 + else: # Otherwise, update as per the 'cc_inc_rate' (31) + return ((cwnd + 'cc_cwnd_inc_rate'*'cc_sendme_inc'/2) + / ('cc_cwnd_inc_rate'*'cc_sendme_inc')); + +Shadow experimentation indicates that 'cc_cwnd_full_gap=2' and +'cc_cwnd_full_per_cwnd=0' minimizes queue overshoot, where as +'cc_cwnd_full_per_cwnd=1' and 'cc_cwnd_full_gap=1' is slightly better +for performance. Since there may be a difference between Shadow and live, +we leave this parmeterization in place. + +Here is the complete pseudocode for TOR_VEGAS with RFC3742, which is run every +time an endpoint receives a SENDME ack. All variables are scoped to the +circuit, unless prefixed by an underscore (local), or in single quotes +(consensus parameters): + + # Decrement counters that signal either an update or cwnd event if next_cc_event: next_cc_event-- + if next_cwnd_event: + next_cwnd_event-- # Do not update anything if we detected a clock stall or jump, # as per [CLOCK_HEURISTICS] if clock_stalled_or_jumped: + inflight -= 'cc_sendme_inc' return + if BDP > cwnd: + _queue_use = 0 + else: + _queue_use = cwnd - BDP + + if cwnd_is_full(cwnd, inflight): + cwnd_full = 1 + else if cwnd_is_nonfull(cwnd, inflight): + cwnd_full = 0 + + if in_slow_start: + if _queue_use < 'cc_vegas_gamma' and not orconn_blocked: + # Only increase cwnd if the cwnd is full + if cwnd_full: + _inc = rfc3742_ss_inc(cwnd); + cwnd += _inc + + # If the RFC3742 increment drops below steady-state increment + # over a full cwnd worth of acks, exit slow start. + if _inc*SENDME_PER_CWND(cwnd) <= 'cc_cwnd_inc'*'cc_cwnd_inc_rate': + in_slow_start = 0 + else: # Limit hit. Exit Slow start (even if cwnd not full) + in_slow_start = 0 + cwnd = BDP + 'cc_vegas_gamma' + + # Provide an emergency hard-max on slow start: + if cwnd >= 'cc_ss_max': + cwnd = 'cc_ss_max' + in_slow_start = 0 + else if next_cc_event == 0: + if _queue_use > 'cc_vegas_delta': + cwnd = BDP + 'cc_vegas_delta' - 'cc_cwnd_inc' + elif _queue_use > cc_vegas_beta or orconn_blocked: + cwnd -= 'cc_cwnd_inc' + elif cwnd_full and _queue_use < 'cc_vegas_alpha': + # Only increment if queue is low, *and* the cwnd is full + cwnd += 'cc_cwnd_inc' + + cwnd = MAX(cwnd, 'cc_circwindow_min') + + # Specify next cwnd and cc update if next_cc_event == 0: - if BDP > cwnd: - queue_use = 0 - else: - queue_use = cwnd - BDP - - if in_slow_start: - if queue_use < cc_vegas_gamma and not orconn_blocked: - # Increment by slow start %, or at least 2 sendme_inc's worth - cwnd = cwnd + MAX(cwnd * cc_cwnd_inc_pct_ss, 2*cc_sendme_inc) - # If our BDP estimator thinks the BDP is still larger, use that - cwnd = MAX(cwnd, BDP) - else: - cwnd = BDP + cc_vegas_gamma - in_slow_start = 0 - else: - if queue_use > cc_vegas_delta: - cwnd = BDP + cc_vegas_delta - cc_cwnd_inc - elif queue_use > cc_vegas_beta or orconn_blocked: - cwnd -= cc_cwnd_inc - elif queue_use < cc_vegas_alpha: - cwnd += cc_cwnd_inc - - cwnd = MAX(cwnd, cc_circwindow_min) + next_cc_event = CWND_UPDATE_RATE(cwnd) + if next_cwnd_event == 0: + next_cwnd_event = SENDME_PER_CWND(cwnd) + + # Determine if we need to reset the cwnd_full state + # (Parameterized) + if 'cc_cwnd_full_per_cwnd' == 1: + if next_cwnd_event == SENDME_PER_CWND(cwnd): + cwnd_full = 0 + else: + if next_cc_event == CWND_UPDATE_RATE(cwnd): + cwnd_full = 0 - # Count the number of sendme acks until next update of cwnd, - # rounded to nearest integer - if in_slow_start: - next_cc_event = round(cwnd / cc_sendme_inc) - else - # Never increment faster in slow start, only steady state. - next_cc_event = round(cwnd / (cc_cwnd_inc_rate * cc_sendme_inc)) + # Update acked cells + inflight -= 'cc_sendme_inc' 3.4. Tor NOLA: Direct BDP tracker [TOR_NOLA] @@ -719,6 +818,12 @@ struct xon_cell { u32 kbps_ewma; } +Parties SHOULD treat XON or XOFF cells with unrecognized versions as a +protocol violation. + +In `xon_cell`, a zero value for `kbps_ewma` means that the stream's rate is +unlimited. Parties should therefore not send "0" to mean "do not send data". + 4.1.1. XON/XOFF behavior If the length of an edge outbuf queue exceeds the size provided in the @@ -1274,6 +1379,25 @@ These are sorted in order of importance to tune, most important first. congestion, to avoid overloading slow relays. Values of 10 or 20 were best. + cc_ewma_ss: + - Description: This specifies the N in N_EWMA smoothing of RTT during + Slow Start. + - Range: [2, INT32_MAX] + - Default: 2 + - Tuning Values: [2,4] + - Shadow Tuning Results: + Setting this to 2 helped reduce overshoot during Slow Start. + + cc_rtt_reset_pct: + - Description: Describes a percentile average between RTT_min and + RTT_current_ewma, for use to reset RTT_min, when the + congestion window hits cwnd_min. + - Range: [0, 100] + - Default: 100 + - Shadow Tuning Results: + cwnd_min is not hit in Shadow simulations, but it can be hit + on the live network while under DoS conditions, and with cheaters. + cc_cwnd_inc: - Description: How much to increment the congestion window by during steady state, every cwnd. @@ -1369,18 +1493,23 @@ These are sorted in order of importance to tune, most important first. 6.5.3. Vegas Parameters - cc_vegas_alpha_{exit,onion,sos,vg,sbws}: - cc_vegas_beta_{exit,onion,sos,vg,sbws}: - cc_vegas_gamma_{exit,onion,sos,vg,sbws}: - cc_vegas_delta_{exit,onion,sos,vg,sbws}: + cc_vegas_alpha_{exit,onion,sbws}: + cc_vegas_beta_{exit,onion,sbws}: + cc_vegas_gamma_{exit,onion,sbws}: + cc_vegas_delta_{exit,onion,sbws}: - Description: These parameters govern the number of cells that [TOR_VEGAS] can detect in queue before reacting. - Range: [0, 1000] (except delta, which has max of INT32_MAX) - Defaults: - alpha: path_len*2*31-31 - beta: path_len*2*31 - gamma: path_len*2*31 - delta: path_len*2*31 + 2*31 + # OUTBUF_CELLS=62 + cc_vegas_alpha_exit (2*OUTBUF_CELLS) + cc_vegas_beta_exit (4*OUTBUF_CELLS) + cc_vegas_gamma_exit (3*OUTBUF_CELLS) + cc_vegas_delta_exit (6*OUTBUF_CELLS) + cc_vegas_alpha_onion (3*OUTBUF_CELLS) + cc_vegas_beta_onion (7*OUTBUF_CELLS) + cc_vegas_gamma_onion (5*OUTBUF_CELLS) + cc_vegas_delta_onion (9*OUTBUF_CELLS) - Tuning Notes: The amount of queued cells that Vegas should tolerate is heavily dependent upon competing congestion control algorithms. The specified @@ -1399,20 +1528,60 @@ These are sorted in order of importance to tune, most important first. need to verify that the path length multiplier still holds for other types of circuits, specifically onion services. - cc_vegas_bdp_mix: - - Description: This parameter allows us to specify a weighted percent - average between the cwnd BDP estimator and the piecewise - estimator. - - Range: [0, 100] - - Default: 100 (use 100% CWND estimator) - - Tuning Notes: - The original Vegas literature specified the CWND estimator, but - the Piecewise estimator is very likely more suited for our use - case. This parameter allows us to average them. We're unlikely to - need it. + cc_sscap_sbws_{exit,onion,sbws}: + - Description: These parameters describe the RFC3742 'cap', after which + congestion window increments are reduced. INT32_MAX disables + RFC3742. + - Range: [100, INT32_MAX] + - Defaults: + sbws: 400 + exit: 500 + onion: 600 - Shadow Tuning Results: - Because the BDP estimator had ack compression and over-estimation, - we the CWND estimator. + We picked these defaults based on the average congestion window + seen in Shadow sims for exits and onion service circuits. + + cc_ss_max: + - Description: This parameter provides a hard-max on the congestion + window in slow start. + - Range: [500, INT32_MAX] + - Default: 5000 + - Shadow Tuning Results: + The largest congestion window seen in Shadow is ~3000, so this was + set as a safety valve above that. + + cc_cwnd_full_gap: + - Description: This parameter defines the integer number of + 'cc_sendme_inc' multiples of gap allowed between inflight and + cwnd, to still declare the cwnd full. + - Range: [0, INT32_MAX] + - Default: 1-2 + - Shadow Tuning Results: + A value of 0 resulted in a slight loss of performance, and increased + variance in throughput. The optimal number here likely depends on + edgeconn inbuf size, edgeconn kernel buffer size, and eventloop + behavior. + + cc_cwnd_full_minpct: + - Description: This paramter defines a low watermark in percent. If + inflight falls below this percent of cwnd, the congestion window + is immediately declared non-full. + - Range: [0, 100] + - Default: 75 + + cc_cwnd_full_per_cwnd: + - Description: This parameter governs how often a cwnd must be + full, in order to allow congestion window increase. If it is 1, + then the cwnd only needs to be full once per cwnd worth of acks. + If it is 0, then it must be full once every cwnd update (ie: + every SENDME). + - Range: [0, 1] + - Default: 1 + - Shadow Tuning Results: + A value of 0 resulted in a slight loss of performance, and increased + variance in throughput. The optimal number here likely depends on + edgeconn inbuf size, edgeconn kernel buffer size, and eventloop + behavior. 6.5.4. NOLA Parameters @@ -2137,3 +2306,6 @@ receive more data. It is sent to tell the sender to resume sending. 31. KIST: Kernel-Informed Socket Transport for Tor https://matt.traudt.xyz/static/papers/kist-tops2018.pdf + +32. RFC3742 Limited Slow Start + https://datatracker.ietf.org/doc/html/rfc3742#section-2 |