Filename: 324-rtt-congestion-control.txt
Title: RTT-based Congestion Control for Tor
Author: Mike Perry
Created: 02 July 2020
Status: Open


0. Motivation [MOTIVATION]

This proposal specifies how to incrementally deploy RTT-based congestion
control and improved queue management in Tor. It is written to allow us
to first deploy the system only at Exit relays, and then incrementally
improve the system by upgrading intermediate relays.

Lack of congestion control is the reason why Tor has an inherent speed
limit of about 500KB/sec for downloads and uploads via Exits, and even
slower for onion services. Because our stream SENDME windows are fixed
at 500 cells per stream, and only ~500 bytes can be sent in one cell,
the max speed of a single Tor stream is 500*500/circuit_latency. This
works out to about 500KB/sec max sustained throughput for a single
download, even if circuit latency is as low as 500ms.

Because onion services paths are more than twice the length of Exit
paths (and thus more than twice the circuit latency), onion service
throughput will always have less than half the throughput of Exit
throughput, until we deploy proper congestion control with dynamic
windows.

Proper congestion control will remove this speed limit for both Exits
and onion services, as well as reduce memory requirements for fast Tor
relays, by reducing queue lengths.

The high-level plan is to use Round Trip Time (RTT) as a primary
congestion signal, and compare the performance of two different
congestion window update algorithms that both use RTT as a congestion
signal.

The combination of RTT-based congestion signaling, a congestion window
update algorithm, and Circuit-EWMA will get us the most if not all of
the benefits we seek, and only requires clients and Exits to upgrade to
use it. Once this is deployed, circuit bandwidth caps will no longer be
capped at ~500kb/sec by the fixed window sizes of SENDME; queue latency
will fall significantly; memory requirements at relays should plummet;
and transient bottlenecks in the network should dissipate.

Extended background information on the choices made in this proposal can
be found at:
  https://lists.torproject.org/pipermail/tor-dev/2020-June/014343.html
  https://lists.torproject.org/pipermail/tor-dev/2020-January/014140.html

An exhaustive list of citations for further reading is in Section
[CITATIONS].


1. Overview [OVERVIEW]

This proposal has five main sections, after this overview. These
sections are referenced [IN_ALL_CAPS] rather than by number, for easy
searching.

Section [CONGESTION_SIGNALS] specifies how to use Tor's SENDME flow
control cells to measure circuit RTT, for use as an implicit congestion
signal. It also mentions an explicit congestion signal, which can be
used as a future optimization once all relays upgrade.

Section [CONTROL_ALGORITHMS] specifies two candidate congestion window
upgrade mechanisms, which will be compared for performance in simulation
in Shadow, as well as evaluated on the live network, and tuned via
consensus parameters listed in [CONSENSUS_PARAMETERS].

Section [FLOW_CONTROL] specifies how to handle back-pressure when one of
the endpoints stops reading data, but data is still arriving. In
particular, it specifies what to do with streams that are not being read
by an application, but still have data arriving on them.

Section [SYSTEM_INTERACTIONS] describes how congestion control will
interact with onion services, circuit padding, and conflux-style traffic
splitting.

Section [EVALUATION] describes how we will evaluate and tune our
options for control algorithms and their parameters.

Section [PROTOCOL_SPEC] describes the specific cell formats and
descriptor changes needed by this proposal.

Section [SECURITY_ANALYSIS] provides information about the DoS and
traffic analysis properties of congestion control.


2. Congestion Signals [CONGESTION_SIGNALS]

In order to detect congestion at relays on a circuit, Tor will use
circuit Round Trip Time (RTT) measurement. This signal will be used in
slightly different ways in our various [CONTROL_ALGORITHMS], which will
be compared against each other for optimum performance in Shadow and on
the live network.

To facilitate this, we will also change SENDME accounting logic
slightly. These changes only require clients, exits, and dirauths to
update.

As a future optimization, it is possible to send a direct ECN congestion
signal. This signal *will* require all relays on a circuit to upgrade to
support it, but it will reduce congestion by making the first congestion event
on a circuit much faster to detect.

To reduce confusion and complexity of this proposal, this signal has been
moved to the ideas repository, under xxx-backward-ecn.txt [BACKWARD_ECN].


2.1 RTT measurement

Recall that Tor clients, exits, and onion services send
RELAY_COMMAND_SENDME relay cells every CIRCWINDOW_INCREMENT (100) cells
of received RELAY_COMMAND_DATA.

This allows those endpoints to measure the current circuit RTT, by
measuring the amount of time between sending of every 100th data cell
and the arrival of the SENDME command that the other endpoint
immediately sends to ack that 100th cell.

Circuits will record the current RTT measurement as a field in their
circuit_t data structure. The current RTT is N-EWMA smoothed[27] over a
'cc_ewma_cwnd_cnt' multiple of congestion window worth of sendme acks.

Circuits will also record the minimum and maximum RTT seen so far.

Algorithms that make use of this RTT measurement for congestion
window update are specified in [CONTROL_ALGORITHMS].

2.1.1. Clock Jump Heuristics [CLOCK_HEURISTICS]

The timestamps for RTT (and BDP) are measured using Tor's
monotime_absolute_usec() API. This API is designed to provide a monotonic
clock that only moves forward. However, depending on the underlying system
clock, this may result in the same timestamp value being returned for long
periods of time, which would result in RTT 0-values. Alternatively, the clock
may jump forward, resulting in abnormally large RTT values.

To guard against this, we perform a series of heuristic checks on the time delta
measured by the RTT estimator, and if these heurtics detect a stall or a jump,
we do not use that value to update RTT or BDP, nor do we update any congestion
control algorithm information that round.

If the time delta is 0, that is always treated as a clock stall.

If we have measured at least 'cc_bwe_min' RTT values or we have successfully
exited slow start, then every sendme ACK, the new candidate RTT is compared to
the stored EWMA RTT. If the new RTT is either 100 times larger than the EWMA
RTT, or 100 times smaller than the stored EWMA RTT, then we do not record that
estimate, and do not update BDP or the congestion control algorithms for that
SENDME ack.

2.2. SENDME behavior changes

We will make four major changes to SENDME behavior to aid in computing
and using RTT as a congestion signal.

First, we will need to establish a ProtoVer of "FlowCtrl=2" to signal
support by Exits for the new SENDME format and congestion control
algorithm mechanisms.  We will need a similar announcement in the onion
service descriptors of services that support congestion control.

Second, we will turn CIRCWINDOW_INCREMENT into a consensus parameter
cc_sendme_inc, instead of using a hardcoded value of 100 cells. It is
likely that more frequent SENDME cells will provide quicker reaction to
congestion, since the RTT will be measured more often. If
experimentation in Shadow shows that more frequent SENDMEs reduce
congestion and improve performance but add significant overhead, we can
reduce SENDME overhead by allowing SENDME cells to carry stream data, as
well, using Proposal 325.

  TODO: Refer to or include final version of the negotiation proposal for
  this: https://gitlab.torproject.org/tpo/core/tor/-/issues/40377

Second, all end-to-end relay cells except RELAY_COMMAND_DROP and SENDME
itself will count towards SENDME cell counts. The details behind how these
cells are handled is addressed in section [SYSTEM_INTERACTIONS].

   XXX: Hrmm, this is not strictly necessary for this proposal to function.
   It may make conflux easier though, but we can defer it to then. The
   current implementation only counts RELAY_COMMAND_DATA towards sendme
   acks, which is the same behavior as the old fixed sendme implementation.

Third, authenticated SENDMEs can remain as-is in terms of protocol
behavior, but will require some implementation updates to account for
variable window sizes and variable SENDME pacing. In particular, the
sendme_last_digests list for auth sendmes needs updated checks for
larger windows and CIRCWINDOW_INCREMENT changes. Other functions to
examine include:
     - circuit_sendme_cell_is_next()
     - sendme_record_cell_digest_on_circ()
     - sendme_record_received_cell_digest()
     - sendme_record_sending_cell_digest()
     - send_randomness_after_n_cells

Fourth, stream level SENDMEs will be eliminated. Details on handling
streams and backpressure is covered in [FLOW_CONTROL].


3. Congestion Window Update Algorithms [CONTROL_ALGORITHMS]

In general, the goal of congestion control is to ensure full and fair
utilization of the capacity of a network path -- in the case of Tor the spare
capacity of the circuit. This is accomplished by setting the congestion window
to target the Bandwidth-Delay Product[28] (BDP) of the circuit in one way or
another, so that the total data outstanding is roughly equal to the actual
transit capacity of the circuit.

There are several ways to update a congestion window to target the BDP. Some
use direct BDP estimation, where as others use backoff properties to achieve
this. We specify three BDP estimation algorithms in the [BDP_ESTIMATION]
sub-section, and three congestion window update algorithms in [TOR_WESTWOOD],
[TOR_VEGAS], and [TOR_NOLA].

Note that the congestion window update algorithms differ slightly from the
background tor-dev mails[1,2], due to corrections and improvements. Hence they
have been given different names than in those two mails. The third algorithm,
[TOR_NOLA], simply uses the latest BDP estimate directly as its congestion
window.

These algorithms will be evaluated by running Shadow simulations, to
help determine parameter ranges, but experimentation on the live network
will be required to determine which of these algorithms performs best
when in competition with our current SENDME behavior, as used by real
network traffic. This experimentation and tuning is detailed in section
[EVALUATION].

All of these algorithms have rules to update 'cwnd' - the current congestion
window, which starts out at a value controlled by consensus parameter
'cc_cwnd_init'. The algorithms also keep track of 'inflight', which is a count
of the number of cells currently not yet acked by a SENDME. The algorithm MUST
ensure that cells cease being sent if 'cwnd - inflight <= 0'. Note that this
value CAN become negative in the case where the cwnd is reduced while packets
are inflight.

While these algorithms are in use, updates and checks of the current
'package_window' field are disabled. Where a 'package_window' value is
still needed, for example by cell packaging schedulers, 'cwnd - inflight' is
used (with checks to return 0 in the event of negative values).

The 'deliver_window' field is still used to decide when to send a SENDME. In C
tor, the deliver window is initially set at 1000, but it never gets below 900,
because authenticated sendmes (Proposal 289) require that we must send only
one SENDME at a time, and send it immediately after 100 cells are received.
This property turns out to be very useful for [BDP_ESTIMATION].

Implementation of different algorithms should be very simple - each
algorithm should have a different update function depending on the selected algorithm,
as specified by consensus parameter 'cc_alg'.

For C Tor's current flow control, these functions are defined in sendme.c,
and are called by relay.c:
  - sendme_note_circuit_data_packaged()
  - sendme_circuit_data_received()
  - sendme_circuit_consider_sending()
  - sendme_process_circuit_level()

Despite the complexity of the following algorithms in their TCP
implementations, their Tor equivalents are extremely simple, each being
just a handful of lines of C. This simplicity is possible because Tor
does not have to deal with out-of-order delivery, packet drops,
duplicate packets, and other network issues at the circuit layer, due to
the fact that Tor circuits already have reliability and in-order
delivery at that layer.

We are also removing the aspects of TCP that cause the congestion
algorithm to reset into slow start after being idle for too long, or
after too many congestion signals. These are deliberate choices that
simplify the algorithms and also should provide better performance for
Tor workloads.

In all cases, variables in these sections are either consensus parameters
specified in [CONSENSUS_PARAMETERS], or scoped to the circuit. Consensus
parameters for congestion control are all prefixed by cc_. Everything else
is circuit-scoped.


3.1. Estimating Bandwidth-Delay Product [BDP_ESTIMATION]

At a high-level, there are three main ways to estimate the Bandwidth-Delay
Product: by using the current congestion window and RTT, by using the inflight
cells and RTT, and by measuring SENDME arrival rate.

All three estimators are updated every SENDME ack arrival.

The SENDME arrival rate is the most accurate way to estimate BDP, but it
requires averaging over multiple SENDME acks to do so. The congestion window
and inflight estimates rely on the congestion algorithm more or less correctly
tracking an approximation of the BDP, and then use current and minimum RTT to
compensate for overshoot.

The SENDME estimator tends to be accurate after ~3-5 SENDME acks. The cwnd and
inflight estimators tend to be accurate once the congestion window exceeds
BDP.

We specify all three because they are all useful in various cases. These cases
are broken up and combined to form the Piecewise BDP estimator.

The SENDME estimator is smoothed through N-EWMA averaging on these
estimators[27], to smooth out temporary effects. This is accomplished by using
an EWMA function with alpha = 2/(N+1):

  N_EWMA = BDP*2/(N+1) + N_EWMA_prev*(N-1)/(N+1).

N is specified in terms of the number of current congestion windows worth of
SENDMEs to average, via consensus parameter 'cc_ewma_cwnd_cnt'.

The cwnd and inflight estimators also use N-EWMA smoothing in the same way,
but only smooth the current RTT estimate, not the congestion window value.

3.1.1. SENDME arrival BDP estimation

The most accurate way to estimate BDP is to measure the amount of time between
SENDME acks. In this period of time, we know that the endpoint successfully
received 'cc_sendme_inc' cells.

This means that the bandwidth of the circuit is then calculated as:

   BWE = cc_sendme_inc/sendme_ack_timestamp_delta

The bandwidth delay product of the circuit is calculated by multiplying this
bandwidth estimate by the *minimum* RTT time of the circuit (to avoid counting
queue time):

   BDP = BWE * RTT_min

In order to minimize the effects of ack compression (aka SENDME responses
becoming close to one another due to queue delay on the return), we
maintain a history a full congestion window worth of previous SENDME
timestamps.

With this, the calculation becomes:

   BWE = (num_sendmes-1) * cc_sendme_inc / num_sendme_timestamp_delta
   BDP = BWE * RTT_min

Note that because we are counting the number of cells *between* the first
and last sendme of the congestion window, we must subtract 1 from the number
of sendmes actually recieved. Over the time period between the first and last
sendme of the congestion window, the other endpoint successfully read
(num_sendmes-1) * cc_sendme_inc cells.

Furthermore, because the timestamps are microseconds, to avoid integer
truncation, we compute the BDP using multiplication first:

   BDP = (num_sendmes-1) * cc_sendme_inc * RTT_min / num_sendme_timestamp_delta

Note that the SENDME BDP estimation will only work after two (2) SENDME acks
have been received. Additionally, it tends not to be stable unless at least
five (5) num_sendme's are used, due to ack compression. This is controlled by
the 'cc_bwe_min' consensus parameter. Finally, if [CLOCK_HEURISTICS] have
detected a clock jump or stall, this estimator is not updated.

If all edge connections no longer have data available to send on a circuit
and all circuit queues have drained without blocking the local orconn, we stop
updating this BDP estimate and discard old timestamps. However, we retain the
actual estimator value.

3.1.2. Congestion Window BDP Estimation

Assuming that the current congestion window is at or above the current BDP,
the bandwidth estimate is the current congestion window size divided by the
RTT estimate:

   BWE = cwnd / RTT_current_ewma

The BDP estimate is computed by multiplying the Bandwidth estimate by
the *minimum* circuit latency:

   BDP = BWE * RTT_min

Simplifying:

   BDP = cwnd * RTT_min / RTT_current_ewma

The net effect of this estimation is to correct for any overshoot of
the cwnd over the actual BDP. It will obviously underestimate BDP if cwnd
is below BDP.

3.1.3. Inflight BDP Estimation

Similar to the congestion window based estimation, the inflight estimation
uses the current inflight packet count to derive BDP. It also subtracts local
circuit queue use from the inflight packet count. This means it will be strictly
less than or equal to the cwnd version:

   BDP = (inflight - circ.chan_cells.n) * RTT_min / RTT_current_ewma

If all edge connections no longer have data available to send on a circuit
and all circuit queues have drained without blocking the local orconn, we stop
updating this BDP estimate, because there are not sufficient inflight cells
to properly estimate BDP.

3.1.4. Piecewise BDP estimation

The piecewise BDP estimation is used to help respond more quickly in the event
the local OR connection is blocked, which indicates congestion somewhere along
the path from the client to the guard (or between Exit and Middle). In this
case, it takes the minimum of the inflight and SENDME estimators.

When the local OR connection is not blocked, this estimator uses the max of
the SENDME and cwnd estimator values.

When the SENDME estimator has not gathered enough data, or has cleared its
estimates based on lack of edge connection use, this estimator uses the
Congestion Window BDP estimator value.


3.2. Tor Westwood: TCP Westwood using RTT signaling [TOR_WESTWOOD]
   http://intronetworks.cs.luc.edu/1/html/newtcps.html#tcp-westwood
   http://nrlweb.cs.ucla.edu/nrlweb/publication/download/99/2001-mobicom-0.pdf
   http://cpham.perso.univ-pau.fr/TCP/ccr_v31.pdf
   https://c3lab.poliba.it/images/d/d7/Westwood_linux.pdf

Recall that TCP Westwood is basically TCP Reno, but it uses BDP estimates
for "Fast recovery" after a congestion signal arrives.

We will also be using the RTT congestion signal as per BOOTLEG_RTT_TOR
here, from the Options mail[1] and Defenestrator paper[3].

This system must keep track of RTT measurements per circuit: RTT_min, RTT_max,
and RTT_current. These are measured using the time delta between every
'cc_sendme_inc' relay cells and the SENDME response. The first RTT_min can be
measured arbitrarily, so long as it is larger than what we would get from
SENDME.

RTT_current is N-EWMA smoothed over 'cc_ewma_cwnd_cnt' congestion windows
worth of SENDME acks.

Recall that BOOTLEG_RTT_TOR emits a congestion signal when the current
RTT falls below some fractional threshold ('cc_westwood_rtt_thresh') fraction
between RTT_min and RTT_max. This check is:
   RTT_current < (1−cc_westwood_rtt_thresh)*RTT_min
                  + cc_westwood_rtt_thresh*RTT_max

Additionally, if the local OR connection is blocked at the time of SENDME ack
arrival, this is treated as an immediate congestion signal.

(We can also optionally use the ECN signal described in
ideas/xxx-backward-ecn.txt, to exit Slow Start.)

Congestion signals from RTT, blocked OR connections, or ECN are processed only
once per congestion window. This is achieved through the next_cc_event flag,
which is initialized to a cwnd worth of SENDME acks, and is decremented
each ack. Congestion signals are only evaluated when it reaches 0.

Note that because the congestion signal threshold of TOR_WESTWOOD is a
function of RTT_max, and excessive queuing can cause an increase in RTT_max,
TOR_WESTWOOD may have a runaway condition. For this reason, we specify a
cc_westwodd_rtt_m backoff parameter, which can reduce RTT_max by a percentage
of the difference between RTT_min and RTT_max, in the event of congestion.

Here is the complete congestion window algorithm for Tor Westwood. This will run
each time we get a SENDME (aka sendme_process_circuit_level()):

 # Update acked cells
 inflight -= cc_sendme_inc

 if next_cc_event:
   next_cc_event--

 # Do not update anything if we detected a clock stall or jump,
 # as per [CLOCK_HEURISTICS]
 if clock_stalled_or_jumped:
   return

 if next_cc_event == 0:
   # BOOTLEG_RTT_TOR threshold; can also be BACKWARD_ECN check:
   if (RTT_current <
      (100−cc_westwood_rtt_thresh)*RTT_min/100 +
      cc_westwood_rtt_thresh*RTT_max/100) or orconn_blocked:
     if in_slow_start:
       cwnd += cwnd * cc_cwnd_inc_pct_ss             # Exponential growth
     else:
       cwnd = cwnd + cc_cwnd_inc                     # Linear growth
   else:
     if cc_westwood_backoff_min:
       cwnd = min(cwnd * cc_westwood_cwnd_m, BDP)         # Window shrink
     else:
       cwnd = max(cwnd * cc_westwood_cwnd_m, BDP)         # Window shrink
     in_slow_start = 0

     # Back off RTT_max (in case of runaway RTT_max)
     RTT_max = RTT_min + cc_westwood_rtt_m * (RTT_max - RTT_min)

   cwnd = MAX(cwnd, cc_circwindow_min)
   next_cc_event = cwnd / (cc_cwnd_inc_rate * cc_sendme_inc)


3.3. Tor Vegas: TCP Vegas with Aggressive Slow Start [TOR_VEGAS]
   http://intronetworks.cs.luc.edu/1/html/newtcps.html#tcp-vegas
   http://pages.cs.wisc.edu/~akella/CS740/F08/740-Papers/BOP94.pdf
   http://www.mathcs.richmond.edu/~lbarnett/cs332/assignments/brakmo_peterson_vegas.pdf
   ftp://ftp.cs.princeton.edu/techreports/2000/628.pdf

TCP Vegas control algorithm estimates the queue lengths at relays by
subtracting the current BDP estimate from the current congestion window.

Assuming the BDP estimate is accurate, any amount by which the congestion
window exceeds the BDP will cause data to queue.

Thus, Vegas estimates estimates the queue use caused by congestion as:

   queue_use = cwnd - BDP

Original TCP Vegas used a cwnd BDP estimator only. However, because the SENDME
BDP estimate is more accurate, and because Piecewise BDP allows us to back off
more when the local OR connection is blocked, we use Piecewise BDP here.

For testing, we also parameterize this queue_use calculation as a tunable
weighted average between the cwnd-based BDP estimate and the piecewise
estimate (consensus parameter 'cc_vegas_bdp_mix').

Additionally, if the local OR connection is blocked at the time of SENDME ack
arrival, this is treated as an immediate congestion signal.

(We can also optionally use the ECN signal described in
ideas/xxx-backward-ecn.txt, to exit Slow Start.)

Congestion signals from RTT, blocked OR connections, or ECN are processed only
once per congestion window. This is achieved through the next_cc_event flag,
which is initialized to a cwnd worth of SENDME acks, and is decremented
each ack. Congestion signals are only evaluated when it reaches 0.

Here is the complete pseudocode for TOR_VEGAS, which is run once per SENDME
ack:

  # Update acked cells
  inflight -= cc_sendme_inc

  if next_cc_event:
    next_cc_event--

  # Do not update anything if we detected a clock stall or jump,
  # as per [CLOCK_HEURISTICS]
  if clock_stalled_or_jumped:
    return

  if next_cc_event == 0:
     if BDP > cwnd:
       queue_use = 0
     else:
       queue_use = cwnd - BDP

     if in_slow_start:
       if queue_use < cc_vegas_gamma and not orconn_blocked:
         cwnd = MAX(cwnd * cc_cwnd_inc_pct_ss, BDP)    # Exponential or BDP
       else:
         cwnd = BDP
         in_slow_start = 0
     else:
       if queue_use > cc_vegas_beta or orconn_blocked:
         cwnd += cc_cwnd_inc
       elif queue_use < cc_vegas_alpha:
         cwnd -= cc_cwnd_inc

     cwnd = MAX(cwnd, cc_circwindow_min)
     next_cc_event = cwnd / (cc_cwnd_inc_rate * cc_sendme_inc)


3.4. Tor NOLA: Direct BDP tracker [TOR_NOLA]

Based on the theory that congestion control should track the BDP,
the simplest possible congestion control algorithm could just set the
congestion window directly to its current BDP estimate, every SENDME ack.

Such an algorithm would need to overshoot the BDP slightly, especially in the
presence of competing algorithms. But other than that, it can be exceedingly
simple. Like Vegas, but without putting on airs. Just enough strung together.

After meditating on this for a while, it also occurred to me that no one has
named a congestion control algorithm after New Orleans. We have Reno, Vegas,
and scores of others. What's up with that?

Here's the pseudocode for TOR_NOLA that runs on every SENDME ack:

  # Do not update anything if we detected a clock stall or jump,
  # as per [CLOCK_HEURISTICS]
  if clock_stalled_or_jumped:
    return

  # If the orconn is blocked, do not overshoot BDP
  if orconn_blocked:
    cwnd = BDP
  else:
    cwnd = BDP + cc_nola_overshoot

  cwnd = MAX(cwnd, cc_circwindow_min)


4. Flow Control [FLOW_CONTROL]

Flow control provides what is known as "pushback" -- the property that
if one endpoint stops reading data, the other endpoint stops sending
data. This prevents data from accumulating at points in the network, if
it is not being read fast enough by an application.

Because Tor must multiplex many streams onto one circuit, and each
stream is mapped to another TCP socket, Tor's current pushback is rather
complicated and under-specified. In C Tor, it is implemented in the
following functions:
   - circuit_consider_stop_edge_reading()
   - connection_edge_package_raw_inbuf()
   - circuit_resume_edge_reading()

Tor currently maintains separate windows for each stream on a circuit,
to provide individual stream flow control. Circuit windows are SENDME
acked as soon as a relay data cell is decrypted and recognized. Stream
windows are only SENDME acked if the data can be delivered to an active
edge connection. This allows the circuit to continue to operate if an
endpoint refuses to read data off of one of the streams on the circuit.

Because Tor streams can connect to many different applications and
endpoints per circuit, it is important to preserve the property that if
only one endpoint edge connection is inactive, it does not stall the
whole circuit, in case one of those endpoints is malfunctioning or
malicious.

However, window-based stream flow control also imposes a speed limit on
individual streams. If the stream window size is below the circuit
congestion window size, then it becomes the speed limit of a download,
as we saw in the [MOTIVATION] section of this proposal.

So for performance, it is optimal that each stream window is the same
size as the circuit's congestion window. However, large stream windows
are a vector for OOM attacks, because malicious clients can force Exits
to buffer a full stream window for each stream while connecting to a
malicious site and uploading data that the site does not read from its
socket. This attack is significantly easier to perform at the stream
level than on the circuit level, because of the multiplier effects of
only needing to establish a single fast circuit to perform the attack on
a very large number of streams.

This catch22 means that if we use windows for stream flow control, we
either have to commit to allocating a full congestion window worth
memory for each stream, or impose a speed limit on our streams.

Hence, we will discard stream windows entirely, and instead use a
simpler buffer-based design that uses XON/XOFF as a backstop. This will
allow us to make full use of the circuit congestion window for every
stream in combination, while still avoiding buffer buildup inside the
network.

4.1. Stream Flow Control Without Windows [WINDOWLESS_FLOW]

Each endpoint (client, Exit, or onion service) should send circuit-level
SENDME acks for all circuit cells as soon as they are decrypted and
recognized, but *before* delivery to their edge connections. If the edge
connection is blocked because an application is not reading, data will
build up in a queue at that endpoint.

Three consensus parameters will govern the max length of this queue:
xoff_client, xoff_exit, and xoff_mobile. These will be used for Tor
clients, exits, and mobile devices, respectively. These cutoffs will be
a percentage of current 'cwnd' rather than number of cells. Something
like 5% of 'cwnd' should be plenty, since these edge connections should
normally drain *much* faster than Tor itself.

If the length of an application stream queue exceeds the size provided
in the appropriate consensus parameter, a RELAY_COMMAND_STREAM_XOFF will
be sent, which instructs the other endpoint to stop sending from that
edge connection. This XOFF cell can optionally contain any available
stream data, as well.

As soon as the queue begins to drain, a RELAY_COMMAND_STREAM_XON will
sent, which allows the other end to resume reading on that edge
connection. Because application streams should drain quickly once they
are active, we will send the XON command as soon as they start draining.
If the queues fill again, another XOFF will be sent. If this results in
excessive XOFF/XON flapping and chatter, we will also use consensus
parameters xon_client, xon_exit, and xon_mobile to optionally specify
when to send an XON. These parameters will be defined in terms of cells
below the xoff_* levels, rather than percentage. The XON cells can also
contain stream data, if any is available.

Tor's OOM killer will be invoked to close any streams whose application
buffer grows too large, due to memory shortage, or malicious endpoints.

Note that no stream buffer should ever grow larger than the xoff level
plus 'cwnd', unless an endpoint is ignoring XOFF. So,
'xoff_{client,exit,mobile} + cwnd' should be the hard-close stream
cutoff, regardless of OOM killer status.


5. System Interactions [SYSTEM_INTERACTIONS]

Tor's circuit-level SENDME system currently has special cases in the
following situations: Intropoints, HSDirs, onion services, and circuit
padding. Additionally, proper congestion control will allow us to very
easily implement conflux (circuit traffic splitting).

This section details those special cases and interactions of congestion
control with other components of Tor.

5.1. HSDirs

Because HSDirs use the tunneled dirconn mechanism and thus also use
RELAY_COMMAND_DATA, they are already subject to Tor's flow control.

We may want to make sure our initial circuit window for HSDir circuits
is set custom for those circuit types, so a SENDME is not required to
fetch long descriptors. This will ensure HSDir descriptors can be
fetched in one RTT.

5.2. Introduction Points

Introduction Points are not currently subject to any flow control.

Because Intropoints accept INTRODUCE1 cells from many client circuits
and then relay them down a single circuit to the service as INTRODUCE2
cells, we cannot provide end-to-end congestion control all the way from
client to service for these cells.

We can run congestion control from the service to the Intropoint, and probably
should, since this is already subject to congestion control.

As an optimization, if that congestion window reaches zero (because the
service is overwhelmed), then we start sending NACKS back to the clients (or
begin requiring proof-of-work), rather than just let clients wait for timeout.

5.3. Rendezvous Points

Rendezvous points are already subject to end-to-end SENDME control,
because all relay cells are sent end-to-end via the rendezvous circuit
splice in circuit_receive_relay_cell().

This means that rendezvous circuits will use end-to-end congestion
control, as soon as individual onion clients and onion services upgrade
to support it. There is no need for intermediate relays to upgrade at
all.

5.4. Circuit Padding

Recall that circuit padding is negotiated between a client and a middle
relay, with one or more state machines running on circuits at the middle
relay that decide when to add padding.
  https://github.com/torproject/tor/blob/master/doc/HACKING/CircuitPaddingDevelopment.md

This means that the middle relay can send padding traffic towards the
client that contributes to congestion, and the client may also send
padding towards the middle relay, that also creates congestion.

For low-traffic padding machines, such as the currently deployed circuit
setup obfuscation, this padding is inconsequential.

However, higher traffic circuit padding machines that are designed to
defend against website traffic fingerprinting will need additional care
to avoid inducing additional congestion, especially after the client or
the exit experiences a congestion signal.

The current overhead percentage rate limiting features of the circuit
padding system should handle this in some cases, but in other cases, an
XON/XOFF circuit padding flow control command may be required, so that
clients may signal to the machine that congestion is occurring.

5.5. Conflux

Conflux (aka multi-circuit traffic splitting) becomes significantly
easier to implement once we have congestion control. However, much like
congestion control, it will require experimentation to tune properly.

Recall that Conflux uses a 256-bit UUID to bind two circuits together at
the Exit or onion service. The original Conflux paper specified an
equation based on RTT to choose which circuit to send cells on.
  https://www.cypherpunks.ca/~iang/pubs/conflux-pets.pdf

However, with congestion control, we will already know which circuit has
the larger congestion window, and thus has the most available cells in
its current congestion window. This will also be the faster circuit.
Thus, the decision of which circuit to send a cell on only requires
comparing congestion windows (and choosing the circuit with more packets
remaining in its window).

Conflux will require sequence numbers on data cells, to ensure that the
two circuits' data is properly re-assembled. The resulting out-of-order
buffer can potentially be as large as an entire congestion window, if
the circuits are very desynced (or one of them closes). It will be very
expensive for Exits to maintain this much memory, and exposes them to
OOM attacks.

This is not as much of a concern in the client download direction, since
clients will typically only have a small number of these out-of-order
buffers to keep around. But for the upload direction, Exits will need
to send some form of early XOFF on the faster circuit if this
out-of-order buffer begins to grow too large, since simply halting the
delivery of SENDMEs will still allow a full congestion window full of
data to arrive. This will also require tuning and experimentation, and
optimum results will vary between simulator and live network.

  TODO: Can we use explicit SENDME sequence number acking to make a
       connection-resumption conflux, to recover from circuit collapse
       or client migration? I am having trouble coming up with a design
       that does not require Exits to maintain a full congestion
       window full of data as a retransmit buffer in the event of
       circuit close. Such reconnect activity might require assistance
       from Guard relays so that they can help clients discover which
       cells were sent vs lost.


6. Performance Evaluation [EVALUATION]

Congestion control for Tor will be easy to implement, but difficult to
tune to ensure optimal behavior.

6.1. Congestion Signal Experiments

Our first experiments were to conduct client-side experiments to
determine how stable the RTT measurements of circuits are across the
live Tor network, to determine if we need more frequent SENDMEs, and/or
need to use any RTT smoothing or averaging.


These experiments were performed using onion service clients and services on
the live Tor network. From these experiments, we tuned the RTT and BDP
estimators, and arrived at reasonable values for EWMA smoothing and the
minimum number of SENDME acks required to estimate BDP.

We should check small variations in the EWMA smoothing and minimum BDP ack
counts in Shadow experimentation, to check for high variability in these
estimates, and other surprises.

6.2. Congestion Algorithm Experiments

In order to evaluate performance of congestion control algorithms, we will
need to implement [TOR_WESTWOOD], [TOR_VEGAS], and [TOR_NOLA]. We will need to
simulate their use in the Shadow Tor network simulator.

Simulation runs will need to evaluate performance on networks that use
only one algorithm, as well as on networks that run a combinations of
algorithms - particularly each type of congestion control in combination
with Tor's current flow control. Depending upon the number of current
flow control clients, more aggressive parameters of these algorithms may
need to be set, but this will result in additional queueing as well as
sub-optimal behavior once all clients upgrade.

In particular, during live onion service testing, we noticed that these
algorithms required particularly agressive default values to compete against
a network full of current clients. As more clients upgrade, we may be able
to lower these defaults. We should get a good idea of what values we can
choose at what upgrade point, from mixed Shadow simulation.

If Tor's current flow control is so aggressive that it causes probelems with
any amount of remaining old clients, we can experiment with kneecapping these
legacy flow control Tor clients by setting a low 'circwindow' consensus
parameter for them. This will allow us to set more reasonable parameter
values, without waiting for all clients to upgrade.

Because custom congestion control can be deployed by any Exit or onion
service that desires better service, we will need to be particularly
careful about how congestion control algorithms interact with rogue
implementations that more aggressively increase their window sizes.
During these adversarial-style experiments, we must verify that cheaters
do not get better service, and that Tor's circuit OOM killer properly
closes circuits that seriously abuse the congestion control algorithm,
as per [SECURITY_ANALYSIS]. This may requiring tuning CircuitEWMAHalfLife,
as well as the oomkiller cutoffs.

Additionally, we specified that the algorithms maintain previous congestion
window estimates in the event that a circuit goes idle, rather than revert
to slow start. We should experiment with intermittent idle/active clients
to make sure that this behavior is acceptable.

6.3. Flow Control Algorithm Experiments

We will need to tune the xoff_* consensus parameters to minimize the
amount of edge connection buffering as well as XON/XOFF chatter for
Exits. This can be done in simulation, but will require fine-tuning on
the live network.

Additionally, we will need to experiment with reducing the cell queue
limits on OR conns before they are blocked, and study the interaction
of that with treating the or conn block as a congestion signal.

  TODO: We should make the cell queue highwater value into a consensus
  parameter in the flow control implementation.

Relays may report these statistics in extra-info descriptor, to help with
monitoring the live network conditions, but this might also require
aggregation or minimization.

6.4. Performance Metrics [EVALUATION_METRICS]

Because congestion control will affect so many aspects of performance,
from throughput to RTT, to load balancing, queue length, overload, and
other failure conditions, the full set of performance metrics will be
required:
  https://gitlab.torproject.org/legacy/trac/-/wikis/org/roadmaps/CoreTor/PerformanceMetrics

We will also need to monitor network health for relay queue lengths,
relay overload, and other signs of network stress (and particularly the
alleviation of network stress).

6.5. Consensus Parameter Tuning [CONSENSUS_PARAMETERS]

During Shadow simulation, we will determine reasonable default
parameters for our consensus parameters for each algorithm. We will then
re-run these tuning experiments on the live Tor network, as described
in:
  https://gitlab.torproject.org/tpo/core/team/-/wikis/NetworkTeam/Sponsor61/PerformanceExperiments

6.5.1. Parameters common to all algorithms

These are sorted in order of importance to tune, most important first.

  cc_alg:
    - Description:
          Specifies which congestion control algorithm clients should
          use, as an integer.
    - Range: [0,3]  (0=fixed, 1=Westwood, 2=Vegas, 3=NOLA)
    - Default: 2
    - Tuning Values: 0-3
    - Tuning Notes:
           These algorithms need to be tested against percentages of current
           fixed alg client competition, in Shadow. Their optimal parameter
           values, and even the optimal algorithm itself, will likely depend
           upon how much fixed sendme traffic is in competition. See the
           algorithm-specific parameters for additional tuning notes.

  cc_bwe_min:
    - Description:
          The minimum number of SENDME acks to average over in order to
          estimate bandwidth (and thus BDP).
    - Range: [2, 20]
    - Default: 5
    - Tuning Values: 3-5
    - Tuning Notes:
           The lower this value is, the sooner we can get an estimate of
           the true BDP of a circuit. Low values may lead to massive
           over-estimation, due to ack compression. However, if this
           value is above the number of acks that fit in cc_cwnd_init, then
           we won't get a BDP estimate within the first use of the circuit.
           Additionally, if this value is above the number of acks that
           fit in cc_cwnd_min, we may not be able to estimate BDP
           when the congestion window is small. If we need small congestion
           windows, we should also lower cc_sendme_inc, which will get us more
           frequent acks with less data.

  cc_sendme_inc:
    - Description: Specifies how many cells a SENDME acks
    - Range: [1, 5000]
    - Default: 50
    - Tuning Values: 25,33,50
    - Tuning Notes:
           Increasing this increases overhead, but also increases BDP
           estimation accuracy. Since we only use circuit-level sendmes,
           and the old code sent sendmes at both every 50 cells, and every
           100, we can set this as low as 33 to have the same amount of
           overhead.

  cc_cwnd_init:
    - Description: Initial congestion window for new congestion
                   control Tor clients. This can be set much higher
                   than TCP, since actual TCP to the guard will prevent
                   buffer bloat issues at local routers.
    - Range: [100, 10000]
    - Default: 500
    - Tuning Values: 150,200,250,500
    - Tuning Notes:
           Higher initial congestion windows allow the algorithms to
           measure initial BDP more accurately, but will lead to queue bursts
           and latency.  Ultimately, the ICW should be set to approximately
           'cc_bwe_min'*'cc_sendme_inc', but the presence of competing
           fixed window clients may require higher values.

  cc_cwnd_min:
    - Description: The minimum allowed cwnd.
    - Range: [cc_sendme_inc, 1000]
    - Tuning Values: [50, 100, 150]
    - Tuning Notes:
           If the cwnd falls below cc_sendme_inc, connections can't send
           enough data to get any acks, and will stall. If it falls below
           cc_bwe_min*cc_sendme_inc, connections can't use SENDME BDP
           estimates. Likely we want to set this around
           cc_bwe_min*cc_sendme_inc, but no lower than cc_sendme_inc.

  circwindow:
    - Description: Initial congestion window for legacy Tor clients
    - Range: [100, 1000]
    - Default: 1000
    - Tuning Values: 100,200,500,1000
    - Tuning Notes:
           If the above congestion algorithms are not optimal until an
           unreasonably high percentge of clients upgrade, we can reduce
           the performance of ossified legacy clients by reducing their
           circuit windows. This will allow old clients to continue to
           operate without impacting optimal network behavior.

  cc_cwnd_inc_rate:
    - Description: How often we update our congestion window, per cwnd worth
                   of packets
    - Range: [1, 250]
    - Default: 1
    - Tuning Values: [1,2,5,10]
    - Tuning Notes:
           Congestion control theory says that the congestion window should
           only be updated once every cwnd worth of packets. We may find it
           better to update more frequently, but this is probably unlikely
           to help a great deal.

  cc_ewma_cwnd_cnt:
    - Description: This specifies the N in N-EWMA smoothing of RTT and BDP
                   estimation. It allows us to average these values over 1
                   or more congestion windows.
    - Range: [1, 100]
    - Default: 2
    - Tuning Values: [1,2,3]
    - Tuning Notes:
           Smoothing our estimates reduces the effects of ack compression and
           other ephemeral network hiccups; changing this much is unlikely
           to have a huge impact on performance.

  cc_cwnd_inc:
    - Description: How much to increment the congestion window by during
                   steady state, every cwnd.
    - Range: [1, 1000]
    - Default: 50
    - Tuning Values: 25,50,100
    - Tuning Notes:
           We are unlikely to need to tune this much, but it might be worth
           trying a couple values.

  cc_cwnd_inc_pct_ss:
    - Description: Percentage of the current congestion window to increment
                   by during slow start, every cwnd.
    - Range: [10, 300]
    - Tuning Values: 50,100,200
    - Tuning Notes:
           On the current live network, the algorithms tended to exit slow
           start early, so we did not exercise this much. This may not be the
           case in Shadow, or once clients upgrade to the new algorithms.

  cc_bdp_alg:
    - Description: The BDP estimation algorithm to use.
    - Range: [0,7]
    - Default: 7 (Piecewise EWMA)
    - Tuning Notes:
           We don't expect to need to tune this.

6.5.2. Westwood parameters

  cc_westwood_rtt_thresh:
    - Description:
              Specifies the cutoff for BOOTLEG_RTT_TOR to deliver
              congestion signal, as fixed point representation
              divided by 1000.
    - Range: [1, 1000]
    - Default: 33
    - Tuning Values: [20, 33, 40, 50]
    - Tuning Notes:
          The Defenestrator paper set this at 23, but did not justify it. We
          may need to raise it to compete with current fixed window SENDME.

  cc_westwood_cwnd_m:
    - Description: Specifies how much to reduce the congestion
                   window after a congestion signal, as a fraction of
                   100.
    - Range: [0, 100]
    - Default: 75
    - Tuning Values: [50, 66, 75]
    - Tuning Notes:
           Congestion control theory started out using 50 here, and then
           decided 70-75 was better.

  cc_westwood_min_backoff:
    - Description: If 1, take the min of BDP estimate and westwood backoff.
                   If 0, take the max of BDP estimate and westwood backoff.
    - Range: [0, 1]
    - Default: 0
    - Tuning Notes:
           This parameter can make the westwood backoff less agressive, if
           need be. We're unlikely to need it, though.

  cc_westwood_rtt_m:
    - Description: Specifies a backoff percent of RTT_max, upon receipt of
                   a congestion signal.
    - Range: [50, 100]
    - Default: 100
    - Tuning Notes:
           Westwood technically has a runaway condition where congestion can
           cause RTT_max to grow, which increases the congestion threshhold.
           This has not yet been observed, but because it is possible, we
           include this parameter.

6.5.3. Vegas Parameters


  cc_vegas_alpha:
  cc_vegas_beta:
  cc_vegas_gamma:
    - Description: These parameters govern the number of cells
                   that [TOR_VEGAS] can detect in queue before reacting.
    - Range: [1, 1000]
    - Defaults: 6*cc_sendme_inc for gamma and beta; 3*cc_sendme_inc for alpha
    - Tuning Notes:
           The amount of queued cells that Vegas should tolerate is heavily
           dependent upon competing congestion control algorithms. The specified
           defaults are necessary to compete against current fixed SENDME traffic,
           but are much larger than neccessary otherwise. As the number of
           clients using fixed windows is reduced (or circwindow is lowered), we
           can reduce these parameters, which will result in less overall queuing
           and congestion on the network.

  cc_vegas_bdp_mix:
    - Description: This parameter allows us to specify a weighted percent
                   average between the cwnd BDP estimator and the piecewise
                   estimator.
    - Range: [0, 100]
    - Default: 0 (use 100% Piecewise EWMA)
    - Tuning Notes:
           The original Vegas literature specified the CWND estimator, but
           the Piecewise estimator is very likely more suited for our use
           case. This parameter allows us to average them. We're unlikely to
           need it.

6.5.4. NOLA Parameters

  cc_nola_overshoot:
    - Description: The number of cells to add to the BDP estimator to obtain
                   the NOLA cwnd.
    - Range: [0, 1000]
    - Default: 100
    - Tuning Values: 0, 50, 100, 150, 200
    - Tuning Notes:
            In order to compete against current fixed sendme, and to ensure
            that the congestion window has an opportunity to grow, we must
            set the cwnd above the current BDP estimate. How much above will
            be a function of competing traffic. It may also turn out that
            absent any more agressive competition, we do not need to overshoot
            the BDP estimate.


6.5.5. Flow Control Parameters

 TODO: These may expand, particularly to include cell_queue_highwater

  xoff_client
  xoff_mobile
  xoff_exit
    - Description: Specifies the stream queue size as a percentage of
                   'cwnd' at an endpoint before an XOFF is sent.
    - Range: [1, 100]
    - Default: 5

  xon_client
  xon_mobile
  xon_exit
    - Description: Specifies the how many cells below xoff_* before
                   an XON is sent from an endpoint.
    - Range: [1, 10000000]
    - Default: 10000


7. Protocol format specifications [PROTOCOL_SPEC]

   TODO: This section needs details once we close out other TODOs above.

7.1. Circuit window handshake format

   TODO: We need to specify a way to communicate the currently seen
         cc_sendme_inc consensus parameter to the other endpoint,
         due to consensus sync delay. Probably during the CREATE
         onionskin (and RELAY_COMMAND_EXTEND).
   TODO: We probably want stricter rules on the range of values
         for the per-circuit negotiation - something like
         it has to be between [cc_sendme_inc/2, 2*cc_sendme_inc].
         That way, we can limit weird per-circuit values, but still
         allow us to change the consensus value in increments.

7.2. XON/XOFF relay cell formats

   TODO: We need to specify XON/XOFF for flow control. This should be
         simple.
   TODO: We should also allow it to carry stream data, as in Prop 325.

7.3. Onion Service formats

   TODO: We need to specify how to signal support for congestion control
         in an onion service, to both the intropoint and to clients.

7.4. Protocol Version format

   TODO: We need to pick a protover to signal Exit and Intropoint
         congestion control support.

7.5. SENDME relay cell format

   TODO: We need to specify how to add stream data to a SENDME as an
         optimization.

7.6. Extrainfo descriptor formats

   TODO: We will want to gather information on circuitmux and other
         relay queues, as well as XON/XOFF rates, and edge connection
         queue lengths at exits.


8. Security Analysis [SECURITY_ANALYSIS]

The security risks of congestion control come in three forms: DoS
attacks, fairness abuse, and side channel risk.

8.1. DoS Attacks (aka Adversarial Buffer Bloat)

The most serious risk of eliminating our current window cap is that
endpoints can abuse this situation to create huge queues and thus DoS
Tor relays.

This form of attack was already studied against the Tor network in the
Sniper attack:
  https://www.freehaven.net/anonbib/cache/sniper14.pdf

We had two fixes for this. First, we implemented a circuit-level OOM
killer that closed circuits whose queues became too big, before the
relay OOMed and crashed.

Second, we implemented authenticated SENDMEs, so clients could not
artificially increase their window sizes with honest exits:
  https://gitweb.torproject.org/torspec.git/tree/proposals/289-authenticated-sendmes.txt

We can continue this kind of enforcement by having Exit relays ensure that
clients are not transmitting SENDMEs too often, and do not appear to be
inflating their send windows beyond what the Exit expects by calculating a
similar estimated receive window. Note that such an estimate may have error
and may become negative if the estimate is jittery.

Unfortunately, authenticated SENDMEs do *not* prevent the same attack
from being done by rogue exits, or rogue onion services. For that, we
rely solely on the circuit OOM killer. During our experimentation, we
must ensure that the circuit OOM killer works properly to close circuits
in these scenarios.

But in any case, it is important to note that we are not any worse off
with congestion control than we were before, with respect to these kinds
of DoS attacks. In fact, the deployment of congestion control by honest
clients should reduce queue use and overall memory use in relays,
allowing them to be more resilient to OOM attacks than before.

8.2. Congestion Control Fairness Abuse (aka Cheating)

On the Internet, significant research and engineering effort has been
devoted to ensuring that congestion control algorithms are "fair" in
that each connection receives equal throughput. This fairness is
provided both via the congestion control algorithm, as well as via queue
management algorithms at Internet routers.

One of the most unfortunate early results was that TCP Vegas, despite
being near-optimal at minimizing queue lengths at routers, was easily
out-performed by more aggressive algorithms that tolerated larger queue
delay (such as TCP Reno).

Note that because the most common direction of traffic for Tor is from
Exit to client, unless Exits are malicious, we do not need to worry
about rogue algorithms as much, but we should still examine them in our
experiments because of the possibility of malicious Exits, as well as
malicious onion services.

Queue management can help further mitigate this risk, too. When RTT is
used as a congestion signal, our current Circuit-EWMA queue management
algorithm is likely sufficient for this. Because Circuit-EWMA will add
additional delay to loud circuits, "cheaters" who use alternate
congestion control algorithms to inflate their congestion windows should
end up with more RTT congestion signals than those who do not, and the
Circuit-EWMA scheduler will also relay fewer of their cells per time
interval.

In this sense, we do not need to worry about fairness and cheating as a
security property, but a lack of fairness in the congestion control
algorithm *will* increase memory use in relays to queue these
unfair/loud circuits, perhaps enough to trigger the OOM killer. So we
should still be mindful of these properties in selecting our congestion
control algorithm, to minimize relay memory use, if nothing else.

These two properties (honest Exits and Circuit-EWMA) may even be enough
to make it possible to use [TOR_VEGAS] even in the presence of other
algorithms, which would be a huge win in terms of memory savings as well
as vastly reduced queue delay. We must verify this experimentally,
though.

8.3. Side Channel Risks

Vastly reduced queue delay and predictable amounts of congestion on the
Tor network may make certain forms of traffic analysis easier.
Additionally, the ability to measure RTT and have it be stable due to
minimal network congestion may make geographical inference attacks
easier:
  https://www.freehaven.net/anonbib/cache/ccs07-latency-leak.pdf
  https://www.robgjansen.com/publications/howlow-pets2013.pdf

It is an open question as to if these risks are serious enough to
warrant eliminating the ability to measure RTT at the protocol level and
abandoning it as a congestion signal, in favor of other approaches
(which have their own side channel risks). It will be difficult to
comprehensively eliminate RTT measurements, too.

On the plus side, Conflux traffic splitting (which is made easy once
congestion control is implemented) does show promise as providing
defense against traffic analysis:
  https://www.comsys.rwth-aachen.de/fileadmin/papers/2019/2019-delacadena-splitting-defense.pdf

There is also literature on shaping circuit bandwidth to create a side
channel. This can be done regardless of the use of congestion control,
and is not an argument against using congestion control. In fact, the
Backlit defense may be an argument in favor of endpoints monitoring
circuit bandwidth and latency more closely, as a defense:
  https://www.freehaven.net/anonbib/cache/ndss09-rainbow.pdf
  https://www.freehaven.net/anonbib/cache/ndss11-swirl.pdf
  https://www.freehaven.net/anonbib/cache/acsac11-backlit.pdf

Finally, recall that we are considering ideas/xxx-backward-ecn.txt
[BACKWARD_ECN] to use a circuit-level cell_t.command to signal
congestion.  This allows all relays in the path to signal congestion in
under RTT/2 in either direction, and it can be flipped on existing relay
cells already in transit, without introducing any overhead.  However,
because cell_t.command is visible and malleable to all relays, it can
also be used as a side channel. So we must limit its use to a couple of
cells per circuit, at most.
  https://blog.torproject.org/tor-security-advisory-relay-early-traffic-confirmation-attack

9. Onion Services

Onion service requires us to advertise the protocol version and congestion
control parameters in a different way since the end points do not know each
other like a client knows all the relays and what they support.

To address this, this is done in two parts. First, the service needs to
advertise to the world that it supports congestion control. This is done
through a new line in the descriptor, see section 9.1 below.

Second, the client needs to inform the service that it wants to use congestion
control on the rendezvous circuit and with wich parameters. This is done
through the INTRODUCE cell as an extension, see section 9.2 below.

9.1. Descriptor

We propose to add a new line to advertise the flow control protocol version:

  "flow-control" SP version-number NL

    The "version-number" value is the same as the protocol version FlowCtrl
    that relay advertises which is defined earlier in this proposal. Current
    value can only be "2".

Clients that are able to parse this line and know the protocol version can then
send in the INTRODUCE1 cell congestion control parameters that they would like
to use which is detailed in the next section.

This line MUST be removed from the descriptor if the consensus disables
congestion control by setting "cc_alg=0". In other words, a service should only
advertise its flow control version if congestion control is enabled network
wide.

9.2 INTRODUCE cell extension

We propose a new extension to the INTRODUCE cell which can be used to send
congestion control parameters down to the service. It is important to mention
that this is an extension to be used in the encrypted setion of the cell and
not its readable section by the introduction point.

If used, it needs to be encoded within the N_EXTENSIONS field of the ENCRYPTED
section of the INTRODUCE1 cell defined in rend-spec-v3.txt section 3.3. The
content is defined as follow:

  EXT_FIELD_TYPE:

    [01] -- Congestion Control Parameters.

    If this flag is set, the extension should be used by the service to learn
    what are the congestion control parameters to use on the rendezvous
    circuit.

  EXT_FIELD content format is:

    N_PARAMS          [1 byte]
    N_PARAMS times:
      PARAM_TYPE      [1 byte]
      PARAM_VALUE     [8 byte]

    The PARAM_TYPE possible values are:

      [01] -- Newest protocol version supported
              The newest protocol version that the client supports.

      [02] -- SENDME increment
              The initial SENDME increment that should be used by both end points
              when using congestion control.

    The PARAM_VALUE size is 8 bytes in order to accomodate 64bit values. It
    MUST match the specified limits for the following PARAM_TYPE:

      [01] -- Min: 2, Max: 255
      [02] -- Min: 1, Max: 5000 (same as "cc_sendme_inc")

9.3 Protocol Flow

First, the client reads the "flow-control" line in the descriptor and gets the
maximum value from what it supports and the service supports. As an example, if
the client supports 2-3-4 and the service supports 2-3, then 3 is chosen.

It then sends that value along its desired cc_sendme_inc value in the
INTRODUCE1 cell in the extension.

The service will then validate that is does support version 3 and that the
parameter cc_sendme_inc is within range of the protocol. Congestion control is
then applied to the rendezvous circuit.

9.4 Circuit Behavior

If the extension is not found in the cell, the service MUST not use congestion
control on the rendezvous circuit.

Any invalid values received in the extension should result in closing the
introduction circuit and thus not continuing the rendezvous process. An invalid
value is either if the value is not supported or out of the defined range.

9.5 Security Considerations

Advertising a new line in a descriptor does leak that a service is running at
least a certain tor version. We believe that this is an acceptable risk in
order to be able for service to take advantage of congestion control. Once a
new tor stable is released, we hope that most service upgrades and thus
everyone looks the same again.

The new extension is located in the encrypted part of the INTRODUCE1 cell and
thus the introduction point can't learn its content.

10. Acknowledgements

Immense thanks to Toke Høiland-Jørgensen for considerable input into all
aspects of the TCP congestion control background material for this proposal,
as well as review of our versions of the algorithms.


11. [CITATIONS]

1. Options for Congestion Control in Tor-Like Networks.
   https://lists.torproject.org/pipermail/tor-dev/2020-January/014140.html

2. Towards Congestion Control Deployment in Tor-like Networks.
   https://lists.torproject.org/pipermail/tor-dev/2020-June/014343.html

3. DefenestraTor: Throwing out Windows in Tor.
   https://www.cypherpunks.ca/~iang/pubs/defenestrator.pdf

4. TCP Westwood: Bandwidth Estimation for Enhanced Transport over Wireless Links
   http://nrlweb.cs.ucla.edu/nrlweb/publication/download/99/2001-mobicom-0.pdf

5. Performance Evaluation and Comparison of Westwood+, New Reno, and Vegas TCP Congestion Control
   http://cpham.perso.univ-pau.fr/TCP/ccr_v31.pdf

6. Linux 2.4 Implementation of Westwood+ TCP with rate-halving
   https://c3lab.poliba.it/images/d/d7/Westwood_linux.pdf

7. TCP Westwood
   http://intronetworks.cs.luc.edu/1/html/newtcps.html#tcp-westwood

8. TCP Vegas: New Techniques for Congestion Detection and Avoidance
   http://pages.cs.wisc.edu/~akella/CS740/F08/740-Papers/BOP94.pdf

9. Understanding TCP Vegas: A Duality Model
   ftp://ftp.cs.princeton.edu/techreports/2000/628.pdf

10. TCP Vegas
    http://intronetworks.cs.luc.edu/1/html/newtcps.html#tcp-vegas

11. Controlling Queue Delay
    https://queue.acm.org/detail.cfm?id=2209336

12. Controlled Delay Active Queue Management
    https://tools.ietf.org/html/rfc8289

13. How Much Anonymity does Network Latency Leak?
    https://www.freehaven.net/anonbib/cache/ccs07-latency-leak.pdf

14. How Low Can You Go: Balancing Performance with Anonymity in Tor
    https://www.robgjansen.com/publications/howlow-pets2013.pdf

15. POSTER: Traffic Splitting to Counter Website Fingerprinting
    https://www.comsys.rwth-aachen.de/fileadmin/papers/2019/2019-delacadena-splitting-defense.pdf

16. RAINBOW: A Robust And Invisible Non-Blind Watermark for Network Flows
    https://www.freehaven.net/anonbib/cache/ndss09-rainbow.pdf

17. SWIRL: A Scalable Watermark to Detect Correlated Network Flows
    https://www.freehaven.net/anonbib/cache/ndss11-swirl.pdf

18. Exposing Invisible Timing-based Traffic Watermarks with BACKLIT
    https://www.freehaven.net/anonbib/cache/acsac11-backlit.pdf

19. The Sniper Attack: Anonymously Deanonymizing and Disabling the Tor Network
    https://www.freehaven.net/anonbib/cache/sniper14.pdf

20. Authenticating sendme cells to mitigate bandwidth attacks
    https://gitweb.torproject.org/torspec.git/tree/proposals/289-authenticated-sendmes.txt

21. Tor security advisory: "relay early" traffic confirmation attack
    https://blog.torproject.org/tor-security-advisory-relay-early-traffic-confirmation-attack

22. The Path Less Travelled: Overcoming Tor’s Bottlenecks with Traffic Splitting
    https://www.cypherpunks.ca/~iang/pubs/conflux-pets.pdf

23. Circuit Padding Developer Documentation
    https://github.com/torproject/tor/blob/master/doc/HACKING/CircuitPaddingDevelopment.md

24. Plans for Tor Live Network Performance Experiments
    https://gitlab.torproject.org/tpo/core/team/-/wikis/NetworkTeam/Sponsor61/PerformanceExperiments

25. Tor Performance Metrics for Live Network Tuning
    https://gitlab.torproject.org/legacy/trac/-/wikis/org/roadmaps/CoreTor/PerformanceMetrics

26. Bandwidth-Delay Product
    https://en.wikipedia.org/wiki/Bandwidth-delay_product

27. Exponentially Weighted Moving Average
    https://corporatefinanceinstitute.com/resources/knowledge/trading-investing/exponentially-weighted-moving-average-ewma/