1 files changed, 673 insertions, 479 deletions
diff --git a/proposals/329-traffic-splitting.txt b/proposals/329-traffic-splitting.txt
index 85d704b..e03d3aa 100644
--- a/proposals/329-traffic-splitting.txt
+++ b/proposals/329-traffic-splitting.txt
@@ -1,15 +1,18 @@
+```
 Filename: 329-traffic-splitting.txt
 Title: Overcoming Tor's Bottlenecks with Traffic Splitting
 Author: David Goulet, Mike Perry
 Created: 2020-11-25
-Status: Draft
+Status: Finished
 
 0. Status
 
   This proposal describes the Conflux [CONFLUX] system developed by
   Mashael AlSabah, Kevin Bauer, Tariq Elahi, and Ian Goldberg. It aims at
   improving Tor client network performance by dynamically splitting
-  traffic between two circuits.
+  traffic between two circuits. We have made several additional improvements
+  to the original Conflux design, by making use of congestion control
+  information, as well as updates from Multipath TCP literature.
 
 
 1. Overview
@@ -19,11 +22,11 @@ Status: Draft
   In order to understand our improvements to Conflux, it is important to
   properly conceptualize what is involved in the design of multipath
   algorithms in general.
-  
+
   The design space is broken into two orthogonal parts: congestion control
   algorithms that apply to each path, and traffic scheduling algorithms
   that decide which packets to send on each path.
-  
+
   MPTCP specifies 'coupled' congestion control (see [COUPLED]). Coupled
   congestion control updates single-path congestion control algorithms to
   account for shared bottlenecks between the paths, so that the combined
@@ -31,45 +34,48 @@ Status: Draft
   happen to be shared between the multiple paths. Various ways of
   accomplishing this have been proposed and implemented in the Linux
   kernel.
-  
+
   Because Tor's congestion control only concerns itself with bottlenecks in
   Tor relay queues, and not with any other bottlenecks (such as
   intermediate Internet routers), we can avoid this complexity merely by
   specifying that any paths that are constructed SHOULD NOT share any
-  relays. In this way, we can proceed to use the exact same congestion
-  control as specified in Proposal 324, for each path.
-  
-  For this reason, this proposal will focus on the traffic scheduling
-  algorithms, rather than coupling. We propose three candidate algorithms
-  that have been studied in the literature, and will compare their
-  performance using simulation and consensus parameters.
+  relays (except for the exit). This assumption is valid, because non-relay
+  bottlenecks are managed by TCP of client-to-relay and relay-to-relay OR
+  connections, and not Tor's circuit-level congestion control. In this way,
+  we can proceed to use the exact same congestion control as specified in
+  [PROP324], for each path.
+
+  For this reason, this proposal will focus on protocol specification, and
+  the traffic scheduling algorithms, rather than coupling. Note that the
+  scheduling algorithms are currently in flux, and will be subject to
+  change as we tune them in Shadow, on the live network, and for future
+  UDP implementation (see [PROP339]). This proposal will be kept up to
+  date with the current implementation.
 
 1.2. Divergence from the initial Conflux design
 
   The initial [CONFLUX] paper doesn't provide any indications on how to
   handle the size of out-of-order cell queue, which we consider a
   potential dangerous memory DoS vector (see [MEMORY_DOS]). It also used
-  RTT as the sole heuristic for selecting which circuit to send on, which
+  RTT as the sole heuristic for selecting which circuit to send on (which
   may vary depending on the geographical locations of the participant
-  relays, without considering their actual available circuit capacity
+  relays), without considering their actual available circuit capacity
   (which will be available to us via Proposal 324). Additionally, since
   the publication of [CONFLUX], more modern packet scheduling algorithms
   have been developed, which aim to reduce out-of-order queue size.
-  
+
   We propose mitigations for these issues using modern scheduling
   algorithms, as well as implementations options for avoiding the
   out-of-order queue at Exit relays. Additionally, we consider resumption,
   side channel, and traffic analysis risks and benefits in [RESUMPTION],
   [SIDE_CHANNELS] and [TRAFFIC_ANALYSIS].
 
+1.3. Design Overview
+
+  The following section describes the Conflux design.
 
-2. Design
+  The circuit construction is as follows:
 
-  The following section describes the Conflux design. Each sub-section is
-  a building block to the multipath design that Conflux proposes.
-  
-  The circuit construction is as follow:
-  
          Primary Circuit (lower RTT)
             +-------+      +--------+
             |Guard 1|----->|Middle 1|----------+
@@ -81,7 +87,7 @@ Status: Draft
             |Guard 2|----->|Middle 2|----------+
             +-------+      +--------+
          Secondary Circuit (higher RTT)
-  
+
   Both circuits are built using current Tor path selection, however they
   SHOULD NOT share the same Guard relay, or middle relay. By avoiding
   using the same relays in these positions in the path, we ensure
@@ -89,380 +95,573 @@ Status: Draft
   'coupled' congestion control algorithms from the MPTCP
   literature[COUPLED].  This both simplifies design, and improves
   performance.
-  
+
   Then, the OP needs to link the two circuits together, as described in
-  [LINKING_CIRCUITS], [LINKING_EXIT], and [LINKING_SERVICE].
-  
-  For ease of explanation, the primary circuit is the circuit with lower
-  RTT, and the secondary circuit is the circuit with higher RTT. Initial
-  RTT is measured during circuit linking, as described in
-  [LINKING_CIRCUITS].  RTT is continually measured using SENDME timing, as
-  in Proposal 324.  This means that during use, the primary circuit and
-  secondary circuit may switch roles, depending on unrelated network
-  congestion caused by other Tor clients.
-  
+  [CONFLUX_HANDSHAKE].
+
+  For ease of explanation, the primary circuit is the circuit that is
+  more desirable to use, as per the scheduling algorithm, and the secondary
+  circuit is used after the primary is blocked by congestion control. Note
+  that for some algorithms, this selection becomes fuzzy, but all of them
+  favor the circuit with lower RTT, at the beginning of transmission.
+
+  Note also that this notion of primary vs secondary is a local property
+  of the current sender: each endpoint may have different notions of
+  primary, secondary, and current sending circuit. They also may use
+  different scheduling algorithms to determine this.
+
+  Initial RTT is measured during circuit linking, as described in
+  [CONFLUX_HANDSHAKE]. After the initial link, RTT is continually measured
+  using SENDME timing, as in Proposal 324. This means that during use,
+  the primary circuit and secondary circuit may switch roles, depending on
+  unrelated network congestion caused by other Tor clients.
+
   We also support linking onion service circuits together. In this case,
   only two rendezvous circuits are linked. Each of these RP circuits will
   be constructed separately, and then linked. However, the same path
   constraints apply to each half of the circuits (no shared relays between
   the legs). If, by chance, the service and the client sides end up
   sharing some relays, this is not catastrophic. Multipath TCP researchers
-  we have consulted (see [ACKNOWLEDGEMENTS]), believe Tor's congestion
+  we have consulted (see [ACKNOWLEDGMENTS]), believe Tor's congestion
   control from Proposal 324 to be sufficient in this rare case.
-  
-  Only two circuits SHOULD be linked together. However, implementations
-  SHOULD make it easy for researchers to *test* more than two paths, as
-  this has been shown to assist in traffic analysis resistance[WTF_SPLIT].
-  At minimum, this means not hardcoding only two circuits in the
-  implementation.
-  
+
+  In the algorithms we recommend here, only two circuits will be linked
+  together at a time.  However, implementations SHOULD support more than
+  two paths, as this has been shown to assist in traffic analysis
+  resistance[WTF_SPLIT], and will also be useful for maintaining a desired
+  target RTT, for UDP VoIP applications.
+
   If the number of circuits exceeds the current number of guard relays,
   guard relays MAY be re-used, but implementations SHOULD use the same
   number of Guards as paths.
-  
+
   Linked circuits MUST NOT be extended further once linked (ie:
   'cannibalization' is not supported).
 
+
+2. Protocol Mechanics
+
 2.1. Advertising support for conflux
 
+2.1.1 Relay
+
   We propose a new protocol version in order to advertise support for
   circuit linking on the relay side:
-  
-     "Relay=4" -- Relay supports an 2 byte sequence number in a RELAY cell
-                  header used for multipath circuit which are linked with the
-                  new RELAY_CIRCUIT_LINK relay cell command.
-  
-  XXX: Advertise this in onion service descriptor.
-  XXX: Onion service descriptor can advertise more than two circuits?
+
+     "Conflux=1" -- Relay supports Conflux as in linking circuits together using
+                    the new LINK, LINKED and SWITCH relay command.
+
+2.1.2 Onion Service
+
+  We propose to add a new line in order to advertise conflux support in the
+  encrypted section of the onion service descriptor:
+
+    "conflux" SP max-num-circ SP desired-ux NL
+
+      The "max-num-circ" value indicate the maximum number of rendezvous
+      circuits that are allowed to be linked together.
+
+  We let the service specify the conflux algorithm to use, when sending data
+  to the service. Some services may prefer latency, where as some may prefer
+  throughput. However, clients also have the ability to request their own UX
+  for data that the service sends, in the LINK handshake below, in part
+  because the high-throughput algorithms will require more out-of-order queue
+  memory, which may be infeasible on mobile.
 
   The next section describes how the circuits are linked together.
 
-2.2. Linking circuits [LINKING_CIRCUITS]
+2.2. Conflux Handshake [CONFLUX_HANDSHAKE]
 
   To link circuits, we propose new relay commands that are sent on both
   circuits, as well as a response to confirm the join, and an ack of this
   response. These commands create a 3way handshake, which allows each
   endpoint to measure the initial RTT of each leg upon link, without
   needing to wait for any data.
-  
+
   All three stages of this handshake are sent on *each* circuit leg to be
   linked.
-  
-  To save round trips, these cells SHOULD be combined with the initial
-  RELAY_BEGIN cell on the faster circuit leg, using Proposal 325. See
-  [LINKING_EXIT] and [LINKING_SERVICE] for more details on setup in each
-  case.
-  
-  There are other ways to do this linking that we have considered, but
-  they seem not to be significantly better than this method, especially
-  since we can use Proposal 325 to eliminate the RTT cost of this setup
-  before sending data. For those other ideas, see [ALTERNATIVE_LINKING]
-  and [ALTERNATIVE_RTT], in the appendix.
-  
+
+  When packed cells are a reality (proposal 340), these cells SHOULD be
+  combined with the initial RELAY_BEGIN cell on the faster circuit leg.
+  This combination also allows better enforcement against side channels.
+  (See [SIDE_CHANNELS]).
+
+  There are other ways to do this linking that we have considered, but they
+  seem not to be significantly better than this method, especially since we can
+  use Proposal 340 to eliminate the RTT cost of this setup before sending data.
+  For those other ideas, see [ALTERNATIVE_LINKING] and [ALTERNATIVE_RTT], in
+  the appendix.
+
   The first two parts of the handshake establish the link, and enable
   resumption:
-  
-     16 -- RELAY_CIRCUIT_LINK
-  
-           Sent from the OP to the exit/service in order to link
-           circuits together at the end point.
-  
-     17 -- RELAY_CIRCUIT_LINKED
-  
-           Sent from the exit/service to the OP, to confirm the circuits
-           were linked.
-  
-  These cells have the following contents:
-  
+
+    19 -- RELAY_CONFLUX_LINK
+
+          Sent from the OP to the exit/service in order to link circuits
+          together at the end point.
+
+    20 -- RELAY_CONFLUX_LINKED
+
+          Sent from the exit/service to the OP, to confirm the circuits were
+          linked.
+
+  The contents of these two cells is exactly the same. They have the following
+  contents:
+
     VERSION   [1 byte]
     PAYLOAD   [variable, up to end of relay payload]
-  
+
   The VERSION tells us which circuit linking mechanism to use. At this
   point in time, only 0x01 is recognized and is the one described by the
   Conflux design.
-  
+
   For version 0x01, the PAYLOAD contains:
-  
+
      NONCE              [32 bytes]
      LAST_SEQNO_SENT    [8 bytes]
      LAST_SEQNO_RECV    [8 bytes]
-  
-  XXX: Should we let endpoints specify their preferred [SCHEDULING] alg
-  here, to override consensus params? This has benefits: eg low-memory
-  mobile clients can ask for an alg that is better for their reorder
-  queues. But it also has complexity risk, if the other endpoint does not
-  want to support it, because of its own memory issues.
-  
+     DESIRED_UX         [1 byte]
+
   The NONCE contains a random 256-bit secret, used to associate the two
   circuits together. The nonce MUST NOT be shared outside of the circuit
   transmission, or data may be injected into TCP streams. This means it
   MUST NOT be logged to disk.
-  
+
   The two sequence number fields are 0 upon initial link, but non-zero in
-  the case of a resumption attempt (See [RESUMPTION]).
-  
-  If either circuit does not receive a RELAY_CIRCUIT_LINKED response, both
+  the case of a reattach or resumption attempt (See [CONFLUX_SET_MANAGEMENT]
+  and [RESUMPTION]).
+
+  The DESIRED_UX field allows the endpoint to request the UX properties
+  it wants. The other endpoint SHOULD select the best known scheduling
+  algorithm, for these properties. The endpoints do not need to agree
+  on which UX style they prefer.
+
+  The UX properties are:
+
+    0 - NO_OPINION
+    1 - MIN_LATENCY
+    2 - LOW_MEM_LATENCY
+    3 - HIGH_THROUGHPUT
+    4 - LOW_MEM_THROUGHPUT
+
+  The algorithm choice is performed by to the *sender* of data, (ie: the
+  receiver of the PAYLOAD). The receiver of data (sender of the PAYLOAD)
+  does not need to be aware of the exact algorithm in use, but MAY enforce
+  expected properties (particularly low queue usage, in the case of requesting
+  either LOW_MEM_LATENCY or LOW_MEM_THROUGHPUT). The receiver MAY close the
+  entire conflux set if these properties are violated.
+
+  If either circuit does not receive a RELAY_CONFLUX_LINKED response, both
   circuits MUST be closed.
-  
+
   The third stage of the handshake exists to help the exit/service measure
   initial RTT, for use in [SCHEDULING]:
-  
-     18 -- RELAY_CIRCUIT_LINKED_RTT_ACK
-  
-           Sent from the OP to the exit/service, to provide initial RTT
-           measurement for the exit/service.
-  
-  For timeout of the handshake, clients SHOULD use the normal SOCKS/stream
-  timeout already in use for RELAY_BEGIN.
-  
-  These three relay commands (RELAY_CIRCUIT_LINK, RELAY_CIRCUIT_LINKED,
-  and RELAY_CIRCUIT_LINKED_RTT_ACK) are send on *each* leg, to allow each
-  endpoint to measure the initial RTT of each leg.
+
+    21 -- RELAY_CONFLUX_LINKED_ACK
+
+          Sent from the OP to the exit/service, to provide initial RTT
+          measurement for the exit/service.
+
+  These three relay commands are sent on *each* leg, to allow each endpoint to
+  measure the initial RTT of each leg.
+
+  The client SHOULD abandon and close circuit if the LINKED message takes too
+  long to arrive. This timeout MUST be no larger than the normal SOCKS/stream
+  timeout in use for RELAY_BEGIN, but MAY be the Circuit Build Timeout value,
+  instead. (The C-Tor implementation currently uses Circuit Build Timeout).
+
+  See [SIDE_CHANNELS] for rules for when to reject unexpected handshake cells.
 
 2.2. Linking Circuits from OP to Exit [LINKING_EXIT]
 
-  To link exit circuits, two circuits to the same exit are built. The
-  client records the circuit build time of each.
-  
-  If the circuits are being built on-demand, for immediate use, the
-  circuit with the lower build time SHOULD use Proposal 325 to append its
-  first RELAY cell to the RELAY_COMMAND_LINK, on the circuit with the
-  lower circuit build time. The exit MUST respond on this same leg.  After
-  that, actual RTT measurements MUST be used to determine future
-  transmissions, as specified in [SCHEDULING].
-  
-  The RTT times between RELAY_COMMAND_LINK and RELAY_COMMAND_LINKED are
-  measured by the client, to determine each circuit RTT to determine
-  primary vs secondary circuit use, and for packet scheduling.  Similarly,
-  the exit measures the RTT times between RELAY_COMMAND_LINKED and
-  RELAY_COMMAND_LINKED_RTT_ACK, for the same purpose.
-  
+  To link exit circuits, two circuits to the same exit are built, with
+  additional restrictions such that they do not share Guard or Middle
+  relays. This restriction applies via specific relay identity keys,
+  and not IP addresses, families, or networks. (This is because the purpose
+  of it is to avoid sharing a bottleneck *inside* relay circuit queues;
+  bottlenecks caused by a shared network are handled by TCP's congestion
+  control on the OR conns).
+
+  Bridges also are subject to the same constraint as Guard relays;
+  the C-Tor codebase emits a warn if only one bridge is configured, unless
+  that bridge has transport "snowflake". Snowflake is exempt from this
+  Guard restriction because it is actually backed by many bridges. In the
+  bridge case, we only warn, and do not refuse to build conflux circuits,
+  because it is not catastrophic that Bridges are shared, it is just
+  sub-optimal for performance and congestion.
+
+  When each circuit is opened, we ensure that congestion control
+  has been negotiated. If congestion control negotiation has failed, the
+  circuit MUST be closed. After this, the linking handshake begins.
+
+  The RTT times between RELAY_CONFLUX_LINK and RELAY_CONFLUX_LINKED are
+  measured by the client, to determine primary vs secondary circuit use,
+  and for packet scheduling. Similarly, the exit measures the RTT times
+  between RELAY_CONFLUX_LINKED and RELAY_CONFLUX_LINKED_ACK, for the same
+  purpose.
+
+  Because of the race between initial data and the RELAY_CONFLUX_LINKED_ACK
+  cell, conditions can arise where an Exit needs to send data before the
+  slowest circuit delivers this ACK. In these cases, it should prefer sending
+  data on the circuit that has delivered the ACK (which will arrive immediately
+  prior to any data from the client). This circuit will be the lower RTT
+  circuit anyway, but the code needs to handle the fact that in this case,
+  there won't yet be an RTT for the second circuit.
+
 2.3. Linking circuits to an onion service [LINKING_SERVICE]
-  
+
   For onion services, we will only concern ourselves with linking
   rendezvous circuits.
-  
+
   To join rendezvous circuits, clients make two introduce requests to a
   service's intropoint, causing it to create two rendezvous circuits, to
   meet the client at two separate rendezvous points. These introduce
   requests MUST be sent to the same intropoint (due to potential use of
   onionbalance), and SHOULD be sent back-to-back on the same intro
-  circuit. They MAY be combined with Proposal 325.
-  
-  The first rendezvous circuit to get joined SHOULD use Proposal 325 to
+  circuit. They MAY be combined with Proposal 340. (Note that if we do not
+  use Prop340, we will have to raise the limit on number of intros per
+  client circuit to 2, here, at intropoints).
+
+  When rendezvous circuits are built, they should use the same Guard,
+  Bridge, and Middle restrictions as specified in 2.2, for Exits. These
+  restrictions SHOULD also take into consideration all Middles in the path,
+  including the rendezvous point. All relay identities should be unique
+  (again, except for when the Snowflake transport is in use). The one
+  special case here is if the chosen rendezvous points by a client
+  are the same as the service's guards. In this case, the service SHOULD
+  NOT use different guards, but instead stick with the guards it has.
+  The reason for this is that we do not want to create the ability
+  for a client to force a service to use different guards.
+
+  The first rendezvous circuit to get joined SHOULD use Proposal 340 to
   append the RELAY_BEGIN command, and the service MUST answer on this
   circuit, until RTT can be measured.
-  
+
   Once both circuits are linked and RTT is measured, packet scheduling
   MUST be used, as per [SCHEDULING].
-  
-2.4. Congestion Control Application [CONGESTION_CONTROL]
-  
-  The SENDMEs for congestion control are performed per-leg. As data
-  arrives, regardless of its ordering, it is counted towards SENDME
-  delivery. In this way, 'cwnd - package_window' of each leg always
-  reflects the available data to send on each leg. This is important for
+
+2.4. Conflux Set Management [CONFLUX_SET_MANAGEMENT]
+
+  When managing legs, it is useful to separate sets that have completed the
+  link handshake from legs that are still performing the handshake. Linked
+  sets MAY have additional unlinked legs on the way, but these should not
+  be used for sending data until the handshake is complete.
+
+  It is also useful to enforce various additional conditions on the handshake,
+  depending on if [RESUMPTION] is supported, and if a leg has been launched
+  because of an early failure, or due to a desire for replacement.
+
+2.4.1. Pre-Building Sets
+
+  In C-Tor, conflux is only used via circuit prebuilding. Pre-built conflux
+  sets are preferred over other pre-built circuits, but if the pre-built pool
+  ends up empty, normal pre-built circuits are used. If those run out, regular
+  non-conflux circuits are built. In other words, in C-Tor, conflux sets are
+  never built on-demand, but this is strictly an implementation decision, to
+  simplify dealing with the C-Tor codebase
+
+  The consensus parameter 'cfx_max_prebuilt_set' specifies the number of
+  sets to pre-build.
+
+  During upgrade, the consensus parameter 'cfx_low_exit_threshold' will be
+  used, so that if there is a low amount of conflux-supporting exits, only
+  one conflux set will be built.
+
+2.4.2. Set construction
+
+  When a set is launched, legs begin the handshake in the unlinked state.
+  As handshakes complete, finalization is attempted, to create a linked set.
+  On the client, this finalization happens upon receipt of the LINKED cell.
+  On the exit/service, this finalization happens upon *sending* the LINKED_ACK.
+
+  The initiator of this handshake considers the set fully linked once the
+  RELAY_CONFLUX_LINKED_ACK is sent (roughly upon receipt of the LINKED cell).
+  Because of the potential race between LINKED_ACK, and initial data sent by
+  the client, the receiver of the handshake must consider a leg linked at
+  the time of *sending* a LINKED_ACK cell.
+
+  This means that exit legs may not have an RTT measurement, if data on the
+  faster leg beats the LINKED_ACK on the slower leg. The implementation MUST
+  account for this, by treating unmeasured legs as having infinite RTT.
+
+  When attempting to finalize a set, this finalization should not complete
+  if any unlinked legs are still pending.
+
+2.4.3. Closing circuits
+
+  For circuits that are unlinked, the origin SHOULD immediately relaunch a new
+  leg when it is closed, subject to the limits in [SIDE_CHANNELS].
+
+  In C-Tor, we do not support arbitrary resumption. Therefore, we perform
+  some additional checks upon closing circuits, to decide if we should
+  immediately tear down the entire set:
+     - If the closed leg was the current sending leg, close the set
+     - If the closed leg had the highest non-zero last_seq_recv/sent, close the set
+     - If data was in progress on a closed leg (inflight > cc_sendme_inc), then
+       all legs must be closed
+
+2.4.4. Reattaching Legs
+
+  While C-Tor does not support arbitrary resumption, new legs *can* be
+  attached, so long as there is no risk of data loss from a closed leg.
+  This enables latency probing, which will be important for UDP VoIP.
+
+  Currently, the C-Tor codebase checks for data loss by verifying that
+  the LINK/LINKED cell has a lower last_seq_sent than all current
+  legs' maximum last_seq_recv, and a lower last_seq_recv than all
+  current legs last_seq_sent.
+
+  This check is performed on finalization, not the receipt of first
+  handshake cell. This gives the data additional time to arrive.
+
+2.5. Congestion Control Application [CONGESTION_CONTROL]
+
+  The SENDMEs for congestion control are performed per-leg. As soon as
+  data arrives, regardless of its ordering, it is counted towards SENDME
+  delivery. In this way, 'cwnd - inflight' of each leg always reflects
+  the available data to send on each leg. This is important for
   [SCHEDULING].
-  
+
   The Congestion control Stream XON/XOFF can be sent on either leg, and
   applies to the stream's transmission on both legs.
-  
-2.5. Sequencing [SEQUENCING]
-  
+
+  In C-Tor, streams used to become blocked as soon as the OR conn
+  of their circuit was blocked. Because conflux can send on the other
+  circuit, which uses a different OR conn, this form of stream blocking
+  has been decoupled from the OR conn status, and only happens when
+  congestion control has decided that all circuits are blocked (congestion
+  control becomes blocked when either 'cwnd - inflight <= 0', *or* when
+  the local OR conn is blocked, so if all local OR conns of a set are
+  blocked, then the stream will block that way).
+
+  Note also that because congestion control only covers RELAY_COMMAND_DATA
+  cells, for all algorithms, a special case must be made such that if no
+  circuit is available to send on due to congestion control blocking,
+  commands other than RELAY_COMMAN_DATA MUST be sent on the current
+  circuit, even if the cell scheduler believes that no circuit is available.
+  Depending on the code structure of Arti, this special case may or may
+  not be necessary. It arises in C-Tor because nothing can block the
+  sending of arbitrary non-DATA relay command cells.
+
+2.6. Sequencing [SEQUENCING]
+
   With multiple paths for data, the problem of data re-ordering appears.
   In other words, cells can arrive out of order from the two circuits
   where cell N + 1 arrives before the cell N.
-  
+
   Handling this reordering operates after congestion control for each
   circuit leg, but before relay cell command processing or stream data
   delivery.
-  
+
   For the receiver to be able to reorder the receiving cells, a sequencing
   scheme needs to be implemented. However, because Tor does not drop or
   reorder packets inside of a circuit, this sequence number can be very
   small. It only has to signal that a cell comes after those arriving on
   another circuit.
-  
-  To achieve this, we add a small sequence number to the common relay
-  header for all relay cells on linked circuits. This sequence number is
-  meant to signal the number of cells sent on the *other* leg, so that
-  each endpoint knows how many cells are still in-flight on another leg.
-  It is different from the absolute sequence number used in
-  [LINKING_CIRCUITS] and [RESUMPTION], but can be derived from that
-  number, using relative arithmetic.
-  
-    Relay command   [1 byte]
-    Recognized      [2 bytes]
-    StreamID        [2 bytes]
-    Digest          [4 bytes]
-    Length          [2 bytes]
-  > LongSeq         [1 bit]  # If this bit is set, use 31 bits for Seq
-  > Sequencing      [7 or 31 bits]
-    Data            [Remainder]
-  
-  The sequence number is only set for the first cell after the endpoint
-  switches legs. In this case, LongSeq is set to 1, and the Sequencing
-  field is 31 more bits. Otherwise it is a 1 byte 0 value.
-  
-  These fields MUST be present on ALL end-to-end relay cells on each leg
-  that come from the endpoint, following a RELAY_CIRCUIT_LINK command.
-  
-  They are absent on 'leaky pipe' RELAY_COMMAND_DROP and
-  RELAY_COMMAND_PADDING_NEGOTIATED cells that come from middle relays, as
-  opposed to the endpoint, to support padding.
-  
-  When an endpoint switches legs, on the first cell in a new leg, LongSeq
-  is set to 1, and the following 31 bits represent the *total* number of
-  cells sent on the *other* leg, before the switch. The receiver MUST wait
-  for that number of cells to arrive from the previous leg before
-  delivering that cell.
-  
-  XXX: In the rare event that we send more than 2^31 cells (~1TB) on a
-  single leg, do we force a switch of legs, or expand the field further?
-  
-  An alternative method of sequencing, that assumes that the endpoint
-  knows when it is going to switch, the cell before it switches, is
-  specified in [ALTERNATIVE_SEQUENCING]. Note that that method requires
-  only 1 byte for sequence number and switch signaling, but requires that
-  the sender know that it is planning to switch, the cell before it
-  switches.  (This is possible with [BLEST_TOR], but [LOWRTT_TOR] can
-  switch based on RTT change, so it may be one cell late in that case).
-
-2.6. Resumption [RESUMPTION]
+
+  To achieve this, we propose a new relay command used to indicate a switch to
+  another leg:
+
+    22 -- RELAY_CONFLUX_SWITCH
+
+          Sent from a sending endpoint when switching leg in an
+          already linked circuit construction. This message is sent on the leg
+          that will be used for new traffic, and tells the receiver the size of
+          the gap since the last data (if any) sent on that leg.
+
+  The cell payload format is:
+
+    SeqNum  [4 bytes]
+
+  The "SeqNum" value is a relative sequence number, which is the difference
+  between the last absolute sequence number sent on the new leg and the last
+  absolute sequence number sent on all other legs prior to the switch. In this
+  way, the endpoint knows what to increment its local absolute sequence number
+  by, before cells start to arrive.
+
+  To achieve this, the sender must maintain the last absolute sequence sent for
+  each leg, and the receiver must maintain the last absolute sequence number
+  received for each leg.
+
+  As an example, let's say we send 10 cells on the first leg, so our absolute
+  sequence number is 10. If we then switch to the second leg, it is trivial to
+  see that we should send a SWITCH with 10 as the relative sequence number, to
+  indicate that regardless of the order in which the first cells are received,
+  subsequent cells on the second leg should start counting at 10.
+
+  However, if we then send 21 cells on this leg, our local absolute sequence
+  number as the sender is 31. So when we switch back to the first leg, where
+  the last absolute sequence sent was 10, we must send a SWITCH cell with 21,
+  so that when the first leg receives subsequent cells, it assigns those cells
+  an absolute sequence number starting at 31.
+
+  In the rare event that we send more than 2^31 cells (~1TB) on a single leg,
+  the leg should be switched in order to reset that relative sequence number to
+  fit within 4 bytes.
+
+  For a discussion of rules to rate limit the usage of SWITCH as a side
+  channel, see [SIDE_CHANNELS].
+
+2.7. Resumption [RESUMPTION]
 
   In the event that a circuit leg is destroyed, they MAY be resumed.
-  
+  Full resumption is not supported in C-Tor, but is possible to implement,
+  at the expense of always storing roughly a congestion window of
+  already-transmitted data on each endpoint, in the worst case. Simpler
+  forms of resumption, where there is no data loss, are supported. This
+  is important to support latency probing, for ensuring UDP VoIP minimum
+  RTT requirements are met (roughly 300-500ms, depending on VoIP
+  implementation).
+
   Resumption is achieved by re-using the NONCE to the same endpoint
   (either [LINKING_EXIT] or [LINKING_SERVICE]). The resumed path need
   not use the same middle and guard relays as the destroyed leg(s), but
   SHOULD NOT share any relays with any existing legs(s).
-  
-  To provide resumption, endpoints store an absolute 64bit cell counter of
-  the last cell they have sent on a conflux pair (their LAST_SEQNO_SENT),
-  as well the last sequence number they have delivered in-order to edge
-  connections corresponding to a conflux pair (their LAST_SEQNO_RECV).
-  Additionally, endpoints MAY store the entire contents of unacked
-  inflight cells (ie the 'package_window' from proposal 324), for each
-  leg, along with information corresponding to those cells' absolute
-  sequence numbers.
-  
-  These 64 bit absolute counters can wrap without issue, as congestion
-  windows will never grow to 2^64 cells until well past the Singularity.
-  However, it is possible that extremely long, bulk circuits could exceed
-  2^64 total sent or received cells, so endpoints SHOULD handle wrapped
-  sequence numbers for purposes of computing retransmit information. (But
-  even this case is unlikely to happen within the next decade or so).
-  
-  Upon resumption, the LAST_SEQNO_SENT and LAST_SEQNO_RECV fields are used
-  to convey the sequence numbers of the last cell the relay sent and
-  received on that leg. The other endpoint can use these sequence numbers
-  to determine if it received the in-flight data or not, or sent more data
-  since that point, up to and including this absolute sequence number. If
-  LAST_SEQNO_SENT has not been received, the endpoint MAY transmit the
-  missing data, if it still has it buffered.
-  
-  Because both endpoints get information about the other side's absolute
-  SENT sequence number, they will know exactly how many re-transmitted
-  packets to expect, if the circuit is successfully resumed.
+
+  If data loss has been detected upon a link handshake, resumption can be
+  achieved by sending a switch cell, which is immediately followed by the
+  missing data. Roughly, each endpoint must check:
+    - if cell.last_seq_recv <
+         min(max(legs.last_seq_sent),max(closed_legs.last_seq_sent)):
+      - send a switch cell immediately with missing data:
+        (last_seq_sent - cell.last_seq_recv)
+
+  If an endpoint does not have this missing data due to memory pressure,
+  that endpoint MUST destroy *both* legs, as this represents unrecoverable
+  data loss.
 
   Re-transmitters MUST NOT re-increment their absolute sent fields
   while re-transmitting.
-  
-  If it does not have this missing data due to memory pressure, that
-  endpoint MUST destroy *both* legs, as this represents unrecoverable
-  data loss.
-  
-  Otherwise, the new circuit can be re-joined, and its RTT can be compared
-  to the remaining circuit to determine if the new leg is primary or
-  secondary.
-  
+
   It is even possible to resume conflux circuits where both legs have been
   collapsed using this scheme, if endpoints continue to buffer their
   unacked package_window data for some time after this close. However, see
   [TRAFFIC_ANALYSIS] for more details on the full scope of this issue.
-  
+
   If endpoints are buffering package_window data, such data should be
   given priority to be freed in any oomkiller invocation. See [MEMORY_DOS]
   for more oomkiller information.
 
+2.8. Data transmission
+
+  Most cells in Tor are circuit-specific, and should only be sent on a
+  circuit, even if that circuit is part of a conflux set. Cells that
+  are not multiplexed do not count towards the conflux sequence numbers.
+
+  However, in addition to the obvious RELAY_COMMAND_DATA, a subset of cells
+  MUST ALSO be multiplexed, so that their ordering is preserved when they
+  arrive at the other end. These cells do count towards conflux sequence
+  numbers, and are handled in the out-of-order queue, to preserve ordered
+  delivery:
+    RELAY_COMMAND_BEGIN
+    RELAY_COMMAND_DATA
+    RELAY_COMMAND_END
+    RELAY_COMMAND_CONNECTED
+    RELAY_COMMAND_RESOLVE
+    RELAY_COMMAND_RESOLVED
+    RELAY_COMMAND_XOFF
+    RELAY_COMMAND_XON
+
+  Currently, this set is the same as the set of cells that have stream ID,
+  but the property that leads to this requirement is not stream usage by
+  itself, it is that these cells must be ordered with respect to all data
+  on the circuit. It is not impossible that future relay commands could be
+  invented that don't have stream IDs, but yet must still arrive in order
+  with respect to circuit data cells. Prop#253 is one possible example of
+  such a thing (though we won't be implementing that proposal).
+
 
 3. Traffic Scheduling [SCHEDULING]
 
   In order to load balance the traffic between the two circuits, the
   original conflux paper used only RTT. However, with Proposal 324, we
   will have accurate information on the instantaneous available bandwidth
-  of each circuit leg, as 'cwnd - package_window' (see Section 3 of
-  Proposal 324).
-  
-  Some additional RTT optimizations are also useful, to improve
-  responsiveness and minimize out-of-order queue sizes.
-  
+  of each circuit leg, as 'cwnd - inflight' (see Section 3 of
+  Proposal 324). We also have the TCP block state of the local OR
+  connection.
+
   We specify two traffic schedulers from the multipath literature and
-  adapt them to Tor: [LOWRTT_TOR] and [BLEST_TOR]. [LOWRTT_TOR] also has
-  three variants, with different trade offs.
-  
-  However, see the [TRAFFIC_ANALYSIS] sections of this proposal for
+  adapt them to Tor: [MINRTT_TOR], and [LOWRTT_TOR]. Additionally,
+  we create low-memory variants of these that aim to minimize the
+  out-of-order queue size at the receiving endpoint.
+
+  Additionally, see the [TRAFFIC_ANALYSIS] sections of this proposal for
   important details on how this selection can be changed, to reduce
   website traffic fingerprinting.
 
-3.1. LowRTT Scheduling [LOWRTT_TOR]
-
-  This scheduling algorithm is based on the original [CONFLUX] paper, with
-  ideas from [MPTCP]'s minRTT/LowRTT scheduler.
-  
-  In this algorithm, endpoints send cells on the circuit with lower RTT
-  (primary circuit). This continues while the congestion window on the
-  circuit has available room: ie whenever cwnd - package_window > 0.
-  
-  Whenever the primary circuit's congestion window becomes full, the
-  secondary circuit is used. We stop reading on the send window source
-  (edge connection) when both congestion windows become full.
-  
+3.1. MinRTT scheduling [MINRTT_TOR]
+
+  This schedulng algorithm is used for the MIN_LATENCY user experience.
+
+  It works by always and only sending on the circuit with the current minimum
+  RTT. With this algorithm, conflux should effectively stay on the circuit with
+  the lowest initial RTT, unless that circuit's RTT raises above the RTT of the
+  other circuit (due to relay load or congestion). When the circuit's congestion
+  window is full (ie: cwnd - inflight <= 0), or if the local OR conn blocks,
+  the conflux set stops transmitting and stops reading on edge connections,
+  rather than switch.
+
+  This should result in low out-of-order queues in most situations, unless
+  the initial RTTs of the two circuits are very close (basically within the
+  Vegas RTT bounds of queue variance, 'alpha' and 'beta').
+
+3.2. LowRTT Scheduling [LOWRTT_TOR]
+
+  This scheduling algorithm is based on [MPTCP]'s LowRTT scheduler. This
+  algorithm is used for the UX choice of HIGH_THROUGHPUT.
+
+  In this algorithm, endpoints send cells on the circuit with lowest RTT that
+  has an unblocked local OR connection, and room in its congestion window (ie:
+  cwnd - inflight > 0). We stop reading on edge connections only when both
+  congestion windows become full, or when both local OR connections are blocked.
+
   In this way, unlike original conflux, we switch to the secondary circuit
-  without causing congestion on the primary circuit. This improves both
-  load times, and overall throughput.
-  
-  This behavior matches minRTT from [MPTCP], sometimes called LowRTT.
-  
-  It may be better to stop reading on the edge connection when the primary
-  congestion window becomes full, rather than switch to the secondary
-  circuit as soon as the primary congestion window becomes full. (Ie: only
-  switch if the RTTs themselves change which circuit is primary). This is
-  what was done in the original Conflux paper. This behavior effectively
-  causes us to optimize for responsiveness and congestion avoidance,
-  rather than throughput. For evaluation, we will control this switching
-  behavior with a consensus parameter (see [CONSENSUS_PARAMETERS]).
-  
-  Because of potential side channel risk (see [SIDE_CHANNELS]), a third
-  variant of this algorithm, where the primary circuit is chosen during
-  the [LINKING_CIRCUITS] handshake and never changed, is also possible
-  to control via consensus parameter.
-
-3.2. BLEST Scheduling [BLEST_TOR]
+  without causing congestion either locally, or on either circuit. This
+  improves both load times, and overall throughput. Given a large enough
+  transmission, both circuits are used to their full capacity,
+  simultaneously.
+
+3.3. MinRTT Low-Memory Scheduling [MINRTT_LOWMEM_TOR]
+
+  The low memory version of the MinRTT scheduler ensures that we do not
+  perform a switch more often than once per congestion window worth of data.
+
+  XXX: Other rate limiting, such as not switching unless the RTT changes by
+  more than X%, may be useful here.
+
+3.4. BLEST Scheduling [BLEST_TOR]
+
+  XXX: Something like this might be useful for minimizing OOQ for the UX
+  choice of LOW_MEM_THROUGHPUT, but we might just be able to reduce switching
+  frequency instead.
+
+  XXX: We want an algorithm that only uses cwnd instead. This algorithm
+  has issues if the primary cwnd grows while the secondary does not.
+  Expect this section to change.
 
   [BLEST] attempts to predict the availability of the primary circuit, and
   use this information to reorder transmitted data, to minimize
   head-of-line blocking in the recipient (and thus minimize out-of-order
   queues there).
-  
+
   BLEST_TOR uses the primary circuit until the congestion window is full.
   Then, it uses the relative RTT times of the two circuits to calculate
   how much data can be sent on the secondary circuit faster than if we
-  just waited for the primary circuit to become available. 
-  
-  This is achieved by computing two variables at the sender:   
-  
+  just waited for the primary circuit to become available.
+
+  This is achieved by computing two variables at the sender:
+
     rtts = secondary.currRTT / primary.currRTT
     primary_limit = primary.cwnd + (rtts-1)/2)*rtts
-  
+
   Note: This (rtts-1)/2 factor represents anticipated congestion window
   growth over this period.. it may be different for Tor, depending on CC
   alg.
-  
+
   If primary_limit < secondary.cwnd - (secondary.package_window + 1), then
   there is enough space on the secondary circuit to send data faster than
   we could than waiting for the primary circuit.
-  
+
   XXX: Note that BLEST uses total_send_window where we use secondary.cwnd
   in this check. total_send_window is min(recv_win, CWND). But since Tor
   does not use receive windows and instead uses stream XON/XOFF, we only
@@ -472,69 +671,28 @@ Status: Draft
   hopefully this is fine. If we need to, we could turn [REORDER_SIGNALING]
   into a receive window indication of some kind, to indicate remaining
   buffer size.
-  
+
   Otherwise, if the primary_limit condition is not hit, cease reading on
   source edge connections until SENDME acks come back.
-  
+
   Here is the pseudocode for this:
-  
+
     while source.has_data_to_send():
       if primary.cwnd > primary.package_window:
         primary.send(source.get_packet())
         continue
-    
+
       rtts = secondary.currRTT / primary.currRTT
       primary_limit = (primary.cwnd + (rtts-1)/2)*rtts
-    
+
       if primary_limit < secondary.cwnd - (secondary.package_window+1):
         secondary.send(source.get_packet())
       else:
         break # done for now, wait for SENDME to free up CWND and restart
-  
+
   Note that BLEST also has a parameter lambda that is updated whenever HoL
   blocking occurs. Because it is expensive and takes significant time to
   signal this over Tor, we omit this.
-  
-  XXX: See [REORDER_SIGNALING] section if we want this lambda feedback.
-
-3.3. Reorder queue signaling [REORDER_SIGNALING]
-
-  Reordering is fairly simple task. By following using the sequence
-  number field in [SEQUENCING], endpoints can know how many cells are
-  still in flight on the other leg.
-  
-  To reorder them properly, a buffer of out of order cells needs to be
-  kept. On the Exit side, this can quickly become overwhelming
-  considering ten of thousands of possible circuits can be held open
-  leading to gigabytes of memory being used. There is a clear potential
-  memory DoS vector in this case, covered in more detail in
-  [MEMORY_DOS].
-
-  Luckily, [BLEST_TOR] and the form of [LOWRTT_TOR] that only uses the
-  primary circuit will minimize or eliminate this out-of-order buffer.
-  
-  XXX: The remainder of this section may be over-complicating things... We
-  only need these concepts if we want to use BLEST's lambda feedback. Though
-  turning this into some kind of receive window that indicates remaining
-  reorder buffer size may also help with the total_send_window also noted
-  in BLEST_TOR.
-  
-  The default for this queue size is governed by the 'cflx_reorder_client'
-  and 'cflx_reorder_srv' consensus parameters (see [CONSENSUS_PARAMS]).
-  'cflx_reorder_srv' applies to Exits and onion services. Both parameters
-  can be overridden by Torrc, to larger or smaller than the consensus
-  parameter. (Low memory clients may want to lower it; SecureDrop onion
-  services or other high-upload services may want to raise it).
-  
-  When the reorder queue hits this size, a RELAY_CONFLUX_XOFF is sent down
-  the circuit leg that has data waiting in the queue and use of that leg
-  SHOULD cease, until it drains to half of this value, at which point an
-  RELAY_CONFLUX_XON is sent. Note that this is different than the stream
-  XON/XOFF from Proposal 324.
-  
-  XXX: [BLEST] actually does not cease use of a path in this case, but
-  instead uses this signal to adjust the lambda parameter, which biases
-  traffic away from that leg.
 
 
 4. Security Considerations
@@ -543,109 +701,152 @@ Status: Draft
 
   Both reorder queues and retransmit buffers inherently represent a memory
   denial of service condition.
-  
+
   For [RESUMPTION] retransmit buffers, endpoints that support this feature
   SHOULD free retransmit information as soon as they get close to memory
   pressure. This prevents resumption while data is in flight, but will not
   otherwise harm operation.
-  
-  For reorder buffers, adversaries can potentially impact this at any
-  point, but most obviously and most severely from the client position.
-  
-  In particular, clients can lie about sequence numbers, sending cells
-  with sequence numbers such that the next expected sequence number is
-  never sent. They can do this repeatedly on many circuits, to exhaust
-  memory at exits.
-  
-  One option is to only allow actual traffic splitting in the downstream
-  direction, towards clients, and always use the primary circuit for
-  everything in the upstream direction. However, the ability to support
-  conflux from the client to the exit shows promise against traffic
-  analysis (see [WTF_SPLIT]).
-  
-  The other option is to use [BLEST_TOR] from clients to exits, as it has
-  predictable interleaved cell scheduling, and minimizes reorder queues at
-  exits. If the ratios prescribed by that algorithm are not followed
-  within some bounds, the other endpoint can close both circuits, and free
-  the queue memory.
-  
-  This still leaves the possibility that intermediate relays may block a
-  leg, allowing cells to traverse only one leg, thus still accumulating at
-  the reorder queue. Clients can also spoof sequence numbers similarly, to
-  make it appear that they are following [BLEST_TOR], without actually
-  sending any data on one of the legs.
-  
-  To handle either of these cases, when a relay is under memory pressure,
-  the circuit OOM killer SHOULD free and close circuits with the oldest
-  reorder queue data, first. This heuristic was shown to be best during
-  the [SNIPER] attack OOM killer iteration cycle.
-
-4.2. Side Channels [SIDE_CHANNELS]
-
-  Two potential side channels may be introduced by the use of Conflux:
-     1. RTT leg-use bias by altering SENDME latency
+
+  In terms of adversarial issues, clients can lie about sequence numbers,
+  sending cells with sequence numbers such that the next expected sequence
+  number is never sent.  They can do this repeatedly on many circuits, to
+  exhaust memory at exits.  Intermediate relays may also block a leg, allowing
+  cells to traverse only one leg, thus still accumulating at the reorder queue.
+
+  In C-Tor we will mitigate this in three ways: via the OOM killer, by the
+  ability for exits to request that clients use the LOW_MEM_LATENCY UX
+  behavior, and by rate limiting the frequency of switching under the
+  LOW_MEM_LATENCY UX style.
+
+  When a relay is under memory pressure, the circuit OOM killer SHOULD free
+  and close circuits with the oldest reorder queue data, first. This heuristic
+  was shown to be best during the [SNIPER] attack OOM killer iteration cycle.
+
+  The rate limiting under LOW_MEM_LATENCY will be heuristic driven, based
+  on data from Shadow simulations, and live network testing. It is possible that
+  other algorithms may be able to be similarly rate limited.
+
+4.2. Protocol Side Channels [SIDE_CHANNELS]
+
+  To understand the decisions we make below with respect to handling
+  potential side channels, it is important to understand a bit of the history
+  of the Tor threat model.
+
+  Tor's original threat model completely disregarded all traffic analysis,
+  including protocol side channels, assuming that they were all equally
+  effective, and that diversity of relays was what provided protection.
+  Numerous attack papers have proven this to be an over-generalization.
+
+  Protocol side channels are most severe when a circuit is known to be silent,
+  because stateful protocol behavior prevents other normal cells from ever being
+  sent. In these cases, it is trivial to inject a packet count pattern that has
+  zero false positives. These kinds of side channels are made use of in the
+  Guard discovery literature, such as [ONION_FOUND], and [DROPMARK]. It is even
+  more trivial to manipulate the AES-CTR cipherstream, as per [RACOON23], until
+  we implement [PROP308].
+
+  However, because we do not want to make this problem worse, it is extremely
+  important to be mindful of ways that an adversary can inject new cell
+  commands, as well as ways that the adversary can spawn new circuits
+  arbitrarily.
+
+  It is also important, though slightly less so, to be mindful of the uniqueness
+  of new handshakes, as handshakes can be used to classify usage (such as via
+  Onion Service Circuit Fingerprinting). Handshake side channels are only
+  weakly defended, via padding machines for onion services. These padding
+  machines will need to be improved, and this is also scheduled for arti.
+
+  Finally, usage-based traffic analysis need to be considered. This includes
+  things like website traffic fingerprinting, and is covered in
+  [TRAFFIC_ANALYSIS].
+
+4.2.1. Cell Injection Side Channel Mitigations
+
+  To avoid [DROPMARK] attacks, several checks must be performed, depending
+  on the cell type. The circuit MUST be closed if any of these checks fail.
+
+  RELAY_CONFLUX_LINK:
+    - Ensure conflux is enabled
+    - Ensure the circuit is an Exit (or Service Rend) circuit
+    - Ensure that no previous LINK cell has arrived on this circuit
+
+  RELAY_CONFLUX_LINKED:
+    - Ensure conflux is enabled
+    - Ensure the circuit is client-side
+    - Ensure this is an unlinked circuit that sent a LINK command
+    - Ensure that the nonce matches the nonce used in the LINK command
+    - Ensure that the cell came from the expected hop
+
+  RELAY_CONFLUX_LINKED_ACK:
+    - Ensure conflux is enabled
+    - Ensure that this circuit is not client-side
+    - Ensure that the circuit has successfully received its LINK cell
+    - Ensure that this circuit has not received a LINKED_ACK yet
+
+  RELAY_CONFLUX_SWITCH
+    - If Prop#340 is in use, this cell MUST be packed with a valid
+      multiplexed RELAY_COMMAND cell.
+    - XXX: Additional rate limiting per algorithm, after tuning.
+
+4.2.2. Guard Discovery Side Channel Mitigations
+
+  In order to mitigate potential guard discovery by malicious exits,
+  clients MUST NOT retry failed unlinked circuit legs for a set more than
+  'cfx_max_unlinked_leg_retry' times.
+
+4.2.3. Usage-Based Side Channel Discussion
+
+  After we have solved all of the zero false positive protocol side
+  channels in Tor, our attention can turn to more subtle, usage-based
+  side channels.
+
+  Two potential usage side channels may be introduced by the use of Conflux:
+     1. Delay-based side channels, by manipulating switching
      2. Location info leaks through the use of both leg's latencies
-  
-  For RTT and leg-use bias, Guard relays could delay legs to introduce a
-  pattern into the delivery of cells at the exit relay, by varying the
-  latency of SENDME cells (every 100th cell) to change the distribution of
-  traffic to send information. This attack could be performed in either
-  direction of traffic, to bias traffic load off of a particular Guard.
-  If an adversary controls both Guards, it could in theory send a binary
-  signal more easily, by alternating delays on each.
-  
-  However, this risk weighs against the potential benefits against traffic
-  fingerprinting, as per [WTF_SPLIT]. Additionally, even ignoring
-  cryptographic tagging attacks, this side channel provides significantly
-  lower information over time than inter-packet-delay based side channels
-  that are already available to Guards and routers along the path to the
-  Guard.
-  
-  Tor currently provides no defenses against already existing
-  single-circuit delay-based side channels, though both circuit padding
-  and [BACKLIT] are potential options it could conceivably deploy. The
-  [BACKLIT] paper also has an excellent review of the various methods that
-  have been studied for such single circuit side channels, and the
-  [BACKLIT] style RTT monitoring could be used to protect against these
-  conflux side channels as well. Circuit padding can also help to obscure
-  which cells are SENDMEs, since circuit padding is not counted towards
-  SENDME totals.
-  
+
+  To perform delay-based side channels, Exits can simply disregard the RTT
+  or cwnd when deciding to switch legs, thus introducing a pattern of gaps that
+  the Guard node can detect. Guard relays can also delay legs to introduce a
+  pattern into the delivery of cells at the exit relay, by varying the latency
+  of SENDME cells (every 31st cell) to change the distribution of traffic to
+  send information. This attack could be performed in either direction of
+  traffic, to bias traffic load off of a particular Guard. If an adversary
+  controls both Guards, it could in theory send a binary signal, by
+  alternating delays on each.
+
+  However, Tor currently provides no defenses against already existing
+  single-circuit delay-based (or stop-and-start) side channels. It is already
+  the case that on a single circuit, either the Guard or the Exit can simply
+  withhold sending traffic, as per a recognizable pattern. This class of
+  attacks, and a possible defense for them, is discussed in [BACKLIT].
+
+  However, circuit padding can also help to obscure these side channels,
+  even if tuned for website fingerprinting. See [TRAFFIC_ANALYSIS] for more
+  details there.
+
   The second class of side channel is where the Exit relay may be able to
   use the two legs to further infer more information about client
   location. See [LATENCY_LEAK] for more details. It is unclear at this
   time how much more severe this is for two paths than just one.
-  
+
   We preserve the ability to disable conflux to and from Exit relays
   using consensus parameters, if these side channels prove more severe,
   or if it proves possible possible to mitigate single-circuit side
   channels, but not conflux side channels.
-  
-  In all cases, all of these side channels appear less severe for onion
-  service traffic, due to the higher path variability due to relay
-  selection, as well as the end-to-end nature of conflux in that case.
-  Thus, we separate our ability to enable/disable conflux for onion
-  services from Exits.
 
 4.3. Traffic analysis [TRAFFIC_ANALYSIS]
 
   Even though conflux shows benefits against traffic analysis in
   [WTF_SPLIT], these gains may be moot if the adversary is able to perform
   packet counting and timing analysis at guards to guess which specific
-  circuits are linked.  In particular, the 3 way handshake in
+  circuits are linked. In particular, the 3 way handshake in
   [LINKING_CIRCUITS] may be quite noticeable.
-  
-  As one countermeasure, it may be possible to eliminate the third leg
-  (RELAY_CIRCUIT_LINKED_RTT_ACK) by computing the exit/service RTT via
-  measuring the time between CREATED/REND_JOINED and RELAY_CIRCUIT_LINK,
-  but this will introduce cross-component complexity into Tor's protocol
-  that could quickly become unwieldy and fragile.
-  
+
   Additionally, the conflux handshake may make onion services stand out
   more, regardless of the number of stages in the handshake. For this
-  reason, it may be more wise to simply address these issues with circuit
+  reason, it may be wise to simply address these issues with circuit
   padding machines during circuit setup (see padding-spec.txt).
-  
+
   Additional traffic analysis considerations arise when combining conflux
   with padding, for purposes of mitigating traffic fingerprinting. For
   this, it seems wise to treat the packet schedulers as another piece of a
@@ -653,7 +854,7 @@ Status: Draft
   machines, perhaps introducing randomness or fudge factors their
   scheduling, as a parameterized distribution. For details, see
   https://github.com/torproject/tor/blob/master/doc/HACKING/CircuitPaddingDevelopment.md
-  
+
   Finally, conflux may exacerbate forms of confirmation-based traffic
   analysis that close circuits to determine concretely if they were in
   use, since closing either leg might cause resumption to fail. TCP RST
@@ -661,44 +862,58 @@ Status: Draft
   capability. [RESUMPTION] with buffering of the inflight unacked
   package_window data, for retransmit, is a partial mitigation, if
   endpoints buffer this data for retransmission for a brief time even if
-  both legs close. This seems more feasible for onion services, which are
-  more vulnerable to this attack. However, if the adversary controls the
-  client, they will notice the resumption re-link, and still obtain
+  both legs close. This buffering seems more feasible for onion services,
+  which are more vulnerable to this attack. However, if the adversary
+  controls the client and is attacking the service in this way, they
+  will notice the resumption re-link at their client, and still obtain
   confirmation that way.
-  
+
   It seems the only way to fully mitigate these kinds of attacks is with
   the Snowflake pluggable transport, which provides its own resumption and
   retransmit behavior. Additionally, Snowflake's use of UDP DTLS also
   protects against TCP RST injection, which we suspect to be the main
   vector for such attacks.
-  
+
   In the future, a DTLS or QUIC transport for Tor such as masque could
   provide similar RST injection resistance, and resumption at Guard/Bridge
   nodes, as well.
 
+5. Consensus Parameters [CONSENSUS]
 
-5. System Interactions
+  - cfx_enabled
+    - Values: 0=off, 1=on
+    - Description: Emergency off switch, in case major issues are discovered.
 
-  - congestion control
-  - EWMA and KIST
-  - CBT and number of guards
-  - Onion service circ obfuscation
-  - Future UDP (may increase need for UDP to buffer before dropping)
-  - Padding (no sequence numbers on padding cells, as per [SEQUENCING])
-    - Also, any padding machines may need re-tuning
-  - No 'cannibalization' of linked circuits
+  - cfx_low_exit_threshold
+    - Range: 0-10000
+    - Description: Fraction out of 10000 that represents the fractional rate of
+      exits that must support protover 5. If the fraction is below this
+      amount, the number of pre-built sets is restricted to 1.
 
+  - cfx_max_linked_set
+    - Range: 0-255
+    - Description: The total number of linked sets that can be created. 255
+      means "unlimited".
 
-6. Consensus and Torrc Parameters [CONSENSUS]
+  - cfx_max_prebuilt_set
+    - Range: 0-255
+    - Description: The maximum number of pre-built conflux sets to make.
+      This value is overridden by the 'cfx_low_exit_threshold' criteria.
 
-  - conflux_circs
-    - Number of conflux circuits
+  - cfx_max_unlinked_leg_retry
+    - Range: 0-255
+    - Description: The maximum number of times to retry an unlinked leg that
+      fails during build or link, to mitigate guard discovery attacks.
 
-  - conflux_sched_exits, conflux_sched_clients, conflux_sched_service
-    - Three forms of LOWRTT_TOR, and BLEST_TOR
+  - cfx_num_legs_set
+    - Range: 0-255
+    - Description: The number of legs to link in a set.
 
-  - ConfluxOnionService
-  - ConfluxOnionCircs
+  - cfx_send_pct
+    - XXX: Experimental tuning parameter. Subject to change/removal.
+
+  - cfx_drain_pct
+    - XXX: Experimental tuning parameter. Subject to change/removal.
 
 
 7. Tuning Experiments [EXPERIMENTS]
@@ -713,51 +928,11 @@ Status: Draft
 
 Appended A [ALTERNATIVES]
 
-A.1 BEGIN/END sequencing [ALTERNATIVE_SEQUENCING]
-
-  In this method of signaling, we increment the sequence number by 1 only
-  when we switch legs, and use BEGIN/END "bookends" to know that all data
-  on a leg has been received.
-  
-  To achieve this, we add a small sequence number to the common relay
-  header for all relay cells on linked circuits, as well as a field to
-  signal the beginning of a sequence, intermediate data, and the end of a
-  sequence.
-  
-    Relay command   [1 byte]
-    Recognized      [2 bytes]
-    StreamID        [2 bytes]
-    Digest          [4 bytes]
-    Length          [2 bytes]
-  > Switching       [2 bits]    # 01 = BEGIN, 00 = CONTINUE, 10 = END
-  > Sequencing      [6 bits]
-    Data            [PAYLOAD_LEN - 12 - Length bytes]
-  
-  These fields MUST be present on ALL end-to-end relay cells on each leg
-  that come from the endpoint, following a RELAY_CIRCUIT_LINK command.
-  
-  They are absent on 'leaky pipe' RELAY_COMMAND_DROP and
-  RELAY_COMMAND_PADDING_NEGOTIATED cells that come from middle relays, as
-  opposed to the endpoint, to support padding.
-  
-  Sequence numbers are incremented by one when an endpoint switches legs
-  to transmit a cell. This number will wrap; implementations MUST treat
-  0 as the next sequence after 2^6-1. Because we do not expect to support
-  significantly more than 2 legs, and much fewer than 63, this is not an
-  issue.
-  
-  The first cell on a new circuit MUST use the BEGIN code for switching.
-  Cells are delivered from that circuit until an END switching signal is
-  received, even if cells arrive first on another circuit with the next
-  sequence number before and END switching field. Recipients MUST only
-  deliver cells with a BEGIN, if their Sequencing number is one more than
-  the last END.
-
-A.2 Alternative Link Handshake [ALTERNATIVE_LINKING]
+A.1 Alternative Link Handshake [ALTERNATIVE_LINKING]
 
   The circuit linking in [LINKING_CIRCUITS] could be done as encrypted
   ntor onionskin extension fields, similar to those used by v3 onions.
-  
+
   This approach has at least four problems:
     i). For onion services, since onionskins traverse the intro circuit
         and return on the rend circuit, this handshake cannot measure
@@ -771,58 +946,58 @@ A.2 Alternative Link Handshake [ALTERNATIVE_LINKING]
    iv). The overhead in processing this onionskin in onionskin queues
         adds additional time for linking, even in the Exit case, making
         that RTT potentially noisy.
-  
+
   Additionally, it is not clear that this approach actually saves us
   anything in terms of setup time, because we can optimize away the
-  linking phase using Proposal 325, to combine initial RELAY_BEGIN cells
+  linking phase using Proposal 340, to combine initial RELAY_BEGIN cells
   with RELAY_CIRCUIT_LINK.
 
-A.3. Alternative RTT measurement [ALTERNATIVE_RTT]
+A.2. Alternative RTT measurement [ALTERNATIVE_RTT]
 
   Instead of measuring RTTs during [LINKING_CIRCUITS], we could create
   PING/PONG cells, whose sole purpose is to allow endpoints to measure
   RTT.
-  
+
   This was rejected for several reasons. First, during circuit use, we
   already have SENDMEs to measure RTT. Every 100 cells (or
   'circwindow_inc' from Proposal 324), we are able to re-measure RTT based
   on the time between that Nth cell and the SENDME ack. So we only need
   PING/PONG to measure initial circuit RTT.
-  
+
   If we were able to use onionskins, as per [ALTERNATIVE_LINKING] above,
   we might be able to specify a PING/PONG/PING handshake solely for
   measuring initial RTT, especially for onion service circuits.
-  
+
   The reason for not making a dedicated PING/PONG for this purpose is that
   it is context-free. Even if we were able to use onionskins for linking
   and resumption, to avoid additional data in handshake that just measures
   RTT, we would have to enforce that this PING/PONG/PING only follows the
   exact form needed by this proposal, at the expected time, and at no
   other points.
-  
+
   If we do not enforce this specific use of PING/PONG/PING, it becomes
   another potential side channel, for use in attacks such as [DROPMARK].
-  
+
   In general, Tor is planning to remove current forms of context-free and
   semantic-free cells from its protocol:
   https://gitlab.torproject.org/tpo/core/torspec/-/issues/39
-  
+
   We should not add more.
 
 
-Appendix B: Acknowledgments [ACKNOWLEDGEMENTS]
+Appendix B: Acknowledgments [ACKNOWLEDGMENTS]
 
   Thanks to Per Hurtig for helping us with the framing of the MPTCP
   problem space.
-  
+
   Thanks to Simone Ferlin for clarifications on the [BLEST] paper, and for
   pointing us at the Linux kernel implementation.
-  
+
   Extreme thanks goes again to Toke Høiland-Jørgensen, who helped
   immensely towards our understanding of how the BLEST condition relates
   to edge connection pushback, and for clearing up many other
   misconceptions we had.
-  
+
   Finally, thanks to Mashael AlSabah, Kevin Bauer, Tariq Elahi, and Ian
   Goldberg, for the original [CONFLUX] paper!
 
@@ -859,3 +1034,22 @@ References:
 
 [DROPMARK]
    https://www.petsymposium.org/2018/files/papers/issue2/popets-2018-0011.pdf
+
+[RACCOON23]
+   https://archives.seul.org/or/dev/Mar-2012/msg00019.html
+
+[ONION_FOUND]
+   https://www.researchgate.net/publication/356421302_From_Onion_Not_Found_to_Guard_Discovery/fulltext/619be24907be5f31b7ac194a/From-Onion-Not-Found-to-Guard-Discovery.pdf
+
+[VANGUARDS_ADDON]
+  https://github.com/mikeperry-tor/vanguards/blob/master/README_TECHNICAL.md
+
+[PROP324]
+  https://gitlab.torproject.org/tpo/core/torspec/-/blob/main/proposals/324-rtt-congestion-control.txt
+
+[PROP339]
+  https://gitlab.torproject.org/tpo/core/torspec/-/blob/main/proposals/339-udp-over-tor.md
+
+[PROP308]
+  https://gitlab.torproject.org/tpo/core/torspec/-/blob/main/proposals/308-counter-galois-onion.txt
+```