From 9f39565e61b65bb5f65d37ed779d17fe1c596179 Mon Sep 17 00:00:00 2001
From: David Goulet <dgoulet@torproject.org>
Date: Wed, 6 Feb 2019 17:24:01 -0500
Subject: Proposal 329: Conflux traffic splitting

---
 proposals/329-traffic-splitting.txt | 843 ++++++++++++++++++++++++++++++++++++
 1 file changed, 843 insertions(+)
 create mode 100644 proposals/329-traffic-splitting.txt

(limited to 'proposals')

diff --git a/proposals/329-traffic-splitting.txt b/proposals/329-traffic-splitting.txt
new file mode 100644
index 0000000..746c6c4
--- /dev/null
+++ b/proposals/329-traffic-splitting.txt
@@ -0,0 +1,843 @@
+Filename: 329-traffic-splitting.txt
+Title: Overcoming Tor's Bottlenecks with Traffic Splitting
+Author: David Goulet, Mike Perry
+Created: 2020-11-25
+Status: Draft
+
+0. Status
+
+   This proposal describes the Conflux [CONFLUX] system developed by Mashael
+   AlSabah, Kevin Bauer, Tariq Elahi, and Ian Goldberg. It aims at improving
+   Tor client network performance by dynamically splitting traffic between two
+   circuits.
+
+
+1. Overview
+
+1.1. Multipath TCP Design Space
+
+   In order to understand our improvements to Conflux, it is important to
+   properly conceptualize what is involved in the design of multipath
+   algorithms in general.
+
+   The design space is broken into two orthogonal parts: congestion
+   control algorithms that apply to each path, and traffic scheduling
+   algorithms that decide when to send packets to send on each path.
+
+   MPTCP specifies 'coupled' congestion control (see [COUPLED]). Coupled
+   congestion control updates single-path congestion control algorithms to
+   account for shared bottlenecks between the paths, so that the combined
+   congestion control algorithms do not overwhelm any bottlenecks that
+   happen to be shared between the multiple paths. Various ways of
+   accomplishing this have been proposed and implemented in the Linux kernel.
+
+   Because Tor's congestion control only concerns itself with bottnecks in Tor
+   relay queues, and not with any other bottlenecks (such as intermediate
+   Internet routers), we can avoid this complexity merely by specifying that
+   any paths that are constructed should not share any relays. In this way, we
+   can proceed to use the exact same congestion control as specified in Proposal
+   324, for each path.
+
+   For this reason, this proposal will focus on the traffic scheduling
+   algorithms, rather than coupling. We propose three candidate algorithms
+   that have been studied in the literature, and will compare their
+   performance using simulation and consensus parameters.
+
+1.2. Divergence from the initial Conflux design
+
+   The initial [CONFLUX] paper doesn't provide any indications on how to handle
+   the size of out-of-order cell queue, which we consider a potential dangerous
+   memory DoS vector (see [MEMORY_DOS]). It also used RTT as the sole heuristic
+   for selecting which circuit to send on, which may vary depending on the
+   geographical locations of the participant relays, without considering their
+   actual available circuit capacity (which will be available to us via Proposal
+   324). Additionally, since the publication of [CONFLUX], more modern
+   packet scheduling algorithms have been developed, which aim to reduce
+   out-of-order queue size.
+
+   We propose mitigations for these issues using modern scheduling algorithms,
+   as well as implementations options for avoiding the out-of-order queue at
+   Exit relays. Additionally, we consider resumption, side channel, and traffic
+   analysis risks and benefits in [RESUMPTION], [SIDE_CHANNELS] and
+   [TRAFFIC_ANALYSIS].
+
+
+2. Design
+
+   The following section describes the Conflux design. Each sub-section is a
+   building block to the multipath design that Conflux proposes.
+
+   The circuit construction is as follow:
+
+          Primary Circuit (lower RTT)
+             +-------+      +--------+
+             |Guard 1|----->|Middle 1|----------+
+             +---^---+      +--------+          |
+    +-----+      |                           +--v---+
+    | OP  +------+                           | Exit |--> ...
+    +-----+      |                           +--^---+
+             +---v---+      +--------+          |
+             |Guard 2|----->|Middle 2|----------+
+             +-------+      +--------+
+          Secondary Circuit (higher RTT)
+
+   Both circuits are built using current Tor path selection, however they
+   SHOULD NOT share the same Guard relay, or middle relay. By avoiding
+   using the same relays in these positions in the path, we ensure
+   additional path capacity, and eliminate the need to use more complicated
+   'coupled' congestion control algorithms from the MPTCP literature[COUPLED].
+   This both simplifies design, and improves performance.
+
+   Then, the OP needs to link the two circuits together, as described in
+   [LINKING_CIRCUITS], [LINKING_EXIT], and [LINKING_SERVICE].
+
+   For ease of explanation, the primary circuit is the circuit with lower RTT,
+   and the secondary circuit is the circuit with higher RTT. Initial RTT
+   is measured during circuit linking, as described in [LINKING_CIRCUITS].
+   RTT is continually measured using SENDME timing, as in Proposal 324.
+   This means that during use, the primary circuit and secondary circuit may
+   switch roles, depending on unrelated network congestion caused by other
+   Tor clients.
+
+   We also support linking onion service circuits together. In this case,
+   only two rendezvous circuits are linked. Each of these RP circuits will be
+   constructed separately, and then linked. However, the same path constraints
+   apply to each half of the circuits (no shared relays between the legs).
+   Should, by chance, the service and the client sides end up sharing some
+   relays, this is not catastrophic. Multipath TCP researchers we have
+   consulted believe Tor's congestion control from Proposal 324 to be
+   sufficient in this rare case.
+
+   Only two circuits SHOULD be linked together. However, implementations
+   SHOULD make it easy for researchers to *test* more than two paths, as this
+   has been shown to assist in traffic analysis resistance[WTF_SPLIT]. At
+   minimum, this means not hardcoding only two circuits in the implementation.
+
+   If the number of circuits exceeds the current number of guard relays,
+   guard relays MAY be re-used, but implementations SHOULD use the same
+   number of Guards as paths.
+
+   Linked circuits MUST NOT be extended further once linked (ie:
+   'cannibalization' is not supported).
+
+2.1. Advertising support for conflux
+
+   We propose a new protocol version in order to advertise support for
+   circuit linking on the relay side:
+
+      "Relay=4" -- Relay supports an 2 byte sequence number in a RELAY cell
+                   header used for multipath circuit which are linked with the
+                   new RELAY_CIRCUIT_LINK relay cell command.
+
+   XXX: Advertise this in onion service descriptor.
+   XXX: Onion service descriptor can advertise more than two circuits?
+
+   The next section describes how the circuits are linked together.
+
+2.2. Linking circuits [LINKING_CIRCUITS]
+
+   To link circuits, we propose new relay commands that are sent on both
+   circuits, as well as a response to confirm the join, and an ack of this
+   response. These commands create a 3way handshake, which allows each
+   endpoint to measure the initial RTT of each leg upon link, without
+   needing to wait for any data.
+
+   All three stages of this handshake are sent on *each* circuit leg to be
+   linked.
+
+   To save round trips, these cells SHOULD be combined with the initial
+   RELAY_BEGIN cell on the faster circuit leg, using Proposal 325. See
+   [LINKING_EXIT] and [LINKING_SERVICE] for more details on setup in each case.
+
+   There are other ways to do this linking that we have considered, but they
+   seem not to be significantly better than this method, especially since we
+   can use Proposal 325 to eliminate the RTT cost of this setup before sending
+   data. For those other ideas, see [ALTERNATIVE_LINKING] and
+   [ALTERNATIVE_RTT], in the appendix.
+
+   The first two parts of the handshake establish the link, and enable
+   resumption:
+
+      16 -- RELAY_CIRCUIT_LINK
+
+            Sent from the OP to the exit/service in order to link
+            circuits together at the end point.
+
+      17 -- RELAY_CIRCUIT_LINKED
+
+            Sent from the exit/service to the OP, to confirm the circuits
+            were linked.
+
+   These cells have the following contents:
+
+     VERSION   [1 byte]
+     PAYLOAD   [variable, up to end of relay payload]
+
+   The VERSION tells us which circuit linking mechanism to use. At this point
+   in time, only 0x01 is recognized and is the one described by the Conflux
+   design.
+
+   For version 0x01, the PAYLOAD contains:
+
+      NONCE              [32 bytes]
+      LAST_SEQNO_SENT    [8 bytes]
+      LAST_SEQNO_RECV    [8 bytes]
+
+   XXX: Should we let endpoints specify their preferred [SCHEDULING] alg
+   here, to override consensus params? This has benefits: eg low-memory
+   mobile clients can ask for an alg that is better for their reorder
+   queues. But it also has complexity risk, if the other endpoint does
+   not want to support it, because of its own memory issues.
+
+   The NONCE contains a random 256-bit secret, used to associate the two
+   circuits together. The nonce must not be shared outside of the circuit
+   transmission, or data may be injected into TCP streams. This means it
+   MUST NOT be logged to disk.
+
+   The two sequence number fields are 0 upon initial link, but non-zero in the
+   case of a resumption attempt (See [RESUMPTION]).
+
+   If either circuit does not receive a RELAY_CIRCUIT_LINKED response, both
+   circuits MUST be closed.
+
+   The third stage of the handshake exists to help the exit/service measure
+   initial RTT, for use in [SCHEDULING]:
+
+      18 -- RELAY_CIRCUIT_LINKED_RTT_ACK
+
+            Sent from the OP to the exit/service, to provide initial RTT
+            measurement for the exit/service.
+
+   For timeout of the handshake, clients should use the normal SOCKS/stream
+   timeout already in use for RELAY_BEGIN.
+
+   These three relay commands (RELAY_CIRCUIT_LINK, RELAY_CIRCUIT_LINKED,
+   and RELAY_CIRCUIT_LINKED_ACK) are send on *each* leg, to allow each
+   endpoint to measure the initial RTT of each leg.
+
+2.2. Linking Circuits from OP to Exit [LINKING_EXIT]
+
+   To link exit circuits, two circuits to the same exit are built. The
+   client records the circuit build time of each.
+
+   If the circuits are being built on-demand, for immediate use, the
+   circuit with the lower build time SHOULD use Proposal 325 to append its
+   first RELAY cell to the RELAY_COMMAND_LINK, on the circuit with the
+   lower circuit build time. The exit MUST respond on this same leg.
+   After that, actual RTT measurements MUST be used to determine
+   future transmissions, as specified in [SCHEDULING].
+
+   The RTT times between RELAY_COMMAND_LINK and RELAY_COMMAND_LINKED are
+   measured by the client, to determine each circuit RTT to determine
+   primary vs secondary circuit use, and for packet scheduling.  Similarly,
+   the exit measures the RTT times between RELAY_COMMAND_LINKED and
+   RELAY_COMMAND_LINKED_ACK, for the same purpose.
+
+2.3. Linking circuits to an onion service [LINKING_SERVICE]
+
+   For onion services, we will only concern ourselves with linking
+   rendezvous circuits.
+
+   To join rendezvous circuits, clients make two introduce requests to a
+   service's intropoint, causing it to create two rendezvous circuits, to
+   meet the client at two separate rendezvous points. These introduce
+   requests MUST be sent to the same intropoint (due to potential use of
+   onionbalance), and SHOULD be sent back-to-back on the same intro
+   circuit. They MAY be combined with Proposal 325.
+
+   The first rendezvous circuit to get joined SHOULD use Proposal 325
+   to append the RELAY_BEGIN command, and the service MUST answer
+   on this circuit, until RTT can be measured.
+
+   Once both circuits are linked and RTT is measured, packet scheduling
+   should be used, as per [SCHEDULING].
+
+2.4. Congestion Control Application [CONGESTION_CONTROL]
+
+   The SENDMEs for congestion control are performed per-leg. As data arrives,
+   regardless of its ordering, it is counted towards SENDME delivery. In this
+   way, 'cwnd - package_window' of each leg always reflects the available
+   data to send on each leg. This is important for [SCHEDULING].
+
+   The Congestion control Stream XON/XOFF can be sent on either leg, and
+   applies to the stream's transmission on both legs.
+
+2.5. Sequencing [SEQUENCING]
+
+   With multiple paths for data, the problem of data re-ordering appears. In
+   other words, cells can arrive out of order from the two circuits where cell
+   N + 1 arrives before the cell N.
+
+   Handling this reordering operates after congestion control for each circuit
+   leg, but before relay cell command processing or stream data delivery.
+
+   For the receiver to be able to reorder the receiving cells, a sequencing
+   scheme needs to be implemented. However, because Tor does not drop or
+   reorder packets inside of a circuit, this sequence number can be very
+   small. It only has to signal that a cell comes after those arriving
+   on another circuit.
+
+   To achieve this, we add a small sequence number to the common relay
+   header for all relay cells on linked circuits. This sequence number
+   is meant to signal the number of cells sent on the *other* leg, so
+   that each endpoint knows how many cells are still in-flight on
+   another leg. It is different from the absolute sequence number used
+   in [LINKING_CIRCUITS] and [RESUMPTION], but can be derived from that
+   number, using relative arithmetic.
+
+     Relay command   [1 byte]
+     Recognized      [2 bytes]
+     StreamID        [2 bytes]
+     Digest          [4 bytes]
+     Length          [2 bytes]
+   > LongSeq         [1 bit]  # If this bit is set, use 31 bits for Seq
+   > Sequencing      [7 or 31 bits]
+     Data            [Remainder]
+
+   The sequence number is only set for the first cell after the endpoint
+   switches legs. In this case, LongSeq is set to 1, and the Sequencing
+   field is 31 more bits. Otherwise it is a 1 byte 0 value.
+
+   These fields MUST be present on ALL end-to-end relay cells on each leg
+   that come from the endpoint, following a RELAY_CIRCUIT_LINK command.
+
+   They are absent on 'leaky pipe' RELAY_COMMAND_DROP and
+   RELAY_COMMAND_PADDING_NEGOTIATED cells that come from middle relays,
+   as opposed to the endpoint, to support padding.
+
+   When an endpoint switches legs, on the first cell in a new leg,
+   LongSeq is set to 1, and the following 31 bits represent the *total*
+   number of cells sent on the *other* leg, before the switch. The receiver
+   must wait for that number of cells to arrive from the previous leg
+   before delivering that cell.
+
+   XXX: In the rare event that we send more than 2^31 cells (~1TB) on a
+   single leg, do we force a switch of legs, or expand the field further?
+
+   An alternative method of sequencing, that assumes that the endpoint
+   knows when it is going to switch, the cell before it switches, is
+   specified in [ALTERNATIVE_SEQUENCING]. Note that that method requires
+   only 1 byte for sequence number and switch signaling, but requires that
+   the sender know that it is planning to switch, the cell before it switches.
+   (This is possible with [BLEST_TOR], but [LOWRTT_TOR] can switch based on
+   RTT change, so it may be one cell late in that case).
+
+2.6. Resumption [RESUMPTION]
+
+   In the event that a circuit leg is destroyed, they MAY be resumed.
+
+   Resumption is achieved by re-using the NONCE and method to the same endpoint
+   (either [LINKING_EXIT] or [LINKING_SERVICE]). The resumed path need not
+   use the same middle and guard relays, but should not share any relays
+   with any existing legs(s).
+
+   To provide resumption, endpoints store an absolute 64bit cell counter of
+   the last cell they have sent on a conflux pair (their LAST_SEQNO_SENT),
+   as well the last sequence number they have delivered in-order to edge
+   connections corresponding to a conflux pair (their LAST_SEQNO_RECV).
+   Additionally, endpoints MAY store the entire contents of unacked
+   inflight cells (ie the 'package_window' from proposal 324), for each
+   leg, along with information corresponding to those cells' absolute
+   sequence numbers.
+
+   These 64 bit absolute counters can wrap without issue, as congestion
+   windows will never grow to 2^64 cells until well past the Singularity.
+   However, it is possible that extremely long, bulk circuits could
+   exceed 2^64 total sent or received cells, so endpoints SHOULD handle
+   wrapped sequence numbers for purposes of computing retransmit
+   information. (But even this case is unlikely to happen within the next
+   decade or so).
+
+   Upon resumption, the LAST_SEQNO_SENT and LAST_SEQNO_RECV fields are used to
+   convey the sequence numbers of the last cell the relay sent and received on
+   that leg. The other endpoint can use these sequence numbers to determine if
+   it received the in-flight data or not, or sent more data since that point,
+   up to and including this absolute sequence number. If LAST_SEQNO_SENT
+   has not been received, the endpoint MAY transmit the missing data, if it
+   still has it buffered.
+
+   Because both endpoints get information about the other side's absolute
+   SENT sequence number, they will know exactly how many re-transmitted
+   packets to expect, should the circuit stay open. Re-transmitters
+   should not re-increment their absolute sent fields while re-transmitting.
+   
+   If it does not have this missing data due to memory pressure, that endpoint
+   should destroy *both* legs, as this represents unrecoverable data loss.
+
+   Otherwise, the new circuit can be re-joined, and its RTT can be compared
+   to the remaining circuit to determine if the new leg is primary or
+   secondary.
+
+   It is even possible to resume conflux circuits where both legs have
+   been collapsed using this scheme, if endpoints continue to buffer their
+   unacked package_window data for some time after this close. However,
+   see [TRAFFIC_ANALYSIS] for more details on the full scope of this
+   issue.
+
+   If endpoints are buffering package_window data, such data should be
+   given priority to be freed in any oomkiller invocation. See
+   [MEMORY_DOS] for more oomkiller information.
+
+
+3. Traffic Scheduling [SCHEDULING]
+
+   In order to load balance the traffic between the two circuits, the original
+   conflux paper used only RTT. However, with Proposal 324, we will have
+   accurate information on the instantaneous available bandwidth of each
+   circuit leg, as 'cwnd - package_window' (see Section 3 of Proposal 324).
+
+   Some additional RTT optimizations are also useful, to improve
+   responsiveness and minimize out-of-order queue sizes.
+
+   We specify two traffic schedulers from the multipath literature and adapt
+   them to Tor: [LOWRTT_TOR] and [BLEST_TOR]. [LOWRTT_TOR] also has three
+   variants, with different trade offs.
+
+   However, see the [TRAFFIC_ANALYSIS] sections of this proposal for important
+   details on how this selection can be changed, to reduce website traffic
+   fingerprinting.
+
+3.1. LowRTT Scheduling [LOWRTT_TOR]
+
+   This scheduling algorithm is based on the original [CONFLUX] paper, with
+   ideas from [MPTCP]'s minRTT/LowRTT scheduler.
+
+   In this algorithm, endpoints send cells on the circuit with lower RTT
+   (primary circuit). This continues while the congestion window on the
+   circuit has available room: ie whenever cwnd - package_window > 0.
+
+   Whenever the primary circuit's congestion window becomes full, the
+   secondary circuit is used. We stop reading on the send window source
+   (edge connection) when both congestion windows become full.
+
+   In this way, unlike original conflux, we switch to the secondary circuit
+   without causing congestion on the primary circuit. This improves both
+   load times, and overall throughput.
+
+   This behavior matches minRTT from [MPTCP], sometimes called LowRTT.
+
+   It may be better to stop reading on the edge connection when the primary
+   congestion window becomes full, rather than switch to the secondary
+   circuit as soon as the primary congestion window becomes full. (Ie: only
+   switch if the RTTs themselves change which circuit is primary). This is
+   what was done in the original Conflux paper. This behavior effectively
+   causes us to optimize for responsiveness and congestion avoidance, rather
+   than throughput. For evaluation, we should control this switching behavior
+   with a consensus parameter (see [CONSENSUS_PARAMETERS]).
+
+   Because of potential side channel risk (see [SIDE_CHANNELS]), a third
+   variant of this algorithm, where the primary circuit is chosen during the
+   [LINKING_CIRCUITS] handshake and never changed, should also be possible
+   to control via consensus parameter.
+
+3.2. BLEST Scheduling [BLEST_TOR]
+
+   [BLEST] attempts to predict the availability of the primary circuit, and
+   use this information to reorder transmitted data, to minimize head-of-line
+   blocking in the recipient (and thus minimize out-of-order queues there).
+
+   BLEST_TOR uses the primary circuit until the congestion window is full.
+   Then, it uses the relative RTT times of the two circuits to calculate
+   how much data can be sent on the secondary circuit faster than if we
+   just waited for the primary circuit to become available. 
+
+   This is achieved by computing two variables at the sender:   
+   
+     rtts = secondary.currRTT / primary.currRTT
+     primary_limit = primary.cwnd + (rtts-1)/2)*rtts
+
+   Note: This (rtts-1)/2 factor represents anticipated congestion window
+   growth over this period.. it may be different for Tor, depending on CC alg.
+
+   If primary_limit < secondary.cwnd - (secondary.package_window + 1), then
+   there is enough space on the secondary circuit to send data faster than we
+   could than waiting for the primary circuit.
+   
+   XXX: Note that BLEST uses total_send_window where we use secondary.cwnd in
+   this check. total_send_window is min(recv_win, CWND). But since Tor does
+   not use receive windows and intead uses stream XON/XOFF, we only use CWND. There
+   is some concern this may alter BLEST's buffer minimization properties,
+   but since receive window should only matter if the application is slower
+   than Tor, and XON/XOFF should cover that case, hopefully this is fine.
+
+   Otherwise, if the primary_limit condition is not hit, cease reading 
+   on source edge connections until SENDME acks come back.
+
+   Here is the pseudocode for this:
+
+     while source.has_data_to_send():
+       if primary.cwnd > primary.package_window:
+         primary.send(source.get_packet())
+         continue
+     
+       rtts = secondary.currRTT / primary.currRTT
+       primary_limit = (primary.cwnd + (rtts-1)/2)*rtts
+     
+       if primary_limit < secondary.cwnd - (secondary.package_window+1):
+         secondary.send(source.get_packet())
+       else:
+         break # done for now, wait for an ACK to free up CWND space and restart
+
+   Note that BLEST also has a parameter lambda that is updated whenever HoL
+   blocking occurs. Because it is expensive and takes significant time to
+   signal this over Tor, we omit this.
+   XXX: See [REORDER_SIGNALING] section if we want this lambda feedback.
+
+3.3. Reorder queue signaling [REORDER_SIGNALING]
+
+   Reordering should be fairly simple task. By following using the sequence
+   number field in [SEQUENCING], endpoints can know how many cells are still
+   in flight on the other leg.
+
+   To reorder them properly, a buffer of out of order cells needs to be kept.
+   On the Exit side, this can quickly become overwhelming considering ten of
+   thousands of possible circuits can be held open leading to gigabytes of
+   memory being used. There is a clear potential memory DoS vector which means
+   that a tor implementation should be able to limit the size of those queues.
+
+   Luckily, [BLEST_TOR] and the form of [LOWRTT_TOR] that only uses the
+   primary circuit will minimize or eliminate this out-of-order buffer.
+
+   XXX: The remainder of this section may be over-complicating things... We
+   only need these concepts if we want to use BLEST's lambda feedback.
+
+   The default for this queue size is governed by the 'cflx_reorder_client'
+   and 'cflx_reorder_srv' consensus parameters (see [CONSENSUS_PARAMS]).
+   'cflx_reorder_srv' applies to Exits and onion services. Both parameters
+   can be overridden by Torrc, to larger or smaller than the consensus
+   parameter. (Low memory clients may want to lower it; SecureDrop onion
+   services or other high-upload services may want to raise it).
+
+   When the reorder queue hits this size, a RELAY_CONFLUX_XOFF is sent down
+   the circuit leg that has data waiting in the queue and use of that leg must
+   cease, until it drains to half of this value, at which point an
+   RELAY_CONFLUX_XON is sent. Note that this is different than the stream
+   XON/XOFF from Proposal 324.
+
+   XXX: [BLEST] actually does not cease use of a path in this case, but
+   instead uses this signal to adjust the lambda parameter, which biases
+   traffic away from that leg.
+
+
+4. Security Considerations
+
+4.1. Memory Denial of Service [MEMORY_DOS]
+
+   Both reorder queues and retransmit buffers inherently represent a memory
+   denial of service condition.
+
+   For [RESUMPTION] retransmit buffers, endpoints that support this
+   feature SHOULD free retransmit information as soon as they get close
+   to memory pressure. This prevents resumption while data is in flight,
+   but will not otherwise harm operation.
+
+   For reorder buffers, adversaries can potentially impact this at any
+   point, but most obviously and most severely from the client position.
+
+   In particular, clients can lie about sequence numbers, sending cells
+   with sequence numbers such that the next expected sequence number
+   is never sent. They can do this repeatedly on many circuits, to exhaust
+   memory at exits.
+
+   One option is to only allow actual traffic splitting in the downstream
+   direction, towards clients, and always use the primary circuit for
+   everything in the upstream direction. However, the ability to support
+   conflux from the client to the exit shows promise against traffic
+   analysis (see [WTF_SPLIT]).
+
+   The other option is to use [BLEST_TOR] from clients to exits, as it has
+   predictable interleaved cell scheduling, and minimizes reorder queues
+   at exits. If the ratios prescribed by that algorithm are not followed
+   within some bounds, the other endpoint can close both circuits, and
+   free the queue memory.
+
+   This still leaves the possibility that intermediate relays may block
+   a leg, allowing cells to traverse only one leg, thus still accumulating
+   at the reorder queue. Clients can also spoof sequence numbers similarly,
+   to make it appear that they are following [BLEST_TOR], without
+   actually sending any data on one of the legs.
+
+   To handle either of these cases, when a relay is under memory pressure,
+   the circuit OOM killer SHOULD free and close circuits with the oldest
+   reorder queue data, first. This heuristic was shown to be best during
+   the [SNIPER] attack OOM killer iteration cycle.
+
+4.2. Side Channels [SIDE_CHANNELS]
+
+   Two potential side channels may be introduced by the use of Conflux:
+      1. RTT leg-use bias by altering SENDME latency
+      2. Location info leaks through the use of both leg's latencies
+
+   For RTT and leg-use bias, Guard relays could delay legs to introduce
+   a pattern into the delivery of cells at the exit relay, by varying the
+   latency of SENDME cells (every 100th cell) to change the distribution
+   of traffic to send information. This attack could be performed in either
+   direction of traffic, to bias traffic load off of a particular Guard.
+   If an adversary controls both Guards, it could in theory send a binary
+   signal more easily, by alternating delays on each.
+
+   However, this risk weighs against the potential benefits against traffic
+   fingerprinting, as per [WTF_SPLIT]. Additionally, even ignoring
+   cryptographic tagging attacks, this side channel provides significantly
+   lower information over time than inter-packet-delay based side channels
+   that are already available to Guards and routers along the path to the
+   Guard.
+
+   Tor currently provides no defenses against already existing
+   single-circuit delay-based side channels, though both circuit padding
+   and [BACKLIT] are potential options it could conceivably deploy. The
+   [BACKLIT] paper also has an excellent review of the various methods
+   that have been studied for such single circuit side channels, and
+   the [BACKLIT] style RTT monitoring could be used to protect against
+   these conflux side channels as well. Circuit padding can also help
+   to obscure which cells are SENDMEs, since circuit padding is not
+   counted towards SENDME totals.
+
+   The second class of side channel is where the Exit relay may be able
+   to use the two legs to further infer more information about client
+   location. See [LATENCY_LEAK] for more details. It is unclear at
+   this time how much more severe this is for two paths than just one.
+
+   We should preserve the ability to disable conflux to and from Exit
+   relays, should these side channels prove more severe, or should
+   it prove possible to mitigate single-circuit side channels, but
+   not conflux side channels.
+
+   In all cases, all of these side channels appear less severe for onion
+   service traffic, due to the higher path variability due to relay
+   selection, as well as the end-to-end nature of conflux in that case.
+   This indicates that our ability to enable/disable conflux for services
+   should be separate from Exits.
+
+4.3. Traffic analysis [TRAFFIC_ANALYSIS]
+
+   Even though conflux shows benefits against traffic analysis in [WTF_SPLIT],
+   these gains may be moot if the adversary is able to perform packet counting
+   and timing analysis at guards to guess which specific circuits are linked.
+   In particular, the 3 way handshake in [LINKING_CIRCUITS] may be quite
+   noticeable.
+
+   As one countermeasure, it may be possible to eliminate the third leg
+   (RELAY_CIRCUIT_LINKED_ACK) by computing the exit/service RTT via measuring
+   the time between CREATED/REND_JOINED and RELAY_CIRCUIT_LINK, but this
+   will introduce cross-component complexity into Tor's protocol that
+   could quickly become unwieldy and fragile.
+
+   Additionally, the conflux handshake may make onion services stand out
+   more, regardless of the number of stages in the handshake. For this
+   reason, it may be more wise to simply address these issues with circuit
+   padding machines during circuit setup (see padding-spec.txt).
+
+   Additional traffic analysis considerations arise when combining conflux
+   with padding, for purposes of mitigating traffic fingerprinting. For this,
+   it seems wise to treat the packet schedulers as another piece of a combined
+   optimization problem in tandem with optimizing padding machines, perhaps
+   introducing randomness or fudge factors their scheduling, as a parameterized
+   distribution. For details, see
+   https://github.com/torproject/tor/blob/master/doc/HACKING/CircuitPaddingDevelopment.md
+
+   Finally, conflux may exacerbate forms of confirmation-based traffic
+   analysis that close circuits to determine concretely if they were in use,
+   since closing either leg might cause resumption to fail. TCP RST
+   injection can perform this attack on the side, without surveillance
+   capability. [RESUMPTION] with buffering of the inflight unacked
+   package_window data, for retransmit, is a partial mitigation, if
+   endpoints buffer this data for retransmission for a brief time even
+   if both legs close. This seems more feasible for onion services,
+   which are more vulnerable to this attack. However, if the adversary
+   controls the client, they will notice the resumption re-link, and
+   still obtain confirmation that way.
+
+   It seems the only way to fully mitigate these kinds of attacks is with
+   the Snowflake pluggable transport, which provides its own resumption
+   and retransmit behavior. Additionally, Snowflake's use of UDP DTLS also
+   protects against TCP RST injection, which we suspect to be the main
+   vector for such attacks.
+
+   In the future, a DTLS or QUIC transport for Tor such as masque could
+   provide similar RST injection resistance, and resumption at
+   Guard/Bridge nodes, as well.
+
+
+5. System Interactions
+
+  - congestion control
+  - EWMA and KIST
+  - CBT and number of guards
+  - Onion service circ obfuscation
+  - Future UDP (it may increase the need for UDP to buffer before dropping)
+  - Padding (no sequence numbers on padding cells, as per [SEQUENCING])
+    - Also, any padding machines may need re-tuning
+  - No 'cannibalization' of linked circuits
+
+
+6. Consensus and Torrc Parameters [CONSENSUS]
+
+  - conflux_circs
+    - Number of conflux circuits
+
+  - conflux_sched_exits, conflux_sched_clients, conflux_sched_service
+    - Three forms of LOWRTT_TOR, and BLEST_TOR
+
+  - ConfluxOnionService
+  - ConfluxOnionCircs
+
+
+7. Tuning Experiments [EXPERIMENTS]
+
+  - conflux_sched & conflux_exits
+    - Exit reorder queue size
+    - Responsiveness vs throughput tradeoff?
+  - Congestion control
+  - EWMA and KIST
+  - num guards & conflux_circs
+
+
+Appended A [ALTERNATIVES]
+
+A.1 BEGIN/END sequencing [ALTERNATIVE_SEQUENCING]
+
+   In this method of signaling, we increment the sequence number by 1
+   only when we switch legs, and use BEGIN/END "bookends" to know that
+   all data on a leg has been received.
+
+   To achieve this, we add a small sequence number to the common relay
+   header for all relay cells on linked circuits, as well as a field to
+   signal the beginning of a sequence, intermediate data, and the end
+   of a sequence.
+
+     Relay command   [1 byte]
+     Recognized      [2 bytes]
+     StreamID        [2 bytes]
+     Digest          [4 bytes]
+     Length          [2 bytes]
+   > Switching       [2 bits]    # 01 = BEGIN, 00 = CONTINUE, 10 = END
+   > Sequencing      [6 bits]
+     Data            [PAYLOAD_LEN - 12 - Length bytes]
+
+   These fields MUST be present on ALL end-to-end relay cells on each leg
+   that come from the endpoint, following a RELAY_CIRCUIT_LINK command.
+
+   They are absent on 'leaky pipe' RELAY_COMMAND_DROP and
+   RELAY_COMMAND_PADDING_NEGOTIATED cells that come from middle relays,
+   as opposed to the endpoint, to support padding.
+
+   Sequence numbers are incremented by one when an endpoint switches legs
+   to transmit a cell. This number will wrap; implementations should treat
+   0 as the next sequence after 2^6-1. Because we do not expect to support
+   significantly more than 2 legs, and much fewer than 63, this is not an
+   issue.
+
+   The first cell on a new circuit MUST use the BEGIN code for switching.
+   Cells are delivered from that circuit until an END switching signal is
+   received, even if cells arrive first on another circuit with the next
+   sequence number before and END switching field. Recipients MUST only
+   deliver cells with a BEGIN, if their Sequencing number is one more than
+   the last END.
+
+A.2 Alternative Link Handshake [ALTERNATIVE_LINKING]
+
+   The circuit linking in [LINKING_CIRCUITS] could be done as encrypted
+   ntor onionskin extension fields, similar to those used by v3 onions.
+
+   This approach has at least four problems:
+     i). For onion services, since the onionskins traverse the intro circuit
+         and then return on the rend circuit, this handshake cannot measure
+         RTT there.
+    ii). Since these onionskins are larger, and have no PFS, an adversary
+         at the middle relay knows that the onionskin is for linking, and
+         can potentially try to obtain the onionskin key for attacks on
+         the link.
+   iii). It makes linking circuits more fragile, since they could timeout
+         due to CBT, or other issues during construction.
+    iv). The overhead in processing this onionskin through onionskin queues
+         adds additional time for linking, even in the Exit case, making
+         that RTT potentially noisy.
+
+   Additionally, it is not clear that this approach actually saves us
+   anything in terms of setup time, because we can optimize away the
+   linking phase using Proposal 325, to combine initial RELAY_BEGIN cells
+   with RELAY_CIRCUIT_LINK.
+
+A.3. Alternative RTT measurement [ALTERNATIVE_RTT]
+
+   Instead of measuring RTTs during [LINKING_CIRCUITS], we could create
+   PING/PONG cells, whose sole purpose is to allow endpoints to measure
+   RTT.
+
+   This was rejected for several reasons. First, during circuit use, we
+   already have SENDMEs to measure RTT. Every 100 cells (or
+   'circwindow_inc' from Proposal 324), we are able to re-measure RTT based
+   on the time between that Nth cell and the SENDME ack. So we only need
+   PING/PONG to measure initial circuit RTT.
+
+   If we were able to use onionskins, as per [ALTERNATIVE_LINKING] above,
+   we might be able to specify a PING/PONG/PING handshake solely for
+   measuring initial RTT, especially for onion service circuits.
+
+   The reason for not making a dedicated PING/PONG for this purpose is that
+   it is context-free. Even if we were able to use onionskins for linking
+   and resumption, to avoid additional data in handshake that just measures
+   RTT, we would have to enforce that this PING/PONG/PING only follows the
+   exact form needed by this proposal, at the expected time, and at no
+   other points.
+
+   If we do not enforce this specific use of PING/PONG/PING, it becomes
+   another potential side channel, for use in attacks such as [DROPMARK].
+
+   In general, Tor is planning to remove current forms of context-free and
+   semantic-free cells from its protocol:
+   https://gitlab.torproject.org/tpo/core/torspec/-/issues/39
+
+   We should not add more.
+
+
+Appendix B: Acknowledgments
+
+   Thanks to Per Hurtig for helping us with the framing of the MPTCP
+   problem space.
+
+   Thanks to Simone Ferlin for clarifications on the [BLEST]
+   paper, and for pointing us at the Linux kernel implementation.
+
+   Extreme thanks goes again to Toke Høiland-Jørgensen, who helped
+   immensely towards our understanding of how the BLEST condition relates
+   to edge connection pushback, and for clearing up many other
+   misconceptions we had.
+
+   Finally, thanks to Mashael AlSabah, Kevin Bauer, Tariq Elahi, and Ian
+   Goldberg, for the original [CONFLUX] paper!
+
+
+References:
+
+[CONFLUX]
+   https://freehaven.net/anonbib/papers/pets2013/paper_65.pdf
+
+[BLEST]
+  https://olivier.mehani.name/publications/2016ferlin_blest_blocking_estimation_mptcp_scheduler.pdf
+  https://opus.lib.uts.edu.au/bitstream/10453/140571/2/08636963.pdf
+  https://github.com/multipath-tcp/mptcp/blob/mptcp_v0.95/net/mptcp/mptcp_blest.c
+
+[WTF_SPLIT]
+   https://www.comsys.rwth-aachen.de/fileadmin/papers/2020/2020-delacadena-trafficsliver.pdf
+
+[COUPLED]
+   https://datatracker.ietf.org/doc/html/rfc6356
+   https://www.researchgate.net/profile/Xiaoming_Fu2/publication/230888515_Delay-based_Congestion_Control_for_Multipath_TCP/links/54abb13f0cf2ce2df668ee4e.pdf?disableCoverPage=true
+   http://staff.ustc.edu.cn/~kpxue/paper/ToN-wwj-2020.04.pdf
+   https://www.thinkmind.org/articles/icn_2019_2_10_30024.pdf
+   https://arxiv.org/pdf/1308.3119.pdf
+
+[BACKLIT]
+   https://www.freehaven.net/anonbib/cache/acsac11-backlit.pdf
+
+[LATENCY_LEAK]
+   https://www.freehaven.net/anonbib/cache/ccs07-latency-leak.pdf
+   https://www.robgjansen.com/publications/howlow-pets2013.pdf
+
+[SNIPER]
+   https://www.freehaven.net/anonbib/cache/sniper14.pdf
+
+[DROPMARK]
+   https://www.freehaven.net/anonbib/cache/sniper14.pdf
-- 
cgit v1.2.3-54-g00ecf