aboutsummaryrefslogtreecommitdiff
path: root/proposals/329-traffic-splitting.txt
blob: b9f5dce66c0fac50f2d1c21759273f68d6c3cb1f (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
Filename: 329-traffic-splitting.txt
Title: Overcoming Tor's Bottlenecks with Traffic Splitting
Author: David Goulet, Mike Perry
Created: 2020-11-25
Status: Draft

0. Status

  This proposal describes the Conflux [CONFLUX] system developed by
  Mashael AlSabah, Kevin Bauer, Tariq Elahi, and Ian Goldberg. It aims at
  improving Tor client network performance by dynamically splitting
  traffic between two circuits.


1. Overview

1.1. Multipath TCP Design Space

  In order to understand our improvements to Conflux, it is important to
  properly conceptualize what is involved in the design of multipath
  algorithms in general.

  The design space is broken into two orthogonal parts: congestion control
  algorithms that apply to each path, and traffic scheduling algorithms
  that decide which packets to send on each path.

  MPTCP specifies 'coupled' congestion control (see [COUPLED]). Coupled
  congestion control updates single-path congestion control algorithms to
  account for shared bottlenecks between the paths, so that the combined
  congestion control algorithms do not overwhelm any bottlenecks that
  happen to be shared between the multiple paths. Various ways of
  accomplishing this have been proposed and implemented in the Linux
  kernel.

  Because Tor's congestion control only concerns itself with bottlenecks in
  Tor relay queues, and not with any other bottlenecks (such as
  intermediate Internet routers), we can avoid this complexity merely by
  specifying that any paths that are constructed SHOULD NOT share any
  relays. In this way, we can proceed to use the exact same congestion
  control as specified in Proposal 324, for each path.

  For this reason, this proposal will focus on the traffic scheduling
  algorithms, rather than coupling. We propose three candidate algorithms
  that have been studied in the literature, and will compare their
  performance using simulation and consensus parameters.

1.2. Divergence from the initial Conflux design

  The initial [CONFLUX] paper doesn't provide any indications on how to
  handle the size of out-of-order cell queue, which we consider a
  potential dangerous memory DoS vector (see [MEMORY_DOS]). It also used
  RTT as the sole heuristic for selecting which circuit to send on, which
  may vary depending on the geographical locations of the participant
  relays, without considering their actual available circuit capacity
  (which will be available to us via Proposal 324). Additionally, since
  the publication of [CONFLUX], more modern packet scheduling algorithms
  have been developed, which aim to reduce out-of-order queue size.

  We propose mitigations for these issues using modern scheduling
  algorithms, as well as implementations options for avoiding the
  out-of-order queue at Exit relays. Additionally, we consider resumption,
  side channel, and traffic analysis risks and benefits in [RESUMPTION],
  [SIDE_CHANNELS] and [TRAFFIC_ANALYSIS].


2. Design

  The following section describes the Conflux design. Each sub-section is
  a building block to the multipath design that Conflux proposes.

  The circuit construction is as follow:

         Primary Circuit (lower RTT)
            +-------+      +--------+
            |Guard 1|----->|Middle 1|----------+
            +---^---+      +--------+          |
   +-----+      |                           +--v---+
   | OP  +------+                           | Exit |--> ...
   +-----+      |                           +--^---+
            +---v---+      +--------+          |
            |Guard 2|----->|Middle 2|----------+
            +-------+      +--------+
         Secondary Circuit (higher RTT)

  Both circuits are built using current Tor path selection, however they
  SHOULD NOT share the same Guard relay, or middle relay. By avoiding
  using the same relays in these positions in the path, we ensure
  additional path capacity, and eliminate the need to use more complicated
  'coupled' congestion control algorithms from the MPTCP
  literature[COUPLED].  This both simplifies design, and improves
  performance.

  Then, the OP needs to link the two circuits together, as described in
  [LINKING_CIRCUITS], [LINKING_EXIT], and [LINKING_SERVICE].

  For ease of explanation, the primary circuit is the circuit with lower
  RTT, and the secondary circuit is the circuit with higher RTT. Initial
  RTT is measured during circuit linking, as described in
  [LINKING_CIRCUITS].  RTT is continually measured using SENDME timing, as
  in Proposal 324.  This means that during use, the primary circuit and
  secondary circuit may switch roles, depending on unrelated network
  congestion caused by other Tor clients.

  We also support linking onion service circuits together. In this case,
  only two rendezvous circuits are linked. Each of these RP circuits will
  be constructed separately, and then linked. However, the same path
  constraints apply to each half of the circuits (no shared relays between
  the legs). If, by chance, the service and the client sides end up
  sharing some relays, this is not catastrophic. Multipath TCP researchers
  we have consulted (see [ACKNOWLEDGEMENTS]), believe Tor's congestion
  control from Proposal 324 to be sufficient in this rare case.

  Only two circuits SHOULD be linked together. However, implementations
  SHOULD make it easy for researchers to *test* more than two paths, as
  this has been shown to assist in traffic analysis resistance[WTF_SPLIT].
  At minimum, this means not hardcoding only two circuits in the
  implementation.

  If the number of circuits exceeds the current number of guard relays,
  guard relays MAY be re-used, but implementations SHOULD use the same
  number of Guards as paths.

  Linked circuits MUST NOT be extended further once linked (ie:
  'cannibalization' is not supported).

2.1. Advertising support for conflux

2.1.1 Relay

  We propose a new protocol version in order to advertise support for
  circuit linking on the relay side:

     "Relay=5" -- Relay supports Conflux as in linking circuits together using
                  the new LINK, LINKED and SWITCH relay command.

2.1.2 Onion Service

  We propose to add a new line in order to advertise conflux support in the
  onion service descriptor:

    "conflux" SP max-num-circ NL

      The "max-num-circ" value indicate the maximum number of rendezvous
      circuits that are allowed to be linked together.

  XXX: We should let the service specify the conflux algorithm to use.
  Some services may prefer latency (LowRTT), where as some may prefer
  throughput (BLEST).

  The next section describes how the circuits are linked together.

2.2. Linking circuits [LINKING_CIRCUITS]

  To link circuits, we propose new relay commands that are sent on both
  circuits, as well as a response to confirm the join, and an ack of this
  response. These commands create a 3way handshake, which allows each
  endpoint to measure the initial RTT of each leg upon link, without
  needing to wait for any data.

  All three stages of this handshake are sent on *each* circuit leg to be
  linked.

  When packed cells are a reality (proposal 340), these cells SHOULD be
  combined with the initial RELAY_BEGIN cell on the faster circuit leg. See
  [LINKING_EXIT] and [LINKING_SERVICE] for more details on setup in each case.

  There are other ways to do this linking that we have considered, but they
  seem not to be significantly better than this method, especially since we can
  use Proposal 340 to eliminate the RTT cost of this setup before sending data.
  For those other ideas, see [ALTERNATIVE_LINKING] and [ALTERNATIVE_RTT], in
  the appendix.

  The first two parts of the handshake establish the link, and enable
  resumption:

    19 -- RELAY_CONFLUX_LINK

          Sent from the OP to the exit/service in order to link circuits
          together at the end point.

    20 -- RELAY_CONFLUX_LINKED

          Sent from the exit/service to the OP, to confirm the circuits were
          linked.

  These cells have the following contents:

    VERSION   [1 byte]
    PAYLOAD   [variable, up to end of relay payload]

  The VERSION tells us which circuit linking mechanism to use. At this
  point in time, only 0x01 is recognized and is the one described by the
  Conflux design.

  For version 0x01, the PAYLOAD contains:

     NONCE              [32 bytes]
     LAST_SEQNO_SENT    [8 bytes]
     LAST_SEQNO_RECV    [8 bytes]
     ALGORITHM          [1 byte]

  XXX: Should we let endpoints specify their preferred [SCHEDULING] alg
  here, to override consensus params? This has benefits: eg low-memory
  mobile clients can ask for an alg that is better for their reorder
  queues. But it also has complexity risk, if the other endpoint does not
  want to support it, because of its own memory issues.
    - YES. At least for Exit circuits, we *will* want to let clients
      request LowRTT or BLEST/CWND scheduling. So we need an algorithm
      field here.
    - XXX: We need to define rules for negotiation then, for onions and
      exits vs consensus.

  The NONCE contains a random 256-bit secret, used to associate the two
  circuits together. The nonce MUST NOT be shared outside of the circuit
  transmission, or data may be injected into TCP streams. This means it
  MUST NOT be logged to disk.

  The two sequence number fields are 0 upon initial link, but non-zero in
  the case of a resumption attempt (See [RESUMPTION]).

  If either circuit does not receive a RELAY_CONFLUX_LINKED response, both
  circuits MUST be closed.

  The third stage of the handshake exists to help the exit/service measure
  initial RTT, for use in [SCHEDULING]:

    21 -- RELAY_CONFLUX_LINKED_ACK

          Sent from the OP to the exit/service, to provide initial RTT
          measurement for the exit/service.

  For timeout of the handshake, clients SHOULD use the normal SOCKS/stream
  timeout already in use for RELAY_BEGIN.

  These three relay commands are send on *each* leg, to allow each endpoint to
  measure the initial RTT of each leg.

  The circuit SHOULD be closed if at least one of these conditions is met:

    - Once a LINK is received, if the next cell relay command is not a
      LINKED_ACK, unless the command is in a packed cell.
    - Once a LINKED_ACK is received, receiving any other command than these:
        * BEGIN, DATA, END, CONNECTED, RESOLVE, RESOLVED, XON, XOFF, SWITCH
    - Receiving a LINKED without a LINK.
    - Receiving a LINKED_ACK without having sent a LINKED.

    XXX Must define our LINK rate limiting parameters.

2.2. Linking Circuits from OP to Exit [LINKING_EXIT]

  To link exit circuits, two circuits to the same exit are built. The
  client records the circuit build time of each.

  If the circuits are being built on-demand, for immediate use, the circuit
  with the lower build time SHOULD use Proposal 340 to append its first RELAY
  cell to the RELAY_CONFLUX_LINK, on the circuit with the lower circuit build
  time. The exit MUST respond on this same leg.  After that, actual RTT
  measurements MUST be used to determine future transmissions, as specified in
  [SCHEDULING].

  The RTT times between RELAY_CONFLUX_LINK and RELAY_CONFLUX_LINKED are
  measured by the client, to determine each circuit RTT to determine primary vs
  secondary circuit use, and for packet scheduling. Similarly, the exit
  measures the RTT times between RELAY_CONFLUX_LINKED and
  RELAY_CONFLUX_LINKED_ACK, for the same purpose.

2.3. Linking circuits to an onion service [LINKING_SERVICE]

  For onion services, we will only concern ourselves with linking
  rendezvous circuits.

  To join rendezvous circuits, clients make two introduce requests to a
  service's intropoint, causing it to create two rendezvous circuits, to
  meet the client at two separate rendezvous points. These introduce
  requests MUST be sent to the same intropoint (due to potential use of
  onionbalance), and SHOULD be sent back-to-back on the same intro
  circuit. They MAY be combined with Proposal 340.

  The first rendezvous circuit to get joined SHOULD use Proposal 340 to
  append the RELAY_BEGIN command, and the service MUST answer on this
  circuit, until RTT can be measured.

  Once both circuits are linked and RTT is measured, packet scheduling
  MUST be used, as per [SCHEDULING].

2.4. Congestion Control Application [CONGESTION_CONTROL]

  The SENDMEs for congestion control are performed per-leg. As data
  arrives, regardless of its ordering, it is counted towards SENDME
  delivery. In this way, 'cwnd - package_window' of each leg always
  reflects the available data to send on each leg. This is important for
  [SCHEDULING].

  The Congestion control Stream XON/XOFF can be sent on either leg, and
  applies to the stream's transmission on both legs.

2.5. Sequencing [SEQUENCING]

  With multiple paths for data, the problem of data re-ordering appears.
  In other words, cells can arrive out of order from the two circuits
  where cell N + 1 arrives before the cell N.

  Handling this reordering operates after congestion control for each
  circuit leg, but before relay cell command processing or stream data
  delivery.

  For the receiver to be able to reorder the receiving cells, a sequencing
  scheme needs to be implemented. However, because Tor does not drop or
  reorder packets inside of a circuit, this sequence number can be very
  small. It only has to signal that a cell comes after those arriving on
  another circuit.

  To achieve this, we propose a new relay command used to indicate a switch to
  another leg:

    22 -- RELAY_CONFLUX_SWITCH

          Sent from the client to the exit/service when switching leg in an
          already linked circuit construction.

  The cell payload format is:

    SeqNum  [4 bytes]

  The "SeqNum" value is a relative sequence number, which is the difference
  between the last absolute sequence number sent on the new leg and the last
  absolute sequence number sent on all other legs prior to the switch. In this
  way, the endpoint knows what to increment its local absolute sequence number
  by, before cells start to arrive.

  To achieve this, the sender must maintain the last absolute sequence sent for
  each leg, and the receiver must maintain the last absolute sequence number
  received for each leg.

  As an example, let's say we send 10 cells on the first leg, so our absolute
  sequence number is 10. If we then switch to the second leg, it is trivial to
  see that we should send a SWITCH with 10 as the relative sequence number, to
  indicate that regardless of the order in which the first cells are received,
  subsequent cells on the second leg should start counting at 10.

  However, if we then send 21 cells on this leg, our local absolute sequence
  number as the sender is 31. So when we switch back to the first leg, where
  the last absolute sequence sent was 10, we must send a SWITCH cell with 21,
  so that when the first leg receives subsequent cells, it assigns those cells
  an absolute sequence number starting at 31.

  In the rare event that we send more than 2^31 cells (~1TB) on a single leg,
  the leg should be switched in order to reset that relative sequence number to
  fit within 4 bytes.

  In order to rate limit the use of SWITCH to prevent its use as a DropMark
  side channel, the circuit SHOULD be closed if at least one of these
  conditions is met:

    - The SeqNum value is below the "cc_sendme_inc" which is currently set
      at 31.
    - If immediately after receiving a SWITCH, another one is received.

      XXX: We should define our rate limiting.

    - If we are NOT an exit circuit.
    - If the SeqNum makes our absolute sequence number to overflow.

2.6. Resumption [RESUMPTION]

  In the event that a circuit leg is destroyed, they MAY be resumed.

  Resumption is achieved by re-using the NONCE to the same endpoint
  (either [LINKING_EXIT] or [LINKING_SERVICE]). The resumed path need
  not use the same middle and guard relays as the destroyed leg(s), but
  SHOULD NOT share any relays with any existing legs(s).

  To provide resumption, endpoints store an absolute 64bit cell counter of
  the last cell they have sent on a conflux pair (their LAST_SEQNO_SENT),
  as well the last sequence number they have delivered in-order to edge
  connections corresponding to a conflux pair (their LAST_SEQNO_RECV).
  Additionally, endpoints MAY store the entire contents of unacked
  inflight cells (ie the 'package_window' from proposal 324), for each
  leg, along with information corresponding to those cells' absolute
  sequence numbers.

  These 64 bit absolute counters can wrap without issue, as congestion
  windows will never grow to 2^64 cells until well past the Singularity.
  However, it is possible that extremely long, bulk circuits could exceed
  2^64 total sent or received cells, so endpoints SHOULD handle wrapped
  sequence numbers for purposes of computing retransmit information. (But
  even this case is unlikely to happen within the next decade or so).

  Upon resumption, the LAST_SEQNO_SENT and LAST_SEQNO_RECV fields are used
  to convey the sequence numbers of the last cell the relay sent and
  received on that leg. The other endpoint can use these sequence numbers
  to determine if it received the in-flight data or not, or sent more data
  since that point, up to and including this absolute sequence number. If
  LAST_SEQNO_SENT has not been received, the endpoint MAY transmit the
  missing data, if it still has it buffered.

  Because both endpoints get information about the other side's absolute
  SENT sequence number, they will know exactly how many re-transmitted
  packets to expect, if the circuit is successfully resumed.

  Re-transmitters MUST NOT re-increment their absolute sent fields
  while re-transmitting.

  If it does not have this missing data due to memory pressure, that
  endpoint MUST destroy *both* legs, as this represents unrecoverable
  data loss.

  Otherwise, the new circuit can be re-joined, and its RTT can be compared
  to the remaining circuit to determine if the new leg is primary or
  secondary.

  It is even possible to resume conflux circuits where both legs have been
  collapsed using this scheme, if endpoints continue to buffer their
  unacked package_window data for some time after this close. However, see
  [TRAFFIC_ANALYSIS] for more details on the full scope of this issue.

  If endpoints are buffering package_window data, such data should be
  given priority to be freed in any oomkiller invocation. See [MEMORY_DOS]
  for more oomkiller information.


3. Traffic Scheduling [SCHEDULING]

  In order to load balance the traffic between the two circuits, the
  original conflux paper used only RTT. However, with Proposal 324, we
  will have accurate information on the instantaneous available bandwidth
  of each circuit leg, as 'cwnd - package_window' (see Section 3 of
  Proposal 324).

  Some additional RTT optimizations are also useful, to improve
  responsiveness and minimize out-of-order queue sizes.

  We specify two traffic schedulers from the multipath literature and
  adapt them to Tor: [LOWRTT_TOR] and [BLEST_TOR]. [LOWRTT_TOR] also has
  three variants, with different trade offs.

  However, see the [TRAFFIC_ANALYSIS] sections of this proposal for
  important details on how this selection can be changed, to reduce
  website traffic fingerprinting.

3.1. LowRTT Scheduling [LOWRTT_TOR]

  This scheduling algorithm is based on the original [CONFLUX] paper, with
  ideas from [MPTCP]'s minRTT/LowRTT scheduler.

  In this algorithm, endpoints send cells on the circuit with lower RTT
  (primary circuit). This continues while the congestion window on the
  circuit has available room: ie whenever cwnd - package_window > 0.

  Whenever the primary circuit's congestion window becomes full, the
  secondary circuit is used. We stop reading on the send window source
  (edge connection) when both congestion windows become full.

  In this way, unlike original conflux, we switch to the secondary circuit
  without causing congestion on the primary circuit. This improves both
  load times, and overall throughput.

  This behavior matches minRTT from [MPTCP], sometimes called LowRTT.

  It may be better to stop reading on the edge connection when the primary
  congestion window becomes full, rather than switch to the secondary
  circuit as soon as the primary congestion window becomes full. (Ie: only
  switch if the RTTs themselves change which circuit is primary). This is
  what was done in the original Conflux paper. This behavior effectively
  causes us to optimize for responsiveness and congestion avoidance,
  rather than throughput. For evaluation, we will control this switching
  behavior with a consensus parameter (see [CONSENSUS_PARAMETERS]).

  Because of potential side channel risk (see [SIDE_CHANNELS]), a third
  variant of this algorithm, where the primary circuit is chosen during
  the [LINKING_CIRCUITS] handshake and never changed, is also possible
  to control via consensus parameter.

3.2. BLEST Scheduling [BLEST_TOR]

  [BLEST] attempts to predict the availability of the primary circuit, and
  use this information to reorder transmitted data, to minimize
  head-of-line blocking in the recipient (and thus minimize out-of-order
  queues there).

  BLEST_TOR uses the primary circuit until the congestion window is full.
  Then, it uses the relative RTT times of the two circuits to calculate
  how much data can be sent on the secondary circuit faster than if we
  just waited for the primary circuit to become available.

  This is achieved by computing two variables at the sender:

    rtts = secondary.currRTT / primary.currRTT
    primary_limit = primary.cwnd + (rtts-1)/2)*rtts

  Note: This (rtts-1)/2 factor represents anticipated congestion window
  growth over this period.. it may be different for Tor, depending on CC
  alg.

  If primary_limit < secondary.cwnd - (secondary.package_window + 1), then
  there is enough space on the secondary circuit to send data faster than
  we could than waiting for the primary circuit.

  XXX: Note that BLEST uses total_send_window where we use secondary.cwnd
  in this check. total_send_window is min(recv_win, CWND). But since Tor
  does not use receive windows and instead uses stream XON/XOFF, we only
  use CWND. There is some concern this may alter BLEST's buffer
  minimization properties, but since receive window only matter if
  the application is slower than Tor, and XON/XOFF will cover that case,
  hopefully this is fine. If we need to, we could turn [REORDER_SIGNALING]
  into a receive window indication of some kind, to indicate remaining
  buffer size.

  Otherwise, if the primary_limit condition is not hit, cease reading on
  source edge connections until SENDME acks come back.

  Here is the pseudocode for this:

    while source.has_data_to_send():
      if primary.cwnd > primary.package_window:
        primary.send(source.get_packet())
        continue

      rtts = secondary.currRTT / primary.currRTT
      primary_limit = (primary.cwnd + (rtts-1)/2)*rtts

      if primary_limit < secondary.cwnd - (secondary.package_window+1):
        secondary.send(source.get_packet())
      else:
        break # done for now, wait for SENDME to free up CWND and restart

  Note that BLEST also has a parameter lambda that is updated whenever HoL
  blocking occurs. Because it is expensive and takes significant time to
  signal this over Tor, we omit this.

  XXX: We may want a third algorithm that only uses cwnd, for comparison.
  The above algorithm may have issues if the primary cwnd grows while the
  secondary does not. Expect this section to change.

  XXX: See [REORDER_SIGNALING] section if we want this lambda feedback.

3.3. Reorder queue signaling [REORDER_SIGNALING]

  Reordering is fairly simple task. By following using the sequence
  number field in [SEQUENCING], endpoints can know how many cells are
  still in flight on the other leg.

  To reorder them properly, a buffer of out of order cells needs to be
  kept. On the Exit side, this can quickly become overwhelming
  considering ten of thousands of possible circuits can be held open
  leading to gigabytes of memory being used. There is a clear potential
  memory DoS vector in this case, covered in more detail in
  [MEMORY_DOS].

  Luckily, [BLEST_TOR] and the form of [LOWRTT_TOR] that only uses the
  primary circuit will minimize or eliminate this out-of-order buffer.

  XXX: The remainder of this section may be over-complicating things... We
  only need these concepts if we want to use BLEST's lambda feedback. Though
  turning this into some kind of receive window that indicates remaining
  reorder buffer size may also help with the total_send_window also noted
  in BLEST_TOR.

  The default for this queue size is governed by the 'cflx_reorder_client'
  and 'cflx_reorder_srv' consensus parameters (see [CONSENSUS_PARAMS]).
  'cflx_reorder_srv' applies to Exits and onion services. Both parameters
  can be overridden by Torrc, to larger or smaller than the consensus
  parameter. (Low memory clients may want to lower it; SecureDrop onion
  services or other high-upload services may want to raise it).

  When the reorder queue hits this size, a RELAY_CONFLUX_XOFF is sent down
  the circuit leg that has data waiting in the queue and use of that leg
  SHOULD cease, until it drains to half of this value, at which point an
  RELAY_CONFLUX_XON is sent. Note that this is different than the stream
  XON/XOFF from Proposal 324.

  XXX: [BLEST] actually does not cease use of a path in this case, but
  instead uses this signal to adjust the lambda parameter, which biases
  traffic away from that leg.


4. Security Considerations

4.1. Memory Denial of Service [MEMORY_DOS]

  Both reorder queues and retransmit buffers inherently represent a memory
  denial of service condition.

  For [RESUMPTION] retransmit buffers, endpoints that support this feature
  SHOULD free retransmit information as soon as they get close to memory
  pressure. This prevents resumption while data is in flight, but will not
  otherwise harm operation.

  For reorder buffers, adversaries can potentially impact this at any
  point, but most obviously and most severely from the client position.

  In particular, clients can lie about sequence numbers, sending cells
  with sequence numbers such that the next expected sequence number is
  never sent. They can do this repeatedly on many circuits, to exhaust
  memory at exits.

  One option is to only allow actual traffic splitting in the downstream
  direction, towards clients, and always use the primary circuit for
  everything in the upstream direction. However, the ability to support
  conflux from the client to the exit shows promise against traffic
  analysis (see [WTF_SPLIT]).

  The other option is to use [BLEST_TOR] from clients to exits, as it has
  predictable interleaved cell scheduling, and minimizes reorder queues at
  exits. If the ratios prescribed by that algorithm are not followed
  within some bounds, the other endpoint can close both circuits, and free
  the queue memory.

  This still leaves the possibility that intermediate relays may block a
  leg, allowing cells to traverse only one leg, thus still accumulating at
  the reorder queue. Clients can also spoof sequence numbers similarly, to
  make it appear that they are following [BLEST_TOR], without actually
  sending any data on one of the legs.

  To handle either of these cases, when a relay is under memory pressure,
  the circuit OOM killer SHOULD free and close circuits with the oldest
  reorder queue data, first. This heuristic was shown to be best during
  the [SNIPER] attack OOM killer iteration cycle.

4.2. Side Channels [SIDE_CHANNELS]

  Two potential side channels may be introduced by the use of Conflux:
     1. RTT leg-use bias by altering SENDME latency
     2. Location info leaks through the use of both leg's latencies

  For RTT and leg-use bias, Guard relays could delay legs to introduce a
  pattern into the delivery of cells at the exit relay, by varying the
  latency of SENDME cells (every 100th cell) to change the distribution of
  traffic to send information. This attack could be performed in either
  direction of traffic, to bias traffic load off of a particular Guard.
  If an adversary controls both Guards, it could in theory send a binary
  signal more easily, by alternating delays on each.

  However, this risk weighs against the potential benefits against traffic
  fingerprinting, as per [WTF_SPLIT]. Additionally, even ignoring
  cryptographic tagging attacks, this side channel provides significantly
  lower information over time than inter-packet-delay based side channels
  that are already available to Guards and routers along the path to the
  Guard.

  Tor currently provides no defenses against already existing
  single-circuit delay-based side channels, though both circuit padding
  and [BACKLIT] are potential options it could conceivably deploy. The
  [BACKLIT] paper also has an excellent review of the various methods that
  have been studied for such single circuit side channels, and the
  [BACKLIT] style RTT monitoring could be used to protect against these
  conflux side channels as well. Circuit padding can also help to obscure
  which cells are SENDMEs, since circuit padding is not counted towards
  SENDME totals.

  The second class of side channel is where the Exit relay may be able to
  use the two legs to further infer more information about client
  location. See [LATENCY_LEAK] for more details. It is unclear at this
  time how much more severe this is for two paths than just one.

  We preserve the ability to disable conflux to and from Exit relays
  using consensus parameters, if these side channels prove more severe,
  or if it proves possible possible to mitigate single-circuit side
  channels, but not conflux side channels.

  In all cases, all of these side channels appear less severe for onion
  service traffic, due to the higher path variability due to relay
  selection, as well as the end-to-end nature of conflux in that case.
  Thus, we separate our ability to enable/disable conflux for onion
  services from Exits.

4.3. Traffic analysis [TRAFFIC_ANALYSIS]

  Even though conflux shows benefits against traffic analysis in
  [WTF_SPLIT], these gains may be moot if the adversary is able to perform
  packet counting and timing analysis at guards to guess which specific
  circuits are linked.  In particular, the 3 way handshake in
  [LINKING_CIRCUITS] may be quite noticeable.

  As one countermeasure, it may be possible to eliminate the third leg
  (RELAY_CIRCUIT_LINKED_ACK) by computing the exit/service RTT via
  measuring the time between CREATED/REND_JOINED and RELAY_CIRCUIT_LINK,
  but this will introduce cross-component complexity into Tor's protocol
  that could quickly become unwieldy and fragile.

  Additionally, the conflux handshake may make onion services stand out
  more, regardless of the number of stages in the handshake. For this
  reason, it may be more wise to simply address these issues with circuit
  padding machines during circuit setup (see padding-spec.txt).

  Additional traffic analysis considerations arise when combining conflux
  with padding, for purposes of mitigating traffic fingerprinting. For
  this, it seems wise to treat the packet schedulers as another piece of a
  combined optimization problem in tandem with optimizing padding
  machines, perhaps introducing randomness or fudge factors their
  scheduling, as a parameterized distribution. For details, see
  https://github.com/torproject/tor/blob/master/doc/HACKING/CircuitPaddingDevelopment.md

  Finally, conflux may exacerbate forms of confirmation-based traffic
  analysis that close circuits to determine concretely if they were in
  use, since closing either leg might cause resumption to fail. TCP RST
  injection can perform this attack on the side, without surveillance
  capability. [RESUMPTION] with buffering of the inflight unacked
  package_window data, for retransmit, is a partial mitigation, if
  endpoints buffer this data for retransmission for a brief time even if
  both legs close. This seems more feasible for onion services, which are
  more vulnerable to this attack. However, if the adversary controls the
  client, they will notice the resumption re-link, and still obtain
  confirmation that way.

  It seems the only way to fully mitigate these kinds of attacks is with
  the Snowflake pluggable transport, which provides its own resumption and
  retransmit behavior. Additionally, Snowflake's use of UDP DTLS also
  protects against TCP RST injection, which we suspect to be the main
  vector for such attacks.

  In the future, a DTLS or QUIC transport for Tor such as masque could
  provide similar RST injection resistance, and resumption at Guard/Bridge
  nodes, as well.


5. System Interactions

  - congestion control
  - EWMA and KIST
  - CBT and number of guards
  - Onion service circ obfuscation
  - Future UDP (may increase need for UDP to buffer before dropping)
  - Padding (no sequence numbers on padding cells, as per [SEQUENCING])
    - Also, any padding machines may need re-tuning
  - No 'cannibalization' of linked circuits


6. Consensus and Torrc Parameters [CONSENSUS]

  - conflux_circs
    - Number of conflux circuits

  - conflux_sched_exits, conflux_sched_clients, conflux_sched_service
    - Three forms of LOWRTT_TOR, and BLEST_TOR

  - ConfluxOnionService
  - ConfluxOnionCircs


7. Tuning Experiments [EXPERIMENTS]

  - conflux_sched & conflux_exits
    - Exit reorder queue size
    - Responsiveness vs throughput tradeoff?
  - Congestion control
  - EWMA and KIST
  - num guards & conflux_circs


Appended A [ALTERNATIVES]

A.1 Alternative Link Handshake [ALTERNATIVE_LINKING]

  The circuit linking in [LINKING_CIRCUITS] could be done as encrypted
  ntor onionskin extension fields, similar to those used by v3 onions.

  This approach has at least four problems:
    i). For onion services, since onionskins traverse the intro circuit
        and return on the rend circuit, this handshake cannot measure
        RTT there.
   ii). Since these onionskins are larger, and have no PFS, an adversary
        at the middle relay knows that the onionskin is for linking, and
        can potentially try to obtain the onionskin key for attacks on
        the link.
  iii). It makes linking circuits more fragile, since they could timeout
        due to CBT, or other issues during construction.
   iv). The overhead in processing this onionskin in onionskin queues
        adds additional time for linking, even in the Exit case, making
        that RTT potentially noisy.

  Additionally, it is not clear that this approach actually saves us
  anything in terms of setup time, because we can optimize away the
  linking phase using Proposal 340, to combine initial RELAY_BEGIN cells
  with RELAY_CIRCUIT_LINK.

A.2. Alternative RTT measurement [ALTERNATIVE_RTT]

  Instead of measuring RTTs during [LINKING_CIRCUITS], we could create
  PING/PONG cells, whose sole purpose is to allow endpoints to measure
  RTT.

  This was rejected for several reasons. First, during circuit use, we
  already have SENDMEs to measure RTT. Every 100 cells (or
  'circwindow_inc' from Proposal 324), we are able to re-measure RTT based
  on the time between that Nth cell and the SENDME ack. So we only need
  PING/PONG to measure initial circuit RTT.

  If we were able to use onionskins, as per [ALTERNATIVE_LINKING] above,
  we might be able to specify a PING/PONG/PING handshake solely for
  measuring initial RTT, especially for onion service circuits.

  The reason for not making a dedicated PING/PONG for this purpose is that
  it is context-free. Even if we were able to use onionskins for linking
  and resumption, to avoid additional data in handshake that just measures
  RTT, we would have to enforce that this PING/PONG/PING only follows the
  exact form needed by this proposal, at the expected time, and at no
  other points.

  If we do not enforce this specific use of PING/PONG/PING, it becomes
  another potential side channel, for use in attacks such as [DROPMARK].

  In general, Tor is planning to remove current forms of context-free and
  semantic-free cells from its protocol:
  https://gitlab.torproject.org/tpo/core/torspec/-/issues/39

  We should not add more.


Appendix B: Acknowledgments [ACKNOWLEDGEMENTS]

  Thanks to Per Hurtig for helping us with the framing of the MPTCP
  problem space.

  Thanks to Simone Ferlin for clarifications on the [BLEST] paper, and for
  pointing us at the Linux kernel implementation.

  Extreme thanks goes again to Toke Høiland-Jørgensen, who helped
  immensely towards our understanding of how the BLEST condition relates
  to edge connection pushback, and for clearing up many other
  misconceptions we had.

  Finally, thanks to Mashael AlSabah, Kevin Bauer, Tariq Elahi, and Ian
  Goldberg, for the original [CONFLUX] paper!


References:

[CONFLUX]
   https://freehaven.net/anonbib/papers/pets2013/paper_65.pdf

[BLEST]
  https://olivier.mehani.name/publications/2016ferlin_blest_blocking_estimation_mptcp_scheduler.pdf
  https://opus.lib.uts.edu.au/bitstream/10453/140571/2/08636963.pdf
  https://github.com/multipath-tcp/mptcp/blob/mptcp_v0.95/net/mptcp/mptcp_blest.c

[WTF_SPLIT]
   https://www.comsys.rwth-aachen.de/fileadmin/papers/2020/2020-delacadena-trafficsliver.pdf

[COUPLED]
   https://datatracker.ietf.org/doc/html/rfc6356
   https://www.researchgate.net/profile/Xiaoming_Fu2/publication/230888515_Delay-based_Congestion_Control_for_Multipath_TCP/links/54abb13f0cf2ce2df668ee4e.pdf?disableCoverPage=true
   http://staff.ustc.edu.cn/~kpxue/paper/ToN-wwj-2020.04.pdf
   https://www.thinkmind.org/articles/icn_2019_2_10_30024.pdf
   https://arxiv.org/pdf/1308.3119.pdf

[BACKLIT]
   https://www.freehaven.net/anonbib/cache/acsac11-backlit.pdf

[LATENCY_LEAK]
   https://www.freehaven.net/anonbib/cache/ccs07-latency-leak.pdf
   https://www.robgjansen.com/publications/howlow-pets2013.pdf

[SNIPER]
   https://www.freehaven.net/anonbib/cache/sniper14.pdf

[DROPMARK]
   https://www.petsymposium.org/2018/files/papers/issue2/popets-2018-0011.pdf