From d73bdd1e0b44b0f9c6ad8da216cdba0a9be2f456 Mon Sep 17 00:00:00 2001 From: Mike Perry Date: Fri, 26 Mar 2021 14:49:55 +0000 Subject: Reformat Prop329. --- proposals/329-traffic-splitting.txt | 1407 ++++++++++++++++++----------------- 1 file changed, 709 insertions(+), 698 deletions(-) (limited to 'proposals') diff --git a/proposals/329-traffic-splitting.txt b/proposals/329-traffic-splitting.txt index 746c6c4..8d73a8c 100644 --- a/proposals/329-traffic-splitting.txt +++ b/proposals/329-traffic-splitting.txt @@ -6,657 +6,668 @@ Status: Draft 0. Status - This proposal describes the Conflux [CONFLUX] system developed by Mashael - AlSabah, Kevin Bauer, Tariq Elahi, and Ian Goldberg. It aims at improving - Tor client network performance by dynamically splitting traffic between two - circuits. + This proposal describes the Conflux [CONFLUX] system developed by + Mashael AlSabah, Kevin Bauer, Tariq Elahi, and Ian Goldberg. It aims at + improving Tor client network performance by dynamically splitting + traffic between two circuits. 1. Overview 1.1. Multipath TCP Design Space - In order to understand our improvements to Conflux, it is important to - properly conceptualize what is involved in the design of multipath - algorithms in general. - - The design space is broken into two orthogonal parts: congestion - control algorithms that apply to each path, and traffic scheduling - algorithms that decide when to send packets to send on each path. - - MPTCP specifies 'coupled' congestion control (see [COUPLED]). Coupled - congestion control updates single-path congestion control algorithms to - account for shared bottlenecks between the paths, so that the combined - congestion control algorithms do not overwhelm any bottlenecks that - happen to be shared between the multiple paths. Various ways of - accomplishing this have been proposed and implemented in the Linux kernel. - - Because Tor's congestion control only concerns itself with bottnecks in Tor - relay queues, and not with any other bottlenecks (such as intermediate - Internet routers), we can avoid this complexity merely by specifying that - any paths that are constructed should not share any relays. In this way, we - can proceed to use the exact same congestion control as specified in Proposal - 324, for each path. - - For this reason, this proposal will focus on the traffic scheduling - algorithms, rather than coupling. We propose three candidate algorithms - that have been studied in the literature, and will compare their - performance using simulation and consensus parameters. + In order to understand our improvements to Conflux, it is important to + properly conceptualize what is involved in the design of multipath + algorithms in general. + + The design space is broken into two orthogonal parts: congestion control + algorithms that apply to each path, and traffic scheduling algorithms + that decide when to send packets to send on each path. + + MPTCP specifies 'coupled' congestion control (see [COUPLED]). Coupled + congestion control updates single-path congestion control algorithms to + account for shared bottlenecks between the paths, so that the combined + congestion control algorithms do not overwhelm any bottlenecks that + happen to be shared between the multiple paths. Various ways of + accomplishing this have been proposed and implemented in the Linux + kernel. + + Because Tor's congestion control only concerns itself with bottnecks in + Tor relay queues, and not with any other bottlenecks (such as + intermediate Internet routers), we can avoid this complexity merely by + specifying that any paths that are constructed should not share any + relays. In this way, we can proceed to use the exact same congestion + control as specified in Proposal 324, for each path. + + For this reason, this proposal will focus on the traffic scheduling + algorithms, rather than coupling. We propose three candidate algorithms + that have been studied in the literature, and will compare their + performance using simulation and consensus parameters. 1.2. Divergence from the initial Conflux design - The initial [CONFLUX] paper doesn't provide any indications on how to handle - the size of out-of-order cell queue, which we consider a potential dangerous - memory DoS vector (see [MEMORY_DOS]). It also used RTT as the sole heuristic - for selecting which circuit to send on, which may vary depending on the - geographical locations of the participant relays, without considering their - actual available circuit capacity (which will be available to us via Proposal - 324). Additionally, since the publication of [CONFLUX], more modern - packet scheduling algorithms have been developed, which aim to reduce - out-of-order queue size. - - We propose mitigations for these issues using modern scheduling algorithms, - as well as implementations options for avoiding the out-of-order queue at - Exit relays. Additionally, we consider resumption, side channel, and traffic - analysis risks and benefits in [RESUMPTION], [SIDE_CHANNELS] and - [TRAFFIC_ANALYSIS]. + The initial [CONFLUX] paper doesn't provide any indications on how to + handle the size of out-of-order cell queue, which we consider a + potential dangerous memory DoS vector (see [MEMORY_DOS]). It also used + RTT as the sole heuristic for selecting which circuit to send on, which + may vary depending on the geographical locations of the participant + relays, without considering their actual available circuit capacity + (which will be available to us via Proposal 324). Additionally, since + the publication of [CONFLUX], more modern packet scheduling algorithms + have been developed, which aim to reduce out-of-order queue size. + + We propose mitigations for these issues using modern scheduling + algorithms, as well as implementations options for avoiding the + out-of-order queue at Exit relays. Additionally, we consider resumption, + side channel, and traffic analysis risks and benefits in [RESUMPTION], + [SIDE_CHANNELS] and [TRAFFIC_ANALYSIS]. 2. Design - The following section describes the Conflux design. Each sub-section is a - building block to the multipath design that Conflux proposes. - - The circuit construction is as follow: - - Primary Circuit (lower RTT) - +-------+ +--------+ - |Guard 1|----->|Middle 1|----------+ - +---^---+ +--------+ | - +-----+ | +--v---+ - | OP +------+ | Exit |--> ... - +-----+ | +--^---+ - +---v---+ +--------+ | - |Guard 2|----->|Middle 2|----------+ - +-------+ +--------+ - Secondary Circuit (higher RTT) - - Both circuits are built using current Tor path selection, however they - SHOULD NOT share the same Guard relay, or middle relay. By avoiding - using the same relays in these positions in the path, we ensure - additional path capacity, and eliminate the need to use more complicated - 'coupled' congestion control algorithms from the MPTCP literature[COUPLED]. - This both simplifies design, and improves performance. - - Then, the OP needs to link the two circuits together, as described in - [LINKING_CIRCUITS], [LINKING_EXIT], and [LINKING_SERVICE]. - - For ease of explanation, the primary circuit is the circuit with lower RTT, - and the secondary circuit is the circuit with higher RTT. Initial RTT - is measured during circuit linking, as described in [LINKING_CIRCUITS]. - RTT is continually measured using SENDME timing, as in Proposal 324. - This means that during use, the primary circuit and secondary circuit may - switch roles, depending on unrelated network congestion caused by other - Tor clients. - - We also support linking onion service circuits together. In this case, - only two rendezvous circuits are linked. Each of these RP circuits will be - constructed separately, and then linked. However, the same path constraints - apply to each half of the circuits (no shared relays between the legs). - Should, by chance, the service and the client sides end up sharing some - relays, this is not catastrophic. Multipath TCP researchers we have - consulted believe Tor's congestion control from Proposal 324 to be - sufficient in this rare case. - - Only two circuits SHOULD be linked together. However, implementations - SHOULD make it easy for researchers to *test* more than two paths, as this - has been shown to assist in traffic analysis resistance[WTF_SPLIT]. At - minimum, this means not hardcoding only two circuits in the implementation. - - If the number of circuits exceeds the current number of guard relays, - guard relays MAY be re-used, but implementations SHOULD use the same - number of Guards as paths. - - Linked circuits MUST NOT be extended further once linked (ie: - 'cannibalization' is not supported). + The following section describes the Conflux design. Each sub-section is + a building block to the multipath design that Conflux proposes. + + The circuit construction is as follow: + + Primary Circuit (lower RTT) + +-------+ +--------+ + |Guard 1|----->|Middle 1|----------+ + +---^---+ +--------+ | + +-----+ | +--v---+ + | OP +------+ | Exit |--> ... + +-----+ | +--^---+ + +---v---+ +--------+ | + |Guard 2|----->|Middle 2|----------+ + +-------+ +--------+ + Secondary Circuit (higher RTT) + + Both circuits are built using current Tor path selection, however they + SHOULD NOT share the same Guard relay, or middle relay. By avoiding + using the same relays in these positions in the path, we ensure + additional path capacity, and eliminate the need to use more complicated + 'coupled' congestion control algorithms from the MPTCP + literature[COUPLED]. This both simplifies design, and improves + performance. + + Then, the OP needs to link the two circuits together, as described in + [LINKING_CIRCUITS], [LINKING_EXIT], and [LINKING_SERVICE]. + + For ease of explanation, the primary circuit is the circuit with lower + RTT, and the secondary circuit is the circuit with higher RTT. Initial + RTT is measured during circuit linking, as described in + [LINKING_CIRCUITS]. RTT is continually measured using SENDME timing, as + in Proposal 324. This means that during use, the primary circuit and + secondary circuit may switch roles, depending on unrelated network + congestion caused by other Tor clients. + + We also support linking onion service circuits together. In this case, + only two rendezvous circuits are linked. Each of these RP circuits will + be constructed separately, and then linked. However, the same path + constraints apply to each half of the circuits (no shared relays between + the legs). Should, by chance, the service and the client sides end up + sharing some relays, this is not catastrophic. Multipath TCP researchers + we have consulted believe Tor's congestion control from Proposal 324 to + be sufficient in this rare case. + + Only two circuits SHOULD be linked together. However, implementations + SHOULD make it easy for researchers to *test* more than two paths, as + this has been shown to assist in traffic analysis resistance[WTF_SPLIT]. + At minimum, this means not hardcoding only two circuits in the + implementation. + + If the number of circuits exceeds the current number of guard relays, + guard relays MAY be re-used, but implementations SHOULD use the same + number of Guards as paths. + + Linked circuits MUST NOT be extended further once linked (ie: + 'cannibalization' is not supported). 2.1. Advertising support for conflux - We propose a new protocol version in order to advertise support for - circuit linking on the relay side: - - "Relay=4" -- Relay supports an 2 byte sequence number in a RELAY cell - header used for multipath circuit which are linked with the - new RELAY_CIRCUIT_LINK relay cell command. + We propose a new protocol version in order to advertise support for + circuit linking on the relay side: + + "Relay=4" -- Relay supports an 2 byte sequence number in a RELAY cell + header used for multipath circuit which are linked with the + new RELAY_CIRCUIT_LINK relay cell command. + + XXX: Advertise this in onion service descriptor. + XXX: Onion service descriptor can advertise more than two circuits? - XXX: Advertise this in onion service descriptor. - XXX: Onion service descriptor can advertise more than two circuits? - - The next section describes how the circuits are linked together. + The next section describes how the circuits are linked together. 2.2. Linking circuits [LINKING_CIRCUITS] - To link circuits, we propose new relay commands that are sent on both - circuits, as well as a response to confirm the join, and an ack of this - response. These commands create a 3way handshake, which allows each - endpoint to measure the initial RTT of each leg upon link, without - needing to wait for any data. - - All three stages of this handshake are sent on *each* circuit leg to be - linked. - - To save round trips, these cells SHOULD be combined with the initial - RELAY_BEGIN cell on the faster circuit leg, using Proposal 325. See - [LINKING_EXIT] and [LINKING_SERVICE] for more details on setup in each case. - - There are other ways to do this linking that we have considered, but they - seem not to be significantly better than this method, especially since we - can use Proposal 325 to eliminate the RTT cost of this setup before sending - data. For those other ideas, see [ALTERNATIVE_LINKING] and - [ALTERNATIVE_RTT], in the appendix. - - The first two parts of the handshake establish the link, and enable - resumption: - - 16 -- RELAY_CIRCUIT_LINK - - Sent from the OP to the exit/service in order to link - circuits together at the end point. - - 17 -- RELAY_CIRCUIT_LINKED - - Sent from the exit/service to the OP, to confirm the circuits - were linked. - - These cells have the following contents: - - VERSION [1 byte] - PAYLOAD [variable, up to end of relay payload] - - The VERSION tells us which circuit linking mechanism to use. At this point - in time, only 0x01 is recognized and is the one described by the Conflux - design. - - For version 0x01, the PAYLOAD contains: - - NONCE [32 bytes] - LAST_SEQNO_SENT [8 bytes] - LAST_SEQNO_RECV [8 bytes] - - XXX: Should we let endpoints specify their preferred [SCHEDULING] alg - here, to override consensus params? This has benefits: eg low-memory - mobile clients can ask for an alg that is better for their reorder - queues. But it also has complexity risk, if the other endpoint does - not want to support it, because of its own memory issues. - - The NONCE contains a random 256-bit secret, used to associate the two - circuits together. The nonce must not be shared outside of the circuit - transmission, or data may be injected into TCP streams. This means it - MUST NOT be logged to disk. - - The two sequence number fields are 0 upon initial link, but non-zero in the - case of a resumption attempt (See [RESUMPTION]). - - If either circuit does not receive a RELAY_CIRCUIT_LINKED response, both - circuits MUST be closed. - - The third stage of the handshake exists to help the exit/service measure - initial RTT, for use in [SCHEDULING]: - - 18 -- RELAY_CIRCUIT_LINKED_RTT_ACK - - Sent from the OP to the exit/service, to provide initial RTT - measurement for the exit/service. - - For timeout of the handshake, clients should use the normal SOCKS/stream - timeout already in use for RELAY_BEGIN. - - These three relay commands (RELAY_CIRCUIT_LINK, RELAY_CIRCUIT_LINKED, - and RELAY_CIRCUIT_LINKED_ACK) are send on *each* leg, to allow each - endpoint to measure the initial RTT of each leg. + To link circuits, we propose new relay commands that are sent on both + circuits, as well as a response to confirm the join, and an ack of this + response. These commands create a 3way handshake, which allows each + endpoint to measure the initial RTT of each leg upon link, without + needing to wait for any data. + + All three stages of this handshake are sent on *each* circuit leg to be + linked. + + To save round trips, these cells SHOULD be combined with the initial + RELAY_BEGIN cell on the faster circuit leg, using Proposal 325. See + [LINKING_EXIT] and [LINKING_SERVICE] for more details on setup in each + case. + + There are other ways to do this linking that we have considered, but + they seem not to be significantly better than this method, especially + since we can use Proposal 325 to eliminate the RTT cost of this setup + before sending data. For those other ideas, see [ALTERNATIVE_LINKING] + and [ALTERNATIVE_RTT], in the appendix. + + The first two parts of the handshake establish the link, and enable + resumption: + + 16 -- RELAY_CIRCUIT_LINK + + Sent from the OP to the exit/service in order to link + circuits together at the end point. + + 17 -- RELAY_CIRCUIT_LINKED + + Sent from the exit/service to the OP, to confirm the circuits + were linked. + + These cells have the following contents: + + VERSION [1 byte] + PAYLOAD [variable, up to end of relay payload] + + The VERSION tells us which circuit linking mechanism to use. At this + point in time, only 0x01 is recognized and is the one described by the + Conflux design. + + For version 0x01, the PAYLOAD contains: + + NONCE [32 bytes] + LAST_SEQNO_SENT [8 bytes] + LAST_SEQNO_RECV [8 bytes] + + XXX: Should we let endpoints specify their preferred [SCHEDULING] alg + here, to override consensus params? This has benefits: eg low-memory + mobile clients can ask for an alg that is better for their reorder + queues. But it also has complexity risk, if the other endpoint does not + want to support it, because of its own memory issues. + + The NONCE contains a random 256-bit secret, used to associate the two + circuits together. The nonce must not be shared outside of the circuit + transmission, or data may be injected into TCP streams. This means it + MUST NOT be logged to disk. + + The two sequence number fields are 0 upon initial link, but non-zero in + the case of a resumption attempt (See [RESUMPTION]). + + If either circuit does not receive a RELAY_CIRCUIT_LINKED response, both + circuits MUST be closed. + + The third stage of the handshake exists to help the exit/service measure + initial RTT, for use in [SCHEDULING]: + + 18 -- RELAY_CIRCUIT_LINKED_RTT_ACK + + Sent from the OP to the exit/service, to provide initial RTT + measurement for the exit/service. + + For timeout of the handshake, clients should use the normal SOCKS/stream + timeout already in use for RELAY_BEGIN. + + These three relay commands (RELAY_CIRCUIT_LINK, RELAY_CIRCUIT_LINKED, + and RELAY_CIRCUIT_LINKED_ACK) are send on *each* leg, to allow each + endpoint to measure the initial RTT of each leg. 2.2. Linking Circuits from OP to Exit [LINKING_EXIT] - To link exit circuits, two circuits to the same exit are built. The - client records the circuit build time of each. - - If the circuits are being built on-demand, for immediate use, the - circuit with the lower build time SHOULD use Proposal 325 to append its - first RELAY cell to the RELAY_COMMAND_LINK, on the circuit with the - lower circuit build time. The exit MUST respond on this same leg. - After that, actual RTT measurements MUST be used to determine - future transmissions, as specified in [SCHEDULING]. - - The RTT times between RELAY_COMMAND_LINK and RELAY_COMMAND_LINKED are - measured by the client, to determine each circuit RTT to determine - primary vs secondary circuit use, and for packet scheduling. Similarly, - the exit measures the RTT times between RELAY_COMMAND_LINKED and - RELAY_COMMAND_LINKED_ACK, for the same purpose. - + To link exit circuits, two circuits to the same exit are built. The + client records the circuit build time of each. + + If the circuits are being built on-demand, for immediate use, the + circuit with the lower build time SHOULD use Proposal 325 to append its + first RELAY cell to the RELAY_COMMAND_LINK, on the circuit with the + lower circuit build time. The exit MUST respond on this same leg. After + that, actual RTT measurements MUST be used to determine future + transmissions, as specified in [SCHEDULING]. + + The RTT times between RELAY_COMMAND_LINK and RELAY_COMMAND_LINKED are + measured by the client, to determine each circuit RTT to determine + primary vs secondary circuit use, and for packet scheduling. Similarly, + the exit measures the RTT times between RELAY_COMMAND_LINKED and + RELAY_COMMAND_LINKED_ACK, for the same purpose. + 2.3. Linking circuits to an onion service [LINKING_SERVICE] - - For onion services, we will only concern ourselves with linking - rendezvous circuits. - - To join rendezvous circuits, clients make two introduce requests to a - service's intropoint, causing it to create two rendezvous circuits, to - meet the client at two separate rendezvous points. These introduce - requests MUST be sent to the same intropoint (due to potential use of - onionbalance), and SHOULD be sent back-to-back on the same intro - circuit. They MAY be combined with Proposal 325. - - The first rendezvous circuit to get joined SHOULD use Proposal 325 - to append the RELAY_BEGIN command, and the service MUST answer - on this circuit, until RTT can be measured. - - Once both circuits are linked and RTT is measured, packet scheduling - should be used, as per [SCHEDULING]. - + + For onion services, we will only concern ourselves with linking + rendezvous circuits. + + To join rendezvous circuits, clients make two introduce requests to a + service's intropoint, causing it to create two rendezvous circuits, to + meet the client at two separate rendezvous points. These introduce + requests MUST be sent to the same intropoint (due to potential use of + onionbalance), and SHOULD be sent back-to-back on the same intro + circuit. They MAY be combined with Proposal 325. + + The first rendezvous circuit to get joined SHOULD use Proposal 325 to + append the RELAY_BEGIN command, and the service MUST answer on this + circuit, until RTT can be measured. + + Once both circuits are linked and RTT is measured, packet scheduling + should be used, as per [SCHEDULING]. + 2.4. Congestion Control Application [CONGESTION_CONTROL] - - The SENDMEs for congestion control are performed per-leg. As data arrives, - regardless of its ordering, it is counted towards SENDME delivery. In this - way, 'cwnd - package_window' of each leg always reflects the available - data to send on each leg. This is important for [SCHEDULING]. - - The Congestion control Stream XON/XOFF can be sent on either leg, and - applies to the stream's transmission on both legs. - + + The SENDMEs for congestion control are performed per-leg. As data + arrives, regardless of its ordering, it is counted towards SENDME + delivery. In this way, 'cwnd - package_window' of each leg always + reflects the available data to send on each leg. This is important for + [SCHEDULING]. + + The Congestion control Stream XON/XOFF can be sent on either leg, and + applies to the stream's transmission on both legs. + 2.5. Sequencing [SEQUENCING] - - With multiple paths for data, the problem of data re-ordering appears. In - other words, cells can arrive out of order from the two circuits where cell - N + 1 arrives before the cell N. - - Handling this reordering operates after congestion control for each circuit - leg, but before relay cell command processing or stream data delivery. - - For the receiver to be able to reorder the receiving cells, a sequencing - scheme needs to be implemented. However, because Tor does not drop or - reorder packets inside of a circuit, this sequence number can be very - small. It only has to signal that a cell comes after those arriving - on another circuit. - - To achieve this, we add a small sequence number to the common relay - header for all relay cells on linked circuits. This sequence number - is meant to signal the number of cells sent on the *other* leg, so - that each endpoint knows how many cells are still in-flight on - another leg. It is different from the absolute sequence number used - in [LINKING_CIRCUITS] and [RESUMPTION], but can be derived from that - number, using relative arithmetic. - - Relay command [1 byte] - Recognized [2 bytes] - StreamID [2 bytes] - Digest [4 bytes] - Length [2 bytes] - > LongSeq [1 bit] # If this bit is set, use 31 bits for Seq - > Sequencing [7 or 31 bits] - Data [Remainder] - - The sequence number is only set for the first cell after the endpoint - switches legs. In this case, LongSeq is set to 1, and the Sequencing - field is 31 more bits. Otherwise it is a 1 byte 0 value. - - These fields MUST be present on ALL end-to-end relay cells on each leg - that come from the endpoint, following a RELAY_CIRCUIT_LINK command. - - They are absent on 'leaky pipe' RELAY_COMMAND_DROP and - RELAY_COMMAND_PADDING_NEGOTIATED cells that come from middle relays, - as opposed to the endpoint, to support padding. - - When an endpoint switches legs, on the first cell in a new leg, - LongSeq is set to 1, and the following 31 bits represent the *total* - number of cells sent on the *other* leg, before the switch. The receiver - must wait for that number of cells to arrive from the previous leg - before delivering that cell. - - XXX: In the rare event that we send more than 2^31 cells (~1TB) on a - single leg, do we force a switch of legs, or expand the field further? - - An alternative method of sequencing, that assumes that the endpoint - knows when it is going to switch, the cell before it switches, is - specified in [ALTERNATIVE_SEQUENCING]. Note that that method requires - only 1 byte for sequence number and switch signaling, but requires that - the sender know that it is planning to switch, the cell before it switches. - (This is possible with [BLEST_TOR], but [LOWRTT_TOR] can switch based on - RTT change, so it may be one cell late in that case). + + With multiple paths for data, the problem of data re-ordering appears. + In other words, cells can arrive out of order from the two circuits + where cell N + 1 arrives before the cell N. + + Handling this reordering operates after congestion control for each + circuit leg, but before relay cell command processing or stream data + delivery. + + For the receiver to be able to reorder the receiving cells, a sequencing + scheme needs to be implemented. However, because Tor does not drop or + reorder packets inside of a circuit, this sequence number can be very + small. It only has to signal that a cell comes after those arriving on + another circuit. + + To achieve this, we add a small sequence number to the common relay + header for all relay cells on linked circuits. This sequence number is + meant to signal the number of cells sent on the *other* leg, so that + each endpoint knows how many cells are still in-flight on another leg. + It is different from the absolute sequence number used in + [LINKING_CIRCUITS] and [RESUMPTION], but can be derived from that + number, using relative arithmetic. + + Relay command [1 byte] + Recognized [2 bytes] + StreamID [2 bytes] + Digest [4 bytes] + Length [2 bytes] + > LongSeq [1 bit] # If this bit is set, use 31 bits for Seq + > Sequencing [7 or 31 bits] + Data [Remainder] + + The sequence number is only set for the first cell after the endpoint + switches legs. In this case, LongSeq is set to 1, and the Sequencing + field is 31 more bits. Otherwise it is a 1 byte 0 value. + + These fields MUST be present on ALL end-to-end relay cells on each leg + that come from the endpoint, following a RELAY_CIRCUIT_LINK command. + + They are absent on 'leaky pipe' RELAY_COMMAND_DROP and + RELAY_COMMAND_PADDING_NEGOTIATED cells that come from middle relays, as + opposed to the endpoint, to support padding. + + When an endpoint switches legs, on the first cell in a new leg, LongSeq + is set to 1, and the following 31 bits represent the *total* number of + cells sent on the *other* leg, before the switch. The receiver must wait + for that number of cells to arrive from the previous leg before + delivering that cell. + + XXX: In the rare event that we send more than 2^31 cells (~1TB) on a + single leg, do we force a switch of legs, or expand the field further? + + An alternative method of sequencing, that assumes that the endpoint + knows when it is going to switch, the cell before it switches, is + specified in [ALTERNATIVE_SEQUENCING]. Note that that method requires + only 1 byte for sequence number and switch signaling, but requires that + the sender know that it is planning to switch, the cell before it + switches. (This is possible with [BLEST_TOR], but [LOWRTT_TOR] can + switch based on RTT change, so it may be one cell late in that case). 2.6. Resumption [RESUMPTION] - In the event that a circuit leg is destroyed, they MAY be resumed. - - Resumption is achieved by re-using the NONCE and method to the same endpoint - (either [LINKING_EXIT] or [LINKING_SERVICE]). The resumed path need not - use the same middle and guard relays, but should not share any relays - with any existing legs(s). - - To provide resumption, endpoints store an absolute 64bit cell counter of - the last cell they have sent on a conflux pair (their LAST_SEQNO_SENT), - as well the last sequence number they have delivered in-order to edge - connections corresponding to a conflux pair (their LAST_SEQNO_RECV). - Additionally, endpoints MAY store the entire contents of unacked - inflight cells (ie the 'package_window' from proposal 324), for each - leg, along with information corresponding to those cells' absolute - sequence numbers. - - These 64 bit absolute counters can wrap without issue, as congestion - windows will never grow to 2^64 cells until well past the Singularity. - However, it is possible that extremely long, bulk circuits could - exceed 2^64 total sent or received cells, so endpoints SHOULD handle - wrapped sequence numbers for purposes of computing retransmit - information. (But even this case is unlikely to happen within the next - decade or so). - - Upon resumption, the LAST_SEQNO_SENT and LAST_SEQNO_RECV fields are used to - convey the sequence numbers of the last cell the relay sent and received on - that leg. The other endpoint can use these sequence numbers to determine if - it received the in-flight data or not, or sent more data since that point, - up to and including this absolute sequence number. If LAST_SEQNO_SENT - has not been received, the endpoint MAY transmit the missing data, if it - still has it buffered. - - Because both endpoints get information about the other side's absolute - SENT sequence number, they will know exactly how many re-transmitted - packets to expect, should the circuit stay open. Re-transmitters - should not re-increment their absolute sent fields while re-transmitting. - - If it does not have this missing data due to memory pressure, that endpoint - should destroy *both* legs, as this represents unrecoverable data loss. - - Otherwise, the new circuit can be re-joined, and its RTT can be compared - to the remaining circuit to determine if the new leg is primary or - secondary. - - It is even possible to resume conflux circuits where both legs have - been collapsed using this scheme, if endpoints continue to buffer their - unacked package_window data for some time after this close. However, - see [TRAFFIC_ANALYSIS] for more details on the full scope of this - issue. - - If endpoints are buffering package_window data, such data should be - given priority to be freed in any oomkiller invocation. See - [MEMORY_DOS] for more oomkiller information. + In the event that a circuit leg is destroyed, they MAY be resumed. + + Resumption is achieved by re-using the NONCE and method to the same + endpoint (either [LINKING_EXIT] or [LINKING_SERVICE]). The resumed path + need not use the same middle and guard relays, but should not share any + relays with any existing legs(s). + + To provide resumption, endpoints store an absolute 64bit cell counter of + the last cell they have sent on a conflux pair (their LAST_SEQNO_SENT), + as well the last sequence number they have delivered in-order to edge + connections corresponding to a conflux pair (their LAST_SEQNO_RECV). + Additionally, endpoints MAY store the entire contents of unacked + inflight cells (ie the 'package_window' from proposal 324), for each + leg, along with information corresponding to those cells' absolute + sequence numbers. + + These 64 bit absolute counters can wrap without issue, as congestion + windows will never grow to 2^64 cells until well past the Singularity. + However, it is possible that extremely long, bulk circuits could exceed + 2^64 total sent or received cells, so endpoints SHOULD handle wrapped + sequence numbers for purposes of computing retransmit information. (But + even this case is unlikely to happen within the next decade or so). + + Upon resumption, the LAST_SEQNO_SENT and LAST_SEQNO_RECV fields are used + to convey the sequence numbers of the last cell the relay sent and + received on that leg. The other endpoint can use these sequence numbers + to determine if it received the in-flight data or not, or sent more data + since that point, up to and including this absolute sequence number. If + LAST_SEQNO_SENT has not been received, the endpoint MAY transmit the + missing data, if it still has it buffered. + + Because both endpoints get information about the other side's absolute + SENT sequence number, they will know exactly how many re-transmitted + packets to expect, should the circuit stay open. Re-transmitters should + not re-increment their absolute sent fields while re-transmitting. + + If it does not have this missing data due to memory pressure, that + endpoint should destroy *both* legs, as this represents unrecoverable + data loss. + + Otherwise, the new circuit can be re-joined, and its RTT can be compared + to the remaining circuit to determine if the new leg is primary or + secondary. + + It is even possible to resume conflux circuits where both legs have been + collapsed using this scheme, if endpoints continue to buffer their + unacked package_window data for some time after this close. However, see + [TRAFFIC_ANALYSIS] for more details on the full scope of this issue. + + If endpoints are buffering package_window data, such data should be + given priority to be freed in any oomkiller invocation. See [MEMORY_DOS] + for more oomkiller information. 3. Traffic Scheduling [SCHEDULING] - In order to load balance the traffic between the two circuits, the original - conflux paper used only RTT. However, with Proposal 324, we will have - accurate information on the instantaneous available bandwidth of each - circuit leg, as 'cwnd - package_window' (see Section 3 of Proposal 324). - - Some additional RTT optimizations are also useful, to improve - responsiveness and minimize out-of-order queue sizes. - - We specify two traffic schedulers from the multipath literature and adapt - them to Tor: [LOWRTT_TOR] and [BLEST_TOR]. [LOWRTT_TOR] also has three - variants, with different trade offs. - - However, see the [TRAFFIC_ANALYSIS] sections of this proposal for important - details on how this selection can be changed, to reduce website traffic - fingerprinting. + In order to load balance the traffic between the two circuits, the + original conflux paper used only RTT. However, with Proposal 324, we + will have accurate information on the instantaneous available bandwidth + of each circuit leg, as 'cwnd - package_window' (see Section 3 of + Proposal 324). + + Some additional RTT optimizations are also useful, to improve + responsiveness and minimize out-of-order queue sizes. + + We specify two traffic schedulers from the multipath literature and + adapt them to Tor: [LOWRTT_TOR] and [BLEST_TOR]. [LOWRTT_TOR] also has + three variants, with different trade offs. + + However, see the [TRAFFIC_ANALYSIS] sections of this proposal for + important details on how this selection can be changed, to reduce + website traffic fingerprinting. 3.1. LowRTT Scheduling [LOWRTT_TOR] - This scheduling algorithm is based on the original [CONFLUX] paper, with - ideas from [MPTCP]'s minRTT/LowRTT scheduler. - - In this algorithm, endpoints send cells on the circuit with lower RTT - (primary circuit). This continues while the congestion window on the - circuit has available room: ie whenever cwnd - package_window > 0. - - Whenever the primary circuit's congestion window becomes full, the - secondary circuit is used. We stop reading on the send window source - (edge connection) when both congestion windows become full. - - In this way, unlike original conflux, we switch to the secondary circuit - without causing congestion on the primary circuit. This improves both - load times, and overall throughput. - - This behavior matches minRTT from [MPTCP], sometimes called LowRTT. - - It may be better to stop reading on the edge connection when the primary - congestion window becomes full, rather than switch to the secondary - circuit as soon as the primary congestion window becomes full. (Ie: only - switch if the RTTs themselves change which circuit is primary). This is - what was done in the original Conflux paper. This behavior effectively - causes us to optimize for responsiveness and congestion avoidance, rather - than throughput. For evaluation, we should control this switching behavior - with a consensus parameter (see [CONSENSUS_PARAMETERS]). - - Because of potential side channel risk (see [SIDE_CHANNELS]), a third - variant of this algorithm, where the primary circuit is chosen during the - [LINKING_CIRCUITS] handshake and never changed, should also be possible - to control via consensus parameter. + This scheduling algorithm is based on the original [CONFLUX] paper, with + ideas from [MPTCP]'s minRTT/LowRTT scheduler. + + In this algorithm, endpoints send cells on the circuit with lower RTT + (primary circuit). This continues while the congestion window on the + circuit has available room: ie whenever cwnd - package_window > 0. + + Whenever the primary circuit's congestion window becomes full, the + secondary circuit is used. We stop reading on the send window source + (edge connection) when both congestion windows become full. + + In this way, unlike original conflux, we switch to the secondary circuit + without causing congestion on the primary circuit. This improves both + load times, and overall throughput. + + This behavior matches minRTT from [MPTCP], sometimes called LowRTT. + + It may be better to stop reading on the edge connection when the primary + congestion window becomes full, rather than switch to the secondary + circuit as soon as the primary congestion window becomes full. (Ie: only + switch if the RTTs themselves change which circuit is primary). This is + what was done in the original Conflux paper. This behavior effectively + causes us to optimize for responsiveness and congestion avoidance, + rather than throughput. For evaluation, we should control this switching + behavior with a consensus parameter (see [CONSENSUS_PARAMETERS]). + + Because of potential side channel risk (see [SIDE_CHANNELS]), a third + variant of this algorithm, where the primary circuit is chosen during + the [LINKING_CIRCUITS] handshake and never changed, should also be + possible to control via consensus parameter. 3.2. BLEST Scheduling [BLEST_TOR] - [BLEST] attempts to predict the availability of the primary circuit, and - use this information to reorder transmitted data, to minimize head-of-line - blocking in the recipient (and thus minimize out-of-order queues there). - - BLEST_TOR uses the primary circuit until the congestion window is full. - Then, it uses the relative RTT times of the two circuits to calculate - how much data can be sent on the secondary circuit faster than if we - just waited for the primary circuit to become available. - - This is achieved by computing two variables at the sender: - - rtts = secondary.currRTT / primary.currRTT - primary_limit = primary.cwnd + (rtts-1)/2)*rtts - - Note: This (rtts-1)/2 factor represents anticipated congestion window - growth over this period.. it may be different for Tor, depending on CC alg. - - If primary_limit < secondary.cwnd - (secondary.package_window + 1), then - there is enough space on the secondary circuit to send data faster than we - could than waiting for the primary circuit. - - XXX: Note that BLEST uses total_send_window where we use secondary.cwnd in - this check. total_send_window is min(recv_win, CWND). But since Tor does - not use receive windows and intead uses stream XON/XOFF, we only use CWND. There - is some concern this may alter BLEST's buffer minimization properties, - but since receive window should only matter if the application is slower - than Tor, and XON/XOFF should cover that case, hopefully this is fine. - - Otherwise, if the primary_limit condition is not hit, cease reading - on source edge connections until SENDME acks come back. - - Here is the pseudocode for this: - - while source.has_data_to_send(): - if primary.cwnd > primary.package_window: - primary.send(source.get_packet()) - continue - - rtts = secondary.currRTT / primary.currRTT - primary_limit = (primary.cwnd + (rtts-1)/2)*rtts - - if primary_limit < secondary.cwnd - (secondary.package_window+1): - secondary.send(source.get_packet()) - else: - break # done for now, wait for an ACK to free up CWND space and restart - - Note that BLEST also has a parameter lambda that is updated whenever HoL - blocking occurs. Because it is expensive and takes significant time to - signal this over Tor, we omit this. - XXX: See [REORDER_SIGNALING] section if we want this lambda feedback. + [BLEST] attempts to predict the availability of the primary circuit, and + use this information to reorder transmitted data, to minimize + head-of-line blocking in the recipient (and thus minimize out-of-order + queues there). + + BLEST_TOR uses the primary circuit until the congestion window is full. + Then, it uses the relative RTT times of the two circuits to calculate + how much data can be sent on the secondary circuit faster than if we + just waited for the primary circuit to become available. + + This is achieved by computing two variables at the sender: + + rtts = secondary.currRTT / primary.currRTT + primary_limit = primary.cwnd + (rtts-1)/2)*rtts + + Note: This (rtts-1)/2 factor represents anticipated congestion window + growth over this period.. it may be different for Tor, depending on CC + alg. + + If primary_limit < secondary.cwnd - (secondary.package_window + 1), then + there is enough space on the secondary circuit to send data faster than + we could than waiting for the primary circuit. + + XXX: Note that BLEST uses total_send_window where we use secondary.cwnd + in this check. total_send_window is min(recv_win, CWND). But since Tor + does not use receive windows and intead uses stream XON/XOFF, we only + use CWND. There is some concern this may alter BLEST's buffer + minimization properties, but since receive window should only matter if + the application is slower than Tor, and XON/XOFF should cover that case, + hopefully this is fine. + + Otherwise, if the primary_limit condition is not hit, cease reading on + source edge connections until SENDME acks come back. + + Here is the pseudocode for this: + + while source.has_data_to_send(): + if primary.cwnd > primary.package_window: + primary.send(source.get_packet()) + continue + + rtts = secondary.currRTT / primary.currRTT + primary_limit = (primary.cwnd + (rtts-1)/2)*rtts + + if primary_limit < secondary.cwnd - (secondary.package_window+1): + secondary.send(source.get_packet()) + else: + break # done for now, wait for SENDME to free up CWND and restart + + Note that BLEST also has a parameter lambda that is updated whenever HoL + blocking occurs. Because it is expensive and takes significant time to + signal this over Tor, we omit this. + + XXX: See [REORDER_SIGNALING] section if we want this lambda feedback. 3.3. Reorder queue signaling [REORDER_SIGNALING] - Reordering should be fairly simple task. By following using the sequence - number field in [SEQUENCING], endpoints can know how many cells are still - in flight on the other leg. - - To reorder them properly, a buffer of out of order cells needs to be kept. - On the Exit side, this can quickly become overwhelming considering ten of - thousands of possible circuits can be held open leading to gigabytes of - memory being used. There is a clear potential memory DoS vector which means - that a tor implementation should be able to limit the size of those queues. - - Luckily, [BLEST_TOR] and the form of [LOWRTT_TOR] that only uses the - primary circuit will minimize or eliminate this out-of-order buffer. - - XXX: The remainder of this section may be over-complicating things... We - only need these concepts if we want to use BLEST's lambda feedback. - - The default for this queue size is governed by the 'cflx_reorder_client' - and 'cflx_reorder_srv' consensus parameters (see [CONSENSUS_PARAMS]). - 'cflx_reorder_srv' applies to Exits and onion services. Both parameters - can be overridden by Torrc, to larger or smaller than the consensus - parameter. (Low memory clients may want to lower it; SecureDrop onion - services or other high-upload services may want to raise it). - - When the reorder queue hits this size, a RELAY_CONFLUX_XOFF is sent down - the circuit leg that has data waiting in the queue and use of that leg must - cease, until it drains to half of this value, at which point an - RELAY_CONFLUX_XON is sent. Note that this is different than the stream - XON/XOFF from Proposal 324. - - XXX: [BLEST] actually does not cease use of a path in this case, but - instead uses this signal to adjust the lambda parameter, which biases - traffic away from that leg. + Reordering should be fairly simple task. By following using the sequence + number field in [SEQUENCING], endpoints can know how many cells are + still in flight on the other leg. + + To reorder them properly, a buffer of out of order cells needs to be + kept. On the Exit side, this can quickly become overwhelming + considering ten of thousands of possible circuits can be held open + leading to gigabytes of memory being used. There is a clear potential + memory DoS vector which means that a tor implementation should be able + to limit the size of those queues. + + Luckily, [BLEST_TOR] and the form of [LOWRTT_TOR] that only uses the + primary circuit will minimize or eliminate this out-of-order buffer. + + XXX: The remainder of this section may be over-complicating things... We + only need these concepts if we want to use BLEST's lambda feedback. + + The default for this queue size is governed by the 'cflx_reorder_client' + and 'cflx_reorder_srv' consensus parameters (see [CONSENSUS_PARAMS]). + 'cflx_reorder_srv' applies to Exits and onion services. Both parameters + can be overridden by Torrc, to larger or smaller than the consensus + parameter. (Low memory clients may want to lower it; SecureDrop onion + services or other high-upload services may want to raise it). + + When the reorder queue hits this size, a RELAY_CONFLUX_XOFF is sent down + the circuit leg that has data waiting in the queue and use of that leg + must cease, until it drains to half of this value, at which point an + RELAY_CONFLUX_XON is sent. Note that this is different than the stream + XON/XOFF from Proposal 324. + + XXX: [BLEST] actually does not cease use of a path in this case, but + instead uses this signal to adjust the lambda parameter, which biases + traffic away from that leg. 4. Security Considerations 4.1. Memory Denial of Service [MEMORY_DOS] - Both reorder queues and retransmit buffers inherently represent a memory - denial of service condition. - - For [RESUMPTION] retransmit buffers, endpoints that support this - feature SHOULD free retransmit information as soon as they get close - to memory pressure. This prevents resumption while data is in flight, - but will not otherwise harm operation. - - For reorder buffers, adversaries can potentially impact this at any - point, but most obviously and most severely from the client position. - - In particular, clients can lie about sequence numbers, sending cells - with sequence numbers such that the next expected sequence number - is never sent. They can do this repeatedly on many circuits, to exhaust - memory at exits. - - One option is to only allow actual traffic splitting in the downstream - direction, towards clients, and always use the primary circuit for - everything in the upstream direction. However, the ability to support - conflux from the client to the exit shows promise against traffic - analysis (see [WTF_SPLIT]). - - The other option is to use [BLEST_TOR] from clients to exits, as it has - predictable interleaved cell scheduling, and minimizes reorder queues - at exits. If the ratios prescribed by that algorithm are not followed - within some bounds, the other endpoint can close both circuits, and - free the queue memory. - - This still leaves the possibility that intermediate relays may block - a leg, allowing cells to traverse only one leg, thus still accumulating - at the reorder queue. Clients can also spoof sequence numbers similarly, - to make it appear that they are following [BLEST_TOR], without - actually sending any data on one of the legs. - - To handle either of these cases, when a relay is under memory pressure, - the circuit OOM killer SHOULD free and close circuits with the oldest - reorder queue data, first. This heuristic was shown to be best during - the [SNIPER] attack OOM killer iteration cycle. + Both reorder queues and retransmit buffers inherently represent a memory + denial of service condition. + + For [RESUMPTION] retransmit buffers, endpoints that support this feature + SHOULD free retransmit information as soon as they get close to memory + pressure. This prevents resumption while data is in flight, but will not + otherwise harm operation. + + For reorder buffers, adversaries can potentially impact this at any + point, but most obviously and most severely from the client position. + + In particular, clients can lie about sequence numbers, sending cells + with sequence numbers such that the next expected sequence number is + never sent. They can do this repeatedly on many circuits, to exhaust + memory at exits. + + One option is to only allow actual traffic splitting in the downstream + direction, towards clients, and always use the primary circuit for + everything in the upstream direction. However, the ability to support + conflux from the client to the exit shows promise against traffic + analysis (see [WTF_SPLIT]). + + The other option is to use [BLEST_TOR] from clients to exits, as it has + predictable interleaved cell scheduling, and minimizes reorder queues at + exits. If the ratios prescribed by that algorithm are not followed + within some bounds, the other endpoint can close both circuits, and free + the queue memory. + + This still leaves the possibility that intermediate relays may block a + leg, allowing cells to traverse only one leg, thus still accumulating at + the reorder queue. Clients can also spoof sequence numbers similarly, to + make it appear that they are following [BLEST_TOR], without actually + sending any data on one of the legs. + + To handle either of these cases, when a relay is under memory pressure, + the circuit OOM killer SHOULD free and close circuits with the oldest + reorder queue data, first. This heuristic was shown to be best during + the [SNIPER] attack OOM killer iteration cycle. 4.2. Side Channels [SIDE_CHANNELS] - Two potential side channels may be introduced by the use of Conflux: - 1. RTT leg-use bias by altering SENDME latency - 2. Location info leaks through the use of both leg's latencies - - For RTT and leg-use bias, Guard relays could delay legs to introduce - a pattern into the delivery of cells at the exit relay, by varying the - latency of SENDME cells (every 100th cell) to change the distribution - of traffic to send information. This attack could be performed in either - direction of traffic, to bias traffic load off of a particular Guard. - If an adversary controls both Guards, it could in theory send a binary - signal more easily, by alternating delays on each. - - However, this risk weighs against the potential benefits against traffic - fingerprinting, as per [WTF_SPLIT]. Additionally, even ignoring - cryptographic tagging attacks, this side channel provides significantly - lower information over time than inter-packet-delay based side channels - that are already available to Guards and routers along the path to the - Guard. - - Tor currently provides no defenses against already existing - single-circuit delay-based side channels, though both circuit padding - and [BACKLIT] are potential options it could conceivably deploy. The - [BACKLIT] paper also has an excellent review of the various methods - that have been studied for such single circuit side channels, and - the [BACKLIT] style RTT monitoring could be used to protect against - these conflux side channels as well. Circuit padding can also help - to obscure which cells are SENDMEs, since circuit padding is not - counted towards SENDME totals. - - The second class of side channel is where the Exit relay may be able - to use the two legs to further infer more information about client - location. See [LATENCY_LEAK] for more details. It is unclear at - this time how much more severe this is for two paths than just one. - - We should preserve the ability to disable conflux to and from Exit - relays, should these side channels prove more severe, or should - it prove possible to mitigate single-circuit side channels, but - not conflux side channels. - - In all cases, all of these side channels appear less severe for onion - service traffic, due to the higher path variability due to relay - selection, as well as the end-to-end nature of conflux in that case. - This indicates that our ability to enable/disable conflux for services - should be separate from Exits. + Two potential side channels may be introduced by the use of Conflux: + 1. RTT leg-use bias by altering SENDME latency + 2. Location info leaks through the use of both leg's latencies + + For RTT and leg-use bias, Guard relays could delay legs to introduce a + pattern into the delivery of cells at the exit relay, by varying the + latency of SENDME cells (every 100th cell) to change the distribution of + traffic to send information. This attack could be performed in either + direction of traffic, to bias traffic load off of a particular Guard. + If an adversary controls both Guards, it could in theory send a binary + signal more easily, by alternating delays on each. + + However, this risk weighs against the potential benefits against traffic + fingerprinting, as per [WTF_SPLIT]. Additionally, even ignoring + cryptographic tagging attacks, this side channel provides significantly + lower information over time than inter-packet-delay based side channels + that are already available to Guards and routers along the path to the + Guard. + + Tor currently provides no defenses against already existing + single-circuit delay-based side channels, though both circuit padding + and [BACKLIT] are potential options it could conceivably deploy. The + [BACKLIT] paper also has an excellent review of the various methods that + have been studied for such single circuit side channels, and the + [BACKLIT] style RTT monitoring could be used to protect against these + conflux side channels as well. Circuit padding can also help to obscure + which cells are SENDMEs, since circuit padding is not counted towards + SENDME totals. + + The second class of side channel is where the Exit relay may be able to + use the two legs to further infer more information about client + location. See [LATENCY_LEAK] for more details. It is unclear at this + time how much more severe this is for two paths than just one. + + We should preserve the ability to disable conflux to and from Exit + relays, should these side channels prove more severe, or should it prove + possible to mitigate single-circuit side channels, but not conflux side + channels. + + In all cases, all of these side channels appear less severe for onion + service traffic, due to the higher path variability due to relay + selection, as well as the end-to-end nature of conflux in that case. + This indicates that our ability to enable/disable conflux for services + should be separate from Exits. 4.3. Traffic analysis [TRAFFIC_ANALYSIS] - Even though conflux shows benefits against traffic analysis in [WTF_SPLIT], - these gains may be moot if the adversary is able to perform packet counting - and timing analysis at guards to guess which specific circuits are linked. - In particular, the 3 way handshake in [LINKING_CIRCUITS] may be quite - noticeable. - - As one countermeasure, it may be possible to eliminate the third leg - (RELAY_CIRCUIT_LINKED_ACK) by computing the exit/service RTT via measuring - the time between CREATED/REND_JOINED and RELAY_CIRCUIT_LINK, but this - will introduce cross-component complexity into Tor's protocol that - could quickly become unwieldy and fragile. - - Additionally, the conflux handshake may make onion services stand out - more, regardless of the number of stages in the handshake. For this - reason, it may be more wise to simply address these issues with circuit - padding machines during circuit setup (see padding-spec.txt). - - Additional traffic analysis considerations arise when combining conflux - with padding, for purposes of mitigating traffic fingerprinting. For this, - it seems wise to treat the packet schedulers as another piece of a combined - optimization problem in tandem with optimizing padding machines, perhaps - introducing randomness or fudge factors their scheduling, as a parameterized - distribution. For details, see - https://github.com/torproject/tor/blob/master/doc/HACKING/CircuitPaddingDevelopment.md - - Finally, conflux may exacerbate forms of confirmation-based traffic - analysis that close circuits to determine concretely if they were in use, - since closing either leg might cause resumption to fail. TCP RST - injection can perform this attack on the side, without surveillance - capability. [RESUMPTION] with buffering of the inflight unacked - package_window data, for retransmit, is a partial mitigation, if - endpoints buffer this data for retransmission for a brief time even - if both legs close. This seems more feasible for onion services, - which are more vulnerable to this attack. However, if the adversary - controls the client, they will notice the resumption re-link, and - still obtain confirmation that way. - - It seems the only way to fully mitigate these kinds of attacks is with - the Snowflake pluggable transport, which provides its own resumption - and retransmit behavior. Additionally, Snowflake's use of UDP DTLS also - protects against TCP RST injection, which we suspect to be the main - vector for such attacks. - - In the future, a DTLS or QUIC transport for Tor such as masque could - provide similar RST injection resistance, and resumption at - Guard/Bridge nodes, as well. + Even though conflux shows benefits against traffic analysis in + [WTF_SPLIT], these gains may be moot if the adversary is able to perform + packet counting and timing analysis at guards to guess which specific + circuits are linked. In particular, the 3 way handshake in + [LINKING_CIRCUITS] may be quite noticeable. + + As one countermeasure, it may be possible to eliminate the third leg + (RELAY_CIRCUIT_LINKED_ACK) by computing the exit/service RTT via + measuring the time between CREATED/REND_JOINED and RELAY_CIRCUIT_LINK, + but this will introduce cross-component complexity into Tor's protocol + that could quickly become unwieldy and fragile. + + Additionally, the conflux handshake may make onion services stand out + more, regardless of the number of stages in the handshake. For this + reason, it may be more wise to simply address these issues with circuit + padding machines during circuit setup (see padding-spec.txt). + + Additional traffic analysis considerations arise when combining conflux + with padding, for purposes of mitigating traffic fingerprinting. For + this, it seems wise to treat the packet schedulers as another piece of a + combined optimization problem in tandem with optimizing padding + machines, perhaps introducing randomness or fudge factors their + scheduling, as a parameterized distribution. For details, see + https://github.com/torproject/tor/blob/master/doc/HACKING/CircuitPaddingDevelopment.md + + Finally, conflux may exacerbate forms of confirmation-based traffic + analysis that close circuits to determine concretely if they were in + use, since closing either leg might cause resumption to fail. TCP RST + injection can perform this attack on the side, without surveillance + capability. [RESUMPTION] with buffering of the inflight unacked + package_window data, for retransmit, is a partial mitigation, if + endpoints buffer this data for retransmission for a brief time even if + both legs close. This seems more feasible for onion services, which are + more vulnerable to this attack. However, if the adversary controls the + client, they will notice the resumption re-link, and still obtain + confirmation that way. + + It seems the only way to fully mitigate these kinds of attacks is with + the Snowflake pluggable transport, which provides its own resumption and + retransmit behavior. Additionally, Snowflake's use of UDP DTLS also + protects against TCP RST injection, which we suspect to be the main + vector for such attacks. + + In the future, a DTLS or QUIC transport for Tor such as masque could + provide similar RST injection resistance, and resumption at Guard/Bridge + nodes, as well. 5. System Interactions @@ -665,7 +676,7 @@ Status: Draft - EWMA and KIST - CBT and number of guards - Onion service circ obfuscation - - Future UDP (it may increase the need for UDP to buffer before dropping) + - Future UDP (may increase need for UDP to buffer before dropping) - Padding (no sequence numbers on padding cells, as per [SEQUENCING]) - Also, any padding machines may need re-tuning - No 'cannibalization' of linked circuits @@ -697,116 +708,116 @@ Appended A [ALTERNATIVES] A.1 BEGIN/END sequencing [ALTERNATIVE_SEQUENCING] - In this method of signaling, we increment the sequence number by 1 - only when we switch legs, and use BEGIN/END "bookends" to know that - all data on a leg has been received. - - To achieve this, we add a small sequence number to the common relay - header for all relay cells on linked circuits, as well as a field to - signal the beginning of a sequence, intermediate data, and the end - of a sequence. - - Relay command [1 byte] - Recognized [2 bytes] - StreamID [2 bytes] - Digest [4 bytes] - Length [2 bytes] - > Switching [2 bits] # 01 = BEGIN, 00 = CONTINUE, 10 = END - > Sequencing [6 bits] - Data [PAYLOAD_LEN - 12 - Length bytes] - - These fields MUST be present on ALL end-to-end relay cells on each leg - that come from the endpoint, following a RELAY_CIRCUIT_LINK command. - - They are absent on 'leaky pipe' RELAY_COMMAND_DROP and - RELAY_COMMAND_PADDING_NEGOTIATED cells that come from middle relays, - as opposed to the endpoint, to support padding. - - Sequence numbers are incremented by one when an endpoint switches legs - to transmit a cell. This number will wrap; implementations should treat - 0 as the next sequence after 2^6-1. Because we do not expect to support - significantly more than 2 legs, and much fewer than 63, this is not an - issue. - - The first cell on a new circuit MUST use the BEGIN code for switching. - Cells are delivered from that circuit until an END switching signal is - received, even if cells arrive first on another circuit with the next - sequence number before and END switching field. Recipients MUST only - deliver cells with a BEGIN, if their Sequencing number is one more than - the last END. + In this method of signaling, we increment the sequence number by 1 only + when we switch legs, and use BEGIN/END "bookends" to know that all data + on a leg has been received. + + To achieve this, we add a small sequence number to the common relay + header for all relay cells on linked circuits, as well as a field to + signal the beginning of a sequence, intermediate data, and the end of a + sequence. + + Relay command [1 byte] + Recognized [2 bytes] + StreamID [2 bytes] + Digest [4 bytes] + Length [2 bytes] + > Switching [2 bits] # 01 = BEGIN, 00 = CONTINUE, 10 = END + > Sequencing [6 bits] + Data [PAYLOAD_LEN - 12 - Length bytes] + + These fields MUST be present on ALL end-to-end relay cells on each leg + that come from the endpoint, following a RELAY_CIRCUIT_LINK command. + + They are absent on 'leaky pipe' RELAY_COMMAND_DROP and + RELAY_COMMAND_PADDING_NEGOTIATED cells that come from middle relays, as + opposed to the endpoint, to support padding. + + Sequence numbers are incremented by one when an endpoint switches legs + to transmit a cell. This number will wrap; implementations should treat + 0 as the next sequence after 2^6-1. Because we do not expect to support + significantly more than 2 legs, and much fewer than 63, this is not an + issue. + + The first cell on a new circuit MUST use the BEGIN code for switching. + Cells are delivered from that circuit until an END switching signal is + received, even if cells arrive first on another circuit with the next + sequence number before and END switching field. Recipients MUST only + deliver cells with a BEGIN, if their Sequencing number is one more than + the last END. A.2 Alternative Link Handshake [ALTERNATIVE_LINKING] - The circuit linking in [LINKING_CIRCUITS] could be done as encrypted - ntor onionskin extension fields, similar to those used by v3 onions. - - This approach has at least four problems: - i). For onion services, since the onionskins traverse the intro circuit - and then return on the rend circuit, this handshake cannot measure - RTT there. - ii). Since these onionskins are larger, and have no PFS, an adversary - at the middle relay knows that the onionskin is for linking, and - can potentially try to obtain the onionskin key for attacks on - the link. - iii). It makes linking circuits more fragile, since they could timeout - due to CBT, or other issues during construction. - iv). The overhead in processing this onionskin through onionskin queues - adds additional time for linking, even in the Exit case, making - that RTT potentially noisy. - - Additionally, it is not clear that this approach actually saves us - anything in terms of setup time, because we can optimize away the - linking phase using Proposal 325, to combine initial RELAY_BEGIN cells - with RELAY_CIRCUIT_LINK. + The circuit linking in [LINKING_CIRCUITS] could be done as encrypted + ntor onionskin extension fields, similar to those used by v3 onions. + + This approach has at least four problems: + i). For onion services, since onionskins traverse the intro circuit + and return on the rend circuit, this handshake cannot measure + RTT there. + ii). Since these onionskins are larger, and have no PFS, an adversary + at the middle relay knows that the onionskin is for linking, and + can potentially try to obtain the onionskin key for attacks on + the link. + iii). It makes linking circuits more fragile, since they could timeout + due to CBT, or other issues during construction. + iv). The overhead in processing this onionskin in onionskin queues + adds additional time for linking, even in the Exit case, making + that RTT potentially noisy. + + Additionally, it is not clear that this approach actually saves us + anything in terms of setup time, because we can optimize away the + linking phase using Proposal 325, to combine initial RELAY_BEGIN cells + with RELAY_CIRCUIT_LINK. A.3. Alternative RTT measurement [ALTERNATIVE_RTT] - Instead of measuring RTTs during [LINKING_CIRCUITS], we could create - PING/PONG cells, whose sole purpose is to allow endpoints to measure - RTT. - - This was rejected for several reasons. First, during circuit use, we - already have SENDMEs to measure RTT. Every 100 cells (or - 'circwindow_inc' from Proposal 324), we are able to re-measure RTT based - on the time between that Nth cell and the SENDME ack. So we only need - PING/PONG to measure initial circuit RTT. - - If we were able to use onionskins, as per [ALTERNATIVE_LINKING] above, - we might be able to specify a PING/PONG/PING handshake solely for - measuring initial RTT, especially for onion service circuits. - - The reason for not making a dedicated PING/PONG for this purpose is that - it is context-free. Even if we were able to use onionskins for linking - and resumption, to avoid additional data in handshake that just measures - RTT, we would have to enforce that this PING/PONG/PING only follows the - exact form needed by this proposal, at the expected time, and at no - other points. - - If we do not enforce this specific use of PING/PONG/PING, it becomes - another potential side channel, for use in attacks such as [DROPMARK]. - - In general, Tor is planning to remove current forms of context-free and - semantic-free cells from its protocol: - https://gitlab.torproject.org/tpo/core/torspec/-/issues/39 - - We should not add more. + Instead of measuring RTTs during [LINKING_CIRCUITS], we could create + PING/PONG cells, whose sole purpose is to allow endpoints to measure + RTT. + + This was rejected for several reasons. First, during circuit use, we + already have SENDMEs to measure RTT. Every 100 cells (or + 'circwindow_inc' from Proposal 324), we are able to re-measure RTT based + on the time between that Nth cell and the SENDME ack. So we only need + PING/PONG to measure initial circuit RTT. + + If we were able to use onionskins, as per [ALTERNATIVE_LINKING] above, + we might be able to specify a PING/PONG/PING handshake solely for + measuring initial RTT, especially for onion service circuits. + + The reason for not making a dedicated PING/PONG for this purpose is that + it is context-free. Even if we were able to use onionskins for linking + and resumption, to avoid additional data in handshake that just measures + RTT, we would have to enforce that this PING/PONG/PING only follows the + exact form needed by this proposal, at the expected time, and at no + other points. + + If we do not enforce this specific use of PING/PONG/PING, it becomes + another potential side channel, for use in attacks such as [DROPMARK]. + + In general, Tor is planning to remove current forms of context-free and + semantic-free cells from its protocol: + https://gitlab.torproject.org/tpo/core/torspec/-/issues/39 + + We should not add more. Appendix B: Acknowledgments - Thanks to Per Hurtig for helping us with the framing of the MPTCP - problem space. - - Thanks to Simone Ferlin for clarifications on the [BLEST] - paper, and for pointing us at the Linux kernel implementation. - - Extreme thanks goes again to Toke Høiland-Jørgensen, who helped - immensely towards our understanding of how the BLEST condition relates - to edge connection pushback, and for clearing up many other - misconceptions we had. - - Finally, thanks to Mashael AlSabah, Kevin Bauer, Tariq Elahi, and Ian - Goldberg, for the original [CONFLUX] paper! + Thanks to Per Hurtig for helping us with the framing of the MPTCP + problem space. + + Thanks to Simone Ferlin for clarifications on the [BLEST] paper, and for + pointing us at the Linux kernel implementation. + + Extreme thanks goes again to Toke Høiland-Jørgensen, who helped + immensely towards our understanding of how the BLEST condition relates + to edge connection pushback, and for clearing up many other + misconceptions we had. + + Finally, thanks to Mashael AlSabah, Kevin Bauer, Tariq Elahi, and Ian + Goldberg, for the original [CONFLUX] paper! References: -- cgit v1.2.3-54-g00ecf