diff options
Diffstat (limited to 'proposals/210-faster-headless-consensus-bootstrap.txt')
-rw-r--r-- | proposals/210-faster-headless-consensus-bootstrap.txt | 82 |
1 files changed, 54 insertions, 28 deletions
diff --git a/proposals/210-faster-headless-consensus-bootstrap.txt b/proposals/210-faster-headless-consensus-bootstrap.txt index 42726e5..79770d8 100644 --- a/proposals/210-faster-headless-consensus-bootstrap.txt +++ b/proposals/210-faster-headless-consensus-bootstrap.txt @@ -21,30 +21,53 @@ Design: Bootstrap Process Changes the first connection that completes. Connection attempts will be performed on an exponential backoff basis. - Initially, connections will be performed to randomly chosen hard - coded directory mirrors. If none of these connections complete within - 5 seconds, connections will also be performed to randomly chosen - canonical directory authorities. + Initially, connections will be performed to a randomly chosen hard + coded directory mirror and a randomly chosen canonical directory + authority. If neither of these connections complete, additional mirror + and authority connections are tried. Mirror connections are tried at + a faster rate than authority connections. We specify that mirror connections retry after half a second, and then double the retry time with every connection: - 0, 0.5, 1, 2, 4, 8, 16, ... + 0, 1, 2, 4, 8, 16, 32, ... - We specify that directory authority connections start after a 5 second - delay, and retry after 5 seconds, doubling the retry time with every - connection: - 5, 10, 20, ... + We specify that directory authority connections retry after 5 seconds, + and then double the retry time with every connection: + 0, 10, 20, ... + + If the client has both an IPv4 and IPv6 address, we try IPv4 and IPv6 + mirrors and authorities on the following schedule: + IPv4, IPv6, IPv4, IPv6, ... + + We try IPv4 first to avoid overloading IPv6-enabled authorities and + mirrors. Mirrors and auths get a separate IPv4/IPv6 schedule. This + ensures that we try an IPv6 authority within the first 10 seconds. + This helps implement #8374 and related tickets. + + The maximum retry time for both timers is 3 days + 1 hour. This places a + small load on the mirrors and authorities, while allowing a client that + regains a network connection to eventually download a consensus. + + The retry timers must reset on HUP and any network reachability events, + [ TODO: do we have network reachability events? ] + so that clients that have unreliable networks can recover from network + failures. The first connection to complete will be used to download the consensus document and the others will be closed, after which bootstrapping will proceed as normal. + A benefit of connecting to directory authorities is that clients are + warned if their clock is wrong. Therefore, when closing a directory + authority connection, we check to see if we have successfully connected + to an authority during this run of the Tor client. If not, we allow the + authority TLS connection to complete, then close the connection. + We expect the vast majority of clients to succeed within 4 seconds, - after making up to 5 connection attempts to mirrors. Clients which can't - connect in the first 5 seconds, will then try to contact a directory - authority. We expect almost all clients to succeed within 10 seconds, - after up to 6 connection attempts to mirrors and up to 2 connection - attempts to authorities. This is a much better success rate than the + after making up to 4 connection attempts to mirrors. Clients which can't + connect in the first 10 seconds, will try 1 more mirror, then try to + contact another directory authority. We expect almost all clients to + succeed within 10 seconds. This is a much better success rate than the current Tor implementation, which fails k/n of clients if k of the n directory authorities are down. (Or, if the connection fails in certain ways, (k/n)^2.) @@ -60,7 +83,11 @@ Design: Fallback Dir Mirror Selection the 100 Guard nodes with the longest uptime. The fallback weights will be set using each mirror's fraction of - consensus bandwidth out of the total of all 100 mirrors. + consensus bandwidth out of the total of all 100 mirrors, adjusted to + ensure no fallback directory sees more than 10% of clients. We will + also exclude fallback directories that are less than 1/1000 of the + consensus weight, as they are not large enough to make it worthwhile + including them. This list of fallback dir mirrors should be updated with every major Tor release. In future releases, the number of dir mirrors @@ -84,7 +111,7 @@ Performance: Additional Load with Current Parameter Choices The dangerous case is in the event of a prolonged consensus failure that induces all clients to enter into the bootstrap process. In this case, the number of TLS connections to the fallback dir mirrors within - the first second would be 3*C/100, or 60,000 for C=2,000,000 users. If + the first second would be 2*C/100, or 40,000 for C=2,000,000 users. If no connections complete before the 10 retries, 7 of which go to mirrors, this could reach as high as 140,000 connection attempts, but this is extremely unlikely to happen in full aggregate. @@ -111,7 +138,7 @@ Implementation Notes: Code Modifications There appear to be a few options for altering this code to retry multiple simultaneous connections. Without refactoring, one approach would be to - set mirror and authority retry helper function timers in + set a connection retry helper function timer in directory_initiate_command_routerstatus() from directory_get_from_dirserver() if the purpose is DIR_PURPOSE_FETCH_CONSENSUS and the only directory servers available @@ -130,7 +157,7 @@ Implementation Notes: Code Modifications altered to examine the list of pending dircons, determine if this one is the first to complete, and if so, then call directory_send_command() to download the consensus and close the other pending dircons. - connection_dir_finished_connecting() would also cancel both timers. + connection_dir_finished_connecting() would also cancel the timer. Reliability Analysis @@ -140,22 +167,21 @@ Reliability Analysis uptime.) We expect the first 10 connection retry times to be: - Mirror: 0s 0.5s 1s 2s 4s 8s 16s - Auth: 5s 10s 20s - Success: 50% 75% 87% 94% 97% 99.4% 99.7% 99.94% 99.97% 99.99% - - 97% of clients succeed while only using directory mirrors. - 2.4% of clients succeed on their first auth connection. - 0.24% of clients succeed after one more mirror and auth connection. - 0.05% of clients succeed after two more mirror and auth connections. - 0.01% of clients remain, but in this scenario, 3 authorities are down, + Mirror: 0s 1s 2s 4s 8s 16s 32s + Auth: 0s 10s 20s + Success: 90% 95% 97% 98.7% 99.4% 99.89% 99.94% 99.988% 99.994% + + 97% of clients succeed in the first 2 seconds. + 99.4% of clients succeed without trying a second authority. + 99.89% of clients succeed in the first 10 seconds. + 0.11% of clients remain, but in this scenario, 2 authorities are down, so the client is most likely blocked from the Tor network. The current implementation makes 1 or 2 authority connections within the first second, depending on exactly how the first connection fails. Under the 20% authority failure assumption, these clients would have a success rate of either 80% or 96% within a few seconds. The scheme above has a - similar success rate in the first few seconds, while spreading the load + greater success rate in the first few seconds, while spreading the load among a larger number of directory mirrors. In addition, if all the authorities are blocked, current clients will inevitably fail, as they do not have a list of directory mirrors. |