summaryrefslogtreecommitdiff
path: root/doc/spec/proposals/151-path-selection-improvements.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/spec/proposals/151-path-selection-improvements.txt')
-rw-r--r--doc/spec/proposals/151-path-selection-improvements.txt148
1 files changed, 0 insertions, 148 deletions
diff --git a/doc/spec/proposals/151-path-selection-improvements.txt b/doc/spec/proposals/151-path-selection-improvements.txt
deleted file mode 100644
index af89f21193..0000000000
--- a/doc/spec/proposals/151-path-selection-improvements.txt
+++ /dev/null
@@ -1,148 +0,0 @@
-Filename: 151-path-selection-improvements.txt
-Title: Improving Tor Path Selection
-Author: Fallon Chen, Mike Perry
-Created: 5-Jul-2008
-Status: Finished
-In-Spec: path-spec.txt
-
-Overview
-
- The performance of paths selected can be improved by adjusting the
- CircuitBuildTimeout and avoiding failing guard nodes. This proposal
- describes a method of tracking buildtime statistics at the client, and
- using those statistics to adjust the CircuitBuildTimeout.
-
-Motivation
-
- Tor's performance can be improved by excluding those circuits that
- have long buildtimes (and by extension, high latency). For those Tor
- users who require better performance and have lower requirements for
- anonymity, this would be a very useful option to have.
-
-Implementation
-
- Gathering Build Times
-
- Circuit build times are stored in the circular array
- 'circuit_build_times' consisting of uint32_t elements as milliseconds.
- The total size of this array is based on the number of circuits
- it takes to converge on a good fit of the long term distribution of
- the circuit builds for a fixed link. We do not want this value to be
- too large, because it will make it difficult for clients to adapt to
- moving between different links.
-
- From our observations, the minimum value for a reasonable fit appears
- to be on the order of 500 (MIN_CIRCUITS_TO_OBSERVE). However, to keep
- a good fit over the long term, we store 5000 most recent circuits in
- the array (NCIRCUITS_TO_OBSERVE).
-
- The Tor client will build test circuits at a rate of one per
- minute (BUILD_TIMES_TEST_FREQUENCY) up to the point of
- MIN_CIRCUITS_TO_OBSERVE. This allows a fresh Tor to have
- a CircuitBuildTimeout estimated within 8 hours after install,
- upgrade, or network change (see below).
-
- Long Term Storage
-
- The long-term storage representation is implemented by storing a
- histogram with BUILDTIME_BIN_WIDTH millisecond buckets (default 50) when
- writing out the statistics to disk. The format this takes in the
- state file is 'CircuitBuildTime <bin-ms> <count>', with the total
- specified as 'TotalBuildTimes <total>'
- Example:
-
- TotalBuildTimes 100
- CircuitBuildTimeBin 25 50
- CircuitBuildTimeBin 75 25
- CircuitBuildTimeBin 125 13
- ...
-
- Reading the histogram in will entail inserting <count> values
- into the circuit_build_times array each with the value of
- <bin-ms> milliseconds. In order to evenly distribute the values
- in the circular array, the Fisher-Yates shuffle will be performed
- after reading values from the bins.
-
- Learning the CircuitBuildTimeout
-
- Based on studies of build times, we found that the distribution of
- circuit buildtimes appears to be a Frechet distribution. However,
- estimators and quantile functions of the Frechet distribution are
- difficult to work with and slow to converge. So instead, since we
- are only interested in the accuracy of the tail, we approximate
- the tail of the distribution with a Pareto curve starting at
- the mode of the circuit build time sample set.
-
- We will calculate the parameters for a Pareto distribution
- fitting the data using the estimators at
- http://en.wikipedia.org/wiki/Pareto_distribution#Parameter_estimation.
-
- The timeout itself is calculated by using the Quartile function (the
- inverted CDF) to give us the value on the CDF such that
- BUILDTIME_PERCENT_CUTOFF (80%) of the mass of the distribution is
- below the timeout value.
-
- Thus, we expect that the Tor client will accept the fastest 80% of
- the total number of paths on the network.
-
- Detecting Changing Network Conditions
-
- We attempt to detect both network connectivity loss and drastic
- changes in the timeout characteristics.
-
- We assume that we've had network connectivity loss if 3 circuits
- timeout and we've received no cells or TLS handshakes since those
- circuits began. We then set the timeout to 60 seconds and stop
- counting timeouts.
-
- If 3 more circuits timeout and the network still has not been
- live within this new 60 second timeout window, we then discard
- the previous timeouts during this period from our history.
-
- To detect changing network conditions, we keep a history of
- the timeout or non-timeout status of the past RECENT_CIRCUITS (20)
- that successfully completed at least one hop. If more than 75%
- of these circuits timeout, we discard all buildtimes history,
- reset the timeout to 60, and then begin recomputing the timeout.
-
- Testing
-
- After circuit build times, storage, and learning are implemented,
- the resulting histogram should be checked for consistency by
- verifying it persists across successive Tor invocations where
- no circuits are built. In addition, we can also use the existing
- buildtime scripts to record build times, and verify that the histogram
- the python produces matches that which is output to the state file in Tor,
- and verify that the Pareto parameters and cutoff points also match.
-
- We will also verify that there are no unexpected large deviations from
- node selection, such as nodes from distant geographical locations being
- completely excluded.
-
- Dealing with Timeouts
-
- Timeouts should be counted as the expectation of the region of
- of the Pareto distribution beyond the cutoff. This is done by
- generating a random sample for each timeout at points on the
- curve beyond the current timeout cutoff.
-
- Future Work
-
- At some point, it may be desirable to change the cutoff from a
- single hard cutoff that destroys the circuit to a soft cutoff and
- a hard cutoff, where the soft cutoff merely triggers the building
- of a new circuit, and the hard cutoff triggers destruction of the
- circuit.
-
- It may also be beneficial to learn separate timeouts for each
- guard node, as they will have slightly different distributions.
- This will take longer to generate initial values though.
-
-Issues
-
- Impact on anonymity
-
- Since this follows a Pareto distribution, large reductions on the
- timeout can be achieved without cutting off a great number of the
- total paths. This will eliminate a great deal of the performance
- variation of Tor usage.