spec/path-spec/learning-timeouts.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305

<a id="path-spec.txt-2.4"></a>

# Learning when to give up ("timeout") on circuit construction {#timing-out}

Since version 0.2.2.8-alpha, Tor clients attempt to learn when to give
up on circuits based on network conditions.

<a id="path-spec.txt-2.4.1"></a>

## Distribution choice

Based on studies of build times, we found that the distribution of
circuit build times appears to be a Frechet distribution (and a multi-modal
Frechet distribution, if more than one guard or bridge is used). However,
estimators and quantile functions of the Frechet distribution are difficult
to work with and slow to converge. So instead, since we are only interested
in the accuracy of the tail, clients approximate the tail of the multi-modal
distribution with a single Pareto curve.

<a id="path-spec.txt-2.4.2"></a>

## How much data to record {#observations}

From our observations, the minimum number of circuit build times for a
reasonable fit appears to be on the order of 100. However, to keep a
good fit over the long term, clients store 1000 most recent circuit build
times in a circular array.

These build times only include the times required to build three-hop
circuits, and the times required to build the first three hops of circuits
with more than three hops.  Circuits of fewer than three hops are not
recorded, and hops past the third are not recorded.

The Tor client should build test circuits at a rate of one every 'cbttestfreq'
(10 seconds) until 'cbtmincircs' (100 circuits) are built, with a maximum of
'cbtmaxopencircs' (default: 10) circuits open at once. This allows a fresh
Tor to have a CircuitBuildTimeout estimated within 30 minutes after install
or network change
(see [Detecting Changing Network Conditions](#changes-in-network) below.)

Timeouts are stored on disk in a histogram of 10ms bin width, the same
width used to calculate the Xm value above. The timeouts recorded in the
histogram must be shuffled after being read from disk, to preserve a
proper expiration of old values after restart.

Thus, some build time resolution is lost during restart. Implementations may
choose a different persistence mechanism than this histogram, but be aware
that build time binning is still needed for parameter estimation.

<a id="path-spec.txt-2.4.3"></a>

## Parameter estimation

Once 'cbtmincircs' build times are recorded, Tor clients update the
distribution parameters and recompute the timeout every circuit completion
(though
[see below](#changes-in-network)
for when to pause and reset timeout due to
too many circuits timing out).

Tor clients calculate the parameters for a Pareto distribution fitting the
data using the maximum likelihood estimator. For derivation, see:
<https://en.wikipedia.org/wiki/Pareto_distribution#Estimation_of_parameters>

Because build times are not a true Pareto distribution, we alter how Xm is
computed. In a max likelihood estimator, the mode of the distribution is
used directly as Xm.

Instead of using the mode of discrete build times directly, Tor clients
compute the Xm parameter using the weighted average of the midpoints
of the 'cbtnummodes' (10) most frequently occurring 10ms histogram bins.
Ties are broken in favor of earlier bins (that is, in favor of bins
corresponding to shorter build times).

(The use of 10 modes was found to minimize error from the selected
cbtquantile, with 10ms bins for quantiles 60-80, compared to many other
heuristics).

To avoid ln(1.0+epsilon) precision issues, use log laws to rewrite the
estimator for 'alpha' as the sum of logs followed by subtraction, rather
than multiplication and division:

`alpha = n/(Sum_n{ln(MAX(Xm, x_i))} - n\*ln(Xm))`

In this, n is the total number of build times that have completed, x_i is
the ith recorded build time, and Xm is the modes of x_i as above.

All times below Xm are counted as having the Xm value via the MAX(),
because in Pareto estimators, Xm is supposed to be the lowest value.
However, since clients use mode averaging to estimate Xm, there can be
values below our Xm. Effectively, the Pareto estimator then treats that
everything smaller than Xm happened at Xm. One can also see that if
clients did not do this, alpha could underflow to become negative, which
results in an exponential curve, not a Pareto probability distribution.

The timeout itself is calculated by using the Pareto Quantile function (the
inverted CDF) to give us the value on the CDF such that 80% of the mass
of the distribution is below the timeout value (parameter 'cbtquantile').

The Pareto Quantile Function (inverse CDF) is:

`F(q) = Xm/((1.0-q)^(1.0/alpha))`

Thus, clients obtain the circuit build timeout for 3-hop circuits by
computing:

`timeout_ms = F(0.8)    # 'cbtquantile' == 0.8`

With this, we expect that the Tor client will accept the fastest 80% of the
total number of paths on the network.

Clients obtain the circuit close time to completely abandon circuits as:

`close_ms = F(0.99)     # 'cbtclosequantile' == 0.99`

To avoid waiting an unreasonably long period of time for circuits that
simply have relays that are down, Tor clients cap timeout_ms at the max
build time actually observed so far, and cap close_ms at twice this max,
but at least 60 seconds:

```text
     timeout_ms = MIN(timeout_ms, max_observed_timeout)
     close_ms = MAX(MIN(close_ms, 2*max_observed_timeout), 'cbtinitialtimeout')
```

<a id="path-spec.txt-2.4.3"></a>

## Calculating timeouts thresholds for circuits of different lengths {#different-lengths}

The timeout_ms and close_ms estimates above are good only for 3-hop
circuits, since only 3-hop circuits are recorded in the list of build
times.

To calculate the appropriate timeouts and close timeouts for circuits of
other lengths, the client multiples the timeout_ms and close_ms values
by a scaling factor determined by the number of communication hops
needed to build their circuits:

```text
timeout_ms\[hops=n\] = timeout_ms * Actions(N) / Actions(3)

close_ms\[hops=n\] = close_ms * Actions(N) / Actions(3)
```

where `Actions(N) = N * (N + 1) / 2.`

To calculate timeouts for operations other than circuit building,
the client should add X to Actions(N) for every round-trip communication
required with the Xth hop.

<a id="path-spec.txt-2.4.4"></a>

## How to record timeouts {#recording-timeouts}

Pareto estimators begin to lose their accuracy if the tail is omitted.
Hence, Tor clients actually calculate two timeouts: a usage timeout, and a
close timeout.

Circuits that pass the usage timeout are marked as measurement circuits,
and are allowed to continue to build until the close timeout corresponding
to the point 'cbtclosequantile' (default 99) on the Pareto curve, or 60
seconds, whichever is greater.

The actual completion times for these measurement circuits should be
recorded.

Implementations should completely abandon a circuit and ignore the circuit
if the total build time exceeds the close threshold. Such closed circuits
should be ignored, as this typically means one of the relays in the path is
offline.

<a id="path-spec.txt-2.4.5"></a>

## Detecting Changing Network Conditions {#changes-in-network}

Tor clients attempt to detect both network connectivity loss and drastic
changes in the timeout characteristics.

To detect changing network conditions, clients keep a history of
the timeout or non-timeout status of the past 'cbtrecentcount' circuits
(20 circuits) that successfully completed at least one hop. If more than
90% of these circuits timeout, the client discards all buildtimes history,
resets the timeout to 'cbtinitialtimeout' (60 seconds), and then begins
recomputing the timeout.

If the timeout was already at least `cbtinitialtimeout`,
the client doubles the timeout.

The records here (of how many circuits succeeded or failed among the most
recent 'cbrrecentcount') are not stored as persistent state.  On reload,
we start with a new, empty state.

<a id="path-spec.txt-2.4.6"></a>

## Consensus parameters governing behavior {#parameters}

Clients that implement circuit build timeout learning should obey the
following consensus parameters that govern behavior, in order to allow
us to handle bugs or other emergent behaviors due to client circuit
construction. If these parameters are not present in the consensus,
the listed default values should be used instead.

```text
      cbtdisabled
        Default: 0
        Min: 0
        Max: 1
        Effect: If 1, all CircuitBuildTime learning code should be
                disabled and history should be discarded. For use in
                emergency situations only.

      cbtnummodes
        Default: 10
        Min: 1
        Max: 20
        Effect: This value governs how many modes to use in the weighted
        average calculation of Pareto parameter Xm. Selecting Xm as the
        average of multiple modes improves accuracy of the Pareto tail
        for quantile cutoffs from 60-80% (see cbtquantile).

      cbtrecentcount
        Default: 20
        Min: 3
        Max: 1000
        Effect: This is the number of circuit build outcomes (success vs
                timeout) to keep track of for the following option.

      cbtmaxtimeouts
        Default: 18
        Min: 3
        Max: 10000
        Effect: When this many timeouts happen in the last 'cbtrecentcount'
                circuit attempts, the client should discard all of its
                history and begin learning a fresh timeout value.

                Note that if this parameter's value is greater than the value
                of 'cbtrecentcount', then the history will never be
                discarded because of this feature.

      cbtmincircs
        Default: 100
        Min: 1
        Max: 10000
        Effect: This is the minimum number of circuits to build before
                computing a timeout.

                Note that if this parameter's value is higher than 1000 (the
                number of time observations that a client keeps in its
                circular buffer), circuit build timeout calculation is
                effectively disabled, and the default timeouts are used
                indefinitely.

      cbtquantile
        Default: 80
        Min: 10
        Max: 99
        Effect: This is the position on the quantile curve to use to set the
                timeout value. It is a percent (10-99).

      cbtclosequantile
        Default: 99
        Min: Value of cbtquantile parameter
        Max: 99
        Effect: This is the position on the quantile curve to use to set the
                timeout value to use to actually close circuits. It is a
                percent (0-99).

      cbttestfreq
        Default: 10
        Min: 1
        Max: 2147483647 (INT32_MAX)
        Effect: Describes how often in seconds to build a test circuit to
                gather timeout values. Only applies if less than 'cbtmincircs'
                have been recorded.

      cbtmintimeout
        Default: 10
        Min: 10
        Max: 2147483647 (INT32_MAX)
        Effect: This is the minimum allowed timeout value in milliseconds.

      cbtinitialtimeout
        Default: 60000
        Min: Value of cbtmintimeout
        Max: 2147483647 (INT32_MAX)
        Effect: This is the timeout value to use before we have enough data
                to compute a timeout, in milliseconds.  If we do not have
                enough data to compute a timeout estimate (see cbtmincircs),
                then we use this interval both for the close timeout and the
                abandon timeout.

      cbtlearntimeout
        Default: 180
        Min: 10
        Max: 60000
        Effect: This is how long idle circuits will be kept open while cbt is
                learning a new timeout value.

      cbtmaxopencircs
        Default: 10
        Min: 0
        Max: 14
        Effect: This is the maximum number of circuits that can be open at
                at the same time during the circuit build time learning phase.
```