proposals/238-hs-relay-stats.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189

Filename: 238-hs-relay-stats

Title: Better hidden service stats from Tor relays
Author: George Kadianakis, David Goulet, Karsten Loesing
Created: 2014-11-17
Status: Incomplete

0. Motivation

   Hidden Services is one of the least understood parts of the Tor
   network. We don't really know how many hidden services there are
   and how much they are used.

   This proposal suggests that Tor relays include some hidden service
   related stats to their extra info descriptors. No stats are
   collected from Tor hidden services or clients.

   While uncertainty might be a good thing in a hidden network,
   learning more information about the usage of hidden services can be
   helpful.

   For example, learning how many cells are sent for hidden service
   purposes tells us whether hidden service traffic is 2% of the Tor
   network traffic or 90% of the Tor network traffic. This info can
   also help us during load balancing, for example if we change the
   path building of hidden services to mitigate guard discovery
   attacks [XXX].
   # XXX Is "HS purposes" only RP traffic? Or also IP traffic?

   Also, learning the number of hidden services, can help us
   understand how widespread hidden services are. It will also help us
   understand approximately how much load is put in the network by
   hidden service logistics, like introduction point circuits etc.

1. Design

   Tor relays will add some fields related to hidden service
   statistics in their extra-info descriptors.

   Tor relays collect these statistics by keeping track of their
   hidden service directory or rendezvous point activities, slightly
   obfuscating the numbers and posting them to the directory
   authorities. Extra-info descriptors are posted to directory
   authorities every 24 hours.

2. Implementation

2.0. Hidden service statistics interval

   We want relays to report hidden-service statistics over a long-enough
   time period to not put users at risk.  Similar to other statistics, we
   suggest a 24-hour statistics interval.  All related statistics are
   collected at the end of that interval and included in the next
   extra-info descriptors published by the relay.

   Tor relays will add the following line to their extra-info descriptor:

    "hidserv-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL
        [At most once.]

        YYYY-MM-DD HH:MM:SS defines the end of the included measurement
        interval of length NSEC seconds (86400 seconds by default).

        A "hidserv-stats-end" line, as well as any other "hidserv-*" line,
        is first added after the relay has been running for at least 24
        hours.

2.1. Hidden service traffic statistics

   We want to learn how much of the total Tor network traffic is caused by
   hidden service usage.  There are three phases in the rendezvous
   protocol where traffic is generated: (1) when hidden services make
   themselves available in the network, (2) when clients open connections
   to hidden services, and (3) when clients exchange application data with
   hidden services.  We expect (3) to consume most bytes here, so we're
   focusing on this only.  More precisely, we measure hidden service
   traffic by counting RELAY cells seen on a rendezvous point after
   receiving a RENDEZVOUS1 cell.  These RELAY cells include commands to
   open or close application streams, and they include application data.

   Tor relays will add the following line to their extra-info descriptor:

    "hidserv-rend-relayed-cells" SP num NL
        [At most once.]

        Approximate number of RELAY cells seen in either direction on a
        circuit after receiving and successfully processing a RENDEZVOUS1
        cell.  The actual number observed by the directory is multiplied
        with a random number in [0.9, 1.1] before being reported.

   The keyword indicates that this line is part of hidden-service
   statistics ("hidserv") and contains aggregate data from the relay
   acting as rendezvous point ("rend").

   We plan to extrapolate reported values to network totals by dividing
   values by the probability of clients picking relays as rendezvous
   point.  This approach should become more precise on faster relays and
   the more relays report these statistics.

   We also plan to compare reported values with "cell-*" statistics to
   learn what fraction of traffic can be attributed to hidden services.

   Ideally, we'd be able to compare values to "write-history" and
   "read-history" lines to compute similar fractions of traffic used for
   hidden services.  The goal would be to avoid enabling "cell-*"
   statistics by default.  In order for this to work we'll have to
   multiply reported cell numbers with the default cell size of 512 bytes.

2.2. HSDir hidden service counting

   We also want to learn how many hidden services exist in the network.
   The best place to learn this is at hidden service directories where
   hidden services publish their descriptors.

   Tor relays will add the following line to their extra-info descriptor:

    "hidserv-dir-published-ids" SP num NL
        [At most once.]

        Approximate number of unique hidden-service identities seen in
        descriptors published to and accepted by this hidden-service
        directory.  The actual number observed by the directory is
        multiplied with a random number in [0.9, 1.1] before being
        reported.

   This statistic requires keeping a separate data structure with unique
   identities seen during the current statistics interval.  We could, in
   theory, have relays iterate over their descriptor caches when producing
   the daily hidden-service statistics blurb.  But it's unclear how
   caching would affect results from such an approach, because descriptors
   published at the start of the current statistics interval could already
   have been removed, and descriptors published in the last statistics
   interval could still be present.  Keeping a separate data structure,
   possibly even a probabilistic one, seems like the more accurate
   approach.

   We plan to extrapolate this value to network totals by calculating what
   fraction of hidden-service identities this relay was supposed to see.
   This extrapolation will be very rough, because each hidden-service
   directory is only responsible for a tiny share of hidden-service
   descriptors, and there is no way to increase that share significantly.

   Here are some numbers: there are about 3000 directories, and each
   descriptor is stored on three directories.  So, each directory is
   responsible for roughly 1/1000 of descriptor identifiers.  There are
   two replicas for each descriptor, and descriptor identifiers change
   once per day.  Hence, each descriptor is stored to four places in
   identifier space throughout a 24-hour period.  The probability of any
   given directory to see a given hidden-service identity is
   1-(1-1/1000)^4 = 0.00399 = 1/250.  This approximation constitutes an
   upper threshold, because it assumes that services are running all day.
   An extrapolation based on this formula will lead to undercounting the
   total number of hidden services.

   A possible inaccuracy in the estimation algorithm comes from the fact
   that a relay may not be acting as hidden-service directory during the
   full statistics interval.  We suggest adding the following line to
   handle this case better.

   Tor relays also add the following line to their extra-info descriptor,
   preceding any "hidserv-dir-*" lines:

    "hidserv-dir-start" YYYY-MM-DD HH:00:00 NL
        [At most once.]

        YYYY-MM-DD HH:00:00 defines the first hour when this
        hidden-service directory accepted either a publish or fetch
        request for a hidden-service descriptor.

   Finally, the intentionally added randomness leads to either under- or
   overcounting hidden services by up to 10%.

3. Discussion

3.1. Count only RP cells? Or also IP cells?
   As discussed on IRC, counting only RP cells should be fine for now.
   Everything else is protocol overhead, which includes HSDir traffic,
   IPo traffic, RPo traffic before the first RELAY cell, etc.  We can
   always be smarter later. -KL

3.2. Why obfuscation on HSDirs stats? And how much?
   As discussed on IRC, maybe we should obfuscate small numbers more than
   large numbers by adding a random number in [-20, 20].  Or we could
   require a reporting threshold, if we can figure out how that cannot be
   gamed by the adversary by making the required number of requests
   themselves.  Let's ask Aaron Johnson. -KL


[XXX]: guard discovery: https://lists.torproject.org/pipermail/tor-dev/2014-September/007474.html