aboutsummaryrefslogtreecommitdiff
path: root/proposals/238-hs-relay-stats.txt
diff options
context:
space:
mode:
authorKarsten Loesing <karsten.loesing@gmx.net>2014-11-19 10:32:56 +0100
committerKarsten Loesing <karsten.loesing@gmx.net>2014-11-19 10:32:56 +0100
commit89282aba291d12a1a539606ebf02af3047bd61fa (patch)
tree97feda6f2c1dcb930441cf1b00e116a901d52cdd /proposals/238-hs-relay-stats.txt
parent909b63b6f7512582d05d1a089fb62a426845818c (diff)
downloadtorspec-89282aba291d12a1a539606ebf02af3047bd61fa.tar.gz
torspec-89282aba291d12a1a539606ebf02af3047bd61fa.zip
Revise George's hidden-service statistics proposal.
Diffstat (limited to 'proposals/238-hs-relay-stats.txt')
-rw-r--r--proposals/238-hs-relay-stats.txt151
1 files changed, 124 insertions, 27 deletions
diff --git a/proposals/238-hs-relay-stats.txt b/proposals/238-hs-relay-stats.txt
index a081989..135048b 100644
--- a/proposals/238-hs-relay-stats.txt
+++ b/proposals/238-hs-relay-stats.txt
@@ -45,48 +45,145 @@ Status: Incomplete
2. Implementation
-2.1. Hidden service traffic statistics
-
- Tor HSDirs will add the following field to their extra-info
- descriptor:
-
- "hs-traffic" ... XXX
-
-2.2. HSDir hidden service counting
+2.0. Hidden service statistics interval
- Tor HSDirs will add the following field to their extra-info
- descriptor:
+ We want relays to report hidden-service statistics over a long-enough
+ time period to not put users at risk. Similar to other statistics, we
+ suggest a 24-hour statistics interval. All related statistics are
+ collected at the end of that interval and included in the next
+ extra-info descriptors published by the relay.
- "dirreq-v3-hsdir" key=val,... NL
- [At most once.]
+ Tor relays will add the following line to their extra-info descriptor:
- Statistics about HS directory activities.
- The current list of statistics is as follows:
+ "hidserv-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL
+ [At most once.]
- "hs-num": The approximate number of HSes that the HSDir is
- hosting descriptors for at the time the extra-info
- descriptor was created.
+ YYYY-MM-DD HH:MM:SS defines the end of the included measurement
+ interval of length NSEC seconds (86400 seconds by default).
+ A "hidserv-stats-end" line, as well as any other "hidserv-*" line,
+ is first added after the relay has been running for at least 24
+ hours.
- To derive this, HSDirs are expected to walk over their descriptor
- caches and count the number of HSes contained. The number is then
- obfuscated slightly by a small noise factor that introduces 10%
- inaccuracy.
-
- More specifically:
-
- hs-num = <number of HSes> * <random real \in [0.9, 1.1]>
+2.1. Hidden service traffic statistics
+ We want to learn how much of the total Tor network traffic is caused by
+ hidden service usage. There are three phases in the rendezvous
+ protocol where traffic is generated: (1) when hidden services make
+ themselves available in the network, (2) when clients open connections
+ to hidden services, and (3) when clients exchange application data with
+ hidden services. We expect (3) to consume most bytes here, so we're
+ focusing on this only. More precisely, we measure hidden service
+ traffic by counting RELAY cells seen on a rendezvous point after
+ receiving a RENDEZVOUS1 cell. These RELAY cells include commands to
+ open or close application streams, and they include application data.
+
+ Tor relays will add the following line to their extra-info descriptor:
+
+ "hidserv-rend-relayed-cells" SP num NL
+ [At most once.]
+
+ Approximate number of RELAY cells seen in either direction on a
+ circuit after receiving and successfully processing a RENDEZVOUS1
+ cell. The actual number observed by the directory is multiplied
+ with a random number in [0.9, 1.1] before being reported.
+
+ The keyword indicates that this line is part of hidden-service
+ statistics ("hidserv") and contains aggregate data from the relay
+ acting as rendezvous point ("rend").
+
+ We plan to extrapolate reported values to network totals by dividing
+ values by the probability of clients picking relays as rendezvous
+ point. This approach should become more precise on faster relays and
+ the more relays report these statistics.
+
+ We also plan to compare reported values with "cell-*" statistics to
+ learn what fraction of traffic can be attributed to hidden services.
+
+ Ideally, we'd be able to compare values to "write-history" and
+ "read-history" lines to compute similar fractions of traffic used for
+ hidden services. The goal would be to avoid enabling "cell-*"
+ statistics by default. In order for this to work we'll have to
+ multiply reported cell numbers with the default cell size of 512 bytes.
+2.2. HSDir hidden service counting
- time_t cutoff = now - REND_CACHE_MAX_AGE - REND_CACHE_MAX_SKEW;
+ We also want to learn how many hidden services exist in the network.
+ The best place to learn this is at hidden service directories where
+ hidden services publish their descriptors.
+
+ Tor relays will add the following line to their extra-info descriptor:
+
+ "hidserv-dir-published-ids" SP num NL
+ [At most once.]
+
+ Approximate number of unique hidden-service identities seen in
+ descriptors published to and accepted by this hidden-service
+ directory. The actual number observed by the directory is
+ multiplied with a random number in [0.9, 1.1] before being
+ reported.
+
+ This statistic requires keeping a separate data structure with unique
+ identities seen during the current statistics interval. We could, in
+ theory, have relays iterate over their descriptor caches when producing
+ the daily hidden-service statistics blurb. But it's unclear how
+ caching would affect results from such an approach, because descriptors
+ published at the start of the current statistics interval could already
+ have been removed, and descriptors published in the last statistics
+ interval could still be present. Keeping a separate data structure,
+ possibly even a probabilistic one, seems like the more accurate
+ approach.
+
+ We plan to extrapolate this value to network totals by calculating what
+ fraction of hidden-service identities this relay was supposed to see.
+ This extrapolation will be very rough, because each hidden-service
+ directory is only responsible for a tiny share of hidden-service
+ descriptors, and there is no way to increase that share significantly.
+
+ Here are some numbers: there are about 3000 directories, and each
+ descriptor is stored on three directories. So, each directory is
+ responsible for roughly 1/1000 of descriptor identifiers. There are
+ two replicas for each descriptor, and descriptor identifiers change
+ once per day. Hence, each descriptor is stored to four places in
+ identifier space throughout a 24-hour period. The probability of any
+ given directory to see a given hidden-service identity is
+ 1-(1-1/1000)^4 = 0.00399 = 1/250. This approximation constitutes an
+ upper threshold, because it assumes that services are running all day.
+ An extrapolation based on this formula will lead to undercounting the
+ total number of hidden services.
+
+ A possible inaccuracy in the estimation algorithm comes from the fact
+ that a relay may not be acting as hidden-service directory during the
+ full statistics interval. We suggest adding the following line to
+ handle this case better.
+
+ Tor relays also add the following line to their extra-info descriptor,
+ preceding any "hidserv-dir-*" lines:
+
+ "hidserv-dir-start" YYYY-MM-DD HH:00:00 NL
+ [At most once.]
+
+ YYYY-MM-DD HH:00:00 defines the first hour when this
+ hidden-service directory accepted either a publish or fetch
+ request for a hidden-service descriptor.
+
+ Finally, the intentionally added randomness leads to either under- or
+ overcounting hidden services by up to 10%.
3. Discussion
3.1. Count only RP cells? Or also IP cells?
+ As discussed on IRC, counting only RP cells should be fine for now.
+ Everything else is protocol overhead, which includes HSDir traffic,
+ IPo traffic, RPo traffic before the first RELAY cell, etc. We can
+ always be smarter later. -KL
3.2. Why obfuscation on HSDirs stats? And how much?
-
+ As discussed on IRC, maybe we should obfuscate small numbers more than
+ large numbers by adding a random number in [-20, 20]. Or we could
+ require a reporting threshold, if we can figure out how that cannot be
+ gamed by the adversary by making the required number of requests
+ themselves. Let's ask Aaron Johnson. -KL
[XXX]: guard discovery: https://lists.torproject.org/pipermail/tor-dev/2014-September/007474.html