From 8579a06da38c3e5821c5145c9037892861aa75da Mon Sep 17 00:00:00 2001
From: George Kadianakis <desnacked@riseup.net>
Date: Tue, 25 Nov 2014 17:33:19 +0000
Subject: Tidy up the proposal.

---
 proposals/238-hs-relay-stats.txt | 196 +++++++++++++++++++--------------------
 1 file changed, 97 insertions(+), 99 deletions(-)

(limited to 'proposals/238-hs-relay-stats.txt')

diff --git a/proposals/238-hs-relay-stats.txt b/proposals/238-hs-relay-stats.txt
index 2f24434..bcadd52 100644
--- a/proposals/238-hs-relay-stats.txt
+++ b/proposals/238-hs-relay-stats.txt
@@ -1,9 +1,8 @@
-Filename: 238-hs-relay-stats
-
+Filename: 238-hs-relay-stats.txt
 Title: Better hidden service stats from Tor relays
 Author: George Kadianakis, David Goulet, Karsten Loesing, Aaron Johnson
 Created: 2014-11-17
-Status: Incomplete
+Status: Draft
 
 0. Motivation
 
@@ -24,13 +23,13 @@ Status: Incomplete
    network traffic or 90% of the Tor network traffic. This info can
    also help us during load balancing, for example if we change the
    path building of hidden services to mitigate guard discovery
-   attacks [XXX].
-   # XXX Is "HS purposes" only RP traffic? Or also IP traffic?
+   attacks [0].
 
-   Also, learning the number of hidden services, can help us
-   understand how widespread hidden services are. It will also help us
-   understand approximately how much load is put in the network by
-   hidden service logistics, like introduction point circuits etc.
+   Also, learning the number of hidden services, can give us an
+   understanding of how widespread hidden services are. It will also
+   help us understand approximately how much load is put in the
+   network by hidden service logistics, like introduction point
+   circuits etc.
 
 1. Design
 
@@ -45,7 +44,7 @@ Status: Incomplete
 
 2. Implementation
 
-2.0. Hidden service statistics interval
+2.1. Hidden service statistics interval
 
    We want relays to report hidden-service statistics over a long-enough
    time period to not put users at risk.  Similar to other statistics, we
@@ -65,7 +64,7 @@ Status: Incomplete
         is first added after the relay has been running for at least 24
         hours.
 
-2.1. Hidden service traffic statistics
+2.2. Hidden service traffic statistics
 
    We want to learn how much of the total Tor network traffic is caused by
    hidden service usage.  There are three phases in the rendezvous
@@ -83,51 +82,17 @@ Status: Incomplete
     "hidserv-rend-relayed-cells" SP num NL
         [At most once.]
 
-        Approximate number of RELAY cells seen in either direction on a
-        circuit after receiving and successfully processing a RENDEZVOUS1
-        cell.  The actual number observed by the directory is multiplied
-        with a random number in [0.9, 1.1] before being reported.
+        Approximate number of RELAY cells seen in either direction on
+        a circuit after receiving and successfully processing a
+        RENDEZVOUS1 cell.  The actual number observed by the directory
+        is multiplied with a random number in [0.9, 1.1] and then gets
+        floored before being reported.
 
    The keyword indicates that this line is part of hidden-service
    statistics ("hidserv") and contains aggregate data from the relay
    acting as rendezvous point ("rend").
 
-   We plan to extrapolate reported values to network totals by dividing
-   values by the probability of clients picking relays as rendezvous
-   point.  This approach should become more precise on faster relays and
-   the more relays report these statistics.
-
-   We also plan to compare reported values with "cell-*" statistics to
-   learn what fraction of traffic can be attributed to hidden services.
-
-   Ideally, we'd be able to compare values to "write-history" and
-   "read-history" lines to compute similar fractions of traffic used for
-   hidden services.  The goal would be to avoid enabling "cell-*"
-   statistics by default.  In order for this to work we'll have to
-   multiply reported cell numbers with the default cell size of 512 bytes
-   (we cannot infer the actual number of bytes, because cells are
-   end-to-end encrypted between client and service).
-
-   A possible alternative to multiplying the number of cells with a random
-   factor is to introduce additive noise.  Let's suppose that we would
-   like to obscure any individual connection that contains C cells or
-   fewer (obscuring extremely and unusually large connections seems
-   hopeless but unnecessary).  That is, we don't want the (distribution
-   of) the cell count from any relay to change by much whether or not C
-   cells are removed.  The standard differential privacy approach would be
-   to *add* noise from the Laplace distribution Lap(\epsilon/C), where
-   \epsilon controls how much the statistics *distribution* can
-   multiplicatively differ.  This is not to say that we need to add noise
-   exactly from that distribution (maybe we weaken the guarantee slightly
-   to get better accuracy), but the same idea applies.  This would apply
-   the same to both large and small relays.  We *want* to learn roughly
-   how much hidden-service traffic each relay has - we just want to
-   obscure the exact number within some tolerance.  We'll probably want to
-   include the algorithm and parameters used for adding noise in the
-   "hidserv-rend-relayed-cells" line, as in, "lap=x" with x being
-   \epsilon/C.
-
-2.2. HSDir hidden service counting
+2.3. HSDir hidden service counting
 
    We also want to learn how many hidden services exist in the network.
    The best place to learn this is at hidden service directories where
@@ -141,8 +106,8 @@ Status: Incomplete
         Approximate number of unique hidden-service identities seen in
         descriptors published to and accepted by this hidden-service
         directory.  The actual number observed by the directory is
-        multiplied with a random number in [0.9, 1.1] before being
-        reported.
+        multiplied with a random number in [0.9, 1.1] and then gets
+        floored before being reported.
 
    This statistic requires keeping a separate data structure with unique
    identities seen during the current statistics interval.  We could, in
@@ -155,6 +120,65 @@ Status: Incomplete
    possibly even a probabilistic one, seems like the more accurate
    approach.
 
+3. Security
+
+   The main security considerations that need discussion are what an
+   adversary could do with reported statistics that they couldn't do
+   without them.  In the following, we're going through things the
+   adversary could learn, how plausible that is, and how much we care.
+   (All these things refer to hidden-service traffic, not to
+   hidden-service counting.  We should think about the latter, too.)
+
+3.1. Identify rendezvous point of high-volume and long-lived connection
+
+   The adversary could identify the rendezvous point of a very large and
+   very long-lived HS connection by observing a relay with unexpectedly
+   large relay cell count.
+
+3.2. Identify number of users of a hidden service
+
+   The adversary may be able to identify the number of users
+   of an HS if he knows the amount of traffic on a connection to that HS
+   (which he potentially can determine himself) and knows when that
+   service goes up or down. He can look at the change in the total
+   reported RP traffic to determine about how many fewer HS users there
+   are when that HS is down.
+
+4. Discussion
+
+4.1. Why count only RP cells? Why not also count IP cells?
+
+   As discussed on IRC, counting only RP cells should be fine for now.
+   Everything else is protocol overhead, which includes HSDir traffic,
+   introduction point traffic, or rendezvous point traffic before the
+   first RELAY cell, etc.
+
+   Furthermore, introduction points correspond to specific HSes, so
+   publishing IP cell stats could reveal the popularity of specific
+   HSes.
+
+4.2. How to use these stats?
+
+ 4.2.1. How to use RP Cell statistics
+
+   We plan to extrapolate reported values to network totals by dividing
+   values by the probability of clients picking relays as rendezvous
+   point.  This approach should become more precise on faster relays and
+   the more relays report these statistics.
+
+   We also plan to compare reported values with "cell-*" statistics to
+   learn what fraction of traffic can be attributed to hidden services.
+
+   Ideally, we'd be able to compare values to "write-history" and
+   "read-history" lines to compute similar fractions of traffic used for
+   hidden services.  The goal would be to avoid enabling "cell-*"
+   statistics by default.  In order for this to work we'll have to
+   multiply reported cell numbers with the default cell size of 512 bytes
+   (we cannot infer the actual number of bytes, because cells are
+   end-to-end encrypted between client and service).
+
+ 4.2.2. How to use HSDir HS statistics
+
    We plan to extrapolate this value to network totals by calculating what
    fraction of hidden-service identities this relay was supposed to see.
    This extrapolation will be very rough, because each hidden-service
@@ -183,51 +207,25 @@ Status: Incomplete
    consider the part of the statistics interval following the valid-after
    time of that consensus.
 
-   Finally, the intentionally added randomness leads to either under- or
-   overcounting hidden services by up to 10%.  A probably better
-   alternative for adding noise is to use the Laplace approach suggested
-   above.
-
-3. Security
-
-   The main security considerations that need discussion are what an
-   adversary could do with reported statistics that they couldn't do
-   without them.  In the following, we're going through things the
-   adversary could learn, how plausible that is, and how much we care.
-   (All these things refer to hidden-service traffic, not to
-   hidden-service counting.  We should think about the latter, too.)
-
-3.1. Identify rendezvous point of high-volume and long-lived connection
-
-   The adversary could identify the rendezvous point of a very large and
-   very long-lived HS connection by observing a relay with unexpectedly
-   large relay cell count.
-
-3.2. Identify hard-coded rendezvous points
-
-   The adversary could observe if there are RPs that consistently report
-   large cell counts. These might be HS clients with hardcoded RPs, and
-   that would allow the adversary to identify this behavior and
-   potentially link that with a known HS client of known behavior (e.g.
-   a botnet client). Then the adversary could figure out which RPs to
-   target.
-
-3.3. Identify number of users of a hidden service
-
-   The adversary may be able to identify the number of users
-   of an HS if he knows the amount of traffic on a connection to that HS
-   (which he potentially can determine himself) and knows when that
-   service goes up or down. He can look at the change in the total
-   reported RP traffic to determine about how many fewer HS users there
-   are when that HS is down.
-
-4. Discussion
-
-4.1. Count only RP cells? Or also IP cells?
-   As discussed on IRC, counting only RP cells should be fine for now.
-   Everything else is protocol overhead, which includes HSDir traffic,
-   IPo traffic, RPo traffic before the first RELAY cell, etc.  We can
-   always be smarter later. -KL
+4.3. Multiplicative or additive noise?
 
+   A possible alternative to multiplying the number of cells with a random
+   factor is to introduce additive noise.  Let's suppose that we would
+   like to obscure any individual connection that contains C cells or
+   fewer (obscuring extremely and unusually large connections seems
+   hopeless but unnecessary).  That is, we don't want the (distribution
+   of) the cell count from any relay to change by much whether or not C
+   cells are removed.  The standard differential privacy approach would be
+   to *add* noise from the Laplace distribution Lap(\epsilon/C), where
+   \epsilon controls how much the statistics *distribution* can
+   multiplicatively differ.  This is not to say that we need to add noise
+   exactly from that distribution (maybe we weaken the guarantee slightly
+   to get better accuracy), but the same idea applies.  This would apply
+   the same to both large and small relays.  We *want* to learn roughly
+   how much hidden-service traffic each relay has - we just want to
+   obscure the exact number within some tolerance.  We'll probably want to
+   include the algorithm and parameters used for adding noise in the
+   "hidserv-rend-relayed-cells" line, as in, "lap=x" with x being
+   \epsilon/C.
 
-[XXX]: guard discovery: https://lists.torproject.org/pipermail/tor-dev/2014-September/007474.html
+[0]: guard discovery: https://lists.torproject.org/pipermail/tor-dev/2014-September/007474.html
-- 
cgit v1.2.3-54-g00ecf