summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorNick Mathewson <nickm@torproject.org>2008-05-08 04:13:36 +0000
committerNick Mathewson <nickm@torproject.org>2008-05-08 04:13:36 +0000
commit32065813ac34437971cb9c8a95a1923557d0557d (patch)
tree2fe16f2f91ea0d16de7e2cca2a1673cdd88d21c6
parent2238d8008d6c1e71e23fa52fbf51dc8773966abe (diff)
downloadtor-32065813ac34437971cb9c8a95a1923557d0557d.tar.gz
tor-32065813ac34437971cb9c8a95a1923557d0557d.zip
Add proposed methodolody for tracking national usage trends.
svn:r14578
-rw-r--r--doc/spec/proposals/ideas/xxx-geoip-survey-plan.txt88
1 files changed, 88 insertions, 0 deletions
diff --git a/doc/spec/proposals/ideas/xxx-geoip-survey-plan.txt b/doc/spec/proposals/ideas/xxx-geoip-survey-plan.txt
new file mode 100644
index 0000000000..08612aa468
--- /dev/null
+++ b/doc/spec/proposals/ideas/xxx-geoip-survey-plan.txt
@@ -0,0 +1,88 @@
+
+
+Abstract
+
+ This document explains how to tell about how many Tor users there
+ are, and how many there are in which country. Statistics are
+ involved.
+
+Motivation
+
+ There are a few reasons we need to keep track of which countries
+ Tor users (in aggregate) are coming from:
+
+ - Resource allocation. Knowing about underserved countries with
+ lots of users can let us know about where we need to direct
+ translation and outreach efforts.
+
+ - Anticensorship. Sudden drops in usage on a national basis can
+ indicate the arrival of a censorious firewall.
+
+ - Sponsor outreach and self-evalutation. Many people and
+ organizations who are interested in funding The Tor Project's
+ work want to know that we're successfully serving parts of the
+ world they're interested in, and that efforts to expand our
+ userbase are actually succeeding. So, when you come right
+ down to it, do we.
+
+Goals
+
+ We want to know about how many Tor users there are, and which
+ countries they're in, even in the presence of a hypothetical
+ "directory guard" feature. Some uncertainty is okay, but we'd like
+ to be able to put a bound on the uncertainty.
+
+ We need to make sure this information isn't exposed in a way that
+ helps an adversary.
+
+Methods:
+
+ Every client downloads network status documents. There are
+ currently three methods (one hypothetical) for clients to get them.
+ - 0.1.2.x clients (and earlier) fetch a v2 networkstatus
+ document about every NETWORKSTATUS_CLIENT_DL_INTERVAL [30
+ minutes].
+
+ - 0.2.0.x clients fetch a v3 networkstatus consensus document
+ at a random interval between when their current document is no
+ longer freshest, and when their current document is about to
+ expire.
+
+ [In both of the above cases, clients choose a directory cache at
+ random with odds roughly proportional to its bandwidth.]
+
+ - In some future version, clients will choose directory caches
+ to serve as their "directory guards" to avoid profiling
+ attacks, similarly to how clients currently start all their
+ circuits at guard nodes.
+
+ We assume that a directory cache can tell which of these three
+ categories a client is in by the format of its status request.
+
+ A directory cache can be made to count distinct client IP
+ addresses that make a certain request of it in a given timeframe.
+ For the first two cases, a cache can get a picture of the overall
+ number and countries of users in the network by dividing the IP
+ count by the probability with which they (as a cache) would be
+ chosen. Assuming that our listed bandwidth is such that we expect
+ to be chosen with probability P for any given request, and we've
+ been counting IPs for long enough that we expect the average
+ client to have made N requests, they will have visited us at least
+ once with probability P' = 1-(1-P)^N, and so we divide the IP
+ counts we've seen by P' for our estimate.
+
+ If directory guards are in use, directory guards get a picture of
+ all those users who chose them as a guard when they were listed
+ as a good choice for a guard, and who are also on the network
+ now. The cleanest data here will come from nodes that were listed
+ as good new-guards choices for a while, and have not been so for a
+ while longer (to study decay rates); nodes that have been listed
+ as good new-guard choices consistently for a long time (to get a
+ sample of the network); and nodes that have been listed as good
+ new-guard choices only recently (to get a sample of new users and
+ users whose guards have died out.)
+
+ Note that these measurements *shouldn't* be taken at directory
+ authorities: their picture of the network is too skewed by the
+ special cases in which clients fetch from them directly.
+