come up with a plan for publishing ip-to-country usage summaries

svn:r12642
author: Roger Dingledine <arma@torproject.org> 2007-12-03 06:03:56 +0000
committer: Roger Dingledine <arma@torproject.org> 2007-12-03 06:03:56 +0000
commit: 8df81ce8bce115a31dc78c9c1a2389e38a46de20 (patch)
tree: a579ac883f1456dfcb4dc7ce55f79000c8306779
parent: f46142e66414d5d6217691b1a8b6931948aade97 (diff)
download: torspec-8df81ce8bce115a31dc78c9c1a2389e38a46de20.tar.gz
torspec-8df81ce8bce115a31dc78c9c1a2389e38a46de20.zip
1 files changed, 161 insertions, 2 deletions
diff --git a/proposals/126-geoip-reporting.txt b/proposals/126-geoip-reporting.txt
index d2da4dc..57480ff 100644
--- a/proposals/126-geoip-reporting.txt
+++ b/proposals/126-geoip-reporting.txt
@@ -205,7 +205,7 @@ Status: Needs-Research
 
 6. Controllers use the IP-to-country db for mapping and for path building
 
-  Down the road, vidalia can use the IP-to-country mappings for placing
+  Down the road, Vidalia could use the IP-to-country mappings for placing
   on its map:
   - The location of the client
   - The location of the bridges, or other relays not in the
@@ -222,6 +222,14 @@ Status: Needs-Research
     GETINFO ip-to-country/128.31.0.34
     250+ip-to-country/128.31.0.34="US","USA","UNITED STATES"
 
+6.1. Other interfaces
+
+  Robert Hogan has also suggested a
+    GETINFO relays-by-country/cn
+
+  as well as torrc options for ExitCountryCodes, EntryCountryCodes,
+  ExcludeCountryCodes, etc.
+
 7. Relays and bridges use the IP-to-country db for usage summaries
 
   Once bridges have a GeoIP database locally, they can start to publish
@@ -231,5 +239,156 @@ Status: Needs-Research
   switch to using directory guards for all users by default.
 
   But how to safely summarize this information without opening too many
-  anonymity leaks seems hard...
+  anonymity leaks?
+
+7.1 Attacks to think about
+
+  First, note that we need to have a large enough time window that we're
+  not aiding correlation attacks much. I hope 24 hours is enough. So
+  that means no publishing stats until you've been up at least 24 hours.
+  And you can't publish follow-up stats more often than every 24 hours,
+  or people could look at the differential.
+
+  Second, note that we need to be sufficiently vague about the IP
+  addresses we're reporting. We are hoping that just specifying the
+  country will be vague enough. But a) what about active attacks where
+  we convince a bridge to use a GeoIP db that labels each suspect IP
+  address as a unique country? We have to assume that the consensus GeoIP
+  db won't be malicious in this way. And b) could such singling-out
+  attacks occur naturally, for example because of countries that have
+  a very small IP space? We should investigate that.
+
+7.2. Granularity of users
+
+  Do we only want to report countries that have a very small anonymity set
+  (that is, number of users) for the day? For example, we might avoid
+  listing any countries that have seen less than five addresses over
+  the 24 hour period. This approach would be helpful in reducing the
+  singling-out opportunities -- in the extreme case, we could imagine a
+  situation where one blogger from the Sudan used Tor on a given day, and
+  we can discover which entry guard she used.
+
+  But I fear that especially for bridges, seeing only one hit from a
+  given country in a given day may be quite common.
+
+  As a compromise, we should start out with an "Other" category in
+  the reported stats, which is the sum of unlisted countries; if that
+  category is consistently interesting, we can think harder about how
+  to get the right data from it safely.
+
+  But note that bridge summaries will not be made public individually,
+  since doing so would help people enumerate bridges. Whereas summaries
+  from normal relays will be public. So perhaps that means we can afford
+  to be more specific in bridge summaries? In particular, I'm thinking the
+  "other" category should be used by public relays but not for bridges
+  (or if it is, used with a lower threshold).
+
+  Even for countries that have many Tor users, we might not want to be
+  too specific about how many users we've seen. For example, we might
+  round down the number of users we report to the nearest multiple of 5.
+  My instinct for now is that this won't be that useful.
+
+7.3 Other issues
+
+  Another note: we'll likely be overreporting in the case of users with
+  dynamic IP addresses: if they rotate to a new address over the course
+  of the day, we'll count them twice. So be it.
+
+7.4. Where to publish the summaries?
+
+  We designed extrainfo documents for information like this. So they
+  should just be more entries in the extrainfo doc.
+
+  But if we want to publish summaries every 24 hours (no more often,
+  no less often), aren't we tried to the router descriptor publishing
+  schedule? That is, if we publish a new router descriptor at the 18
+  hour mark, and nothing much has changed at the 24 hour mark, won't
+  the new descriptor get dropped as being "cosmetically similar", and
+  then nobody will know to ask about the new extrainfo document?
+
+  One solution would be to make and remember the 24 hour summary at the
+  24 hour mark, but not actually publish it anywhere until we happen to
+  publish a new descriptor for other reasons. If we happen to go down
+  before publishing a new descriptor, then so be it, at least we tried.
+
+7.5. What if the relay is unreachable or goes to sleep?
+
+  Even if you've been up for 24 hours, if you were hibernating for 18
+  of them, then we're not getting as much fuzziness as we'd like. So
+  I guess that means that we need a 24-hour period of being "awake"
+  before we'll willing to publish a summary. A similar attack works if
+  you've been awake but unreachable for the first 18 of the 24 hours. As
+  another example, a bridge that's on a laptop might be suspended for
+  some of each day.
+
+  This implies that some relays and bridges will never publish summary
+  stats, because they're not ever reliably working for 24 hours in
+  a row. If a significant percentage of our reporters end up being in
+  this boat, we should investigate whether we can accumulate 24 hours of
+  "usefulness", even if there are holes in the middle, and publish based
+  on that.
+
+  What other issues are like this? It seems that just moving to a new
+  IP address shouldn't be a reason to cancel stats publishing, assuming
+  we were usable at each address.
+
+7.6. IP addresses that aren't in the geoip db
+
+  Some IP addresses aren't in the public geoip databases. In particular,
+  I've found that a lot of African countries are missing, but there
+  are also some common ones in the US that are missing, like parts of
+  Comcast. We could just lump unknown IP addresses into the "other"
+  category, but it might be useful to gather a general sense of how many
+  lookups are failing entirely, by adding a separate "Unknown" category.
+
+  We could also contribute back to the geoip db, by letting bridges set
+  a config option to report the actual IP addresses that failed their
+  lookup. Then the bridge authority operators can manually make sure
+  the correct answer will be in later geoip files. This config option
+  should be disabled by default.
+
+7.7 Bringing it all together
+
+  So here's the plan:
+
+  24 hours after starting up (modulo Section 7.5 above), bridges and
+  relays should construct a daily summary of client countries they've
+  seen, including the above "Unknown" category (Section 7.6) as well.
+
+  Non-bridge relays lump all countries with less than K (e.g. K=5) users
+  into the "Other" category (see Sec 7.2 above), whereas bridge relays are
+  willing to list a country even when it has only one user for the day.
+
+  Whenever we have a daily summary on record, we include it in our
+  extrainfo document whenever we publish one. The daily summary we
+  remember locally gets replaced with a newer one when another 24
+  hours pass.
+
+7.8. Some forward secrecy
+
+  How should we remember addresses locally? If we convert them into
+  country-codes immediately, we will count them again if we see them
+  again. On the other hand, we don't really want to keep a list hanging
+  around of all IP addresses we've seen in the past 24 hours.
+
+  Step one is that we should never write this stuff to disk. Keeping it
+  only in ram will make things somewhat better. Step two is to avoid
+  keeping any timestamps associated with it: rather than a rolling
+  24-hour window, which would require us to remember the various times
+  we've seen that address, we can instead just throw out the whole list
+  every 24 hours and start over.
+
+  We could hash the addresses, and then compare hashes when deciding if
+  we've seen a given address before. We could even do keyed hashes. Or
+  Bloom filters. But if our goal is to defend against an adversary
+  who steals a copy of our ram while we're running and then does
+  guess-and-check on whatever blob we're keeping, we're in bad shape.
+
+  We could drop the last octet of the IP address as soon as we see
+  it. That would cause us to undercount some users from cablemodem and
+  DSL networks that have a high density of Tor users. And it wouldn't
+  really help that much -- indeed, the extent to which it does help is
+  exactly the extent to which it makes our stats less useful.
+
+  Other ideas?
author	Roger Dingledine <arma@torproject.org>	2007-12-03 06:03:56 +0000
committer	Roger Dingledine <arma@torproject.org>	2007-12-03 06:03:56 +0000
commit	8df81ce8bce115a31dc78c9c1a2389e38a46de20 (patch)
tree	a579ac883f1456dfcb4dc7ce55f79000c8306779
parent	f46142e66414d5d6217691b1a8b6931948aade97 (diff)
download	torspec-8df81ce8bce115a31dc78c9c1a2389e38a46de20.tar.gz torspec-8df81ce8bce115a31dc78c9c1a2389e38a46de20.zip