proposals/104-short-descriptors.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144

Filename: 104-short-descriptors.txt
Title: Long and Short Router Descriptors
Version: $Revision$
Last-Modified: $Date$
Author: Nick Mathewson
Created:
Status: Open

Overview:

  This document proposes moving unused-by-clients information from regular
  router descriptors into a special "long form" router descriptor.

  It presents options; it is not yet a complete proposal.

Proposal:

  Some of the costliest fields in the current directory protocol are ones
  that no client actually uses.  In particular, the "read-history" and
  "write-history" fields are used only by the authorities for monitoring the
  status of the network.  If we took them out, the size of a compressed list
  of all the routers would fall by about 60%.  (No other disposable field
  would save more than 2%.)

  One possible solution here is that routers should generate and upload a
  short-form and long-form descriptor.  Only the short-form descriptor should
  ever be used by anybody for routing.  The long-form descriptor should be
  used only for analytics and other tools.  (If we allowed people to route
  with long descriptors, we'd have to ensure that they stayed in sync with
  the short ones somehow.  So let's not do that.)  We can ensure that the
  short descriptors are used by only recommending those in the network
  statuses.

  Another possible solution would be to drop these fields from descriptors,
  and have them uploaded as a part of a separate "bandwidth report" to the
  authorities.  This could help prevent the mistake of using long descriptors
  in the place of short ones. It could also be generalized later to be an
  overall status report, to include sanitized GeoIP information and whatever
  else comes up.

Other disposable fields:

  Clients don't need these fields, but removing them doesn't help bandwidth
  enough to be worthwhile.
    contact (save about 1%)
    fingerprint (save about 3%)

  We could represent these fields more succinctly, but removing them would
  only save 1%.  (!)
    reject
    accept
  (Apparently, exit polices are highly compressible.)

  [Does size-on-disk matter to anybody? Some clients and servers don't
   have much disk, or have really slow disk (e.g. USB). And we don't
   store caches compressed right now. -RD]

Issues:

  Indexing long descriptor or bandwidth reports presents an issue: right now
  the way to make sure you have the same copy of a descriptor as everyone
  else is to request the descriptor by its digest, and to make sure that
  the digest you request is the one that the authorities like.

  Authorities should presumably list the digests of short descriptors, since
  that's what most everybody will be using.  Including a second digest for
  long descriptors/bandwidth reports in the networkstatus would only bloat it
  with information nobody wants.

  Possible solutions are:
   1) Drop the property that you can be sure of having the same long
      descriptor as others.  This seems unoptimal, but if nobody caches
      long descriptors so you have to go to the authority to get them,
      maybe it's not so bad.
   2) Have a separate extra-information-status that also gets generated by the
      authorities; use it to tell which long descriptors others have.  Also a
      pain.
   3) Have short descriptors include a hash of the corresponding long
      descriptor/extra-info.  This would keep the same order of magnitude
      performance increase (~59.2% savings as opposed to 61% savings.)
      This would require longdesc/extra-info downloaders to fetch
      router data before they could know which longdescs/extra info to fetch.
   4) Have each authority make a signed concatenated "extra info" document,
      and hope we never need to reconcile them.
   5) ????

Migration:

  For long/short descriptor approach:
     * First:
       * Authorities should accept both, now, and silently drop short
         descriptors.
       * Routers should upload both once authorities accept them.
       * There should be a "long descriptor" url named
         /tor/server/fp-detailed/ and the current "normal" URL.
         Authorities should serve long descriptors from both URLs.
         There's no such thing as asking for a long descriptor by
         its digest.
     * Once tools that want long descriptors support fetching them from the
       "long descriptor" URL:
       * Have authorities remember short descriptors, and serve them from the
         'normal' URL.
       These tools include:
         lefkada's exit.py script.
         tor26's noreply script and general directory cache.
         https://nighteffect.us/tns/ for its graphs
         and check with or-talk for the rest, once it's time.

  For bandwidth info approach:
     * First:
       * Rename it; it won't be just bandwidth forever.
       * Authorities should accept bandwidth info
       * Routers should upload bandwidth info once authorities accept it.
       * There should be a way to download bandwidth info
     * Once tools that want bandwidth info support fetching it:
       * Have routers stop including bandwidth info in their router
         descriptors.

Discussion:

  Solution 4 seems like a nice plan: in many cases, the external services
  that use read-history and write-history are directory authorities
  themselves, so they just use their local opinion.

  Roger thinks we should go with the long/short descriptor plan, along
  with solution 4. We don't want to just upload a bandwidth message,
  because that involves new data structures for every new piece of
  information we decide to upload. I suspect we'll realize once this
  is deployed that there is other info we want to put in the long
  descriptors.

  This won't solve the future sanitized GeoIP uploading question, but
  who knows where we'll actually want to send that data, and whether
  we'll want to handle it with the same privacy constraints as this data,
  so let's not try to solve that yet.

  However, we may still need some basic reconciling algorithms between
  authorities -- otherwise, if a router uploads to four authorities
  and fails to reach the fifth, then that fifth will never have the new
  descriptor. This will mean that the best strategy for external tools
  is to fetch full concatenated-style long-descriptor lists from every
  single authority, and merge them locally. So each authority should
  periodically fetch the list from the others and take the new ones.