aboutsummaryrefslogtreecommitdiff
path: root/doc/spec/proposals/166-statistics-extra-info-docs.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/spec/proposals/166-statistics-extra-info-docs.txt')
-rw-r--r--doc/spec/proposals/166-statistics-extra-info-docs.txt391
1 files changed, 391 insertions, 0 deletions
diff --git a/doc/spec/proposals/166-statistics-extra-info-docs.txt b/doc/spec/proposals/166-statistics-extra-info-docs.txt
new file mode 100644
index 0000000000..ab2716a71c
--- /dev/null
+++ b/doc/spec/proposals/166-statistics-extra-info-docs.txt
@@ -0,0 +1,391 @@
+Filename: 166-statistics-extra-info-docs.txt
+Title: Including Network Statistics in Extra-Info Documents
+Author: Karsten Loesing
+Created: 21-Jul-2009
+Target: 0.2.2
+Status: Accepted
+
+Change history:
+
+ 21-Jul-2009 Initial proposal for or-dev
+
+
+Overview:
+
+ The Tor network has grown to almost two thousand relays and millions
+ of casual users over the past few years. With growth has come
+ increasing performance problems and attempts by some countries to
+ block access to the Tor network. In order to address these problems,
+ we need to learn more about the Tor network. This proposal suggests to
+ measure additional statistics and include them in extra-info documents
+ to help us understand the Tor network better.
+
+
+Introduction:
+
+ As of May 2009, relays, bridges, and directories gather the following
+ data for statistical purposes:
+
+ - Relays and bridges count the number of bytes that they have pushed
+ in 15-minute intervals over the past 24 hours. Relays and bridges
+ include these data in extra-info documents that they send to the
+ directory authorities whenever they publish their server descriptor.
+
+ - Bridges further include a rough number of clients per country that
+ they have seen in the past 48 hours in their extra-info documents.
+
+ - Directories can be configured to count the number of clients they
+ see per country in the past 24 hours and to write them to a local
+ file.
+
+ Since then we extended the network statistics in Tor. These statistics
+ include:
+
+ - Directories now gather more precise statistics about connecting
+ clients. Fixes include measuring in intervals of exactly 24 hours,
+ counting unsuccessful requests, measuring download times, etc. The
+ directories append their statistics to a local file every 24 hours.
+
+ - Entry guards count the number of clients per country per day like
+ bridges do and write them to a local file every 24 hours.
+
+ - Relays measure statistics of the number of cells in their circuit
+ queues and how much time these cells spend waiting there. Relays
+ write these statistics to a local file every 24 hours.
+
+ - Exit nodes count the number of read and written bytes on exit
+ connections per port as well as the number of opened exit streams
+ per port in 24-hour intervals. Exit nodes write their statistics to
+ a local file.
+
+ The following four sections contain descriptions for adding these
+ statistics to the relays' extra-info documents.
+
+
+Directory request statistics:
+
+ The first type of statistics aims at measuring directory requests sent
+ by clients to a directory mirror or directory authority. More
+ precisely, these statistics aim at requests for v2 and v3 network
+ statuses only. These directory requests are sent non-anonymously,
+ either via HTTP-like requests to a directory's Dir port or tunneled
+ over a 1-hop circuit.
+
+ Measuring directory request statistics is useful for several reasons:
+ First, the number of locally seen directory requests can be used to
+ estimate the total number of clients in the Tor network. Second, the
+ country-wise classification of requests using a GeoIP database can
+ help counting the relative and absolute number of users per country.
+ Third, the download times can give hints on the available bandwidth
+ capacity at clients.
+
+ Directory requests do not give any hints on the contents that clients
+ send or receive over the Tor network. Every client requests network
+ statuses from the directories, so that there are no anonymity-related
+ concerns to gather these statistics. It might be, though, that clients
+ wish to hide the fact that they are connecting to the Tor network.
+ Therefore, IP addresses are resolved to country codes in memory,
+ events are accumulated over 24 hours, and numbers are rounded up to
+ multiples of 4 or 8.
+
+ "dirreq-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL
+ [At most once.]
+
+ YYYY-MM-DD HH:MM:SS defines the end of the included measurement
+ interval of length NSEC seconds (86400 seconds by default).
+
+ A "dirreq-stats-end" line, as well as any other "dirreq-*" line,
+ is only added when the relay has opened its Dir port and after 24
+ hours of measuring directory requests.
+
+ "dirreq-v2-ips" CC=N,CC=N,... NL
+ [At most once.]
+ "dirreq-v3-ips" CC=N,CC=N,... NL
+ [At most once.]
+
+ List of mappings from two-letter country codes to the number of
+ unique IP addresses that have connected from that country to
+ request a v2/v3 network status, rounded up to the nearest multiple
+ of 8. Only those IP addresses are counted that the directory can
+ answer with a 200 OK status code.
+
+ "dirreq-v2-reqs" CC=N,CC=N,... NL
+ [At most once.]
+ "dirreq-v3-reqs" CC=N,CC=N,... NL
+ [At most once.]
+
+ List of mappings from two-letter country codes to the number of
+ requests for v2/v3 network statuses from that country, rounded up
+ to the nearest multiple of 8. Only those requests are counted that
+ the directory can answer with a 200 OK status code.
+
+ "dirreq-v2-share" num% NL
+ [At most once.]
+ "dirreq-v3-share" num% NL
+ [At most once.]
+
+ The share of v2/v3 network status requests that the directory
+ expects to receive from clients based on its advertised bandwidth
+ compared to the overall network bandwidth capacity. Shares are
+ formatted in percent with two decimal places. Shares are
+ calculated as means over the whole 24-hour interval.
+
+ "dirreq-v2-resp" status=num,... NL
+ [At most once.]
+ "dirreq-v3-resp" status=nul,... NL
+ [At most once.]
+
+ List of mappings from response statuses to the number of requests
+ for v2/v3 network statuses that were answered with that response
+ status, rounded up to the nearest multiple of 4. Only response
+ statuses with at least 1 response are reported. New response
+ statuses can be added at any time. The current list of response
+ statuses is as follows:
+
+ "ok": a network status request is answered; this number
+ corresponds to the sum of all requests as reported in
+ "dirreq-v2-reqs" or "dirreq-v3-reqs", respectively, before
+ rounding up.
+ "not-enough-sigs: a version 3 network status is not signed by a
+ sufficient number of requested authorities.
+ "unavailable": a requested network status object is unavailable.
+ "not-found": a requested network status is not found.
+ "not-modified": a network status has not been modified since the
+ If-Modified-Since time that is included in the request.
+ "busy": the directory is busy.
+
+ "dirreq-v2-direct-dl" key=val,... NL
+ [At most once.]
+ "dirreq-v3-direct-dl" key=val,... NL
+ [At most once.]
+ "dirreq-v2-tunneled-dl" key=val,... NL
+ [At most once.]
+ "dirreq-v3-tunneled-dl" key=val,... NL
+ [At most once.]
+
+ List of statistics about possible failures in the download process
+ of v2/v3 network statuses. Requests are either "direct"
+ HTTP-encoded requests over the relay's directory port, or
+ "tunneled" requests using a BEGIN_DIR cell over the relay's OR
+ port. The list of possible statistics can change, and statistics
+ can be left out from reporting. The current list of statistics is
+ as follows:
+
+ Successful downloads and failures:
+
+ "complete": a client has finished the download successfully.
+ "timeout": a download did not finish within 10 minutes after
+ starting to send the response.
+ "running": a download is still running at the end of the
+ measurement period for less than 10 minutes after starting to
+ send the response.
+
+ Download times:
+
+ "min", "max": smallest and largest measured bandwidth in B/s.
+ "d[1-4,6-9]": 1st to 4th and 6th to 9th decile of measured
+ bandwidth in B/s. For a given decile i, i/10 of all downloads
+ had a smaller bandwidth than di, and (10-i)/10 of all downloads
+ had a larger bandwidth than di.
+ "q[1,3]": 1st and 3rd quartile of measured bandwidth in B/s. One
+ fourth of all downloads had a smaller bandwidth than q1, one
+ fourth of all downloads had a larger bandwidth than q3, and the
+ remaining half of all downloads had a bandwidth between q1 and
+ q3.
+ "md": median of measured bandwidth in B/s. Half of the downloads
+ had a smaller bandwidth than md, the other half had a larger
+ bandwidth than md.
+
+
+Entry guard statistics:
+
+ Entry guard statistics include the number of clients per country and
+ per day that are connecting directly to an entry guard.
+
+ Entry guard statistics are important to learn more about the
+ distribution of clients to countries. In the future, this knowledge
+ can be useful to detect if there are or start to be any restrictions
+ for clients connecting from specific countries.
+
+ The information which client connects to a given entry guard is very
+ sensitive. This information must not be combined with the information
+ what contents are leaving the network at the exit nodes. Therefore,
+ entry guard statistics need to be aggregated to prevent them from
+ becoming useful for de-anonymization. Aggregation includes resolving
+ IP addresses to country codes, counting events over 24-hour intervals,
+ and rounding up numbers to the next multiple of 8.
+
+ "entry-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL
+ [At most once.]
+
+ YYYY-MM-DD HH:MM:SS defines the end of the included measurement
+ interval of length NSEC seconds (86400 seconds by default).
+
+ An "entry-stats-end" line, as well as any other "entry-*"
+ line, is first added after the relay has been running for at least
+ 24 hours.
+
+ "entry-ips" CC=N,CC=N,... NL
+ [At most once.]
+
+ List of mappings from two-letter country codes to the number of
+ unique IP addresses that have connected from that country to the
+ relay and which are no known other relays, rounded up to the
+ nearest multiple of 8.
+
+
+Cell statistics:
+
+ The third type of statistics have to do with the time that cells spend
+ in circuit queues. In order to gather these statistics, the relay
+ memorizes when it puts a given cell in a circuit queue and when this
+ cell is flushed. The relay further notes the life time of the circuit.
+ These data are sufficient to determine the mean number of cells in a
+ queue over time and the mean time that cells spend in a queue.
+
+ Cell statistics are necessary to learn more about possible reasons for
+ the poor network performance of the Tor network, especially high
+ latencies. The same statistics are also useful to determine the
+ effects of design changes by comparing today's data with future data.
+
+ There are basically no privacy concerns from measuring cell
+ statistics, regardless of a node being an entry, middle, or exit node.
+
+ "cell-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL
+ [At most once.]
+
+ YYYY-MM-DD HH:MM:SS defines the end of the included measurement
+ interval of length NSEC seconds (86400 seconds by default).
+
+ A "cell-stats-end" line, as well as any other "cell-*" line,
+ is first added after the relay has been running for at least 24
+ hours.
+
+ "cell-processed-cells" num,...,num NL
+ [At most once.]
+
+ Mean number of processed cells per circuit, subdivided into
+ deciles of circuits by the number of cells they have processed in
+ descending order from loudest to quietest circuits.
+
+ "cell-queued-cells" num,...,num NL
+ [At most once.]
+
+ Mean number of cells contained in queues by circuit decile. These
+ means are calculated by 1) determining the mean number of cells in
+ a single circuit between its creation and its termination and 2)
+ calculating the mean for all circuits in a given decile as
+ determined in "cell-processed-cells". Numbers have a precision of
+ two decimal places.
+
+ "cell-time-in-queue" num,...,num NL
+ [At most once.]
+
+ Mean time cells spend in circuit queues in milliseconds. Times are
+ calculated by 1) determining the mean time cells spend in the
+ queue of a single circuit and 2) calculating the mean for all
+ circuits in a given decile as determined in
+ "cell-processed-cells".
+
+ "cell-circuits-per-decile" num NL
+ [At most once.]
+
+ Mean number of circuits that are included in any of the deciles,
+ rounded up to the next integer.
+
+
+Exit statistics:
+
+ The last type of statistics affects exit nodes counting the number of
+ bytes written and read and the number of streams opened per port and
+ per 24 hours. Exit port statistics can be measured from looking at
+ headers of BEGIN and DATA cells. A BEGIN cell contains the exit port
+ that is required for the exit node to open a new exit stream.
+ Subsequent DATA cells coming from the client or being sent back to the
+ client contain a length field stating how many bytes of application
+ data are contained in the cell.
+
+ Exit port statistics are important to measure in order to identify
+ possible load-balancing problems with respect to exit policies. Exit
+ nodes that permit more ports than others are very likely overloaded
+ with traffic for those ports plus traffic for other ports. Improving
+ load balancing in the Tor network improves the overall utilization of
+ bandwidth capacity.
+
+ Exit traffic is one of the most sensitive parts of network data in the
+ Tor network. Even though these statistics do not require looking at
+ traffic contents, statistics are aggregated so that they are not
+ useful for de-anonymizing users. Only those ports are reported that
+ have seen at least 0.1% of exiting or incoming bytes, numbers of bytes
+ are rounded up to full kibibytes (KiB), and stream numbers are rounded
+ up to the next multiple of 4.
+
+ "exit-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL
+ [At most once.]
+
+ YYYY-MM-DD HH:MM:SS defines the end of the included measurement
+ interval of length NSEC seconds (86400 seconds by default).
+
+ An "exit-stats-end" line, as well as any other "exit-*" line, is
+ first added after the relay has been running for at least 24 hours
+ and only if the relay permits exiting (where exiting to a single
+ port and IP address is sufficient).
+
+ "exit-kibibytes-written" port=N,port=N,... NL
+ [At most once.]
+ "exit-kibibytes-read" port=N,port=N,... NL
+ [At most once.]
+
+ List of mappings from ports to the number of kibibytes that the
+ relay has written to or read from exit connections to that port,
+ rounded up to the next full kibibyte.
+
+ "exit-streams-opened" port=N,port=N,... NL
+ [At most once.]
+
+ List of mappings from ports to the number of opened exit streams
+ to that port, rounded up to the nearest multiple of 4.
+
+
+Implementation notes:
+
+ Right now, relays that are configured accordingly write similar
+ statistics to those described in this proposal to disk every 24 hours.
+ With this proposal being implemented, relays include the contents of
+ these files in extra-info documents.
+
+ The following steps are necessary to implement this proposal:
+
+ 1. The current format of [dirreq|entry|buffer|exit]-stats files needs
+ to be adapted to the description in this proposal. This step
+ basically means renaming keywords.
+
+ 2. The timing of writing the four *-stats files should be unified, so
+ that they are written exactly 24 hours after starting the
+ relay. Right now, the measurement intervals for dirreq, entry, and
+ exit stats starts with the first observed request, and files are
+ written when observing the first request that occurs more than 24
+ hours after the beginning of the measurement interval. With this
+ proposal, the measurement intervals should all start at the same
+ time, and files should be written exactly 24 hours later.
+
+ 3. It is advantageous to cache statistics in local files in the data
+ directory until they are included in extra-info documents. The
+ reason is that the 24-hour measurement interval can be very
+ different from the 18-hour publication interval of extra-info
+ documents. When a relay crashes after finishing a measurement
+ interval, but before publishing the next extra-info document,
+ statistics would get lost. Therefore, statistics are written to
+ disk when finishing a measurement interval and read from disk when
+ generating an extra-info document. Only the statistics that were
+ appended to the *-stats files within the past 24 hours are included
+ in extra-info documents. Further, the contents of the *-stats files
+ need to be checked in the process of generating extra-info documents.
+
+ 4. With the statistics patches being tested, the ./configure options
+ should be removed and the statistics code be compiled by default.
+ It is still required for relay operators to add configuration
+ options (DirReqStatistics, ExitPortStatistics, etc.) to enable
+ gathering statistics. However, in the near future, statistics shall
+ be enabled gathered by all relays by default, where requiring a
+ ./configure option would be a barrier for many relay operators.