diff options
Diffstat (limited to 'doc/spec/proposals/166-statistics-extra-info-docs.txt')
-rw-r--r-- | doc/spec/proposals/166-statistics-extra-info-docs.txt | 391 |
1 files changed, 0 insertions, 391 deletions
diff --git a/doc/spec/proposals/166-statistics-extra-info-docs.txt b/doc/spec/proposals/166-statistics-extra-info-docs.txt deleted file mode 100644 index ab2716a71c..0000000000 --- a/doc/spec/proposals/166-statistics-extra-info-docs.txt +++ /dev/null @@ -1,391 +0,0 @@ -Filename: 166-statistics-extra-info-docs.txt -Title: Including Network Statistics in Extra-Info Documents -Author: Karsten Loesing -Created: 21-Jul-2009 -Target: 0.2.2 -Status: Accepted - -Change history: - - 21-Jul-2009 Initial proposal for or-dev - - -Overview: - - The Tor network has grown to almost two thousand relays and millions - of casual users over the past few years. With growth has come - increasing performance problems and attempts by some countries to - block access to the Tor network. In order to address these problems, - we need to learn more about the Tor network. This proposal suggests to - measure additional statistics and include them in extra-info documents - to help us understand the Tor network better. - - -Introduction: - - As of May 2009, relays, bridges, and directories gather the following - data for statistical purposes: - - - Relays and bridges count the number of bytes that they have pushed - in 15-minute intervals over the past 24 hours. Relays and bridges - include these data in extra-info documents that they send to the - directory authorities whenever they publish their server descriptor. - - - Bridges further include a rough number of clients per country that - they have seen in the past 48 hours in their extra-info documents. - - - Directories can be configured to count the number of clients they - see per country in the past 24 hours and to write them to a local - file. - - Since then we extended the network statistics in Tor. These statistics - include: - - - Directories now gather more precise statistics about connecting - clients. Fixes include measuring in intervals of exactly 24 hours, - counting unsuccessful requests, measuring download times, etc. The - directories append their statistics to a local file every 24 hours. - - - Entry guards count the number of clients per country per day like - bridges do and write them to a local file every 24 hours. - - - Relays measure statistics of the number of cells in their circuit - queues and how much time these cells spend waiting there. Relays - write these statistics to a local file every 24 hours. - - - Exit nodes count the number of read and written bytes on exit - connections per port as well as the number of opened exit streams - per port in 24-hour intervals. Exit nodes write their statistics to - a local file. - - The following four sections contain descriptions for adding these - statistics to the relays' extra-info documents. - - -Directory request statistics: - - The first type of statistics aims at measuring directory requests sent - by clients to a directory mirror or directory authority. More - precisely, these statistics aim at requests for v2 and v3 network - statuses only. These directory requests are sent non-anonymously, - either via HTTP-like requests to a directory's Dir port or tunneled - over a 1-hop circuit. - - Measuring directory request statistics is useful for several reasons: - First, the number of locally seen directory requests can be used to - estimate the total number of clients in the Tor network. Second, the - country-wise classification of requests using a GeoIP database can - help counting the relative and absolute number of users per country. - Third, the download times can give hints on the available bandwidth - capacity at clients. - - Directory requests do not give any hints on the contents that clients - send or receive over the Tor network. Every client requests network - statuses from the directories, so that there are no anonymity-related - concerns to gather these statistics. It might be, though, that clients - wish to hide the fact that they are connecting to the Tor network. - Therefore, IP addresses are resolved to country codes in memory, - events are accumulated over 24 hours, and numbers are rounded up to - multiples of 4 or 8. - - "dirreq-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL - [At most once.] - - YYYY-MM-DD HH:MM:SS defines the end of the included measurement - interval of length NSEC seconds (86400 seconds by default). - - A "dirreq-stats-end" line, as well as any other "dirreq-*" line, - is only added when the relay has opened its Dir port and after 24 - hours of measuring directory requests. - - "dirreq-v2-ips" CC=N,CC=N,... NL - [At most once.] - "dirreq-v3-ips" CC=N,CC=N,... NL - [At most once.] - - List of mappings from two-letter country codes to the number of - unique IP addresses that have connected from that country to - request a v2/v3 network status, rounded up to the nearest multiple - of 8. Only those IP addresses are counted that the directory can - answer with a 200 OK status code. - - "dirreq-v2-reqs" CC=N,CC=N,... NL - [At most once.] - "dirreq-v3-reqs" CC=N,CC=N,... NL - [At most once.] - - List of mappings from two-letter country codes to the number of - requests for v2/v3 network statuses from that country, rounded up - to the nearest multiple of 8. Only those requests are counted that - the directory can answer with a 200 OK status code. - - "dirreq-v2-share" num% NL - [At most once.] - "dirreq-v3-share" num% NL - [At most once.] - - The share of v2/v3 network status requests that the directory - expects to receive from clients based on its advertised bandwidth - compared to the overall network bandwidth capacity. Shares are - formatted in percent with two decimal places. Shares are - calculated as means over the whole 24-hour interval. - - "dirreq-v2-resp" status=num,... NL - [At most once.] - "dirreq-v3-resp" status=nul,... NL - [At most once.] - - List of mappings from response statuses to the number of requests - for v2/v3 network statuses that were answered with that response - status, rounded up to the nearest multiple of 4. Only response - statuses with at least 1 response are reported. New response - statuses can be added at any time. The current list of response - statuses is as follows: - - "ok": a network status request is answered; this number - corresponds to the sum of all requests as reported in - "dirreq-v2-reqs" or "dirreq-v3-reqs", respectively, before - rounding up. - "not-enough-sigs: a version 3 network status is not signed by a - sufficient number of requested authorities. - "unavailable": a requested network status object is unavailable. - "not-found": a requested network status is not found. - "not-modified": a network status has not been modified since the - If-Modified-Since time that is included in the request. - "busy": the directory is busy. - - "dirreq-v2-direct-dl" key=val,... NL - [At most once.] - "dirreq-v3-direct-dl" key=val,... NL - [At most once.] - "dirreq-v2-tunneled-dl" key=val,... NL - [At most once.] - "dirreq-v3-tunneled-dl" key=val,... NL - [At most once.] - - List of statistics about possible failures in the download process - of v2/v3 network statuses. Requests are either "direct" - HTTP-encoded requests over the relay's directory port, or - "tunneled" requests using a BEGIN_DIR cell over the relay's OR - port. The list of possible statistics can change, and statistics - can be left out from reporting. The current list of statistics is - as follows: - - Successful downloads and failures: - - "complete": a client has finished the download successfully. - "timeout": a download did not finish within 10 minutes after - starting to send the response. - "running": a download is still running at the end of the - measurement period for less than 10 minutes after starting to - send the response. - - Download times: - - "min", "max": smallest and largest measured bandwidth in B/s. - "d[1-4,6-9]": 1st to 4th and 6th to 9th decile of measured - bandwidth in B/s. For a given decile i, i/10 of all downloads - had a smaller bandwidth than di, and (10-i)/10 of all downloads - had a larger bandwidth than di. - "q[1,3]": 1st and 3rd quartile of measured bandwidth in B/s. One - fourth of all downloads had a smaller bandwidth than q1, one - fourth of all downloads had a larger bandwidth than q3, and the - remaining half of all downloads had a bandwidth between q1 and - q3. - "md": median of measured bandwidth in B/s. Half of the downloads - had a smaller bandwidth than md, the other half had a larger - bandwidth than md. - - -Entry guard statistics: - - Entry guard statistics include the number of clients per country and - per day that are connecting directly to an entry guard. - - Entry guard statistics are important to learn more about the - distribution of clients to countries. In the future, this knowledge - can be useful to detect if there are or start to be any restrictions - for clients connecting from specific countries. - - The information which client connects to a given entry guard is very - sensitive. This information must not be combined with the information - what contents are leaving the network at the exit nodes. Therefore, - entry guard statistics need to be aggregated to prevent them from - becoming useful for de-anonymization. Aggregation includes resolving - IP addresses to country codes, counting events over 24-hour intervals, - and rounding up numbers to the next multiple of 8. - - "entry-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL - [At most once.] - - YYYY-MM-DD HH:MM:SS defines the end of the included measurement - interval of length NSEC seconds (86400 seconds by default). - - An "entry-stats-end" line, as well as any other "entry-*" - line, is first added after the relay has been running for at least - 24 hours. - - "entry-ips" CC=N,CC=N,... NL - [At most once.] - - List of mappings from two-letter country codes to the number of - unique IP addresses that have connected from that country to the - relay and which are no known other relays, rounded up to the - nearest multiple of 8. - - -Cell statistics: - - The third type of statistics have to do with the time that cells spend - in circuit queues. In order to gather these statistics, the relay - memorizes when it puts a given cell in a circuit queue and when this - cell is flushed. The relay further notes the life time of the circuit. - These data are sufficient to determine the mean number of cells in a - queue over time and the mean time that cells spend in a queue. - - Cell statistics are necessary to learn more about possible reasons for - the poor network performance of the Tor network, especially high - latencies. The same statistics are also useful to determine the - effects of design changes by comparing today's data with future data. - - There are basically no privacy concerns from measuring cell - statistics, regardless of a node being an entry, middle, or exit node. - - "cell-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL - [At most once.] - - YYYY-MM-DD HH:MM:SS defines the end of the included measurement - interval of length NSEC seconds (86400 seconds by default). - - A "cell-stats-end" line, as well as any other "cell-*" line, - is first added after the relay has been running for at least 24 - hours. - - "cell-processed-cells" num,...,num NL - [At most once.] - - Mean number of processed cells per circuit, subdivided into - deciles of circuits by the number of cells they have processed in - descending order from loudest to quietest circuits. - - "cell-queued-cells" num,...,num NL - [At most once.] - - Mean number of cells contained in queues by circuit decile. These - means are calculated by 1) determining the mean number of cells in - a single circuit between its creation and its termination and 2) - calculating the mean for all circuits in a given decile as - determined in "cell-processed-cells". Numbers have a precision of - two decimal places. - - "cell-time-in-queue" num,...,num NL - [At most once.] - - Mean time cells spend in circuit queues in milliseconds. Times are - calculated by 1) determining the mean time cells spend in the - queue of a single circuit and 2) calculating the mean for all - circuits in a given decile as determined in - "cell-processed-cells". - - "cell-circuits-per-decile" num NL - [At most once.] - - Mean number of circuits that are included in any of the deciles, - rounded up to the next integer. - - -Exit statistics: - - The last type of statistics affects exit nodes counting the number of - bytes written and read and the number of streams opened per port and - per 24 hours. Exit port statistics can be measured from looking at - headers of BEGIN and DATA cells. A BEGIN cell contains the exit port - that is required for the exit node to open a new exit stream. - Subsequent DATA cells coming from the client or being sent back to the - client contain a length field stating how many bytes of application - data are contained in the cell. - - Exit port statistics are important to measure in order to identify - possible load-balancing problems with respect to exit policies. Exit - nodes that permit more ports than others are very likely overloaded - with traffic for those ports plus traffic for other ports. Improving - load balancing in the Tor network improves the overall utilization of - bandwidth capacity. - - Exit traffic is one of the most sensitive parts of network data in the - Tor network. Even though these statistics do not require looking at - traffic contents, statistics are aggregated so that they are not - useful for de-anonymizing users. Only those ports are reported that - have seen at least 0.1% of exiting or incoming bytes, numbers of bytes - are rounded up to full kibibytes (KiB), and stream numbers are rounded - up to the next multiple of 4. - - "exit-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL - [At most once.] - - YYYY-MM-DD HH:MM:SS defines the end of the included measurement - interval of length NSEC seconds (86400 seconds by default). - - An "exit-stats-end" line, as well as any other "exit-*" line, is - first added after the relay has been running for at least 24 hours - and only if the relay permits exiting (where exiting to a single - port and IP address is sufficient). - - "exit-kibibytes-written" port=N,port=N,... NL - [At most once.] - "exit-kibibytes-read" port=N,port=N,... NL - [At most once.] - - List of mappings from ports to the number of kibibytes that the - relay has written to or read from exit connections to that port, - rounded up to the next full kibibyte. - - "exit-streams-opened" port=N,port=N,... NL - [At most once.] - - List of mappings from ports to the number of opened exit streams - to that port, rounded up to the nearest multiple of 4. - - -Implementation notes: - - Right now, relays that are configured accordingly write similar - statistics to those described in this proposal to disk every 24 hours. - With this proposal being implemented, relays include the contents of - these files in extra-info documents. - - The following steps are necessary to implement this proposal: - - 1. The current format of [dirreq|entry|buffer|exit]-stats files needs - to be adapted to the description in this proposal. This step - basically means renaming keywords. - - 2. The timing of writing the four *-stats files should be unified, so - that they are written exactly 24 hours after starting the - relay. Right now, the measurement intervals for dirreq, entry, and - exit stats starts with the first observed request, and files are - written when observing the first request that occurs more than 24 - hours after the beginning of the measurement interval. With this - proposal, the measurement intervals should all start at the same - time, and files should be written exactly 24 hours later. - - 3. It is advantageous to cache statistics in local files in the data - directory until they are included in extra-info documents. The - reason is that the 24-hour measurement interval can be very - different from the 18-hour publication interval of extra-info - documents. When a relay crashes after finishing a measurement - interval, but before publishing the next extra-info document, - statistics would get lost. Therefore, statistics are written to - disk when finishing a measurement interval and read from disk when - generating an extra-info document. Only the statistics that were - appended to the *-stats files within the past 24 hours are included - in extra-info documents. Further, the contents of the *-stats files - need to be checked in the process of generating extra-info documents. - - 4. With the statistics patches being tested, the ./configure options - should be removed and the statistics code be compiled by default. - It is still required for relay operators to add configuration - options (DirReqStatistics, ExitPortStatistics, etc.) to enable - gathering statistics. However, in the near future, statistics shall - be enabled gathered by all relays by default, where requiring a - ./configure option would be a barrier for many relay operators. |