From d26c11be77bdd379e3c4709b4b3cc2403b1448bd Mon Sep 17 00:00:00 2001 From: Karsten Loesing Date: Thu, 23 Jul 2009 10:59:00 -0400 Subject: Proposal: Including Network Statistics in Extra-Info Documents --- proposals/166-statistics-extra-info-docs.txt | 391 +++++++++++++++++++++++++++ 1 file changed, 391 insertions(+) create mode 100644 proposals/166-statistics-extra-info-docs.txt (limited to 'proposals/166-statistics-extra-info-docs.txt') diff --git a/proposals/166-statistics-extra-info-docs.txt b/proposals/166-statistics-extra-info-docs.txt new file mode 100644 index 0000000..3716c04 --- /dev/null +++ b/proposals/166-statistics-extra-info-docs.txt @@ -0,0 +1,391 @@ +Filename: 166-statistics-extra-info-docs.txt +Title: Including Network Statistics in Extra-Info Documents +Author: Karsten Loesing +Created: 21-Jul-2009 +Target: 0.2.2 +Status: Open + +Change history: + + 21-Jul-2009 Initial proposal for or-dev + + +Overview: + + The Tor network has grown to almost two thousand relays and millions + of casual users over the past few years. With growth has come + increasing performance problems and attempts by some countries to + block access to the Tor network. In order to address these problems, + we need to learn more about the Tor network. This proposal suggests to + measure additional statistics and include them in extra-info documents + to help us understand the Tor network better. + + +Introduction: + + As of May 2009, relays, bridges, and directories gather the following + data for statistical purposes: + + - Relays and bridges count the number of bytes that they have pushed + in 15-minute intervals over the past 24 hours. Relays and bridges + include these data in extra-info documents that they send to the + directory authorities whenever they publish their server descriptor. + + - Bridges further include a rough number of clients per country that + they have seen in the past 48 hours in their extra-info documents. + + - Directories can be configured to count the number of clients they + see per country in the past 24 hours and to write them to a local + file. + + Since then we extended the network statistics in Tor. These statistics + include: + + - Directories now gather more precise statistics about connecting + clients. Fixes include measuring in intervals of exactly 24 hours, + counting unsuccessful requests, measuring download times, etc. The + directories append their statistics to a local file every 24 hours. + + - Entry guards count the number of clients per country per day like + bridges do and write them to a local file every 24 hours. + + - Relays measure statistics of the number of cells in their circuit + queues and how much time these cells spend waiting there. Relays + write these statistics to a local file every 24 hours. + + - Exit nodes count the number of read and written bytes on exit + connections per port as well as the number of opened exit streams + per port in 24-hour intervals. Exit nodes write their statistics to + a local file. + + The following four sections contain descriptions for adding these + statistics to the relays' extra-info documents. + + +Directory request statistics: + + The first type of statistics aims at measuring directory requests sent + by clients to a directory mirror or directory authority. More + precisely, these statistics aim at requests for v2 and v3 network + statuses only. These directory requests are sent non-anonymously, + either via HTTP-like requests to a directory's Dir port or tunneled + over a 1-hop circuit. + + Measuring directory request statistics is useful for several reasons: + First, the number of locally seen directory requests can be used to + estimate the total number of clients in the Tor network. Second, the + country-wise classification of requests using a GeoIP database can + help counting the relative and absolute number of users per country. + Third, the download times can give hints on the available bandwidth + capacity at clients. + + Directory requests do not give any hints on the contents that clients + send or receive over the Tor network. Every client requests network + statuses from the directories, so that there are no anonymity-related + concerns to gather these statistics. It might be, though, that clients + wish to hide the fact that they are connecting to the Tor network. + Therefore, IP addresses are resolved to country codes in memory, + events are accumulated over 24 hours, and numbers are rounded up to + multiples of 4 or 8. + + "dirreq-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL + [At most once.] + + YYYY-MM-DD HH:MM:SS defines the end of the included measurement + interval of length NSEC seconds (86400 seconds by default). + + A "dirreq-stats-end" line, as well as any other "dirreq-*" line, + is only added when the relay has opened its Dir port and after 24 + hours of measuring directory requests. + + "dirreq-v2-ips" CC=N,CC=N,... NL + [At most once.] + "dirreq-v3-ips" CC=N,CC=N,... NL + [At most once.] + + List of mappings from two-letter country codes to the number of + unique IP addresses that have connected from that country to + request a v2/v3 network status, rounded up to the nearest multiple + of 8. Only those IP addresses are counted that the directory can + answer with a 200 OK status code. + + "dirreq-v2-reqs" CC=N,CC=N,... NL + [At most once.] + "dirreq-v3-reqs" CC=N,CC=N,... NL + [At most once.] + + List of mappings from two-letter country codes to the number of + requests for v2/v3 network statuses from that country, rounded up + to the nearest multiple of 8. Only those requests are counted that + the directory can answer with a 200 OK status code. + + "dirreq-v2-share" num% NL + [At most once.] + "dirreq-v3-share" num% NL + [At most once.] + + The share of v2/v3 network status requests that the directory + expects to receive from clients based on its advertised bandwidth + compared to the overall network bandwidth capacity. Shares are + formatted in percent with two decimal places. Shares are + calculated as means over the whole 24-hour interval. + + "dirreq-v2-resp" status=num,... NL + [At most once.] + "dirreq-v3-resp" status=nul,... NL + [At most once.] + + List of mappings from response statuses to the number of requests + for v2/v3 network statuses that were answered with that response + status, rounded up to the nearest multiple of 4. Only response + statuses with at least 1 response are reported. New response + statuses can be added at any time. The current list of response + statuses is as follows: + + "ok": a network status request is answered; this number + corresponds to the sum of all requests as reported in + "dirreq-v2-reqs" or "dirreq-v3-reqs", respectively, before + rounding up. + "not-enough-sigs: a version 3 network status is not signed by a + sufficient number of requested authorities. + "unavailable": a requested network status object is unavailable. + "not-found": a requested network status is not found. + "not-modified": a network status has not been modified since the + If-Modified-Since time that is included in the request. + "busy": the directory is busy. + + "dirreq-v2-direct-dl" key=val,... NL + [At most once.] + "dirreq-v3-direct-dl" key=val,... NL + [At most once.] + "dirreq-v2-tunneled-dl" key=val,... NL + [At most once.] + "dirreq-v3-tunneled-dl" key=val,... NL + [At most once.] + + List of statistics about possible failures in the download process + of v2/v3 network statuses. Requests are either "direct" + HTTP-encoded requests over the relay's directory port, or + "tunneled" requests using a BEGIN_DIR cell over the relay's OR + port. The list of possible statistics can change, and statistics + can be left out from reporting. The current list of statistics is + as follows: + + Successful downloads and failures: + + "complete": a client has finished the download successfully. + "timeout": a download did not finish within 10 minutes after + starting to send the response. + "running": a download is still running at the end of the + measurement period for less than 10 minutes after starting to + send the response. + + Download times: + + "min", "max": smallest and largest measured bandwidth in B/s. + "d[1-4,6-9]": 1st to 4th and 6th to 9th decile of measured + bandwidth in B/s. For a given decile i, i/10 of all downloads + had a smaller bandwidth than di, and (10-i)/10 of all downloads + had a larger bandwidth than di. + "q[1,3]": 1st and 3rd quartile of measured bandwidth in B/s. One + fourth of all downloads had a smaller bandwidth than q1, one + fourth of all downloads had a larger bandwidth than q3, and the + remaining half of all downloads had a bandwidth between q1 and + q3. + "md": median of measured bandwidth in B/s. Half of the downloads + had a smaller bandwidth than md, the other half had a larger + bandwidth than md. + + +Entry guard statistics: + + Entry guard statistics include the number of clients per country and + per day that are connecting directly to an entry guard. + + Entry guard statistics are important to learn more about the + distribution of clients to countries. In the future, this knowledge + can be useful to detect if there are or start to be any restrictions + for clients connecting from specific countries. + + The information which client connects to a given entry guard is very + sensitive. This information must not be combined with the information + what contents are leaving the network at the exit nodes. Therefore, + entry guard statistics need to be aggregated to prevent them from + becoming useful for de-anonymization. Aggregation includes resolving + IP addresses to country codes, counting events over 24-hour intervals, + and rounding up numbers to the next multiple of 8. + + "entry-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL + [At most once.] + + YYYY-MM-DD HH:MM:SS defines the end of the included measurement + interval of length NSEC seconds (86400 seconds by default). + + An "entry-stats-end" line, as well as any other "entry-*" + line, is first added after the relay has been running for at least + 24 hours. + + "entry-ips" CC=N,CC=N,... NL + [At most once.] + + List of mappings from two-letter country codes to the number of + unique IP addresses that have connected from that country to the + relay and which are no known other relays, rounded up to the + nearest multiple of 8. + + +Cell statistics: + + The third type of statistics have to do with the time that cells spend + in circuit queues. In order to gather these statistics, the relay + memorizes when it puts a given cell in a circuit queue and when this + cell is flushed. The relay further notes the life time of the circuit. + These data are sufficient to determine the mean number of cells in a + queue over time and the mean time that cells spend in a queue. + + Cell statistics are necessary to learn more about possible reasons for + the poor network performance of the Tor network, especially high + latencies. The same statistics are also useful to determine the + effects of design changes by comparing today's data with future data. + + There are basically no privacy concerns from measuring cell + statistics, regardless of a node being an entry, middle, or exit node. + + "cell-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL + [At most once.] + + YYYY-MM-DD HH:MM:SS defines the end of the included measurement + interval of length NSEC seconds (86400 seconds by default). + + A "cell-stats-end" line, as well as any other "cell-*" line, + is first added after the relay has been running for at least 24 + hours. + + "cell-processed-cells" num,...,num NL + [At most once.] + + Mean number of processed cells per circuit, subdivided into + deciles of circuits by the number of cells they have processed in + descending order from loudest to quietest circuits. + + "cell-queued-cells" num,...,num NL + [At most once.] + + Mean number of cells contained in queues by circuit decile. These + means are calculated by 1) determining the mean number of cells in + a single circuit between its creation and its termination and 2) + calculating the mean for all circuits in a given decile as + determined in "cell-processed-cells". Numbers have a precision of + two decimal places. + + "cell-time-in-queue" num,...,num NL + [At most once.] + + Mean time cells spend in circuit queues in milliseconds. Times are + calculated by 1) determining the mean time cells spend in the + queue of a single circuit and 2) calculating the mean for all + circuits in a given decile as determined in + "cell-processed-cells". + + "cell-circuits-per-decile" num NL + [At most once.] + + Mean number of circuits that are included in any of the deciles, + rounded up to the next integer. + + +Exit statistics: + + The last type of statistics affects exit nodes counting the number of + bytes written and read and the number of streams opened per port and + per 24 hours. Exit port statistics can be measured from looking of + headers of BEGIN and DATA cells. A BEGIN cell contains the exit port + that is required for the exit node to open a new exit stream. + Subsequent DATA cells coming from the client or being sent back to the + client contain a length field stating how many bytes of application + data are contained in the cell. + + Exit port statistics are important to measure in order to identify + possible load-balancing problems with respect to exit policies. Exit + nodes that permit more ports than others are very likely overloaded + with traffic for those ports plus traffic for other ports. Improving + load balancing in the Tor network improves the overall utilization of + bandwidth capacity. + + Exit traffic is one of the most sensitive parts of network data in the + Tor network. Even though these statistics do not require looking at + traffic contents, statistics are aggregated so that they are not + useful for de-anonymizing users. Only those ports are reported that + have seen at least 0.1% of exiting or incoming bytes, numbers of bytes + are rounded up to full kibibytes (KiB), and stream numbers are rounded + up to the next multiple of 4. + + "exit-stats-end" YYYY-MM-DD HH:MM:SS (NSEC s) NL + [At most once.] + + YYYY-MM-DD HH:MM:SS defines the end of the included measurement + interval of length NSEC seconds (86400 seconds by default). + + An "exit-stats-end" line, as well as any other "exit-*" line, is + first added after the relay has been running for at least 24 hours + and only if the relay permits exiting (where exiting to a single + port and IP address is sufficient). + + "exit-kibibytes-written" port=N,port=N,... NL + [At most once.] + "exit-kibibytes-read" port=N,port=N,... NL + [At most once.] + + List of mappings from ports to the number of kibibytes that the + relay has written to or read from exit connections to that port, + rounded up to the next full kibibyte. + + "exit-streams-opened" port=N,port=N,... NL + [At most once.] + + List of mappings from ports to the number of opened exit streams + to that port, rounded up to the nearest multiple of 4. + + +Implementation notes: + + Right now, relays that are configured accordingly write similar + statistics to those described in this proposal to disk every 24 hours. + With this proposal being implemented, relays include the contents of + these files in extra-info documents. + + The following steps are necessary to implement this proposal: + + 1. The current format of [dirreq|entry|buffer|exit]-stats files needs + to be adapted to the description in this proposal. This step + basically means renaming keywords. + + 2. The timing of writing the four *-stats files should be unified, so + that they are written exactly after 24 hours after starting the + relay. Right now, the measurement intervals for dirreq, entry, and + exit stats starts with the first observed request, and files are + written when observing the first request that occurs more than 24 + hours after the beginning of the measurement interval. With this + proposal, the measurement intervals should all start at the same + time, and files should be written exactly 24 hours later. + + 3. It is advantageous to cache statistics in local files in the data + directory until they are included in extra-info documents. The + reason is that the 24-hour measurement interval can be very + different from the 18-hour publication interval of extra-info + documents. When a relay crashed after finishing a measurement + interval, but before publishing the next extra-info document, + statistics would get lost. Therefore, statistics are written to + disk when finishing a measurement interval and read from disk when + generating an extra-info document. As a result, the *-stats files + need to be overwritten after 24 hours, rather than appending new + statistics to them. Further, the contents of the *-stats files need + to be checked in the process of generating extra-info documents. + + 4. With the statistics patches being tested, the ./configure options + should be removed and the statistics code be compiled by default. + It is still required for relay operators to add configuration + options (DirReqStatistics, ExitPortStatistics, etc.) to enable + gathering statistics. However, in the near future, statistics shall + be enabled gathered by all relays by default, where requiring a + ./configure option would be a barrier for many relay operators. -- cgit v1.2.3-54-g00ecf