diff options
Diffstat (limited to 'spec/bridgedb-spec.md')
-rw-r--r-- | spec/bridgedb-spec.md | 443 |
1 files changed, 443 insertions, 0 deletions
diff --git a/spec/bridgedb-spec.md b/spec/bridgedb-spec.md new file mode 100644 index 0000000..b12fcfb --- /dev/null +++ b/spec/bridgedb-spec.md @@ -0,0 +1,443 @@ +# BridgeDB specification + +<a id="bridgedb-spec.txt-0"></a> + +This document specifies how BridgeDB processes bridge descriptor files +to learn about new bridges, maintains persistent assignments of bridges +to distributors, and decides which bridges to give out upon user +requests. + +Some of the decisions here may be suboptimal: this document is meant to +specify current behavior as of August 2013, not to specify ideal +behavior. + +<a id="bridgedb-spec.txt-1"></a> + +## Importing bridge network statuses and bridge descriptors { #importing } + +BridgeDB learns about bridges by parsing bridge network statuses, +bridge descriptors, and extra info documents as specified in Tor's +directory protocol. BridgeDB parses one bridge network status file +first and at least one bridge descriptor file and potentially one extra +info file afterwards. + +BridgeDB scans its files on sighup. + +BridgeDB does not validate signatures on descriptors or networkstatus +files: the operator needs to make sure that these documents have come +from a Tor instance that did the validation for us. + +<a id="bridgedb-spec.txt-1.1"></a> + +### Parsing bridge network statuses { #parsing-network-status } + +Bridge network status documents contain the information of which bridges +are known to the bridge authority and which flags the bridge authority +assigns to them. +We expect bridge network statuses to contain at least the following two +lines for every bridge in the given order (format fully specified in Tor's +directory protocol): + +```text + "r" SP nickname SP identity SP digest SP publication SP IP SP ORPort + SP DirPort NL + "a" SP address ":" port NL (no more than 8 instances) + "s" SP Flags NL +``` + +BridgeDB parses the identity and the publication timestamp from the "r" +line, the OR address(es) and ORPort(s) from the "a" line(s), and the +assigned flags from the "s" line, specifically checking the assignment +of the "Running" and "Stable" flags. +BridgeDB memorizes all bridges that have the Running flag as the set of +running bridges that can be given out to bridge users. +BridgeDB memorizes assigned flags if it wants to ensure that sets of +bridges given out should contain at least a given number of bridges +with these flags. + +<a id="bridgedb-spec.txt-1.2"></a> + +### Parsing bridge descriptors { #parsing-bridge-descriptors } + +BridgeDB learns about a bridge's most recent IP address and OR port +from parsing bridge descriptors. +In theory, both IP address and OR port of a bridge are also contained +in the "r" line of the bridge network status, so there is no mandatory +reason for parsing bridge descriptors. But the functionality described +in this section is still implemented in case we need data from the +bridge descriptor in the future. + +Bridge descriptor files may contain one or more bridge descriptors. +We expect a bridge descriptor to contain at least the following lines in +the stated order: + +```text + "@purpose" SP purpose NL + "router" SP nickname SP IP SP ORPort SP SOCKSPort SP DirPort NL + "published" SP timestamp + ["opt" SP] "fingerprint" SP fingerprint NL + "router-signature" NL Signature NL +``` + +BridgeDB parses the purpose, IP, ORPort, nickname, and fingerprint +from these lines. +BridgeDB skips bridge descriptors if the fingerprint is not contained +in the bridge network status parsed earlier or if the bridge does not +have the Running flag. +BridgeDB discards bridge descriptors which have a different purpose +than "bridge". BridgeDB can be configured to only accept descriptors +with another purpose or not discard descriptors based on purpose at +all. +BridgeDB memorizes the IP addresses and OR ports of the remaining +bridges. +If there is more than one bridge descriptor with the same fingerprint, +BridgeDB memorizes the IP address and OR port of the most recently +parsed bridge descriptor. +If BridgeDB does not find a bridge descriptor for a bridge contained in +the bridge network status parsed before, it does not add that bridge +to the set of bridges to be given out to bridge users. + +<a id="bridgedb-spec.txt-1.3"></a> + +### Parsing extra-info documents { #parsing-extra-info } + +BridgeDB learns if a bridge supports a pluggable transport by parsing +extra-info documents. +Extra-info documents contain the name of the bridge (but only if it is +named), the bridge's fingerprint, the type of pluggable transport(s) it +supports, and the IP address and port number on which each transport +listens, respectively. + +Extra-info documents may contain zero or more entries per bridge. We expect +an extra-info entry to contain the following lines in the stated order: + +```text + "extra-info" SP name SP fingerprint NL + "transport" SP transport SP IP ":" PORT ARGS NL +``` + +BridgeDB parses the fingerprint, transport type, IP address, port and any +arguments that are specified on these lines. BridgeDB skips the name. If +the fingerprint is invalid, BridgeDB skips the entry. BridgeDB memorizes +the transport type, IP address, port number, and any arguments that are be +provided and then it assigns them to the corresponding bridge based on the +fingerprint. Arguments are comma-separated and are of the form k=v,k=v. +Bridges that do not have an associated extra-info entry are not invalid. + +<a id="bridgedb-spec.txt-2"></a> + +## Assigning bridges to distributors { #assigning-to-distributors } + +A "distributor" is a mechanism by which bridges are given (or not +given) to clients. The current distributors are "email", "https", +and "unallocated". + +BridgeDB assigns bridges to distributors based on an HMAC hash of the +bridge's ID and a secret and makes these assignments persistent. +Persistence is achieved by using a database to map node ID to +distributor. +Each bridge is assigned to exactly one distributor (including +the "unallocated" distributor). +BridgeDB may be configured to support only a non-empty subset of the +distributors specified in this document. +BridgeDB may be configured to use different probabilities for assigning +new bridges to distributors. +BridgeDB does not change existing assignments of bridges to +distributors, even if probabilities for assigning bridges to +distributors change or distributors are disabled entirely. + +<a id="bridgedb-spec.txt-3"></a> + +## Giving out bridges upon requests { #distributing } + +Upon receiving a client request, a BridgeDB distributor provides a +subset of the bridges assigned to it. +BridgeDB only gives out bridges that are contained in the most recently +parsed bridge network status and that have the Running flag set (see +Section 1). +BridgeDB may be configured to give out a different number of bridges +(typically 4) depending on the distributor. +BridgeDB may define an arbitrary number of rules. These rules may +specify the criteria by which a bridge is selected. Specifically, +the available rules restrict the IP address version, OR port number, +transport type, bridge relay flag, or country in which the bridge +should not be blocked. + +<a id="bridgedb-spec.txt-4"></a> + +## Selecting bridges to be given out based on IP addresses { #ip-based } + +```text + BridgeDB may be configured to support one or more distributors which + gives out bridges based on the requestor's IP address. Currently, this + is how the HTTPS distributor works. + The goal is to avoid handing out all the bridges to users in a similar + IP space and time. + +> Someone else should look at proposals/ideas/old/xxx-bridge-disbursement +> to see if this section is missing relevant pieces from it. -KL + + BridgeDB fixes the set of bridges to be returned for a defined time + period. + BridgeDB considers all IP addresses coming from the same /24 network + as the same IP address and returns the same set of bridges. From here on, + this non-unique address will be referred to as the IP address's 'area'. + BridgeDB divides the IP address space equally into a small number of + +> Note, changed term from "areas" to "disjoint clusters" -MF + + disjoint clusters (typically 4) and returns different results for requests + coming from addresses that are placed into different clusters. + +> I found that BridgeDB is not strict in returning only bridges for a +> given area. If a ring is empty, it considers the next one. Is this +> expected behavior? -KL +> +> This does not appear to be the case, anymore. If a ring is empty, then +> BridgeDB simply returns an empty set of bridges. -MF +> +> I also found that BridgeDB does not make the assignment to areas +> persistent in the database. So, if we change the number of rings, it +> will assign bridges to other rings. I assume this is okay? -KL + + BridgeDB maintains a list of proxy IP addresses and returns the same + set of bridges to requests coming from these IP addresses. + The bridges returned to proxy IP addresses do not come from the same + set as those for the general IP address space. +``` + +BridgeDB can be configured to include bridge fingerprints in replies +along with bridge IP addresses and OR ports. +BridgeDB can be configured to display a CAPTCHA which the user must solve +prior to returning the requested bridges. + +The current algorithm is as follows. An IP-based distributor splits +the bridges uniformly into a set of "rings" based on an HMAC of their +ID. Some of these rings are "area" rings for parts of IP space; some +are "category" rings for categories of IPs (like proxies). When a +client makes a request from an IP, the distributor first sees whether +the IP is in one of the categories it knows. If so, the distributor +returns an IP from the category rings. If not, the distributor +maps the IP into an "area" (that is, a /24), and then uses an HMAC to +map the area to one of the area rings. + +When the IP-based distributor determines from which area ring it is handing +out bridges, it identifies which rules it will use to choose appropriate +bridges. Using this information, it searches its cache of rings for one +that already adheres to the criteria specified in this request. If one +exists, then BridgeDB maps the current "epoch" (N-hour period) and the +IP's area (/24) to a point on the ring based on HMAC, and hands out +bridges at that point. If a ring does not already exist which satisfies this +request, then a new ring is created and filled with bridges that fulfill +the requirements. This ring is then used to select bridges as described. + +"Mapping X to Y based on an HMAC" above means one of the following: + +```text + - We keep all of the elements of Y in some order, with a mapping + from all 160-bit strings to positions in Y. + - We take an HMAC of X using some fixed string as a key to get a + 160-bit value. We then map that value to the next position of Y. +``` + +When giving out bridges based on a position in a ring, BridgeDB first +looks at flag requirements and port requirements. For example, +BridgeDB may be configured to "Give out at least L bridges with port +443, and at least M bridges with Stable, and at most N bridges +total." To do this, BridgeDB combines to the results: + +```text + - The first L bridges in the ring after the position that have the + port 443, and + - The first M bridges in the ring after the position that have the + flag stable and that it has not already decided to give out, and + - The first N-L-M bridges in the ring after the position that it + has not already decided to give out. + + After BridgeDB selects appropriate bridges to return to the requestor, it + then prioritises the ordering of them in a list so that as many criteria + are fulfilled as possible within the first few bridges. This list is then + truncated to N bridges, if possible. N is currently defined as a + piecewise function of the number of bridges in the ring such that: + + / + | 1, if len(ring) < 20 + | + N = | 2, if 20 <= len(ring) <= 100 + | + | 3, if 100 <= len(ring) + \ + + The bridges in this sublist, containing no more than N bridges, are the + bridges returned to the requestor. +``` + +<a id="bridgedb-spec.txt-5"></a> + +## Selecting bridges to be given out based on email addresses + +```text + BridgeDB can be configured to support one or more distributors that are + giving out bridges based on the requestor's email address. Currently, + this is how the email distributor works. + The goal is to bootstrap based on one or more popular email service's + sybil prevention algorithms. + +> Someone else should look at proposals/ideas/old/xxx-bridge-disbursement +> to see if this section is missing relevant pieces from it. -KL + + BridgeDB rejects email addresses containing other characters than the + ones that RFC2822 allows. + BridgeDB may be configured to reject email addresses containing other + characters it might not process correctly. + +> I don't think we do this, is it worthwhile? -MF + + BridgeDB rejects email addresses coming from other domains than a + configured set of permitted domains. + BridgeDB normalizes email addresses by removing "." characters and by + removing parts after the first "+" character. + BridgeDB can be configured to discard requests that do not have the + value "pass" in their X-DKIM-Authentication-Result header or does not + have this header. The X-DKIM-Authentication-Result header is set by + the incoming mail stack that needs to check DKIM authentication. + + BridgeDB does not return a new set of bridges to the same email address + until a given time period (typically a few hours) has passed. + +> Why don't we fix the bridges we give out for a global 3-hour time period +> like we do for IP addresses? This way we could avoid storing email +> addresses. -KL +> +> The 3-hour value is probably much too short anyway. If we take longer +> time values, then people get new bridges when bridges show up, as +> opposed to then we decide to reset the bridges we give them. (Yes, this +> problem exists for the IP distributor). -NM +> +> I'm afraid I don't fully understand what you mean here. Can you +> elaborate? -KL +> +> Assuming an average churn rate, if we use short time periods, then a +> requestor will receive new bridges based on rate-limiting and will (likely) +> eventually work their way around the ring; eventually exhausting all bridges +> available to them from this distributor. If we use a longer time period, +> then each time the period expires there will be more bridges in the ring +> thus reducing the likelihood of all bridges being blocked and increasing +> the time and effort required to enumerate all bridges. (This is my +> understanding, not from Nick) -MF +> +> Also, we presently need the cache to prevent replays and because if a user +> sent multiple requests with different criteria in each then we would leak +> additional bridges otherwise. -MF + + BridgeDB can be configured to include bridge fingerprints in replies + along with bridge IP addresses and OR ports. + BridgeDB can be configured to sign all replies using a PGP signing key. + BridgeDB periodically discards old email-address-to-bridge mappings. + BridgeDB rejects too frequent email requests coming from the same + normalized address. +``` + +To map previously unseen email addresses to a set of bridges, BridgeDB +proceeds as follows: + +```text + - It normalizes the email address as above, by stripping out dots, + removing all of the localpart after the +, and putting it all + in lowercase. (Example: "John.Doe+bridges@example.COM" becomes + "johndoe@example.com".) + - It maps an HMAC of the normalized address to a position on its ring + of bridges. + - It hands out bridges starting at that position, based on the + port/flag requirements, as specified at the end of section 4. + + See section 4 for the details of how bridges are selected from the ring + and returned to the requestor. +``` + +<a id="bridgedb-spec.txt-6"></a> + +## Selecting unallocated bridges to be stored in file buckets { #unallocated-buckets } + +> Kaner should have a look at this section. -NM + +```text + BridgeDB can be configured to reserve a subset of bridges and not give + them out via one of the distributors. + BridgeDB assigns reserved bridges to one or more file buckets of fixed + sizes and write these file buckets to disk for manual distribution. + BridgeDB ensures that a file bucket always contains the requested + number of running bridges. + If the requested number of bridges in a file bucket is reduced or the + file bucket is not required anymore, the unassigned bridges are + returned to the reserved set of bridges. + If a bridge stops running, BridgeDB replaces it with another bridge + from the reserved set of bridges. + +> I'm not sure if there's a design bug in file buckets. What happens if +> we add a bridge X to file bucket A, and X goes offline? We would add +> another bridge Y to file bucket A. OK, but what if A comes back? We +> cannot put it back in file bucket A, because it's full. Are we going to +> add it to a different file bucket? Doesn't that mean that most bridges +> will be contained in most file buckets over time? -KL +> +> This should be handled the same as if the file bucket is reduced in size. +> If X returns, then it should be added to the appropriate distributor. -MF +``` + +<a id="bridgedb-spec.txt-7"></a> + +## Displaying Bridge Information { #formatting } + +After bridges are selected using one of the methods described in +Sections 4 - 6, they are output in one of two formats. Bridges are +formatted as: + +`<address:port> NL` + +Pluggable transports are formatted as: + +`<transportname> SP <address:port> [SP arglist] NL` + +where arglist is an optional space-separated list of key-value pairs in +the form of k=v. + +Previously, each line was prepended with the "bridge" keyword, such as + +`"bridge" SP <address:port> NL` + +`"bridge" SP <transportname> SP <address:port> [SP arglist] NL` + +> We don't do this anymore because Vidalia and TorLauncher don't expect it. +> See the commit message for b70347a9c5fd769c6d5d0c0eb5171ace2999a736. + +<a id="bridgedb-spec.txt-8"></a> + +## Writing bridge assignments for statistics + +BridgeDB can be configured to write bridge assignments to disk for +statistical analysis. +The start of a bridge assignment is marked by the following line: + +"bridge-pool-assignment" SP YYYY-MM-DD HH:MM:SS NL + +YYYY-MM-DD HH:MM:SS is the time, in UTC, when BridgeDB has completed +loading new bridges and assigning them to distributors. + +For every running bridge there is a line with the following format: + +fingerprint SP distributor (SP key "=" value)\* NL + +The distributor is one out of "email", "https", or "unallocated". + +Both "email" and "https" distributors support adding keys for "port", +"flag" and "transport". Respectively, the port number, flag name, and +transport types are the values. These are used to indicate that +a bridge matches certain port, flag, transport criteria of requests. + +The "https" distributor also allows the key "ring" with a number as +value to indicate to which IP address area the bridge is returned. + +The "unallocated" distributor allows the key "bucket" with the file +bucket name as value to indicate which file bucket a bridge is assigned +to. |