aboutsummaryrefslogtreecommitdiff
path: root/proposals/288-privcount-with-shamir.txt
diff options
context:
space:
mode:
authorNick Mathewson <nickm@torproject.org>2017-12-01 15:25:10 -0500
committerNick Mathewson <nickm@torproject.org>2017-12-01 15:25:10 -0500
commit8f56a246f8c2501e1c52594f804292ad6b7cca06 (patch)
tree9ec4075e85dd1bc7ad4c3261e5be58c797503377 /proposals/288-privcount-with-shamir.txt
parent4f809e03629cd372e5b6ab5abbbd213f0e70195f (diff)
downloadtorspec-8f56a246f8c2501e1c52594f804292ad6b7cca06.tar.gz
torspec-8f56a246f8c2501e1c52594f804292ad6b7cca06.zip
Add privcount-with-shamir proposal
Diffstat (limited to 'proposals/288-privcount-with-shamir.txt')
-rw-r--r--proposals/288-privcount-with-shamir.txt564
1 files changed, 564 insertions, 0 deletions
diff --git a/proposals/288-privcount-with-shamir.txt b/proposals/288-privcount-with-shamir.txt
new file mode 100644
index 0000000..4489048
--- /dev/null
+++ b/proposals/288-privcount-with-shamir.txt
@@ -0,0 +1,564 @@
+Filename: 288-privcount-with-shamir.txt
+Title: Privacy-Preserving Statistics with Privcount in Tor (Shamir version)
+Author: Nick Mathewson, Tim Wilson-Brown, Aaron Johnson
+Created: 1-Dec-2017
+Supercedes: 280
+Status: Draft
+
+0. Acknowledgments
+
+ Tariq Elahi, George Danezis, and Ian Goldberg designed and implemented
+ the PrivEx blinding scheme. Rob Jansen and Aaron Johnson extended
+ PrivEx's differential privacy guarantees to multiple counters in
+ PrivCount:
+
+ https://github.com/privcount/privcount/blob/master/README.markdown#research-background
+
+ Rob Jansen and Tim Wilson-Brown wrote the majority of the experimental
+ PrivCount code, based on the PrivEx secret-sharing variant. This
+ implementation includes contributions from the PrivEx authors, and
+ others:
+
+ https://github.com/privcount/privcount/blob/master/CONTRIBUTORS.markdown
+
+ This research was supported in part by NSF grants CNS-1111539,
+ CNS-1314637, CNS-1526306, CNS-1619454, and CNS-1640548.
+
+ The use of a Shamir secret-sharing-based approach is due to a
+ suggestion by Aaron Johnson (iirc); Carolin Zöbelein did some helpful
+ analysis here.
+
+ Aaron Johnson and Tim Wilson-Brown made improvements to the draft proposal.
+
+1. Introduction and scope
+
+ PrivCount is a privacy-preserving way to collect aggregate statistics
+ about the Tor network without exposing the statistics from any single
+ Tor relay.
+
+ This document describes the behavior of the in-Tor portion of the
+ PrivCount system. It DOES NOT describe the counter configurations,
+ or any other parts of the system. (These will be covered in separate
+ proposals.)
+
+2. PrivCount overview
+
+ Here follows an oversimplified summary of PrivCount, with enough
+ information to explain the Tor side of things. The actual operation
+ of the non-Tor components is trickier than described below.
+
+ In PrivCount, a Data Collector (DC, in this case a Tor relay) shares
+ numeric data with N different Tally Reporters (TRs). (A Tally Reporter
+ performs the summing and unblinding roles of the Tally Server and Share
+ Keeper from experimental PrivCount.)
+
+ All N Tally Reporters together can reconstruct the original data, but
+ no (N-1)-sized subset of the Tally Reporters can learn anything about
+ the data.
+
+ (In reality, the Tally Reporters don't reconstruct the original data
+ at all! Instead, they will reconstruct a _sum_ of the original data
+ across all participating relays.)
+
+ In brief, the system works as follow:
+
+ To share data, for each counter value V to be shared, the Data Collector
+ first adds Gaussian noise to V in order to produce V', uses (K,N) Shamir
+ secret-sharing to generate N shares of V' (K<=N, K being the
+ reconstruction threshold), encrypts each share to a different Tally
+ Reporter, and sends each encrypted share to the Tally Reporter it
+ is encrypted for.
+
+ The Tally Reporters then agree on the set S of Data Collectors that sent
+ data to all of them, and each Tally Reporter forms a share of the aggregate
+ value by decrypting the shares it received from the Data Collectors in S
+ and adding them together. The Tally Reporters then, collectively, perform
+ secret reconstruction, thereby learning the sum of all the different
+ values V'.
+
+ The use of Shamir secret sharing lets us survive up to N-K crashing TRs.
+ Waiting until the end to agree on a set S of surviving relays lets us
+ survive an arbitrary number of crashing DCs. In order to prevent bogus
+ data from corrupting the tally, the Tally Reporters can perform the
+ aggregation step multiple times, each time proceeding with a different
+ subset of S and taking the median of the resulting values.
+
+ Relay subsets should be chosen at random to avoid relays manipulating their
+ subset membership(s). If an shared random value is required, all relays must
+ submit their results, and then the next revealed shared random value can
+ be used to select relay subsets. (Tor's shared random value can be
+ calculated as soon as all commits have been revealed. So all relay results
+ must be received *before* any votes are cast in the reveal phase for that
+ shared random value.)
+
+ Below we describe the algorithm in more detail, and describe the data
+ format to use.
+
+3. The algorithm
+
+ All values below are B-bit integers modulo some prime P; we suggest
+ B=62 and P = 2**62 - 2**30 - 1 (hex 0x3fffffffbfffffff). The size of
+ this field is an upper limit on the largest sum we can calculate; it
+ is not a security parameter.
+
+ There are N Tally Reporters: every participating relay must agree on
+ which N exist, and on their current public keys. We suggest listing
+ them in the consensus networkstatus document. All parties must also
+ agree on some ordering the Tally Reporters. Similarly, all parties
+ must also agree on some value K<=N.
+
+ There are a number of well-known "counters", identified known by ASCII
+ identifiers. Each counter is a value that the participating relays
+ will know how to count. Let C be the number of counters.
+
+3.1. Data Collector (DC) side
+
+ At the start of each period, every Data Collector ("client" below)
+ initializes their state as follows
+
+ 1. For every Tally Reporter with index i, the client constructs a
+ random 32-byte random value SEED_i. The client then generates
+ a pseudorandom bitstream of C*B bits using the SHAKE-256
+ XOF with SEED_i as its input, and divides this stream into
+ C values, with the c'th value denoted by MASK(i, c).
+
+ [Because P is very close to a power of 2, nearly all seeds will
+ produce MASK values in range 0...(P-1). If any does not, the
+ client picks a new seed.]
+
+ 2. The client encrypts SEED_i using the public key of Tally
+ Reporter i, and remembers this encrypted value. It discards
+ SEED_i.
+
+ 3. For every counter c, the client generates a noise value Z_c
+ from an appropriate Gaussian distribution. If the noise value is
+ negative, the client adds P to bring Z_c into the range 0...(P-1).
+ (The noise MUST be sampled using the procedure in Appendix C.)
+
+ The client then uses Shamir secret sharing to generate
+ N shares (x,y) of Z_c, 1 <= x <= N, with the x'th share to be used by
+ the x'th Tally Reporter. See Appendix A for more on Shamir secret
+ sharing. See Appendix B for another idea about X coordinates.
+
+ The client picks a random value CTR_c and stores it in the counter,
+ which serves to locally blind the counter.
+
+ The client then subtracts (MASK(x, c)+CTR_c) from y, giving
+ "encrypted shares" of (x, y0) where y0 = y-CTR_c.
+
+ The client then discards all MASK values, all CTR values, and all
+ original shares (x,y), all CTR and the noise value Z_c. For each
+ counter c, it remembers CTR_c, and N shares of the form (x, y).
+
+ To increment a counter by some value "inc":
+
+ 1. The client adds "inc" to counter value, modulo P.
+
+ (This step is chosen to be optimal, since it will happen more
+ frequently than any other step in the computation.)
+
+ Aggregate counter values that are close to P/2 MUST be scaled to
+ avoid overflow. See Appendix D for more information. (We do not think
+ that any counters on the current Tor network will require scaling.)
+
+ To publish the counter values:
+
+ 1. The client publishes, in the format described below:
+
+ The list of counters it knows about
+ The list of TRs it knows about
+ For each TR:
+ For each counter c:
+ A list of (i, y-CTR_c-MASK(x,c)), which corresponds
+ to the share for the i'th TR of counter c.
+ SEED_i as encrypted earlier to the i'th TR's public key.
+
+3.2. Tally Reporter (TR) side
+
+ This section is less completely specified than the Data Collector's
+ behavior: I expect that the TRs will be easier to update as we proceed.
+
+ (Each TR has a long-term identity key (ed25519). It also has a
+ sequence of short-term curve25519 keys, each associated with a single
+ round of data collection.)
+
+ 1. When a group of TRs receives information from the Data Collectors,
+ they collectively chose a set S of DCs and a set of counters such
+ that every TR in the group has a valid entry for every counter,
+ from every DC in the set.
+
+ To be valid, an entry must not only be well-formed, but must also
+ have the x coordinate in its shares corresponding to the
+ TR's position in the list of TRs.
+
+ 2. For each Data Collector's report, the i'th TR decrypts its part of
+ the client's report using its curve25519 key. It uses SEED_i and
+ SHAKE-256 to regenerate MASK(0) through MASK(C-1). Then for each
+ share (x, y-CTR_c-MASK(x,c)) (note that x=i), the TR reconstructs the
+ true share of the value for that DC and counter c by adding
+ V+MASK(x,c) to the y coordinate to yield the share (x, y_final).
+
+ 3. For every counter in the set, each TR computes the sum of the
+ y_final values from all clients.
+
+ 4. For every counter in the set, each TR publishes its a share of
+ the sum as (x, SUM(y_final)).
+
+ 5. If at least K TRs publish correctly, then the sum can be
+ reconstructed using Lagrange polynomial interpolation. (See
+ Appendix A).
+
+ 6. If the reconstructed sum is greater than P/2, it is probably a negative
+ value. The value can be obtained by subtracting P from the sum.
+ (Negative values are generated when negative noise is added to small
+ signals.)
+
+ 7. If scaling has been applied, the sum is scaled by the scaling factor.
+ (See Appendix D.)
+
+4. The document format
+
+4.1. The counters document.
+
+ This document format builds on the line-based directory format used
+ for other tor documents, described in Tor's dir-spec.txt.
+
+ Using this format, we describe a "counters" document that publishes
+ the shares collected by a given DC, for a single TR.
+
+ The "counters" document has these elements:
+
+ "privctr-dump-format" SP VERSION SP SigningKey
+
+ [At start, exactly once]
+
+ Describes the version of the dump format, and provides an ed25519
+ signing key to identify the relay. The signing key is encoded in
+ base64 with padding stripped. VERSION is "alpha" now, but should
+ be "1" once this document is finalized.
+
+ "starting-at" SP IsoTime
+
+ [Exactly once]
+
+ The start of the time period when the statistics here were
+ collected.
+
+ "ending-at" SP IsoTime
+
+ [Exactly once]
+
+ The end of the time period when the statistics here were
+ collected.
+
+ "share-parameters" SP Number SP Number
+
+ [Exactly once]
+
+ The number of shares needed to reconstruct the client's
+ measurements (K), and the number of shares produced (N),
+ respectively.
+
+ "tally-reporter" SP Identifier SP Integer SP Key
+
+ [At least twice]
+
+ The curve25519 public key of each Tally Reporter that the relay
+ believes in. (If the list does not match the list of
+ participating Tally Reporters, they won't be able to find the
+ relay's values correctly.) The identifiers are non-space,
+ non-nul character sequences. The Key values are encoded in
+ base64 with padding stripped; they must be unique within each
+ counters document. The Integer values are the X coordinate of
+ the shares associated with each Tally Reporter.
+
+ "encrypted-to-key" SP Key
+
+ [Exactly once]
+
+ The curve25519 public key to which the report below is encrypted.
+ Note that it must match one of the Tally Reporter options above.
+
+
+ "report" NL
+ "----- BEGIN ENCRYPTED MESSAGE-----" NL
+ Base64Data
+ "----- END ENCRYPTED MESSAGE-----" NL
+
+ [Exactly once]
+
+ An encrypted document, encoded in base64. The plaintext format is
+ described in section 4.2. below. The encryption is as specified in
+ section 5 below, with STRING_CONSTANT set to "privctr-shares-v1".
+
+ "signature" SP Signature
+
+ [At end, exactly once]
+
+ The Ed25519 signature of all the fields in the document, from the
+ first byte, up to but not including the "signature" keyword here.
+ The signature is encoded in base64 with padding stripped.
+
+4.2. The encrypted "shares" document.
+
+ The shares document is sent, encrypted, in the "report" element above.
+ Its plaintext contents include these fields:
+
+ "encrypted-seed" NL
+ "----- BEGIN ENCRYPTED MESSAGE-----" NL
+ Base64Data
+ "----- END ENCRYPTED MESSAGE-----" NL
+
+ [At start, exactly once.]
+
+ An encrypted document, encoded in base64. The plaintext value is
+ the 32-byte value SEED_i for this TR. The encryption is as
+ specified in section 5 below, with STRING_CONSTANT set to
+ "privctr-seed-v1".
+
+ "d" SP Keyword SP Integer
+
+ [Any number of times]
+
+ For each counter, the name of the counter, and the obfuscated Y
+ coordinate of this TR's share for that counter. (The Y coordinate
+ is calculated as y-CTR_c as in 3.1 above.) The order of counters
+ must correspond to the order used when generating the MASK() values;
+ different clients do not need to choose the same order.
+
+5. Hybrid encryption
+
+ This scheme is taken from rend-spec-v3.txt, section 2.5.3, replacing
+ "secret_input" and "STRING_CONSTANT". It is a hybrid encryption
+ method for encrypting a message to a curve25519 public key PK.
+
+ We generate a new curve25519 keypair (sk,pk).
+
+ We run the algorithm of rend-spec-v3.txt 2.5.3, replacing
+ "secret_input" with Curve25519(sk,PK) | SigningKey, where
+ SigningKey is the DC's signing key. (Including the DC's SigningKey
+ here prevents one DC from replaying another one's data.)
+
+ We transmit the encrypted data as in rend-spec-v3.txt 2.5.3,
+ prepending pk.
+
+
+Appendix A. Shamir secret sharing for the impatient
+
+ In Shamir secret sharing, you want to split a value in a finite
+ field into N shares, such that any K of the N shares can
+ reconstruct the original value, but K-1 shares give you no
+ information at all.
+
+ The key insight here is that you can reconstruct a K-degree
+ polynomial given K+1 distinct points on its curve, but not given
+ K points.
+
+ So, to split a secret, we going to generate a (K-1)-degree
+ polynomial. We'll make the Y intercept of the polynomial be our
+ secret, and choose all the other coefficients at random from our
+ field.
+
+ Then we compute the (x,y) coordinates for x in [1, N]. Now we
+ have N points, any K of which can be used to find the original
+ polynomial.
+
+ Moreover, we can do what PrivCount wants here, because adding the
+ y coordinates of N shares gives us shares of the sum: If P1 is
+ the polynomial made to share secret A and P2 is the polynomial
+ made to share secret B, and if (x,y1) is on P1 and (x,y2) is on
+ P2, then (x,y1+y2) will be on P1+P2 ... and moreover, the y
+ intercept of P1+P2 will be A+B.
+
+ To reconstruct a secret from a set of shares, you have to either
+ go learn about Lagrange polynomials, or just blindly copy a
+ formula from your favorite source.
+
+ Here is such a formula, as pseudocode^Wpython, assuming that
+ each share is an object with a _x field and a _y field.
+
+ def interpolate(shares):
+ for sh in shares:
+ product_num = FE(1)
+ product_denom = FE(1)
+ for sh2 in shares:
+ if sh2 is sh:
+ continue
+ product_num *= sh2._x
+ product_denom *= (sh2._x - sh._x)
+
+ accumulator += (sh._y * product_num) / product_denom
+
+ return accumulator
+
+
+Appendix B. An alternative way to pick X coordinates
+
+ Above we describe a system where everybody knows the same TRs and
+ puts them in the same order, and then does Shamir secret sharing
+ using "x" as the x coordinate for the x'th TR.
+
+ But what if we remove that requirement by having x be based on a hash
+ of the public key of the TR? Everything would still work, so long as
+ all users chose the same K value. It would also let us migrate TR
+ sets a little more gracefully.
+
+
+Appendix C. Sampling floating-point Gaussian noise for differential privacy
+
+ Background:
+
+ When we add noise to a counter value (signal), we want the added noise to
+ protect all of the bits in the signal, to ensure differential privacy.
+
+ But because noise values are generated from random double(s) using
+ floating-point calculations, the resulting low bits are not distributed
+ evenly enough to ensure differential privacy.
+
+ As implemented in the C "double" type, IEEE 754 double-precision
+ floating-point numbers contain 53 significant bits in their mantissa. This
+ means that noise calculated using doubles can not ensure differential
+ privacy for client activity larger than 2**53:
+ * if the noise is scaled to the magnitude of the signal using
+ multiplication, then the low bits are unprotected,
+ * if the noise is not scaled, then the high bits are unprotected.
+
+ But the operations in the noise transform also suffer from floating-point
+ inaccuracy, further affecting the low bits in the mantissa. So we can only
+ protect client activity up to 2**46 with Laplacian noise. (We assume that
+ the limit for Gaussian noise is similar.)
+
+ Our noise generation procedure further reduces this limit to 2**42. For
+ byte counters, 2**42 is 4 Terabytes, or the observed bandwidth of a 1 Gbps
+ relay running at full speed for 9 hours. It may be several years before we
+ want to protect this much client activity. However, since the mitigation is
+ relatively simple, we specify that it MUST be implemented.
+
+ Procedure:
+
+ Data collectors MUST sample noise as follows:
+ 1. Generate random double(s) in [0, 1] that are integer multiples of
+ 2**-53.
+ TODO: the Gaussian transform in step 2 may require open intervals
+ 2. Generate a Gaussian floating-point noise value at random with sigma 1,
+ using the random double(s) generated in step 1.
+ 3. Multiply the floating-point noise by the floating-point sigma value.
+ 4. Truncate the scaled noise to an integer to remove the fractional bits.
+ (These bits can never correspond to signal bits, because PrivCount only
+ collects integer counters.)
+ 5. If the floating-point sigma value from step 3 is large enough that any
+ noise value could be greater than or equal to 2**46, we need to
+ randomise the low bits of the integer scaled noise value. (This ensures
+ that the low bits of the signal are always hidden by the noise.)
+
+ If we use the sample_unit_gaussian() transform in nickm/privcount_nm:
+ A. The maximum r value is sqrt(-2.0*ln(2**-53)) ~= 8.57, and the
+ maximal sin(theta) values are +/- 1.0. Therefore, the generated
+ noise values can be greater than or equal to 2**46 when the sigma
+ value is greater than 2**42.
+ B. Therefore, the number of low bits that need to be randomised is:
+ N = floor(sigma / 2**42)
+ C. We randomise the lowest N bits of the integer noise by replacing them
+ with a uniformly distributed N-bit integer value in 0...(2**N)-1.
+ 6. Add the integer noise to the integer counter, before the counter is
+ incremented in response to events. (This ensures that the signal value
+ is always protected.)
+
+ This procedure is security-sensitive: changing the order of
+ multiplications, truncations, or bit replacements can expose the low or
+ high bits of the signal or noise.
+
+ As long as the noise is sampled using this procedure, the low bits of the
+ signal are protected. So we do not need to "bin" any signals.
+
+ The impact of randomising more bits than necessary is minor, but if we fail
+ to randomise an unevenly distributed bit, client activity can be exposed.
+ Therefore, we choose to randomise all bits that could potentially be affected
+ by floating-point inaccuracy.
+
+ Justification:
+
+ Although this analysis applies to Laplacian noise, we assume a similar
+ analysis applies to Gaussian noise. (If we add Laplacian noise on DCs,
+ the total ends up with a Gaussian distribution anyway.)
+
+ TODO: check that the 2**46 limit applies to Gaussian noise.
+
+ This procedure results in a Gaussian distribution for the higher ~42 bits
+ of the noise. We can safely ignore the value of the lower bits of the noise,
+ because they are insignificant for our reporting.
+
+ This procedure is based on section 5.2 of:
+ "On Significance of the Least Significant Bits For Differential Privacy"
+ Ilya Mironov, ACM CCS 2012
+ https://www.microsoft.com/en-us/research/wp-content/uploads/2012/10/lsbs.pdf
+
+ We believe that this procedure is safe, because we neither round nor smooth
+ the noise values. The truncation in step 4 has the same effect as Mironov's
+ "safe snapping" procedure. Randomising the low bits removes the 2**46 limit
+ on the sigma value, at the cost of departing slightly from the ideal
+ infinite-precision Gaussian distribution. (But we already know that these
+ bits are distributed poorly, due to floating-point inaccuracy.)
+
+ Mironov's analysis assumes that a clamp() function is available to clamp
+ large signal and noise values to an infinite floating-point value.
+ Instead of clamping, PrivCount's arithmetic wraps modulo P. We believe that
+ this is safe, because any reported values this large will be meaningless
+ modulo P. And they will not expose any client activity, because "modulo P"
+ is an arithmetic transform of the summed noised signal value.
+
+ Alternatives:
+
+ We could round the encrypted value to the nearest multiple of the
+ unprotected bits. But this relies on the MASK() value being a uniformly
+ distributed random value, and it is less generic.
+
+ We could also simply fail when we reach the 2**42 limit on the sigma value,
+ but we do not want to design a system with a limit that low.
+
+ We could use a pure-integer transform to create Gaussian noise, and avoid
+ floating-point issues entirely. But we have not been able to find an
+ efficient pure-integer Gaussian or Laplacian noise transform. Nor do we
+ know if such a transform can be used to ensure differential privacy.
+
+
+Appendix D. Scaling large counters
+
+ We do not believe that scaling will be necessary to collect PrivCount
+ statistics in Tor. As of November 2017, the Tor network advertises a
+ capacity of 200 Gbps, or 2**51 bytes per day. We can measure counters as
+ large as ~2**61 before reaching the P/2 counter limit.
+
+ If scaling becomes necessary, we can scale event values (and noise sigmas)
+ by a scaling factor before adding them to the counter. Scaling may introduce
+ a bias in the final result, but this should be insignificant for reporting.
+
+
+Appendix Z. Remaining client-side uncertainties
+
+ [These are the uncertainties at the client side. I'm not considering
+ TR-only operations here unless they affect clients.]
+
+ Should we do a multi-level thing for the signing keys? That is, have
+ an identity key for each TR and each DC, and use those to sign
+ short-term keys?
+
+ How to tell the DCs the parameters of the system, including:
+ - who the TRs are, and what their keys are?
+ - what the counters are, and how much noise to add to each?
+ - how do we impose a delay when the noise parameters change?
+ (this delay ensures differential privacy even when the old and new
+ counters are compared)
+ - or should we try to monotonically increase counter noise?
+ - when the collection intervals start and end?
+ - what happens in networks where some relays report some counters, and
+ other relays report other counters?
+ - do we just pick the latest counter version, as long as enough relays
+ support it?
+ (it's not safe to report multiple copies of counters)
+
+ How the TRs agree on which DCs' counters to collect?
+
+ How data is uploaded to DCs?
+
+ What to say about persistence on the DC side?