Add privcount-with-shamir proposal

author: Nick Mathewson <nickm@torproject.org> 2017-12-01 15:25:10 -0500
committer: Nick Mathewson <nickm@torproject.org> 2017-12-01 15:25:10 -0500
commit: 8f56a246f8c2501e1c52594f804292ad6b7cca06 (patch)
tree: 9ec4075e85dd1bc7ad4c3261e5be58c797503377 /proposals/288-privcount-with-shamir.txt
parent: 4f809e03629cd372e5b6ab5abbbd213f0e70195f (diff)
download: torspec-8f56a246f8c2501e1c52594f804292ad6b7cca06.tar.gz
torspec-8f56a246f8c2501e1c52594f804292ad6b7cca06.zip
1 files changed, 564 insertions, 0 deletions
diff --git a/proposals/288-privcount-with-shamir.txt b/proposals/288-privcount-with-shamir.txt
new file mode 100644
index 0000000..4489048
--- /dev/null
+++ b/proposals/288-privcount-with-shamir.txt
@@ -0,0 +1,564 @@
+Filename: 288-privcount-with-shamir.txt
+Title: Privacy-Preserving Statistics with Privcount in Tor (Shamir version)
+Author: Nick Mathewson, Tim Wilson-Brown, Aaron Johnson
+Created: 1-Dec-2017
+Supercedes: 280
+Status: Draft
+
+0. Acknowledgments
+
+  Tariq Elahi, George Danezis, and Ian Goldberg designed and implemented
+  the PrivEx blinding scheme. Rob Jansen and Aaron Johnson extended
+  PrivEx's differential privacy guarantees to multiple counters in
+  PrivCount:
+
+  https://github.com/privcount/privcount/blob/master/README.markdown#research-background
+
+  Rob Jansen and Tim Wilson-Brown wrote the majority of the experimental
+  PrivCount code, based on the PrivEx secret-sharing variant. This
+  implementation includes contributions from the PrivEx authors, and
+  others:
+
+  https://github.com/privcount/privcount/blob/master/CONTRIBUTORS.markdown
+
+  This research was supported in part by NSF grants CNS-1111539,
+  CNS-1314637, CNS-1526306, CNS-1619454, and CNS-1640548.
+
+  The use of a Shamir secret-sharing-based approach is due to a
+  suggestion by Aaron Johnson (iirc); Carolin Zöbelein did some helpful
+  analysis here.
+
+  Aaron Johnson and Tim Wilson-Brown made improvements to the draft proposal.
+
+1. Introduction and scope
+
+  PrivCount is a privacy-preserving way to collect aggregate statistics
+  about the Tor network without exposing the statistics from any single
+  Tor relay.
+
+  This document describes the behavior of the in-Tor portion of the
+  PrivCount system.  It DOES NOT describe the counter configurations,
+  or any other parts of the system. (These will be covered in separate
+  proposals.)
+
+2. PrivCount overview
+
+  Here follows an oversimplified summary of PrivCount, with enough
+  information to explain the Tor side of things.  The actual operation
+  of the non-Tor components is trickier than described below.
+
+  In PrivCount, a Data Collector (DC, in this case a Tor relay) shares
+  numeric data with N different Tally Reporters (TRs). (A Tally Reporter
+  performs the summing and unblinding roles of the Tally Server and Share
+  Keeper from experimental PrivCount.)
+
+  All N Tally Reporters together can reconstruct the original data, but
+  no (N-1)-sized subset of the Tally Reporters can learn anything about
+  the data.
+
+  (In reality, the Tally Reporters don't reconstruct the original data
+  at all! Instead, they will reconstruct a _sum_ of the original data
+  across all participating relays.)
+
+  In brief, the system works as follow:
+
+  To share data, for each counter value V to be shared, the Data Collector
+  first adds Gaussian noise to V in order to produce V', uses (K,N) Shamir
+  secret-sharing to generate N shares of V' (K<=N, K being the
+  reconstruction threshold), encrypts each share to a different Tally
+  Reporter, and sends each encrypted share to the Tally Reporter it
+  is encrypted for.
+
+  The Tally Reporters then agree on the set S of Data Collectors that sent
+  data to all of them, and each Tally Reporter forms a share of the aggregate
+  value by decrypting the shares it received from the Data Collectors in S
+  and adding them together. The Tally Reporters then, collectively, perform
+  secret reconstruction, thereby learning the sum of all the different
+  values V'.
+
+  The use of Shamir secret sharing lets us survive up to N-K crashing TRs.
+  Waiting until the end to agree on a set S of surviving relays lets us
+  survive an arbitrary number of crashing DCs. In order to prevent bogus
+  data from corrupting the tally, the Tally Reporters can perform the
+  aggregation step multiple times, each time proceeding with a different
+  subset of S and taking the median of the resulting values.
+
+  Relay subsets should be chosen at random to avoid relays manipulating their
+  subset membership(s). If an shared random value is required, all relays must
+  submit their results, and then the next revealed shared random value can
+  be used to select relay subsets. (Tor's shared random value can be
+  calculated as soon as all commits have been revealed. So all relay results
+  must be received *before* any votes are cast in the reveal phase for that
+  shared random value.)
+
+  Below we describe the algorithm in more detail, and describe the data
+  format to use.
+
+3. The algorithm
+
+  All values below are B-bit integers modulo some prime P; we suggest
+  B=62 and P = 2**62 - 2**30 - 1 (hex 0x3fffffffbfffffff).  The size of
+  this field is an upper limit on the largest sum we can calculate; it
+  is not a security parameter.
+
+  There are N Tally Reporters: every participating relay must agree on
+  which N exist, and on their current public keys.  We suggest listing
+  them in the consensus networkstatus document.  All parties must also
+  agree on some ordering the Tally Reporters.  Similarly, all parties
+  must also agree on some value K<=N.
+
+  There are a number of well-known "counters", identified known by ASCII
+  identifiers.  Each counter is a value that the participating relays
+  will know how to count.  Let C be the number of counters.
+
+3.1. Data Collector (DC) side
+
+  At the start of each period, every Data Collector ("client" below)
+  initializes their state as follows
+
+      1. For every Tally Reporter with index i, the client constructs a
+         random 32-byte random value SEED_i.  The client then generates
+         a pseudorandom bitstream of C*B bits using the SHAKE-256
+         XOF with SEED_i as its input, and divides this stream into
+         C values, with the c'th value denoted by MASK(i, c).
+
+         [Because P is very close to a power of 2, nearly all seeds will
+         produce MASK values in range 0...(P-1).  If any does not, the
+         client picks a new seed.]
+
+      2. The client encrypts SEED_i using the public key of Tally
+         Reporter i, and remembers this encrypted value.  It discards
+         SEED_i.
+
+      3. For every counter c, the client generates a noise value Z_c
+         from an appropriate Gaussian distribution. If the noise value is
+         negative, the client adds P to bring Z_c into the range 0...(P-1).
+         (The noise MUST be sampled using the procedure in Appendix C.)
+
+         The client then uses Shamir secret sharing to generate
+         N shares (x,y) of Z_c, 1 <= x <= N, with the x'th share to be used by
+         the x'th Tally Reporter.  See Appendix A for more on Shamir secret
+         sharing.  See Appendix B for another idea about X coordinates.
+
+         The client picks a random value CTR_c and stores it in the counter,
+         which serves to locally blind the counter.
+
+         The client then subtracts (MASK(x, c)+CTR_c) from y, giving
+         "encrypted shares" of (x, y0) where y0 = y-CTR_c.
+
+         The client then discards all MASK values, all CTR values, and all
+         original shares (x,y), all CTR and the noise value Z_c. For each
+         counter c, it remembers CTR_c, and N shares of the form (x, y).
+
+  To increment a counter by some value "inc":
+
+      1. The client adds "inc" to counter value, modulo P.
+
+         (This step is chosen to be optimal, since it will happen more
+         frequently than any other step in the computation.)
+
+         Aggregate counter values that are close to P/2 MUST be scaled to
+         avoid overflow. See Appendix D for more information. (We do not think
+         that any counters on the current Tor network will require scaling.)
+
+  To publish the counter values:
+
+      1. The client publishes, in the format described below:
+
+         The list of counters it knows about
+         The list of TRs it knows about
+         For each TR:
+            For each counter c:
+                A list of (i, y-CTR_c-MASK(x,c)), which corresponds
+                to the share for the i'th TR of counter c.
+            SEED_i as encrypted earlier to the i'th TR's public key.
+
+3.2. Tally Reporter (TR) side
+
+  This section is less completely specified than the Data Collector's
+  behavior: I expect that the TRs will be easier to update as we proceed.
+
+  (Each TR has a long-term identity key (ed25519).  It also has a
+  sequence of short-term curve25519 keys, each associated with a single
+  round of data collection.)
+
+   1. When a group of TRs receives information from the Data Collectors,
+      they collectively chose a set S of DCs and a set of counters such
+      that every TR in the group has a valid entry for every counter,
+      from every DC in the set.
+
+      To be valid, an entry must not only be well-formed, but must also
+      have the x coordinate in its shares corresponding to the
+      TR's position in the list of TRs.
+
+   2. For each Data Collector's report, the i'th TR decrypts its part of
+      the client's report using its curve25519 key.  It uses SEED_i and
+      SHAKE-256 to regenerate MASK(0) through MASK(C-1).  Then for each
+      share (x, y-CTR_c-MASK(x,c)) (note that x=i), the TR reconstructs the
+      true share of the value for that DC and counter c by adding
+      V+MASK(x,c) to the y coordinate to yield the share (x, y_final).
+
+   3. For every counter in the set, each TR computes the sum of the
+      y_final values from all clients.
+
+   4. For every counter in the set, each TR publishes its a share of
+      the sum as (x, SUM(y_final)).
+
+   5. If at least K TRs publish correctly, then the sum can be
+      reconstructed using Lagrange polynomial interpolation. (See
+      Appendix A).
+
+   6. If the reconstructed sum is greater than P/2, it is probably a negative
+      value. The value can be obtained by subtracting P from the sum.
+      (Negative values are generated when negative noise is added to small
+      signals.)
+
+   7. If scaling has been applied, the sum is scaled by the scaling factor.
+      (See Appendix D.)
+
+4. The document format
+
+4.1. The counters document.
+
+  This document format builds on the line-based directory format used
+  for other tor documents, described in Tor's dir-spec.txt.
+
+  Using this format, we describe a "counters" document that publishes
+  the shares collected by a given DC, for a single TR.
+
+  The "counters" document has these elements:
+
+    "privctr-dump-format" SP VERSION SP SigningKey
+
+       [At start, exactly once]
+
+       Describes the version of the dump format, and provides an ed25519
+       signing key to identify the relay. The signing key is encoded in
+       base64 with padding stripped. VERSION is "alpha" now, but should
+       be "1" once this document is finalized.
+
+    "starting-at" SP IsoTime
+
+       [Exactly once]
+
+       The start of the time period when the statistics here were
+       collected.
+
+    "ending-at" SP IsoTime
+
+       [Exactly once]
+
+       The end of the time period when the statistics here were
+       collected.
+
+    "share-parameters" SP Number SP Number
+
+       [Exactly once]
+
+       The number of shares needed to reconstruct the client's
+       measurements (K), and the number of shares produced (N),
+       respectively.
+
+    "tally-reporter" SP Identifier SP Integer SP Key
+
+       [At least twice]
+
+       The curve25519 public key of each Tally Reporter that the relay
+       believes in.  (If the list does not match the list of
+       participating Tally Reporters, they won't be able to find the
+       relay's values correctly.)  The identifiers are non-space,
+       non-nul character sequences.  The Key values are encoded in
+       base64 with padding stripped; they must be unique within each
+       counters document.  The Integer values are the X coordinate of
+       the shares associated with each Tally Reporter.
+
+    "encrypted-to-key" SP Key
+
+       [Exactly once]
+
+       The curve25519 public key to which the report below is encrypted.
+       Note that it must match one of the Tally Reporter options above.
+
+
+    "report" NL
+      "----- BEGIN ENCRYPTED MESSAGE-----" NL
+      Base64Data
+      "----- END ENCRYPTED MESSAGE-----" NL
+
+      [Exactly once]
+
+      An encrypted document, encoded in base64. The plaintext format is
+      described in section 4.2. below. The encryption is as specified in
+      section 5 below, with STRING_CONSTANT set to "privctr-shares-v1".
+
+    "signature" SP Signature
+
+       [At end, exactly once]
+
+       The Ed25519 signature of all the fields in the document, from the
+       first byte, up to but not including the "signature" keyword here.
+       The signature is encoded in base64 with padding stripped.
+
+4.2. The encrypted "shares" document.
+
+  The shares document is sent, encrypted, in the "report" element above.
+  Its plaintext contents include these fields:
+
+   "encrypted-seed" NL
+      "----- BEGIN ENCRYPTED MESSAGE-----" NL
+      Base64Data
+      "----- END ENCRYPTED MESSAGE-----" NL
+
+      [At start, exactly once.]
+
+      An encrypted document, encoded in base64. The plaintext value is
+      the 32-byte value SEED_i for this TR. The encryption is as
+      specified in section 5 below, with STRING_CONSTANT set to
+      "privctr-seed-v1".
+
+   "d" SP Keyword SP Integer
+
+      [Any number of times]
+
+      For each counter, the name of the counter, and the obfuscated Y
+      coordinate of this TR's share for that counter.  (The Y coordinate
+      is calculated as y-CTR_c as in 3.1 above.)  The order of counters
+      must correspond to the order used when generating the MASK() values;
+      different clients do not need to choose the same order.
+
+5. Hybrid encryption
+
+   This scheme is taken from rend-spec-v3.txt, section 2.5.3, replacing
+   "secret_input" and "STRING_CONSTANT".  It is a hybrid encryption
+   method for encrypting a message to a curve25519 public key PK.
+
+     We generate a new curve25519 keypair (sk,pk).
+
+     We run the algorithm of rend-spec-v3.txt 2.5.3, replacing
+     "secret_input" with Curve25519(sk,PK) | SigningKey, where
+     SigningKey is the DC's signing key.  (Including the DC's SigningKey
+     here prevents one DC from replaying another one's data.)
+
+     We transmit the encrypted data as in rend-spec-v3.txt 2.5.3,
+     prepending pk.
+
+
+Appendix A. Shamir secret sharing for the impatient
+
+   In Shamir secret sharing, you want to split a value in a finite
+   field into N shares, such that any K of the N shares can
+   reconstruct the original value, but K-1 shares give you no
+   information at all.
+
+   The key insight here is that you can reconstruct a K-degree
+   polynomial given K+1 distinct points on its curve, but not given
+   K points.
+
+   So, to split a secret, we going to generate a (K-1)-degree
+   polynomial.  We'll make the Y intercept of the polynomial be our
+   secret, and choose all the other coefficients at random from our
+   field.
+
+   Then we compute the (x,y) coordinates for x in [1, N].  Now we
+   have N points, any K of which can be used to find the original
+   polynomial.
+
+   Moreover, we can do what PrivCount wants here, because adding the
+   y coordinates of N shares gives us shares of the sum:  If P1 is
+   the polynomial made to share secret A and P2 is the polynomial
+   made to share secret B, and if (x,y1) is on P1 and (x,y2) is on
+   P2, then (x,y1+y2) will be on P1+P2 ... and moreover, the y
+   intercept of P1+P2 will be A+B.
+
+   To reconstruct a secret from a set of shares, you have to either
+   go learn about Lagrange polynomials, or just blindly copy a
+   formula from your favorite source.
+
+   Here is such a formula, as pseudocode^Wpython, assuming that
+   each share is an object with a _x field and a _y field.
+
+     def interpolate(shares):
+        for sh in shares:
+           product_num = FE(1)
+           product_denom = FE(1)
+           for sh2 in shares:
+               if sh2 is sh:
+                   continue
+               product_num *= sh2._x
+               product_denom *= (sh2._x - sh._x)
+
+           accumulator += (sh._y * product_num) / product_denom
+
+       return accumulator
+
+
+Appendix B. An alternative way to pick X coordinates
+
+   Above we describe a system where everybody knows the same TRs and
+   puts them in the same order, and then does Shamir secret sharing
+   using "x" as the x coordinate for the x'th TR.
+
+   But what if we remove that requirement by having x be based on a hash
+   of the public key of the TR?  Everything would still work, so long as
+   all users chose the same K value.  It would also let us migrate TR
+   sets a little more gracefully.
+
+
+Appendix C. Sampling floating-point Gaussian noise for differential privacy
+
+   Background:
+
+   When we add noise to a counter value (signal), we want the added noise to
+   protect all of the bits in the signal, to ensure differential privacy.
+
+   But because noise values are generated from random double(s) using
+   floating-point calculations, the resulting low bits are not distributed
+   evenly enough to ensure differential privacy.
+
+   As implemented in the C "double" type, IEEE 754 double-precision
+   floating-point numbers contain 53 significant bits in their mantissa. This
+   means that noise calculated using doubles can not ensure differential
+   privacy for client activity larger than 2**53:
+     * if the noise is scaled to the magnitude of the signal using
+       multiplication, then the low bits are unprotected,
+     * if the noise is not scaled, then the high bits are unprotected.
+
+   But the operations in the noise transform also suffer from floating-point
+   inaccuracy, further affecting the low bits in the mantissa. So we can only
+   protect client activity up to 2**46 with Laplacian noise. (We assume that
+   the limit for Gaussian noise is similar.)
+
+   Our noise generation procedure further reduces this limit to 2**42. For
+   byte counters, 2**42 is 4 Terabytes, or the observed bandwidth of a 1 Gbps
+   relay running at full speed for 9 hours. It may be several years before we
+   want to protect this much client activity. However, since the mitigation is
+   relatively simple, we specify that it MUST be implemented.
+
+   Procedure:
+
+   Data collectors MUST sample noise as follows:
+     1. Generate random double(s) in [0, 1] that are integer multiples of
+        2**-53.
+        TODO: the Gaussian transform in step 2 may require open intervals
+     2. Generate a Gaussian floating-point noise value at random with sigma 1,
+        using the random double(s) generated in step 1.
+     3. Multiply the floating-point noise by the floating-point sigma value.
+     4. Truncate the scaled noise to an integer to remove the fractional bits.
+        (These bits can never correspond to signal bits, because PrivCount only
+        collects integer counters.)
+     5. If the floating-point sigma value from step 3 is large enough that any
+        noise value could be greater than or equal to 2**46, we need to
+        randomise the low bits of the integer scaled noise value. (This ensures
+        that the low bits of the signal are always hidden by the noise.)
+
+        If we use the sample_unit_gaussian() transform in nickm/privcount_nm:
+        A. The maximum r value is sqrt(-2.0*ln(2**-53)) ~=  8.57, and the
+           maximal sin(theta) values are +/- 1.0. Therefore, the generated
+           noise values can be greater than or equal to 2**46 when the sigma
+           value is greater than 2**42.
+        B. Therefore, the number of low bits that need to be randomised is:
+               N = floor(sigma / 2**42)
+        C. We randomise the lowest N bits of the integer noise by replacing them
+           with a uniformly distributed N-bit integer value in 0...(2**N)-1.
+     6. Add the integer noise to the integer counter, before the counter is
+        incremented in response to events. (This ensures that the signal value
+        is always protected.)
+
+   This procedure is security-sensitive: changing the order of
+   multiplications, truncations, or bit replacements can expose the low or
+   high bits of the signal or noise.
+
+   As long as the noise is sampled using this procedure, the low bits of the
+   signal are protected. So we do not need to "bin" any signals.
+
+   The impact of randomising more bits than necessary is minor, but if we fail
+   to randomise an unevenly distributed bit, client activity can be exposed.
+   Therefore, we choose to randomise all bits that could potentially be affected
+   by floating-point inaccuracy.
+
+   Justification:
+
+   Although this analysis applies to Laplacian noise, we assume a similar
+   analysis applies to Gaussian noise. (If we add Laplacian noise on DCs,
+   the total ends up with a Gaussian distribution anyway.)
+
+   TODO: check that the 2**46 limit applies to Gaussian noise.
+
+   This procedure results in a Gaussian distribution for the higher ~42 bits
+   of the noise. We can safely ignore the value of the lower bits of the noise,
+   because they are insignificant for our reporting.
+
+   This procedure is based on section 5.2 of:
+   "On Significance of the Least Significant Bits For Differential Privacy"
+   Ilya Mironov, ACM CCS 2012
+   https://www.microsoft.com/en-us/research/wp-content/uploads/2012/10/lsbs.pdf
+
+   We believe that this procedure is safe, because we neither round nor smooth
+   the noise values. The truncation in step 4 has the same effect as Mironov's
+   "safe snapping" procedure. Randomising the low bits removes the 2**46 limit
+   on the sigma value, at the cost of departing slightly from the ideal
+   infinite-precision Gaussian distribution. (But we already know that these
+   bits are distributed poorly, due to floating-point inaccuracy.)
+
+   Mironov's analysis assumes that a clamp() function is available to clamp
+   large signal and noise values to an infinite floating-point value.
+   Instead of clamping, PrivCount's arithmetic wraps modulo P. We believe that
+   this is safe, because any reported values this large will be meaningless
+   modulo P. And they will not expose any client activity, because "modulo P"
+   is an arithmetic transform of the summed noised signal value.
+
+   Alternatives:
+
+   We could round the encrypted value to the nearest multiple of the
+   unprotected bits. But this relies on the MASK() value being a uniformly
+   distributed random value, and it is less generic.
+
+   We could also simply fail when we reach the 2**42 limit on the sigma value,
+   but we do not want to design a system with a limit that low.
+
+   We could use a pure-integer transform to create Gaussian noise, and avoid
+   floating-point issues entirely. But we have not been able to find an
+   efficient pure-integer Gaussian or Laplacian noise transform. Nor do we
+   know if such a transform can be used to ensure differential privacy.
+
+
+Appendix D. Scaling large counters
+
+   We do not believe that scaling will be necessary to collect PrivCount
+   statistics in Tor. As of November 2017, the Tor network advertises a
+   capacity of 200 Gbps, or 2**51 bytes per day. We can measure counters as
+   large as ~2**61 before reaching the P/2 counter limit.
+
+   If scaling becomes necessary, we can scale event values (and noise sigmas)
+   by a scaling factor before adding them to the counter. Scaling may introduce
+   a bias in the final result, but this should be insignificant for reporting.
+
+
+Appendix Z. Remaining client-side uncertainties
+
+   [These are the uncertainties at the client side. I'm not considering
+    TR-only operations here unless they affect clients.]
+
+   Should we do a multi-level thing for the signing keys?  That is, have
+   an identity key for each TR and each DC, and use those to sign
+   short-term keys?
+
+   How to tell the DCs the parameters of the system, including:
+      - who the TRs are, and what their keys are?
+      - what the counters are, and how much noise to add to each?
+      - how do we impose a delay when the noise parameters change?
+        (this delay ensures differential privacy even when the old and new
+        counters are compared)
+        - or should we try to monotonically increase counter noise?
+      - when the collection intervals start and end?
+      - what happens in networks where some relays report some counters, and
+        other relays report other counters?
+        - do we just pick the latest counter version, as long as enough relays
+          support it?
+          (it's not safe to report multiple copies of counters)
+
+   How the TRs agree on which DCs' counters to collect?
+
+   How data is uploaded to DCs?
+
+   What to say about persistence on the DC side?
author	Nick Mathewson <nickm@torproject.org>	2017-12-01 15:25:10 -0500
committer	Nick Mathewson <nickm@torproject.org>	2017-12-01 15:25:10 -0500
commit	8f56a246f8c2501e1c52594f804292ad6b7cca06 (patch)
tree	9ec4075e85dd1bc7ad4c3261e5be58c797503377 /proposals/288-privcount-with-shamir.txt
parent	4f809e03629cd372e5b6ab5abbbd213f0e70195f (diff)
download	torspec-8f56a246f8c2501e1c52594f804292ad6b7cca06.tar.gz torspec-8f56a246f8c2501e1c52594f804292ad6b7cca06.zip