From ea41a664476e7bc3690f080fbd3c13e2e32629fc Mon Sep 17 00:00:00 2001
From: Nick Mathewson <nickm@torproject.org>
Date: Fri, 22 Oct 2021 17:36:04 -0400
Subject: Add proposals 336 and 337.

---
 proposals/336-randomize-guard-retries.md |  87 +++++++++++++++++++
 proposals/337-simpler-guard-usability.md | 138 +++++++++++++++++++++++++++++++
 2 files changed, 225 insertions(+)
 create mode 100644 proposals/336-randomize-guard-retries.md
 create mode 100644 proposals/337-simpler-guard-usability.md

(limited to 'proposals')

diff --git a/proposals/336-randomize-guard-retries.md b/proposals/336-randomize-guard-retries.md
new file mode 100644
index 0000000..5ee5b71
--- /dev/null
+++ b/proposals/336-randomize-guard-retries.md
@@ -0,0 +1,87 @@
+```
+Filename: 336-randomize-guard-retries.md
+Title: Randomized schedule for guard retries
+Author: Nick Mathewson
+Created: 2021-10-22
+Status: Open
+```
+
+# Introduction
+
+When we notice that a guard isn't working, we don't mark it as retriable
+until a certain interval has passed.  Currently, these intervals are
+fixed, as described in the documentation for `GUARDS_RETRY_SCHED` in
+`guard-spec` appendix A.1.  Here we propose using a randomized retry
+interval instead, based on the same decorrelated-jitter algorithm we use
+for directory retries.
+
+The upside of this approach is that it makes our behavior in
+the presence of an unreliable network a bit harder for an attacker to
+predict. It also means that if a guard goes down for a while, its
+clients will notice that it is up at staggered times, rather than
+probing it in lock-step.
+
+The downside of this approach is that we can, if we get unlucky
+enough, completely fail to notice that a preferred guard is online when
+we would otherwise have noticed sooner.
+
+Note that when a guard is marked retriable, it isn't necessarily retried
+immediately.  Instead, its status is changed from "Unreachable" to
+"Unknown", which will cause it to get retried.
+
+For reference, our previous schedule was:
+
+```
+   {param:PRIMARY_GUARDS_RETRY_SCHED}
+      -- every 10 minutes for the first six hours,
+      -- every 90 minutes for the next 90 hours,
+      -- every 4 hours for the next 3 days,
+      -- every 9 hours thereafter.
+
+   {param:GUARDS_RETRY_SCHED} --
+      -- every hour for the first six hours,
+      -- every 4 hours for the next 90 hours,
+      -- every 18 hours for the next 3 days,
+      -- every 36 hours thereafter.
+```
+
+# The new algorithm
+
+We re-use the decorrelated-jitter algorithm from `dir-spec` section 5.5.
+The specific formula used to compute the 'i+1'th delay is:
+
+```
+Delay_{i+1} = MIN(cap, random_between(lower_bound, upper_bound))
+where upper_bound = MAX(lower_bound+1, Delay_i * 3)
+      lower_bound = MAX(1, base_delay).
+```
+
+For primary guards, we set base_delay to 30 seconds and cap to 6 hours.
+
+For non-primary guards, we set base_delay to 10 minutes and cap to 36
+hours.
+
+(These parameters were selected by simulating the results of using them
+until they looked "a bit more aggressive" than the current algorithm, but
+not too much.)
+
+The average behavior for the new primary schedule is:
+
+```
+First 1.0 hours: 10.14283 attempts. (Avg delay 4m 47.41s)
+First 6.0 hours: 19.02377 attempts. (Avg delay 15m 36.95s)
+First 96.0 hours: 56.11173 attempts. (Avg delay 1h 40m 3.13s)
+First 168.0 hours: 83.67091 attempts. (Avg delay 1h 58m 43.16s)
+Steady state: 2h 36m 44.63s between attempts.
+```
+
+The average behavior for the new non-primary schedule is:
+
+```
+First 1.0 hours: 3.08069 attempts. (Avg delay 14m 26.08s)
+First 6.0 hours: 8.1473 attempts. (Avg delay 35m 25.27s)
+First 96.0 hours: 22.57442 attempts. (Avg delay 3h 49m 32.16s)
+First 168.0 hours: 29.02873 attempts. (Avg delay 5h 27m 2.36s)
+Steady state: 11h 15m 28.47s between attempts.
+```
+
diff --git a/proposals/337-simpler-guard-usability.md b/proposals/337-simpler-guard-usability.md
new file mode 100644
index 0000000..d75c4d8
--- /dev/null
+++ b/proposals/337-simpler-guard-usability.md
@@ -0,0 +1,138 @@
+```
+Filename: 337-simpler-guard-usability.md
+Title: A simpler way to decide, "Is this guard usable?"
+Author: Nick Mathewson
+Created: 2021-10-22
+Status: Open
+```
+
+# Introduction
+
+The current `guard-spec` describes a mechanism for how to behave when
+our primary guards are unreachable, and we don't know which other guards
+are reachable.  This proposal describes a simpler method, currently
+implemented in [Arti](https://gitlab.torproject.org/tpo/core/arti/).
+
+(Note that this method might not actually give different results: its
+only advantage is that it is much simpler to implement.)
+
+## The task at hand
+
+For illustration, we'll assume that our primary guards are P1, P2, and
+P3, and our subsequent guards (in preference order) are G1, G2, G3, and
+so on.  The status of each guard is Reachable (we think we can connect
+to it), Unreachable (we think it's down), or Unknown (we haven't tried
+it recently).
+
+The question becomes, "What should we do when P1, P2, and P3 are
+Unreachable, and G1, G2, ... are all Unknown"?
+
+In this circumstance, we _could_ say that we only build circuits to G1,
+wait for them to succeed or fail, and only try G2 if we see that the
+circuits to G1 have failed completely.  But that delays in the case that
+G1 is down.
+
+Instead, the first time we get a circuit request, we try to build one
+circuit to G1.  On the next circuit request, if the circuit to G1 isn't
+done yet, we launch a circuit to G2 instead.  The next request (if the
+G1 and G2 circuits are still pending) goes to G3, and so on.  But
+(here's the critical part!) we don't actually _use_ the circuit to G2
+unless the circuit to G1 fails, and we don't actually _use_ the circuit
+to G3 unless the circuits to G1 and G2 both fail.
+
+This approach causes Tor clients to check the status of multiple
+possible guards in parallel, while not actually _using_ any guard until
+we're sure that all the guards we'd rather use are down.
+
+## The current algorithm and its drawbacks
+
+For the current algorithm, see `guard-spec` section 4.9: circuits are
+exploratory if they are not using a primary guard.  If such an
+exploratory circuit is `waiting_for_better_guard`, then we advance it
+(or not) depending on the status of all other _circuits_ using guards that
+we'd rather be using.
+
+In other words, the current algorithm is described in terms of actions
+to take with given circuits.
+
+For Arti (and for other modular Tor implementations), however, this
+algorithm is a bit of a pain: it introduces dependencies between the
+guard code and the circuit handling code, requiring each one to mess
+with the other.
+
+# Proposal
+
+I suggest that we describe an alternative algorithm for handing circuits
+to non-primary guards, to be used in preference to the current
+algorithm.  Unlike the existing approach, it isolates the guard logic a
+bit better from the circuit logic.
+
+## Handling exploratory circuits
+
+When all primary guards are Unreachable, we need to try non-primary
+guards.  We select the first such guard (in preference order) that is
+neither Unreachable nor Pending.  Whenever we give out such a guard, if
+the guard's status is Unknown, then we call that guard "Pending" until
+the attempt to use it succeeds or fails.  We remember when the guard
+became Pending.
+
+> Aside: None of the above is a change from our existing specification.
+
+After completing a circuit, the implementation must check whether
+its guard is usable.  A guard is usable according to these rules:
+
+Primary guards are always usable.
+
+Non-primary guards are usable for a given circuit if every guard earlier
+in the preference list is either unsuitable for that circuit
+(e.g. because of family restrictions), or marked as Unreachable, or has
+been pending for at least `{NONPRIMARY_GUARD_CONNECT_TIMEOUT}`.
+
+Non-primary guards are unusable for a given circuit if some guard earlier
+in the preference list is suitable for the circuit _and_ Reachable.
+
+Non-primary guards are unusable if they have not become usable after
+`{NONPRIMARY_GUARD_IDLE_TIMEOUT}` seconds.
+
+If a circuit's guard is neither usable nor unusable immediately, the
+circuit is not discarded; instead, it is kept (but not used) until it
+becomes usable or unusable.
+
+> I am not 100% sure whether this description produces the same behavior
+> as the current guard-spec, but it is simpler to describe, and has
+> proven to be simpler to implement.
+
+## Implications for program design.
+
+(This entire section is implementation detail to explain why this is a
+simplification from the previous algorithm. It is for explanatory
+purposes only and is not part of the spec.)
+
+With this algorithm, we cut down the interaction between the guard code
+and the circuit code considerably, but we do not remove it entirely.
+Instead, there remains (in Arti terms) a pair of communication channels
+between the circuit manager and the guard manager:
+
+ * Whenever a guard is given to the circuit manager, the circuit manager
+   receives the write end of a single-use channel to
+   report whether the guard has succeeded or failed.
+
+ * Whenever a non-primary guard is given to the circuit manager, the
+   circuit receives the read end of a single-use channel that will tell
+   it whether the guard is usable or unusable.  This channel doesn't
+   report anything until the guard has one status or the other.
+
+With this design, the circuit manager never needs to look at the list of
+guards, and the guard manager never needs to look at the list of
+circuits.
+
+## Subtleties concerning "guard success"
+
+Note that the above definitions of a Reachable guard depend on reporting
+when the _guard_ is successful or failed. This is not necessarily the
+same as reporting whether the _circuit_ is successful or failed.  For
+example, a circuit that fails after the first hop does not necessarily
+indicate that there's anything wrong with the guard.  Similarly, we can
+reasonably conclude that the guard is working (at least somewhat) as
+long as we have an open channel to it.
+
-- 
cgit v1.2.3-54-g00ecf