aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMatt Traudt <sirmatt@ksu.edu>2017-09-27 16:11:05 -0400
committerDavid Goulet <dgoulet@torproject.org>2017-09-29 11:06:31 -0400
commit3ef7e6f1870bf6bc2c26076c6abb545a576f9a42 (patch)
tree2b9759e1a993becc1fdbbefe8fa6046e870b6ba9
parentfc6c0b46fb2741d58e4c8afc22800c76c304b0dc (diff)
downloadtor-3ef7e6f1870bf6bc2c26076c6abb545a576f9a42.tar.gz
tor-3ef7e6f1870bf6bc2c26076c6abb545a576f9a42.zip
sched: Don't get KIST stuck in an infinite loop
When a channel is scheduled and flush cells returns 0 that is no cells to flush, we flag it back in waiting for cells so it doesn't get stuck in a possible infinite loop. It has been observed on moria1 where a closed channel end up in the scheduler where the flush process returned 0 cells but it was ultimately kept in the scheduling loop forever. We suspect that this is due to a more deeper problem in tor where the channel_more_to_flush() is actually looking at the wrong queue and was returning 1 for an empty channel thus putting the channel in the "Case 4" of the scheduler which is to go back in pending state thus re-considered at the next iteration. This is a fix that allows the KIST scheduler to recover properly from a not entirelly diagnosed problem in tor. Fixes #23676 Signed-off-by: David Goulet <dgoulet@torproject.org>
-rw-r--r--changes/bug236766
-rw-r--r--src/or/scheduler_kist.c17
2 files changed, 22 insertions, 1 deletions
diff --git a/changes/bug23676 b/changes/bug23676
new file mode 100644
index 0000000000..90f9e31809
--- /dev/null
+++ b/changes/bug23676
@@ -0,0 +1,6 @@
+ o Major bugfixes (scheduler):
+ If a channel is put into the scheduler's pending list, then it starts
+ closing, and then if the scheduler runs before it finishes closing, the
+ scheduler will get stuck trying to flush its cells while the lower layers
+ refuse to cooperate. Fix that race condition by given the scheduler an
+ escape method. Fixes bug 23676; bugfix on 0.3.2.1-alpha
diff --git a/src/or/scheduler_kist.c b/src/or/scheduler_kist.c
index 9e960cdef9..2dddd3e9a4 100644
--- a/src/or/scheduler_kist.c
+++ b/src/or/scheduler_kist.c
@@ -591,8 +591,23 @@ kist_scheduler_run(void)
if (flush_result > 0) {
update_socket_written(&socket_table, chan, flush_result *
(CELL_MAX_NETWORK_SIZE + TLS_PER_CELL_OVERHEAD));
+ } else {
+ /* XXX: This can happen because tor sometimes does flush in an
+ * opportunistic way cells from the circuit to the outbuf so the
+ * channel can end up here without having anything to flush nor needed
+ * to write to the kernel. Hopefully we'll fix that soon but for now
+ * we have to handle this case which happens kind of often. */
+ log_debug(LD_SCHED,
+ "We didn't flush anything on a chan that we think "
+ "can write and wants to write. The channel's state is '%s' "
+ "and in scheduler state %d. We're going to mark it as "
+ "waiting_for_cells (as that's most likely the issue) and "
+ "stop scheduling it this round.",
+ channel_state_to_string(chan->state),
+ chan->scheduler_state);
+ chan->scheduler_state = SCHED_CHAN_WAITING_FOR_CELLS;
+ continue;
}
- /* XXX What if we didn't flush? */
}
/* Decide what to do with the channel now */