aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorRuss Cox <rsc@golang.org>2017-02-15 15:41:50 -0500
committerRuss Cox <rsc@golang.org>2017-02-16 15:25:04 +0000
commitbcda91c18d8b4f6ed67897eba2b62a2c21b584a9 (patch)
tree2e39f87e85f9ba8a8426c74777688db22fb3ac36
parent7d7a0a9d64b12056d5a73beb8eea2970ee15f88e (diff)
downloadgo-bcda91c18d8b4f6ed67897eba2b62a2c21b584a9.tar.gz
go-bcda91c18d8b4f6ed67897eba2b62a2c21b584a9.zip
[release-branch.go1.8] runtime: do not call wakep from enlistWorker, to avoid possible deadlock
We have seen one instance of a production job suddenly spinning to 100% CPU and becoming unresponsive. In that one instance, a SIGQUIT was sent after 328 minutes of spinning, and the stacks showed a single goroutine in "IO wait (scan)" state. Looking for things that might get stuck if a goroutine got stuck in scanning a stack, we found that injectglist does: lock(&sched.lock) var n int for n = 0; glist != nil; n++ { gp := glist glist = gp.schedlink.ptr() casgstatus(gp, _Gwaiting, _Grunnable) globrunqput(gp) } unlock(&sched.lock) and that casgstatus spins on gp.atomicstatus until the _Gscan bit goes away. Essentially, this code locks sched.lock and then while holding sched.lock, waits to lock gp.atomicstatus. The code that is doing the scan is: if castogscanstatus(gp, s, s|_Gscan) { if !gp.gcscandone { scanstack(gp, gcw) gp.gcscandone = true } restartg(gp) break loop } More analysis showed that scanstack can, in a rare case, end up calling back into code that acquires sched.lock. For example: runtime.scanstack at proc.go:866 calls runtime.gentraceback at mgcmark.go:842 calls runtime.scanstack$1 at traceback.go:378 calls runtime.scanframeworker at mgcmark.go:819 calls runtime.scanblock at mgcmark.go:904 calls runtime.greyobject at mgcmark.go:1221 calls (*runtime.gcWork).put at mgcmark.go:1412 calls (*runtime.gcControllerState).enlistWorker at mgcwork.go:127 calls runtime.wakep at mgc.go:632 calls runtime.startm at proc.go:1779 acquires runtime.sched.lock at proc.go:1675 This path was found with an automated deadlock-detecting tool. There are many such paths but they all go through enlistWorker -> wakep. The evidence strongly suggests that one of these paths is what caused the deadlock we observed. We're running those jobs with GOTRACEBACK=crash now to try to get more information if it happens again. Further refinement and analysis shows that if we drop the wakep call from enlistWorker, the remaining few deadlock cycles found by the tool are all false positives caused by not understanding the effect of calls to func variables. The enlistWorker -> wakep call was intended only as a performance optimization, it rarely executes, and if it does execute at just the wrong time it can (and plausibly did) cause the deadlock we saw. Comment it out, to avoid the potential deadlock. Fixes #19112. Unfixes #14179. Change-Id: I6f7e10b890b991c11e79fab7aeefaf70b5d5a07b Reviewed-on: https://go-review.googlesource.com/37093 Run-TryBot: Russ Cox <rsc@golang.org> Reviewed-by: Austin Clements <austin@google.com> Reviewed-on: https://go-review.googlesource.com/37022 TryBot-Result: Gobot Gobot <gobot@golang.org>
-rw-r--r--src/runtime/mgc.go10
1 files changed, 6 insertions, 4 deletions
diff --git a/src/runtime/mgc.go b/src/runtime/mgc.go
index 0b996d8950..cd57720917 100644
--- a/src/runtime/mgc.go
+++ b/src/runtime/mgc.go
@@ -628,10 +628,12 @@ func (c *gcControllerState) endCycle() {
//go:nowritebarrier
func (c *gcControllerState) enlistWorker() {
// If there are idle Ps, wake one so it will run an idle worker.
- if atomic.Load(&sched.npidle) != 0 && atomic.Load(&sched.nmspinning) == 0 {
- wakep()
- return
- }
+ // NOTE: This is suspected of causing deadlocks. See golang.org/issue/19112.
+ //
+ // if atomic.Load(&sched.npidle) != 0 && atomic.Load(&sched.nmspinning) == 0 {
+ // wakep()
+ // return
+ // }
// There are no idle Ps. If we need more dedicated workers,
// try to preempt a running P so it will switch to a worker.