Understanding Real-World Concurrency Bugs in Go: Tengfei Tu Xiaoyu Liu
Understanding Real-World Concurrency Bugs in Go: Tengfei Tu Xiaoyu Liu
In total, we have studied 171 concurrency bugs in these ap- 1 func finishReq(timeout time.Duration) r ob {
plications. We analyzed the root causes of them, performed 2 - ch := make(chan ob)
3 + ch := make(chan ob, 1)
experiments to reproduce them, and examined their fixing 4 go func() {
patches. Finally, we tested them with two existing Go con- 5 result := fn()
currency bug detectors (the only publicly available ones). 6 ch <- result // block
7 }
Our study focuses on a long-standing and fundamental 8 select {
question in concurrent programming: between message pass- 9 case result = <- ch:
ing [27, 37] and shared memory, which of these inter-thread 10 return result
11 case <- time.After(timeout):
communication mechanisms is less error-prone [2, 11, 48]. 12 return nil
Go is a perfect language to study this question, since it pro- 13 }
vides frameworks for both shared memory and message 14 }
passing. However, it encourages the use of channels over
Figure 1. A blocking bug caused by channel.
shared memory with the belief that explicit message passing
is less error-prone [1, 2, 21].
To understand Go concurrency bugs and the comparison Go proposes to ease the creation of goroutines, the usage
between message passing and shared memory, we propose of buffered vs. unbuffered channels, the non-determinism
to categorize concurrency bugs along two orthogonal dimen- of waiting for multiple channel operations using select,
sions: the cause of bugs and their behavior. Along the cause and the special library time. Although each of these fea-
dimension, we categorize bugs into those that are caused tures were designed to ease multi-threaded programming, in
by misuse of shared memory and those caused by misuse of reality, it is difficult to write correct Go programs with them.
message passing. Along the second dimension, we separate Overall, our study reveals new practices and new issues of
bugs into those that involve (any number of) goroutines that Go concurrent programming, and it sheds light on an answer
cannot proceed (we call them blocking bugs) and those that to the debate of message passing vs. shared memory accesses.
do not involve any blocking (non-blocking bugs). Our findings improve the understanding of Go concurrency
Surprisingly, our study shows that it is as easy to make con- and can provide valuable guidance for future tool design.
currency bugs with message passing as with shared memory, This paper makes the following key contributions.
sometimes even more. For example, around 58% of blocking • We performed the first empirical study of Go concur-
bugs are caused by message passing. In addition to the viola- rency bugs with six real-world, production-grade Go
tion of Go’s channel usage rules (e.g., waiting on a channel applications.
that no one sends data to or close), many concurrency bugs • We made nine high-level key observations of Go con-
are caused by the mixed usage of message passing and other currency bug causes, fixes, and detection. They can
new semantics and new libraries in Go, which can easily be be useful for Go programmers’ references. We further
overlooked but hard to detect. make eight insights into the implications of our study
To demonstrate errors in message passing, we use a block- results to guide future research in the development,
ing bug from Kubernetes in Figure 1. The finishReq func- testing, and bug detection of Go.
tion creates a child goroutine using an anonymous func- • We proposed new methods to categorize concurrency
tion at line 4 to handle a request—a common practice in bugs along two dimensions of bug causes and behav-
Go server programs. The child goroutine executes fn() and iors. This taxonomy methodology helped us to better
sends result back to the parent goroutine through channel compare different concurrency mechanisms and corre-
ch at line 6. The child will block at line 6 until the parent lations of bug causes and fixes. We believe other bug
pulls result from ch at line 9. Meanwhile, the parent will studies can utilize similar taxonomy methods as well.
block at select until either when the child sends result to ch
All our study results and studied commit logs can be found
(line 9) or when a timeout happens (line 11). If timeout hap-
at https://github.com/system-pclub/go-concurrency-bugs.
pens earlier or if Go runtime (non-deterministically) chooses
the case at line 11 when both cases are valid, the parent will
return from requestReq() at line 12, and no one else can 2 Background and Applications
pull result from ch any more, resulting in the child being Go is a statically-typed programming language that is de-
blocked forever. The fix is to change ch from an unbuffered signed for concurrent programming from day one [60]. Al-
channel to a buffered one, so that the child goroutine can most all major Go revisions include improvements in its con-
always send the result even when the parent has exit. currency packages [23]. This section gives a brief background
This bug demonstrates the complexity of using new fea- on Go’s concurrency mechanisms, including its thread model,
tures in Go and the difficulty in writing correct Go programs inter-thread communication methods, and thread synchro-
like this. Programmers have to have a clear understanding nization mechanisms. We also introduce the six Go applica-
of goroutine creation with anonymous function, a feature tions we chose for this study.
Understanding Real-World Concurrency Bugs in Go ASPLOS’19, April 13–17, 2019, Providence, RI, USA
Application Normal F. Anonymous F. Total Per KLOC Goroutines/Threads Ave. Execution Time
Workload
Docker 33 112 145 0.18 client server client-Go server-Go
Kubernetes 301 233 534 0.23 g_sync_ping_pong 7.33 2.67 63.65% 76.97%
etcd 86 211 297 0.67 sync_ping_pong 7.33 4 63.23% 76.57%
CockroachDB 27 125 152 0.29 qps_unconstrained 201.46 6.36 91.05% 92.73%
gRPC-Go 14 30 44 0.83
BoltDB 2 0 2 0.22 Table 3. Dynamic information when executing RPC
gRPC-C 5 - 5 0.03 benchmarks. The ratio of goroutine number divided by thread
number and the average goroutine execution time normalized by the
Table 2. Number of goroutine/thread creation sites. The
whole application’s execution time.
number of goroutine/thread creation sites using normal functions and
Shared Memory Message
anonymous functions, total number of creation sites, and creation Application
Mutex atomic Once WaitGroup Cond chan Misc.
Total
sites per thousand lines of code. Docker 62.62% 1.06% 4.75% 1.70% 0.99% 27.87% 0.99% 1410
Kubernetes 70.34% 1.21% 6.13% 2.68% 0.96% 18.48% 0.20% 3951
This section presents our static and dynamic analysis results etcd 45.01% 0.63% 7.18% 3.95% 0.24% 42.99% 0 2075
CockroachDB 55.90% 0.49% 3.76% 8.57% 1.48% 28.23% 1.57% 3245
of goroutine usages and Go concurrency primitive usages in gRPC-Go 61.20% 1.15% 4.20% 7.00% 1.65% 23.03% 1.78% 786
our selected six applications. BoltDB 70.21% 2.13% 0 0 0 23.40% 4.26% 47
Usage Proportion
1
Usage Proportion
.15−02
15−05
15−08
15−11
16−02
16−05
16−08
16−11
17−02
17−05
17−08
17−11
18−02
18−05
0
0
.15−02
15−05
15−08
15−11
16−02
16−05
16−08
16−11
17−02
17−05
17−08
17−11
18−02
18−05
0 100 200 300 400 500 600 700
Bug Life Time (Days)
Figure 3. Usages of Message-
Figure 2. Usages of Shared-Memory
Passing Primitives over Time. For Figure 4. Bug Life Time. The CDF of
Primitives over Time. For each appli-
each application, we calculate the proportion the life time of all shared-memory bugs and
cation, we calculate the proportion of shared-
of message-passing primitives over all all message-passing bugs.
memory primitives over all primitives.
primitives.
from Feb 2015 to May 2018. Overall, the usages tend to be Bug taxonomy. We propose a new method to categorize Go
stable over time, which also implies that our study results concurrency bugs according to two orthogonal dimensions.
will be valuable for future Go programmers. The first dimension is based on the behavior of bugs. If one or
Observation 2: Although traditional shared memory thread more goroutines are unintentionally stuck in their execution
communication and synchronization remains to be heavily and cannot move forward, we call such concurrency issues
used, Go programmers also use significant amount of message- blocking bugs. If instead all goroutines can finish their tasks
passing primitives. but their behaviors are not desired, we call them non-blocking
Implication 1: With heavier usages of goroutines and new ones. Most previous concurrency bug studies [24, 43, 45]
types of concurrency primitives, Go programs may potentially categorize bugs into deadlock bugs and non-deadlock bugs,
introduce more concurrency bugs. where deadlocks include situations where there is a circular
wait across multiple threads. Our definition of blocking is
4 Bug Study Methodology broader than deadlocks and include situations where there
This section discusses how we collected, categorized, and is no circular wait but one (or more) goroutines wait for
reproduced concurrency bugs in this study. resources that no other goroutines supply. As we will show
Collecting concurrency bugs. To collect concurrency in Section 5, quite a few Go concurrency bugs are of this
bugs, we first filtered GitHub commit histories of the six kind. We believe that with new programming habits and
applications by searching their commit logs for concurrency- semantics with new languages like Go, we should pay more
related keywords, including “race”, “deadlock”, “synchroniza- attention to these non-deadlock blocking bugs and extend
tion”, “concurrency”, “lock”, “mutex”, “atomic”, “compete”, the traditional concurrency bug categorization mechanism.
“context”, “once”, and “goroutine leak”. Some of these key- The second dimension is along the cause of concurrency
words are used in previous works to collect concurrency bugs. Concurrency bugs happen when multiple threads try to
bugs in other languages [40, 42, 45]. Some of them are re- communicate and errors happen during such communication.
lated to new concurrency primitives or libraries introduced Our idea is thus to categorize causes of concurrency bugs by
by Go, such as “once” and “context”. One of them, “goroutine how different goroutines communicate: by accessing shared
leak”, is related to a special problem in Go. In total, we found memory or by passing messages. This categorization can
3211 distinct commits that match our search criteria. help programmers and researchers choose better ways to
Behavior Cause perform inter-thread communication and to detect and avoid
Application
blocking non-blocking shared memory message passing potential errors when performing such communication.
Docker 21 23 28 16 According to our categorization method, there are a total
Kubernetes 17 17 20 14
etcd 21 16 18 19 of 85 blocking bugs and 86 non-blocking bugs, and there
CockroachDB 12 16 23 5 are a total of 105 bugs caused by wrong shared memory
gRPC 11 12 12 11
BoltDB 3 2 4 1
protection and 66 bugs caused by wrong message passing.
Total 85 86 105 66 Table 5 shows the detailed breakdown of bug categories
Table 5. Taxonomy. This table shows how our studied bugs dis- across each application.
tribute across different categories and applications. We further analyzed the life time of our studied bugs, i.e.,
We then randomly sampled the filtered commits, identified the time from when the buggy code was added (committed)
commits that fix concurrency bugs, and manually studied to the software to when it is being fixed in the software (a bug-
them. Many bug-related commit logs also mention the cor- fixing patch is committed). As shown in Figure 4, most bugs
responding bug reports, and we also study these reports for we study (both shared memory and message passing) have
our bug analysis. We studied 171 concurrency bugs in total. long life time. We also found the time when these bugs were
ASPLOS’19, April 13–17, 2019, Providence, RI, USA Tengfei et al.
out from the loop. Figure 7. A blocking bug caused by wrong usage of
Although conditional variable and thread group wait are channel with lock.
both traditional concurrency techniques, we suspect Go’s Channel and other blocking primitives For 16 blocking bugs,
new programming model to be one of the reasons why pro- one goroutine is blocked at a channel operation, and another
grammers made these concurrency bugs. For example, un- goroutine is blocked at lock or wait. For example, as shown
like pthread_join which is a function call that explicitly in Figure 7, goroutine1 is blocked at sending request to
waits on the completion of (named) threads, WaitGroup is a channel ch, while goroutine2 is blocked at m.Lock(). The
variable that can be shared across goroutines and its Wait fix is to add a select with default branch for goroutine1
function implicitly waits for the Done function. to make ch not blocking any more.
Observation 4: Most blocking bugs that are caused by shared Messaging libraries Go provides several libraries to pass data
memory synchronization have the same causes and same fixes or messages, like Pipe. These special library calls can also
as traditional languages. However, a few of them are different cause blocking bugs when not used correctly. For example,
from traditional languages either because of Go’s new im- similar to channel, if a Pipe is not closed, a goroutine can be
plementation of existing primitives or its new programming blocked when it tries to send data to or pull data from the
semantics. unclosed Pipe. There are 4 collected blocking bugs caused
by special Go message-passing library calls.
5.1.2 Misuse of Message Passing
Observation 5: All blocking bugs caused by message passing
We now discuss blocking bugs caused by errors in message are related to Go’s new message passing semantics like channel.
passing, which in the contrary of common belief are the They can be difficult to detect especially when message pass-
main type of blocking bugs in our studied applications. ing operations are used together with other synchronization
Channel Mistakes in using channel to pass messages across mechanisms.
goroutines cause 29 blocking bugs. Many of the channel- Implication 2: Contrary to common belief, message passing
related blocking bugs are caused by the missing of a send to can cause more blocking bugs than shared memory. We call for
(or receive from) a channel or closing a channel, which will attention to the potential danger in programming with message
result in the blocking of a goroutine that waits to receive passing and raise the research question of bug detection in this
from (or send to) the channel. One such example is Figure 1. area.
ASPLOS’19, April 13–17, 2019, Providence, RI, USA Tengfei et al.
Adds Moves Changes Removes Misc. Root Cause # of Used Bugs # of Detected Bugs
Shared Memory Mutex 7 1
Mutex 9 7 2 8 2 Chan 8 0
Wait 0 1 0 1 1 Chan w/ 4 1
RWMutex 0 2 0 3 0 Messaging Libraries 2 0
Message Passing Total 21 2
Chan 15 1 5 4 4
Chan w/ 6 3 2 4 1 Table 8. Benchmarks and evaluation results of the
Messaging Lib 1 0 0 1 2
Total 31 14 9 21 10
deadlock detector.
Table 7. Fix strategies for blocking bugs. The subscript s between Chan and Adds is the second highest, with lift value
stands for synchronization. 1.42. All other categories that have more than 10 blocking
5.2 Fixes of Blocking Bugs bugs have lift values below 1.16, showing no strong correla-
After understanding the causes of blocking bugs in Go, we tion.
now analyze how Go programmers fixed these bugs in the We also analyzed the fixes of blocking bugs according to
real world. the type of concurrency primitives used in the patches. As
Eliminating the blocking cause of a hanging goroutine expected, most bugs whose causes are related to a certain
will unblock it and this is the general approach to fix block- type of primitive were also fixed by adjusting that primitive.
ing bugs. To achieve this goal, Go developers often adjust For example, all Mutex-related bugs were fixed by adjusting
synchronization operations, including adding missing ones, Mutex primitives.
moving or changing misplaced/misused ones, and removing The high correlation of bug causes and the primitives and
extra ones. Table 7 summarizes these fixes. strategies used to fix them, plus the limited types of syn-
Most blocking bugs caused by mistakenly protecting chronization primitives in Go, suggests fruitful revenue in
shared memory accesses were fixed by methods similar to investigating automatic correction of blocking bugs in Go.
traditional deadlock fixes. For example, among the 33 Mutex- We further find that the patch size of our studied blocking
or RWMutex-related bugs, 8 were fixed by adding a missing bugs is small, with an average of 6.8 lines of code. Around
unlock; 9 were fixed by moving lock or unlock operations 90% of studied blocking bugs are fixed by adjusting synchro-
to proper locations; and 11 were fixed by removing an extra nization primitives.
lock operation. Observation 6: Most blocking bugs in our study (both tradi-
11 blocking bugs caused by wrong message passing were tional shared-memory ones and message passing ones) can be
fixed by adding a missing message or closing operation to fixed with simple solutions and many fixes are correlated with
a channel (and on two occasions, to a pipe) on a goroutine bug causes.
different from the blocking one. 8 blocking bugs were fixed Implication 3: High correlation between causes and fixes in
by adding a select with a default option (e.g., Figure 7) Go blocking bugs and the simplicity in their fixes suggest that
or a case with operation on a different channel. Another it is promising to develop fully automated or semi-automated
common fix of channel-related blocking bugs is to replace an tools to fix blocking bugs in Go.
unbuffered channel with a buffered channel (e.g., Figure 1).
Other channel-related blocking bugs can be fixed by strate- 5.3 Detection of Blocking Bugs
gies such as moving a channel operation out of a critical Go provides a built-in deadlock detector that is implemented
section and replacing channel with shared variables. in the goroutine scheduler. The detector is always enabled
To understand the relationship between the cause of during Go runtime and it reports deadlock when no gorou-
a blocking bug and its fix, we apply a statistical metric tines in a running process can make progress. We tested all
called lift, following previous empirical studies on real-world our reproduced blocking bugs with Go’s built-in deadlock
P (AB)
bugs [29, 41]. lift is calculated as lift(A, B) = P (A)P (B) , where detector to evaluate what bugs it can find. For every tested
A denotes a root cause category, B denotes a fix strategy cate- bug, the blocking can be triggered deterministically in ev-
gory, P(AB) denotes the probability that a blocking is caused ery run. Therefore, for each bug, we only ran it once in this
by A and fixed by B. When lift value is equal to 1, A root experiment. Table 8 summarizes our test results.
cause is independent with B fix strategy. When lift value is The built-in deadlock detector can only detect two block-
larger than 1, A and B are positively correlated, which means ing bugs, BoltDB#392 and BoltDB#240, and fail in all other
if a blocking is caused by A, it is more likely to be fixed by B. cases (although the detector does not report any false posi-
When lift is smaller than 1, A and B are negatively correlated. tives [38, 39]). There are two reasons why the built-in detec-
Among all the bug categories that have more than 10 tor failed to detect other blocking bugs. First, it does not con-
blocking bugs (we omit categories that have less than 10 sider the monitored system as blocking when there are still
bugs because of their statistical insignificance), Mutex is the some running goroutines. Second, it only examines whether
category that has the strongest correlation to a type of fix—it or not goroutines are blocked at Go concurrency primitives
correlates with Moves with lift value 1.52. The correlation but does not consider goroutines that wait for other systems
Understanding Real-World Concurrency Bugs in Go ASPLOS’19, April 13–17, 2019, Providence, RI, USA
Table 9. Root causes of non-blocking bugs. traditional: Figure 8. A data race caused by anonymous function.
traditional non-blocking bugs; anonymous function: non-blocking 1 func (p *peer) send() {
bugs caused by anonymous function; waitgroup: misusing WaitGroup; 2 p.mu.Lock()
3 defer p.mu.Unlock()
lib: Go library; chan: misusing channel.
4 switch p.status {
5 case idle:
resources. These two limitations were largely due to the de- 6 + p.wg.Add(1)
sign goal of the built-in detector—minimal runtime overhead. 7 go func() {
8 - p.wg.Add(1)
When implemented in the runtime scheduler, it is very hard 9 ...
1 func (p * peer) stop() {
for a detector to effectively identify complex blocking bugs 2 p.mu.Lock()
10 p.wg.Done()
3 p.status = stopped
without sacrificing performance. 11 }()
4 p.mu.Unlock()
12 case stopped:
Implication 4: Simple runtime deadlock detector is not ef- 13 }
5 p.wg.Wait()
6 }
fective in detecting Go blocking bugs. Future research should 14 }
(a) func1 (b) func2
focus on building novel blocking bug detection techniques, for
example, with a combination of static and dynamic blocking Figure 9. A non-blocking bug caused by misusing
pattern detection. WaitGroup.
Docker#22985 and CockroachDB#6111 are caused by data
6 Non-Blocking Bugs race on a shared variable whose reference is passed across
This section presents our study on non-blocking bugs. Simi- goroutines through a channel.
lar to what we did in Section 5, we studied the root causes Anonymous function Go designers make goroutine declara-
and fixes of non-blocking bugs and evaluated a built-in race tion similar to a regular function call (which does not even
detector of Go. need to have a “function name”) so as to ease the creation of
goroutines. All local variables declared before a Go anony-
6.1 Root Causes of Non-blocking Bugs mous function are accessible by the anonymous function.
Similar to blocking bugs, we also categorize our collected Unfortunately, this ease of programming can increase the
non-blocking bugs into those that were caused by failing chance of data-race bugs when goroutines are created with
to protect shared memory and those that have errors with anonymous functions, since developers may not pay enough
message passing (Table 9). attention to protect such shared local variables.
We found 11 bugs of this type, 9 of which are caused by a
6.1.1 Failing to Protect Shared Memory data race between a parent goroutine and a child goroutine
Previous work [8, 14, 16, 17, 46, 47, 52, 62–64] found that not created using an anonymous function. The other two are
protecting shared memory accesses or errors in such protec- caused by a data race between two child goroutines. One
tion are the main causes of data race and other non-deadlock example from Docker is shown in Figure 8. Local variable i
bugs. Similarly, we found around 80% of our collected non- is shared between the parent goroutine and the goroutines it
blocking bugs are due to un-protected or wrongly protected creates at line 2. The developer intends each child goroutine
shared memory accesses. However, not all of them share the uses a distinct i value to initialize string apiVersion at line 4.
same causes as non-blocking bugs in traditional languages. However, values of apiVersion are non-deterministic in the
Traditional bugs More than half of our collected non- buggy program. For example, if the child goroutines begin
blocking bugs are caused by traditional problems that also after the whole loop of the parent goroutine finishes, value of
happen in classic languages like C and Java, such as atomic- apiVersion are all equal to ‘v1.21’. The buggy program only
ity violation [8, 16, 46], order violation [17, 47, 62, 64], and produces desired result when each child goroutine initializes
data race [14, 52, 63]. This result shows that same mistakes string apiVersion immediately after its creation and before
are made by developers across different languages. It also i is assigned to a new value. Docker developers fixed this bug
indicates that it is promising to apply existing concurrency by making a copy of the shared variable i at every iteration
bug detection algorithms to look for new bugs in Go. and pass the copied value to the new goroutines.
Interestingly, we found seven non-blocking bugs whose Misusing WaitGroup There is an underlying rule when using
root causes are traditional but are largely caused by the lack WaitGroup, which is that Add has to be invoked before Wait.
of a clear understanding in new Go features. For example, The violation of this rule causes 6 non-blocking bugs. Figure 9
ASPLOS’19, April 13–17, 2019, Providence, RI, USA Tengfei et al.
shows one such bug in etcd, where there is no guarantee 1 ticker := time.NewTicker()
that Add at line 8 of func1 happens before Wait at line 5 of 2 for {
3 + select {
func2. The fix is to move Add into a critical section, which 4 + case <- stopCh:
ensures that Add will either be executed before Wait or it 5 + return
will not be executed. 6 + default:
7 + }
Special libraries Go provides many new libraries, some of 8 f()
which use objects that are implicitly shared by multiple gor- 9 select {
outines. If they are not used correctly, data race may hap- 10 case <- stopCh:
11 return
pen. For example, the context object type is designed to 12 case <- ticker:
be accessed by multiple goroutines that are attached to the 13 }
context. etcd#7816 is a data-race bug caused by multiple 14 }
goroutines accessing the string field of a context object. Figure 11. A non-blocking bug caused by select and
Another example is the testing package which is de- channel.
signed to support automated testing. A testing function (iden-
tified by beginning the function name with “Test”) takes only 1 - timer := time.NewTimer(0)
one parameter of type testing.T, which is used to pass test- 2 + var timeout <- chan time.Time
3 if dur > 0 {
ing states such as error and log. Three data-race bugs are 4 - timer = time.NewTimer(dur)
caused by accesses to a testing.T variable from the gor- 5 + timeout = time.NewTimer(dur).C
outine running the testing function and other goroutines 6 }
7 select {
created inside the testing function. 8 - case <- timer.C:
Observation 7: About two-thirds of shared-memory non- 9 + case <- timeout:
blocking bugs are caused by traditional causes. Go’s new multi- 10 case <-ctx.Done():
11 return nil
thread semantics and new libraries contribute to the rest one- 12 }
third.
Implication 5: New programming models and new libraries Figure 12. A non-blocking bug caused by Timer.
that Go introduced to ease multi-thread programming can
themselves be the reasons of more concurrency bugs. be processed first. This non-determinism implementation of
select caused 3 bugs. Figure 11 shows one such example.
6.1.2 Errors during Message Passing The loop at line 2 executes a heavy function f() at line
Errors during message passing can also cause non-blocking 8 whenever a ticker ticks at line 12 (case 2) and stops its
bugs and they comprise around 20% of our collected non- execution when receiving a message from channel stopCh
blocking bugs. at line 10 (case 1). If receiving a message from stopCh and
Misusing channel As what we discussed in Section 2, there the ticker ticks at the same time, there is no guarantee which
are several rules when using channel, and violating them one will be chosen by select. If select chooses case 2, f()
can lead to non-blocking bugs in addition to blocking ones. will be executed unnecessarily one more time. The fix is to
There are 16 non-blocking bugs caused by misuse of channel. add another select at the beginning of the loop to handle
the unprocessed signal from stopCh.
1 - select {
2 - case <- c.closed:
Special libraries Some of Go’s special libraries use channels
3 - default: in a subtle way, which can also cause non-blocking bugs.
4 + Once.Do(func() { Figure 12 shows one such bug related to the time package
5 close(c.closed)
6 + }) which is designed for measuring time. Here, a timer is cre-
7 - } ated with timeout duration 0 at line 1. At the creation time
of a Timer object, Go runtime (implicitly) starts a library-
Figure 10. A bug caused by closing a channel twice. internal goroutine which starts timer countdown. The timer
As an example, Docker#24007 in Figure 10 is caused by the is set with a timeout value dur at line 4. Developers here
violation of the rule that a channel can only be closed once. intended to return from the current function only when dur
When multiple goroutines execute the piece of code, more is larger than 0 or when ctx.Done(). However, when dur is
than one of them can execute the default clause and try to not greater than 0, the library-internal goroutine will signal
close the channel at line 5, causing a runtime panic in Go. the timer.C channel as soon as the creation of the timer,
The fix is to use Once package to enforce that the channel is causing the function to return prematurely (line 8). The fix
only closed once. is to avoid the Timer creation at line 1.
Another type of concurrency bugs happen when using Observation 8: There are much fewer non-blocking bugs
channel and select together. In Go, when multiple messages caused by message passing than by shared memory accesses.
received by a select, there is no guarantee which one will Rules of channel and complexity of using channel with other
Understanding Real-World Concurrency Bugs in Go ASPLOS’19, April 13–17, 2019, Providence, RI, USA
The data race detector successfully detected 7/13 tradi- in detecting bugs that are caused by the combination of
tional bugs and 3/4 bugs caused by anonymous functions. channel and locks, such as the one in Figure 7.
For six of these successes, the data race detector reported Misusing Go libraries can cause both blocking and non-
bugs on every run, while for the rest four, around 100 runs blocking bugs. We summarized several patterns about mis-
were needed before the detector reported a bug. using Go libraries in our study. Detectors can leverage the
There are three possible reasons why the data race detec- patterns we learned to reveal previously unknown bugs.
tor failed to report many non-blocking bugs. First, not all Our study also found the violation of rules Go enforces
non-blocking bugs are data races; the race detector was not with its concurrency primitives is one major reason for con-
designed to detect these other types. Second, the effective- currency bugs. A novel dynamic technique can try to enforce
ness of the underlying happen-before algorithm depends on such rules and detect violation at runtime.
the interleaving of concurrent goroutines. Finally, with only
four shadow words for each memory object, the detector
cannot keep a long history and may miss data races. 8 Related Works
Implication 8: Simple traditional data race detector cannot
Studying Real-World Bugs. There are many empirical
effectively detect all types of Go non-blocking bugs. Future studies on real-world bugs [9, 24, 25, 29, 40, 44, 45]. These
research can leverage our bug analysis to develop more infor- studies have successfully guided the design of various bug-
mative, Go-specific non-blocking bug detectors. combating techniques. To the best of our knowledge, our
7 Discussion and Future Work work is the first study focusing on concurrency bugs in Go
and the first to compare bugs caused by errors when access-
Go advocates for making thread creation easy and light-
ing shared memory and errors when passing messages.
weight and for using message passing over shared memory
Combating Blocking Bugs. As a traditional problem, there
for inter-thread communication. Indeed, we saw more gor-
are many research works fighting deadlocks in C and
outines created in Go programs than traditional threads and
Java [7, 28, 33–35, 51, 54, 55, 58, 59]. Although useful, our
there are significant usages of Go channel and other mes-
study shows that there are many non-deadlock blocking
sage passing mechanisms. However, our study show that
bugs in Go, which are not the goal of these techniques. Some
if not used correctly, these two programming practices can
techniques are proposed to detect blocking bugs caused by
potentially cause concurrency bugs.
misusing channel [38, 39, 49, 56]. However, blocking bugs
Shared memory vs. message passing. Our study found
can be caused by other primitives. Our study reveals many
that message passing does not necessarily make multi-
code patterns for blocking bugs that can serve the basis for
threaded programs less error-prone than shared memory.
future blocking bug detection techniques.
In fact, message passing is the main cause of blocking bugs.
Combating Non-Blocking Bugs. Many previous research
To make it worse, when combined with traditional synchro-
works are conducted to detect, diagnose and fix non-deadlock
nization primitives or with other new language features
bugs, caused by failing to synchronize shared memory ac-
and libraries, message passing can cause blocking bugs that
cesses [4, 5, 8, 14, 16, 17, 30–32, 43, 46, 47, 52, 62–64]. They
are very hard to detect. Message passing causes less non-
are promising to be applied to Go concurrency bugs. How-
blocking bugs than shared memory synchronization and sur-
ever, our study finds that there is a non-negligible portion of
prisingly, was even used to fix bugs that are caused by wrong
non-blocking bugs caused by errors during message passing,
shared memory synchronization. We believe that message
and these bugs are not covered by previous works. Our study
passing offers a clean form of inter-thread communication
emphasizes the need of new techniques to fight errors during
and can be useful in passing data and signals. But they are
message passing.
only useful if used correctly, which requires programmers
to not only understand message passing mechanisms well
but also other synchronization mechanisms of Go.
Implication on bug detection. Our study reveals many 9 Conclusion
buggy code patterns that can be leveraged to conduct con- As a programming language designed for concurrency, Go
currency bug detection. As a preliminary effort, we built a provides lightweight goroutines and channel-based message
detector targeting the non-blocking bugs caused by anony- passing between goroutines. Facing the increasing usage of
mous functions (e.g. Figure 8). Our detector has already dis- Go in various types of applications, this paper conducts the
covered a few new bugs, one of which has been confirmed first comprehensive, empirical study on 171 real-world Go
by real application developers [12]. concurrency bugs from two orthogonal dimensions. Many
More generally, we believe that static analysis plus pre- interesting findings and implications are provided in our
vious deadlock detection algorithms will still be useful in study. We expect our study to deepen the understanding
detecting most Go blocking bugs caused by errors in shared of Go concurrency bugs and bring more attention to Go
memory synchornization. Static technologies can also help concurrency bugs.
Understanding Real-World Concurrency Bugs in Go ASPLOS’19, April 13–17, 2019, Providence, RI, USA
[40] Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, and multithreaded programs. ACM Transactions on Computer Systems,
Haryadi S. Gunawi. Taxdc: A taxonomy of non-deterministic concur- 15(4):391-411, 1997.
rency bugs in datacenter distributed systems. In Proceedings of the [53] Konstantin Serebryany and Timur Iskhodzhanov. Threadsanitizer:
21th International Conference on Architectural Support for Programming Data race detection in practice. In Proceedings of the Workshop on
Languages and Operating Systems (ASPLOS ’16), Atlanta, Georgia, USA, Binary Instrumentation and Applications (WBIA ’09), New York, USA,
April 2016. December 2009.
[41] Zhenmin Li, Lin Tan, Xuanhui Wang, Shan Lu, Yuanyuan Zhou, and [54] Vivek K Shanbhag. Deadlock-detection in java-library using static-
Chengxiang Zhai. Have things changed now?: An empirical study of analysis. In 15th Asia-Pacific Software Engineering Conference (APSEC
bug characteristics in modern open source software. In Proceedings ’08), Beijing, China, December 2008.
of the 1st workshop on Architectural and system support for improving [55] Francesco Sorrentino. Picklock: A deadlock prediction approach under
software dependability (ASID ’06), San Jose, California, USA, October nested locking. In Proceedings of the 22nd International Symposium on
2006. Model Checking Software (SPIN ’15), Stellenbosch, South Africa, August
[42] Ziyi Lin, Darko Marinov, Hao Zhong, Yuting Chen, and Jianjun Zhao. 2015.
Jacontebe: A benchmark suite of real-world java concurrency bugs. [56] Kai Stadtmüller, Martin Sulzmann, and Peter" Thiemann. Static trace-
In 30th IEEE/ACM International Conference on Automated Software based deadlock analysis for synchronous mini-go. In 14th Asian Sym-
Engineering (ASE ’15), Lincoln, Nebraska, USA, November 2015. posium on Programming Languages and Systems (APLAS ’16), Hanoi,
[43] Haopeng Liu, Yuxi Chen, and Shan Lu. Understanding and generating Vietnam, November 2016.
high quality patches for concurrency bugs. In Proceedings of the 2016 [57] Jie Wang, Wensheng Dou, Yu Gao, Chushu Gao, Feng Qin, Kang Yin,
24th ACM SIGSOFT International Symposium on Foundations of Software and Jun Wei. A comprehensive study on real world concurrency bugs in
Engineering (FSE ’16), Seattle, Washington, USA, November 2016. node.js. In Proceedings of the 32nd IEEE/ACM International Conference
[44] Lanyue Lu, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and on Automated Software Engineering (ASE ’17), Urbana-Champaign,
Shan Lu. A study of linux file system evolution. In Proceedings of the Illinois, USA, October 2017.
11th USENIX Conference on File and Storage Technologies (FAST ’13), [58] Yin Wang, Terence Kelly, Manjunath Kudlur, Stéphane Lafortune, and
San Jose, California, USA, February 2013. Scott A Mahlke. Gadara: Dynamic deadlock avoidance for multi-
[45] Shan Lu, Soyeon Park, Eunsoo Seo, and Yuanyuan Zhou. Learning threaded programs. In Proceedings of the 8th USENIX Conference on
from mistakes – a comprehensive study of real world concurrency Operating systems design and implementation (OSDI ’08), San Diego,
bug characteristics. In Proceedings of the 13th International Conference California, USA, December 2008.
on Architectural Support for Programming Languages and Operating [59] Yin Wang, Stéphane Lafortune, Terence Kelly, Manjunath Kudlur, and
Systems (ASPLOS ’08), Seattle, Washington, USA, March 2008. Scott A. Mahlke. The theory of deadlock avoidance via discrete control.
[46] Shan Lu, Joseph Tucek, Feng Qin, and Yuanyuan Zhou. Avio: Detecting In Proceedings of the 36th annual ACM SIGPLAN-SIGACT symposium
atomicity violations via access interleaving invariants. In Proceedings on Principles of programming languages (POPL ’09), Savannah, Georgia,
of the 12th International Conference on Architectural Support for Pro- USA, January 2009.
gramming Languages and Operating Systems (ASPLOS ’06), San Jose, [60] Wikipedia. Go (programming language). URL:
California, USA, October 2006. https://en.wikipedia.org/wiki/Go_(programming_language).
[47] Brandon Lucia and Luis Ceze. Finding concurrency bugs with context- [61] Weiwei Xiong, Soyeon Park, Jiaqi Zhang, Yuanyuan Zhou, and
aware communication graphs. In Proceedings of the 42nd Annual Zhiqiang Ma. Ad hoc synchronization considered harmful. In Pro-
IEEE/ACM International Symposium on Microarchitecture (MICRO ’09), ceedings of the 9th USENIX Conference on Operating systems design
New York, USA, December 2009. and implementation (OSDI ’10), Vancouver, British Columbia, Canada,
[48] Kedar S. Namjoshi. Are concurrent programs that are easier to write October 2010.
also easier to check? In Workshop on Exploiting Concurrency Efficiently [62] Jie Yu and Satish Narayanasamy. A case for an interleaving constrained
and Correctly, 2008. shared-memory multi-processor. In Proceedings of the 36th annual
[49] Nicholas Ng and Nobuko Yoshida. Static deadlock detection for con- International symposium on Computer architecture (ISCA ’09), Austin,
current go by global session graph synthesis. In Proceedings of the 25th Texas, USA, June 2009.
International Conference on Compiler Construction (CC ’16), Barcelona, [63] Yuan Yu, Tom Rodeheffer, and Wei Chen. Racetrack: Efficient detection
Spain, March 2016. of data race conditions via adaptive tracking. In Proceedings of the 20th
[50] Rob Pike. Go Concurrency Patterns. URL: ACM symposium on Operating systems principles (SOSP ’05), Brighton,
https://talks.golang.org/2012/concurrency.slide. United Kingdom, October 2005.
[51] Dawson R. Engler and Ken Ashcraft. Racerx: Effective, static detection [64] Wei Zhang, Chong Sun, and Shan Lu. Conmem: detecting severe
of race conditions and deadlocks. In Proceedings of the 19th ACM concurrency bugs through an effect-oriented approach. In Proceedings
symposium on Operating systems principles (SOSP ’03), Bolton Landing, of the 15th International Conference on Architectural Support for Pro-
New York, USA, October 2003. gramming Languages and Operating Systems (ASPLOS ’10), Pittsburgh,
[52] Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, Pennsylvania, USA, March 2010.
and Thomas Anderson. Eraser: A dynamic data race detector for