Data Races, what are they, and why are they evil? – Part 1
Race Conditions and Data Races
Most readers will be aware of “Race Conditions”, or “Races”. They are a common cause of bugs when writing concurrent programs that are not seen in single-threaded programs.
You might also hear about people talking about a type of bug known as a “Data Race”. They have caused quite a stir online recently when the accuracy of a simulation program for the COVID-19 pandemic was called into question because of suspected potential “data races”.
So, are data races the same as race conditions?
Simply put, race conditions and data races are different. One is not even a subset of the other, which is a common misunderstanding even among experienced developers given that they both have “race” in their names, and share similar code patterns.
Race condition, according to its wiki page:
… is the condition of an electronics, software, or other system where the system’s substantive behavior is dependent on the sequence or timing of other uncontrollable events. It becomes a bug when one or more of the possible behaviors is undesirable.
This definition implies race condition is not inherently a bug. Race conditions are inevitable as long as concurrent events exist within a system.
Consider the following example:
Thread 1 // x is a shared variable x = 1 print x
Thread 2 // x is a shared variable x = 999
There is clearly a race condition going on, where the result of
print x is dependent on the sequence of assignments to
x. However, this behavior may be intended and would not be considered a bug if both results are acceptable outcomes.
On the other hand, a data race is clearly defined as a bug. A data race occurs when two conflicting memory accesses:
- are accessing the same piece of memory;
- are performed concurrently;
- have at least one write;
- are not protected by synchronization or locks.
The definition of concurrency is language-dependent. For example, the C++ Standard defines two actions as potentially concurrent if:
- they are performed by different threads, or
- they are unsequenced, and at least one is performed by a signal handler.
The second rule is idiosyncratic to C++. The Java Language Specification gives a different definition.
So, in contrast to a possibly desired race condition, a data race can be clearly defined as a bug that might result in undefined or unexpected behavior.
Now, upon re-examining the example above, we can discern that it is also a “data race”. The concurrent accesses to
x are unprotected, so it is a BUG!
A valid fix for this bug is to simply protect each access with a lock:
Thread 1 // x is a shared variable // l is a global lock lock(l) x = 1 unlock(l) lock(l) print x unlock(l)
Thread 2 // x is a shared variable // l is a global lock lock(l) x = 999 unlock(l)
Now, all data races in this example are gone, but a race condition still exists. The critical sections, the code segment between each lock and unlock pairing, are still unsequenced. The code can still be buggy, say if printing out “999” is undesirable.
Benign or non-benign? Wherein lies the evilness?
Let’s suppose printing out “999” is indeed undesired buggy behavior.
Clearly, the buggy printing of “999” is not introduced by the data race but the race condition. To get rid of the bug, we can either join thread 2 before the write operation to
x in thread 1; or put the two operations on
x in thread 1 into the same critical section. Both fixes ensure there can be no unexpected operation between
x = 1 and
Now, some of you may wonder: if the code is buggy before and after our first fix for the data race, what’s the point of fixing it, and what is the harm actually brought by a data race?
We answer those questions in Part 2 of this blog.