Race Conditions: Threads Passing in the Night

6 min readFeb 3, 2023

Have you ever written some code that…didn’t’ quite work?…like it works MOST of the time, but every once in a while it stops working or fails under load in production, but you can’t recreate it. This could be the result of a race conditions which can occur once in a thousand cycles or once in a million times or unfortunately…never. If you’re here specifically to solve this problem, “this article is the sign you’ve been looking for”, otherwise be prepared to have a lot of information that might keep you up at night. After reading this article, you should be able to describe, identify and mitigate race conditions within any sufficiently concurrent application.

TLDR; race conditions are a side-effect of concurrent processes accessing shared data, they can be mitigated (and resolved in some cases) by applying synchronization to ensure that your application enforces order of operation and protects the state of the shared resource with respect to time.

A race condition is a situation where concurrent processes are trying to access a shared resource and their perspective (in terms of time) isn’t consistent. It’s a “timing” problem where a shared resource (or rather the state of that resource) isn’t protected such that the timing of operations affect its output in unexpected ways. Race conditions generally present when the timing of those concurrent processes are incredibly similar…”like strangers, waitin’, up and down the boulevard” or “ships passing in the night”.

Here is a more concrete example: an application that has four concurrent processes/threads that operate on a shared resource (an integer), two readers and two writers (both with the same logic). At a rate of 1Hz (once per second), the readers will print the current value of the shared integer while the writers will increment the value of the shared integer by one.

These are expected/valid example outputs for that application running for two seconds and printing its output for each process. The most important thing to understand is that because all the processes are sharing the same resource they can NEVER NEVER NEVER execute in parallel:

read(one): 0
read(two): 0
write(one): 1
write(two): 2
read(two): 2
read(one): 2
write(one): 3
write(two): 4
read(one): 0
write(one): 1
read(two): 1
write(two): 2
read(two): 2
write(one): 3
read(one): 3
write(two): 4
write(one): 1
write(two): 2
read(one): 2
read(two): 2
write(one): 3
write(two): 4
read(two): 4
read(one): 4

The examples above are well within the expected spectrum of valid outputs; because they all have the same timing, they pretty much execute in a random order at 1Hz. Although the output is all over the place, the output is 100% consistent meaning: the reads only show the “last” value written by one of the write processes.

The logs below show what race conditions would look like, such that the output is inconsistent, These examples show race conditions that present VERY clearly:

read(one): 0
read(two): 0
write(one): 1
write(two): 1
read(two): 1
read(one): 1
write(one): 2
write(two): 3

This example shows a read/write/modify race condition, in this case data is LOST, the concurrent writers “read” the data at the same time and because they read it when the value is 0, they both “increment” it to 1.

write(one): 1
write(two): 2
read(one): 1
read(two): 2
read(two): 2
read(one): 2
write(one): 3
write(two): 4

In this example, we have a read race condition where even though the shared value is “2”, the two read processes execute BEFORE the value is incremented by 1. In this case, there’s no guarantee that even if one process reads before another, that they print in the order of execution (there are some interesting implications here for implementation). While no data is “lost” in the second example, this is somewhat more nefarious because data IS lost from the perspective of the reader (within specific contexts). Mitigation and resolution of race conditions are a function of the business logic they involve; some race conditions can be totally resolved while others can only be mitigated. Race conditions can be mitigated/resolved by adding synchronization that serializes the business logic susceptible to race conditions. These are the general methods to resolve/mitigate race conditions:

read/write/modify race conditions can be mitigated by having a single writer
read/write/modify race conditions can be resolved using a mutex to protect the resource being accessed concurrently
read race conditions can be mitigated by having a single reader
read race conditions can be resolved by using a buffer to enforce the order and defining what valid data is

To simplify, you’re going to do one of three things to handle a race condition:

Do nothing
Use a mutex to protect the shared resource
Use a buffer with logic to prevent data loss on read

Doing nothing is 100% OK, if you start here, you can ALWAYS decide to do something later. This is the best option when you feel confident that mitigation/resolution is outside the scope of the work effort OR the complexity introduced by mitigating/resolving race conditions doesn’t line up with the probability of the race condition occurring. Although technical debt gets a bad rep,

it’s the mark of a good developer to constantly revisit a decision and determine whether or not it deserves attention (over and over again); it’s okay to err on not fixing something as long as you keep it in mind. Lazy programmers can still be good programmers.

A mutex can be used to enforce what can access the shared resource and when; this protects the “state” of the resource. For example, you can be sure that no other process can write until AFTER a given write operation has occurred (e.g., the example where two increments executed, but only +1 was experienced, could never happen). Mutexes have the side effect of serializing execution so there is often a performance hit under heavy load because other threads block while waiting on the mutex. In situations where read race conditions are OK for the context of the application, caching can be an appropriate solution to reduce mutex contention and mitigate performance hits for reads.

Although read race conditions can be benign (i.e., the data is eventually correct), they can also be catastrophic; what if the read process is used to store data for access later? In these cases, read race conditions can ruin a “perspective” of the data in that although the data was “correct” (what was actually generated), from the perspective of the reader storing data, some of it is missing. A solution to this problem is to use a buffer/queue; this will allow you to create logic to determine when data has been “mutated” and store a copy of that state in the order of its occurrence. Buffers and queues can solve problems dealing with mutex contention, data loss and is a starting point for having multiple readers and it “working”. This problem is commonly solved with the producer/consumer design pattern.

Although I’ve avoided focusing on implementation details, there’s value in mentioning that race conditions can also happen at macro levels such as: between applications, databases, file systems or between pods that scale horizontally. Synchronization is limited by the scope in which they are implemented in; for example, if you use a mutex scoped within an application that has two instances; neither application can protect operations that occur in the other application instance. This also applies to file system locks, they work well, but only if interested parties are scoped to the same file system. This even extends to databases; sometimes the implied mutex a database has isn’t enough to ensure a race condition can’t occur; ESPECIALLY if the process you’re trying to protect spans multiple databases. In microservices architecture, this is referred to as a saga or distributed transactions.

Race conditions are an expected side effect to concurrent processes, inextricably linked.

Race Conditions: Threads Passing in the Night

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Antonio Alexander

No responses yet