Your multithreading peace of mind –

Can We Detect and Debug Multi-Threading Issues?

codebrad — Mon, 18 Apr 2022 15:11:54 +0000

Archives

I recently came across a question on StackOverflow asking, “How to detect and debug multi-threading problems?“. This is an interesting, albeit broad question. Those who took computer science classes in college likely remember how easy it is to quickly end up with cryptic errors or erratic behavior as soon as threads are introduced.

While there are some good answers by clearly smart people on the StackOverflow question, none really truly address the question being asked. Because there is no real answer.

Is it possible to detect and debug problems coming from multi-threaded code?

This is a sub question quoted from the StackOverflow question, and fortunately, it does have an answer. In theory, it is definitely possible to detect and debug problems coming from multi-threaded code. In practice, less so.

There are so many different types of concurrency bugs and each comes with its own techniques and challenges for detecting and debugging. One of the most commonly discussed type of concurrency bugs are data races, which involve two threads accessing the same memory in parallel.

Data races are notoriously difficult to detect and debug because they depend on a specific timing to be observed. Not only can you as the programmer not reliably control the timing, but trying to debug can even change the timing and prevent the race from occurring.

In my experience data races are often exposed only when some non-deterministic Heisenbug cannot be explained and someone manually inspects the code searching for a potential data race. The manual inspection can be tedious, but there is a concrete set of circumstances to search for.

A data race occurs when:

Two threads access the same location in memory (with at least one access being a write)
Both threads are running in parallel
The accesses are un-synchronized (e.g. they are not guarded by a lock)

Once some non-deterministic behavior is observed, you can search through your code manually for some code that matches the three criteria listed above. This process is generally tedious and in complex software, there is not guarantee that manual effort alone will succeed.

Are there any special logging frameworks,debugging techniques, or code inspectors to solve multi-threading issues?

This question is a paraphrased quote from the original StackOverflow question, and is why I stated earlier that the question has no real answer.

While there are plenty of logging frameworks, debugging techniques, and code inspectors designed for multi-threaded issues, none solve the problem. Multi-threaded bugs remain among the most challenging problems faced by software today. With that said, there are some great tools that can be extremely helpful. There are a few good tools that can aid in data race detection specifically.

Tools like Intel Inspector and Google’s Thread Sanitizer can profile the execution of a program and report if a data race is observed at runtime. However, these tools rely on the data race actually occurring during the profiled execution.

As mentioned above, data races are notoriously difficult to detect and debug because they depend on a specific timing to be observed. If the race is not triggered during the execution, these tools will not report the data race.

We are working on a data race detector that addresses the limitations of tools like Intel inspector. Our tool, Coderrect Scanner, does not rely on profiling the execution, but scans the source code directly and reasons about any possible races. By not relying on actually executing the code, we are able to eliminate the most difficult part of detecting data races (their non-deterministic nature during execution).

Conclusion

Even in the year 2020, with all of the advancements in computing and software development, the question “How to detect and debug multi-threading problems?” is nearly unanswerable. Software is increasingly relying on parallelism, yet concurrency bugs remain one of the hardest problems in computing. As software moves towards a highly parallel future, developers will likely rely more and more on tools like Thread Sanitizer and the Coderrect Scanner.

Announcing Coderrect Scanner v1.0.0

codebrad — Thu, 04 Mar 2021 17:11:34 +0000

Announcing Coderrect Scanner v1.0.0

March 3rd 2021 marks the first official release of the Coderrect Scanner. After over a year of developing a prototype and adding support for various features, the Coderrect Scanner has reached version 1.

The release of Coderrect Scanner v1 is one of the biggest releases so far. In addition to the usual core algorithm and detection improvements that come with each release, version 1 also comes with a complete overhaul of the report UI and a ton of new convenience features that make the scanner easier to use.

There are too many changes to cover them all in detail, like our new Github Action (see the Release Notes for the full list) so this post will focus on some of the more exciting updates to the report UI.

New Design

The report has been completely overhauled in both look and functionality. The new Index page lists a quick summary of all the targets that have been scanned for the current project. Clicking on any individual target shows the full race report for that target.

Diff View

One of the most exciting new features in the report is the diff view. This view allows you to choose between viewing all detected races, or only new races. New races are races that were detected in the newest commit, but not the previous commit.

This feature allows you to quickly see if there have been any new races added since a previous version. The screenshot above shows that the Coderrect Scanner detected 14 races in version 6.2.1 of redis that were not present in version 6.2.0.

Filtering Races

The new report also adds a variety of new options for filtering races.

Hovering the mouse over any line of code displayed in the report brings up on option to filter the race, as shown above. Clicking brings up a menu that allows you to decide how to filter the race.

Races can be filtered by variable name, source file, variable type, or by any combination. After selecting one or more options and clicking save, any races matching the added criteria will be hidden from the current report.

Filtering races is only a temporary UI change. If you just want to get a quick idea of what kind of races are being reported or filter races you have already reviewed, this is a perfect way to remove those race from the UI without worrying about hiding any future races.

However, if you you do want the scanner to filter races permanently because either you are confident there will never be a race on a certain variable or you don’t care about races in a certain file, the filter can be made permanent by updating Coderrect Scanner’s json config file. Manually updating the json can be a little tricky, so the report UI offers a way to automatically generate the correct json configuration based any filters set via the report UI.

The “Create a Filter List” button at the top of the report automatically generates the correct json. Simply paste this into a file named coderrect.json in the same directory that you run coderrect from and the filter will be made permanent. If at a later date you want to remove the filter, simply delete this json configuration.

These are just a few of the highlights of the new features included in release v1.0.0. Checkout the newest changes by heading over to the download page and trying out v1.0.0.

Why Every Company Should Care About Concurrency Bugs

chriscoderrectcom — Sun, 31 Jan 2021 00:34:25 +0000

Archives

By: Chris

This August will see the tenth anniversary of Marc Andreessen’s claim that software is eating the world, a bold prediction that is proven to be so insightful. Ten years later, we have witnessed software transforming our lives and economy in such a profound way, and is expected to continue to do so at an accelerated pace for decades to come, with the help from the advances in big data, artificial intelligence and faster computers.

Today the software digital transformation is global in scope and touches each and every industry. Even traditional manufacturing companies relying on machines and service companies historically depending on person-to-person interactions are embracing the software revolution. As an example, Goldman Sachs Group, where bankers used to win trust of clients through close relationships and insightful advice, now has over 10,000 software developers, more than many of the leading technology companies. No wonder US Bureau of Statistics projects the number of highly paid software development jobs will grow 22%, compared to 4% for that of all jobs in the next decade. Allow me to make a prediction here: every company, or at least large companies, will eventually become a software company in some way, shape, or form.

Ever since there was software, there have been software bugs. Software industry has come a long way in dealing with software bugs: on average coding defects have been reduced from 12-50 per 1,000 lines of delivered code (KLOC) to 0.434 KLOC, thanks mostly to better tools and development processes, more rigorous testing, and safer programming languages.

However, some software bugs are determined to take a stab at the motto that “software is eating the world”. Given how pervasively software is powering our lives, a software bug these days can cause a lot more damages than just crashing one’s desktop. Nasdaq and a number of market makers have learned this expensive lesson. On May 18th, 2012, Facebook went public on Nasdaq, the 3rd largest IPO in the US, and the largest for Nasdaq at the time. The Facebook stock came alive for trading at 11:30am, half an hour later than planned and only after a software glitch preventing the delivery of order confirmation was mitigated. As a result, marker maker UBS reported lost $350mn, and Nasdaq was fined $10mn and paid $40mn in compensation claims. Knight Capital Group, Citadel and Citigroup also lost millions in this incident.

Nasdaq’s software was considered well-engineered and had executed millions of trades everyday almost flawlessly before the incident. So, what went wrong? It turned out that the expensive glitch was caused by a time-bomb like type of bugs called concurrency bugs. A concurrent software program runs on multiple threads to take advantage of multiple processing units to achieve high performance and low latency. Without correct synchronization, multiple threads can update shared data at the same time and the order of their execution is unpredictable. Exponential possibilities of interlacing between multiple threads can quickly go beyond what a human developer’s brain can handle, making it error prone. A tricky part is that the manifestation of such bugs depends on timing and sequence of the software threads involved, that’s why a seemingly perfect software can execute well for years before something catastrophic suddenly happens.

Can any software testing come to rescue? Yes and No. If the test cases used happen to trigger the sequences of problematic interlacing, the bug will be exposed in the testing phase. However, the potential space of interlacing is typically astronomical, making comprehensive coverage of test cases impossible. To make it even worse, even if a test triggers a bug, such bug would not necessarily show up again next time when the test is run. This phenomenon is called non-determinism. These challenges make software testing for concurrency bugs unreliable at best, and a total miss many times. It is not uncommon for developers to spend weeks and even months to manually exam source code to debug concurrency issues.

These all sound scary, and unfortunately are still largely true in today’s software. Because the concurrency issues can be very painful to deal with, further exacerbated by the lacking of advanced tools to assure quality for concurrent softwares, they often slip through quality assurance measures. A quick search in the National Vulnerability Database produced 15 reported security vulnerabilities that are caused by concurrency issues in this month (January 2021) alone. Affected softwares include Apple MacOs, Android, Google ChromeOS and NVIDIA vGPU. If these well-resourced and most technologically sophisticated companies still suffer from concurrency bugs, the situation is likely to be even more grim elsewhere.

Saying “every company should care” may be a little exaggeration. However if you are running a software team or a company that is leading or embracing digital transformation, it would be wise to look into concurrency issues and make sure best quality assurance measures are put in place to give you a multithreading peace of mind.

Note 1: “Code Complete, A Practice Handbook of Software Construction”, Steve McCoonell

Note 2: “Measuring Software Quality – A Study of Open Source Software”, Coverity

New Concurrency APIs in C++ 17 (Parallel Algorithms in STL)

jieatcoderrect — Thu, 28 Jan 2021 02:00:39 +0000

Archives

By: Jie

Introduction

To write a multi-threading program, programmers need to call platform APIs or third-party library APIs (e.g., POSIX Thread APIs or Boost Thread APIs) prior to the introduction of Concurrency APIs in C++ 11.

To make the code more portable, C++ started introducing Concurrency APIs in C++ 11, when std::thread became available so programmers can use a uniform API rather than call pthread_create() in POSIX or CreateThread() in Windows to create threads. Mutex, future/async, and condition variables were also introduced, which are necessary low-level APIs to develop a concurrent program.

C++ 14 didn’t do too much in terms of concurrency. It mainly introduced Reader-Writer locks that are still low-level concurrency APIs.

There are 150+ algorithms (e.g., search, sort) in Standard Template Library (STL). They are sequential algorithms. To leverage the power of multiple cores, programmers have to parallelize these libraries using the low-level concurrency APIs aforementioned. It is not an easy task – programmers need to understand some complex strategies such as work-stealing and thread pool.

Now, C++ 17 comes to the rescue. It not only parallelizes almost all STL algorithms but also supports vectorization. This article will show you how to use parallelized/vectorized versions of STL algorithms through a couple of examples.

Prerequisites

I’ll use Clang 10 and Ubuntu 18.04 to compile and run samples through this article. To install clang10:

$ sudo apt update
$ sudo apt install clang

Clang relies on TBB (Thread Building Block, a library from Intel) to implement C++ 17 concurrency APIs. To install TBB:

$ sudo apt install libtbb-dev

The following diagram shows the CMakeLists.txt we use:

cmake_minimum_required(VERSION 3.16)
project(cbegin)

set(CMAKE_CXX_COMPILER "/usr/bin/clang++")
set(CMAKE_CXX_STANDARD 17)

add_executable(blog main.cpp)
target_link_libraries(blog tbb)

command on Line 8 links the program with TBB.

An example of sorting

STL implements a sort algorithm std::sort(). The following program shows how to sort one billion integers using the default sequential std::sort.

#include 
#include 
#include 

// 1G integers
constexpr long COUNT = 1024*1024*1024;

int main() {
    int32_t *data{new int32_t[COUNT]};

    typedef std::chrono::high_resolution_clock Clock;

    auto t1 = Clock::now();
    std::sort(data, data+COUNT);
    auto t2 = Clock::now();

    auto elapsed = std::chrono::duration_cast(t2 - t1);
    std::cout << elapsed.count() << '\n';
}

It takes about 178.764 seconds to sort the array on my 128-threads (64-cores) machine. The screenshot shows it uses only one thread (note: there are two busy threads. One of them is being used by another program).

The following code snippet shows a parallelized version of std::sort.

#include 
#include 
#include 
#include 

// 1G integers
constexpr long COUNT = 1024*1024*1024;

int main() {
    int32_t *data{new int32_t[COUNT]};

    typedef std::chrono::high_resolution_clock Clock;

    auto t1 = Clock::now();
    std::sort(std::execution::par, data, data+COUNT);
    auto t2 = Clock::now();

    auto elapsed = std::chrono::duration_cast(t2 - t1);
    std::cout << elapsed.count() << '\n';
}

It takes only about 3.035 seconds to finish the same job on the same machine. All 128 threads were busy as shown in the screenshot.It takes about 3.035 seconds to finish the job on the same machine. You can tell all 128 threads are busy from the screenshot.

What are the differences between these two versions?

Line 3 – includes a new header file called “execution” that defines several execution policies. These policies tell std::sort how developers want to run sorting – sequential or parallel.
Line 15 – uses a new version of std::sort whose first parameter is the execution policy. In our example, we use “std::execution::par” to tell STL to sort integers parallelly.

Execution Policies

C++ 17 offers the execution policies for most of the algorithms:

sequenced_policy (and the corresponding global object std::execution::seq) – tells STL to execute your algorithm sequential.
parallel_policy (and the corresponding global object std::execution::par) – tells STL to execute your algorithm in parallel.
parallel_unsequenced_policy (and the corresponding global object std::execution::par_unseq) – tells STL to execute your algorithm in parallel with also ability to use vector instructions (like SSE, AVX).

The following example illustrates how to execute std::find in parallel.

#include 
#include 
#include 

// 1G integers
constexpr size_t COUNT = 1024*1024*1024;

int main() {
    int32_t *data{new int32_t[COUNT]};
    auto itr = std::find(std::execution::par, data, data+COUNT, 101);
    if (it != data+COUNT)
        std::cout << *itr << '\n';
}

Race Condition

For the execution policy of parallel_policy and parallel_unsequenced_policy, STL uses a pool of threads to process data simultaneously. It’s therefore possible to introduce race conditions into your program.

The following example tries to increase a counter after it processes a data item.

#include 
#include 
#include 
#include 

constexpr size_t COUNT = 8L*1024*1024*1024;

int main() {
    int32_t *data1{new int32_t[COUNT]};
    int32_t *data2{new int32_t[COUNT]};
    int32_t *res{new int32_t[COUNT]};
    int64_t counter = 0;

    typedef std::chrono::high_resolution_clock Clock;

    auto t1 = Clock::now();
    std::transform(std::execution::par, data1, data1+COUNT, data2, res, [&counter](int i, int j){
        counter++;
        return i + j; });
    auto t2 = Clock::now();

    auto elapsed = std::chrono::duration_cast(t2 - t1);
    std::cout << elapsed.count() << '\n';

    std::cout << "counter=" << counter << '\n';
}

Line 12 – defines a counter.

Line 17-19 – the lambda will increase the counter after it finishes the processing of a data item. Since there are multiple threads executing this piece of logic in parallel, they may update the “counter” at the same time.

I ran the program on the same machine and got

You can find the counter is much smaller than the correct value 8,589,934,592. Coderrect is working on this, stay tuned for more details soon.

More are coming …

This article quickly goes through the parallel STL algorithms introduced by C++ 17. What’s next?

C++ 20 was approved in September 2020, which introduces more concurrency APIs such as latches, barriers, and coroutines. Our next article will cover its extensions to std::future.

C++ 20 is here! What is new for multithreaded programming?

peimingliu — Wed, 30 Sep 2020 05:05:49 +0000

Archives

C++ 20 will bring the most significant change to modern C++ since C++ 11. Many exciting new features like modules and concept are going to change C++ fundamentally. It also introduces several interesting new features for multithreaded programming and we briefly introduce them in this blog!

`std::jthread` and stop tokens

From the start (pre-C++11), many (including me) had wanted threads to have what is now jthread’s behavior.
BJARNE STROUSTRUP, Thriving in a Crowded and Changing World: C++ 2006–2020

The jthread (short for “joining thread”) is a thread that obeys RAII; that is, its destructor joins, rather than terminates, if the jthread goes out of scope.

inline jthread::~jthread() {
  if (joinable()) {   // if not joined/detached, signal stop and wait for end:
    request_stop();
    join();
  }
}

While std::thread requires an explicit call to std::thread::join to join threads, std::jthread joins threads implicitly upon destruction, so jthread can make your program much simpler in some cases. Considered the following example (from the original jthread proposal) which is common in many scenarios:

void compareWithStdThreadJoining() {
  std::thread t([] { //... });
  try {
    //...
  } catch (...) { 
    j.join();
    throw; // rethrow
  }
  t.join();
}

To make sure the thread is joined under all conditions, a call to join() has to be inserted everywhere in the code. Now using jthread, the code can be greatly simplied as jthread is automatically joined when it goes out of the scope.

`std::jthread` is cooperative interruptible

As “cooperative interruptable” might sounds complicated, it is actually an old and simple (yet efficient) solution that has already be widely adopted. The basic idea is, if I want a jthread to stop, I set its stop token. It is called “cooperative”, as after the stop token is set, it is the jthread's obligation to occasionally test the token, and if the token has been set, cleans up and exits. For example, the following code snippet won’t work as what one might expect, which means jthread won’t stop after calling request_stop().

void interruptJThread() {
   std::jthread([]() {
      while (true) { cout << "running..." << endl; }
   });
   jthread.request_stop();
}

To make it work, you need to use std::stop_tokens introduced in C++ 20 similar to the following:

void interruptJThread() {
   std::jthread([](std::stop_token token) {
      while (token.stop_requested()) { cout << "running..." << endl; }
   });
   jthread.request_stop();
}

At high level, it is exactly the same as if you use an automic boolean flag to stop the thread. Although it might not be as efficient and clean as the standard API. Also, more advanced uses such as registering a callback upon the specific stop_token being set is also possible by using std::stop_callback.

`atomic>`

As it might be surprising to you, unlike boost::shared_ptr, std::shared_ptr is not thread safe before C++ 20! That is, the increasing and decreasing on the reference count is not atomic thus might cause data races!

Now C++ 20 finally offers a specialization for atomic>. In C++20, by using atomic>, std::shared_ptr can finally be maniputated safely by multiple threads!

Conclusion

There are many other new features related to concurrecy that we do not cover in this blog (such as std::atomic_ref). Besides, although C++20 has made significant improvement in many aspects, the hoped-for general concurrency model (“executors“) still wasn’t ready for C++20 despite valiant efforts and an emerging broad consensus. While there is still long way to improve the way how people do concurrent programming, we hope that coderrect scanner is able to make your life easier!

TOCTOU: Funny Name for a Serious Bug

codebrad — Tue, 01 Sep 2020 03:50:13 +0000

Archives

What is TOCTOU

Time-of-check, time-of-use — or TOCTOU — is a type of software bug that can lead to serious security vulnerabilities. At the time of writing, searching the keyword “TOCTOU” in the Common Vulnerabilities Database returns 94 cases where a TOCTOU bug could be exploited maliciously. These cases show examples of arbitrary code execution, privilege escalation, unintended file deletion, and many other exploits in widely used software.

TOCTOU is a specific type of race condition. For a full technical description, Mitre’s list of common software weaknesses offers the following:

The software checks the state of a resource before using that resource, but the resource’s state can change between the check and the use in a way that invalidates the results of the check. This can cause the software to perform invalid actions when the resource is in an unexpected state.

In simpler terms, the program wants to do something but only if some condition is true.

if (condition) {
    doSomething();
}

However, a time-of-check to time-of-use bug may allow doSomething to execute while condition is not currently true. A TOCTOU bug occurs when another thread or process running in parallel changes the value of condition after the if check, but before the call to doSomething. The end result is that doSomething is called when condition is false, and this can lead to disastrous consequences.

Take, for example, a pseudo code program to prevent a checking account from being overdrawn.

if (accountBalance() > 0) {
   withdrawMoney();
}

Assume the account starts with $1. The account owner makes two transactions simultaneously, causing this code block to be executed twice in parallel. If both threads check the accountBalance before either thread has withdrawn any money, both threads will see $1 in the account. Then both threads will withdraw money, likely causing the account to become overdrawn.

This is a time-of-check to time-of-use bug because the state (accountBalance) is changed between the check (accountBalance() > 0) and the use withdrawMoney().

Classic Example

Wikipedia gives an excellent classical example of a Time of Check Time of Use bug. This example shows a race condition between multiple processes that allows an attacker to access a protected file without permission.

Victim
if (access("file", W_OK) != 0) {
   exit(1);
}

fd = open("file", O_WRONLY);
// Actuall writing over /etc/passwd
write(fd, buffer, sizeof(buffer));

Attacker


// After the access check
symlink("/etc/passwd", "file");
// Before the open, 
//  "file" -> password database

In this case, the vulnerable program intends to check if the user can write to a file using access and then if — and only if — the user has permission, it will open and write to the file.

However, the permission check and the actual opening of the file are not atomic, making it possible for some other process to interleave the permission “check” and “use” of that permission.

An attacker may be able to change the file to point to something that the user does not have access to read, in this case /etc/passwd, but because the victim program has already succeeded the call to access, it continues and opens the file anyway.

Assuming file was initially a symlink to a file that the attacker created and has write access to, the chain of events leading to a vulnerability are:

The victim program calls access to see if the attacker has write permission to file
The access check succeeds because file currently points to the text file created by the attacker
In another process, the attacker changes file to point to /etc/passwd which they do not have permission to write to
The victim program calls open("file", O_WRONLY) and allows the attacker to write to /etc/passwd

Thus the attacker was able to write to the highly confidential /etc/passwd file.

This bug’s root cause is a race condition involving the filesystem, where the victim program expected the execution to be atomic. Inter-process race conditions like this have led to many security vulnerabilities. However, TOCTOU is possible anywhere parallelism can occur.

Multi-threaded TOCTOU

Although the classic examples of TOCTOU usually show a race on the filesystem between different processes, TOCTOU can be just as dangerous in multi-threaded software.

Null Pointer Dereference

Consider the following example.

Object *global;

// Thread 1
if (global != nullptr) {

     // null dereference
    auto value = *global;
}

// Thread 2
global = nullptr;

In this example, the programmer attempted to avoid a nullptr dereference in thread one by only dereferencing Object *global if it is not null. However, as the check and dereference are not done atomically, thread two can set global to be null after the check, but before the dereference. This results in thread one attempting to dereference nullptr.

One potential fix is to ensure that the check and use are made atomic and cannot be interleaved.

Object *global;

// Thread 1
pthread_mutex_lock(&lock);
if (global != nullptr) {        
    auto value = *global;
}
pthread_mutex_unlock(&lock);

// Thread 2
pthread_mutex_lock(&lock);
global = nullptr;
pthread_mutex_unlock(&lock);

Now the locks ensure that thread two can not interleave the check and dereference on thread one.

Although this particular case may seem relatively straightforward, this pattern can lead to serious security vulnerabilities in real software.

TOCTOU in Windows

A time-of-check to time-of-use race triggered a critical use after free vulnerability in Windows XP. The vulnerability allowed attackers to crash the system and potentially even execute arbitrary code with elevated privileges.

Based on the description, a rough recreation of what the vulnerability might have looked like is shown below.

struct Procedure {
  bool isProcessing;
  void (*function)(); // function pointer
};

std::list pendingProcedures;

void* workerThread(void* arg) {
    while (!pendingProcedures.empty()) {
        auto it = pendingProcedures.begin();
        while (it != pendingProcedures.end()) {
            Procedure* proc = *it;
            if (proc->isProcessing) {
                continue;
            }
            proc->isProcessing = true;

            // Process Procedure
            proc->function();

            // Update "it" to next procedure in list
            // Remove proc from pendingProcedures

            delete proc;
        }
    }
}

First, notice there is a Procedure struct that contains a flag isProcessing. Next, there is a list of Procedure* called pendingProcedures. Lastly, there is a function called workerThread that loops over the list of pending Procedures and processes them.

The worker thread searches for a procedure that is not processing at the line if(proc->isProcessing). Once a procedure with isProcessing set to false is found, the thread “acquires” the procedure by setting isProcessing to true.

The trouble here is a race on the isProcessing flag. Multiple worker threads can acquire the same procedure through the following chain of events.

// Thread 1
auto proc = *it; [proc = 0xabc123]

if (proc->isProcessing) 
proc->isProcessing = true;

// Process proc
// ...
delete proc;

// Thread 2
auto proc = *it; [proc = 0xabc123]
if (proc->isProcessing) 


proc->isProcessing = true;


// proc has already been deleted
// Dangerous Use After Free!
proc->function();

When both worker threads acquire the same procedure for processing, it is likely one thread may process the procedure after another thread has already called delete proc. This causes a use after free error, and can potentially be exploited by malicious agents to execute arbitrary code.

Preventing TOCTOU

Despite the dire consequences, there is no consensus on how to detect and prevent TOCTOU bugs reliably.

For inter-process and filesystem-level TOCTOU race conditions, file locks, transactional operating systems, and other approaches have so far been proposed, but none have yet emerged as the de-facto solution.

Multi-threaded TOCTOU seem to be even more difficult to detect and prevent. Despite a wealth of research, there are very few, if any production ready tools for detecting TOCTOU bugs in concurrent programs. Tools like Valgrind or Intel Inspector may be able to detect the side effect of a TOCTOU bug (e.g. a use after free, race condition, or double delete), but neither can detect the TOCTOU directly. However, newer tools like Coderrect’s code scanner offer some support for detecting TOCTOU directly, as well as other types of concurrency bugs.

Overall, it seems that for now the best approach for preventing TOCTOU is developer awareness. As more developers become familiar with, and aware of, TOCTOU style race conditions they will be less likely to inadvertently allow TOCTOU bugs in to critical code. Nonetheless, mistakes are inevitable, and for those cases, we can rely on tools like Coderrect, Valgrind, and Intel Inspector to assist developers in detecting problems in their code.

New Tool Ensures HPC Code will Compute as Intended

jeffatcoderrect — Sat, 29 Aug 2020 23:39:40 +0000

Archives

HPC, what and why?

Over a century ago, the U.S. population had grown so large that it took several years to compute the U.S. Census results on a tabulating machine. Nowadays, it would take less than a second. The reason is simple: today’s computers are far faster than the machine 100 years ago.

People call the ability to compute at high speeds as High-performance computing (HPC). One of the best-known types of HPC solution is the supercomputer. Today, the world’s fastest supercomputer can compute over 4 x10¹⁷ complex calculations in one second — that is, more than four thousand million million times faster than a tabulating machine. HPC benefits the world in many ways, such as predicting weather in “real time”, enabling advanced artificial intelligence (AI), and finding drugs for COVID-19.

HPC programming is different, and is hard

I have a piece of C/C++/Python code running on my PC, but it is too slow. Can I run it on a supercomputer and get a faster compute result? Yes, that’s possible but some hard work is needed.

HPC programming is different from programming for a PC or a mobile phone. A supercomputer contains many compute nodes that work together to complete one or more tasks. This is called parallel processing. It’s similar to having millions of PCs networked together, combining compute power to complete tasks faster. Historically, parallel programming has been a very buggy process. Because concurrency bugs such as race conditions and deadlocks can be easily introduced into parallel programs, but these bugs are notoriously difficult to detect and debug.

These bugs are nasty. Weeks can easily be lost, if not months, tracking down a race condition at scale.
Jeff Huang, Professor of Computer Science, Texas A&M University

“You can port your code to run on a supercomputer and get good speedups using parallelization models such as OpenMP” says Jeff Huang, a professor of Computer Science at Texas A&M. “However, be cautious of race conditions. These bugs are nasty. You can get non-deterministic results that are non-reproducible because of them. Weeks can easily be lost, if not months, tracking down a race condition at scale.”

Coderrect: a new tool to rescue HPC code

Huang and a team of researchers recently published a new technique that helps ensure that HPC programs written with OpenMP are race-free — that is, the code will not have any erroneous race condition. The technique and tool, namely OMPRacer, was developed at Coderrect Inc. and will be presented at the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’20). OMPRacer is now a part of Coderrect, a powerful debugging tool for multithreaded programs.

“What’s really exciting about Coderrect is that it’s designed to catch all possible races at compile time so you can fix the races before the code actually runs,” says Huang. “You then have piece of mind before the code starts that it’s going to do the right thing and not to mess anything up because of a race condition.”

Because Coderrect does not rely on any run time information, it is also privacy preserving. The user’s data will never be disclosed to Coderrect.

“Despite that Coderrect knows nothing about the computing data at run time, it is very good at detecting races by only looking at the code itself and is extremely fast in doing so, thanks to a new type of data-flow analysis we invented in this work.” Huang adds.

Huang believes that Coderrect will benefit the HPC industry in a huge range of applications.

“From scientific simulations to drug discovery to countless data-driven applications in AI – they all need large amounts of computing resources,” says Huang. “HPC is becoming more and more critical.”

Paper reference

OMPRacer: A Scalable and Precise Static RaceDetector for OpenMP Programs

Bradley Swain, Texas A&M University and Coderrect Inc.
Yanze Li Texas A&M University
Peiming Liu Texas A&M University
Ignacio Laguna Lawrence Livermore National Laboratory
Giorgis Georgakoudis Lawrence Livermore National Laboratory
Jeff Huang Texas A&M University

100x Faster Calculation of Pi with just One Line of Code?

chriscoderrectcom — Fri, 28 Aug 2020 17:13:34 +0000

Archives

Every year on March 14th (3/14), we celebrate PI (Greek letter “π”) day. Most people know that value of π is approximately 3.14 – and some are able to remember it to a few more decimal places – but do you know how to calculate it?

In this article, we break down a simple numeric method to calculate π using C. We also show you how to make the code 100x faster by adding a single line of OpenMP parallelization.

Pi numeric integration

The formula shown above uses numeric integration to calculate π. When N is large enough, the calculated value will be infinitely close to π.

The code shown below implements this numeric method in C:

#include 
#define N 1000000000
int main () {
    double delta = 1.0/(double) N;
    double sum = 0.0;
    for (int i=0; i < N; i++) {
        double x = (i+0.5)*delta;
        sum += 4.0 / (1.0+x*x);
    }
    double pi = delta * sum;
    printf("PI = %.12g\n", pi);
}

Compile and run this code in your terminal:

$ gcc pi.c -o pi
$ ./pi

It will print the result:

PI = 3.14159265359

The calculation takes around 3-6 seconds using a standard computer.

100X Speedup with OpenMP Parallelization

There are many ways to make the above code run faster. But one extremely easy way is to use OpenMP parallelization: adding the following line right before the “for” loop.

#pragma omp parallel for reduction(+:sum)

The line above looks like a comment in the code. However, it actually tells the compiler to parallelize the “for” loop by using multithreading; this way the code will run much faster than before when ran on a multicore machine.

The code shown below is the full version, and we added an outer loop to simulate different numbers of threads:

#include 
#include 
#define N 1000000000
int main () {
    double delta = 1.0/(double) N;
    int MAX_THREADS = omp_get_max_threads();
    // Compute parallel compute times for 1-MAX_THREADS
    for (int j=1; j<= MAX_THREADS; j++) {
        printf(" running on %d threads: ", j);
        omp_set_num_threads(j);
        double sum = 0.0;
        double start = omp_get_wtime();
        #pragma omp parallel for reduction(+:sum)
        for (int i=0; i < N; i++) {
            double x = (i+0.5)*delta;
            sum += 4.0 / (1.0+x*x);
        }
        // Out of the parallel region, finalize computation
        double pi = delta * sum;
        double time_lapse = omp_get_wtime() - start;
        printf("PI = %.12g computed in %.4g seconds\n", pi, time_lapse);
    }
}

If you pass the flag “-fopenmp” to gcc, and run the code above with:

$ gcc -fopenmp pi.c -o pi
$ ./pi

It will print the following results:

running on 1 threads: PI = 3.14159265359 computed in 3.4 seconds
running on 2 threads: PI = 3.14159265359 computed in 1.705 seconds
running on 3 threads: PI = 3.14159265359 computed in 1.137 seconds
running on 4 threads: PI = 3.14159265359 computed in 0.858 seconds
running on 5 threads: PI = 3.14159265359 computed in 0.687 seconds
running on 6 threads: PI = 3.14159265359 computed in 0.574 seconds
running on 7 threads: PI = 3.14159265359 computed in 0.4946 seconds
running on 8 threads: PI = 3.14159265359 computed in 0.4324 seconds
...
running on 126 threads: PI = 3.14159265359 computed in 0.03336 seconds
running on 127 threads: PI = 3.14159265359 computed in 0.03072 seconds
running on 128 threads: PI = 3.14159265359 computed in 0.0281 seconds

As we can see, the code calculates the same result, but it runs more than 100X faster than before when running 128 threads.

reduction(+:sum)

Pay attention to reduction(+:sum). It is very important here: it is a reduction clause provided by OpenMP and it tells the compiler that sum is a reduction variable on the + operator. The compiler will then put the + operation on sum in a critical section so that the final stores to sum from different threads will be properly synchronized and correctly summed.

If reduction(+:sum) is not added — a common mistake for OpenMP beginners — the code will not only calculate an incorrect result but will also run slower than without parallelization:

running on 1 threads: PI = 3.14159265359 computed in 3.403 seconds
running on 2 threads: PI = 1.43071963533 computed in 5.701 seconds
running on 3 threads: PI = 1.34809292029 computed in 5.424 seconds
running on 4 threads: PI = 1.08601649484 computed in 5.929 seconds
running on 5 threads: PI = 0.903822885536 computed in 6.059 seconds
running on 6 threads: PI = 0.62312876914 computed in 5.552 seconds
running on 7 threads: PI = 0.596798835465 computed in 5.976 seconds
running on 8 threads: PI = 0.50340501324 computed in 6.086 seconds

When reduction(+:sum) is not added, a data race exists on the shared variable sum. It is this data race that leads to both incorrect results and poor cache performance.

To give readers peace of mind, we’d like to introduce Coderrect as an easy solution to catch this race, and others like it. Next up, we show how to use Coderrect with this example.

Prerequisites

This tutorial assumes you have successfully installed the Coderrect software following our quick start guide.

Detect the race using Coderrect

Compile the code with coderrect:

$ coderrect -t gcc -fopenmp pi.c -o pi

The coderrect -t command will compile and analyze the program, with the flag -t printing the race report to the terminal.

==== Found a data race between:
line 22, column 13 in pi.c AND line 22, column 17 in pi.c
Shared variable:
at line 16 of pi.c
16| double sum = 0.0;
Thread 1:
20| for (int i=0; i < N; i++) {
21| double x = (i+0.5)*delta;
>22| sum += 4.0 / (1.0+x*x);
23| }
24|
>>>Stacktrace:
Thread 2:
20| for (int i=0; i < N; i++) {
21| double x = (i+0.5)*delta;
>22| sum += 4.0 / (1.0+x*x);
23| }
24|
>>>Stacktrace:

1 OpenMP races

Interpret the Results

The reported race starts with a summary of where the race was found.

==== Found a data race between:
line 22, column 13 in pi.c AND line 22, column 17 in pi.c

Next, the report shows the name and location of the variable on which the race occurs.

Shared variable:
at line 16 of pi.c
16| double sum = 0.0;

This output shows that the race occurs on the variable sum allocated at line 16 and reports information about the two unsynchronized accesses to sum.

20| for (int i=0; i < N; i++) {
21| double x = (i+0.5)*delta;
>22| sum += 4.0 / (1.0+x*x);
23| }
24|

The code snippet shows that line 22 has racy accesses, both read and write, to sum.

Data Races, what are they, and why are they evil? – Part 2

chriscoderrectcom — Sat, 01 Aug 2020 00:12:53 +0000

Archives

In Part 1 of this blog, we described the differences between data races and race conditions and demonstrated some simple fixes through examples. You may have observed that fixing data races does not necessarily eliminate the race condition. If so, what is the harm actually brought by a data race? And why do we want to get rid of them?

Before we step into the murky world of data races, let’s first take a step back to understand a new concept: “The benign data race”.

There is a strong and widely held perception that data races are acceptable in certain contexts, and thus commonly used in existing code.

This, as it turns out, is very wrong. Not only are these so-called benign data races technically incorrect and harmful, but new bugs may be introduced by future re-compilation if the code is meant to be portable.

We present some common C/C++ “benign” patterns from a paper by Hans-J. Boehm below, and briefly explain how these common patterns might go wrong:

1. Double-checked locking

Example 1

if (!init_flag) {
    lock(); 
    if (!init_flag) {
        my_data = ...;
        init_flag = true;
    }
    unlock(); 
} 
tmp = my_data;

This is a very famous pattern (known as DCLP) and is also well-known to be incorrect. The code is potentially buggy if compiler optimization or hardware reordering is applied, which, in most real cases, they are.

For example, there’s nothing to prevent the compiler from advancing init_flag = true before the setting of my_data. Thus a second thread may end up reading an uninitialized value.

A general solution for this pattern does not exist until Java 1.5 (the volatile keyword) and C++11 (the memory fence). Without these, we have to protect the whole pattern with a lock, which will cost us in degraded performance.

For more details about DCLP, you can check the wiki, this page for Java, and Scott and Andrei’s paper about C++.

2. Reading an imprecise value is allowed

There are cases where a read from a thread does not care whether another thread has updated the value or not, i.e., either the old value or the new value will result in a valid computation. In such cases, is a data race benign?

The answer, of course, is NO. Simply because data races can potentially cause the read operation to obtain a meaningless value that corresponds to neither the old value nor the new.

Example 2

// Read global
int my_counter = counter;
int (* my_func) (int);
if (my_counter > my_old_counter) {
    ...
    my_func = ...;
}
...
if (my_counter > my_old_counter) {
    ... 
    my_func(...) 
}

// Read global
int my_counter = counter;
int (* my_func) (int);
if (my_counter > my_old_counter) {
    ...
    my_func = ...;
}
...
// ANOTHER read global
int my_counter = counter;
if (my_counter > my_old_counter) {
    ... 
    my_func(...) 
}

Consider the code pattern on the left, where my_counter reads from the global variable counter without protection. The global counter may get updated by some other threads running in parallel. Here the developer may think reading an old value of counter is acceptable due to the if (my_counter > my_old_counter) check.

However, suppose the compiler thinks the register of my_counter needs to be spilled between the two if-checks. With this in mind, it is legit for the compiler to re-read my_counter‘s value again to avoid spilling, as shown on the right. Under this optimized code, the two if-checks may not yield the same result, which can lead to unexpected behavior when calling my_func.

If we go a bit further, down to the hardware level. Assuming we want to maintain an imprecise global 64-bit integer counter. We make the counter++ concurrent and unprotected. If your system only supports 32-bit indivisible writes, then once a thread is updating the counter value from 2^32 - 1 to 2^32, another thread could read a value whose higher bits are already updated but lower bits are not. This will make the counter value totally meaningless.

3. Incorrect ad hoc synchronizations

Ad hoc (or user constructed) synchronizations should not use ordinary operations. Otherwise, we will generally face issues similar to those mentioned above.

Example 3

// `flag` is intended to be an ad hoc lock
while (!flag) {
    ...
}

tmp = flag
while (!tmp) {
    ...
}

For example, the busy-waiting loop on the left can be optimized to something like the example on the right, due to register spilling. The optimized loop is likely to be infinite, yet it is equivalent to the example on the left under sequential code.

Or if we want to use a busy-waiting loop as a memory barrier, such as in the example below:

Example 4

while (!flag) {}; // waiting for `data` to be updated
tmp = data;

Compilers or hardware could potentially reverse the order of the two instructions. Thus, this ad hoc memory barrier is mostly meaningless.

4. Redundant write

Assume we have a function f(x):

Example 5

// count is a shared variable between threads
f(x) {
    ...;
    for (p = x; p != 0; p = p -> next) {
        if (p->data < 0) count++;
    }
}

Notice that count is a shared variable, and is only updated when p has a negative value. Now if we use this function in the following way:

Example 6

// Thread 1
count = 0;
f(positives);

// Thread 2
f(positives);
count = 0;

The argument positives is a linked list whose nodes only contain positive values. Hence, there’s only one data race between the two count = 0s. You may wonder how this could possibly go wrong since all we are doing is writing the same value to count repeatedly.

However, things are different if the compiler inlines f(positives) and promotes the variable count to a thread-local register (this is a legitimate concern because count is written unconditionally). Now the actual instructions being executed by each thread become something like below:

Example 7

// Thread 1
count = 0; // step 2
reg1 = count // step 4
count = reg1 // step 6

// Thread 2
reg2 = count; // step 1
count = reg2; // step 3
count = 0; // step 5

Now under this specific interleaving, we find it is possible for count to remain its original value instead of being zeroed out.

5. Disjoint bit manipulation

This kind of “benign” data race is described as:

… data races between two memory operations where the programmer knows for sure that the two operations use or modify different bits in a shared variable.

Those kinds of manipulation are not generally benign. Updating bits usually entails 1) Reading the whole value; 2) Updating the relevant bits; and 3) Writing the value back. We can clearly see how specific interleaving may cause some updates to be invalid.

Even in the case where there is only 1 thread updating the bits, we can come up with the following example:

Example 8

x.bits1 = 1;
switch(i) {
    case 1:
        x.bits2 = 1;
        ...;
        break;
    case 2:
        x.bits2 = 1;
        ...;
        break;
    case 3:
        ...;
        break;
    default:
        x.bits2 = 1;
        ...;
        break;
}

x.bits1 = 1;
tmp = x.bits2
x.bits2 =1
switch(i) {
    case 1:
        ...;
        break;
    case 2:
        ...;
        break;
    case 3:
        x.bits2 = tmp;
        ...;
        break;
    default:
        ...;
        break;
}

Our original code is shown on the left. x is a global variable whose bits represent different flags, and bits1 and bits2 are two bit fields of x.

Notice that in our original code, case 3 does not update x.bits2. It also happens to be the branch we are using (assume this is a fact that the compiler does not know, so that the other branches will not be pruned). Now, if we have a second thread only reading the value of x.bits2. This seems to be a perfectly safe situation.

However, existing compilers may consider adjacent bit-fields as belonging to the same memory location (here in our case, x.bits1 and x.bits2). The write to x.bits1 will make the compiler think the target memory location is data-race free, thus justify its optimizations on x.bits2. This leads to the optimized code shown on the right, where the potential updates on x.bits2 are advanced to shrink the code size (if we have more branches that update x.bits2). Now, if we are still using case 3, the read from the other thread will conflict with the write on case 3, causing the second thread to read invalid values of x.bits2.

In summary, the evilness of data races mostly lies in compiler optimizations, hardware reordering, and underlying memory models. Most compiler optimizations assume the program is data-race free; thus, compilers may introduce undefined behaviors when data races exist. The worst thing about undefined behavior is it could lead to any actions — even launching a missile — instead of just entering an invalid path of your program.

If you are interested in further reading and discussion of data races, there’s another excellent blog explaining this.

How do I prevent my program from launching missiles unexpectedly?

Having made it this far, hopefully you are not utterly terrified by what can be caused by data races. Just to be fair, there could be “benign” data races under more concrete contexts. Such as when knowing the specific architecture, compiler version, the optimization passes used, the memory models, and where the code is not intended to be portable.

Nevertheless, We strongly believe developers should pay more attention to potential data races in the same manner that they treat compiler warnings.

Properly protecting concurrent shared memory accesses not only prevents programs from falling foul of all the subtle bugs we presented above, but also significantly increases the readability of your code by specifying their concurrent semantics. Besides, there are tools like ThreadSanitizer, Intel Inspector, and Coderrect Racedetector to efficiently reveal potential data races at different programming phases.

Let’s just ensure we are not making our software program into an accidental missile launcher, ;).

Data Races, what are they, and why are they evil? – Part 1

yanzeli — Sat, 01 Aug 2020 00:09:00 +0000

Archives

Race Conditions and Data Races

Most readers will be aware of “Race Conditions”, or “Races”. They are a common cause of bugs when writing concurrent programs that are not seen in single-threaded programs.

You might also hear about people talking about a type of bug known as a “Data Race”. They have caused quite a stir online recently when the accuracy of a simulation program for the COVID-19 pandemic was called into question because of suspected potential “data races”.

So, are data races the same as race conditions?

Simply put, race conditions and data races are different. One is not even a subset of the other, which is a common misunderstanding even among experienced developers given that they both have “race” in their names, and share similar code patterns.

Race condition, according to its wiki page:

… is the condition of an electronics, software, or other system where the system’s substantive behavior is dependent on the sequence or timing of other uncontrollable events. It becomes a bug when one or more of the possible behaviors is undesirable.

This definition implies race condition is not inherently a bug. Race conditions are inevitable as long as concurrent events exist within a system.

Consider the following example:

Example 1

Thread 1
// x is a shared variable
x = 1

print x

Thread 2
// x is a shared variable

x = 999

There is clearly a race condition going on, where the result of print x is dependent on the sequence of assignments to x. However, this behavior may be intended and would not be considered a bug if both results are acceptable outcomes.

On the other hand, a data race is clearly defined as a bug. A data race occurs when two conflicting memory accesses:

are accessing the same piece of memory;
are performed concurrently;
have at least one write;
are not protected by synchronization or locks.

The definition of concurrency is language-dependent. For example, the C++ Standard defines two actions as potentially concurrent if:

they are performed by different threads, or
they are unsequenced, and at least one is performed by a signal handler.

The second rule is idiosyncratic to C++. The Java Language Specification gives a different definition.

So, in contrast to a possibly desired race condition, a data race can be clearly defined as a bug that might result in undefined or unexpected behavior.

Now, upon re-examining the example above, we can discern that it is also a “data race”. The concurrent accesses to x are unprotected, so it is a BUG!

A valid fix for this bug is to simply protect each access with a lock:

Example 2

Thread 1
// x is a shared variable
// l is a global lock
lock(l)
x = 1
unlock(l)

lock(l)
print x
unlock(l)

Thread 2
// x is a shared variable
// l is a global lock
lock(l)
x = 999
unlock(l)

Now, all data races in this example are gone, but a race condition still exists. The critical sections, the code segment between each lock and unlock pairing, are still unsequenced. The code can still be buggy, say if printing out “999” is undesirable.

Benign or non-benign? Wherein lies the evilness?

Let’s suppose printing out “999” is indeed undesired buggy behavior.

Clearly, the buggy printing of “999” is not introduced by the data race but the race condition. To get rid of the bug, we can either join thread 2 before the write operation to x in thread 1; or put the two operations on x in thread 1 into the same critical section. Both fixes ensure there can be no unexpected operation between x = 1 and print x.

Now, some of you may wonder: if the code is buggy before and after our first fix for the data race, what’s the point of fixing it, and what is the harm actually brought by a data race?

We answer those questions in Part 2 of this blog.

Your multithreading peace of mind –

Can We Detect and Debug Multi-Threading Issues?

Is it possible to detect and debug problems coming from multi-threaded code?

Are there any special logging frameworks,debugging techniques, or code inspectors to solve multi-threading issues?

Conclusion

Announcing Coderrect Scanner v1.0.0

Announcing Coderrect Scanner v1.0.0

New Design

Diff View

Filtering Races

Why Every Company Should Care About Concurrency Bugs

By: Chris

New Concurrency APIs in C++ 17 (Parallel Algorithms in STL)

By: Jie

Introduction

Prerequisites

An example of sorting

Execution Policies

Race Condition

More are coming …

C++ 20 is here! What is new for multithreaded programming?

std::jthread and stop tokens

std::jthread is cooperative interruptible

atomic>

Conclusion

TOCTOU: Funny Name for a Serious Bug

What is TOCTOU

Classic Example

Multi-threaded TOCTOU

Null Pointer Dereference

TOCTOU in Windows

Preventing TOCTOU

New Tool Ensures HPC Code will Compute as Intended

HPC, what and why?

HPC programming is different, and is hard

Coderrect: a new tool to rescue HPC code

Paper reference

100x Faster Calculation of Pi with just One Line of Code?

100X Speedup with OpenMP Parallelization

reduction(+:sum)

Prerequisites

Detect the race using Coderrect

Interpret the Results

Data Races, what are they, and why are they evil? – Part 2

1. Double-checked locking

Example 1

2. Reading an imprecise value is allowed

Example 2

3. Incorrect ad hoc synchronizations

Example 3

Example 4

4. Redundant write

Example 5

Example 6

Example 7

5. Disjoint bit manipulation

Example 8

How do I prevent my program from launching missiles unexpectedly?

Data Races, what are they, and why are they evil? – Part 1

Race Conditions and Data Races

Example 1

Example 2

Benign or non-benign? Wherein lies the evilness?

`std::jthread` and stop tokens

`std::jthread` is cooperative interruptible

`atomic>`