100x Faster Calculation of Pi with just One Line of Code?
Every year on March 14th (3/14), we celebrate PI (Greek letter “π”) day. Most people know that value of π is approximately 3.14 – and some are able to remember it to a few more decimal places – but do you know how to calculate it?
In this article, we break down a simple numeric method to calculate π using C. We also show you how to make the code 100x faster by adding a single line of OpenMP parallelization.

The formula shown above uses numeric integration to calculate π. When N is large enough, the calculated value will be infinitely close to π.
The code shown below implements this numeric method in C:
#include <stdio.h> #define N 1000000000 int main () { double delta = 1.0/(double) N; double sum = 0.0; for (int i=0; i < N; i++) { double x = (i+0.5)*delta; sum += 4.0 / (1.0+x*x); } double pi = delta * sum; printf("PI = %.12g\n", pi); }
Compile and run this code in your terminal:
$ gcc pi.c -o pi $ ./pi
It will print the result:
PI = 3.14159265359
The calculation takes around 3-6 seconds using a standard computer.
100X Speedup with OpenMP Parallelization
There are many ways to make the above code run faster. But one extremely easy way is to use OpenMP parallelization: adding the following line right before the “for” loop.
#pragma omp parallel for reduction(+:sum)
The line above looks like a comment in the code. However, it actually tells the compiler to parallelize the “for” loop by using multithreading; this way the code will run much faster than before when ran on a multicore machine.
The code shown below is the full version, and we added an outer loop to simulate different numbers of threads:
#include <omp.h> #include <stdio.h> #define N 1000000000 int main () { double delta = 1.0/(double) N; int MAX_THREADS = omp_get_max_threads(); // Compute parallel compute times for 1-MAX_THREADS for (int j=1; j<= MAX_THREADS; j++) { printf(" running on %d threads: ", j); omp_set_num_threads(j); double sum = 0.0; double start = omp_get_wtime(); #pragma omp parallel for reduction(+:sum) for (int i=0; i < N; i++) { double x = (i+0.5)*delta; sum += 4.0 / (1.0+x*x); } // Out of the parallel region, finalize computation double pi = delta * sum; double time_lapse = omp_get_wtime() - start; printf("PI = %.12g computed in %.4g seconds\n", pi, time_lapse); } }
If you pass the flag “-fopenmp” to gcc, and run the code above with:
$ gcc -fopenmp pi.c -o pi $ ./pi
It will print the following results:
running on 1 threads: PI = 3.14159265359 computed in 3.4 seconds running on 2 threads: PI = 3.14159265359 computed in 1.705 seconds running on 3 threads: PI = 3.14159265359 computed in 1.137 seconds running on 4 threads: PI = 3.14159265359 computed in 0.858 seconds running on 5 threads: PI = 3.14159265359 computed in 0.687 seconds running on 6 threads: PI = 3.14159265359 computed in 0.574 seconds running on 7 threads: PI = 3.14159265359 computed in 0.4946 seconds running on 8 threads: PI = 3.14159265359 computed in 0.4324 seconds ... running on 126 threads: PI = 3.14159265359 computed in 0.03336 seconds running on 127 threads: PI = 3.14159265359 computed in 0.03072 seconds running on 128 threads: PI = 3.14159265359 computed in 0.0281 seconds
As we can see, the code calculates the same result, but it runs more than 100X faster than before when running 128 threads.
reduction(+:sum)
Pay attention to reduction(+:sum)
. It is very important here: it is a reduction clause provided by OpenMP and it tells the compiler that sum
is a reduction variable on the +
operator. The compiler will then put the +
operation on sum
in a critical section so that the final stores to sum
from different threads will be properly synchronized and correctly summed.
If reduction(+:sum)
is not added — a common mistake for OpenMP beginners — the code will not only calculate an incorrect result but will also run slower than without parallelization:
running on 1 threads: PI = 3.14159265359 computed in 3.403 seconds running on 2 threads: PI = 1.43071963533 computed in 5.701 seconds running on 3 threads: PI = 1.34809292029 computed in 5.424 seconds running on 4 threads: PI = 1.08601649484 computed in 5.929 seconds running on 5 threads: PI = 0.903822885536 computed in 6.059 seconds running on 6 threads: PI = 0.62312876914 computed in 5.552 seconds running on 7 threads: PI = 0.596798835465 computed in 5.976 seconds running on 8 threads: PI = 0.50340501324 computed in 6.086 seconds
When reduction(+:sum) is not added, a data race exists on the shared variable sum. It is this data race that leads to both incorrect results and poor cache performance.
To give readers peace of mind, we’d like to introduce Coderrect as an easy solution to catch this race, and others like it. Next up, we show how to use Coderrect with this example.
Prerequisites
This tutorial assumes you have successfully installed the Coderrect software following our quick start guide.
Detect the race using Coderrect
Compile the code with coderrect
:
$ coderrect -t gcc -fopenmp pi.c -o pi
The coderrect -t
command will compile and analyze the program, with the flag -t
printing the race report to the terminal.
==== Found a data race between: line 22, column 13 in pi.c AND line 22, column 17 in pi.c Shared variable: at line 16 of pi.c 16| double sum = 0.0; Thread 1: 20| for (int i=0; i < N; i++) { 21| double x = (i+0.5)*delta; >22| sum += 4.0 / (1.0+x*x); 23| } 24| >>>Stacktrace: Thread 2: 20| for (int i=0; i < N; i++) { 21| double x = (i+0.5)*delta; >22| sum += 4.0 / (1.0+x*x); 23| } 24| >>>Stacktrace: 1 OpenMP races
Interpret the Results
The reported race starts with a summary of where the race was found.
==== Found a data race between: line 22, column 13 in pi.c AND line 22, column 17 in pi.c
Next, the report shows the name and location of the variable on which the race occurs.
Shared variable: at line 16 of pi.c 16| double sum = 0.0;
This output shows that the race occurs on the variable sum
allocated at line 16 and reports information about the two unsynchronized accesses to sum
.
20| for (int i=0; i < N; i++) { 21| double x = (i+0.5)*delta; >22| sum += 4.0 / (1.0+x*x); 23| } 24|
The code snippet shows that line 22 has racy accesses, both read and write, to sum
.