100x Faster Calculation of Pi with just One Line of Code?

Every year on March 14th (3/14), we celebrate PI (Greek letter “π”) day. Most people know that value of π is approximately 3.14 – and some are able to remember it to a few more decimal places – but do you know how to calculate it?

 In this article, we break down a simple numeric method to calculate π using C. We also show you how to make the code 100x faster by adding a single line of OpenMP parallelization.

Pi numeric integration

The formula shown above uses numeric integration to calculate π. When N is large enough, the calculated value will be infinitely close to π.

The code shown below implements this numeric method in C:

#include <stdio.h>
#define N 1000000000
int main () {
    double delta = 1.0/(double) N;
    double sum = 0.0;
    for (int i=0; i < N; i++) {
        double x = (i+0.5)*delta;
        sum += 4.0 / (1.0+x*x);
    }
    double pi = delta * sum;
    printf("PI = %.12g\n", pi);
}

Compile and run this code in your terminal:

$ gcc pi.c -o pi
$ ./pi

It will print the result:

PI = 3.14159265359

The calculation takes around 3-6 seconds using a standard computer.


100X Speedup with OpenMP Parallelization

There are many ways to make the above code run faster. But one extremely easy way is to use OpenMP parallelization: adding the following line right before the “for” loop.

#pragma omp parallel for reduction(+:sum)

The line above looks like a comment in the code. However, it actually tells the compiler to parallelize the “for” loop by using multithreading; this way the code will run much faster than before when ran on a multicore machine.

The code shown below is the full version, and we added an outer loop to simulate different numbers of threads:

#include <omp.h>
#include <stdio.h>
#define N 1000000000
int main () {
    double delta = 1.0/(double) N;
    int MAX_THREADS = omp_get_max_threads();
    // Compute parallel compute times for 1-MAX_THREADS
    for (int j=1; j<= MAX_THREADS; j++) {
        printf(" running on %d threads: ", j);
        omp_set_num_threads(j);
        double sum = 0.0;
        double start = omp_get_wtime();
        #pragma omp parallel for reduction(+:sum)
        for (int i=0; i < N; i++) {
            double x = (i+0.5)*delta;
            sum += 4.0 / (1.0+x*x);
        }
        // Out of the parallel region, finalize computation
        double pi = delta * sum;
        double time_lapse = omp_get_wtime() - start;
        printf("PI = %.12g computed in %.4g seconds\n", pi, time_lapse);
    }
}

If you pass the flag “-fopenmp” to gcc, and run the code above with:

$ gcc -fopenmp pi.c -o pi
$ ./pi

It will print the following results:

running on 1 threads: PI = 3.14159265359 computed in 3.4 seconds
running on 2 threads: PI = 3.14159265359 computed in 1.705 seconds
running on 3 threads: PI = 3.14159265359 computed in 1.137 seconds
running on 4 threads: PI = 3.14159265359 computed in 0.858 seconds
running on 5 threads: PI = 3.14159265359 computed in 0.687 seconds
running on 6 threads: PI = 3.14159265359 computed in 0.574 seconds
running on 7 threads: PI = 3.14159265359 computed in 0.4946 seconds
running on 8 threads: PI = 3.14159265359 computed in 0.4324 seconds
...
running on 126 threads: PI = 3.14159265359 computed in 0.03336 seconds
running on 127 threads: PI = 3.14159265359 computed in 0.03072 seconds
running on 128 threads: PI = 3.14159265359 computed in 0.0281 seconds

As we can see, the code calculates the same result, but it runs more than 100X faster than before when running 128 threads.


reduction(+:sum)

Pay attention to reduction(+:sum). It is very important here: it is a reduction clause provided by OpenMP and it tells the compiler that sum is a reduction variable on the + operator. The compiler will then put the + operation on sum  in a critical section so that the final stores to sum from different threads will be properly synchronized and correctly summed.

If reduction(+:sum) is not added — a common mistake for OpenMP beginners — the code will not only calculate an incorrect result but will also run slower than without parallelization:

running on 1 threads: PI = 3.14159265359 computed in 3.403 seconds
running on 2 threads: PI = 1.43071963533 computed in 5.701 seconds
running on 3 threads: PI = 1.34809292029 computed in 5.424 seconds
running on 4 threads: PI = 1.08601649484 computed in 5.929 seconds
running on 5 threads: PI = 0.903822885536 computed in 6.059 seconds
running on 6 threads: PI = 0.62312876914 computed in 5.552 seconds
running on 7 threads: PI = 0.596798835465 computed in 5.976 seconds
running on 8 threads: PI = 0.50340501324 computed in 6.086 seconds

When reduction(+:sum) is not added, a data race exists on the shared variable sum. It is this data race that leads to both incorrect results and poor cache performance.

To give readers peace of mind, we’d like to introduce Coderrect as an easy solution to catch this race, and others like it. Next up, we show how to use Coderrect with this example.


Prerequisites

This tutorial assumes you have successfully installed the Coderrect software following our quick start guide.


Detect the race using Coderrect

Compile the code with coderrect:

$ coderrect -t gcc -fopenmp pi.c -o pi

The coderrect -t command will compile and analyze the program, with the flag -t printing the race report to the terminal.

==== Found a data race between:
line 22, column 13 in pi.c AND line 22, column 17 in pi.c
Shared variable:
at line 16 of pi.c
16| double sum = 0.0;
Thread 1:
20| for (int i=0; i < N; i++) {
21| double x = (i+0.5)*delta;
>22| sum += 4.0 / (1.0+x*x);
23| }
24|
>>>Stacktrace:
Thread 2:
20| for (int i=0; i < N; i++) {
21| double x = (i+0.5)*delta;
>22| sum += 4.0 / (1.0+x*x);
23| }
24|
>>>Stacktrace:

1 OpenMP races

Interpret the Results

The reported race starts with a summary of where the race was found.

==== Found a data race between:
line 22, column 13 in pi.c AND line 22, column 17 in pi.c

Next, the report shows the name and location of the variable on which the race occurs.

Shared variable:
at line 16 of pi.c
16| double sum = 0.0;

This output shows that the race occurs on the variable sum allocated at line 16 and reports information about the two unsynchronized accesses to sum.

20| for (int i=0; i < N; i++) {
21| double x = (i+0.5)*delta;
>22| sum += 4.0 / (1.0+x*x);
23| }
24|

The code snippet shows that line 22 has racy accesses, both read and write, to sum.

Leave a Reply