{"id":2415,"date":"2020-08-28T11:13:34","date_gmt":"2020-08-28T17:13:34","guid":{"rendered":"https:\/\/coderrect.com\/?p=2415"},"modified":"2022-04-18T09:15:45","modified_gmt":"2022-04-18T15:15:45","slug":"100x-faster-pi-calculation-with-one-line-of-parallelization","status":"publish","type":"post","link":"https:\/\/coderrect.com\/100x-faster-pi-calculation-with-one-line-of-parallelization\/","title":{"rendered":"100x Faster Calculation of Pi with just One Line of Code?"},"content":{"rendered":"<header class=\"header pb-3 text-center\"  >\n\t\n\t\n\t\n\t<div class=\"container\">\n\t\t\n\t\t<h1 class=\"display-4 mb-6 fw-500\">Archives<\/h1>\n\t<\/div>\n<\/header>\n\n\n<h3 class=\"wp-block-heading\"><\/h3>\n\n\n\n<p>Every year on March 14th (3\/14), we celebrate PI (Greek letter \u201c\u03c0\u201d) day. Most people know that value of \u03c0 is approximately 3.14 &#8211; and some are able to remember it to a few more decimal places &#8211; but do you know how to calculate it?<\/p>\n\n\n\n<p>&nbsp;In this article, we break down a simple numeric method to calculate \u03c0 using C. We also show you how to make the code 100x faster by adding a single line of <a href=\"https:\/\/www.openmp.org\/\/\">OpenMP<\/a> parallelization.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large is-resized\"><img data-attachment-id=\"2416\" data-permalink=\"https:\/\/coderrect.com\/100x-faster-pi-calculation-with-one-line-of-parallelization\/image-1-2\/\" data-orig-file=\"https:\/\/coderrect.com\/wp-content\/uploads\/2020\/08\/image-1.png\" data-orig-size=\"518,864\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"image-1\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/coderrect.com\/wp-content\/uploads\/2020\/08\/image-1-180x300.png\" data-large-file=\"https:\/\/coderrect.com\/wp-content\/uploads\/2020\/08\/image-1.png\" decoding=\"async\" loading=\"lazy\" src=\"https:\/\/coderrect.com\/wp-content\/uploads\/2020\/08\/image-1.png\" alt=\"\" class=\"wp-image-2416\" width=\"259\" height=\"432\" srcset=\"https:\/\/coderrect.com\/wp-content\/uploads\/2020\/08\/image-1.png 518w, https:\/\/coderrect.com\/wp-content\/uploads\/2020\/08\/image-1-180x300.png 180w\" sizes=\"(max-width: 259px) 100vw, 259px\" \/><figcaption>Pi numeric integration<\/figcaption><\/figure><\/div>\n\n\n\n<p>The formula shown above uses numeric integration to calculate \u03c0. When N is large enough, the calculated value will be infinitely close to \u03c0.<\/p>\n\n\n\n<p>The code shown below implements this numeric method in C:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">#include &lt;stdio.h&gt;\n#define N 1000000000\nint main () {\n    double delta = 1.0\/(double) N;\n    double sum = 0.0;\n    for (int i=0; i &lt; N; i++) {\n        double x = (i+0.5)*delta;\n        sum += 4.0 \/ (1.0+x*x);\n    }\n    double pi = delta * sum;\n    printf(\"PI = %.12g\\n\", pi);\n}<\/pre>\n\n\n\n<p>Compile and run this code in your terminal:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">$ gcc pi.c -o pi\n$ .\/pi<\/pre>\n\n\n\n<p>It will print the result:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">PI = 3.14159265359<\/pre>\n\n\n\n<p>The calculation takes around 3-6 seconds using a standard computer.<\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">100X Speedup with OpenMP Parallelization<\/h2>\n\n\n\n<p>There are many ways to make the above code run faster. But one <em>extremely<\/em> easy way is to use <a href=\"https:\/\/en.wikipedia.org\/wiki\/OpenMP\">OpenMP parallelization<\/a>: adding the following line right before the \u201cfor\u201d loop.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">#pragma omp parallel for reduction(+:sum)<\/pre>\n\n\n\n<p>The line above looks like a comment in the code. However, it actually tells the compiler to parallelize the \u201cfor\u201d loop by using multithreading; this way the code will run much faster than before when ran on a multicore machine.<\/p>\n\n\n\n<p>The code shown below is the full version, and we added an outer loop to simulate different numbers of threads:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">#include &lt;omp.h&gt;\n#include &lt;stdio.h&gt;\n#define N 1000000000\nint main () {\n    double delta = 1.0\/(double) N;\n    int MAX_THREADS = omp_get_max_threads();\n    \/\/ Compute parallel compute times for 1-MAX_THREADS\n    for (int j=1; j&lt;= MAX_THREADS; j++) {\n        printf(\" running on %d threads: \", j);\n        omp_set_num_threads(j);\n        double sum = 0.0;\n        double start = omp_get_wtime();\n        #pragma omp parallel for reduction(+:sum)\n        for (int i=0; i &lt; N; i++) {\n            double x = (i+0.5)*delta;\n            sum += 4.0 \/ (1.0+x*x);\n        }\n        \/\/ Out of the parallel region, finalize computation\n        double pi = delta * sum;\n        double time_lapse = omp_get_wtime() - start;\n        printf(\"PI = %.12g computed in %.4g seconds\\n\", pi, time_lapse);\n    }\n}<\/pre>\n\n\n\n<p>If you pass the flag \u201c-fopenmp\u201d to gcc, and run the code above with:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">$ gcc -fopenmp pi.c -o pi\n$ .\/pi<\/pre>\n\n\n\n<p>It will print the following results:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">running on 1 threads: PI = 3.14159265359 computed in 3.4 seconds\nrunning on 2 threads: PI = 3.14159265359 computed in 1.705 seconds\nrunning on 3 threads: PI = 3.14159265359 computed in 1.137 seconds\nrunning on 4 threads: PI = 3.14159265359 computed in 0.858 seconds\nrunning on 5 threads: PI = 3.14159265359 computed in 0.687 seconds\nrunning on 6 threads: PI = 3.14159265359 computed in 0.574 seconds\nrunning on 7 threads: PI = 3.14159265359 computed in 0.4946 seconds\nrunning on 8 threads: PI = 3.14159265359 computed in 0.4324 seconds\n...\nrunning on 126 threads: PI = 3.14159265359 computed in 0.03336 seconds\nrunning on 127 threads: PI = 3.14159265359 computed in 0.03072 seconds\nrunning on 128 threads: PI = 3.14159265359 computed in 0.0281 seconds<\/pre>\n\n\n\n<p>As we can see, the code calculates the same result, but it runs more than 100X faster than before when running 128 threads.<\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">reduction(+:sum)<\/h2>\n\n\n\n<p>Pay attention to <code>reduction(+:sum)<\/code>. It is very important here: it is a <strong>reduction clause<\/strong> provided by OpenMP and it tells the compiler that <code>sum<\/code> is a reduction variable on the <code>+<\/code> operator. The compiler will then put the <code>+<\/code> operation on <code>sum<\/code> &nbsp;in a <strong>critical section<\/strong> so that the final stores to <code>sum<\/code> from different threads will be properly synchronized and correctly summed.<\/p>\n\n\n\n<p>If <code>reduction(+:sum)<\/code> is not added \u2014 a common mistake for OpenMP beginners \u2014 the code will not only calculate an incorrect result but will also run slower than without parallelization:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">running on 1 threads: PI = 3.14159265359 computed in 3.403 seconds\nrunning on 2 threads: PI = 1.43071963533 computed in 5.701 seconds\nrunning on 3 threads: PI = 1.34809292029 computed in 5.424 seconds\nrunning on 4 threads: PI = 1.08601649484 computed in 5.929 seconds\nrunning on 5 threads: PI = 0.903822885536 computed in 6.059 seconds\nrunning on 6 threads: PI = 0.62312876914 computed in 5.552 seconds\nrunning on 7 threads: PI = 0.596798835465 computed in 5.976 seconds\nrunning on 8 threads: PI = 0.50340501324 computed in 6.086 seconds<\/pre>\n\n\n\n<p>When <strong>reduction(+:sum)<\/strong> is not added, a data race exists on the shared variable <strong>sum<\/strong>. It is this data race that leads to both incorrect results and poor cache performance.<\/p>\n\n\n\n<p>To give readers peace of mind, we\u2019d like to introduce <a href=\"http:\/\/coderrect.com\/\">Coderrect<\/a> as an easy solution to catch this race, and others like it. Next up, we show how to use Coderrect with this example.<\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Prerequisites<\/h2>\n\n\n\n<p>This tutorial assumes you have successfully installed the Coderrect software <a href=\"https:\/\/coderrect.com\/olddocs\/quick-start\/\" data-type=\"URL\" data-id=\"https:\/\/coderrect.com\/olddocs\/quick-start\/\">following our quick start<\/a> guide.<\/p>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Detect the race using Coderrect<\/h2>\n\n\n\n<p>Compile the code with <code>coderrect<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">$ coderrect -t gcc -fopenmp pi.c -o pi<\/pre>\n\n\n\n<p>The <code>coderrect -t<\/code>&nbsp;command will compile and analyze the program, with the flag <code>-t<\/code> printing the race report to the terminal.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">==== Found a data race between:\nline 22, column 13 in pi.c AND line 22, column 17 in pi.c\nShared variable:\nat line 16 of pi.c\n16| double sum = 0.0;\nThread 1:\n20| for (int i=0; i &lt; N; i++) {\n21| double x = (i+0.5)*delta;\n&gt;22| sum += 4.0 \/ (1.0+x*x);\n23| }\n24|\n&gt;&gt;&gt;Stacktrace:\nThread 2:\n20| for (int i=0; i &lt; N; i++) {\n21| double x = (i+0.5)*delta;\n&gt;22| sum += 4.0 \/ (1.0+x*x);\n23| }\n24|\n&gt;&gt;&gt;Stacktrace:\n\n1 OpenMP races<\/pre>\n\n\n\n<hr class=\"wp-block-separator\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Interpret the Results<\/h2>\n\n\n\n<p>The reported race starts with a summary of where the race was found.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">==== Found a data race between:\nline 22, column 13 in pi.c AND line 22, column 17 in pi.c<\/pre>\n\n\n\n<p>Next, the report shows the name and location of the variable on which the race occurs.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">Shared variable:\nat line 16 of pi.c\n16| double sum = 0.0;<\/pre>\n\n\n\n<p>This output shows that the race occurs on the variable&nbsp;<code>sum<\/code>&nbsp;allocated at line 16 and reports information about the two unsynchronized accesses to&nbsp;<code>sum<\/code>.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">20| for (int i=0; i &lt; N; i++) {\n21| double x = (i+0.5)*delta;\n&gt;22| sum += 4.0 \/ (1.0+x*x);\n23| }\n24|<\/pre>\n\n\n\n<p>The code snippet shows that line 22 has racy accesses, both read and write, to&nbsp;<code>sum<\/code>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Every year on March 14th (3\/14), we celebrate PI (Greek letter \u201c\u03c0\u201d) day. Most people know that value of \u03c0 is approximately 3.14 &#8211; and some are able to remember it to a few more<\/p>\n","protected":false},"author":182999569,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_coblocks_attr":"","_coblocks_dimensions":"","_coblocks_responsive_height":"","_coblocks_accordion_ie_support":"","advanced_seo_description":"","jetpack_seo_html_title":"","jetpack_seo_noindex":false,"jetpack_post_was_ever_published":false,"jetpack_anchor_podcast":"","jetpack_anchor_episode":"","jetpack_anchor_spotify_show":"","jetpack_publicize_message":"","jetpack_is_tweetstorm":false,"jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false}}},"categories":[1396],"tags":[1369,1371,1364,1362,1368,1374,1402,1397,1375,1398,1365,1363,1366,1373,1367,1372],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_likes_enabled":false,"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/pbGzvs-CX","_links":{"self":[{"href":"https:\/\/coderrect.com\/wp-json\/wp\/v2\/posts\/2415"}],"collection":[{"href":"https:\/\/coderrect.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/coderrect.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/coderrect.com\/wp-json\/wp\/v2\/users\/182999569"}],"replies":[{"embeddable":true,"href":"https:\/\/coderrect.com\/wp-json\/wp\/v2\/comments?post=2415"}],"version-history":[{"count":18,"href":"https:\/\/coderrect.com\/wp-json\/wp\/v2\/posts\/2415\/revisions"}],"predecessor-version":[{"id":5839,"href":"https:\/\/coderrect.com\/wp-json\/wp\/v2\/posts\/2415\/revisions\/5839"}],"wp:attachment":[{"href":"https:\/\/coderrect.com\/wp-json\/wp\/v2\/media?parent=2415"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/coderrect.com\/wp-json\/wp\/v2\/categories?post=2415"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/coderrect.com\/wp-json\/wp\/v2\/tags?post=2415"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}