Why Every Company Should Care About Concurrency Bugs

By: Chris

This August will see the tenth anniversary of Marc Andreessen’s claim that software is eating the world, a bold prediction that is proven to be so insightful. Ten years later, we have witnessed software transforming our lives and economy in such a profound way, and is expected to continue to do so at an accelerated pace for decades to come, with the help from the advances in big data, artificial intelligence and faster computers.

Today the software digital transformation is global in scope and touches each and every industry. Even traditional manufacturing companies relying on machines and service companies historically depending on person-to-person interactions are embracing the software revolution.  As an example, Goldman Sachs Group, where bankers used to win trust of clients through close relationships and insightful advice, now has over 10,000 software developers, more than many of the leading technology companies.  No wonder US Bureau of Statistics projects the number of highly paid software development jobs will grow 22%, compared to 4% for that of all jobs in the next decade.  Allow me to make a prediction here: every company, or at least large companies, will eventually become a software company in some way, shape, or form.

Ever since there was software, there have been software bugs. Software industry has come a long way in dealing with software bugs: on average coding defects have been reduced from 12-50 per 1,000 lines of delivered code (KLOC)  to 0.434 KLOC, thanks mostly to better tools and development processes, more rigorous testing, and safer programming languages.

However, some software bugs are determined to take a stab at the motto that “software is eating the world”.  Given how pervasively software is powering our lives, a software bug these days can cause a lot more damages than just crashing one’s desktop. Nasdaq and a number of market makers have learned this expensive lesson. On May 18th, 2012, Facebook went public on Nasdaq, the 3rd largest IPO in the US, and the largest for Nasdaq at the time. The Facebook stock came alive for trading at 11:30am, half an hour later than planned and only after a software glitch preventing the delivery of order confirmation was mitigated. As a result, marker maker UBS reported lost $350mn, and Nasdaq was fined $10mn and paid $40mn in compensation claims. Knight Capital Group, Citadel and Citigroup also lost millions in this incident.

Nasdaq’s software was considered well-engineered and had executed millions of trades everyday almost flawlessly before the incident.  So, what went wrong?  It turned out that the expensive glitch was caused by a time-bomb like type of bugs called concurrency bugs.  A concurrent software program runs on multiple threads to take advantage of multiple processing units to achieve high performance and low latency.  Without correct synchronization, multiple threads can update shared data at the same time and the order of their execution is unpredictable. Exponential possibilities of interlacing between multiple threads can quickly go beyond what a human developer’s brain can handle, making it error prone.  A tricky part is that the manifestation of such bugs depends on timing and sequence of the software threads involved, that’s why a seemingly perfect software can execute well for years before something catastrophic suddenly happens.

Can any software testing come to rescue?  Yes and No.  If the test cases used happen to trigger the sequences of problematic interlacing, the bug will be exposed in the testing phase.  However, the potential space of interlacing is typically astronomical, making comprehensive coverage of test cases impossible. To make it even worse, even if a test triggers a bug, such bug would not necessarily show up again next time when the test is run. This phenomenon is called non-determinism. These challenges make software testing for concurrency bugs unreliable at best, and a total miss many times. It is not uncommon for developers to spend weeks and even months to manually exam source code to debug concurrency issues.

These all sound scary, and unfortunately are still largely true in today’s software. Because the concurrency issues can be very painful to deal with, further exacerbated by the lacking of advanced tools to assure quality for concurrent softwares, they often slip through quality assurance measures.  A quick search in the National Vulnerability Database produced 15 reported security vulnerabilities that are caused by concurrency issues in this month (January 2021) alone. Affected softwares include Apple MacOs, Android, Google ChromeOS and NVIDIA vGPU. If these well-resourced and most technologically sophisticated companies still suffer from concurrency bugs, the situation is likely to be even more grim elsewhere.

Saying “every company should care” may be a little exaggeration. However if you are running a software team or a company that is leading or embracing digital transformation, it would be wise to look into concurrency issues and make sure best quality assurance measures are put in place to give you a multithreading peace of mind

Note 1: “Code Complete, A Practice Handbook of Software Construction”, Steve McCoonell

Note 2: “Measuring Software Quality – A Study of Open Source Software”, Coverity

Leave a Reply