Detect races in GPU kernels

This tutorial showcases Coderrect on detecting block-level and warp-level race hazards in GPU/CUDA kernels. Both examples are from NVIDIA’s official documentation: block_error.cu and warp_error.cu. Note that if you do not already have nvidia-cuda-toolkit installed on your machine you must use the command listed below.

$ sudo apt install nvidia-cuda-toolkit
$ coderrect -t nvcc block_error.cu
==== Found a race between: 
line 9, column 5 in block_error.cu AND line 14, column 25 in block_error.cu
Shared variable: 
smem at line 3 of block_error.cu
 3|__shared__ int smem[THREADS];
Thread 1: 
 7|{
 8|    int tx = threadIdx.x;
>9|    smem[tx] = data_in[tx] + tx;
 10|
 11|    if (tx == 0) {
>>>Stack Trace:
Thread 2: 
 12|        *sum_out = 0;
 13|        for (int i = 0; i < THREADS; ++i)
>14|            *sum_out += smem[i];
 15|    }
 16|}
>>>Stack Trace:
The OpenMP region this bug occurs:
/CUDA/benchmarks/t/block_error.cu
>27|    sumKernel<<<1, THREADS>>>(data_in, sum_out);
 28|    cudaDeviceSynchronize();
 29|
 30|    cudaFree(data_in);
 31|    cudaFree(sum_out);
 32|    return 0;
Gets called from:
>>>main
detected 1 races in total.
To check the race report, please open '/CUDA/benchmarks/t/.coderrect/report/index.html' in your browser

$ coderrect -t nvcc wrap_error.cu
==== Found a race between: 
line 12, column 5 in wrap_error.cu AND line 19, column 32 in wrap_error.cu
Shared variable: 
smem_first at line 5 of wrap_error.cu
 5|__shared__ int smem_first[THREADS];
Thread 1: 
 10|{
 11|    int tx = threadIdx.x;
>12|    smem_first[tx] = data_in[tx] + tx;
 13|    //__syncwarp();
 14|    if (tx % WARP_SIZE == 0) {
>>>Stack Trace:
Thread 2: 
 17|        smem_second[wx] = 0;
 18|        for (int i = 0; i < WARP_SIZE; ++i)
>19|            smem_second[wx] += smem_first[wx * WARP_SIZE + i];
 20|    }
 21|
>>>Stack Trace:
The OpenMP region this bug occurs:
/CUDA/benchmarks/t/wrap_error.cu
>40|    sumKernel<<<1, THREADS>>>(data_in, sum_out);
 41|    cudaDeviceSynchronize();
 42|
 43|    cudaFree(data_in);
 44|    cudaFree(sum_out);
 45|    return 0;

Note that in the above code line 13 is commented out, which disables the warp-level synchronization.

13| //__syncwarp();

If line 13 is uncommented, the race will be fixed, and the tool will report no races:

detected 0 races in total.