

# Faculty of Mathematics and Information Science

WARSAW UNIVERSITY OF TECHNOLOGY

# Graphic Processors in Computational Applications

Part 4 - Extended CUDA Features

dr inż. Krzysztof Kaczmarski 2021









# Materialy sponsorowane przez:

Projekt "NERW 2 PW. Nauka – Edukacja – Rozwój – Współpraca" współfinansowany jest ze środków Unii Europejskiej w ramach Europejskiego Funduszu Społecznego

Zadanie 10 pn. "Modyfikacja programów studiów na kierunkach prowadzonych przez Wydział Matematyki i Nauk Informacyjnych", realizowane w ramach projektu "NERW 2 PW. Nauka – Edukacja – Rozwój – Współpraca", współfinansowanego jest ze środków Unii Europejskiej w ramach Europejskiego Funduszu Społecznego





Politechnika Warszawska







## Part 4 – Extended CUDA Features

### Advanced Warp-level Functions

Programming Model Extensions Independent Thread Scheduling Compatibility Cooperative Groups

CUDA 11 and Ampere Architecture Compute Sanitizer

### Advanced Warp-level Functions

compute capability 7.x or higher

\_\_all\_sync(unsigned mask, predicate): Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for all of them.

### Advanced Warp-level Functions

compute capability 7.x or higher

\_\_all\_sync(unsigned mask, predicate): Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for all of them.

\_\_any\_sync(unsigned mask, predicate): Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for any of them.

### Advanced Warp-level Functions

compute capability 7.x or higher

\_\_all\_sync(unsigned mask, predicate): Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for all of them.

\_\_any\_sync(unsigned mask, predicate): Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for any of them.

\_\_ballot\_sync(unsigned mask, predicate): Evaluate predicate for all non-exited threads in mask and return an integer whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active.

### Advanced Warp-level Functions

compute capability 7.x or higher

\_\_all\_sync(unsigned mask, predicate): Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for all of them.

\_\_any\_sync(unsigned mask, predicate): Evaluate predicate for all non-exited threads in mask and return non-zero if and only if predicate evaluates to non-zero for any of them.

\_\_ballot\_sync(unsigned mask, predicate): Evaluate predicate for all non-exited threads in mask and return an integer whose Nth bit is set if and only if predicate evaluates to non-zero for the Nth thread of the warp and the Nth thread is active.

\_\_activemask(): Returns a 32-bit integer mask of all currently active threads in the calling warp. The Nth bit is set if the Nth lane in the warp is active when it is called. Inactive threads are represented by 0 bits in the returned mask. Threads which have exited the program are always marked as inactive.

# Warp match functions

### Advanced Warp-level Functions

### compute capability 7.x or higher

\_\_match\_any\_sync(unsigned mask, T value): Returns mask of threads that have same value of value in mask

\_\_match\_all\_sync(unsigned mask, T value, int \*pred): Returns mask if all threads in mask have the same value for value; otherwise 0 is returned. Predicate pred is set to true if all threads in mask have the same value of value; otherwise the predicate is set to false.

```
T can be int, unsigned int, long, unsigned long, long long, unsigned long long, float, double
```

# Warp reduce functions

Advanced Warp-level Functions

### compute capability 8.x

T \_\_reduce\_\*\_sync(unsigned mask, T value): intrinsics perform a reduction operation on the data provided in value after synchronizing threads named in mask. T can be unsigned or signed for add, min, max and unsigned only for and, or, xor operations.

# Warp shuffle functions

### Advanced Warp-level Functions

compute capability 3.x or higher \_\_shfl\_sync, \_\_shfl\_\*\_sync: exchange a variable between threads within a warp (up, down, xor).

# Warp broadcast without shared memory

Advanced Warp-level Functions

# Inclusive scan across sub-partitions of 8 threads

Advanced Warp-level Functions

```
__global__ void scan4() {
       int laneId = threadIdx.x & 0x1f;
       // Seed sample starting value (inverse of lane ID)
       int value = 31 - laneId;
       // Loop to accumulate scan within my partition.
       // Scan requires log2(n) == 3 steps for 8 threads
       // It works by an accumulated sum up the warp
       // by 1, 2, 4, 8 etc. steps.
10
       for (int i=1; i<=4; i*=2) {
           // We do the __shfl_sync unconditionally so that we
11
           // can read even from threads which won't do a
12
           // sum, and then conditionally assign the result.
13
           int n = __shfl_up_sync(0xfffffffff, value, i, 8);
14
           if ((laneId & 7) >= i)
15
              value += n:
16
17
       printf("Thread, \( \) \( d_1 \) final, \( value_1 = 1 \) \( d \) n", threadIdx.x, value);
18
19
```

# Reduction across a warp

### Advanced Warp-level Functions

```
1 __global__ void warpReduce() {
2    int laneId = threadIdx.x & 0x1f;
3    // Seed starting value as inverse lane ID
4    int value = 31 - laneId;
5    // Use XOR mode to perform butterfly reduction
7    for (int i=16; i>=1; i/=2)
8        value += __shfl_xor_sync(0xfffffffff, value, i, 32);
9    // "value" now contains the sum across all threads
11    printf("Thread_\%d\final\uvalue\u=\u'\d\n", threadIdx.x, value);
12 }
```

# Warp matrix functions

Advanced Warp-level Functions

warp matrix operations leverage Tensor Cores to accelerate matrix problems of the form  $D=A\cdot B+C$ . These operations are supported on mixed-precision floating point data for devices of compute capability 7.0 or higher. This requires co-operation from all threads in a warp.

# Warp matrix functions

### Advanced Warp-level Functions

warp matrix operations leverage Tensor Cores to accelerate matrix problems of the form  $D=A\cdot B+C$ . These operations are supported on mixed-precision floating point data for devices of compute capability 7.0 or higher. This requires co-operation from all threads in a warp.

Sub-byte WMMA operations provide a way to access the low-precision capabilities of Tensor Cores. They are considered a preview feature i.e. the data structures and APIs for them are subject to change and may not be compatible with future releases.

# Tensor cores matrix multiplication

### Advanced Warp-level Functions

16x16x16 matrix multiplication in a single warp.

```
#include <mma.h>
   using namespace nvcuda;
   __global__ void wmma_ker(half *a, half *b, float *c) {
      // Declare the fragments
      wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::col_major>
           a frag;
      wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::row_major>
6
           b_frag;
      wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;
      // Initialize the output to zero
      wmma::fill_fragment(c_frag, 0.0f);
      // Load the inputs
10
      wmma::load_matrix_sync(a_frag, a, 16);
11
      wmma::load_matrix_sync(b_frag, b, 16);
12
      // Perform the matrix multiplication
13
      wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
14
      // Store the output
15
      wmma::store_matrix_sync(c, c_frag, 16, wmma::mem_row_major);
16
17
```



# Part 4 – Extended CUDA Features

Advanced Warp-level Functions

Programming Model Extensions
Independent Thread Scheduling Compatibility
Cooperative Groups

CUDA 11 and Ampere Architecture Compute Sanitizer



### Part 4 – Extended CUDA Features

Advanced Warp-level Functions

Programming Model Extensions
Independent Thread Scheduling Compatibility
Cooperative Groups

CUDA 11 and Ampere Architecture Compute Sanitizer

# Independent Thread Scheduling Compatibility

Programming Model Extensions

The Volta and Turing architectures feature Independent Thread Scheduling among threads in a warp. If the developer made assumptions about warp-synchronicity, 1 this feature can alter the set of threads participating in the executed code compared to previous architectures.

# Independent Thread Scheduling Compatibility

### Programming Model Extensions

- ➤ To avoid data corruption, applications using warp intrinsics (\_\_shfl\*, \_\_any, \_\_all, and \_\_ballot) should transition to the new, safe, synchronizing counterparts, with the \*\_sync suffix. The new warp intrinsics take in a mask of threads that explicitly define which lanes (threads of a warp) must participate in the warp intrinsic.
- ▶ Applications that assume reads and writes are implicitly visible to other threads in the same warp need to insert the new \_\_syncwarp() warp-wide barrier synchronization instruction between steps where data is exchanged between threads via global or shared memory. Assumptions that code is executed in lockstep or that reads/writes from separate threads are visible across a warp without synchronization are invalid.
- ▶ Applications using \_\_syncthreads() or the PTX bar.sync (and their derivatives) in such a way that a barrier will not be reached by some non-exited thread in the thread block must be modified to ensure that all non-exited threads reach the barrier.



### Part 4 – Extended CUDA Features

Advanced Warp-level Functions

Programming Model Extensions
Independent Thread Scheduling Compatibility
Cooperative Groups

CUDA 11 and Ampere Architecture Compute Sanitizer

# Cooperative Groups

Programming Model Extensions

Cooperative Groups is an extension to the CUDA programming model, introduced in CUDA 9, for organizing groups of communicating threads. Cooperative Groups allows developers to express the granularity at which threads are communicating, helping them to express richer, more efficient parallel decompositions.



# Part 4 – Extended CUDA Features

Advanced Warp-level Functions

Programming Model Extensions
Independent Thread Scheduling Compatibility
Cooperative Groups

CUDA 11 and Ampere Architecture Compute Sanitizer



### Part 4 – Extended CUDA Features

Advanced Warp-level Functions

Programming Model Extensions
Independent Thread Scheduling Compatibility
Cooperative Groups

CUDA 11 and Ampere Architecture Compute Sanitizer

# A new tool to check memory accesses

CUDA 11 and Ampere Architecture

### pre CUDA 11

cuda-memcheck too

### from CUDA 11

Compute Sanitizer, a next-generation, functional correctness checking tool that provides runtime checking for out-of-bounds memory accesses and race condition

# Compute Sanitizer I

### Out-of-bounds array access

```
__global__ void oobAccess(int* in, int* out)
        int bid = blockIdx.x:
        int tid = threadIdx.x;
        if (bid == 4)
            out[tid] = in[dMem[tid]]:
     int main()
10
11
12
        // Array of 8 elements, where element 4 causes the OOB
13
        std::array<int, Size> hMem = {0, 1, 2, 10, 4, 5, 6, 7};
14
        cudaMemcpy(d_mem, hMem.data(), size, cudaMemcpyHostToDevice);
15
16
        oobAccess <<<10, Size>>>(d in, d out):
17
        cudaDeviceSynchronize();
18
19
20
     $ /usr/local/cuda-11.0/Sanitizer/compute-sanitizer --destroy-on-device-error kernel --show-backtrace
           no basic
     ====== COMPUTE-SANITIZER
    Device: Tesla T4
     ======= Invalid global read of size 4 bytes
23
24
    ====== at 0x480 in /tmp/CUDA11.0/ComputeSanitizer/Tests/Memcheck/basic/basic.cu:40:oobAccess(int*,
           int*)
25
     ======= by thread (3,0,0) in block (4,0,0)
26
    ====== Address 0x7f551f200028 is out of bounds
```

# Compute Sanitizer II

### Race condition

```
__global__ void Basic()
        __shared__ volatile int i;
        i = threadIdx.x;
     int main()
        Basic<<<1.2>>>():
10
        cudaDeviceSvnchronize():
11
12
13
14
    $ /usr/local/cuda-11.0/Sanitizer/compute-sanitizer --destroy-on-device-error kernel --show-backtrace
           no --tool racecheck --racecheck-report hazard raceBasic
15
     ====== COMPUTE-SANITIZER
16
     ====== ERROR: Potential WAW hazard detected at shared 0x0 in block (0.0.0) :
17
     ====== Write Thread (0,0,0) at 0x100 in /tmp/CUDA11.0/ComputeSanitizer/Tests/Racecheck/raceBasic/
           raceBasic.cu:11:Basic(void)
18
     ====== Write Thread (1.0.0) at 0x100 in /tmp/CUDA11.0/ComputeSanitizer/Tests/Racecheck/raceBasic/
           raceBasic.cu:11:Basic(void)
19
     ====== Current Value : 0, Incoming Value : 1
20
     -----
     ====== RACECHECK SUMMARY: 1 hazard displayed (1 error, 0 warnings)
```

# Roofline model visualization

### CUDA 11 and Ampere Architecture

Arithmetic Intensity is the most important concept in Roofline.

- Ratio of Total FLOPs performed to Total Bytes moved
- Total Bytes to/from DRAM and includes all cache and prefetcher effects
- ► Can be very different from total loads/stores (bytes requested) due to cache reuse



Pramod Ramarao. Cuda 11 features revealed. https://developer.nvidia.com/blog/cuda-11-features-revealed/, May 2020

# L2 persistence cache access

### CUDA 11 and Ampere Architecture

When a CUDA kernel accesses a data region in the global memory repeatedly, such data accesses can be considered to be persisting. On the other hand, if the data is only accessed once, such data accesses can be considered to be streaming.

Starting with CUDA 11.0, devices of compute capability 8.0 and above have the capability to influence persistence of data in the L2 cache, potentially providing higher bandwidth and lower latency accesses to global memory.

### For details please consult:

NVIDIA. Cuda c++ programming guide. www.nvidia.com/cuda

NVIDIA CUDA Toolkit. Cuda c++ best practices guide.

https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html, 2020

# **Bibliography**

- NVIDIA. Cuda c++ programming guide. www.nvidia.com/cuda.
- Pramod Ramarao. Cuda 11 features revealed. https://developer.nvidia.com/blog/cuda-11-features-revealed/, May 2020.
- NVIDIA CUDA Toolkit. Cuda c++ best practices guide. https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html, 2020.

# Materialy sponsorowane przez:

Projekt "NERW 2 PW. Nauka – Edukacja – Rozwój – Współpraca" współfinansowany jest ze środków Unii Europejskiej w ramach Europejskiego Funduszu Społecznego

Zadanie 10 pn. "Modyfikacja programów studiów na kierunkach prowadzonych przez Wydział Matematyki i Nauk Informacyjnych", realizowane w ramach projektu "NERW 2 PW. Nauka – Edukacja – Rozwój – Współpraca", współfinansowanego jest ze środków Unii Europejskiej w ramach Europejskiego Funduszu Społecznego





Politechnika Warszawska



