

#### Faculty of Mathematics and Information Science

WARSAW UNIVERSITY OF TECHNOLOGY

# Graphic Processors in Computational Applications

Part 2 – CUDA Advances

dr inż. Krzysztof Kaczmarski 2021



Rzeczpospolita Polska Politechnika Warszawska

Unia Europejska Europejski Fundusz Społeczny



Materiały sponsorowane przez:

Projekt "NERW 2 PW. Nauka – Edukacja – Rozwój – Współpraca" współfinansowany jest ze środków Unii Europejskiej w ramach Europejskiego Funduszu Społecznego

Zadanie 10 pn. "Modyfikacja programów studiów na kierunkach prowadzonych przez Wydział Matematyki i Nauk Informacyjnych", realizowane w ramach projektu "NERW 2 PW. Nauka – Edukacja – Rozwój – Współpraca", współfinansowanego jest ze środków Unii Europejskiej w ramach Europejskiego Funduszu Społecznego



Rzeczpospolita Polska Politechnika Warszawska

Unia Europejska Europejski Fundusz Społeczny



#### Goals for today:

- Understand advanced CUDA techniques
- Get familiar with pitfalls of parallel programming



WARSAW UNIVERSITY OF TECHNOLOGY

#### Part 2 – CUDA Advances

#### Warp threads scheduling

Advanced synchronization

Variables and Memory Memory types Global Memory Access Shared Memory Example of shared memory utilization – matrices

- Asynchronous operations
- Problems of parallelism Race conditions Volatile
- Time Measurements

- 1 instruction 1
- 2 if (threadIdx.x<4)</pre>
- 3 instruction 2
- 4 else
- 5 instruction 3
- 6 instruction 4

- 1 instruction 1
- 2 if (threadIdx.x<4)</pre>
- 3 instruction 2
- 4 else
- 5 instruction 3
- 6 instruction 4



Warp threads scheduling

- 1 instruction 1
- 2 if (threadIdx.x<4)
- 3 instruction 2
- 4 else
- 5 instruction 3
- 6 instruction 4

Instr. for Threads 1-3: 1 2 (3) 4 Instr. for Threads 4-8: 1 (2) 3 4



Warp threads scheduling

- 1 instruction 1
- 2 if (threadIdx.x<4)
- 3 instruction 2
- 4 else
- 5 instruction 3
- 6 instruction 4

Instr. for Threads 1-3: 1 2 (3) 4 Instr. for Threads 4-8: 1 (2) 3 4



A single thread is assigned to a single ALU. Waste of bandwidth – some ALUs do nothing.

Warp threads scheduling

- 1 instruction 1
- 2 if (threadIdx.x<4)
- 3 instruction 2
- 4 else
- 5 instruction 3
- 6 instruction 4

Instr. for Threads 1-3: 1 2 (3) 4 Instr. for Threads 4-8: 1 (2) 3 4

Common mistake: Instr 2. before Instr. 3

Warp of Threads Instr. 1 Active threads Inactive t Instr. 2 Instr. 3 Instr. 4

A single thread is assigned to a single ALU. Waste of bandwidth – some ALUs do nothing.

- 1 instruction 1
- 2 if (threadIdx.x<32)</pre>
- 3 instruction 2
- 4 else
- 5 instruction 3
- 6 instruction 4

- instruction 1 1 if (threadIdx.x<32)</pre> 2 3
  - instruction 2
- else 4
- instruction 3 5
- instruction 4 6



| 1 | instruction 1                     |
|---|-----------------------------------|
| 2 | <pre>if (threadIdx.x&lt;32)</pre> |
| 3 | instruction 2                     |
| 4 | else                              |
| 5 | instruction 3                     |
| 6 | instruction 4                     |
|   | Instr. for Warp 0: 1 2 4          |
|   | Instr. for Warp 1: 1 3 4          |



Warp threads scheduling

| 1 | instruction 1                                        |
|---|------------------------------------------------------|
| 2 | <pre>if (threadIdx.x&lt;32)</pre>                    |
| 3 | instruction 2                                        |
| 4 | else                                                 |
| 5 | instruction 3                                        |
| 6 | instruction 4                                        |
|   | Instr. for Warp 0: 1 2 4<br>Instr. for Warp 1: 1 3 4 |



Warp-level control saves bandwidth in conditional operations.



WARSAW UNIVERSITY OF TECHNOLOGY

#### Part 2 – CUDA Advances

#### Warp threads scheduling

#### Advanced synchronization

Variables and Memory Memory types Global Memory Access Shared Memory Example of shared memory utilization – matrices

Asynchronous operations

Problems of parallelism Race conditions Volatile

Time Measurements

Advanced synchronization

Device side:

int \_\_syncthreads\_count(int predicate); is identical to \_\_syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns the number of threads for which predicate evaluates to non-zero.

Advanced synchronization

Device side:

- int \_\_syncthreads\_count(int predicate); is identical to \_\_syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns the number of threads for which predicate evaluates to non-zero.
- int \_\_syncthreads\_and(int predicate); similarly but evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for all of them.

Advanced synchronization

Device side:

- int \_\_syncthreads\_count(int predicate); is identical to \_\_syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns the number of threads for which predicate evaluates to non-zero.
- int \_\_syncthreads\_and(int predicate); similarly but evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for all of them.
- int \_\_syncthreads\_or(int predicate); ...similarly but returns non-zero if predicate evaluates to non-zero for any of the threads.

Advanced synchronization

Device side:

- int \_\_syncthreads\_count(int predicate); is identical to \_\_syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns the number of threads for which predicate evaluates to non-zero.
- int \_\_syncthreads\_and(int predicate); similarly but evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for all of them.
- int \_\_syncthreads\_or(int predicate); ...similarly but returns non-zero if predicate evaluates to non-zero for any of the threads.
- void \_\_syncwarp(unsigned mask=0xffffffff); will cause the executing thread to wait until all warp lanes named in mask have executed a \_\_syncwarp() (with the same mask) before resuming execution. All non-exited threads named in mask must execute a corresponding \_\_syncwarp() with the same mask, or the result is undefined.

Advanced synchronization

Device side memory fence functions:

void \_\_threadfence\_block(); waits until all global and shared memory accesses made by the calling thread before are visible to all threads in the thread block.

Advanced synchronization

Device side memory fence functions:

- void \_\_threadfence\_block(); waits until all global and shared memory accesses made by the calling thread before are visible to all threads in the thread block.
- void \_\_threadfence(); waits until all global and shared memory accesses made by the calling thread prior to \_\_threadfence() are visible to:

Advanced synchronization

Device side memory fence functions:

- void \_\_threadfence\_block(); waits until all global and shared memory accesses made by the calling thread before are visible to all threads in the thread block.
- void \_\_threadfence(); waits until all global and shared memory accesses made by the calling thread prior to \_\_threadfence() are visible to:
  - All threads in the thread block for shared memory accesses,

Advanced synchronization

Device side memory fence functions:

- void \_\_threadfence\_block(); waits until all global and shared memory accesses made by the calling thread before are visible to all threads in the thread block.
- void \_\_threadfence(); waits until all global and shared memory accesses made by the calling thread prior to \_\_threadfence() are visible to:
  - All threads in the thread block for shared memory accesses,
  - All threads in the device for global memory accesses.



WARSAW UNIVERSITY OF TECHNOLOGY

#### Part 2 – CUDA Advances

Warp threads scheduling

Advanced synchronization

Variables and Memory

Memory types Global Memory Access Shared Memory Example of shared memory utilization – matrices

Asynchronous operations

Problems of parallelism Race conditions Volatile

Time Measurements



WARSAW UNIVERSITY OF TECHNOLOGY

#### Part 2 – CUDA Advances

Warp threads scheduling

Advanced synchronization

#### Variables and Memory

#### Memory types

Global Memory Access Shared Memory Example of shared memory utilization – matrices

Asynchronous operations

Problems of parallelism Race conditions Volatile

Time Measurements

### Accessing different types of memory

#### Variables and Memory



Variables and Memory

A variable declared in a kernel generally is stored in registers if possible. Exceptions (memory space specifiers):

\_\_device\_\_

Variables and Memory

A variable declared in a kernel generally is stored in registers if possible. Exceptions (memory space specifiers):

\_\_device\_\_

Stored in device global memory (large, high latency)

Variables and Memory

A variable declared in a kernel generally is stored in registers if possible. Exceptions (memory space specifiers):

- device\_\_
  - Stored in device global memory (large, high latency)
  - Accessible by all threads

Variables and Memory

A variable declared in a kernel generally is stored in registers if possible. Exceptions (memory space specifiers):

- device\_\_
  - Stored in device global memory (large, high latency)
  - Accessible by all threads
  - ► Lifetime: application

Variables and Memory

A variable declared in a kernel generally is stored in registers if possible. Exceptions (memory space specifiers):

device\_\_

- Stored in device global memory (large, high latency)
- Accessible by all threads
- Lifetime: application

\_\_constant\_\_

Variables and Memory

A variable declared in a kernel generally is stored in registers if possible. Exceptions (memory space specifiers):

\_\_device\_\_

- Stored in device global memory (large, high latency)
- Accessible by all threads
- Lifetime: application

\_\_constant\_\_

Stored in constant memory space

Variables and Memory

A variable declared in a kernel generally is stored in registers if possible. Exceptions (memory space specifiers):

\_\_device\_\_

- Stored in device global memory (large, high latency)
- Accessible by all threads
- Lifetime: application

\_\_constant\_\_

- Stored in constant memory space
- Accessible by all threads

Variables and Memory

A variable declared in a kernel generally is stored in registers if possible. Exceptions (memory space specifiers):

\_\_device\_\_

- Stored in device global memory (large, high latency)
- Accessible by all threads
- Lifetime: application
- \_\_constant\_\_
  - Stored in constant memory space
  - Accessible by all threads
  - Lifetime: the CUDA context in which it is created

Variables and Memory

A variable declared in a kernel generally is stored in registers if possible. Exceptions (memory space specifiers):

\_\_device\_\_

- Stored in device global memory (large, high latency)
- Accessible by all threads
- Lifetime: application
- \_\_constant\_\_
  - Stored in constant memory space
  - Accessible by all threads
  - Lifetime: the CUDA context in which it is created

\_\_shared\_\_

Variables and Memory

A variable declared in a kernel generally is stored in registers if possible. Exceptions (memory space specifiers):

\_\_device\_\_

- Stored in device global memory (large, high latency)
- Accessible by all threads
- Lifetime: application
- \_\_constant\_\_
  - Stored in constant memory space
  - Accessible by all threads
  - Lifetime: the CUDA context in which it is created

#### \_\_shared\_\_

Stored in on-chip shared memory (very low latency)

Variables and Memory

A variable declared in a kernel generally is stored in registers if possible. Exceptions (memory space specifiers):

\_\_device\_\_

- Stored in device global memory (large, high latency)
- Accessible by all threads
- Lifetime: application
- \_\_constant\_\_
  - Stored in constant memory space
  - Accessible by all threads
  - Lifetime: the CUDA context in which it is created

#### \_\_shared\_\_

- Stored in on-chip shared memory (very low latency)
- Allocated by execution configuration or declared at compile time

Variables and Memory

A variable declared in a kernel generally is stored in registers if possible. Exceptions (memory space specifiers):

\_\_device\_\_

- Stored in device global memory (large, high latency)
- Accessible by all threads
- Lifetime: application
- constant\_\_
  - Stored in constant memory space
  - Accessible by all threads
  - Lifetime: the CUDA context in which it is created
- \_\_shared\_\_
  - Stored in on-chip shared memory (very low latency)
  - Allocated by execution configuration or declared at compile time
  - Accessible by all threads in the same thread block

Variables and Memory

A variable declared in a kernel generally is stored in registers if possible. Exceptions (memory space specifiers):

\_\_device\_\_

- Stored in device global memory (large, high latency)
- Accessible by all threads
- Lifetime: application
- constant\_\_
  - Stored in constant memory space
  - Accessible by all threads
  - Lifetime: the CUDA context in which it is created
- \_\_shared\_\_
  - Stored in on-chip shared memory (very low latency)
  - Allocated by execution configuration or declared at compile time
  - Accessible by all threads in the same thread block
  - Lifetime: kernel execution

Variables and Memory

A variable declared in a kernel generally is stored in registers if possible. Exceptions (memory space specifiers):

\_\_device\_\_

- Stored in device global memory (large, high latency)
- Accessible by all threads
- Lifetime: application
- constant\_\_
  - Stored in constant memory space
  - Accessible by all threads
  - Lifetime: the CUDA context in which it is created
- \_\_shared\_\_
  - Stored in on-chip shared memory (very low latency)
  - Allocated by execution configuration or declared at compile time
  - Accessible by all threads in the same thread block
  - Lifetime: kernel execution

\_\_managed\_\_

Variables and Memory

A variable declared in a kernel generally is stored in registers if possible. Exceptions (memory space specifiers):

\_\_device\_\_

- Stored in device global memory (large, high latency)
- Accessible by all threads
- Lifetime: application
- constant\_\_
  - Stored in constant memory space
  - Accessible by all threads
  - Lifetime: the CUDA context in which it is created
- \_\_shared\_\_
  - Stored in on-chip shared memory (very low latency)
  - Allocated by execution configuration or declared at compile time
  - Accessible by all threads in the same thread block
  - Lifetime: kernel execution
- \_\_managed\_\_
  - Can be referenced by both device and host

Variables and Memory

A variable declared in a kernel generally is stored in registers if possible. Exceptions (memory space specifiers):

\_\_device\_\_

- Stored in device global memory (large, high latency)
- Accessible by all threads
- Lifetime: application
- constant\_\_
  - Stored in constant memory space
  - Accessible by all threads
  - Lifetime: the CUDA context in which it is created
- \_\_shared\_\_
  - Stored in on-chip shared memory (very low latency)
  - Allocated by execution configuration or declared at compile time
  - Accessible by all threads in the same thread block
  - Lifetime: kernel execution

#### \_\_managed\_\_

- Can be referenced by both device and host
- ► Lifetime: application



WARSAW UNIVERSITY OF TECHNOLOGY

### Part 2 – CUDA Advances

Warp threads scheduling

Advanced synchronization

#### Variables and Memory

Memory types Global Memory Access

Shared Memory Example of shared memory utilization – matrices

Asynchronous operations

Problems of parallelism Race conditions Volatile

Time Measurements

#### **Global Memory Operations**

Variables and Memory

Memory operations are executed per warp

- 32 threads in a warp provide memory addresses
- Hardware determines into which lines those addresses fall
- Stores:

Invalidate L1, go at least to L2, 32-byte granularity

- Three types of loads:
  - Caching (default)
  - Non-caching
  - Read-only

# Memory Load

Variables and Memory

#### Caching (default mode)

- Attempts to hit in L1, then L2, then GMEM
- Load granularity is 128-byte line
- Non-caching
  - Compile with -Xptxas -dlcm=cg option to nvcc
  - Attempts to hit in L2, then GMEM (Does not hit in L1, invalidates the line if it's in L1 already)
  - Load granularity is 32 bytes

#### Read-only

- Loads via read-only cache: (Attempts to hit in Read-only cache, then L2, then GMEM)
- Load granularity is 32 bytes

#### Coalesced Global Memory Access

Perhaps the most important optimization

Global memory loads and stores by threads of a warp are coalesced by the device into as few as possible transactions.

#### Compute capability $\geq 6.0$ (since pascal)

The concurrent accesses of the threads of a warp will coalesce into a number of transactions equal to the number of 32-byte transactions necessary to service all of the threads of the warp.

#### Compute capability < 5.2 (before pascal)

L1-caching of accesses to global memory can be optionally enabled. If L1-caching is enabled on these devices, the number of required transactions is equal to the number of required 128-byte aligned segments.

## Simple Access Pattern (cc $\geq$ 6.0)

Variables and Memory



- The k-th thread accesses the k-th word in a 32-byte aligned array.
- If the threads of a warp access adjacent 4-byte words
- ...and not all equally participate
- ...and/or random permuted access inside the block
- then still only four 32-byte transactions would have been performed by a device.

NVIDIA CUDA Toolkit. Cuda c++ best practices guide.

https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html, 2020

#### Misaligned Sequential Access Pattern (cc>6.0)



- The sequential threads accesses sequential memory but not aligned with a 32-byte segment,
- then five 32-byte transactions would have been performed by a device.

#### Misaligned Sequential Access Pattern (cc>6.0)



- The sequential threads accesses sequential memory but not aligned with a 32-byte segment,
- then five 32-byte transactions would have been performed by a device.

```
1 __global__ void offsetCopy(float *odata,
2 float *idata,
3 int offset)
4 {
5 int xid = blockIdx.x * blockDim.x +
6 threadIdx.x + offset;
7 odata[xid] = idata[xid];
8 }
```

#### Misaligned Sequential Access Pattern (cc>6.0)

#### Variables and Memory



- The sequential threads accesses sequential memory but not aligned with a 32-byte segment,
- then five 32-byte transactions would have been performed by a device.

```
1 __global__ void offsetCopy(float *odata,
2 float *idata,
3 int offset)
4 {
5 int xid = blockIdx.x * blockDim.x +
6 threadIdx.x + offset;
7 odata[xid] = idata[xid];
8 }
```



NVIDIA CUDA Toolkit. Cuda c++ best practices guide.

https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html, 2020

#### Strided Access Pattern



- A stride of 2 results in a 50% of load/store efficiency since half the elements in the transaction are not used and represent wasted bandwidth.
- As the stride increases, the effective bandwidth decreases until the point where 32 32-byte segments are loaded for the 32 threads in a warp.

## Strided Access Pattern

Variables and Memory



1

2

3

4

5

6

7

8

- A stride of 2 results in a 50% of load/store efficiency since half the elements in the transaction are not used and represent wasted bandwidth.
- As the stride increases, the effective bandwidth decreases until the point where 32 32-byte segments are loaded for the 32 threads in a warp.

## Strided Access Pattern

Variables and Memory



1

2

3

4 5

6

7

8

- A stride of 2 results in a 50% of load/store efficiency since half the elements in the transaction are not used and represent wasted bandwidth.
- As the stride increases, the effective bandwidth decreases until the point where 32 32-byte segments are loaded for the 32 threads in a warp.



 Avoid non-unit-stride global memory accesses – use shared memory.

NVIDIA CUDA Toolkit. Cuda c++ best practices guide. https://docs.nvidia.com/cuda/cuda-c-best-practicesguide/index.html, 2020

#### In older architectures

(not supported now)







 $21 \, / \, 60$ 

Variables and Memory

 Align data to fit equal segments in memory (arrays allocated with cudaMalloc... are positioned to appropriate addresses automatically)

- Align data to fit equal segments in memory (arrays allocated with cudaMalloc... are positioned to appropriate addresses automatically)
- For single-dimensional arrays

Variables and Memory

- Align data to fit equal segments in memory (arrays allocated with cudaMalloc... are positioned to appropriate addresses automatically)
- For single-dimensional arrays

array of type\* accessed by BaseAddress + tid

- Align data to fit equal segments in memory (arrays allocated with cudaMalloc... are positioned to appropriate addresses automatically)
- For single-dimensional arrays
  - array of type\* accessed by BaseAddress + tid
  - type\* must meet the size and alignment requirements

- Align data to fit equal segments in memory (arrays allocated with cudaMalloc... are positioned to appropriate addresses automatically)
- For single-dimensional arrays
  - array of type\* accessed by BaseAddress + tid
  - type\* must meet the size and alignment requirements
  - if size of type\* is larger than 16 it must be treated with additional care

- Align data to fit equal segments in memory (arrays allocated with cudaMalloc... are positioned to appropriate addresses automatically)
- For single-dimensional arrays
  - array of type\* accessed by BaseAddress + tid
  - type\* must meet the size and alignment requirements
  - if size of type\* is larger than 16 it must be treated with additional care
- For two-dimensional arrays

- Align data to fit equal segments in memory (arrays allocated with cudaMalloc... are positioned to appropriate addresses automatically)
- For single-dimensional arrays
  - array of type\* accessed by BaseAddress + tid
  - type\* must meet the size and alignment requirements
  - if size of type\* is larger than 16 it must be treated with additional care
- For two-dimensional arrays
  - array of type\* accessed by BaseAddress + width\*tiy + tix

- Align data to fit equal segments in memory (arrays allocated with cudaMalloc... are positioned to appropriate addresses automatically)
- For single-dimensional arrays
  - array of type\* accessed by BaseAddress + tid
  - type\* must meet the size and alignment requirements
  - if size of type\* is larger than 16 it must be treated with additional care
- For two-dimensional arrays
  - array of type\* accessed by BaseAddress + width\*tiy + tix
  - width is a multiply of 16

- Align data to fit equal segments in memory (arrays allocated with cudaMalloc... are positioned to appropriate addresses automatically)
- For single-dimensional arrays
  - array of type\* accessed by BaseAddress + tid
  - type\* must meet the size and alignment requirements
  - if size of type\* is larger than 16 it must be treated with additional care
- For two-dimensional arrays
  - array of type\* accessed by BaseAddress + width\*tiy + tix
  - width is a multiply of 16
  - The width of the thread block is a multiple of half the warp size

Variables and Memory

► If proper memory alignment is impossible:

Variables and Memory

► If proper memory alignment is impossible:

Use structures of arrays instead of arrays of structures

Variables and Memory

► If proper memory alignment is impossible:

Use structures of arrays instead of arrays of structures

Variables and Memory

If proper memory alignment is impossible:

Use structures of arrays instead of arrays of structures

| AoS | $x_1$ | $y_1$ | $z_1$ | $w_1$ | $x_2$ | $y_2$ | $z_2$ | $w_2$ |
|-----|-------|-------|-------|-------|-------|-------|-------|-------|
|     | $x_3$ | $y_3$ | $z_3$ | $w_3$ | $x_4$ | $y_4$ | $z_4$ |       |

Variables and Memory

If proper memory alignment is impossible:

Use structures of arrays instead of arrays of structures

| AoS | $x_1$ | $y_1$ | $z_1$ | $w_1$ | $x_2$ | $y_2$ | $z_2$ | $w_2$ |
|-----|-------|-------|-------|-------|-------|-------|-------|-------|
|     | $x_3$ | $y_3$ | $z_3$ | $w_3$ | $x_4$ | $y_4$ | $z_4$ |       |

| SoA | $x_1$ | $x_2$ | $x_3$ | $x_4$ | <br> | <br> |
|-----|-------|-------|-------|-------|------|------|
|     | $y_1$ | $y_2$ | $y_3$ | $y_4$ | <br> | <br> |
|     | $z_1$ | $z_2$ | $z_3$ | $z_4$ | <br> | <br> |
|     | $w_1$ | $w_2$ | $w_3$ | $w_4$ | <br> | <br> |

Variables and Memory

If proper memory alignment is impossible:

Use structures of arrays instead of arrays of structures

| AoS | $x_1$ | $y_1$ | $z_1$ | $w_1$ | $x_2$ | $y_2$ | $z_2$ | $w_2$ |
|-----|-------|-------|-------|-------|-------|-------|-------|-------|
|     | $x_3$ | $y_3$ | $z_3$ | $w_3$ | $x_4$ | $y_4$ | $z_4$ |       |

| SoA | $x_1$ | $x_2$ | $x_3$ | $x_4$ | <br> | <br> |
|-----|-------|-------|-------|-------|------|------|
|     | $y_1$ | $y_2$ | $y_3$ | $y_4$ | <br> | <br> |
|     | $z_1$ | $z_2$ | $z_3$ | $z_4$ | <br> | <br> |
|     | $w_1$ | $w_2$ | $w_3$ | $w_4$ | <br> | <br> |

Use \_\_align(4), \_\_align(8) or \_\_align(16) in structure declarations

## Coalescing example I

Variables and Memory

```
Misaligned memory access with float3 data
```

```
global void accessFloat3(float3 *d_in, float3 *d_out)
1
  Ł
2
     int index = blockIdx.x * blockDim.x + threadIdx.x;
3
     float3 a = d_in[index];
4
     a.x += 2;
   a.v += 2;
6
     a.z += 2:
7
     d out[index] = a;
8
  }
g
```

 Each thread reads 3 floats = 12 bytes
 Half warp reads 16 \* 12 = 192 bytes (three 64B non-contiguous segments)

NVIDIA. Cuda whitepapers. www.nvidia.com/cuda

## Coalescing example II

Variables and Memory

Coalesced memory access with float3 data

```
__global__ void accessFloat3Shared(float *g_in, float *g_out)
 1
   Ł
 2
      int index = 3 * blockIdx.x * blockDim.x + threadIdx.x;
3
      __shared__ float s_data[256*3];
 4
      s_data[threadIdx.x] = g_in[index];
 5
      s_data[threadIdx.x+256] = g_in[index+256];
6
      s_data[threadIdx.x+512] = g_in[index+512];
7
      _____syncthreads();
8
      float3 a = ((float3*)s data)[threadIdx.x];
9
10
      a.x += 2:
11
      a.v += 2;
      a.z += 2:
12
      ((float3*)s_data)[threadIdx.x] = a;
13
      __syncthreads();
14
      g_out[index] = s_data[threadIdx.x];
15
      g_out[index+256] = s_data[threadIdx.x+256];
16
      g_out[index+512] = s_data[threadIdx.x+512];
17
   }
18
```



WARSAW UNIVERSITY OF TECHNOLOGY

## Part 2 – CUDA Advances

Warp threads scheduling

Advanced synchronization

#### Variables and Memory

Memory types Global Memory Access

#### Shared Memory

Example of shared memory utilization - matrices

Asynchronous operations

Problems of parallelism Race conditions Volatile

Time Measurements

#### Allocating shared memory

Variables and Memory

#### Static way

Device side:



Host side:

1 kernel <<< nBlocks, blockSize >>>(...);

## Allocating shared memory

Variables and Memory

#### Static way Device side:

```
constant uint blockSize = 64;
2
    __global__ void kernel(...)
3
5
      shared short array0[blockSize];
6
      __shared__ float array1[blockSize];
7
      shared int array2[blockSize];
8
9
   }
```

Host side:

kernel<<< nBlocks, blockSize >>>(...); 1

#### Dynamic way Device side:

2

3 4

6

7

8

9

10

1

4

```
constant uint blockSize = 64;
__global__ void kernel(...)
  extern shared float array[];
  //All variables declared in this fashion,
        start at the same address in memory, so:
  short* array0 = (short*)array;
  float* arrav1 = (float*)&arrav0[blockSize];
  int* array2 = (int*)&array1[blockSize];
```

#### Host side:

```
smBytes = blockSize*sizeof(float)
2
           + blockSize*sizeof(short)
3
           + blockSize*sizeof(int):
5
```

kernel<<< nBlocks, blockSize, smBytes >>>(...);

## Allocating shared memory

Variables and Memory

# Static way

```
constant uint blockSize = 64;
                                                      constant uint blockSize = 64;
2
    __global__ void kernel(...)
                                                  2
                                                      __global__ void kernel(...)
3
                                                  3
                                                  4
                                                         extern shared float array[];
5
      shared short array0[blockSize];
                                                         //All variables declared in this fashion,
6
      __shared__ float array1[blockSize];
                                                               start at the same address in memory, so:
7
      shared int array2[blockSize];
                                                  6
8
                                                  7
                                                         short* array0 = (short*)array;
9
                                                  8
                                                         float* arrav1 = (float*)&arrav0[blockSize];
                                                  9
                                                         int* array2 = (int*)&array1[blockSize];
                                                 10
      Host side:
                                                         Host side:
    kernel<<< nBlocks, blockSize >>>(...);
1
                                                      smBytes = blockSize*sizeof(float)
                                                  2
                                                             + blockSize*sizeof(short)
                                                  3
                                                             + blockSize*sizeof(int):
                                                  4
                                                  5
                                                      kernel<<< nBlocks, blockSize, smBytes >>>(...);
```

Note that pointers need to be aligned to the type they point to. Error: array1 is not aligned to 4 bytes:

```
1 short* array0 = (short*)array;
```

```
2 float* array1 = (float*)&array0[127];
```

#### **Dynamic way** Device side:

Variables and Memory

Shared memory is divided into equally sized memory modules, called **banks**.

Variables and Memory

Shared memory is divided into equally sized memory modules, called banks.

Different banks can be accessed simultaneously.

- Shared memory is divided into equally sized memory modules, called banks.
- Different banks can be accessed simultaneously.
- Read or write to n addresses in n banks multiplies bandwidth of a single bank by n.

- Shared memory is divided into equally sized memory modules, called banks.
- Different banks can be accessed simultaneously.
- Read or write to n addresses in n banks multiplies bandwidth of a single bank by n.
- If many threads refers the same bank the access is serialized hardware splits a memory request that has bank conflicts into as many separate conflict-free requests as necessary.

- Shared memory is divided into equally sized memory modules, called banks.
- Different banks can be accessed simultaneously.
- Read or write to n addresses in n banks multiplies bandwidth of a single bank by n.
- If many threads refers the same bank the access is serialized hardware splits a memory request that has bank conflicts into as many separate conflict-free requests as necessary.
- There is one exception if all threads within a half-warp accesses the same address.

#### Bank conflicts

Variables and Memory

Shared memory banks are organized in such a way that successive 32-bit words are assigned to successive banks and each bank has a bandwidth of 32 bits per clock cycle. The bandwidth of shared memory is 32 bits per bank per clock cycle.

## Access with no bank conflicts

Variables and Memory



left: stride = 1 right: stride random

NVIDIA. Cuda whitepapers. www.nvidia.com/cuda

### Access with bank conflicts

#### Variables and Memory



left: stride = 2 (2 way bank conflict) right: stride = 8 (8 way bank conflict)

NVIDIA. Cuda whitepapers. www.nvidia.com/cuda

Padding – adding extra space between array elements in order to brake cyclic access to same bank.

## Example of bank conflicts removal in reduction I

#### Variables and Memory

### Addressing Without Padding

int ai = offset\*(2\*thid+1)-1; int bi = offset\*(2\*thid+2)-1; temp[bi] += temp[ai];

offset = 1: Address (ai) stride is 2, resulting in 2-way bank conflicts







Mark Harris. Parallel prefix sum (scan) with CUDA. www.nvidia.com/cuda, 2007

#### Example of bank conflicts removal in reduction II

#### Variables and Memory

## Addressing With Padding

```
int ai = offset*(2*thid+1)-1;
int bi = offset*(2*thid+2)-1;
ai += ai / NUM_BANKS;
bi += bi / NUM_BANKS;
temp[bi] += temp[ai];
```





Offset = 2: Padding addresses every 16 elements removes bank conflicts



Padding increment:



Variables and Memory

We need more space in shared memory:

unsigned int extra\_space = num\_elements / NUM\_BANKS;

Variables and Memory

We need more space in shared memory:

```
unsigned int extra_space = num_elements / NUM_BANKS;
```

#### Padding macro:

- 1 #define NUM\_BANKS 16
- 2 #define LOG\_NUM\_BANKS 4
- 3
- 4 **#ifdef** ZERO\_BANK\_CONFLICTS
- 5 #define CONFLICT\_FREE\_OFFSET(index) ((index) >> LOG\_NUM\_BANKS \

+ (index) >> (2 \* LOG NUM BANKS))

- 6
- 7 **#else**
- 8 #define CONFLICT\_FREE\_OFFSET(index) ((index) >> LOG\_NUM\_BANKS)
- 9 **#endif**

Variables and Memory

We need more space in shared memory:

```
unsigned int extra_space = num_elements / NUM_BANKS;
1
    Padding macro:
  #define NUM BANKS 16
1
  #define LOG_NUM_BANKS 4
2
3
  #ifdef ZERO_BANK_CONFLICTS
4
  #define CONFLICT_FREE_OFFSET(index) ((index) >> LOG_NUM_BANKS \
5
                                   + (index) >> (2 * LOG_NUM_BANKS))
6
  #else
7
  #define CONFLICT FREE OFFSET(index) ((index) >> LOG NUM BANKS)
8
  #endif
a
```

Zero bank conflicts requires even more additional space:

1 **#ifdef** ZERO\_BANK\_CONFLICTS

```
2 extra_space += extra_space / NUM_BANKS;
```

```
3 #endif
```

Mark Harris. Parallel prefix sum (scan) with CUDA. www.nvidia.com/cuda, 2007

Variables and Memory

Loading data into shared memory:

```
int ai = thid, bi = thid + (n/2);
// compute spacing to avoid bank conflicts
int bankOffsetA = CONFLICT_FREE_OFFSET(ai);
int bankOffsetB = CONFLICT_FREE_OFFSET(bi);
TEMP(ai + bankOffsetA) = g_idata[ai];
TEMP(bi + bankOffsetB) = g_idata[bi];
```

Variables and Memory

Loading data into shared memory:

```
int ai = thid, bi = thid + (n/2);
1
2
   // compute spacing to avoid bank conflicts
3
   int bankOffsetA = CONFLICT FREE OFFSET(ai);
   int bankOffsetB = CONFLICT_FREE_OFFSET(bi);
5
6
  TEMP(ai + bankOffsetA) = g_idata[ai];
7
  TEMP(bi + bankOffsetB) = g_idata[bi];
8
    Algorithm:
  int ai = offset*(2*thid+1)-1:
1
   int bi = offset*(2*thid+2)-1;
2
3
   ai += CONFLICT_FREE_OFFSET(ai);
4
   bi += CONFLICT FREE OFFSET(bi);
6
  TEMP(bi) += TEMP(ai);
7
```

Mark Harris. Parallel prefix sum (scan) with CUDA. www.nvidia.com/cuda, 2007



WARSAW UNIVERSITY OF TECHNOLOGY

## Part 2 – CUDA Advances

Warp threads scheduling

Advanced synchronization

#### Variables and Memory

Memory types Global Memory Access Shared Memory Example of shared memory utilization – matrices

#### Asynchronous operations

Problems of parallelism Race conditions Volatile

Time Measurements

No shared memory used



#### Host program for clarity

| void MatMul(const Matrix A,                     | 22      |
|-------------------------------------------------|---------|
| const Matrix B,                                 | 23      |
| Matrix C)                                       | 24      |
| {                                               | 25      |
| <pre>// Load A and B to device memory</pre>     | 26      |
| Matrix d_A;                                     | 27      |
| d_A.width = A.width; d_A.height = A.heig        | ght; 28 |
| size_t size = A.width * A.height                | 29      |
| <pre>* sizeof(float);</pre>                     | 30      |
|                                                 | 31      |
| <pre>cudaMalloc(&amp;d_A.elements, size);</pre> | 32      |
| cudaMemcpy(d_A.elements, A.elements, siz        | ze, 33  |
| cudaMemcpyHostToDevice);                        | 34      |
| Matrix d_B;                                     | 35      |
| d_B.width = B.width;                            | 36      |
| d_B.height = B.height;                          | 37      |
| size = B.width * B.height * sizeof(float        | ; 38    |
|                                                 | 39      |
| <pre>cudaMalloc(&amp;d_B.elements, size);</pre> | 40      |
| cudaMemcpy(d_B.elements, B.elements, siz        | ze, 41  |
| cudaMemcpyHostToDevice);                        | 42      |
|                                                 | 43      |

```
// Allocate C in device memory
Matrix d C:
d_C.width = C.width; d_C.height = C.height;
size = C.width * C.height * sizeof(float);
cudaMalloc(&d C.elements, size):
```

```
// Invoke kernel
dim3 dimBlock(BLOCK SIZE, BLOCK SIZE);
dim3 dimGrid(B.width / dimBlock.x,
            A.height / dimBlock.y);
MatMulKernel <<< dimGrid. dimBlock>>>
             (d A, d B, d C):
```

```
// Read C from device memory
cudaMemcpy(C.elements, d_C.elements, size,
          cudaMemcpyDeviceToHost);
```

```
// Free device memory
cudaFree(d_A.elements);
cudaFree(d B.elements);
cudaFree(d C.elements);
```

#### with shared memory used I



NVIDIA. Cuda c++ programming guide.

www.nvidia.com/cuda

```
typedef struct {
        int width;
 3
        int height;
 Δ
        int stride;
 5
        float* elements;
 6
     } Matrix:
     device float GetElement(const Matrix A,
                              int row. int col)
     ſ
        return A.elements[row * A.stride + col];
     }
     device void SetElement(Matrix A, int row,
                             int col, float value)
     ſ
        A.elements[row * A.stride + col] = value:
     3
     // Get the BLOCK_SIZExBLOCK_SIZE sub-matrix Asub
     // of A located col sub-matrices to the right and
     // row sub-matrices down from the upper-left corner
     // of A
      device Matrix GetSubMatrix(Matrix A, int row,
            int col)
     ſ
        Matrix Asub:
        Asub.width = BLOCK_SIZE;
28
        Asub.height = BLOCK SIZE:
29
        Asub.stride = A.stride:
30
        Asub.elements = &A.elements[A.stride
31
                                  * BLOCK SIZE * row
32
                                  + BLOCK SIZE * coll:
33
        return Asub;
34
```

3

with shared memory used II





#### www.nvidia.com/cuda

```
__global__ void MatMulKernel(Matrix A,
Matrix B,
Matrix C)
{
    // Block row and columm
    int blockRow = blockIdx.y;
    int blockCol = blockIdx.x;
    // Each thread block computes one sub-matrix
    // Csub of C
Matrix Csub = GetSubMatrix(C, blockRow.
```

// Each thread computes one element of Csub
// by accumulating results into Cvalue
float Cvalue = 0;

```
// Thread row and column within Csub
int row = threadIdx.y;
int col = threadIdx.x;
```

51

52

53

54

55

81

82 }

with shared memory used III



www.nvidia.com/cuda

```
for (int m = 0; m < (A.width / BLOCK_SIZE); ++m)</pre>
   // Get sub-matrix Asub of A
   Matrix Asub = GetSubMatrix(A, blockRow, m);
   // Get sub-matrix Bsub of B
   Matrix Bsub = GetSubMatrix(B, m, blockCol);
   // Shared memory used to store Asub and Bsub
   __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
   shared float Bs[BLOCK SIZE][BLOCK SIZE];
   // Each thread loads one element of each sub
          -matrix
   As[row][col] = GetElement(Asub, row, col);
   Bs[row][col] = GetElement(Bsub, row, col);
   // Synchronize to make sure the sub-matrices
   // are loaded before starting the
          computation
   _____syncthreads();
   // Multiply Asub and Bsub together
   for (int e = 0; e < BLOCK_SIZE; ++e)</pre>
       Cvalue += As[row][e] * Bs[e][col]:
   // Synchronize to asure that the preceding
   // computation is done before loading new
   // sub-matrices of A and B
   syncthreads();
3
// Write Csub to device memory, one thread, one
      element
SetElement(Csub, row, col, Cvalue);
```



WARSAW UNIVERSITY OF TECHNOLOGY

### Part 2 – CUDA Advances

Warp threads scheduling

Advanced synchronization

Variables and Memory Memory types Global Memory Access Shared Memory Example of shared memory utilization – matrices

#### Asynchronous operations

Problems of parallelism Race conditions Volatile

Time Measurements

Asynchronous operations

• Applications manage concurrency through streams.

- Applications manage concurrency through streams.
- ► A stream is a sequence of commands that execute in order.

- Applications manage concurrency through streams.
- ► A stream is a sequence of commands that execute in order.
- Different streams may execute their commands out of order with respect to one another or concurrently.

- Applications manage concurrency through streams.
- ► A stream is a sequence of commands that execute in order.
- Different streams may execute their commands out of order with respect to one another or concurrently.



Asynchronous operations

- Applications manage concurrency through streams.
- ► A stream is a sequence of commands that execute in order.
- Different streams may execute their commands out of order with respect to one another or concurrently.

cudaStream\_t - stream type

cudaStreamCreate( &stream )

- Applications manage concurrency through streams.
- ► A stream is a sequence of commands that execute in order.
- Different streams may execute their commands out of order with respect to one another or concurrently.
- cudaStream\_t stream type
- cudaStreamCreate( &stream )
- cudaStreamDestroy( &stream ) waits for all tasks to complete before destroying a stream;

- Applications manage concurrency through streams.
- ► A stream is a sequence of commands that execute in order.
- Different streams may execute their commands out of order with respect to one another or concurrently.
- cudaStream\_t stream type
- cudaStreamCreate( &stream )
- cudaStreamDestroy( &stream ) waits for all tasks to complete before destroying a stream;
- cudaStreamQuery() checks if all preceding commands in a stream have completed

- Applications manage concurrency through streams.
- ► A stream is a sequence of commands that execute in order.
- Different streams may execute their commands out of order with respect to one another or concurrently.
- cudaStream\_t stream type
- cudaStreamCreate( &stream )
- cudaStreamDestroy( &stream ) waits for all tasks to complete before destroying a stream;
- cudaStreamQuery() checks if all preceding commands in a stream have completed
- cudaStreamSynchronize() forces the run-time to wait until all preceding commands in a stream have completed.

- Applications manage concurrency through streams.
- ► A stream is a sequence of commands that execute in order.
- Different streams may execute their commands out of order with respect to one another or concurrently.
- cudaStream\_t stream type
- cudaStreamCreate( &stream )
- cudaStreamDestroy( &stream ) waits for all tasks to complete before destroying a stream;
- cudaStreamQuery() checks if all preceding commands in a stream have completed
- cudaStreamSynchronize() forces the run-time to wait until all preceding commands in a stream have completed.
- cudaThreadSynchronize() forces the run-time to wait until all preceding device tasks in all streams have completed

#### Streams API – example

Asynchronous operations

NVIDIA. Cuda c++ programming guide. www.nvidia.com/cuda

```
cudaStream t stream[2];
1
   for (int i = 0; i < 2; ++i)</pre>
2
      cudaStreamCreate(&stream[i]);
 3
   float* hostPtr:
4
   cudaMallocHost((void**)&hostPtr, 2 * size, cudaHostAllocDefault);
5
   cudaMalloc((void**)&inputDevPtr, 2 * size);
6
   cudaMalloc((void**)&outputDevPtr, 2 * size);
7
   for (int i = 0; i < 2; ++i)
8
      cudaMemcpyAsync(inputDevPtr + i * size, hostPtr + i * size,
9
                size, cudaMemcpyHostToDevice, stream[i]);
10
   for (int i = 0; i < 2; ++i)
11
      myKernel <<<100, 512, 0, stream[i]>>>
12
                (outputDevPtr + i * size, inputDevPtr + i * size, size);
13
   for (int i = 0; i < 2; ++i)
14
      cudaMemcpyAsync(hostPtr + i * size, outputDevPtr + i * size,
15
                size, cudaMemcpyDeviceToHost, stream[i]);
16
   cudaThreadSynchronize();
17
   for (int i = 0; i < 2; ++i)</pre>
18
      cudaStreamDestroy(&stream[i]);
19
```



WARSAW UNIVERSITY OF TECHNOLOGY

## Part 2 – CUDA Advances

Warp threads scheduling

Advanced synchronization

Variables and Memory Memory types Global Memory Access Shared Memory Example of shared memory utilization – matrices

Asynchronous operations

Problems of parallelism Race conditions Volatile



WARSAW UNIVERSITY OF TECHNOLOGY

# Part 2 – CUDA Advances

Warp threads scheduling

Advanced synchronization

Variables and Memory

Memory types

Global Memory Access

Shared Memory

Example of shared memory utilization - matrices

Asynchronous operations

### Problems of parallelism Race conditions Volatile

Problems of parallelism

Simplest possible operation (\*x is a global memory pointer)

Let int \*x point to global memory. \*x++ happens in 3 steps:

- 1. Read the value in \*x into a register.
- 2. Add 1 to the value in the register.
- 3. Write the result back to \*x.

Problems of parallelism

Simplest possible operation (\*x is a global memory pointer)

Let int \*x point to global memory. \*x++ happens in 3 steps:

- 1. Read the value in \*x into a register.
- 2. Add 1 to the value in the register.
- 3. Write the result back to \*x.

- 1 A: \*x++
- 2 B: \*x++

Problems of parallelism

Simplest possible operation (\*x is a global memory pointer)

Let int \*x point to global memory. \*x++ happens in 3 steps:

- 1. Read the value in \*x into a register.
- 2. Add 1 to the value in the register.
- 3. Write the result back to \*x.

```
1 A: *x++

2 B: *x++

1 A: a = *x //a=7

2 B: b = *x //b=7

3 A: a++ //8

4 A: *x = a //8

5 B: b++ //8

6 B: *x = b //8
```

#### Problems of parallelism

```
#include <stdio.h>
 1
 2
     #include <stdlib.h>
     #include <cuda.h>
 4
     #include <cuda runtime.h>
 5
 6
     global void colonel(int *d a){
 7
       *d_a += 1;
8
     }
9
10
     int main(){
11
       int a = 0, *d a:
12
       cudaMalloc((void**) &a_d, sizeof(int));
13
       cudaMemcpy(d a, &a, sizeof(int), cudaMemcpyHostToDevice);
      float elapsedTime:
14
15
      cudaEvent t start, stop:
16
       cudaEventCreate(&start);
17
       cudaEventCreate(&stop);
18
       cudaEventRecord( start, 0 );
19
20
       colonel<<<1000,1000>>>(d_a);
21
22
       cudaEventRecord( stop, 0 );
23
       cudaEventSynchronize( stop );
       cudaEventElapsedTime( &elapsedTime, start, stop );
24
25
       cudaEventDestrov( start ):
26
       cudaEventDestroy( stop );
27
       printf("GPU_Time_::/%f_ms\n", elapsedTime);
28
29
       cudaMemcpy(&a, d_a, sizeof(int), cudaMemcpyDeviceToHost);
30
       printf("a_{l}=_{l}%d n", a):
31
32
       cudaFree(d a):
33
     3
```

### Race condition results

Problems of parallelism

#### Output:

- 1 > nvcc race\_condition.cu -o race\_condition
- 2 > ./race\_condition
- 3 GPU Time : 0.148 ms
- 4 a = 88

### Race condition results

Problems of parallelism

### Output:

- 1 > nvcc race\_condition.cu -o race\_condition
- 2 > ./race\_condition
- 3 GPU Time : 0.148 ms
- 4 a = 88

#### Modification:

```
1 __global__ void colonel(int *d_a){
2 atomicAdd(d_a, 1);
3 }
```

### Race condition results

Problems of parallelism

### Output:

- 1 > nvcc race\_condition.cu -o race\_condition
- 2 > ./race\_condition
- 3 GPU Time : 0.148 ms
- 4 a = 88

#### Modification:

```
1 __global__ void colonel(int *d_a){
2 atomicAdd(d_a, 1);
3 }
```

#### Output:

1 GPU Time : 14.85 ms

2 a = 1000000

### Atomic functions can only be used in device functions.

### Atomic operations I

(for all devices CC>2.0)

**Device-wide atomics:** atomic for all CUDA threads in the current program executing in the same compute device as the current thread:

- 1 atomicAdd()
- 2 atomicSub()
- 3 atomicMin()
- 4 atomicMax()
- 5 atomicInc()
- 6 atomicDec()
- 7 atomicAdd()
- 8 atomicExch()
- 9 atomicAnd()
- 10 atomicOr()
- 11 atomicXor()

12 int atomicCAS(int\* address, int compare, int val); // Compare And Swap (returns old value)

An atomic function performs a read-modify-write atomic operation on one 32-bit or 64-bit word residing in global or shared memory.

### Atomic operations II (for all devices CC≥6.0)

**System-wide atomics:** atomic for all threads in the current program including other CPUs and GPUs in the system. These are suffixed with \_system. Like: atomicAdd\_system().

**Block-wide atomics:** atomic for all CUDA threads in the current program executing in the same thread block as the current thread. These are suffixed with \_block. Like: atomicAdd\_block().



WARSAW UNIVERSITY OF TECHNOLOGY

# Part 2 – CUDA Advances

Warp threads scheduling

Advanced synchronization

Variables and Memory

Memory types

Global Memory Access

Shared Memory

Example of shared memory utilization - matrices

Asynchronous operations

### Problems of parallelism

Race conditions Volatile

Problems of parallelism

 One of the compiler's tricks: reuse references to memory location

Problems of parallelism

- One of the compiler's tricks: reuse references to memory location
- Result: A reused value may be changed by another thread in the background

Problems of parallelism

- One of the compiler's tricks: reuse references to memory location
- Result: A reused value may be changed by another thread in the background

Problems of parallelism

- One of the compiler's tricks: reuse references to memory location
- Result: A reused value may be changed by another thread in the background

```
// myArray is an array of non-zero integers
1
   // located in global or shared memory
2
   __global__ void myKernel(int* result)
3
   Ł
4
      int tid = threadIdx.x;
5
      int ref1 = myArray[tid] * 1;
6
      myArray[tid + 1] = 2;
7
      int ref2 = myArray[tid] * 1;
8
      result[tid] = ref1 * ref2;
9
10
  }
```

the first reference to myArray[tid] compiles into a memory read instruction

Problems of parallelism

- One of the compiler's tricks: reuse references to memory location
- Result: A reused value may be changed by another thread in the background

```
// myArray is an array of non-zero integers
1
   // located in global or shared memory
2
   __global__ void myKernel(int* result)
3
   Ł
4
      int tid = threadIdx.x;
5
      int ref1 = myArray[tid] * 1;
6
      myArray[tid + 1] = 2;
7
      int ref2 = myArray[tid] * 1;
8
      result[tid] = ref1 * ref2;
9
10
  }
```

- the first reference to myArray[tid] compiles into a memory read instruction
- the second reference does not as the compiler simply reuses the result of the first read



WARSAW UNIVERSITY OF TECHNOLOGY

## Part 2 – CUDA Advances

Warp threads scheduling

Advanced synchronization

Variables and Memory Memory types Global Memory Access Shared Memory Example of shared memory utilization – matrices

Asynchronous operations

Problems of parallelism Race conditions Volatile

### **Timers** API

- cudaEvent\_t event type
- cudaEventSynchronize() blocks CPU until given event records
- cudaEventRecord() records given event in given stream
- cudaEventElapsedTime() calculates time in milliseconds between events
- cudaEventCreate() creates an event
- cudaEventDestroy() destroys an event

### **Timers Example**

```
cudaEvent_t start, stop; float time;
1
   cudaEventCreate(&start):
 2
   cudaEventCreate(&stop);
3
 4
   cudaEventRecord( start, 0 );
6
   kernel<<<grid,threads>>> ( d_odata, d_idata, size_x, size_y);
 7
8
   cudaEventRecord( stop, 0 );
9
10
   cudaEventSynchronize( stop );
11
12
   cudaEventElapsedTime( &time, start, stop );
13
14
   cudaEventDestroy( start );
15
   cudaEventDestroy( stop );
16
```

### Theoretical Bandwidth Calculation

$$\mathsf{TB} = (\mathsf{Clock} \times 10^6 \times \mathsf{MemInt} \times 2)/10^9$$

- ► TB theoretical bandwidth [GB/s]
- Clock memory clock rate [MHz]
- MemInt width of memory interface [B]
- 2 DDR Double Data Rate Memory

### Theoretical Bandwidth Calculation

Time Measurements

$$\mathsf{TB} = (\mathsf{Clock} \times 10^6 \times \mathsf{MemInt} \times 2)/10^9$$

- ► TB theoretical bandwidth [GB/s]
- Clock memory clock rate [MHz]
- MemInt width of memory interface [B]
- 2 DDR Double Data Rate Memory

For NVIDIA GeForce GTX 280 we get:

 $(1107 \times 10^6 \times 512/8 \times 2)/10^9 = 141.6 \ [GB/s]$ 

### Effective Bandwidth Calculation

$$EB = \frac{(B_r + B_w) \times 10^{-9}}{t}$$

- ► EB effective bandwidth [GB/s]
- ►  $B_r$  bytes read
- $B_w$  bytes written
- ► *t* time of the test [s]

# Bibliography

Mark Harris. Parallel prefix sum (scan) with CUDA. www.nvidia.com/cuda, 2007.

- NVIDIA. Cuda c++ programming guide. www.nvidia.com/cuda.
- NVIDIA. Cuda whitepapers. www.nvidia.com/cuda.
- NVIDIA CUDA Toolkit. Cuda c++ best practices guide. https://docs.nvidia.com/cuda/cuda-c-best-practicesguide/index.html, 2020.

Materiały sponsorowane przez:

Projekt "NERW 2 PW. Nauka – Edukacja – Rozwój – Współpraca" współfinansowany jest ze środków Unii Europejskiej w ramach Europejskiego Funduszu Społecznego

Zadanie 10 pn. "Modyfikacja programów studiów na kierunkach prowadzonych przez Wydział Matematyki i Nauk Informacyjnych", realizowane w ramach projektu "NERW 2 PW. Nauka – Edukacja – Rozwój – Współpraca", współfinansowanego jest ze środków Unii Europejskiej w ramach Europejskiego Funduszu Społecznego



Rzeczpospolita Polska Politechnika Warszawska

Unia Europejska Europejski Fundusz Społeczny

