Graphic Processors in Computational Applications

Part 1 – Introduction

dr inż. Krzysztof Kaczmarski 2024 Materiały sponsorowane przez:

Projekt "NERW 2 PW. Nauka – Edukacja – Rozwój – Współpraca" współfinansowany jest ze środków Unii Europejskiej w ramach Europejskiego Funduszu Społecznego

Zadanie 10 pn. "Modyfikacja programów studiów na kierunkach prowadzonych przez Wydział Matematyki i Nauk Informacyjnych", realizowane w ramach projektu "NERW 2 PW. Nauka – Edukacja – Rozwój – Współpraca", współfinansowanego jest ze środków Unii Europejskiej w ramach Europejskiego Funduszu Społecznego



Rzeczpospolita Polska Politechnika Warszawska

Unia Europejska Europejski Fundusz Społeczny



### Goals for today:

- Understand course passing requirements
- Get basic knowledge on GPU programming

### Part 1 – Introduction



WARSAW UNIVERSITY OF TECHNOLOGY

#### Semester Schedule

#### GPU and modern HPC

Introduction to CUDA and GPGPU Threads and Processes CUDA Programming Language Memory Management Synchronization Error reporting Example

#### Lectures

Technical part:

- 1. GPU threads basics
  - Process/Thread/Kernel, Host/Device
- 2. Memory management Global/Local/Shared/Registers/Constant
- 3. Threads synchronization
- 4. Advanced memory management
- 5. Multiple GPU HPC
- 6. Advanced parallel execution problems
- 7. Inter-warp communication
- 8. Thrust API

# Lectures

Algorithms:

- 1. Model of vector processing
- 2. Parallel scalability models
- 3. Prefix-sums
- 4. Parallel sorting
- 5. Optimal matrix multiplication
- 6. Particle interactions

# Obligatory Laboratories

Semester Schedule

- $1\,$  Tutorial: Play in the playground choose your toys
- 2 Tutorial: Can you reduce? (3p)
- 3 Tutorial: Touch a fractal border (3p)
- 4 Tutorial: Trust in Thrust (3p)
- 5-9 Project 1 (40-60p)

10-14 Project 2 (40-60p)

Choose two projects from the list:

- A (easy): 40 points
- ► B (moderate): 60 points
- You must report progress every two weeks.
- Deadline for the projects: the last week of the semester.

#### Projects Grading II

- If a project contains no mistakes it gets 100% of the possible points.
- There are penalty points for misuse of GPU concepts:
  - -10% : processor occupancy not achieved or too few threads running
  - -10% : memory allocation or deallocation problems
  - -10% : AoS if SoA is possible
    - -5% : shared memory conflicts
    - $-5\%\,$  : ugly code, no comments, mess in files
    - -5% : no makefile (cmake is ok)

## Part 1 – Introduction



WARSAW UNIVERSITY OF TECHNOLOGY

Semester Schedule

#### GPU and modern HPC

Introduction to CUDA and GPGPU Threads and Processes CUDA Programming Language Memory Management Synchronization Error reporting Example

### The most powerful computers use GPU devices

#### GPU and modern HPC

| Site:                                                             | RIKEN Center for Comp. Sci.                                                                | DOE/SC/Oak Ridge Nat. Lab.                                                         | DOE/NNSA/LLNL                                                                                         |  |  |  |
|-------------------------------------------------------------------|--------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------|--|--|--|
| Manufacturer:<br>Cores:<br>Memory:<br>Processor:<br>Interconnect: | Fujitsu<br>7,299,072<br>4,866,048 GB<br>A64FX 48C 2.2GHz<br>Tofu interconnect D            | IBM<br>2,414,592<br>2,801,664 GB<br>IBM POWER9 22C 3.07GHz<br>Dual-rail Infiniband | IBM / NVIDIA / Mellanox<br>1,572,480<br>1,382,400 GB<br>IBM POWER9 22C 3.1GHz<br>Dual-rail Infiniband |  |  |  |
| Performance                                                       |                                                                                            |                                                                                    |                                                                                                       |  |  |  |
| Linpack<br>Theoretical Peak<br>Nmax<br>HPCG [TFlop/s]             | 415,530 TFlop/s<br>513,855 TFlop/s<br>20,459,520<br>13,366.4                               | 148,600 TFlop/s<br>200,795 TFlop/s<br>16,473,600<br>2,925.75                       | 94,640 TFlop/s<br>125,712 TFlop/s<br>11,902,464<br>1,795.67                                           |  |  |  |
| Power Consumption                                                 | n                                                                                          |                                                                                    |                                                                                                       |  |  |  |
| Power:                                                            | 28,334.50 kW                                                                               | 10,096.00 kW                                                                       | 7,438.28 kW                                                                                           |  |  |  |
| Software                                                          |                                                                                            |                                                                                    |                                                                                                       |  |  |  |
| Operating System:<br>Compiler:<br>Math Library:<br>MPI:           | Red Hat Enterprise Linux<br>FUJITSU Soft. V4.0<br>FUJITSU Soft. V4.0<br>FUJITSU Soft. V4.0 | RHEL 7.4<br>XLC, nvcc<br>ESSL, CUBLAS 9.2<br>Spectrum MPI                          | RHEL 7.4<br>IBM XLC<br>ESSL, CUBLAS 9.2<br>IBM Spectrum MPI                                           |  |  |  |

#### Table: June 2020: www.top500.org

In order to read about FUGAKU get the report: Jack Dongarra's FUGAKU Report, 22 June 2020

#### GPU and modern HPC

Accelerator/Co-Processor - Systems Share



## **NVIDIA Supercomputer**

**POD** Architecture



**NVIDIA** website

## DGX A100 HPC Server

GPU and modern HPC

8X NVIDIA A100 GPUS WITH 320 GB TOTAL GPU MEMORY 12 NVLinks/GPU, 600 GB/s GPU-to-GPU Bi-directonal Bandwidth

6X NVIDIA NVSWITCHES
 4.8 TB/s Bi-directional Bandwidth, 2X More than Previous
 Generation NVSwitch

9x MELLANOX CONNECTX-6 2006b/S NETWORK INTERFACE 450 GB/s Peak Bi-directional Bandwidth

OUAL 64-CORE AMD CPUs AND 1 TB SYSTEM MEMORY 3.2X More Cores to Power the Most Intensive AI Jobs

5 15 TB GEN4 NVME SSD 25GB/s Peak Bandwidth, 2X Faster than Gen3 NVME SSDs

#### **NVIDIA** website

## GPU computing applications

#### GPU and modern HPC

| GPU Computing Applications                               |                                  |                           |              |                          |        |                             |                            |             |                |                             |  |
|----------------------------------------------------------|----------------------------------|---------------------------|--------------|--------------------------|--------|-----------------------------|----------------------------|-------------|----------------|-----------------------------|--|
| Libraries and Middleware                                 |                                  |                           |              |                          |        |                             |                            |             |                |                             |  |
| cuDNN<br>TensorRT                                        | cuFF<br>cuBLA<br>cuRAN<br>cuSPAR | S (<br>D M                | CULA<br>AGMA | Thrust<br>NPP            | ust SV |                             | VSIPL<br>SVM<br>penCurrent |             | sX<br>IX<br>Y  | MATLAB<br>Mathematica       |  |
| Programming Languages                                    |                                  |                           |              |                          |        |                             |                            |             |                |                             |  |
|                                                          | C C++                            |                           |              |                          |        | lava<br>/thon Dir<br>appers |                            | rectCompute |                | Directives<br>(e.g. OpenACC |  |
|                                                          | CUDA-Enabled NVIDIA GPUs         |                           |              |                          |        |                             |                            |             |                |                             |  |
| NVIDIA Ampere Architecture<br>(compute capabilities 8.x) |                                  |                           |              |                          |        |                             |                            |             | Tesla A Series |                             |  |
| NVIDIA Turing Architecture<br>(compute capabilities 7.x) |                                  |                           | G            | GeForce 2000 Series      |        | Quadro RTX Series           |                            | eries       | Tesla T Series |                             |  |
| NVIDIA Volta Architecture<br>(compute capabilities 7.x)  |                                  | DRIVE/JETSO<br>AGX Xavier | DN           |                          |        | Quadro GV Series            |                            | ies         | Tesla V Series |                             |  |
| NVIDIA Pascal Architecture<br>(compute capabilities 6.x) |                                  | Tegra X2                  | G            | GeForce 1000 Series      |        | Quadro P Series             |                            |             | Tesla F        | P Series                    |  |
|                                                          |                                  | Ember                     | ded          | Consumer<br>Desktop/Lapt |        | PR                          | ofessio                    | nal         | 16             | ata Center                  |  |

**NVIDIA** website

#### **NVIDIA** Processors Evolution

GPU and modern HPC

| Product Architecture     | P100               | V100                | A100                | H100                |
|--------------------------|--------------------|---------------------|---------------------|---------------------|
| Announcement date        | April 2016         | December 2017       | May 2020            | September 2022      |
| GPU Codename             | GP100              | GV100               | GA100               | GH100               |
| GPU Architecture         | Pascal             | Volta               | Ampere              | Hopper              |
| SMs                      | 56                 | 80                  | 108                 | 132                 |
| TPCs                     | 28                 | 40                  | 54                  | 66                  |
| FP32 Cores / SM          | 64                 | 64                  | 64                  | 128                 |
| FP32 Cores / GPU         | 3584               | 5120                | 6912                | 16896               |
| FP64 Cores / SM          | 32                 | 32                  | 32                  | 64                  |
| FP64 Cores / GPU         | 1792               | 2560                | 3456                | 8448                |
| INT32 Cores / SM         | NA                 | 64                  | 64                  | 64                  |
| INT32 Cores / GPU        | NA                 | 5120                | 6912                | 8448                |
| Tensor Cores / SM        | NA                 | 8                   | 4                   | 4                   |
| Tensor Cores / GPU       | NA                 | 640                 | 432                 | 576                 |
| GPU Boost Clock          | 1480 MHz           | 1530 MHz            | 1410 MHz            | Not finalized       |
| Peak FP16 TFLOPS         | 21.2               | 31.4                | 78                  | 120                 |
| Peak FP32 TFLOPS         | 10.6               | 15.7                | 19.5                | 60                  |
| Peak FP64 TFLOPS         | 5.3                | 7.8                 | 9.7                 | 30                  |
| Texture Units            | 224                | 320                 | 432                 | 528                 |
| Memory Interface         | 4096-bit HBM2      | 4096-bit HBM2       | 5120-bit HBM2       | 5120-bit HBM3       |
| Memory Size              | 16 GB              | 32 GB / 16 GB       | 40 GB / 80 GB       | 80 GB               |
| Memory Data Rate         | 703 MHz DDR        | 877.5 MHz DDR       | 1215 MHz DDR        | Not finalized       |
| Memory Bandwidth         | 720 GB/sec         | 900 GB/sec          | 1.6 TB/sec          | 3.0 TB/sec          |
| L2 Cache Size            | 4096 KB            | 6144 KB             | 40960 KB            | 50 MB               |
| Shared Memory Size / SM  | 64 KB              | up to 96 KB         | up to 164 KB        | 228 KB              |
| Register File Size / SM  | 256 KB             | 256 KB              | 256 KB              | 256 KB              |
| Register File Size / GPU | 14336 KB           | 20480 KB            | 27648 KB            | 33792 KB            |
| TDP                      | 300 Watts          | 300 Watts           | 400 Watts           | 700 Watts           |
| Transistors              | 15.3 billion       | 21.1 billion        | 54.2 billion        | 80 billion          |
| GPU Die Size             | $610 \text{ mm}^2$ | 815 mm <sup>2</sup> | 826 mm <sup>2</sup> | 814 mm2             |
| TSMC Manufact. Proc.     | 16 nm FinFET+      | 12 nm FFN           | 7 nm N7             | 4N cust. for NVIDIA |
|                          |                    |                     |                     |                     |

16 / 51

## Part 1 – Introduction



WARSAW UNIVERSITY OF TECHNOLOGY

Semester Schedule

#### GPU and modern HPC

Introduction to CUDA and GPGPU Threads and Processes CUDA Programming Language Memory Management Synchronization Error reporting Example

#### General Components of a GPU Processor Introduction to CUDA and GPGPU



#### **NVIDIA CUDA Programming Guide**

# Architecture of GA100 Processor

Introduction to CUDA and GPGPU

#### GA100 Full GPU with 128 SMs (A100 Tensor Core GPU has 108 SMs)



**NVIDIA A100 Tensor Core GPU Architecture** 

#### GP100 Streaming Multiprocessor Internals Introduction to CUDA and GPGPU

| M                               |                    |            |      |      |            |         | Instructi | on Cache                        |                |                    |      |      |            |       |     |  |  |  |
|---------------------------------|--------------------|------------|------|------|------------|---------|-----------|---------------------------------|----------------|--------------------|------|------|------------|-------|-----|--|--|--|
|                                 | Instruction Buffer |            |      |      |            |         |           |                                 |                | Instruction Buffer |      |      |            |       |     |  |  |  |
|                                 | Warp Scheduler     |            |      |      |            |         |           |                                 | Warp Scheduler |                    |      |      |            |       |     |  |  |  |
|                                 | Dispeto            |            |      |      |            | ch Unit |           | Dispetch Unit Dispetch Unit     |                |                    |      |      |            |       |     |  |  |  |
| Register File (32,768 x 32-bit) |                    |            |      |      |            |         |           | Register File (32,768 x 32-bit) |                |                    |      |      |            |       |     |  |  |  |
| Core                            | Core               | DP<br>Unit | Core | Core | DP<br>Unit |         | SFU       | Core                            | Core           | DP<br>Unit         | Core | Core | DP<br>Unit |       | SFU |  |  |  |
| Core                            | Core               | DP<br>Unit | Core | Core | DP<br>Unit |         | SFU       | Core                            | Core           | DP<br>Unit         | Core | Core | DP<br>Unit |       | sru |  |  |  |
| Core                            | Core               | DP<br>Unit | Core | Core | DP<br>Unit |         | SFU       | Core                            | Core           | DP<br>Unit         | Core | Core | DP<br>Unit |       | SFU |  |  |  |
| Core                            | Core               | DP<br>Unit | Core | Core | DP<br>Unit |         | SFU       | Core                            | Core           | DP<br>Unit         | Core | Core | DP<br>Unit |       | SFU |  |  |  |
| Core                            | Core               | DP<br>Unit | Core | Core | DP<br>Unit |         | SFU       | Core                            | Core           | DP<br>Unit         | Core | Core | DP<br>Unit |       | SFU |  |  |  |
| Core                            | Core               | DP<br>Unit | Core | Core | DP<br>Unit |         | SFU       | Core                            | Core           | DP<br>Unit         | Core | Core | DP<br>Unit |       | SFU |  |  |  |
| Core                            | Core               | DP<br>Unit | Core | Core | DP<br>Unit | LDIST   | SFU       | Core                            | Core           | DP<br>Unit         | Core | Core | DP<br>Unit | LDIST | SFU |  |  |  |
| Core                            | Core               | Unit       | Core | Core | DP<br>Unit | LDIST   | SFU       | Core                            | Core           | Unit               | Core | Core | DP<br>Unit | LDIST | SFU |  |  |  |
|                                 | Texture / L1 Cache |            |      |      |            |         |           |                                 |                |                    |      |      |            |       |     |  |  |  |
|                                 | Te                 | ×          |      |      | т          | ex      |           | Tex Tex                         |                |                    |      |      |            |       |     |  |  |  |
| 64KB Shared Memory              |                    |            |      |      |            |         |           |                                 |                |                    |      |      |            |       |     |  |  |  |

#### **NVIDIA**

#### Pascal SM consists of:

- 64 (cc 6.0) or 128 (6.1 and 6.2) CUDA cores for arithmetic operations,
- 16 (cc 6.0) or 32 (6.1 and 6.2) special function units for single-precision floating-point,
- 2 (6.0) or 4 (6.1 and 6.2) warp schedulers.

#### GA100 Streaming Multiprocessor Internals Introduction to CUDA and GPGPU



Ampere SM consists of:

- 64 FP32 cores for single-precision arithmetic operations,
- 32 FP64 cores for double-precision arithmetic operations,
- ▶ 64 INT32 cores for integer math,
- 4 mixed-precision Tensor Cores,
- 16 special function units for single-precision floating-point transcendental functions,
- 4 warp schedulers.

#### **NVIDIA**

### Part 1 – Introduction



WARSAW UNIVERSITY OF TECHNOLOGY

Semester Schedule

GPU and modern HPC

#### Introduction to CUDA and GPGPU Threads and Processes

CUDA Programming Language Memory Management Synchronization Error reporting Example

Introduction to CUDA and GPGPU

Simplification:

1. Threads are coupled in groups called *warps* 

Introduction to CUDA and GPGPU

- 1. Threads are coupled in groups called *warps*
- 2. Threads in a warp can only perform the same instruction

Introduction to CUDA and GPGPU

- 1. Threads are coupled in groups called *warps*
- 2. Threads in a warp can only perform the same instruction
- 3. A warp is build of 32 threads

Introduction to CUDA and GPGPU

- 1. Threads are coupled in groups called *warps*
- 2. Threads in a warp can only perform the same instruction
- 3. A warp is build of 32 threads
- 4. Warps are gathered in *blocks*

Introduction to CUDA and GPGPU

- 1. Threads are coupled in groups called *warps*
- 2. Threads in a warp can only perform the same instruction
- 3. A warp is build of 32 threads
- 4. Warps are gathered in *blocks*
- 5. One block is assigned to single SM only

Introduction to CUDA and GPGPU

- 1. Threads are coupled in groups called *warps*
- 2. Threads in a warp can only perform the same instruction
- 3. A warp is build of 32 threads
- 4. Warps are gathered in *blocks*
- 5. One block is assigned to single SM only
- 6. One SM may execute many blocks

Introduction to CUDA and GPGPU

- 1. Threads are coupled in groups called *warps*
- 2. Threads in a warp can only perform the same instruction
- 3. A warp is build of 32 threads
- 4. Warps are gathered in *blocks*
- 5. One block is assigned to single SM only
- 6. One SM may execute many blocks

Introduction to CUDA and GPGPU

#### Simplification:

- 1. Threads are coupled in groups called *warps*
- 2. Threads in a warp can only perform the same instruction
- 3. A warp is build of 32 threads
- 4. Warps are gathered in *blocks*
- 5. One block is assigned to single SM only
- 6. One SM may execute many blocks



#### **NVIDIA**

# Kernels – Threads definitions

Introduction to CUDA and GPGPU

- ▶ special C++ function with \_\_global\_\_ declaration
- $\blacktriangleright$  compiler runs N CUDA threads in parallel

Kernels – Threads definitions

Simplification:

- ▶ special C++ function with \_\_global\_\_ declaration
- compiler runs N CUDA threads in parallel

Definition of a kernel:

```
1 __global__ void VecAdd(float* A, float* B, float* C)
2 {
3    int i = threadIdx.x;
4    C[i] = A[i] + B[i];
5 }
```

#### Invocation of a kernel:

```
1 int main()
2 {
3     VecAdd<<<1, N>>>(A, B, C);
4 }
```

# SIMD processing model



# SIMD processing model



# SIMD processing model



#### SISD, MIMD, MISD - Flynn Taxonomy



# SISD, MIMD, MISD - Flynn Taxonomy



# SISD, MIMD, MISD - Flynn Taxonomy



#### Automatic threads scalability

Introduction to CUDA and GPGPU

1. Thread blocks are automatically assigned to SMs.



**NVIDIA** 

#### Automatic threads scalability

- Thread blocks are automatically assigned to SMs.
- 2. Programmers have no control on this process.



#### Automatic threads scalability

Introduction to CUDA and GPGPU

- 1. Thread blocks are automatically assigned to SMs.
- 2. Programmers have no control on this process.
- Subsequent kernel execution may result in different assignment.



**NVIDIA** 

# Heterogeneous programming with host and device Introduction to CUDA and GPGPU



#### **NVIDIA**









Parallel memory copying and kernel execution requires asynchronous (non-blocking) memory copying and execution streams (cuda streams).

#### Part 1 – Introduction



WARSAW UNIVERSITY OF TECHNOLOGY

Semester Schedule

GPU and modern HPC

Introduction to CUDA and GPGPU Threads and Processes CUDA Programming Language Memory Management Synchronization Error reporting Example

Introduction to CUDA and GPGPU

► Modified C++ language

- Modified C++ language
- A program is build of C++ functions (executed in CPU or GPU)

- ► Modified C++ language
- A program is build of C++ functions (executed in CPU or GPU)
- Function running in GPU (streaming processor) is called kernel.

- ► Modified C++ language
- A program is build of C++ functions (executed in CPU or GPU)
- Function running in GPU (streaming processor) is called kernel.
- Kernel properties:

- ► Modified C++ language
- A program is build of C++ functions (executed in CPU or GPU)
- Function running in GPU (streaming processor) is called kernel.
- ► Kernel properties:
  - can only access GPU memory or CPU memory with special allocation

- ► Modified C++ language
- A program is build of C++ functions (executed in CPU or GPU)
- Function running in GPU (streaming processor) is called kernel.
- Kernel properties:
  - can only access GPU memory or CPU memory with special allocation
  - no variable number of arguments

- ► Modified C++ language
- A program is build of C++ functions (executed in CPU or GPU)
- Function running in GPU (streaming processor) is called kernel.
- Kernel properties:
  - can only access GPU memory or CPU memory with special allocation
  - no variable number of arguments
  - no static variables

- Modified C++ language
- A program is build of C++ functions (executed in CPU or GPU)
- Function running in GPU (streaming processor) is called kernel.
- Kernel properties:
  - can only access GPU memory or CPU memory with special allocation
  - no variable number of arguments
  - no static variables
  - Iimited recursion

- Modified C++ language
- A program is build of C++ functions (executed in CPU or GPU)
- Function running in GPU (streaming processor) is called kernel.
- Kernel properties:
  - can only access GPU memory or CPU memory with special allocation
  - no variable number of arguments
  - no static variables
  - Iimited recursion
  - must be void

- Modified C++ language
- A program is build of C++ functions (executed in CPU or GPU)
- Function running in GPU (streaming processor) is called kernel.
- Kernel properties:
  - can only access GPU memory or CPU memory with special allocation
  - no variable number of arguments
  - no static variables
  - Iimited recursion
  - must be void
- Kernel launches are asynchronous (return to CPU immediately).

- ► Modified C++ language
- A program is build of C++ functions (executed in CPU or GPU)
- Function running in GPU (streaming processor) is called kernel.
- Kernel properties:
  - can only access GPU memory or CPU memory with special allocation
  - no variable number of arguments
  - no static variables
  - Iimited recursion
  - must be void
- Kernel launches are asynchronous (return to CPU immediately).
- ▶ Kernel executes after all previous CUDA calls have completed.



Threads Identification



Each kernel contains local variables defining the execution context:

threadIdx – three dimensional value unique within a block



- threadIdx three dimensional value unique within a block
- blockIdx three dimensional value unique within a grid

- Each kernel contains local variables defining the execution context:
  - threadIdx three dimensional value unique within a block
  - blockIdx three dimensional value unique within a grid
  - blockDim three dimensional value describing a block dimensions

- Each kernel contains local variables defining the execution context:
  - threadIdx three dimensional value unique within a block
  - blockIdx three dimensional value unique within a grid
  - blockDim three dimensional value describing a block dimensions
  - gridDim three dimensional value describing a grid dimensions

Defining Grid and Blocks

Thread block (composed of thread warps) is a group of threads that can:

- Thread block (composed of thread warps) is a group of threads that can:
  - synchronize their execution

- Thread block (composed of thread warps) is a group of threads that can:
  - synchronize their execution
  - communicate via shared memory

- Thread block (composed of thread warps) is a group of threads that can:
  - synchronize their execution
  - communicate via shared memory
- ► Single block is assigned to a single SM for all its lifetime.

- Thread block (composed of thread warps) is a group of threads that can:
  - synchronize their execution
  - communicate via shared memory
- ► Single block is assigned to a single SM for all its lifetime.
- ► Grid = all blocks for given launch

Introduction to CUDA and GPGPU

Kernel launch syntax:

kernel\_name<<<gridDim, blockDim, sharedMem, strId>>>(p1,... pN)

kernel\_name - name of a kernel function with \_\_global\_\_
declaration

Introduction to CUDA and GPGPU

Kernel launch syntax:

kernel\_name<<<gridDim, blockDim, sharedMem, strId>>>(p1,... pN)

- kernel\_name name of a kernel function with \_\_global\_\_
  declaration
- ▶ gridDim dim3 value describing number of blocks in a grid

Introduction to CUDA and GPGPU

Kernel launch syntax:

kernel\_name<<<gridDim, blockDim, sharedMem, strId>>>(p1,... pN)

- kernel\_name name of a kernel function with \_\_global\_\_
  declaration
- ▶ gridDim dim3 value describing number of blocks in a grid
- blockDim dim3 value describing number of threads in each block

Introduction to CUDA and GPGPU

Kernel launch syntax:

kernel\_name<<<gridDim, blockDim, sharedMem, strId>>>(p1,... pN)

- kernel\_name name of a kernel function with \_\_global\_\_
  declaration
- ▶ gridDim dim3 value describing number of blocks in a grid
- blockDim dim3 value describing number of threads in each block
- sharedMem (optional) size of shared memory allocated for each block in bytes

Introduction to CUDA and GPGPU

Kernel launch syntax:

kernel\_name<<<gridDim, blockDim, sharedMem, strId>>>(p1,... pN)

- kernel\_name name of a kernel function with \_\_global\_\_
  declaration
- ▶ gridDim dim3 value describing number of blocks in a grid
- blockDim dim3 value describing number of threads in each block
- sharedMem (optional) size of shared memory allocated for each block in bytes
- strId (optional) identification of a stream for parallel kernel execution (default 0)

Introduction to CUDA and GPGPU

Kernel launch syntax:

kernel\_name<<<gridDim, blockDim, sharedMem, strId>>>(p1,... pN)

- kernel\_name name of a kernel function with \_\_global\_\_
  declaration
- ▶ gridDim dim3 value describing number of blocks in a grid
- blockDim dim3 value describing number of threads in each block
- sharedMem (optional) size of shared memory allocated for each block in bytes
- strId (optional) identification of a stream for parallel kernel execution (default 0)

p1,... pN - kernel parameters
 (automatically copied to a device through the constant memory)

Introduction to CUDA and GPGPU

#### dim3 type:

used for indexing and describing blocks of threads and grids

- can be constructed from one, two and three values
- based on uint[3], default value: (1,1,1)

Introduction to CUDA and GPGPU

#### dim3 type:

- used for indexing and describing blocks of threads and grids
- can be constructed from one, two and three values
- based on uint[3], default value: (1,1,1)
- other built-in vector types:
  - [u]{char,short,int,long}{1..4}, float{1..4}
  - Structures accessed with x, y, z, w fields: uint4 param;
    - int y = param.y;
  - They all come with a constructor, for example: int2 make\_int2(int x, int y);

Introduction to CUDA and GPGPU

#### functions qualifiers:

\_\_global\_\_ launched by CPU on device (must return void)
 \_\_device\_\_ called from other GPU functions (never CPU)
 \_\_host\_\_ can be executed by CPU
 (can be used together with \_\_device\_\_)

# Two dimensional block execution I (one block only)

```
global___void MatAdd(float A[N][N], float B[N][N], float C[N][N])
2 {
      int i = threadIdx.x;
3
      int j = threadIdx.y;
4
      C[i][j] = A[i][j] + B[i][j];
5
6 }
7
8 int main()
9 {
10
      . . .
      // Kernel invocation with one block of N * N * 1 threads
11
12
      int numBlocks = 1;
13
      dim3 threadsPerBlock(N, N);
14
      MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
15
16
      . . .
17 }
```

#### Two dimensional block execution II

(more blocks require global threads identification)

```
1 __global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N])
  {
2
      int i = blockIdx.x * blockDim.x + threadIdx.x;
3
      int j = blockIdx.y * blockDim.y + threadIdx.y;
4
      if (i < N && j < N)
5
          C[i][j] = A[i][j] + B[i][j];
6
7 }
8
9 int main()
10 {
11
      . . .
      // Kernel invocation with multiple blocks according to the
12
           problem size (please note integer division)
13
      dim3 threadsPerBlock(16, 16);
14
      dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
15
      MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
16
17
      . . .
18 }
```

#### Part 1 – Introduction



WARSAW UNIVERSITY OF TECHNOLOGY

Semester Schedule

#### GPU and modern HPC

#### Introduction to CUDA and GPGPU

Threads and Processes CUDA Programming Language Memory Management Synchronization Error reporting Example

Classical (manual) approach

```
1 int n = 1024;
2 int nbytes = n*sizeof(int);
3 int *d_array = 0;
```

cudaMalloc((void\*\*)&d\_array, nbytes)

Classical (manual) approach

```
1 \text{ int } n = 1024;
```

```
2 int nbytes = n*sizeof(int);
```

```
3 int *d_array = 0;
```

cudaMalloc((void\*\*)&d\_array, nbytes)

cudaMemset(d\_array, 0, nbytes)

Classical (manual) approach

```
1 int n = 1024;
```

```
2 int nbytes = n*sizeof(int);
```

```
3 int *d_array = 0;
```

cudaMalloc((void\*\*)&d\_array, nbytes)

cudaMemset(d\_array, 0, nbytes)

cudaFree(d\_array)

Classical (manual) approach

```
1 int n = 1024;
```

```
2 int nbytes = n*sizeof(int);
```

```
3 int *d_array = 0;
```

cudaMalloc((void\*\*)&d\_array, nbytes)

```
cudaMemset(d_array, 0, nbytes)
```

```
cudaFree(d_array)
```

cudaMemcpy(void \*dst, void \*src, size\_t nBytes, enum cudaMemcpyKind direction)

Classical (manual) approach

```
1 int n = 1024;
```

- 2 int nbytes = n\*sizeof(int);
- 3 int \*d\_array = 0;

cudaMalloc((void\*\*)&d\_array, nbytes)

- cudaMemset(d\_array, 0, nbytes)
- cudaFree(d\_array)
- cudaMemcpy(void \*dst, void \*src, size\_t nBytes, enum cudaMemcpyKind direction)
  - HostToDevice

Classical (manual) approach

```
1 int n = 1024;
```

```
2 int nbytes = n*sizeof(int);
```

```
3 int *d_array = 0;
```

cudaMalloc((void\*\*)&d\_array, nbytes)

cudaMemset(d\_array, 0, nbytes)

```
cudaFree(d_array)
```

cudaMemcpy(void \*dst, void \*src, size\_t nBytes, enum cudaMemcpyKind direction)



DeviceToHost

Classical (manual) approach

```
1 int n = 1024;
```

- 2 int nbytes = n\*sizeof(int);
- 3 int \*d\_array = 0;

cudaMalloc((void\*\*)&d\_array, nbytes)

cudaMemset(d\_array, 0, nbytes)

cudaFree(d\_array)

cudaMemcpy(void \*dst, void \*src, size\_t nBytes, enum cudaMemcpyKind direction)

- HostToDevice
- DeviceToHost
- DeviceToDevice

Classical (manual) approach

```
1 int n = 1024;
```

- 2 int nbytes = n\*sizeof(int);
- 3 int \*d\_array = 0;

cudaMalloc((void\*\*)&d\_array, nbytes)

cudaMemset(d\_array, 0, nbytes)

cudaFree(d\_array)

cudaMemcpy(void \*dst, void \*src, size\_t nBytes, enum cudaMemcpyKind direction)

- HostToDevice
- DeviceToHost
- DeviceToDevice

Classical (manual) approach

```
1 int n = 1024;
```

```
2 int nbytes = n*sizeof(int);
```

```
3 int *d_array = 0;
```

cudaMalloc((void\*\*)&d\_array, nbytes)

```
cudaMemset(d_array, 0, nbytes)
```

```
cudaFree(d_array)
```

cudaMemcpy(void \*dst, void \*src, size\_t nBytes, enum cudaMemcpyKind direction)

- HostToDevice
- DeviceToHost
- DeviceToDevice

CPU blocking version (also assures that kernels have completed).

#### Memory Management

Classical (manual) approach

De-referencing normal CPU pointer on GPU will crash (and vice versa).

Good naming practices

- a\_ device pointers
- $h_-$  host pointers
- $\mathtt{s}\_-\mathsf{shared}$  memory

#### Part 1 – Introduction



WARSAW UNIVERSITY OF TECHNOLOGY

Semester Schedule

GPU and modern HPC

#### Introduction to CUDA and GPGPU

Threads and Processes CUDA Programming Language Memory Management

#### Synchronization

Error reporting Example

Basics

Device side: \_\_syncthreads()

- Device side: \_\_syncthreads()
  - Synchronizes all threads in a block

- Device side: \_\_syncthreads()
  - Synchronizes all threads in a block
  - No thread can pass this barrier until all threads in the block reach it

- Device side: \_\_syncthreads()
  - Synchronizes all threads in a block
  - No thread can pass this barrier until all threads in the block reach it
  - Used to avoid conflicts when accessing shared memory

- Device side: \_\_syncthreads()
  - Synchronizes all threads in a block
  - No thread can pass this barrier until all threads in the block reach it
  - Used to avoid conflicts when accessing shared memory
  - Allowed in conditional code only if the conditional is uniform across the entire thread block

- Device side: \_\_syncthreads()
  - Synchronizes all threads in a block
  - No thread can pass this barrier until all threads in the block reach it
  - Used to avoid conflicts when accessing shared memory
  - Allowed in conditional code only if the conditional is uniform across the entire thread block
- Host side: cudaDeviceSynchronize()

- Device side: \_\_syncthreads()
  - Synchronizes all threads in a block
  - No thread can pass this barrier until all threads in the block reach it
  - Used to avoid conflicts when accessing shared memory
  - Allowed in conditional code only if the conditional is uniform across the entire thread block
- Host side: cudaDeviceSynchronize()
  - Blocks the current CPU thread until all GPU calls are finished.

- Device side: \_\_syncthreads()
  - Synchronizes all threads in a block
  - No thread can pass this barrier until all threads in the block reach it
  - Used to avoid conflicts when accessing shared memory
  - Allowed in conditional code only if the conditional is uniform across the entire thread block
- Host side: cudaDeviceSynchronize()
  - Blocks the current CPU thread until all GPU calls are finished.
  - Including all streams.

- Device side: \_\_syncthreads()
  - Synchronizes all threads in a **block**
  - No thread can pass this barrier until all threads in the block reach it
  - Used to avoid conflicts when accessing shared memory
  - Allowed in conditional code only if the conditional is uniform across the entire thread block
- Host side: cudaDeviceSynchronize()
  - Blocks the current CPU thread until all GPU calls are finished.
  - Including all streams.
  - (formerly cudaThreadSynchronize())

- Device side: \_\_syncthreads()
  - Synchronizes all threads in a **block**
  - No thread can pass this barrier until all threads in the block reach it
  - Used to avoid conflicts when accessing shared memory
  - Allowed in conditional code only if the conditional is uniform across the entire thread block
- Host side: cudaDeviceSynchronize()
  - Blocks the current CPU thread until all GPU calls are finished.
  - Including all streams.
  - (formerly cudaThreadSynchronize())

Basics

- Device side: \_\_syncthreads()
  - Synchronizes all threads in a block
  - No thread can pass this barrier until all threads in the block reach it
  - Used to avoid conflicts when accessing shared memory
  - Allowed in conditional code only if the conditional is uniform across the entire thread block
- Host side: cudaDeviceSynchronize()
  - Blocks the current CPU thread until all GPU calls are finished.
  - Including all streams.
  - (formerly cudaThreadSynchronize())

#### Note

There are other more advanced device synchronization methods which will be discussed later

#### Device Threads Synchronization

Deprecation Warning

cudaThreadSynchronize() is now deprecated:

"Note that this function is deprecated because its name does not reflect its behavior. Its functionality is similar to the non-deprecated function cudaDeviceSynchronize(), which should be used instead."

NVIDIA. Cuda toolkit documentation. https://docs.nvidia.com/cuda/

#### Part 1 – Introduction



WARSAW UNIVERSITY OF TECHNOLOGY

Semester Schedule

GPU and modern HPC

#### Introduction to CUDA and GPGPU

Threads and Processes CUDA Programming Language Memory Management Synchronization Error reporting

Example

Introduction to CUDA and GPGPU

 All CUDA calls return error code: cudaError\_t (Except for kernel launches)

Introduction to CUDA and GPGPU

 All CUDA calls return error code: cudaError\_t (Except for kernel launches)

cudaError\_t cudaGetLastError(void)

- Returns the code for the last error

Introduction to CUDA and GPGPU

 All CUDA calls return error code: cudaError\_t (Except for kernel launches)

- cudaError\_t cudaGetLastError(void)
  - Returns the code for the last error
- char\* cudaGetErrorString(cudaError\_t code)
   Returns a null-terminated character string describing the

error

printf("%s\n", cudaGetErrorString( cudaGetLastError()));

Introduction to CUDA and GPGPU

 All CUDA calls return error code: cudaError\_t (Except for kernel launches)

- cudaError\_t cudaGetLastError(void)
  - Returns the code for the last error
- char\* cudaGetErrorString(cudaError\_t code)
   Returns a null-terminated character string describing the

error

printf("%s\n", cudaGetErrorString( cudaGetLastError()));

Introduction to CUDA and GPGPU

 All CUDA calls return error code: cudaError\_t (Except for kernel launches)

- cudaError\_t cudaGetLastError(void)
  - Returns the code for the last error
- char\* cudaGetErrorString(cudaError\_t code)

   Returns a null-terminated character string describing the error
   printf("%s\n", cudaGetErrorString( cudaGetLastError()));

Check for the error only after a kernel has finished executing – kernel calls are asynchronous.

# CUDA Debugging

```
#ifdef DEBUG
    cudaThreadSynchronize();
2
    cudaError_t error = cudaGetLastError();
3
    if(error != cudaSuccess)
4
    ſ
5
       printf("CUDA error: %s\n", cudaGetErrorString(error));
6
       exit(-1);
7
    ን
8
9 #endif
```

Compile with: \$ nvcc -DDEBUG program.cu

#### Part 1 – Introduction



WARSAW UNIVERSITY OF TECHNOLOGY

Semester Schedule

GPU and modern HPC

#### Introduction to CUDA and GPGPU

Threads and Processes CUDA Programming Language Memory Management Synchronization Error reporting Example

## First kernel - Host code completed

```
Introduction to CUDA and GPGPU
```

```
1 #include<cuda.h>
2
3 int main()
4 {
      cudaSetDevice(cutGetMaxGflopsDeviceId());
5
      int N = 4096:
6
      int numBytes = N*N * sizeof(int);
7
      cudaMalloc((void**)&d_A, numbytes);
8
      cudaMalloc((void**)&d_B, numbytes);
9
      cudaMalloc((void**)&d_C, numbytes);
10
11
      cudaMemcpy(d_A, h_A, numBytes, cudaMemcpyHostToDevice);
12
      cudaMemcpy(d_B, h_B, numBytes, cudaMemcpyHostToDevice);
13
      cudaMemset(d C, 0, numBytes);
14
15
      dim3 threadsPerBlock(16, 16);
16
      dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
17
      MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
18
19
      cudaMemcpy(h_C, d_C, numBytes, cudaMemcpyDeviceToHost);
20
21
      cudaFree(d_A);
22
      cudaFree(d B):
23
      cudaFree(d_C);
24
25 }
```

# Bibliography



#### Multiple Authors. GPU Gems.

https://developer.nvidia.com/gpugems/gpugems/contributors.

NVIDIA. Cuda toolkit documentation. https://docs.nvidia.com/cuda/.

NVIDIA CUDA Toolkit. Cuda c++ best practices guide. https://docs.nvidia.com/cuda/cuda-c-best-practicesguide/index.html, 2020. Materiały sponsorowane przez:

Projekt "NERW 2 PW. Nauka – Edukacja – Rozwój – Współpraca" współfinansowany jest ze środków Unii Europejskiej w ramach Europejskiego Funduszu Społecznego

Zadanie 10 pn. "Modyfikacja programów studiów na kierunkach prowadzonych przez Wydział Matematyki i Nauk Informacyjnych", realizowane w ramach projektu "NERW 2 PW. Nauka – Edukacja – Rozwój – Współpraca", współfinansowanego jest ze środków Unii Europejskiej w ramach Europejskiego Funduszu Społecznego



Rzeczpospolita Polska Politechnika Warszawska

Unia Europejska Europejski Fundusz Społeczny

