Intro to CUDA

Introduction to CUDA

GPU vs. CPU

The GPU is a chipset specialized for parallel processing.
A GPU has far more cores than a CPU, but the performance of a single GPU core is much weaker compared to a CPU core.
However, by executing a large number of cores in parallel, the GPU significantly accelerates simple computations.
In contrast, the CPU excels in single-core performance and demonstrates high efficiency in handling complex computations with fewer cores.

Concepts of CUDA

Model Structure

Thread: A thread is the smallest unit of computation in parallel processing. Each thread executes the same instruction but works with different data. This approach is called SIMT (Single Instruction Multiple Thread).
Block: A block is a group of threads that share the same memory (also referred to as Shared Memory).
Grid: A grid is a collection of blocks required to perform a computation.

Memory Structure

Global Memory: The largest memory unit accessible by all threads, but relatively slow in access speed.
Shared Memory: Memory allocated per block, accessible by all threads within that block.
Local Memory: Memory allocated individually to each thread.
Constant & Texture Memory: Used to store read-only data and cache data.

Kernel Functions

Every kernel executed on the GPU must include the __global__ tag.
To reference blocks and threads, the syntax <<< >>> is used.