Thread Hierarchy
CUDA Thread Hierarchy
- CUDA organizes threads in a hierarchical structure to efficiently manage parallel execution. This hierarchy is divided into:
- Grid
- Block
- Warp
Grids and Blocks
- Grid
- A grid is created whenever a kernel is launched, and it is composed of blocks.
- When launching a kernel, you must specify the number of blocks inside the grid.
- Grids can be 1D, 2D, or 3D, depending on the computational purpose.
- Block
- A block is a group of threads. Threads inside a block can collaborate through shared memory and can synchronize their execution.
- Like grids, blocks can be 1D, 2D, or 3D depending on the application.
- Key Characteristics
- A block is the unit of thread scheduling.
- Threads inside a block can communicate and synchronize with each other.
- Blocks are independent of one another and cannot directly communicate.
Warps
- Warp
- A warp is a group of 32 threads that execute the same instruction simultaneously on an SM.
- Warps are the smallest execution unit in GPU computation.
- Key Characteristics
- Under SIMT execution, all threads in a warp follow the same instruction.
- Excessive branching (e.g., conditionals) can cause divergence, reducing performance.
Thread Indexing
- Thread indexing is essential for mapping data correctly to each thread.
Global Indexing
- Global indexing assigns a unique index to every thread in the entire grid, allowing data mapping to individual threads.
- Example:
int globalIdx = blockIdx.x * blockDim.x + threadIdx.x;
- Variables:
blockIdx.x= index of the block within the gridblockDim.x= number of threads per blockthreadIdx.x= index of the thread within its block
- Why this calculation? Suppose there are 3 blocks, each with 4 threads:
- Block 0 → Threads 0 ~ 3
- Block 1 → Threads 0 ~ 3
- Block 2 → Threads 0 ~ 3
- Using the formula, the global indices become:
blockIdx.x threadIdx.x globalIdx Calculation Result 0 0 0 * 4 + 0 0 0 1 0 * 4 + 1 1 0 2 0 * 4 + 2 2 1 0 1 * 4 + 0 4 1 1 1 * 4 + 1 5 2 3 2 * 4 + 3 11
Block Indexing
- Blocks can also be indexed to determine their position.
int blockIdx = blockIdx.x;
// For 2D or 3D configurations:
int blockIdx_x = blockIdx.x;
int blockIdx_y = blockIdx.y;
int blockIdx_z = blockIdx.z;
Example
- Each thread can be mapped to different data with code like this:
__global__ void exampleKernel(float *A, int N) {
int globalIdx = blockIdx.x * blockDim.x + threadIdx.x;
if (globalIdx < N) {
A[globalIdx] = globalIdx;
}
}
Specifying Grid and Block Sizes
- Choosing appropriate grid and block sizes is critical for efficient GPU utilization and optimal performance.
Factors to Consider
- Number of Threads
- The number of threads should exceed the number of data elements.
- Hardware Specs
- GPUs have limits on threads per block, usually up to 1,024.
- Register count and shared memory size affect how many blocks can run simultaneously.
- Occupancy
- The ratio between the maximum warps that can be assigned to an SM and the number of active warps.
- Memory Access Patterns
- Data should be mapped to consecutive memory addresses for faster access.
Optimal Settings
- Powers of 2
- Computer systems generally favor powers of 2. Block sizes are best set as powers of 2 (e.g., 128, 256, 512).
- Multiples of 32
- Since a warp consists of 32 threads, block sizes should be multiples of 32 for full warp utilization.
- Hardware Constraints
- Block size must not exceed hardware limits.
Example
int threadsPerBlock = 256; // 512, 1024 also possible
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
- In most cases, block sizes are set like this.
Block Synchronization
- As mentioned earlier, threads inside a block can synchronize with one another.
- This is especially important when using shared memory.
Syncthreads
- This function ensures that all threads in a block complete their current instruction before moving forward.
- Often used during shared memory operations to avoid crashes.
__global__ void synchronizedKernel(float *A, float *B, float *C, int N) {
__shared__ float sharedData[256];
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
sharedData[threadIdx.x] = A[idx] * 2.0f;
__syncthreads();
C[idx] = sharedData[threadIdx.x] + B[idx];
}
}
- In this example,
__syncthreads()ensures all threads write to shared memory before proceeding to the next operation. - Note: While there are other synchronization methods, synchronization across blocks is not possible since blocks are independent.