CPU vs. GPU Architecture

Basics of CPU Architecture

Core Structure

CPUs currently range from 2 cores to 64 cores, and each core provides high performance.
CPUs also have their own cache hierarchy, which is as follows:
- L1 Cache: The smallest cache (16–64KB), directly connected to each core, with the fastest access speed.
- L2 Cache: Larger than L1 (256–512KB) but slower. It acts as a bridge between L1 and L3 caches.
- L3 Cache: Shared by all cores, with capacity in the megabytes. It is slower than L1 and L2.
CPUs have the following features:
- Branch Prediction: Since CPUs excel in single-core performance but are weaker in parallel execution, heavy branching (e.g., many conditionals) can cause slowdowns. To mitigate this, CPUs predict conditions and speculatively execute. If the prediction is correct, execution is faster; if not, the CPU re-executes for the correct branch.
- Out-of-Order Execution: Because CPUs normally process instructions sequentially, a long-latency instruction could stall later ones. To avoid this, CPUs execute independent instructions ahead of time.
- Instruction Pipeline: CPUs split instructions into stages and process them simultaneously, similar to how an assembly line distributes tasks for parallel execution.

Strengths of CPUs

Low Latency: Each core has high performance, allowing for very fast instruction processing.
Complex Control Flow: Thanks to branch prediction and out-of-order execution, CPUs can efficiently handle branching and complex instruction sequences.

Basics of GPU Architecture

Core Structure

Streaming Multiprocessors (SMs): Specialized units with hundreds or thousands of cores designed for parallel execution.
CUDA Cores: The fundamental computational units of the GPU, optimized for simple, repetitive operations.
GPUs also feature a diverse memory hierarchy:
- Global Memory: Shared by all threads, large in size but relatively slow.
- Shared Memory: Shared at the block level, located directly on the GPU chip. It is smaller than global memory but faster.
- Registers: The smallest memory units, private to each thread. Extremely fast but very limited in size.
- Constant & Texture Memory: Specialized read-only memory types for particular workloads, optimized for caching.
Warp Scheduling: Instead of scheduling individual threads, GPUs execute instructions in groups of 32 threads, called warps. All 32 threads in a warp execute together.

Strengths of GPUs

High Throughput: Unlike CPUs, GPUs can run hundreds or thousands of threads simultaneously.
Data Parallelism: GPUs excel at workloads that can be split into identical operations across large datasets.
Memory Bandwidth: GPUs feature much higher memory bandwidth than CPUs, enabling them to handle large volumes of data in parallel.

SIMD vs. SIMT

SIMD (Single Instruction Multiple Data)

SIMD is the parallel execution model used in CPUs.
Example:

// Add 4 vector elements at once (e.g., AVX)
C[0:3] = A[0:3] + B[0:3];

SIMD enables fast operations on multiple data elements, but it has limitations:
- All elements must execute the same instruction.
- Branching reduces SIMD efficiency.

SIMT (Single Instruction Multiple Threads)

SIMT is the parallel execution model used in GPUs.
It executes the same instruction across multiple threads simultaneously.
Example:

__global__ void add(float* A, float* B, float* C) {
    int idx = threadIdx.x;
    C[idx] = A[idx] + B[idx];
}

Unlike SIMD, SIMT allows branching: each thread can follow its own path.
However, there are still limitations:
- Excessive branching causes divergence, reducing performance.

CUDA Optimization Techniques

Coalesced Access

GPUs operate at the warp level. If threads within a warp access consecutive memory addresses, memory bandwidth can be fully utilized, minimizing access latency.

__global__ void coalescedAccess(float *A, float *B, float *C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if(idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}

In this example, since each thread in the warp accesses consecutive memory addresses, bandwidth is maximized, and performance improves.

Shared Memory Utilization

Frequently accessed data should be stored in shared memory rather than repeatedly fetched from global memory.

__global__ void sharedMemoryExample(float *A, float *B, float *C, int N) {
    extern __shared__ float sharedData[];
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
    if(idx < N) {
        sharedData[threadIdx.x] = A[idx];
        __syncthreads();
        C[idx] = sharedData[threadIdx.x] + B[idx];
    }
}

Use the __shared__ keyword to declare shared memory.
- The extern keyword indicates that the array size will be specified dynamically at kernel launch:
```
  sharedMemoryExample<<<blocks, threadsPerBlock, sharedMemSize>>>(...);
```
sharedData[threadIdx.x] = A[idx]; ensures data is stored per-thread.
- Since shared memory is allocated per block, indexing with threadIdx.x ensures that each block has its own shared memory region.
__syncthreads(); forces all threads in the block to synchronize before proceeding, preventing memory conflicts.

Occupancy

Occupancy refers to how many warps are active on an SM at once.
Since GPUs swap between warps when one stalls, having enough active warps prevents idle time.
Increasing threads per block can improve occupancy, but if it exceeds shared memory limits, it may reduce efficiency.

Thread Divergence

While SIMT allows branching, excessive branching within a warp reduces performance.
It’s better to use conditional assignments instead of if/else:

float result = condition ? valueA : valueB;

Balancing Workload

Threads and blocks should be assigned evenly sized tasks.
If some threads finish much earlier than others, cores may be left idle, reducing GPU utilization.