Kernel Configuration

As mentioned earlier, CUDA performance varies widely depending on hardware capabilities.
When programming CUDA, you should choose the number of threads per block with hardware limits in mind.
You should consider occupancy: high occupancy generally indicates good utilization of the hardware, but excessively high occupancy is not always best.
Register pressure and shared memory limits can throttle performance and actually slow execution.

Example

To compare configurations, we perform vector addition while varying the number of threads per block.

// kernelTuning.cu
#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

// Declaration of the vector addition kernel.
__global__ void vectorAddKernel(const float *A, const float *B, float *C, int N) {
    int idx = blockDim.x * blockIdx.x + threadIdx.x;
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}

int main() {
    // Vector length.
    int N = 1 << 20;  // 1M elements
    size_t size = N * sizeof(float);

    // Allocate host memory for input vectors and output vector.
    float *h_A = (float*)malloc(size);
    float *h_B = (float*)malloc(size);
    float *h_C = (float*)malloc(size);

    // Initialize host arrays with random values.
    srand(time(NULL));
    for (int i = 0; i < N; i++) {
        h_A[i] = (float)(rand() % 100) / 10.0f;
        h_B[i] = (float)(rand() % 100) / 10.0f;
    }

    // Allocate device memory for vectors.
    float *d_A, *d_B, *d_C;
    cudaMalloc((void**)&d_A, size);
    cudaMalloc((void**)&d_B, size);
    cudaMalloc((void**)&d_C, size);

    // Copy input data from host to device.
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Define a range of block sizes to test (e.g., multiples of warp size 32).
    int blockSizes[] = {32, 64, 128, 256, 512};
    int numBlockSizes = sizeof(blockSizes) / sizeof(blockSizes[0]);
    float timeTaken;

    // Create CUDA events for timing.
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);

    // Loop through the different block sizes.
    for (int i = 0; i < numBlockSizes; i++) {
        int threadsPerBlock = blockSizes[i];
        int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

        // Record the start event.
        cudaEventRecord(start);

        // Launch the kernel with the current configuration.
        vectorAddKernel<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

        // Record the stop event.
        cudaEventRecord(stop);
        cudaEventSynchronize(stop);

        // Calculate the elapsed time.
        cudaEventElapsedTime(&timeTaken, start, stop);
        printf("Block Size: %d, Blocks Per Grid: %d, Time Taken: %f ms\n", threadsPerBlock, blocksPerGrid, timeTaken);
    }

    // Optionally, copy the result back to host for correctness verification.
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    // Free device memory.
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

    // Free host memory.
    free(h_A);
    free(h_B);
    free(h_C);

    // Destroy CUDA events.
    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    return 0;
}

Sample output:

Block Size: 32, Blocks Per Grid: 32768, Time Taken: 165.995926 ms
Block Size: 64, Blocks Per Grid: 16384, Time Taken: 208.894424 ms
Block Size: 128, Blocks Per Grid: 8192, Time Taken: 289.929138 ms
Block Size: 256, Blocks Per Grid: 4096, Time Taken: 294.194855 ms
Block Size: 512, Blocks Per Grid: 2048, Time Taken: 207.752121 ms

In this experiment, a block size of 32 (one warp per block) yields the fastest runtime.
However, the kernel is extremely simple and occupancy is quite low, so overall GPU resources are likely underutilized.