Host-Device Synchronization

As discussed earlier, the host refers to the CPU, and the device refers to the GPU.
By default, CUDA launches kernels asynchronously. This means that the CPU does not wait for the GPU to finish and immediately continues executing subsequent instructions.
In such cases, operations may become unsynchronized, leading to incorrect results.
To avoid these issues, we need to learn how to explicitly synchronize between the CPU and GPU.

Overview

CUDA kernels are executed asynchronously by default. This implies:
- The CPU does not wait for the GPU computation to finish.
- Memory access may occur before the kernel finishes, potentially copying only partial data.
- Explicit synchronization is required to prevent such issues.
Example of an error caused by asynchronous execution:

vectorAddKernel<<<numBlocks, numThreads>>>(d_A, d_B, d_C, N);
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); // Incorrect: GPU may not be done

Corrected version with synchronization:

vectorAddKernel<<<numBlocks, numThreads>>>(d_A, d_B, d_C, N);
cudaDeviceSynchronize(); // Wait until GPU computation is done
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

Understanding Host-Device Synchronization

Problems with Asynchronous Execution

The CPU may proceed before the GPU has completed its computation.
- This can result in incomplete or incorrect data being returned to the host.
Multiple kernels may execute asynchronously.
- Without explicit ordering, kernels may overlap, leading to incorrect sequencing.
Memory access may occur before GPU computations are complete.
- Results may be invalid or corrupted.

Solutions

Explicit synchronization ensures clear ordering between host and device operations.

Synchronization Methods

Method	Description	When to Use
`cudaDeviceSynchronize()`	Blocks the CPU until all previously launched CUDA operations are finished.	When host operations depend on the completion of all CUDA operations.
`cudaMemcpy()`	Memory transfer functions inherently synchronize.	When memory access is required only after computations are finished.
`cudaEventSynchronize(event)`	Synchronizes only with a specific event.	When only a specific operation among multiple operations requires synchronization.

While frequent use of cudaDeviceSynchronize() guarantees correct execution order, overuse unnecessarily stalls CPU execution, leading to performance degradation.