Host-Device Synchronization
Host-Device Synchronization
- As discussed earlier, the host refers to the CPU, and the device refers to the GPU.
- By default, CUDA launches kernels asynchronously. This means that the CPU does not wait for the GPU to finish and immediately continues executing subsequent instructions.
- In such cases, operations may become unsynchronized, leading to incorrect results.
- To avoid these issues, we need to learn how to explicitly synchronize between the CPU and GPU.
Overview
- CUDA kernels are executed asynchronously by default. This implies:
- The CPU does not wait for the GPU computation to finish.
- Memory access may occur before the kernel finishes, potentially copying only partial data.
- Explicit synchronization is required to prevent such issues.
- Example of an error caused by asynchronous execution:
vectorAddKernel<<<numBlocks, numThreads>>>(d_A, d_B, d_C, N);
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost); // Incorrect: GPU may not be done
- Corrected version with synchronization:
vectorAddKernel<<<numBlocks, numThreads>>>(d_A, d_B, d_C, N);
cudaDeviceSynchronize(); // Wait until GPU computation is done
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
Understanding Host-Device Synchronization
Problems with Asynchronous Execution
- The CPU may proceed before the GPU has completed its computation.
- This can result in incomplete or incorrect data being returned to the host.
- Multiple kernels may execute asynchronously.
- Without explicit ordering, kernels may overlap, leading to incorrect sequencing.
- Memory access may occur before GPU computations are complete.
- Results may be invalid or corrupted.
Solutions
- Explicit synchronization ensures clear ordering between host and device operations.
Synchronization Methods
| Method | Description | When to Use |
|---|---|---|
cudaDeviceSynchronize() |
Blocks the CPU until all previously launched CUDA operations are finished. | When host operations depend on the completion of all CUDA operations. |
cudaMemcpy() |
Memory transfer functions inherently synchronize. | When memory access is required only after computations are finished. |
cudaEventSynchronize(event) |
Synchronizes only with a specific event. | When only a specific operation among multiple operations requires synchronization. |
- While frequent use of
cudaDeviceSynchronize()guarantees correct execution order, overuse unnecessarily stalls CPU execution, leading to performance degradation.