The Synchronous Call Hiding in Plain Sight

Let's have a fun pop quiz as if you are still in the class. You write this:

cudaMemcpy(d_input, h_input, size, cudaMemcpyHostToDevice);
my_kernel<<<grid, block>>>(d_input, d_output);
cudaMemcpy(h_output, d_output, size, cudaMemcpyDeviceToHost);

Quick, how much of these overlaps? The copy up, the kernel, the copy back, surely the GPU is smart enough to pipeline this, right? It's 2026. We have tensor cores. We have NVLink. We landed hardware-accelerated ray tracing. Surely a memcpy doesn't just stop everything.

It does. Every single cudaMemcpy in that snippet is a full, blocking, host-synchronous operation. The CPU thread issues the copy, then sits there and waits for it to finish before moving to the next line. Your "parallel" code is executing as three sequential steps with zero overlap. The timeline looks like a sad, single-file line at the DMV ¹.

And the worst part? Nobody tells you. There's no warning. No performance hint. The function name has no Sync suffix, it just is synchronous, and you're expected to know that.

The Expectation vs. What Actually Happens

When you first learn CUDA, the mental model goes something like this: the GPU is an asynchronous coprocessor. You fire work at it, and it runs independently while the CPU does other things. Kernel launches are asynchronous, my_kernel<<<>>>() returns immediately, and the GPU chews on it in the background. This is true, and it's beautiful, and it builds the reasonable expectation that other GPU operations work the same way.

They don't. cudaMemcpy breaks the model. It issues a DMA transfer and then blocks the host thread until every last byte has landed. That means:

CPU → GPU copy: blocks until transfer completes.
Kernel launch: finally dispatched, runs on GPU. Returns to CPU immediately (async, the one thing that works as expected).
GPU → CPU copy: blocks until the kernel finishes AND the transfer completes.

Steps 1 and 3 are synchronization barriers. Your CPU can't enqueue anything else. Your GPU can't start the kernel until the CPU gets past step 1. Nothing overlaps. You've built a three-stage pipeline with two brick walls in it.

For small transfers, this doesn't matter. For anything real like streaming data, inference serving and multi-stage processing, it's a silent throughput killer. You're leaving entire milliseconds of potential overlap on the table, and on a GPU where kernels might run in microseconds, that's orders of magnitude of waste.

Why This Is So Disorienting

The API design is what gets you. Consider the function signatures:

cudaMemcpy(dst, src, count, kind);          // synchronous — blocks
cudaMemcpyAsync(dst, src, count, kind, stream);  // asynchronous — doesn't block

The default is synchronous. The async version is the opt-in variant with an extra parameter. In an API that was built for parallel hardware, the copy function that blocks all parallelism is the one with the shorter name, fewer arguments, and more prominent documentation placement.

This would be like if std::async in C++ blocked by default, and you needed std::async_actually_async_this_time to get concurrency. Every language design instinct says the common-case, short-name function should do the expected thing on the platform it was designed for. In CUDA, the expected thing is parallelism. cudaMemcpy does the opposite.

To make it worse, transitioning to cudaMemcpyAsync isn't just "add Async and a stream." The host memory must be pinned (page-locked) via cudaMallocHost or cudaHostAlloc. Regular malloc'd memory can't do true async DMA because the OS might page it out mid-transfer. So the "fix" requires you to also change your memory allocation strategy, which ripples through your entire host-side codebase. One synchronous call, and now you're refactoring your allocator.

The Pipeline You Were Supposed to Build

Here's what the actual overlap pattern looks like when you do it right:

cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

// Chunk 1 on stream1
cudaMemcpyAsync(d_in1, h_in1, size, cudaMemcpyHostToDevice, stream1);
kernel<<<grid, block, 0, stream1>>>(d_in1, d_out1);
cudaMemcpyAsync(h_out1, d_out1, size, cudaMemcpyDeviceToHost, stream1);

// Chunk 2 on stream2 — overlaps with chunk 1!
cudaMemcpyAsync(d_in2, h_in2, size, cudaMemcpyHostToDevice, stream2);
kernel<<<grid, block, 0, stream2>>>(d_in2, d_out2);
cudaMemcpyAsync(h_out2, d_out2, size, cudaMemcpyDeviceToHost, stream2);

Now the GPU's copy engines and compute engines can run simultaneously. While stream1's kernel executes, stream2's upload can happen in parallel on the DMA engine. While stream2's kernel runs, stream1's download overlaps. The timeline goes from three sequential blocks to a properly pipelined waterfall. This is what GPU programming is supposed to feel like.

But you had to: create explicit streams, switch to async copies, pin all your host memory, manually partition your data into chunks, and reason about which operations can overlap across which engines. That's five architectural decisions that weren't in the original three-line version. The distance between "works correctly" and "works efficiently" is enormous, and cudaMemcpy is the default-shaped pit you fall into on the way.

The Copy Engine Rabbit Hole

Here's where it gets genuinely interesting. Modern NVIDIA GPUs have separate hardware copy engines, typically one for host-to-device and one for device-to-host that operate independently of the compute engine (the SMs). This means three things can potentially happen at once: a copy up, a kernel, and a copy down. It's a three-lane highway built into the silicon.

cudaMemcpy uses one lane at a time and puts traffic cones across the other two.

The hardware was designed for overlap. The DMA engines exist specifically so that data movement and computation aren't competing for the same resources. When you use synchronous copies, you're not just being "suboptimal", you're leaving dedicated hardware units completely idle. It's like buying a dual-GPU workstation and only using one GPU, except the second GPU is inside the first one, and you're ignoring it because the API default told you to.

And there's another twist: even cudaMemcpyAsync won't overlap with kernels on the default stream (stream 0) unless you compile with --default-stream per-thread. The default stream has implicit synchronization with all other streams. So even after you switch to async copies, if you forget to use explicit streams, you're still serialized. The layers of implicit synchronization in CUDA are nested like a matryoshka doll of performance traps.

The cudaMemcpy problem is really about defaults shaping architecture. The vast majority of CUDA code in tutorials, textbooks, and quick prototypes uses synchronous copies because they're the path of least resistance. They work. They're simple. They produce correct results. And they silently establish a performance ceiling that many applications never break through because the developer never realizes the ceiling exists.

The deeper lesson is about the CUDA execution model itself. The GPU is not a magic box you throw work at. It's a multi-engine system, copy engines, compute engines, scheduling units and getting real performance means understanding which engines are active, which are stalled, and which are sitting dark because you didn't ask them to do anything. cudaMemcpy is the first place most people encounter this, and it's a rude awakening.

But once you see it, once you open Nsight Systems and watch those timelines light up with overlapping copies and kernels across multiple streams, something clicks. The GPU isn't just a wide processor. It's an orchestra, and the streams are the sheet music. cudaMemcpy is what happens when you hand every section the same part and make them play one at a time.

The music is a lot better when they play together. It's just infuriating that the conductor's default is to not tell them that.

If you've ever wondered why your GPU utilization is 30% despite "using CUDA," go open Nsight Systems. Look at the timeline. Count the white space. That white space has a name, and its name is cudaMemcpy.

From experience of others, I didn't have to do it because I don't drive. ↩︎