In every CUDA tutorial, every conference talk, every blog post and every well-meaning Stack Overflow answer eventually says the same thing: "Increase your occupancy." Launch more threads. Fill those SMs. Keep the hardware busy. Occupancy is the north star, and 100% is the promised land. So you do it. You refactor your kernel to use fewer registers. You shrink your shared memory footprint. You cram more warps onto each Streaming Multiprocessor. Your occupancy climbs from 50% to 100%. You run the benchmark.
It's slower.
Not by a little. Measurably, reproducibly slower. You double-check. You triple-check. You question your profiler, your methodology, your career choices. But the numbers don't lie: you doubled occupancy and got worse performance.
Welcome to the occupancy trap, the most widely repeated half-truth in GPU computing.
What Occupancy Actually Means
Let's back up. Occupancy is the ratio of active warps on an SM to the maximum number of warps that SM can support. An SM on an Ampere GPU (compute capability 8.x) can hold up to 64 warps (2048 threads). If your kernel's resource usage only allows 32 warps to be resident, you're at 50% occupancy.
The logic for why higher is better seems airtight: GPUs hide memory latency by switching between warps. When one warp stalls waiting for a global memory fetch, another warp steps in and executes. More resident warps means more candidates to switch to, which means more latency hiding, which means the execution units stay fed. Simple. Elegant. Correct, but in theory.
The problem is that this model treats the GPU like a perfectly fungible resource pool where warps are free and interchangeable. They aren't. And you need to understand why this is happening.
When you "optimize for occupancy," you're typically doing one or more of the following:
Reducing register usage per thread. Registers are the fastest memory on the GPU, literally single-cycle access. When you force the compiler to use fewer of them (via __launch_bounds__ or maxrregcount), the excess variables get spilled to local memory, which is actually DRAM. You've traded single-cycle access for hundreds-of-cycles access. Your occupancy spreadsheet looks great. Your actual memory traffic pattern is a disaster.
Shrinking shared memory per block. Shared memory is your explicitly managed L1-adjacent scratchpad. Reducing it to fit more blocks per SM means each block has less fast storage to work with, pushing data back to global memory. You've increased the number of warps available to hide latency while simultaneously increasing the amount of latency that needs hiding. It's the computational equivalent of drilling holes in a boat, so the water drains out faster.
Launching smaller thread blocks. More blocks mean more scheduling overhead, more barrier synchronization edge cases, and less opportunity for intra-block data reuse. The SM is busier, but it's busy doing bookkeeping.
So in each case, you've traded per-thread efficiency for aggregate parallelism, and the tradeoff is often terrible. Because the underlying assumption is usually wrong. We need to face the counterintuitive truth.
Here's what makes this genuinely fascinating rather than just annoying: the optimal occupancy for many real-world kernels is somewhere between 25% and 50%. Not 100%. Not even close. Volkov's landmark GTC talk (practically required reading at this point) demonstrated that carefully tuned kernels with high register usage, low occupancy, and instruction-level parallelism (ILP) consistently outperformed their high-occupancy counterparts. The insight was that a single thread doing more useful work per instruction is often better than many threads doing less work each while fighting over shared resources.
Consider a kernel that does a matrix multiply tile. At 50% occupancy with 64 registers per thread, each thread can hold an entire tile in registers and blast through the computation with minimal memory traffic. At 100% occupancy with 32 registers per thread, half the tile spills to local memory, and every iteration triggers DRAM accesses that didn't exist before. The SM is "busier" in the second case. It's also slower.
The GPU doesn't care how many warps are resident. It cares how many warps are ready to execute. If your high-occupancy warps are all stalled on the memory traffic you created by reducing their registers, you haven't hidden latency, you've just distributed it across more threads.
But why this myth persists?
There are three main reasons, and I think all of them are understandable:
The CUDA Occupancy Calculator. NVIDIA ships/ed a literal spreadsheet tool that computes your occupancy and implicitly frames higher numbers as better. It's a useful diagnostic for understanding resource bottlenecks, but generations of developers have internalized it as an optimization target rather than a diagnostic. The map became the territory.
It works for naive kernels. If your kernel is simple, memory-bound, and doesn't use many registers, the "hello world" of CUDA, then yes, more occupancy usually helps. The advice is correct for the first kernels you ever write, which means it gets cemented as gospel before you encounter cases where it breaks down.
It's easy to measure. Occupancy is a clean, single number between 0 and 100. "How efficiently is each thread using its registers relative to its arithmetic intensity and the L2 cache hit rate under this data layout" is not. People optimize what they can measure, especially under deadline pressure.
And what can we do?
There are many things that can be done, there is no simple answer or solution that can fit all scenarios. Take them as list of suggestions or guidelines more than specific answers. You can do the following:
Profile first, mythologize never. Use ncu to look at your kernel's actual bottleneck. Is it compute-bound? Memory-bound? Latency-bound? The answer determines whether occupancy matters at all. If you're compute-bound, occupancy above ~50% is almost certainly irrelevant.
Let the compiler use registers. Unless you have a specific, profiler-backed reason to restrict them, let nvcc allocate as many registers as your kernel wants. Register spilling is almost always worse than lower occupancy. Trust the fastest memory you have.
Think in terms of achieved throughput, not theoretical parallelism. The question isn't "how many warps can I fit?" It's "how many bytes per second am I actually moving?" and "how many FLOPS am I actually achieving?" If those numbers go up while occupancy goes down, you are winning. The scoreboard is throughput. Occupancy is a proxy, and proxies lie.
Experiment with fewer, fatter threads. Try giving each thread more work. Unroll loops. Use more registers. Let each thread process multiple elements. This is often called "thread coarsening," and it flies in the face of the "launch as many threads as possible" instinct. It's also frequently faster.
The Deeper Lesson
The occupancy myth is really a lesson about Goodhart's Law applied to GPU programming: when a measure becomes a target, it ceases to be a good measure. Occupancy is a useful signal for understanding how your kernel maps onto hardware resources. It's a terrible optimization objective.
What's beautiful about this, and I mean that without irony, is that it forces you to actually understand what the hardware is doing. You can't cargo-cult your way to a fast kernel. You have to sit down with the profiler, understand the memory hierarchy, reason about register pressure, and think about what each thread is actually doing with its cycles. It's harder than checking a percentage. It's also where the real performance lives.
The GPU doesn't reward you for keeping it busy. It rewards you for keeping it useful. Those are very different things, and the gap between them is where the interesting engineering happens.
If you've ever proudly hit 100% occupancy and then watched your kernel get outperformed by one running at 25%, congratulations, you've graduated from the tutorial into the real world. It's worse here, but the benchmarks are honest.