If you are working with CUDA and GPU programming, you have probably heard the term "register spilling" at some point. Register spilling is a phenomenon that occurs when a CUDA kernel uses more registers than are available on the GPU, causing some of the register data to be spilled to local memory. This can lead to significant performance degradation, as accessing local memory is much slower than accessing registers.
But to understand register spilling, we must first understand what registers are in the context of a GPU, and why they occupy such a privileged position in the memory hierarchy.
A GPU Streaming Multiprocessor (SM) contains a register file which is a large, flat bank of 32-bit registers shared among all threads concurrently resident on that SM. On modern NVIDIA architectures (Ampere, Hopper), each SM provides 65,536 32-bit registers. These are the fastest storage available to a thread: access latency is effectively zero cycles (operands are read in the same cycle the instruction is issued), and bandwidth is enormous, on the order of tens of terabytes per second aggregate across the chip.
Every thread executing on the SM is allocated a contiguous slice of this register file at launch time. The key constraint is this: the register file is statically partitioned among all resident warps. If each thread in a kernel uses 32 registers, and each warp has 32 threads, then each warp consumes 32 × 32 = 1024 registers. An SM with 65,536 registers can therefore host at most 64 warps simultaneously. If each thread uses 64 registers, that drops to 32 warps, halving occupancy.
This creates the fundamental tension that makes register spilling interesting: the compiler must balance per-thread register usage (which determines computational throughput for each thread) against occupancy (which determines the SM's ability to hide memory latency through warp-level parallelism).
To iterate, register spilling occurs when a kernel's live variable set exceeds the number of physical registers the compiler has allocated for each thread. When this happens, the compiler must evict some register values to a slower level of the memory hierarchy, specifically, to local memory, which despite its name resides in the same off-chip DRAM (or L2 cache) as global memory.
Concretely, a "spill" manifests as a pair of instructions:
- Spill store (
STL): Write a register value to the thread's local memory stack frame. - Spill load (
LDL): Later, read that value back from local memory into a register when it is needed again.
Each of these instructions has a latency of hundreds of cycles (200–800 cycles depending on L1/L2 cache hit rates), compared to the zero-cycle access of a register read. This is why spilling is costly: it transforms what should be a free operand access into a memory transaction that can stall the warp's execution pipeline.
Each CUDA thread has a private local memory region. NVCC uses this region to store:
- Spilled register values.
- Large arrays declared within a kernel that cannot be kept in registers.
- Compiler-generated temporaries for complex expressions.
The local memory address for thread t in block b is computed as an offset from a per-thread basis address. The hardware coalesces local memory accesses across threads in a warp, thread 0 accesses address base + offset, thread 1 accesses base + offset + stride, and so on, so that a warp's spill loads/stores hit contiguous cache lines. This is important: it means spills are at least coalesced, but they still pay the latency penalty of an L1/L2 access.
NVCC's register allocation is a graph-coloring problem operating on the intermediate representation (IR) after the PTX (Parallel Thread Execution) virtual ISA has been lowered to SASS (the actual machine ISA). The process unfolds in several phases:
Phase 1: Liveness Analysis
The compiler performs a classic dataflow analysis to determine, at each program point, which virtual registers are live, meaning their values will be used by some future instruction before being overwritten.
Consider this simplified kernel:
__global__ void example(float *A, float *B, float *C, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {
float a = A[idx]; // v1 = load
float b = B[idx]; // v2 = load
float c = a * b; // v3 = v1 * v2
float d = sinf(c); // v4 = sin(v3)
float e = a + d; // v5 = v1 + v4 <- v1 is still live here!
float f = b * e; // v6 = v2 * v5 <- v2 is still live here!
C[idx] = f; // store v6
}
}The live ranges are:
| Instruction | Live-in set |
|---|---|
| v1 = load | {idx, A, B, C, N} |
| v2 = load | {idx, v1, B, C, N} |
| v3 = v1*v2 | {v1, v2, C, idx} |
| v4 = sin(v3) | {v1, v2, v3, C, idx} |
| v5 = v1+v4 | {v1, v2, v4, C, idx} |
| v6 = v2*v5 | {v2, v5, C, idx} |
| store v6 | {v6, C, idx} |
The maximum register pressure occurs at instruction v4 = sin(v3), where five virtual registers (v1, v2, v3, C, idx) are simultaneously live. If the physical register budget is 4, the compiler must spill at least one.
Phase 2: Interference Graph Construction
The compiler builds an interference graph where each node represents a virtual register and an edge connects two nodes if their live ranges overlap. Two virtual registers that are simultaneously live cannot share the same physical register.
For the example above, v1 and v2 interfere (both live from instruction 2 onwards through instruction 5 for v1 and instruction 6 for v2). The chromatic number of this graph tells us the minimum number of physical registers needed.
Phase 3: Graph Coloring with Spilling
NVCC uses a variant of the Chaitin-Briggs graph coloring algorithm, adapted for the GPU's architectural constraints. The algorithm proceeds:
- Simplify: Iteratively remove nodes with degree less than k (the number of available physical registers) from the graph, pushing them onto a stack.
- Potential spill: If no node has degree < k, select a node to be a potential spill candidate based on heuristics (discussed below), remove it, and mark it.
- Select: Pop nodes from the stack and assign colors (physical registers). If a potential spill node cannot be colored, it becomes an actual spill, its value is stored to local memory.
- Rewrite: Insert
STLandLDLinstructions for each actual spill and re-run allocation if needed.
And there is are some heuristics that NVCC uses to select spill candidates when the graph is too dense. This happens when the allocator must choose which virtual register to spill, the decision is critical. NVCC employs several heuristics:
Cost-based spilling: The compiler estimates the "spill cost" of each candidate as a function of:
- Frequency of use: A register used inside a loop body has high spill cost because every iteration would incur a spill load.
- Definition-use distance: A value defined far from its use is a better spill candidate than one used immediately after definition.
- Rematerialization potential: If the value can be cheaply recomputed (e.g., it is a constant, an address calculation, or a simple arithmetic expression of other live values), spilling it is effectively free, the compiler can rematerialize it instead of loading from local memory.
Loop-aware analysis:
NVCCgives significant weight to loop nesting depth. A variable live across a loop body but only used outside the loop is a prime spill candidate, it can be spilled once before the loop and reloaded once after, rather than incurring per-iteration cost.
Why NVCC Spills: Architectural Motivations
NVCC's register allocation strategy is driven by several GPU-specific considerations that distinguish it from CPU register allocation:
1. The Occupancy Cliff
The register file is a hard-partitioned resource. The relationship between per-thread register count and maximum warps per SM is a step function that can be visualized as follows (for Ampere architecture with 65,536 registers):
Registers/thread Max warps (Ampere SM, 65536 regs)
≤ 32 64
≤ 40 48 (<- occupancy drops by 25%)
≤ 48 40
≤ 64 32 (<- occupancy halved)
≤ 80 24
≤ 96 20
≤ 128 16 (<- occupancy quartered)
≤ 255 8 (<- absolute minimum)Notice the non-linearity: going from 32 to 33 registers per thread drops maximum warps from 64 to 48, a 25% occupancy reduction from a single additional register. NVCC is aware of these thresholds and may deliberately spill a few variables to keep register count at or below a step boundary.
2. Launch Bounds and Explicit Hints
CUDA provides the __launch_bounds__ qualifier to give the compiler information about the intended block size:
__global__ void __launch_bounds__(256, 4)
my_kernel(float *data) {
// ...
}Here, 256 is the maximum threads per block and 4 is the minimum blocks per SM. From minBlocks = 4 and threadsPerBlock = 256, the compiler computes that at least 4 × (256/32) = 32 warps must be resident simultaneously, requiring at most 65536 / (32 × 32) = 64 registers per thread. NVCC will then aggressively spill to enforce this limit, even if the natural register usage would be higher.
Without __launch_bounds__, NVCC uses a default heuristic (typically targeting ~32 registers per thread on recent architectures) and makes less aggressive spilling decisions.
3. The maxrregcount Flag
The compiler flag --maxrregcount=N globally caps register usage per thread at N. When a kernel's natural register demand exceeds N, NVCC must spill the difference. This is a blunt instrument, it applies uniformly and can cause excessive spilling in register-hungry kernels, but it is commonly used to tune occupancy across an entire compilation unit.
4. Predication and Divergence Pressure
GPU kernels frequently contain conditional code where both branches must be considered for liveness, because threads in a warp may diverge. Consider:
if (condition) {
float x = expensive_computation_1();
use(x);
} else {
float y = expensive_computation_2();
use(y);
}On a CPU, only one branch's registers are live at a time. On a GPU, predicated execution or warp-level divergence means the compiler may conservatively assume that variables from both branches are simultaneously live, inflating register pressure and causing spills that would not occur in scalar compilation.
Modern NVCC versions perform predication-aware liveness analysis that is more precise about this, but deeply nested divergent control flow still tends to inflate register pressure.
Analyzing Spills: Practical Techniques
Now that we understand why spills happen, how can we analyze them in practice? NVCC and NVIDIA's profiling tools provide several ways to observe and quantify spilling. These techniques are essential for diagnosing performance issues and guiding optimization efforts. The only caveat is that the tools and metrics can be overwhelming, so I will focus on the most informative ones for spill analysis and will not cover the full breadth of Nsight Compute's capabilities or even try to explain the various occupancy and warp-level metrics that are also important for performance tuning.
Use Compiler flag: --ptxas-options=-v
The most direct way to observe spills is the verbose output from ptxas, the PTX assembler:
nvcc --ptxas-options=-v -o kernel kernel.cuThis produces output like:
ptxas info : Compiling entry function '_Z9my_kernelPfS_S_i'
ptxas info : Function properties for _Z9my_kernelPfS_S_i
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 28 registers, 360 bytes cmem[0]When spilling occurs, you see nonzero values:
ptxas info : Function properties for _Z15heavy_kernelPfS_S_i
128 bytes stack frame, 96 bytes spill stores, 88 bytes spill loads
ptxas info : Used 64 registers, 380 bytes cmem[0]The asymmetry between spill stores (96 bytes) and spill loads (88 bytes) is normal, some spilled values may be dead along certain paths or rematerialized instead of reloaded.
SASS Inspection with cuobjdump
To see the actual spill instructions, disassemble the binary:
cuobjdump -sass kernel.o | grep -E 'STL|LDL'STL (Store to Local) and LDL (Load from Local) are the SASS instructions corresponding to spill stores and loads. You can count their frequency, observe their placement relative to loop structures, and infer which variables were spilled.
Nsight Compute Profiling
NVIDIA Nsight Compute provides detailed metrics for spill analysis, this is the most powerful tool that you have in your arsenal. Key metrics to look at include:
l1tex__data_pipe_lsu_wavefronts_mem_lg_cmd_read: This counts local memory read transactions (spill loads).l1tex__data_pipe_lsu_wavefronts_mem_lg_cmd_write: This counts local memory write transactions (spill stores).smsp__sass_inst_executed_op_local_ldandsmsp__sass_inst_executed_op_local_st: Provide the direct counts of local load/store instructions executed.
A high ratio of local memory traffic to global memory traffic is a strong indicator that spills are the performance bottleneck. But there are many nuances: if the spilled values are reused frequently and hit in L1 cache, the performance impact may be less severe than if they cause L1 misses. And you might miss the fact that some spills are rematerialized, so the local memory traffic metrics may undercount the true spill cost.
Nsight Compute Source Correlation
Using nvcc -lineinfo, Nsight Compute can correlate SASS instructions back to source lines. This allows you to identify which source-level variables are being spilled, critical for targeted optimization.
A Detailed Example: Spill Pathology and Resolution
Let's look into a realistic kernel with high register pressure and see how we can analyze and optimize it. Consider that we have a kernel performing a stencil computation with multiple intermediate buffers, something like this:
__global__ void stencil_3d(
const float *__restrict__ input,
float *__restrict__ output,
int Nx, int Ny, int Nz)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int k = blockIdx.z * blockDim.z + threadIdx.z;
if (i >= 1 && i < Nx-1 && j >= 1 && j < Ny-1 && k >= 1 && k < Nz-1) {
int idx = i + j*Nx + k*Nx*Ny;
// Load 7-point stencil
float center = input[idx];
float xm = input[idx - 1];
float xp = input[idx + 1];
float ym = input[idx - Nx];
float yp = input[idx + Nx];
float zm = input[idx - Nx*Ny];
float zp = input[idx + Nx*Ny];
// Compute second derivatives
float d2x = xp - 2.0f*center + xm;
float d2y = yp - 2.0f*center + ym;
float d2z = zp - 2.0f*center + zm;
// Nonlinear diffusion coefficient
float grad_sq = (xp-xm)*(xp-xm) + (yp-ym)*(yp-ym) + (zp-zm)*(zp-zm);
float kappa = 1.0f / (1.0f + grad_sq);
// Cross-derivative terms (13-point stencil extension)
float xy_pp = input[idx + 1 + Nx];
float xy_pm = input[idx + 1 - Nx];
float xy_mp = input[idx - 1 + Nx];
float xy_mm = input[idx - 1 - Nx];
float d2xy = 0.25f * (xy_pp - xy_pm - xy_mp + xy_mm);
float xz_pp = input[idx + 1 + Nx*Ny];
float xz_pm = input[idx + 1 - Nx*Ny];
float xz_mp = input[idx - 1 + Nx*Ny];
float xz_mm = input[idx - 1 - Nx*Ny];
float d2xz = 0.25f * (xz_pp - xz_pm - xz_mp + xz_mm);
float yz_pp = input[idx + Nx + Nx*Ny];
float yz_pm = input[idx + Nx - Nx*Ny];
float yz_mp = input[idx - Nx + Nx*Ny];
float yz_mm = input[idx - Nx - Nx*Ny];
float d2yz = 0.25f * (yz_pp - yz_pm - yz_mp + yz_mm);
output[idx] = center + kappa * (d2x + d2y + d2z + d2xy + d2xz + d2yz);
}
}This kernel has enormous register pressure. At the point where d2yz is being computed, the live set includes: idx, Nx, Ny, center, d2x, d2y, d2z, kappa, d2xy, d2xz, plus the four yz_* temporaries, plus the output pointer, plus several address-computation intermediaries. Compiling with -v:
ptxas info : Used 42 registers, 48 bytes spill stores, 40 bytes spill loads42 registers puts us in the "max 48 warps" occupancy bucket. The spills push some pressure to local memory. That's a problem because this kernel is likely memory-bound, and the spill-induced local memory traffic will further reduce effective bandwidth. But how do we fix it? We can think of several strategies:
Strategy 1: Reduce Live Range Overlap
This is the most obvious and often the most effective strategy. Restructure the computation to minimize the number of simultaneously live. By restructuring the computation to minimize the number of simultaneously live intermediate values, we can reduce register pressure without changing the algorithm intermediate values. In our case, it would be as the following:
// Compute and accumulate terms incrementally
float laplacian = 0.0f;
// X-derivative block
{
float xm = input[idx - 1];
float xp = input[idx + 1];
laplacian += xp - 2.0f*center + xm;
// xm and xp are dead after this scope
}
// Y-derivative block
{
float ym = input[idx - Nx];
float yp = input[idx + Nx];
laplacian += yp - 2.0f*center + ym;
}And so on for each term. By scoping intermediate values tightly, we reduce the maximum live set at any program point. The compiler can reuse the physical registers that held xm and xp for ym and yp.
Strategy 2: Recompute Instead of Store
If kappa depends on gradient values and those gradient values are also needed for cross-terms, it may be cheaper to recompute the gradient components rather than keeping them live across many instructions. This is the rematerialization strategy, trading ALU cycles (which are cheap on a GPU) for register pressure reduction.
Strategy 3: __launch_bounds__ Tuning
If the kernel is latency-bound rather than throughput-bound, you might accept lower occupancy in exchange for zero spills:
__global__ void __launch_bounds__(128, 2)
stencil_3d(const float *__restrict__ input, ...) {
// With minBlocks=2 and 128 threads, the compiler has more
// registers per thread to work with, potentially eliminating spills
}This is a deliberate architectural tradeoff: fewer concurrent warps, but each warp runs at full speed with no spill-induced stalls.
NVCC's PTX-to-SASS Pipeline and Spill Decisions
There is a common misconception that register spilling is a direct consequence of the PTX code generated by NVCC. In reality, the spilling decision does not happen at the PTX level. PTX uses an unlimited virtual register set, a kernel's PTX may reference hundreds of virtual registers (%f0, %f1, ..., %f127, ...) without concern for physical limits. The register allocation and spilling decisions are made later, during the PTX-to-SASS compilation phase performed by ptxas. This means that the PTX code you see is not a reliable indicator of whether spills will occur or how many registers will be used in the final SASS. The reasons are:
- PTX optimizations (CSE, dead code elimination, constant propagation) may reduce or inflate the virtual register count before
ptxassees it. ptxasperforms its own optimizations: instruction scheduling, register coalescing, and live-range splitting that can substantially change the spilling outcome relative to a naive analysis of the PTX.- SASS-level instruction scheduling is interleaved with register allocation,
ptxasmay reorder instructions to reduce live-range overlaps, but some reordering may increase register pressure if they bring two previously non-overlapping live ranges into conflict.
This is why analyzing spills from PTX alone is insufficient, the PTX register count bears little relation to the SASS register count. Always inspect the ptxas verbose output or the SASS disassembly.
The Register Pressure vs. Occupancy Tradeoff: A Quantitative View
People will always ask: "How many registers per thread should I use?" The answer is: it depends. And there is a relation with our beloved misleading metric of occupancy. The relationship between register pressure, occupancy, and performance is non-monotonic and workload-dependent. Consider a kernel with arithmetic intensity (FLOPs per byte of memory traffic):
- Memory-bound kernels (): Performance scales with occupancy because the SM needs many warps in flight to saturate memory bandwidth. Spilling a few registers to increase occupancy from 50% to 75% can yield a net speedup, even though each individual thread is slower.
- Compute-bound kernels (): Performance scales with per-thread throughput. Additional warps provide diminishing returns because the SM's compute pipelines are already saturated. Here, spilling hurts, each spill load occupies a memory pipeline slot that could be used for useful data, and the stall cycles directly reduce throughput.
- Latency-bound kernels (insufficient parallelism to hide any latency): Occupancy is critical, and moderate spilling is acceptable as long as the spill traffic hits L1 cache.
The roofline model provides a framework for this analysis. At the ridge point where compute and memory ceilings intersect, the optimal register allocation strategy changes qualitatively.
Spills and the L1 Cache
A critical architectural detail: spilled values go to local memory addresses, but these addresses are cached in the L1 data cache (unified with shared memory on Volta+ architectures). If a warp spills a value and reloads it shortly after, the reload will likely hit L1 with a latency of ~30 cycles rather than the ~200+ cycles of an L2 or DRAM access.
This means that not all spills are equally expensive. So a spill-reload pair within a tight loop, where the reloaded value stays hot in L1, costs ~30 cycles per access. Painful, but manageable. But a spill at the top of a long computation with a reload at the bottom where intervening memory traffic has evicted the spilled value from L1, costs 200–800 cycles. Devastating.
NVCC's spill heuristics attempt to account for this by preferring to spill values with short spill-reload distances (likely L1 hits) over values with long distances (likely L1 misses). And there are cases where the compiler may choose to spill a value that is only used once after a long computation, accepting the high latency because the alternative (keeping it live in a register) would cause even worse performance due to occupancy reduction. Another problem is the double precision values that require two registers, which can easily push a kernel over the register limit and cause spills. Double-precision (double, long long) values require two consecutive 32-bit registers (a "register pair"). This means a kernel using double arithmetic faces roughly twice the register pressure of an equivalent float kernel. On architectures where double-precision throughput is already reduced (consumer GPUs: 1/32 of FP32 rate on Ampere), the additional register pressure from spilling compounds the performance penalty. The compiler must also respect alignment constraints for register pairs, further restricting allocation flexibility and increasing the likelihood of spills.
The full sequence, from source to spilled SASS, is:
- C++ Frontend (
cudafe++): Parses CUDA, separates host/device code. - Device IR Optimization: Inlining, loop unrolling, constant propagation, all of which can dramatically change register pressure.
- PTX Generation (
cicc): Produces PTX with virtual (unlimited) registers. - PTX Optimization (
ptxasfrontend): CSE, dead code elimination, peephole optimizations on PTX. - Liveness Analysis (
ptxas): Computes live ranges for all virtual registers. - Interference Graph Construction: Builds the conflict graph.
- Graph Coloring with Spilling:
Chaitin-Briggsvariant allocates physical registers, introduces spills. - Spill Code Insertion:
STL/LDLinstructions are inserted. - Post-Allocation Scheduling: Instructions (including spill code) are scheduled to hide latencies.
- SASS Emission: Final machine code with concrete register assignments and spill instructions.
Understanding this pipeline, and knowing where in it to intervene (source restructuring at step 2, __launch_bounds__ at step 7, --maxrregcount at step 7, manual PTX at step 3) is the key to effective register spill analysis and optimization on NVIDIA GPUs.
As a final remark, I want to emphasize that register spilling is not inherently bad. It is a compiler-managed tradeoff between per-thread performance and SM-level parallelism. The goal is not to eliminate all spills, but to ensure that the spilling pattern aligns with the kernel's computational characteristics. A memory-bound kernel can afford, and may even benefit from, moderate spilling to increase occupancy. A compute-bound kernel with complex register-heavy arithmetic should be tuned to minimize spills, even at the cost of reduced occupancy. The tools exist to measure both: ptxas -v, cuobjdump -sass, and Nsight Compute metrics. The optimization loop is: measure register count and spill volume, profile actual performance, adjust source structure or compiler hints, and measure again.