Branching on GPUs

2025-08-24

Here is a quick thought about what you would encounter while doing GPU programming. It is not a new idea, but it is a useful one to keep in mind. And it illustrates the difference that dealing with GPUs can make and the mindset shift that is required for a programmer. It is easy to miss and then spend couple of hours/days debugging why the performance is tanking.

On a CPU, an if-else really picks a path. On an NVIDIA GPU, 32 threads in a warp execute in lockstep. If some lanes take the if and others take the else, the warp serializes both paths under masks. The hardware keeps every lane in step, so we pay for all branches while half the lanes sit idle on each segment. That effect is warp divergence, and it quietly erases throughput.

Here is the trap in minimal form:

__device__ float f_pos(float x) { return sqrtf(x) + 1.0f; }
__device__ float f_neg(float x) { return -expf(-x); }

__global__ void branchy(const float* x, float* y, int n){
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i >= n) return;
  float xi = x[i];
  if (xi >= 0.0f) y[i] = f_pos(xi);   // path A
  else            y[i] = f_neg(xi);   // path B
}

If a warp holds a mix of xi signs, it will execute A and B sequentially. When the work per path is small, you can often win by removing control flow entirely and letting the compiler emit a select. Every lane computes both candidates, then chooses without branching:

__global__ void predicated(const float* x, float* y, int n){
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i >= n) return;
  float xi = x[i];
  float a = f_pos(xi);
  float b = f_neg(xi);
  y[i] = (xi >= 0.0f) ? a : b;  // select, no divergent branch
}

The same idea can be written as arithmetic masks. It looks odd to CPU eyes, but it is often faster on mixed data because the warp stays uniform:

float m = (xi >= 0.0f) ? 1.0f : 0.0f;
y[i] = m * f_pos(xi) + (1.0f - m) * f_neg(xi);

When one path is expensive, computing both is wasteful. The cure is to make warps agree or make kernels homogeneous. One practical pattern is to separate orchestration from heavy math, build two index lists for elements that need path A vs path B, then launch specialized kernels that do no branching:

// Build pos_idx and neg_idx via a flag + scan or warp ballots (omitted here).

template<bool POS>
__global__ void process_subset(const float* x, float* y, const int* idx, int m){
  int k = blockIdx.x * blockDim.x + threadIdx.x;
  if (k >= m) return;
  int i = idx[k];
  float xi = x[i];
  y[i] = POS ? f_pos(xi) : f_neg(xi);
}

If our data tends to be locally uniform, you can take a cheaper route by fast-pathing the warp-uniform cases and falling back to predication only when the warp disagrees:

__global__ void hybrid(const float* x, float* y, int n){
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i >= n) return;
  unsigned mask = __activemask();
  bool cond = (x[i] >= 0.0f);

  if (__all_sync(mask, cond)) {
    y[i] = f_pos(x[i]);            // all lanes take A
  } else if (__all_sync(mask, !cond)) {
    y[i] = f_neg(x[i]);            // all lanes take B
  } else {
    float a = f_pos(x[i]);         // mixed warp: avoid branching
    float b = f_neg(x[i]);
    y[i] = cond ? a : b;
  }
}

Another lever is data order. Sort or bucket elements so that neighboring items share the same predicate, then the simple branchy kernel behaves as if there were no divergence. This trades a pass of reordering for higher warp execution efficiency and better cache behavior, which often nets a win for heavy kernels.

The mindset shift is straightforward, stop thinking in terms of choosing a path and start thinking in terms of making paths uniform. On GPUs, control flow is a collective property. We can either remove the branch (predication), isolate it to a lightweight orchestration pass (compaction plus specialized kernels), or restructure the data so warps agree by construction. Measure with branch efficiency and warp execution efficiency; prefer designs where whole warps do the same work and heavy math runs in kernels that never ask a divergent question.