PhD hate journey with CUDA

2025-05-31

Let me paint you a picture. You’re three years into your PhD, working on some fancy parallel algorithm. Your advisor says “just port it to CUDA, should take a week.” That was six months ago. You haven’t slept properly since.

You start with beautiful, clean C++ code. Templates, RAII, all the good stuff:

auto result = parallel_reduce(data.begin(), data.end(), 
    [](auto x, auto y) { return x + y; });

Then CUDA enters the chat:

float* d_data;
cudaMalloc(&d_data, size * sizeof(float));
cudaMemcpy(d_data, h_data, size * sizeof(float), cudaMemcpyHostToDevice);

// 47 lines of kernel launch code
my_reduction_kernel<<<gridSize, blockSize, sharedMemSize>>>(d_data, d_result, size);

// Don't forget error checking! (Spoiler: you will forgot)
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) {
    // Welcome to debugging hell
}

The real fun starts when you try to be clever. “I’ll use Unified Memory!” you think. “It’ll be just like regular pointers!”, How hard it would be?

float* data;
cudaMallocManaged(&data, size * sizeof(float));
// This works great!
// ...until you profile and discover it's slower than CPU code

Then there’s the classic PhD experience: your algorithm needs dynamic data structures. In C++, no problem! In CUDA? Time to implement your own GPU-friendly hash table from scratch. Stack Overflow posts from 2012 become your bible. You start having dreams about warp divergence.

The debugging experience deserves its own horror story:

__global__ void definitely_working_kernel(float* data) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    data[idx] = data[idx + 1];  // Segfault somewhere in here
    // Good luck finding which of the 100 threads caused it
}

Your options:

printf debugging (breaks everything because 100 threads are printing)
cuda-gdb (I hope you like command lines from 1985)
Nsight (crashes every 20 minutes and needs more memory than your kernel)
Just stare at the code until you achieve enlightenment

And don’t get me started on the build system. Your CMakeLists.txt looks like someone’s ransom note:

find_package(CUDA REQUIRED)
# Wait no, that's deprecated
enable_language(CUDA)
# But this doesn't work with your version
# 200 lines of increasingly desperate CMake hacks

The worst part? You’ll spend weeks optimizing your kernel, fighting shared memory bank conflicts, unrolling loops, achieving perfect coalesced access… only to discover your CPU baseline was faster because memory transfers dominated everything.

But here’s the thing they don’t tell you: eventually, Stockholm syndrome kicks in. You start to enjoy thinking in warps and blocks. You dream in thread indices. You see a nested loop and immediately think “can I parallelize this?”

Your non-CUDA friends show you their “parallel” code using std::async and you have to suppress a laugh. “That’s cute,” you think, while mentally calculating how many teraflops your GPU could push on their problem.

Another Three years later, you defend your thesis. The committee asks about your “novel GPU acceleration techniques.” You smile, knowing they’ll never understand the blood sacrifice required to make thrust::reduce work with your custom allocator.

Your code works. It’s fast. It’s also an ungodly chimera of C++98 device code, C++17 host code, raw CUDA APIs, thrust, cub, and that one crucial kernel you copied from a random GitHub repo at 3 AM.

You graduate. Your advisor uses your code for the next paper. The new PhD student opens your codebase, sees the mixture of unified memory, pinned memory, and manual memory management, and asks “why is it like this?”

You could explain. But instead, you just smile and say: “Just port it to CUDA. Should take a week.”

The cycle continues.