CUDA: Where Your C++ Features Go to Die
So you've got this beautiful modern C++ codebase. Smart pointers, STL containers, lambdas everywhere. Life is good. Then your Supervisor says "Hey! Make it run on the GPU" and suddenly you're coding like it's 1999
.
Want to use std::vector
in your kernel? Haha, no. std::shared_ptr
? Get out of here. That fancy std::optional
you love? Forget it.
__global__ void
The "solution" is to write everything twice:
// Host code: Living in 2025
std::vector<float> host_data = ;
// Device code: Welcome to 1999 (Your childhood years are back!)
float* device_data;
;
;
// Don't forget to cudaFree!
Oh, and if you wanted to capture that lambda by value in your kernel?
int multiplier = 5;
auto kernel = ;
// Good luck getting this to work reliably
The real comedy is when CUDA supports "C++17" but you try to use C++17 features:
__device__ auto
NVIDIA keeps adding C++ standard support, but it's like getting a Ferrari engine with bicycle wheels. Sure, technically it's C++20 compliant... except for everything that makes C++20 actually useful.
The gap between "host" and "device" code is where your sanity goes to die. You end up writing this weird dual-personality codebase that's neither good C++ nor good CUDA, just a Frankenstein monster of manual memory management and preprocessor hacks.
But hey, at least it's fast! (After you spend a couple of weeks debugging why your kernel launch is returning cudaErrorIllegalAddress
)