CUDA: Where Your C++ Features Go to Die
So you’ve got this beautiful modern C++ codebase. Smart pointers, STL containers, lambdas everywhere. Life is good. Then your Supervisor says “Hey! make it run on the GPU” and suddenly you’re coding like it’s 1999
.
Want to use std::vector
in your kernel? Haha, no. std::shared_ptr
? Get outta here. That fancy std::optional
you love? Forget it.
1__global__ void myKernel(std::vector<float> data) { // NOPE
2 auto ptr = std::make_unique<float[]>(10); // NOPE
3 std::array<float, 10> arr; // MAYBE? (spoiler: probably breaks)
4}
The “solution” is to write everything twice:
1// Host code: Living in 2025
2std::vector<float> host_data = {1.0f, 2.0f, 3.0f};
3
4// Device code: Welcome to 1999 (Your childhood years are back!)
5float* device_data;
6cudaMalloc(&device_data, sizeof(float) * 3);
7cudaMemcpy(device_data, host_data.data(), ...);
8// Don't forget to cudaFree!
Oh, and if you wanted to capture that lambda by value in your kernel?
1int multiplier = 5;
2auto kernel = [=] __device__ (int x) { return x * multiplier; };
3// Good luck getting this to work reliably
The real comedy is when CUDA supports “C++17” but you try to use C++17 features:
1__device__ auto process(int x) {
2 if constexpr (sizeof(int) == 4) { // This might work!
3 return std::variant<int, float>{x}; // This definitely won't
4 }
5}
NVIDIA keeps adding C++ standard support, but it’s like getting a Ferrari engine with bicycle wheels. Sure, technically it’s C++20 compliant… except for everything that makes C++20 actually useful.
The gap between “host” and “device” code is where your sanity goes to die. You end up writing this weird dual-personality codebase that’s neither good C++ nor good CUDA, just a frankenstein monster of manual memory management and preprocessor hacks.
But hey, at least it’s fast! (After you spend couple of weeks debugging why your kernel launch is returning cudaErrorIllegalAddress
)