CUDA: Where Your C++ Features Go to Die

2025-05-25

So you’ve got this beautiful modern C++ codebase. Smart pointers, STL containers, lambdas everywhere. Life is good. Then your Supervisor says “Hey! make it run on the GPU” and suddenly you’re coding like it’s 1999.

Want to use std::vector in your kernel? Haha, no. std::shared_ptr? Get outta here. That fancy std::optional you love? Forget it.

1__global__ void myKernel(std::vector<float> data) {  // NOPE
2    auto ptr = std::make_unique<float[]>(10);       // NOPE
3    std::array<float, 10> arr;                      // MAYBE? (spoiler: probably breaks)
4}

The “solution” is to write everything twice:

1// Host code: Living in 2025
2std::vector<float> host_data = {1.0f, 2.0f, 3.0f};
3
4// Device code: Welcome to 1999 (Your childhood years are back!)
5float* device_data;
6cudaMalloc(&device_data, sizeof(float) * 3);
7cudaMemcpy(device_data, host_data.data(), ...);
8// Don't forget to cudaFree!

Oh, and if you wanted to capture that lambda by value in your kernel?

1int multiplier = 5;
2auto kernel = [=] __device__ (int x) { return x * multiplier; };
3// Good luck getting this to work reliably

The real comedy is when CUDA supports “C++17” but you try to use C++17 features:

1__device__ auto process(int x) {
2    if constexpr (sizeof(int) == 4) {  // This might work!
3        return std::variant<int, float>{x};  // This definitely won't
4    }
5}

NVIDIA keeps adding C++ standard support, but it’s like getting a Ferrari engine with bicycle wheels. Sure, technically it’s C++20 compliant… except for everything that makes C++20 actually useful.

The gap between “host” and “device” code is where your sanity goes to die. You end up writing this weird dual-personality codebase that’s neither good C++ nor good CUDA, just a frankenstein monster of manual memory management and preprocessor hacks.

But hey, at least it’s fast! (After you spend couple of weeks debugging why your kernel launch is returning cudaErrorIllegalAddress)