CUDA: Where Your C++ Features Go to Die

So you've got this beautiful modern C++ codebase. Smart pointers, STL containers, lambdas everywhere. Life is good. Then your Supervisor says "Hey! Make it run on the GPU" and suddenly you're coding like it's 1999.

Want to use std::vector in your kernel? Haha, no. std::shared_ptr? Get out of here. That fancy std::optional you love? Forget it.

__global__ void myKernel(std::vector<float> data) {  // NOPE
    auto ptr = std::make_unique<float[]>(10);       // NOPE
    std::array<float, 10> arr;                      // MAYBE? (spoiler: probably breaks)
}

The "solution" is to write everything twice:

// Host code: Living in 2025
std::vector<float> host_data = {1.0f, 2.0f, 3.0f};

// Device code: Welcome to 1999 (Your childhood years are back!)
float* device_data;
cudaMalloc(&device_data, sizeof(float) * 3);
cudaMemcpy(device_data, host_data.data(), ...);
// Don't forget to cudaFree!

Oh, and if you wanted to capture that lambda by value in your kernel?

int multiplier = 5;
auto kernel = [=] __device__ (int x) { return x * multiplier; };
// Good luck getting this to work reliably

The real comedy is when CUDA supports "C++17" but you try to use C++17 features:

__device__ auto process(int x) {
    if constexpr (sizeof(int) == 4) {  // This might work!
        return std::variant<int, float>{x};  // This definitely won't
    }
}

NVIDIA keeps adding C++ standard support, but it's like getting a Ferrari engine with bicycle wheels. Sure, technically it's C++20 compliant... except for everything that makes C++20 actually useful.

The gap between "host" and "device" code is where your sanity goes to die. You end up writing this weird dual-personality codebase that's neither good C++ nor good CUDA, just a Frankenstein monster of manual memory management and preprocessor hacks.

But hey, at least it's fast! (After you spend a couple of weeks debugging why your kernel launch is returning cudaErrorIllegalAddress)