Mohamed Elashri

CCCL for HEP Triggers

Trigger development at the LHC sits at the intersection of physics, computing, and real time systems engineering. Every experiment must reduce raw collision data rates by orders of magnitude within a tight latency budget. For decades this has been achieved through a combination of specialized hardware triggers and highly tuned software pipelines. Today only LHCb has made the full leap to a GPU resident software trigger at the first stage, but ATLAS, CMS, and ALICE are all investigating how GPUs can play a larger role in their trigger strategies.

When I work on GPU code for physics, I often find myself reimplementing standard patterns: compaction of candidates, prefix scans to build offsets, warp level histograms for vertexing, and fine grained atomics. This is partly deliberate. We want to control memory layouts, tune kernels for specific architectures, and reduce our dependence on external libraries that might change underneath us. Still, when I look at NVIDIA’s CUDA C++ Core Libraries (CCCL) I cannot help but think that they could provide a consistent baseline for many of these building blocks. Even if we never adopt them wholesale for production, they could be powerful tools for prototyping and for sharing ideas across experiments.

What CCCL is and where it came from

CCCL is NVIDIA’s combined home for three core CUDA libraries:

Each of these libraries had its own history. Thrust has been used for more than a decade as a high level way to express data parallel operations. CUB was written to expose low level primitives tuned for each GPU generation. libcudacxx was developed to make modern C++ available inside CUDA kernels. Around 2023 NVIDIA decided to consolidate these into one monorepo, CCCL, so that the libraries would evolve together with consistent versioning and shared development practices.

Most CUDA toolkits already ship with Thrust and CUB. libcu++ headers are also bundled with recent releases. That means that in practice most of us already have CCCL available on our systems without adding new runtime dependencies. The monorepo mainly provides a way to track the latest features directly from GitHub if the bundled versions lag behind.

So Why CCCL would be relevant to trigger development?

Trigger algorithms in HEP almost always boil down to a set of recurring patterns:

  1. Filtering and compaction: deciding which candidates (tracks, clusters, hits) should be kept for more expensive processing.
  2. Prefix sums and scans: computing offsets or event level counts, used in everything from track indexing to calorimeter clustering.
  3. Histogramming and reductions: voting for primary vertex positions, computing pileup densities, or binning calorimeter energies.
  4. Synchronization and atomics: ensuring thread safety when multiple threads contribute to a shared result.

Most experiments today reinvent these pieces (for CPU, GPU and even FPGA). There are reasons for this. We want to tune memory access patterns for our particular data layouts. We want to avoid shipping dependencies into long lived cluster environments. And we sometimes gain performance by writing kernels that fuse multiple steps together.

Still, CCCL offers a set of primitives that are well tested, optimized for each GPU generation, and already integrated with the CUDA toolchain. That makes it an interesting candidate for exploration, especially in the early stages of algorithm design.

And How would CCCL fit into the current LHC software landscape

Even if no experiment other than LHCb runs a fully GPU resident trigger today, all of them are experimenting with GPU acceleration in parts of the trigger. CCCL could provide benefits in several areas:

One of the main reasons we reimplement so much is that we manage memory very carefully. We allocate workspaces once, reuse them across events, and ensure predictable memory access patterns. CCCL can fit into this model to some extent: CUB exposes its temporary storage so that it can be preallocated and reused, for example. But in other cases CCCL abstractions may hide details that matter in a real time trigger. That is why I do not expect us to adopt it wholesale. Instead I see it as a complementary tool, useful for prototyping, for shared learning, and maybe for selected building blocks where its performance and stability are on par with our custom solutions.

A possible path forward

For me the logical next step is to set up a series of prototypes that use CCCL for common patterns and benchmark them against our custom implementations. The questions I want to answer are:

If the answers are positive, CCCL could become a shared foundation for exploratory work across experiments, even if production code continues to use custom kernels for critical paths.

CCCL represents a consolidation of libraries that many GPU developers have used independently for years. For HEP, it offers the possibility of a more unified and maintainable approach to the building blocks of trigger development. I have not yet integrated it into my workflow, but I am interested in exploring its potential. The balance we need to strike is between productivity and control: CCCL can accelerate prototyping and sharing of ideas, while custom kernels will likely remain necessary where every microsecond counts. As more experiments experiment with GPUs in their triggers, having a shared set of primitives like those in CCCL may prove to be an advantage for the entire community.