CCCL for HEP Triggers

2025-08-31

Trigger development at the LHC sits at the intersection of physics, computing, and real time systems engineering. Every experiment must reduce raw collision data rates by orders of magnitude within a tight latency budget. For decades this has been achieved through a combination of specialized hardware triggers and highly tuned software pipelines. Today only LHCb has made the full leap to a GPU resident software trigger at the first stage, but ATLAS, CMS, and ALICE are all investigating how GPUs can play a larger role in their trigger strategies.

When I work on GPU code for physics, I often find myself reimplementing standard patterns: compaction of candidates, prefix scans to build offsets, warp level histograms for vertexing, and fine grained atomics. This is partly deliberate. We want to control memory layouts, tune kernels for specific architectures, and reduce our dependence on external libraries that might change underneath us. Still, when I look at NVIDIA’s CUDA C++ Core Libraries (CCCL) I cannot help but think that they could provide a consistent baseline for many of these building blocks. Even if we never adopt them wholesale for production, they could be powerful tools for prototyping and for sharing ideas across experiments.

What CCCL is and where it came from

CCCL is NVIDIA’s combined home for three core CUDA libraries:

Thrust: a parallel algorithms library that resembles the C++ STL. It provides sort, reduce, scan, and copy_if on both host and device, designed to integrate cleanly with CUDA streams.
CUB: a collection of tuned primitives for block and warp level collectives, as well as device wide operations like sorting and prefix sums. These are the tools you reach for inside kernels when you need to coordinate across a warp or block.
libcu++: an implementation of standard C++ features adapted to GPU device code, including <cuda/std/span>, <cuda/std/atomic>, and synchronization primitives.

Each of these libraries had its own history. Thrust has been used for more than a decade as a high level way to express data parallel operations. CUB was written to expose low level primitives tuned for each GPU generation. libcudacxx was developed to make modern C++ available inside CUDA kernels. Around 2023 NVIDIA decided to consolidate these into one monorepo, CCCL, so that the libraries would evolve together with consistent versioning and shared development practices.

Most CUDA toolkits already ship with Thrust and CUB. libcu++ headers are also bundled with recent releases. That means that in practice most of us already have CCCL available on our systems without adding new runtime dependencies. The monorepo mainly provides a way to track the latest features directly from GitHub if the bundled versions lag behind.

So Why CCCL would be relevant to trigger development?

Trigger algorithms in HEP almost always boil down to a set of recurring patterns:

Filtering and compaction: deciding which candidates (tracks, clusters, hits) should be kept for more expensive processing.
Prefix sums and scans: computing offsets or event level counts, used in everything from track indexing to calorimeter clustering.
Histogramming and reductions: voting for primary vertex positions, computing pileup densities, or binning calorimeter energies.
Synchronization and atomics: ensuring thread safety when multiple threads contribute to a shared result.

Most experiments today reinvent these pieces (for CPU, GPU and even FPGA). There are reasons for this. We want to tune memory access patterns for our particular data layouts. We want to avoid shipping dependencies into long lived cluster environments. And we sometimes gain performance by writing kernels that fuse multiple steps together.

Still, CCCL offers a set of primitives that are well tested, optimized for each GPU generation, and already integrated with the CUDA toolchain. That makes it an interesting candidate for exploration, especially in the early stages of algorithm design.

And How would CCCL fit into the current LHC software landscape

Even if no experiment other than LHCb runs a fully GPU resident trigger today, all of them are experimenting with GPU acceleration in parts of the trigger. CCCL could provide benefits in several areas:

Prototyping new algorithms: Thrust allows fast iteration. A researcher can write a filtering step with copy_if in a few lines, test it against real data, and only later decide whether it is worth replacing with a hand tuned kernel. This reduces time to insight.
Cross experiment sharing: one of the difficulties in HEP software is that each collaboration builds its own infrastructure. If CMS expresses candidate filtering with Thrust and ATLAS uses the same idioms for tower clustering, developers moving between collaborations will have a shared vocabulary.
Stable primitives across architectures: NVIDIA tunes CUB primitives for each GPU generation. Instead of each experiment reoptimizing its scans and reductions when a new GPU generation arrives, we could rely on CCCL to carry that burden.
Safer device code: libcu++ brings spans, atomics, and barriers into device code. Using these could reduce the risk of subtle bugs in kernels and make code easier to read for new collaborators.

One of the main reasons we reimplement so much is that we manage memory very carefully. We allocate workspaces once, reuse them across events, and ensure predictable memory access patterns. CCCL can fit into this model to some extent: CUB exposes its temporary storage so that it can be preallocated and reused, for example. But in other cases CCCL abstractions may hide details that matter in a real time trigger. That is why I do not expect us to adopt it wholesale. Instead I see it as a complementary tool, useful for prototyping, for shared learning, and maybe for selected building blocks where its performance and stability are on par with our custom solutions.

A possible path forward

For me the logical next step is to set up a series of prototypes that use CCCL for common patterns and benchmark them against our custom implementations. The questions I want to answer are:

How much development time do we save by using CCCL primitives in early stages?
Do CCCL routines achieve comparable throughput and latency to our tuned kernels when integrated into a trigger like pipeline?
How easy is it to control memory allocation and reuse temporary storage in CCCL based code?
Can CCCL abstractions serve as a teaching tool for new students entering trigger development, giving them a clean way to understand common patterns before diving into low level kernels?

If the answers are positive, CCCL could become a shared foundation for exploratory work across experiments, even if production code continues to use custom kernels for critical paths.

CCCL represents a consolidation of libraries that many GPU developers have used independently for years. For HEP, it offers the possibility of a more unified and maintainable approach to the building blocks of trigger development. I have not yet integrated it into my workflow, but I am interested in exploring its potential. The balance we need to strike is between productivity and control: CCCL can accelerate prototyping and sharing of ideas, while custom kernels will likely remain necessary where every microsecond counts. As more experiments experiment with GPUs in their triggers, having a shared set of primitives like those in CCCL may prove to be an advantage for the entire community.