Columns over Rows

2025-08-01

I still remember the first time I instrumented my analysis with a profiler, every line looked harmless, yet the wall clock crawled, the tool highlighted nothing but bright red cache misses, the culprit was the way I handled data, one event at a time, in strict rows, that style matches how the detector records information, yet it collides head-on with how modern CPUs move bytes, the solution was to flip perspective and treat each branch as a long column, since that day I have written every new study with column thinking first, the difference feels like moving from walking through mud to gliding on ice

The raw files that reach me are ordinary ROOT files, inside each file a structure called a TTree holds branches, imagine a spreadsheet where every column keeps numbers of the same type, px¹ values in one, py in another, those branches are already stored contiguously on disk thanks to ROOT’s design, uproot in Python streams them straight into contiguous NumPy or awkward array buffers, once in that form I can ask the CPU to add a million numbers with one vectorized instruction instead of a million Python loops, code that once took minutes now completes while I read the next line

import uproot
import numpy as np  

tree = uproot.open("dimuon.root:Candidates")  
px1, py1, pz1, px2, py2, pz2 = tree.arrays([  
    "mu1_px","mu1_py","mu1_pz","mu2_px","mu2_py","mu2_pz"  
], library="np").values()  

mass = np.sqrt((px1+px2)**2 + (py1+py2)**2 + (pz1+pz2)**2)

The entire array of invariant masses appears in one shot, Matplotlib plots it instantly, the same transformation written with a C-style loop would touch distant bytes on every iteration, the cache would thrash and performance would collapse

Rows look comfortable until the data become jagged², many experiments store variable sized vectors, an event might have two tracks or two hundred, awkward array keeps such vectors efficient by pairing a single offset buffer with one content buffer, operations like “take the highest transverse momentum track in each event” remain pure array arithmetic, so we don’t need explicit loops.

import awkward as ak
import vector 

arrays = tree.arrays(["track_px","track_py","track_pz"], how="zip")  // unpack the jagged array into a flat one
vec = vector.zip(px=arrays.track_px, py=arrays.track_py, pz=arrays.track_pz)  // convert to vectorized form
leading_pt = ak.fill_none(ak.max(vec.pt, axis=1), 0.0)  // fill missing values with zero

Edge cases appear when empty events or wildly unbalanced multiplicities sneak in, filling missing values with zeros or bucketing events by multiplicity keeps kernels aligned, those tiny preprocessing steps rescue throughput without contorting the physics logic

Pure Python remains great for exploration yet heavy fits still crave compiled speed, rather than rewriting the full chain I borrow the TTree pointer from uproot and hand it to a short C++ extension built around ROOT RDataFrame, both languages then share the same memory, no file rewrite, no copy.

The code below shows how I define a C++ function that computes a histogram from the TTree data, using Eigen for matrix operations, and then call it from Python:

#include <ROOT/RDataFrame.hxx>  
#include <Eigen/Dense>  

extern "C" void vertex_fit(void* tree_ptr) {  
    ROOT::RDataFrame df(*static_cast<TTree*>(tree_ptr));  
    auto det = [](float x,float y,float z){  
        Eigen::Matrix3f m;  
        m << x,x,y, x,y,z, y,z,x;  
        return m.determinant();  
    };  
    auto h = df.Define("det", det, {"vx","vy","vz"})  
              .Histo1D({"h","det",100,-1,1}, "det");  
    h->Write();  
}

import ctypes, uproot  
lib = ctypes.cdll.LoadLibrary("fit.so")  
lib.vertex_fit.argtypes = [ctypes.c_void_p]  
tree = uproot.open("vertices.root:Tree")  
lib.vertex_fit(tree.member("fTree"))

While the C++ thread pool processes data, I lock Python from reading, a simple flag avoids races, when the compiled kernel returns the histogram file becomes just another uproot object ready for Matplotlib, one data source, two runtimes, zero duplication

Mixed precision pitfalls lurk, ROOT branches often store float32 while NumPy defaults to float64, I always cast the NumPy view to match the on-disk type, any later C++ lambda receives exactly what it expects, life stays predictable

Readers outside HEP sometimes ask why we bother with this dance, the answer is that column layouts map directly onto the way hardware streams memory, whether the dataset describes muons or housing prices the principle holds, keep like numbers together, feed them to vectorized kernels, avoid row loops unless you really need to inspect one record, the rest is syntax

I still dip into row mode when debugging a rare pathological event, walking through it field by field can reveal a corrupted value or a misaligned index, but once the cause is clear I pivot back to columns, the payoff in speed and clarity is too large to ignore, rows tell the story, columns do the math.

px, py, pz are always referring to momentum components of a particle. ↩︎
Jagged arrays are those where each row can have a different number of elements, like an event with 2 tracks vs. another with 200 tracks …etc. ↩︎