Columns over Rows
I still remember the first time I instrumented my analysis with a profiler, every line looked harmless, yet the wall clock crawled, the tool highlighted nothing but bright red cache misses, the culprit was the way I handled data, one event at a time, in strict rows, that style matches how the detector records information, yet it collides head-on with how modern CPUs move bytes, the solution was to flip perspective and treat each branch as a long column, since that day I have written every new study with column thinking first, the difference feels like moving from walking through mud to gliding on ice
The raw files that reach me are ordinary ROOT
files, inside each file a structure called a TTree
holds branches, imagine a spreadsheet where every column keeps numbers of the same type, px
1 values in one, py
in another, those branches are already stored contiguously on disk thanks to ROOT’s
design, uproot
in Python streams them straight into contiguous NumPy or awkward
array buffers, once in that form I can ask the CPU to add a million numbers with one vectorized instruction instead of a million Python loops, code that once took minutes now completes while I read the next line
import uproot
import numpy as np
tree = uproot.open("dimuon.root:Candidates")
px1, py1, pz1, px2, py2, pz2 = tree.arrays([
"mu1_px","mu1_py","mu1_pz","mu2_px","mu2_py","mu2_pz"
], library="np").values()
mass = np.sqrt((px1+px2)**2 + (py1+py2)**2 + (pz1+pz2)**2)
The entire array of invariant masses appears in one shot, Matplotlib
plots it instantly, the same transformation written with a C-style
loop would touch distant bytes on every iteration, the cache would thrash and performance would collapse
Rows look comfortable until the data become jagged2, many experiments store variable sized vectors, an event might have two tracks or two hundred, awkward
array keeps such vectors efficient by pairing a single offset buffer with one content buffer, operations like “take the highest transverse momentum track in each event” remain pure array arithmetic, so we don’t need explicit loops.
import awkward as ak
import vector
arrays = tree.arrays(["track_px","track_py","track_pz"], how="zip") // unpack the jagged array into a flat one
vec = vector.zip(px=arrays.track_px, py=arrays.track_py, pz=arrays.track_pz) // convert to vectorized form
leading_pt = ak.fill_none(ak.max(vec.pt, axis=1), 0.0) // fill missing values with zero
Edge cases appear when empty events or wildly unbalanced multiplicities sneak in, filling missing values with zeros or bucketing events by multiplicity keeps kernels aligned, those tiny preprocessing steps rescue throughput without contorting the physics logic
Pure Python remains great for exploration yet heavy fits still crave compiled speed, rather than rewriting the full chain I borrow the TTree
pointer from uproot
and hand it to a short C++ extension built around ROOT RDataFrame
, both languages then share the same memory, no file rewrite, no copy.
The code below shows how I define a C++ function that computes a histogram from the TTree
data, using Eigen
for matrix operations, and then call it from Python:
#include <ROOT/RDataFrame.hxx>
#include <Eigen/Dense>
extern "C" void vertex_fit(void* tree_ptr) {
ROOT::RDataFrame df(*static_cast<TTree*>(tree_ptr));
auto det = [](float x,float y,float z){
Eigen::Matrix3f m;
m << x,x,y, x,y,z, y,z,x;
return m.determinant();
};
auto h = df.Define("det", det, {"vx","vy","vz"})
.Histo1D({"h","det",100,-1,1}, "det");
h->Write();
}
import ctypes, uproot
lib = ctypes.cdll.LoadLibrary("fit.so")
lib.vertex_fit.argtypes = [ctypes.c_void_p]
tree = uproot.open("vertices.root:Tree")
lib.vertex_fit(tree.member("fTree"))
While the C++ thread pool processes data, I lock Python from reading, a simple flag avoids races, when the compiled kernel returns the histogram file becomes just another uproot
object ready for Matplotlib
, one data source, two runtimes, zero duplication
Mixed precision pitfalls lurk, ROOT
branches often store float32
while NumPy
defaults to float64
, I always cast the NumPy
view to match the on-disk type, any later C++ lambda receives exactly what it expects, life stays predictable
Readers outside HEP sometimes ask why we bother with this dance, the answer is that column layouts map directly onto the way hardware streams memory, whether the dataset describes muons or housing prices the principle holds, keep like numbers together, feed them to vectorized kernels, avoid row loops unless you really need to inspect one record, the rest is syntax
I still dip into row mode when debugging a rare pathological event, walking through it field by field can reveal a corrupted value or a misaligned index, but once the cause is clear I pivot back to columns, the payoff in speed and clarity is too large to ignore, rows tell the story, columns do the math.