Two Uproot Mysteries
So I was working on parsing some ntuple
s for a gazilion time and naturally decided to stick with uproot
, the so called modern solution for reading ROOT
files in Python.
Because in 2025 we obviously need yet another way to handle TTree
structures from a library that hates you just a little less than ROOT
itself.
Let’s begin with the classic. eager vs lazy loading. You’d think that calling .arrays()
would get you arrays, right? But no.
Depending on whether you passed a single branch or multiple, or if you asked for library="np"
or "ak"
, you might get a NumPy
array, an awkward
array,
or a dictionary of arrays or an empty array wrapped in sixteen layers of abstraction that looks like data but isn’t.
And of course, none of this is consistent unless you read the fine print of the documentation where they casually mention that sometimes you get a RecordArray
pretending to be a NumPy
structured array.
Which is cool, because what you really wanted was a flat array of floats from a leaf called Track_PT
, but you ended up with an object that fails silently when passed to your
histogramming code because “iteration” means something different now.
Then there’s the string decoding. Ah yes, TObjString
and its wonderful legacy. You’d think reading a string from a ROOT
file would be straightforward after all, ASCII
is solved, UTF-8
is solved,
and even ROOT
has had decades to figure it out. But uproot
will happily give you a byte string that looks like it was read from a VAX
system via a serial port.
Sometimes it decodes fine. Other times it’s a bytes
object that mysteriously needs decode("utf-8")
, but only after you’ve cast it through four awkward
array conversions and flattened it twice.
And don’t get me started on reading a list of strings because suddenly you’re handed an ak.Array
where every entry is a list of one string,
but sometimes it’s a list of zero strings, or worse, an object array with dtype=object
that breaks every downstream NumPy
function you try to use.
In the end, uproot
is still the only sane choice if you want to avoid diving into PyROOT
and compiling things at 2AM on lxplus. But sane doesn’t mean it won’t drive you slightly insane. It’s like talking to a reasonable person who insists on answering every question with a riddle. Sure, you get the answer, but you have to guess the context first.