A little chat about jagged arrays

Hey there, fellow physics enthusiasts and data aficionados! It’s Mohamed here, and today I want to chat about something that might not seem super thrilling at first glance but it is important for us. Yep I’m talking about arrays! These data structures are the unsung heroes that help us make sense of large datasets, be it from the LHC or your local weather station.

Now, arrays come in two flavors: Regular (or if you want to get all formal, “rectangular”) and Jagged. The difference might seem subtle, but trust me, it can have a big impact on your data manipulation and analysis.

The Good Ol’ Regular Array

So, what’s a regular array? Imagine a well-organized bookshelf where each shelf has the same number of books. In Python, you could visualize this as a list of lists, where each inner list (the shelf) has the same number of elements (the books).

Here’s how you’d set one up in Python:

# Picture-perfect array, like a well-organized bookshelf
regular_array = [
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
]

Or if you’re feeling fancy and want to use NumPy (which you absolutely should for heavy-duty tasks):

import numpy as np

# A NumPy array
regular_array_np = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])

Accessing elements? A piece of cake! Here’s how:

# That moment when you precisely know where your favorite book is!
element = regular_array[0][1]  # Output will be 2

The Intriguing Jagged Array

Now let’s get to the oddball of the array family—the jagged array. Imagine a bookshelf where each shelf can have a different number of books. Some might be packed; others might just have a lone book gathering dust.

In Python, that’d look something like this:

# The rebellious jagged array
jagged_array = [
    [1],
    [2, 3],
    [4, 5, 6]
]

Fetching an element from this structure requires a bit more caution. You don’t want to reach for a book that isn’t there!

# Ah, the mystery of finding the right book!
element_jagged = jagged_array[2][1]  # Output will be 5

Lets now move to something more physics. For experimental particle phycisist it will be of course tracks.

Jagged Arrays and Particle Tracks

Okay, so if you’re into high-energy physics like me, you’re no stranger to particle tracks. When particles whiz through a detector (shoutout to my LHCb peeps!), they often produce a series of hits that we record. Now, here’s the kicker—not all particles generate the same number of hits. Some might produce just a couple, while others could result in a whole chain.

Enter jagged arrays.

Imagine each particle as a row in your jagged array, and the hits as the elements in that row. This way, you don’t have to waste space on empty hits. It’s like custom-fitting your data structure to your data. Neat, huh?

# Particle tracks as a jagged array
particle_tracks = [
    [0.1, 0.2],          # Particle 1 with 2 hits
    [0.3, 0.4, 0.5],     # Particle 2 with 3 hits
    [0.6]                # Particle 3 with 1 hit
]

before we move to extend this discussion, I want to move to discuss different idea, a spell called Vectorization.

Now, let’s talk about something that makes Python supercharged: vectorization. If you’ve ever had to deal with large datasets in C++ (I’m looking at you, ROOT), you know that the loop-based approach can be slow as a snail.

Vectorization lets you perform operations on entire arrays at once, without the need for explicit loops. It’s like sending your calculations to the gym and getting them back all beefed up!

Python libraries like NumPy and Awkward Array have got this down to a science. They’re optimized to handle operations in a way that makes the most of your hardware capabilities, often outperforming traditional loop-based implementations.

import numpy as np

# Regular array in NumPy
array_np = np.array([1, 2, 3, 4, 5])

# Vectorized addition
addition = array_np + 5  # This adds 5 to all elements, and it's super fast!
# Another vectorized operation
multiplication = array_np * 2  # This multiplies all elements by 2

You’ve probably heard the term “vectorization” being thrown around you before. Here’s how you can add 5 to every energy level without using a for-loop:

import numpy as np

# Using NumPy for vectorized operations
energy_levels = np.array([100, 200, 150])
new_energy_levels = energy_levels + 5

but if we wanted to do it using the normal python way, we would have to do it like this:


# Using a for-loop
energy_levels = [100, 200, 150]
new_energy_levels = []
for energy in energy_levels:
    new_energy_levels.append(energy + 5)

This will be a lot slower than the vectorized version. Try this for when you have a large dataset and you’ll see what I mean.

I get it, ROOT and C++ have been the go-to for high-energy physics for ages. But with the advent of Python libraries that support vectorization, the game’s changing. One particular benefit is that we get to use python syntax which is like a breath of fresh air compared to C++. Less boilerplate, more action. All this without sacrificing performance because with vectorization, you can get close to, or even match, the performance of traditional C++ code in many cases.

Let’s say you’re working with an array of particle energies and you want to normalize them. In C++, you’d do something like:

#include <iostream>
#include <vector>
#include <algorithm>

int main() {
    std::vector<int> energies = {100, 200, 150};
    int max_energy = *std::max_element(energies.begin(), energies.end());
    
    for (int &energy : energies) {
        energy = energy / max_energy;
    }
}

In Python with NumPy, it’s way shorter:

import numpy as np

energies = np.array([100, 200, 150])
max_energy = np.max(energies)
normalized_energies = energies / max_energy

Now, let’s move back to our discussion about jagged arrays. We already know a regular array is like a well-balanced bookshelf. Every row has the same number of elements. Let’s look at another example, this time involving particle energies:

# Array to hold energies of particles in MeV
particle_energies = [
    [100, 200, 150],
    [120, 180, 170],
    [110, 190, 160]
]

And hey, you can even calculate the average energy for each set of particles:

# Calculating the average energy
average_energies = [sum(row) / len(row) for row in particle_energies]

But Jagged arrays are perfect for when each particle has a different number of hits in the detector. Let’s look at an example:

# Array to hold hits for different particles
particle_hits = [
    [0.1, 0.2],
    [0.3, 0.4, 0.5],
    [0.6]
]

Want to find the maximum number of hits any particle has? Sure thing:

# Finding max number of hits
max_hits = max(len(row) for row in particle_hits)

Again lets get back to tracks. Let’s consider a scenario where each row represents a particle and the columns represent its track coordinates (x, y, z).

# Particle tracks as a jagged array
particle_tracks = [
    [(0, 1, 2)],
    [(3, 4, 5), (6, 7, 8)],
    [(9, 10, 11), (12, 13, 14), (15, 16, 17)]
]

Want to find the average x-coordinate for the second particle? Easy peasy!

# Calculating the average x-coordinate for the second particle
average_x = sum(coord[0] for coord in particle_tracks[1]) / len(particle_tracks[1])

Now you might think that life is easy with jagged arrays but this unfortunately is not the case. For starters, Jagged arrays are not supported by NumPy. But there is is a library called Awkward Array that supports them. But I don’t want to get into the details of this library here. Instead lets go deep about problems from the developer perspective not a user of library. While jagged arrays are a wonderful tool for handling non-uniform datasets, they come with their own set of issues from a developer’s perspective.

Problem 1: Data Alignment and Padding

In languages that are keen on type and memory safety, managing jagged arrays can be a mess. They often involve a layer of indirection, which can lead to inefficient memory usage and cache misses. The data isn’t stored contiguously, making it a poor candidate for SIMD (Single Instruction, Multiple Data) optimizations.

Problem 2: Inconsistent API and Data Contracts

When working with libraries that expect regular arrays (i.e NumPy), the API and data contracts can become messy when you introduce jagged arrays. You may need to write a lot of custom code to handle them, which can be error-prone.

Problem 3: Lack of Native Operations

Standard mathematical operations like matrix multiplications or element-wise additions are not directly applicable to jagged arrays. This means you often end up writing custom, complex loops, sacrificing both readability and performance.

Taking into consedration these problems (and more!), lets see how Awkward Array Comes to the Rescue

Awkward Array is not just a clever name; it’s a library designed to address these very issues. But how does it do that? Let’s look under the hood.

Efficient Memory Management

Awkward Array employs a columnar data layout, storing data contiguously wherever possible. This makes it cache-efficient and allows for SIMD operations. The data structure keeps metadata separately, which is used to reconstruct the original jagged shape when needed.

Unified API for Different Data Shapes

From a developer’s standpoint, one of the best things about Awkward Array is its unified API. Whether you’re working with regular or jagged arrays, the library provides a consistent way to manipulate data. This reduces the cognitive load and the likelihood of bugs.

Vectorized Operations

Awkward Array is designed to leverage vectorized operations even on jagged data structures. It does so by translating high-level operations into low-level, contiguous array manipulations, essentially “flattening” the computation but not the data. This way, you can perform complex mathematical operations without resorting to explicit loops, making your code both faster and more readable.

Awkward Array uses a combination of different data structures under the hood:

  1. Lists: These handle the jaggedness of the data, essentially the varying lengths of the sub-arrays.
  2. Numpy Arrays: For the actual data storage, making it efficient.
  3. Records and Tables: These are used for more complex, nested structures.

The library utilizes lazy evaluation, meaning it doesn’t perform computations until absolutely necessary. This allows for better optimization and reduced computational overhead.

I want to conclude this chat by saying that I’m working more closely with jagged arrays recently and in a deeper way. It is part of my work on writing inference engine for deep learning based primary vertex reconstruction algorithm that is being developed for LHCb (and other LHC experiments). I will be writing more about this in the future. I just wanted to write something here to share my thoughts for now!.