Skip to content

Instantly share code, notes, and snippets.

@jpivarski
Created October 15, 2024 23:34
Show Gist options
  • Save jpivarski/3929b7ec7d356f138f213b959cc099be to your computer and use it in GitHub Desktop.
Save jpivarski/3929b7ec7d356f138f213b959cc099be to your computer and use it in GitHub Desktop.
Curated Awkward/Uproot questions from StackOverflow

-- Question -----------------------------------------------

I am trying to rebin my hist objects from linear bins to log(x) bins. I see on the UHI read the docs on indexing (https://uhi.readthedocs.io/en/latest/indexing.html#) that it is possible to linear rebin with h == h[::rebin(2)], but this gives all bins the same width still. Is it possible to rebin with user defined bin edges? In this case, I wanted log(x) bins. If it is not possible, I am also thinking about making a new hist object that meets the log(x) bins by slicing the original histogram in a loop, each loop being each bin edges.

-- Answer ------------

Linear to log won't work because edges can't move. Ultimately variable space rebinning boils down to "map x bins into one" over and over, where x changes. The final set of bin edges will always be a subset of the original bin edges. You can make them sort-of log space with rounding to the nearest linear bin edge (but I'd highly recommend waiting for the next release of boost-histogram, as full UHI support including variable rebinding is currently an open PR that I'm going to be working on this week).

It is much better to do a log binning on the original data, then you'll have actual log-spaced bin edges.

-- Question -----------------------------------------------

I'm trying to access the elements of an awkward array that do not correspond to some particular set of indices. I have 3 events in total with one jet per event and some number of leptons. Each lepton has a particular flag associated with it. For each jet I keep track of the indices of the leptons in that jet:

jet_lepton_indices = ak.Array([[0, 2], [1], [2,3]])
print(f'jet_lepton_indices\n{jet_lepton_indices}\n')

lepton_flags = ak.Array([[0, 10, 20, 30], [0, 10, 20, 30], [0, 10, 20, 30, 40]])
print(f'lepton_flags\n{lepton_flags}\n')

The output:

jet_lepton_indices
[[0, 2], [1], [2, 3]]

lepton_flags
[[0, 10, 20, 30], [0, 10, 20, 30], [0, 10, 20, 30, 40]]

If I want the flags of the leptons that are in each jet I do lepton_flags[jet_lepton_indices] and get:

[[0, 20],
 [10],
 [20, 30]]

But I also need to access all the lepton flags associated with leptons that are not in the jets. I'd like to be able to produce:

[[10, 30],
 [0, 20, 30],
 [0, 10, 40]]

I thought I could do lepton_flags[~jet_lepton_indices], but that has behavior I don't understand. A way to flatten/unflatten I can't figure that out either.

-- Answer ------------

(The "TL;DR" is at the bottom, below the horizontal line.)

The ~ (bitwise not) didn't work on your array of integers because it just inverted the bits in the integers:

>>> jet_lepton_indices
<Array [[0, 2], [1], [2, 3]] type='3 * var * int64'>
>>> ~jet_lepton_indices
<Array [[-1, -3], [-2], [-3, -4]] type='3 * var * int64'>

Ultimately, what you want is to convert your integer array slice into a boolean array slice. As slices, integer arrays have strictly more information than boolean arrays: they can duplicate and change the order of elements from the sliced array in addition to just dropping elements. Thus, integer array slices can always be converted into boolean array slices, but not the other way around. In fact, there has been a request for such a function, #497, and that issue describes several ways of getting it, all different from the one I worked out below. (I'm still going to show the example that I just worked out because it's simpler and demonstrates a common pattern: cartesian to increase dimensions, do something in the new dimension, then aggregate over it to get back to the old number of dimensions.)

Another fact about boolean array slices is that they have to agree with the list-lengths of the array that they slice. (Footnote: to invert a selection, you need to know the universal set, so that's why it's only possible to invert a boolean array slice.) Therefore, to convert an integer array slice into a boolean array slice, we need to use the array to be sliced, lepton_flags. We can use ak.local_index to make an integer array of integer indexes for all elements that exist in lepton_flags:

>>> all_indices = ak.local_index(lepton_flags)
>>> all_indices
<Array [[0, 1, 2, 3], [0, ..., 3], [0, 1, 2, 3, 4]] type='3 * var * int64'>

Now the goal will be to find booleans for each one of these indices that say whether the index is in jet_lepton_indices or not. That kind of question has the form, "for each item in X (the local index), is there any item in Y (jet_lepton_indices) for which Z (they're equal)?" The "for each" of one array with another array is handled by ak.cartesian, and since we'll want to aggregate over everything associated with a single item of X ("is ak.any item equal?") we'll need nested=True to make a new dimension, to later aggregate over.

>>> pairs = ak.cartesian([all_indices, jet_lepton_indices], nested=True)
>>> pairs.show(type=True)
type: 3 * var * var * (
    int64,
    int64
)
[[[(0, 0), (0, 2)], [(1, 0), (1, 2)], [(2, ...), ...], [(3, 0), (3, 2)]],
 [[(0, 1)], [(1, 1)], [(2, 1)], [(3, 1)]],
 [[(0, 2), (0, 3)], [(1, 2), (1, 3)], ..., [(3, ...), ...], [(4, 2), (4, 3)]]]

The pairs are more deeply nested (var * var *) than all_indices and jet_lepton_indices (var *) because we asked for the results to be grouped by same-first-index (nested=True).

The left item in each of these pairs is from all_indices and the right is from jet_lepton_indices, for all combinations. To separate them, use ak.unzip:

>>> whole_set, in_set = ak.unzip(pairs)
>>> whole_set
<Array [[[0, 0], [1, 1], [...], [3, 3]], ...] type='3 * var * var * int64'>
>>> in_set
<Array [[[0, 2], [0, 2], [...], [0, 2]], ...] type='3 * var * var * int64'>

The whole_set and in_set line up because they come from the same pairs. Since they line up, we can use == on them, to get a boolean that's True if and only if a member of the whole_set is in the in_set.

>>> whole_set == in_set
<Array [[[True, False], ..., [False, ...]], ...] type='3 * var * var * bool'>

If any (ak.any) of these innermost lists (axis=-1) is True, then we want to say that the whole group, representing an item from all_indices, is in jet_lepton_indices.

>>> jet_lepton_boolean = ak.any(whole_set == in_set, axis=-1)
>>> jet_lepton_boolean
<Array [[True, False, True, False], ..., [False, ...]] type='3 * var * bool'>

This jet_lepton_boolean is a boolean array that can be used as a slice to produce the same elements as jet_lepton_indices. As a boolean, it can be negated with ~.

>>> lepton_flags[~jet_lepton_boolean]
<Array [[10, 30], [0, 20, 30], [0, 10, 40]] type='3 * var * int64'>

That's the selection of lepton_flags that you want: it's everything except what was in

>>> lepton_flags[jet_lepton_indices]
<Array [[0, 20], [10], [20, 30]] type='3 * var * int64'>

As an alternative, you could have constructed the negated booleans directly by using != instead of ==.


Here's a summary of this method, as a function:

def indices_to_booleans(indices, array_to_slice):
    whole_set, in_set = ak.unzip(ak.cartesian([
        ak.local_index(array_to_slice), indices
    ], nested=True))
    return ak.any(whole_set == in_set, axis=-1)

This solution depends on the fact that your original arrays are only one level deep (var *), though I think it might generalize if you pass the appropriate axis argument to ak.cartesian, but I haven't thought about it enough to be sure.

Also, #497 provides more ways of doing it.

-- Question -----------------------------------------------

Note: I am using awkward version 1.10.3.

So, the general overview is that I have a set of data that is in awkward arrays, and I want to be able to pass this data to a simple feedforward pytorch model. I believe that pytorch doesn't natively handle awkward arrays so I am planning on converting the data to either torch or numpy arrays before passing through to the model. I should also note that whilst the data is stored in awkward arrays, at this point the data is not jagged.

Here is an example of the input data and of what I am looking for:

import awkward as ak
import numpy as np
import torch


arr = ak.Array({"MET_pt" : [0.0, 100.0, 20.0, 30.0, 4.0],
                 "MET_phi" : [0, 0.1, 0.2, 0.3, 0.4],
                 "class" : [0, 1, 0, 1, 0]})

# These are my input features
x = arr[['MET_pt', 'MET_phi']]
# These are my class labels
y = arr['class']
#
## Here would be the code converting to torch tensors 
# 
x_torch = torch.tensor([[0, 0], [100, 0.1], [20, 0.2], [30, 0.3], [4, 0.4]])

y_torch = torch.tensor([0, 1, 0, 1, 0])

However, I cannot find an easy way to convert x from the awkward arrays to the torch arrays. I can easily convert y to torch tensors by simply doing:

torch.tensor(y)
> tensor([0, 1, 0, 1, 0])

But I am unable to do this for the x array:

torch.tensor(x)
> TypeError: object of type 'Record' has no len()

This lead me to the idea of converting to numpy arrays first:

torch.tensor(ak.to_numpy(x))
> TypeError: can't convert np.ndarray of type numpy.void. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool

But as you can see this doesn't work either.

I think the problem lies in the fact that the ak.to_numpy() function converts the x array to:

ak.to_numpy(x)
> array([(  0., 0. ), (100., 0.1), ( 20., 0.2), ( 30., 0.3), (  4., 0.4)],
      dtype=[('MET_pt', '<f8'), ('MET_phi', '<f8')])

where I want it to convert like:

ak.to_numpy(x)

> [[0, 0], [100, 0.1], [20, 0.2], [30, 0.3], [4, 0.4]]

Is there anyway of converting an N-dim non-jagged awkward array such as x into the format shown immediately above? Or is there a smarter way to convert directly to torch tensors?

Sorry if this is a stupid question! Thanks!

-- Answer ------------

You've already converted y; the problem with x is that it's not a purely numerical array. You could convert x to a NumPy structured array,

>>> ak.to_numpy(x)
array([(  0., 0. ), (100., 0.1), ( 20., 0.2), ( 30., 0.3), (  4., 0.4)],
      dtype=[('MET_pt', '<f8'), ('MET_phi', '<f8')])

which can then be viewed and reshaped to get the array that you want:

>>> ak.to_numpy(x).view("<f8").reshape(-1, 2)
array([[  0. ,   0. ],
       [100. ,   0.1],
       [ 20. ,   0.2],
       [ 30. ,   0.3],
       [  4. ,   0.4]])

but this relies strongly on the fact that all of the fields are the same type, "<f8" (doubles). If you had a mix of floating-point numbers and integers (charge?), or numbers of different bit-widths, then this wouldn't work.

Here's a better method: break up x (or the original arr) into its two fields, first.

>>> x["MET_pt"]
<Array [0, 100, 20, 30, 4] type='5 * float64'>
>>> x["MET_phi"]
<Array [0, 0.1, 0.2, 0.3, 0.4] type='5 * float64'>

What you want to do is interleave these so that you get one value from "MET_pt", followed by one value from "MET_phi", then the next value from "MET_pt", and so on. If you first put the values in length-1 lists, which is a reshaping (can be done in Awkward or NumPy, with the same syntax),

>>> x["MET_pt", :, np.newaxis]
<Array [[0], [100], [20], [30], [4]] type='5 * 1 * float64'>
>>> x["MET_phi", :, np.newaxis]
<Array [[0], [0.1], [0.2], [0.3], [0.4]] type='5 * 1 * float64'>

then what you want is to concatenate each of these length-1 lists from the first array with each of the length-1 lists from the second array. That is, you want to concatenate them, not concatenation at axis=0, but concatenation at axis=1, the first level deep of lists (see ak.concatenate or np.concatenate).

>>> np.concatenate((x["MET_pt", :, np.newaxis], x["MET_phi", :, np.newaxis]), axis=1)
<Array [[0, 0], [100, 0.1], ..., [30, ...], [4, 0.4]] type='5 * 2 * float64'>

Now you can pass it to Torch.

>>> torch.tensor(np.concatenate((
...     x["MET_pt", :, np.newaxis], x["MET_phi", :, np.newaxis]
... ), axis=1))
tensor([[  0.0000,   0.0000],
        [100.0000,   0.1000],
        [ 20.0000,   0.2000],
        [ 30.0000,   0.3000],
        [  4.0000,   0.4000]], dtype=torch.float64)

-- Question -----------------------------------------------

I'm thinking of migrating from awkward 1 to 2.

I used lazy reading from parquet file using awkward1.

tree = ak.from_parquet(filename, lazy=True) # data is not read
print(np.max(tree["x"]),end=" ") # "x" is just read here.

How can I realize this with awkward2?

-- Answer ------------

The one major interface change from Awkward 1.x to Awkward 2.x is that lazy arrays (PartitionedArray of VirtualArray) have been replaced by dask-awkward. The motivation for this was to give users more control over when the array-reading happens (when you say, .compute(), and not before), as well as where (distributed on any cluster that runs Dask jobs).

Here's an example:

>>> import awkward as ak
>>> import dask_awkward as dak
>>> tree = dak.from_parquet("https://pivarski-princeton.s3.amazonaws.com/chicago-taxi.parquet")
>>> tree
dask.awkward<from-parquet, npartitions=1>

This object represents data that have not been read yet. It's a kind of lazy array. It has all of the metadata, such as field names and data types.

>>> ak.fields(tree)
['trip', 'payment', 'company']
>>> ak.fields(tree.trip)
['sec', 'km', 'begin', 'end', 'path']
>>> tree.trip.type.show()
?? * var * {
    sec: ?float32,
    km: ?float32,
    begin: {
        lon: ?float64,
        lat: ?float64,
        time: ?datetime64[ms]
    },
    end: {
        lon: ?float64,
        lat: ?float64,
        time: ?datetime64[ms]
    },
    path: var * {
        londiff: float32,
        latdiff: float32
    }
}

You can perform a lazy computation (all ak.* functions work on dak.Arrays), and it remains lazy, unlike an Awkward 1.x array:

>>> result = ak.max(tree.trip.sec)
>>> result
dask.awkward<max, type=Scalar, dtype=float32>

But when you say, .compute(), you get fully evaluated data.

>>> result.compute()
86400.0

See dask.org for distributing and scaling up processes, either on one computer or on a cluster of computers.

-- Question -----------------------------------------------

I am working in jupyter notebook and have a pandas dataframe from which I would like to fill a ROOT TH3F histogram and save it to a ROOT file using uproot. I haven't been able to find much in the way of examples that would illustrate how to do this, but here is what I assume is the procedure:

  1. Declare a ROOT TH3F and iterate over the dataframe to fill the histogram.
  2. Open ("recreate") a new ROOT file with uproot and write this histogram to it.

Below is some example code that shows how I tried to go about it (incorrectly, because it segfaults).

import ROOT as R
import uproot as ur
import numpy as np
import pandas as pd

# Example dataframe
data = {
'x': [9.5, 5.0, 2.2, 8.1, 5.5, 1.4, 2.5, 9.2, 3.0, 7.9],
'y': [2.0, 5.7, 1.3, 9.1, 6.0, 6.2, 5.8, 1.8, 5.8, 3.1],
'z': [7.5, 4.1, 3.1, 1.6, 2.4, 8.2, 1.3, 4.4, 2.3, 5.0]
}
df = pd.DataFrame(data)

# Fill TH3F
xyz_hist = R.TH3F('xyz', 'xyz', 100, 0, 10, 100, 0, 10, 100, 0, 10)
for index, row in df.iterrows():
     xyz_hist.Fill(row['x'], row['y'], row['z'])

# Open file and write histogram
outfile = ur.recreate('outfile.root')
outfile['xyz'] = xyz_hist

Could someone please clarify what is the correct way to go about it? Or is this wrong because I am trying to use uproot for something that it wasn't intended/built for, and the solution is to just use ROOT for opening the file, storing the histogram, etc.?

-- Answer ------------

I executed exactly your code and encountered no issues, regardless of whether I read the histogram back into ROOT:

import ROOT
f = ROOT.TFile("outfile.root")
h = f.Get("xyz")
h.Draw()

or Uproot and hist:

import uproot
f = uproot.open("outfile.root")
h = f["xyz"]
h.to_hist()

so you might just have an old version of one of the packages and are seeing a bug that was fixed since then. Here are the versions that successfully tested the above:

  • root 6.30.2 (from conda-forge)
  • uproot 5.2.3
  • hist 2.7.2
  • numpy 1.26.3
  • pandas 2.2.0

More generally, I'd like to point out a few things.

  1. It's not necessary to iterate over the DataFrame with df.iterrows(); doing so defeats the purpose of putting data into arrays that can be manipulated with precompiled routines. I'll show an example in a moment.
  2. It's not necessary to make a ROOT object (TH3) in order to save the data with Uproot. If ROOT is available, you can save it with ROOT. Uproot is a pure Python alternative to ROOT, so there's not much point in mixing them (although it works because Uproot asks ROOT to serialize its data and Uproot knows how to deserialize it, so this TH3 is effectively being "saved" in memory and then loaded from that in-memory file, before Uproot writes it to a disk-file).

Here's a way that it can be done entirely with Uproot and hist:

import uproot
from hist import Hist
import pandas as pd

data = {
    "x": [9.5, 5.0, 2.2, 8.1, 5.5, 1.4, 2.5, 9.2, 3.0, 7.9],
    "y": [2.0, 5.7, 1.3, 9.1, 6.0, 6.2, 5.8, 1.8, 5.8, 3.1],
    "z": [7.5, 4.1, 3.1, 1.6, 2.4, 8.2, 1.3, 4.4, 2.3, 5.0],
}
df = pd.DataFrame(data)

xyz_hist = Hist.new.Reg(100, 0, 10).Reg(100, 0, 10).Reg(100, 0, 10).Double()
xyz_hist.fill(data["x"], data["y"], data["z"])

outfile = uproot.recreate("outfile.root")
outfile["xyz"] = xyz_hist

and here's how it can be done entirely with ROOT:

import ROOT
import pandas as pd

data = {
    "x": [9.5, 5.0, 2.2, 8.1, 5.5, 1.4, 2.5, 9.2, 3.0, 7.9],
    "y": [2.0, 5.7, 1.3, 9.1, 6.0, 6.2, 5.8, 1.8, 5.8, 3.1],
    "z": [7.5, 4.1, 3.1, 1.6, 2.4, 8.2, 1.3, 4.4, 2.3, 5.0],
}
df = pd.DataFrame(data)

rdf = ROOT.RDF.FromNumpy({
    "x": df["x"].values, "y": df["y"].values, "z": df["z"].values
})
h = rdf.Histo3D(("xyz", "", 100, 0, 10, 100, 0, 10, 100, 0, 10), "x", "y", "z")

outfile = ROOT.TFile("outfile.root", "RECREATE")
h.Write()
outfile.Close()

(In both cases, the Pandas DataFrame is also superfluous; both the Hist.fill and the ROOT.RDF.FromNumpy methods actually want NumPy arrays. In the ROOT case, I have to explicitly pull NumPy arrays out of the DataFrame. However, I assume that you have a reason for wanting to use Pandas that goes beyond this example.)

-- Question -----------------------------------------------

I'm new to uproot lib and just a user.
I want to write the string type into root file, but the error happens as below:

Error: TypeError: cannot write Awkward Array type to ROOT file: 46 * string

Aim: I want to extract some information from 10 root files via a algorithm and merge them into one root file. I stored these information into a list.

The root file and the code can be found here[GoogleDrive](https://drive.google.com/file/d/1U6-Ug-uq3sHMwsY9ISJj8K2Dw9ssuLlF/view?usp=sharing, https://drive.google.com/file/d/1a32KQw30iLBv_H9qWcNUEsBAfEDOSTVM/view?usp=sharing)

Does someone know how to solve this problem? I would be very grateful if someone could solve this difficult problem.

-- Answer ------------

The ability to write strings was added in PR #940 (Oct 2023), which is in release 5.0.13, 5.1.0, and following. Here's a demo:

>>> import awkward as ak
>>> import uproot
>>> strings = ak.Array([f"str{i}" for i in range(46)])
>>> print(strings.type)
46 * string
>>> with uproot.recreate("output.root") as file:
...     file["tree"] = {"branch": strings}
... 

Then, restart Python (to prove that it's really in the file) and read it back:

>>> import uproot
>>> with uproot.open("output.root") as file:
...     strings = file["tree"]["branch"].array()
... 
>>> print(strings.type)
46 * string
>>> strings
<Array ['str0', 'str1', 'str2', ..., 'str44', 'str45'] type='46 * string'>

So you should be able to just upgrade Uproot and it will work.

-- Question -----------------------------------------------

I am experimenting with the dask functionality of uproot, i.e. loading branches into dask arrays.

Unfortunately I am not understanding why things happen when trying to perform computations on these arrays, e.g.

import dask.array as da

tree = uproot.dask("file.root:tree", library = 'np')
branch_data = data["testbranch"]
mean = da.mean(branch_data).compute()

The branch data is a 2-dimensional array and I would like to compute the mean along axis=1, i.e. for each row. Strangely the output is the same as if i did:

np.mean(branch_data.T, axis = 1)

It somehow computes the mean of the columns instead. Trying to add axis=1results in an error saying axis out of bounds for array with dimension 1. But calling compute() on the branch to print the actual data it clearly is the expected 2-dim array. This also happens with other methods like da.sum().

EDIT: Here I provide an example that reproduces the problem:

import numpy as np
import uproot

# creating sample file with tree
with uproot.recreate("test.root") as file:
    file["test_tree"] = {"test_branch": np.random.random((100,10))}

# Standard uproot (output I aim for with the dask option)
tree = uproot.open("./test.root:test_tree")
branch = tree["test_branch"].array(library = 'np')
mean = np.mean(branch, axis = 1)
print(mean)

# Uproot-Dask (Will compute mean columnwise but would expect a single scalar. Strange...)
tree = uproot.dask("./test.root:test_tree", library = 'np')
branch = tree["test_branch"]
mean = np.mean(branch).compute()
print(mean)

#This should correspond to the standard uproot ouput but does not work. Also strange
mean = np.mean(branch, axis = 1).compute()
print(mean)

-- Answer ------------

According to the uproot manual, the uproot.dask() function "returns an unevaluated Dask array from TTrees."

Playing with the branch object you created in your sample code, I was able to convert it to a plain Numpy array by calling compute() on it:

import numpy as np
import uproot

with uproot.recreate("test.root") as file:
    file["test_tree"] = {"test_branch": np.random.random((100,10))}

tree = uproot.dask("./test.root:test_tree", library = 'np')
branch = tree["test_branch"]
b_as_array = branch.compute()

Continuing interactively:

>>> branch
dask.array<test_branch-from-uproot, shape=(100,), dtype=float64, chunksize=(100,), chunktype=numpy.ndarray>
>>> b_as_array = branch.compute()
>>> b_as_array.shape
(100, 10)
>>> np.mean(b_as_array, axis=0)
array([0.54450986, 0.48361194, 0.52477069, 0.50902231, 0.52925032,
       0.47309532, 0.49022969, 0.48736406, 0.5027256 , 0.56298907])

Note that I had to use axis=0 to get the mean you want since np.mean(b_as_array) returned just one number being the average of all 1000 numbers.

Now, you probably want to keep the efficiency of Dask arrays, and Dask itself provides the same operations. You should probably use the Dask implementations rather than the equivalent Numpy ones.

E.g.

>>> branch.mean().compute()
array([0.54450986, 0.48361194, 0.52477069, 0.50902231, 0.52925032,
       0.47309532, 0.49022969, 0.48736406, 0.5027256 , 0.56298907])

Lots more details at https://docs.dask.org/en/stable/array.html

-- Question -----------------------------------------------

The problem poped up when I try to use function awkward.load(), but there comes the message "module 'awkward' has no attribute 'load'", and I think the problem is that the newest version of awkward just delete this attribute, but I cannot find any alternative solution. Does anyone know how to fix it? The version of awkward is 2.4.3

I've already import the module.

-- Answer ------------

You're right, the awkward.load function is part of the Awkward 0.x interface, and you're using Awkward 2.x. The serialization format used in Awkward 0.x is no longer supported. Since Awkward 1.0.1 (December 2020), the pickle format has been kept backward-compatible.

Since you're trying to load files, I can tell you there is a way to do it across the three major versions. A package named awkward0 can be installed and imported alongside other versions of Awkward Array. You can use that to load the data from the 0.x file format. Awkward 1.x has an awkward.from_awkward0 function that can turn an Awkward 0 array in memory into an Awkward 1 array in memory. Then pickle that array in a file and load it into a Python process that has Awkward 2 installed. Awkward 1 and Awkward 2 can't both be installed in the same Python environment, because they're just two versions of the same package. (That's normally the case for Python packages, and the trick we pulled with making awkward0 a package on its own caused other problems.)

But be aware that this is not the only interface change. In each major version of a software package, semantic versioning allows for interface changes. The interface changes from Awkward 0 to Awkward 1 were significant. The above technique, complicated though it is, will allow you to restore all the data, but code needs to be updated if you're going to use the new version.

-- Question -----------------------------------------------

I am trying to use Uproot to read some existing trees. I can open the ROOT file with trees, I can "get" the trees and "show" their branches. I've met branches for which I don't know how to "dump" them and/or get their content.

Some branches contain standard ROOT objects:

Vtx/Vtx.NuFileName              | TObjString*[]                        | AsObjects(AsArray(True, False, AsPointer(Model_TObjString), ()))
SYST/SYST._cov_matrix           | TMatrixT<double>[]                   | AsObjects(AsArray(True, False, Model_TMatrixT_3c_double_3e_, ()))

Some branches are "experiment specific" / "custom" (I don't have ROOT "dictionaries" for them):

POTInfo_v2                      | Header                               | AsObjects(Model_Header)
SEL/SEL._rootStep               | StepBase*[]                          | AsObjects(AsArray(True, False, AsPointer(Model_StepBase), ()))
SEL/SEL._firstSteps             | std::vector<StepBase*>[]             | AsObjects(AsArray(True, False, AsVector(False, AsPointer(Model_StepBase)), ()))
CATEG/CATEG._types              | std::vector<TrackTypeDefinition>[]   | AsObjects(AsArray(True, False, AsVector(False, Model_TrackTypeDefinition), ()))
CONF/CONF._toys                 | std::vector<ToyVariationWrite>[]     | AsObjects(AsArray(True, False, AsVector(False, Model_ToyVariationWrite), ()))

Thanks in advance.

-- Answer ------------

The TTree.show method only displays a summary of the TBranches, a tabular view of TBranch.typename (C++ type name) and TBranch.interpretation (Pythonic interpretation). In both cases, this will show you some levels of depth (e.g. std::vector containing XYZ), but classes are opaque.

If you want to know what fields are in a class definition, even if it is a user-defined class, you can get that information from the ROOT file's "streamer info." Here's an example:

>>> import uproot
>>> import skhep_testdata
>>> root_directory = uproot.open(skhep_testdata.data_path("uproot-issue283.root"))

This file, uproot-issue283.root, was from a search for supernovas in IceCube, and it has custom data types with names like SN_Analysis_Configuration_t, I3Eval_t, and SN_File_t. Knowing the C++ type name of the class (not the Python name, which starts with "Model_"), you can look it up with uproot.ReadOnlyFile.streamer_named:

>>> root_directory.file.streamer_named("I3Eval_t").show()
I3Eval_t (v7): TObject (v1)
    theDataArray: Sni3DataArray* (TStreamerObjectAnyPointer)
    NumberOfChannels: int (TStreamerBasicType)
    NoAvailableSlices: int (TStreamerBasicType)
    AvailableDataSize: int (TStreamerBasicType)
    mGPSCardId: int (TStreamerBasicType)
    mGPSPrescale: int (TStreamerBasicType)
    mGPSEventNo: int (TStreamerBasicType)
    mScalerCardId: int (TStreamerBasicType)
    mScalerStartChannel: int (TStreamerBasicType)
    StartUTC: long (TStreamerBasicType)
    MaxChannels: int (TStreamerBasicType)
    mMaxJitterLogs: int (TStreamerBasicType)
    Channel: I3Eval_t::ChannelContainer_t* (TStreamerObjectAnyPointer)
    ChannelIDMap: map<long,int> (TStreamerSTL)
    BadChannelIDSet: set<long> (TStreamerSTL)
    ChannelID: long* (TStreamerBasicPointer)
    Deadtime: double* (TStreamerBasicPointer)
    Efficiency: double* (TStreamerBasicPointer)

The information here includes the class version (v7), its inheritance (TObject (v1)), its class attributes, their C++ types, and the ROOT streamer than maintains them (e.g. TStreamerBasicType). Note that ROOT can only store attributes, which are usually private in C++, not any methods or functions. When you get an instance of one of these C++ classes, you can access any member with the uproot.Model.member method (see also members, all_members to include superclasses, and has_member):

>>> root_directory["config/detector"].member("theDataArray")
<Sni3DataArray (version 1) at 0x7efd043d1ca0>

>>> root_directory["config/detector"].member("NumberOfChannels")
5160

>>> root_directory["config/detector"].member("Efficiency")
array([1.  , 1.  , 1.  , ..., 1.35, 1.35, 1.35])

If you're getting the objects from a TTree, rather than directly out of a TDirectory as in the above example, you'll (probably) have to get the data with library="np" (NumPy arrays of Python objects) instead of the default library="ak" (Awkward Arrays).

To see all of the streamers, there's an uproot.ReadOnlyFile.show_streamers method:

>>> root_directory.file.show_streamers()
SN_Streaming_All_t (v3): SN_Streaming_t (v2)
    iamdaq: int (TStreamerBasicType)
    no_channels: int (TStreamerBasicType)
    data: unsigned short* (TStreamerBasicPointer)

SN_Streaming_t (v2)
    base_time: unsigned long (TStreamerBasicType)
    gps_time_high: unsigned long (TStreamerBasicType)
    gps_time_low: unsigned long (TStreamerBasicType)
    buf_pos: unsigned long (TStreamerBasicType)
    buf_loop: unsigned long (TStreamerBasicType)

SN_Streaming_GPS_t (v3): SN_Streaming_t (v2)
    real_gps_low: unsigned long long (TStreamerBasicType)
    real_gps_high: unsigned long long (TStreamerBasicType)
    event_no: unsigned long long (TStreamerBasicType)

...

and you can limit its output to a class and the classes that class depends on by passing the C++ name:

>>> root_directory.file.show_streamers("I3Eval_t")
I3Eval_t::ChannelContainer_t (v1)

Sni3DataArray (v1)

TObject (v1)
    fUniqueID: unsigned int (TStreamerBasicType)
    fBits: unsigned int (TStreamerBasicType)

I3Eval_t (v7): TObject (v1)
    theDataArray: Sni3DataArray* (TStreamerObjectAnyPointer)
    NumberOfChannels: int (TStreamerBasicType)
    NoAvailableSlices: int (TStreamerBasicType)
    AvailableDataSize: int (TStreamerBasicType)
    mGPSCardId: int (TStreamerBasicType)
    mGPSPrescale: int (TStreamerBasicType)
    mGPSEventNo: int (TStreamerBasicType)
    mScalerCardId: int (TStreamerBasicType)
    mScalerStartChannel: int (TStreamerBasicType)
    StartUTC: long (TStreamerBasicType)
    MaxChannels: int (TStreamerBasicType)
    mMaxJitterLogs: int (TStreamerBasicType)
    Channel: I3Eval_t::ChannelContainer_t* (TStreamerObjectAnyPointer)
    ChannelIDMap: map<long,int> (TStreamerSTL)
    BadChannelIDSet: set<long> (TStreamerSTL)
    ChannelID: long* (TStreamerBasicPointer)
    Deadtime: double* (TStreamerBasicPointer)
    Efficiency: double* (TStreamerBasicPointer)

-- Question -----------------------------------------------

I'm reading a root file using uproot and converting parts of it into a DataFrame using the arrays method.
This works fine, until I try to save to parquet using the to_parquet method on the dataframe. Sample code is given below.

# First three lines are here to rename the columns and choose what data to keep
data = pd.read_csv(dictFile, header = None, delim_whitespace=True)
dataFile, dataKey = data[0], data[1]
content_ele = {dataKey[i]: dataFile[i] for i in np.arange(len(dataKey))}

# We run over the different files to save a simplified version of them.
file_list = pd.read_csv(file_list_loc, names=["Loc"])

for file_loc in file_list.Loc:

    tree = uproot.open(f"{file_path}/{file_loc}:CollectionTree")

    arrays = tree.arrays(dataKey, library="pd").rename(columns=content_ele)

    save_loc = f"{save_path}/{file_loc[:-6]}reduced.parquet"
    arrays.to_parquet(path=save_loc)

Doing so, results in the following error: _arrow_array_() got an unexpected keyword argument 'type'
It seems to originate from pa.array, if that helps out.

Of note, the most simplest dataframe I've had this error with has 2 columns with different length awkward arrays (awkward.highlevel.Array) in each row but the same for each column. An example is given below.

           A                      B
0	[31, 26, 17, 23]	[-2.1, 1.3, 0.5, -0.4]
1	[75, 15, 49]	    [2.4, -1.8, 0.8] 
2	[58, 45, 64, 47]	[-1.9, -0.4, -2.5, 1.3]
3	[26]	            [-1.1] 

I've tried both reducing what elements I run on, such as only integers, reducing amount of columns as above.
However, running this exact same method with to_json gives no errors. The problem with that method is that once I read it again, what was previously awkward arrays are now just lists, making it much more impractical to work with whenever I may want to calculate something like array.A/2. Yes, I could just convert it, but it seems wiser to keep the original format and it is easier since I don't have to do it each time.

-- Answer ------------

Solution: Upgrade your awkward-pandas package. When I first tried to reproduce your problem with awkward-pandas version 2022.12a1, I saw the same error, then I upgraded to 2023.8.0 and it's gone.

Detective work: I'm writing all of this down because I'm so proud of myself. :)

I'm guessing that the data in f"{file_path}/{file_loc}:CollectionTree" is ragged. There's no indication of this in your example, but if it were purely numerical data types (no variable-length lists or nested data structures), then the arrays would be a normal Pandas DataFrame. If, in that case, you got an error, it would be a Pandas error—possible, but less likely because someone else would have noticed it first.

So assuming that arrays is a DataFrame of ragged data (and this is Uproot >= 5.0), the data types in each column are managed with awkward-pandas. If so, I should be able to reproduce the error like this:

>>> import awkward as ak
>>> import pandas as pd
>>> import awkward_pandas
>>> ragged_array = ak.Array([[0, 1, 2], [], [3, 4], [5], [6, 7, 8, 9]])
>>> ak_ext_array = awkward_pandas.AwkwardExtensionArray(ragged_array)
>>> df = pd.DataFrame({"column": ak_ext_array})
>>> df
         column
0     [0, 1, 2]
1            []
2        [3, 4]
3           [5]
4  [6, 7, 8, 9]
>>> df.to_parquet("/tmp/file.parquet")

and I do (with awkward-pandas version 2022.12a1):

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pandas/core/frame.py", line 2889, in to_parquet
    return to_parquet(
  File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pandas/io/parquet.py", line 411, in to_parquet
    impl.write(
  File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pandas/io/parquet.py", line 159, in write
    table = self.api.Table.from_pandas(df, **from_pandas_kwargs)
  File "pyarrow/table.pxi", line 3480, in pyarrow.lib.Table.from_pandas
  File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 609, in dataframe_to_arrays
    arrays = [convert_column(c, f)
  File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 609, in <listcomp>
    arrays = [convert_column(c, f)
  File "/home/jpivarski/mambaforge/lib/python3.9/site-packages/pyarrow/pandas_compat.py", line 590, in convert_column
    result = pa.array(col, type=type_, from_pandas=True, safe=safe)
  File "pyarrow/array.pxi", line 263, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
TypeError: __arrow_array__() got an unexpected keyword argument 'type'

(For the future: including a whole stack trace would remove a lot of guesswork.)

I first thought, "Maybe awkward-pandas hasn't implemented the __arrow_array__ protocol." But no, the AwkwardExtensionArray has an __arrow_array__ method:

>>> ak_ext_array.__arrow_array__()
<pyarrow.lib.ChunkedArray object at 0x7ff422d0b9f0>
[
  [
    [
      0,
      1,
      2
    ],
    [],
    ...
    [
      5
    ],
    [
      6,
      7,
      8,
      9
    ]
  ]
]

Then, "Maybe it has an __arrow_array__ method, but that method doesn't take a type argument," which is what the error message is saying.

>>> help(ak_ext_array.__arrow_array__)
Help on method __arrow_array__ in module awkward_pandas.array:
__arrow_array__() method of awkward_pandas.array.AwkwardExtensionArray instance

Aha! That's it! So I was about to write an issue on awkward-pandas, and in so doing, point out the function definition that's missing a type argument. But the function definition isn't missing a type argument.

https://github.com/intake/awkward-pandas/blob/1f8cf19fdc9cb0786642f39cfaf7c084c3c5c9bc/src/awkward_pandas/array.py#L148-L151

It's just that my copy of the package was old. This is an old bug that has since been fixed.

I upgraded my awkward-pandas and it all works now:

>>> df.to_parquet("/tmp/file.parquet")

(no errors)

>>> ak.from_parquet("/tmp/file.parquet").show()
[{column: [0, 1, 2]},
 {column: []},
 {column: [3, 4]},
 {column: [5]},
 {column: [6, 7, 8, 9]}]

(reads back appropriately)

-- Question -----------------------------------------------

I am using uproot to read root files in Python. My files are such that I am doing this:

ifile = uproot.open(path_to_root_file)

metadata = ifile['Metadata']
waveforms = ifile['Waveforms']

waveforms.show()
waveforms_of_event_50 = waveforms['voltages'].array()[50]
print(waveforms_of_event_50)

and I get as output

name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
event                | int32_t                  | AsDtype('>i4')
producer             | std::string              | AsStrings()
voltages             | std::vector<std::vect... | AsObjects(AsVector(True, As...
[[0.00647, 0.00647, 0.00671, 0.00647, ..., 0.00769, 0.00769, 0.00647], ...]

Since the waveforms['voltages'] is an array of array of waveforms, it is heavy and, consequently, the line waveforms_of_event_50 = waveforms['voltages'].array()[50] takes long, because it has to load all the waveforms into memory only to discard all of them but the 50th. Even more, for some of the files this is not even possible because they simply don't fit in memory. What I want to do is instead to get the ith waveform without loading all of them into memory, which I understand is one of the things root files are good for, i.e. something like waveforms['voltages'].get_element_number(50). Is this possible? How?

-- Answer ------------

I could point to my answer here, but StackOverflow likes to have the content locally (in case we ever delete that GitHub repo or anything). Here's a copy of what I said there:

The best you can do is

waveforms_of_event_50 = waveforms['voltages'].array(entry_start=50, entry_stop=51)[0]

This will read the minimum physically possible, which is one TBasket. The TBasket might be several kB or maybe a few MB of data. Data in TTrees are not stored in smaller granularity than this, and each chunk is generally compressed, so you have to read the whole thing to decompress it. It will definitely solve you problem with tens of GB, though: I don't think it's possible for a single TBasket to get that large.

This is not a good pattern if, right after this, you want to read the waveform of event 51, because it's probably in the same TBasket that you just read, and

waveforms_of_event_51 = waveforms['voltages'].array(entry_start=51, entry_stop=52)[0]

would read it again. If you want to load just one TBasket at a time, see TBranch.basket_entry_start_stop to know where to put your entry_start, entry_stop boundaries.

-- Question -----------------------------------------------

I am very new to awkward arrays (and python in general) and just want to know how I could add the first elements of each array in an ak.array together.

E.g my_array = [[1,2,3], [4,5,6], [7,8,9].....]

and I want 1 + 4 + 7...

-- Answer ------------

If you only want the sum of first elements from all nested lists, then @moaz ahmad's answer is best. (See also my comment on it.)

But if this is a stepping stone toward the sum over all firsts, the sum over all seconds, the sum over all thirds, etc., then what you're describing is an axis=0 summation. (See the documentation for ak.sum.)

>>> array = ak.Array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
>>> ak.sum(array, axis=0)
<Array [12, 15, 18] type='3 * int64'>

The first element of the result is 1 + 4 + 7 = 12, the second is 2 + 5 + 8 = 15, and the third is 3 + 6 + 9 = 18. You could take the first element of this result, but @moaz ahmad's answer involves less unnecessary summation.

This also works if the lengths of the lists in the array are not all equal to each other, as they are in your example. (Arrays of equal-length lists are regular 2-dimensional arrays, and NumPy is a better tool for that.) Consider the following irregular example:

>>> array = ak.Array([
...     [   1, 2, 3         ],
...     [                   ],
...     [   3, 4            ],
...     [None, 5, 6, 7, 8, 9],
... ])
>>> ak.sum(array, axis=0)
<Array [4, 11, 9, 7, 8, 9] type='6 * int64'>

The first element of the result is 1 + 3 = 4 (the None gets skipped; only relevant if your dataset has missing values), the second element is 2 + 4 + 5 = 11, the third element is 3 + 6 = 9, and the remaining elements are 7, 8, and 9 because they line up with blank spaces.

-- Question -----------------------------------------------

Reading from an ntuple and saving into a new ntuple, after many iterations over events, getting the following:

    utils.writeTree(outf, ltree, outtreename, brs)
  File "/srv01/agrp/roybr/zprimeplusxntr/UprootFramework/utils.py", line 273, in writeTree
    outf[treename].extend({ br[2] : events[br[0]][br[1]] for br in branches })
  File "/srv01/agrp/roybr/.local/lib/python3.9/site-packages/uproot3/write/objects/TTree.py", line 109, in extend
    raise Exception("Basket data should be added to all branches")
Exception: Basket data should be added to all branches

Unclear as to why this is happening. Any help would be highly appreciated!

I expected the code to run smoothly. Not clear why extending the tree and writing into it should create such an exception, especially after iterating over many events without special problems.

EDIT Let me share a longer snippet of the lines prior to the one shown above, mentioned by the trace:

elif "fakeWeight" in ltree.weight.fields and "em" not in lep:
    print ("extending tree for fakes")
    brs = brlist(lep, True)
    brs.extend([ ["weight", "fakeWeight"] ])
    brs.extend([ ["weight", "leptonSF_iso"] ])
    #utils.writeTree(outf, ltree, outtreename, brlist(lep))
    utils.writeTree(outf, ltree, outtreename, brs)
else:
    #utils.writeTree(outf, ltree, outtreename, brlist(lep))
    utils.writeTree(outf, ltree, outtreename, brs)

-- Answer ------------

One thing that would be needed to diagnose this is the code that led to the error. While the end of the stack trace is useful, it's not enough.

The stack trace is showing two things, though. One is that this is in a library called UprootFramework, which seems to be built on top of Uproot. In principle, the error (whatever it is) could be a user mistake, it could be in the UprootFramework, or it could be in Uproot itself. (Or it could be in mismatched assumptions between two of these layers, which amounts to the same thing.)

The other thing that I noticed is that this is Uproot3, a legacy version of Uproot that was superseded by Uproot 4.0 in 2020 and then Uproot 5.0 in 2022. In particular, the file-writing code was thoroughly overhauled because it was experimental and buggy in Uproot3. This long thread chronicles the rewrite effort. If you go to the Uproot3 GitHub page, you'll see that it's archived and has warning messages about being deprecated. Even if there are bugs in file-writing in Uproot3, it would be counterproductive now to investigate them. This is a version 3 → 4+ update guide and this is how Uproot-writing works in version 4+.

Good luck!

-- Question -----------------------------------------------

The problem I am facing is to create a new array from a set of indices. Namely, I have a set of particles and jets, for each jet there is a list of indices of which particles belong to the given jet.

import awkward as ak
import vector

particles = ak.Array({
    "mass": ak.Array([1, 2, 3, 4, 5]),
    "px": ak.Array([1, 2, 3, 4, 5]),
    "py": ak.Array([1, 2, 3, 4, 5]),
    "pz": ak.Array([1, 2, 3, 4, 5]),
})

particles_p4 = vector.awk(ak.zip({
    "mass": particles["mass"],
    "x": particles["px"],
    "y": particles["py"],
    "z": particles["pz"]
}))

jet = ak.Array({
    "constituents": ak.Array([[1, 2, 4, 5], [3]]),
    "energy": ak.Array([1.2, 3.4])
})

What I would like to get is for each jet the particle_p4 values in a new array like so:

<Array [[{x: 1, y: 1, z: 1, ... z: 4, tau: 4}]] type='2 * var * {"x": int64, "y"...'>

where the first element of that would be:

<Array [{x: 1, y: 1, z: 1, ... z: 5, tau: 5}] type='4 * {"x": int64, "y": int64,...'>

Doing this with a for loop is trivial, however I have the notion that this can be done in a more efficient way with the tools available in Awkward array.

Bonus: What about even more nested arrays, for example where each event has multiple jets?

-- Answer ------------

First, I think you mean you have

jet = ak.Array({
    "constituents": ak.Array([[0, 1, 3, 4], [2]]),
    "energy": ak.Array([1.2, 3.4])
})

because I'd expect the "constituents" indexes to be 0-based, not 1-based. But even if it is 1-based, just start by subtracting 1.

>>> jet.constituents - 1
<Array [[0, 1, 3, 4], [2]] type='2 * var * int64'>

The biggest problem here is that these indexes are nested one level deeper than the particles_p4 that you want to slice. You want the 0, 1, 3, 4, and also the 2, in your jet.constituents to be indexes in the not-nested list, particles_p4.

If we just arbitrarily flatten them (axis=-1 means to squash the last/deepest dimension):

>>> ak.flatten(jet.constituents, axis=-1)
<Array [0, 1, 3, 4, 2] type='5 * int64'>

these indexes are exactly what you'd need to apply to particles_p4. Here, I'm using the current (2.x) version of Awkward Array, so that I can use .show(), but the integer-array slice works in any version of Awkward Array.

>>> particles_p4[ak.flatten(jet.constituents, axis=-1)].show(type=True)
type: 5 * Momentum4D[
    x: int64,
    y: int64,
    z: int64,
    tau: int64
]
[{x: 1, y: 1, z: 1, tau: 1},
 {x: 2, y: 2, z: 2, tau: 2},
 {x: 4, y: 4, z: 4, tau: 4},
 {x: 5, y: 5, z: 5, tau: 5},
 {x: 3, y: 3, z: 3, tau: 3}]

If we take that as a partial solution, all we need to do now is put the nested structure back into the result.

ak.flatten has an opposite, ak.unflatten, which takes a flat array and adds nestedness from an array of list lengths. You can get the list lengths from the original jet.constituents with ak.num. Again, I'll use axis=-1 so that this answer will generalize to deeper nestings.

>>> lengths = ak.num(jet.constituents, axis=-1)
>>> lengths
<Array [4, 1] type='2 * int64'>

>>> rearranged = particles_p4[ak.flatten(jet.constituents, axis=-1)]
>>> rearranged
<MomentumArray4D [{x: 1, y: 1, z: 1, tau: 1}, ..., {...}] type='5 * Momentu...'>

>>> result = ak.unflatten(rearranged, lengths, axis=-1)
>>> result.show(type=True)
type: 2 * var * Momentum4D[
    x: int64,
    y: int64,
    z: int64,
    tau: int64
]
[[{x: 1, y: 1, z: 1, tau: 1}, {x: 2, ...}, ..., {x: 5, y: 5, z: 5, tau: 5}],
 [{x: 3, y: 3, z: 3, tau: 3}]]

For the bonus round, if all of the above arrays (particles_p4 and jet) were arrays of lists, where each list represents one event, rather than an array representing one event, then the above would hold. I'm taking it as a given that the length of the particles_p4_by_event is equal to the length of the jet_by_event arrays, and the values of jet_by_event.constituents are indexes within each event in particles_p4_by_event (not global indexes; each event should restart at zero). That is, all of your arrays agree on how many events there are, and each event is handled individually, with no cross-over between events.

-- Question -----------------------------------------------

I'm trying to recreate a root file that contains a char* type branch (which is interpreted by uproot as AsStrings())

When using mktree uproot doesn't recognize the np.dtype('string') and when trying np.dtype('S') I get:

TypeError: cannot write NumPy dtype |S0 in TTree

Is it possible to do this, or is it simply not implemented in the package?

-- Answer ------------

"Strings" is not one of the data types that WritableTTree supports. See the blue box under https://uproot.readthedocs.io/en/latest/basic.html#writing-ttrees-to-a-file for a full list.

However, it's possible to write some string-like data. Awkward Arrays of strings are just lists of uint8 type with special metadata (the __array__: "strings" parameter) indicating that it should be interpreted as a string. There are actually two types, "string" and "bytestring", in which we assume that the former is UTF-8 encoded and the latter is not.

These data can be written to ROOT files by removing the parameters from the array, so that it looks like a plain array of integers:

>>> import awkward as ak
>>> array = ak.Array(["one", "two", "three", "four", "five"])
>>> ak.without_parameters(array)
<Array [[111, 110, 101], ..., [102, 105, 118, 101]] type='5 * var * uint8'>

Here's a way to write these data into a ROOT file:

>>> import uproot
>>> file = uproot.recreate("/tmp/some.root")
>>> file["tree"] = {"branch": ak.without_parameters(array)}
>>> file["tree"].show()
name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
nbranch              | int32_t                  | AsDtype('>i4')
branch               | uint8_t[]                | AsJagged(AsDtype('uint8'))

When you read them back, the uint8_t* array could be cast as a char* array, but watch out! The strings are not null-terminated (end with a \x00 byte). Many string-interpreting functions in C and C++ won't be expecting that. There are some functions, like strncpy and std::string's two-argument constructor, that can be given string length information so that they don't look for a null-terminator. The string length information is the counter branch, nbranch in the above.

I recognize that that's unpleasant. I just opened a feature request on Uproot for writing string data in a natural way, using ROOT's TLeafC, rather than this hack.

-- Question -----------------------------------------------

I want to add a new field in an already zipped jagged array. For example, if I zip 4D info into a muons object, then I can call pt,eta,phi,charge like this: muons.Muon.pt. However, if I want to add a new field such as 2*pt into this muons object, then I can't do this with muons.Muon['pt2'] = 2 * arrays['Muon_pt"]. Is there anything I misunderstand or how can I add a new field in this jagged array? Could you please help me? thanks

muons = ak.zip({
    "pt": arrays["Muon_pt"],
    "eta": arrays["Muon_eta"],
    "phi": arrays["Muon_phi"],
    "charge": arrays["Muon_charge"],
})

I think I can add a new field in the zipped jagged array, like: muons.Muon['pt2'] then I can call this field with muons.Muon.pt2

-- Answer ------------

What you've described is the right way to go about it, and it should work.

As a walkthrough, we first get some arrays,

>>> import uproot
>>> import skhep_testdata
>>> import awkward as ak
>>> arrays = uproot.open(skhep_testdata.data_path("uproot-HZZ.root"))["events"].arrays()

Then zip them into a structure,

>>> muons = ak.zip({
...     "px": arrays["Muon_Px"],
...     "py": arrays["Muon_Py"],
...     "pz": arrays["Muon_Pz"],
... })
>>> muons.show(type=True)
type: 2421 * var * {
    px: float32,
    py: float32,
    pz: float32
}
[[{px: -52.9, py: -11.7, pz: -8.16}, {px: 37.7, py: 0.693, pz: ..., ...}],
 [{px: -0.816, py: -24.4, pz: 20.2}],
 [{px: 49, py: -21.7, pz: 11.2}, {px: 0.828, py: 29.8, pz: 37}],
 [{px: 22.1, py: -85.8, pz: 404}, {px: 76.7, py: -14, pz: 335}],
 [{px: 45.2, py: 67.2, pz: -89.7}, {px: 39.8, py: 25.4, pz: 20.1}],
 [{px: 9.23, py: 40.6, pz: -14.6}, {px: -5.79, py: -30.3, pz: 43}],
 [{px: 12.5, py: -42.5, pz: -124}, {px: 29.5, py: -4.45, pz: -26.4}],
 [{px: 34.9, py: -16, pz: 156}],
 [{px: -53.2, py: 92, pz: 35.6}, {px: 11.5, py: -4.42, pz: -17.5}],
 [{px: -67, py: 53.2, pz: 54.4}, {px: -18.1, py: -35.1, pz: 58}],
 ...,
 [{px: 14.9, py: 32, pz: -156}],
 [{px: -24.2, py: -35, pz: -19.2}],
 [{px: -9.2, py: -42.2, pz: -64.3}],
 [{px: 34.5, py: 28.8, pz: -151}, {px: -31.6, py: -10.4, pz: -111}],
 [{px: -39.3, py: -14.6, pz: 61.7}],
 [{px: 35.1, py: -14.2, pz: 161}],
 [{px: -29.8, py: -15.3, pz: -52.7}],
 [{px: 1.14, py: 63.6, pz: 162}],
 [{px: 23.9, py: -35.7, pz: 54.7}]]

And now add the new field:

>>> muons["px2"] = 2 * arrays["Muon_Px"]

By construction, this px2 has the same number of elements (deeply, for all sublists) as px. That should also be true in your pt example. So the above line should successfully add the new field; if it doesn't, then I think you want to submit a bug report.

Here's what the new muons looks like for me:

>>> muons.show(type=True)
type: 2421 * var * {
    px: float32,
    py: float32,
    pz: float32,
    px2: float32
}
[[{px: -52.9, py: -11.7, pz: -8.16, px2: -106}, {px: 37.7, py: ..., ...}],
 [{px: -0.816, py: -24.4, pz: 20.2, px2: -1.63}],
 [{px: 49, py: -21.7, pz: 11.2, px2: 98}, {px: 0.828, py: 29.8, ...}],
 [{px: 22.1, py: -85.8, pz: 404, px2: 44.2}, {px: 76.7, py: -14, ...}],
 [{px: 45.2, py: 67.2, pz: -89.7, px2: 90.3}, {px: 39.8, py: 25.4, ...}],
 [{px: 9.23, py: 40.6, pz: -14.6, px2: 18.5}, {px: -5.79, py: -30.3, ...}],
 [{px: 12.5, py: -42.5, pz: -124, px2: 25.1}, {px: 29.5, py: -4.45, ...}],
 [{px: 34.9, py: -16, pz: 156, px2: 69.8}],
 [{px: -53.2, py: 92, pz: 35.6, px2: -106}, {px: 11.5, py: -4.42, ...}],
 [{px: -67, py: 53.2, pz: 54.4, px2: -134}, {px: -18.1, py: -35.1, ...}],
 ...,
 [{px: 14.9, py: 32, pz: -156, px2: 29.8}],
 [{px: -24.2, py: -35, pz: -19.2, px2: -48.3}],
 [{px: -9.2, py: -42.2, pz: -64.3, px2: -18.4}],
 [{px: 34.5, py: 28.8, pz: -151, px2: 69}, {px: -31.6, py: -10.4, ...}],
 [{px: -39.3, py: -14.6, pz: 61.7, px2: -78.6}],
 [{px: 35.1, py: -14.2, pz: 161, px2: 70.1}],
 [{px: -29.8, py: -15.3, pz: -52.7, px2: -59.5}],
 [{px: 1.14, py: 63.6, pz: 162, px2: 2.28}],
 [{px: 23.9, py: -35.7, pz: 54.7, px2: 47.8}]]

px2 has been added to the record type and you can see it in the values. This is Uproot 5 and Awkward Array 2, by the way, but this should also have worked in the previous major version.

-- Question -----------------------------------------------

I'm trying to extract cycles/revisions ("TreeName;3" etc) from one root file and make them their own trees in a new one. I tried doing it by creating a new file and assigning it to a new name, but I get an error telling me that TTree is not writable

with uproot.open("old_file.root") as in_file:
    with uproot.recreate("new_file.root") as out_file:
        for key in in_file.keys():
            ttree = in_file[key]
            new_name = key.replace(";","_")
            out_file[new_name] = ttree

This resulted in NotImplementedError: this ROOT type is not writable: TTree I'm kind of confused because when I print out out_file it tells me that it is a <WritableDirectory '/' ...> I expected it to assign out_file[new_name] to ttree by value. However digging into the documentation "uproot.writing.identify.add_to_directory" says it will raise this error if the object to be added is not writable, so I guess it doesn't just make a copy in memory like I expected it to.

Next I tried to make a new tree first and then move the data in chunk by chunk. However this also didn't work because the tree creation failed:

out_file[new_name] = ttree.typenames()

ValueError: 'extend' must fill every branch with the same number of entries; 'name2' has 7 entries With the typenames being something like {'name1': 'double', 'name2': 'int32_t', 'name3': 'double[]', 'name4': 'int32_t[]', 'name5': 'bool[]'}

Trying to debug it i noticed some very strange behavior

out_file[new_name] = {'name1': 'double', 'name2': 'float32'}

yields the exact same error, while

out_file[new_name] = {'name1': 'float64', 'name2': 'float32'}
out_file[new_name].show()

gives

name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
name1                | uint8_t                  | AsDtype('uint8')
name2                | uint8_t                  | AsDtype('uint8')

so at this point I don't know what a datatype is anymore

Finally I tried doing it by writing the arrays but this failed, too

arrays = ttree.arrays(ttree.keys(),library='np')
out_file[key.replace(";","_")] = arrays

giving TypeError: cannot write Awkward Array type to ROOT file: unknown

With similar issues arising using awkward array or pandas

-- Answer ------------

I decided to give a complete working example (following up on comments, above), but found that there are a lot of choices to be made. All you want to do is to copy the input TTree—you don't want to make choices—so you really want a high-level "copy whole TTree" function, but such a function does not exist. (That would be a good addition to Uproot or a new module that uses Uproot to do hadd-type work. A good project if anyone is interested!)

I'm starting with this file, which may be obtained a variety of ways:

file_path = "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root"

file_path = "http://opendata.cern.ch/record/12341/files/Run2012BC_DoubleMuParked_Muons.root"

file_path = "/tmp/Run2012BC_DoubleMuParked_Muons.root"

It's big enough that it should be copied in chunks, not all at once. The first chunk sets the types, so it can be performed with an assignment of new branch names to arrays, but subsequent chunks have to call WritableFile.extend because you don't want to replace the new TTree, you want to add to it. Neither of these explicitly deal with types; the types are picked up from the array.

Here's a first attempt, using "100 MB" as a chunk size. (This will be the sum of TBasket sizes across TBranches in the output TTree. What we're doing here is more than copying; it's repartitioning the data into a new chunk size.)

with uproot.recreate("/tmp/output.root") as output_file:
    first_chunk = True

    with uproot.open(file_path) as input_file:
        input_ttree = input_file["Events"]

        for arrays_chunk in input_ttree.iterate(step_size="100 MB"):
            if first_chunk:
                output_file["Events"] = arrays_chunk
                first_chunk = False
            else:
                output_file["Events"].extend(arrays_chunk)

However, it fails because assignment and extend expect a dict of arrays, not a single array.

So we could ask TTree.iterate to give us a dict of Awkward Arrays, one for each TBranch, rather than a single Awkward Array that represents all of the TBranches. That would look like this:

with uproot.recreate("/tmp/output.root") as output_file:
    first_chunk = True

    with uproot.open(file_path) as input_file:
        input_ttree = input_file["Events"]

        for dict_of_arrays in input_ttree.iterate(step_size="100 MB", how=dict):
            if first_chunk:
                output_file["Events"] = dict_of_arrays
                first_chunk = False
            else:
                output_file["Events"].extend(dict_of_arrays)

It copies the file, but whereas the original file had TBranches like

name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
nMuon                | uint32_t                 | AsDtype('>u4')
Muon_pt              | float[]                  | AsJagged(AsDtype('>f4'))
Muon_eta             | float[]                  | AsJagged(AsDtype('>f4'))
Muon_phi             | float[]                  | AsJagged(AsDtype('>f4'))
Muon_mass            | float[]                  | AsJagged(AsDtype('>f4'))
Muon_charge          | int32_t[]                | AsJagged(AsDtype('>i4'))

the new file has TBranches like

name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
nMuon                | uint32_t                 | AsDtype('>u4')
nMuon_pt             | int32_t                  | AsDtype('>i4')
Muon_pt              | float[]                  | AsJagged(AsDtype('>f4'))
nMuon_eta            | int32_t                  | AsDtype('>i4')
Muon_eta             | float[]                  | AsJagged(AsDtype('>f4'))
nMuon_phi            | int32_t                  | AsDtype('>i4')
Muon_phi             | float[]                  | AsJagged(AsDtype('>f4'))
nMuon_mass           | int32_t                  | AsDtype('>i4')
Muon_mass            | float[]                  | AsJagged(AsDtype('>f4'))
nMuon_charge         | int32_t                  | AsDtype('>i4')
Muon_charge          | int32_t[]                | AsJagged(AsDtype('>i4'))

What happened is that Uproot didn't know that each of the Awkward Arrays have the same number of items per entry (that the number of pt values in one event is the same as the number of eta values in one event). If the TBranches hadn't all been muons, but some were muons and some were electrons or jets, that wouldn't be true.

The reason these nMuon_pt, nMuon_eta, etc. TBranches are included at all is because ROOT needs them. The Muon_pt, Muon_eta, etc. TBranches are read, in ROOT, as C++ arrays of variable length, and a C++ user needs to know how big to preallocate an array and after which array entry the contents are uninitialized junk. These are not needed in Python (Awkward Array prevents users from seeing uninitialized junk).

So you could ignore them. But if you really need to/want to get rid of them, here's a way: build exactly the array you want to write. Now that we're dealing with types, we'll use WritableDirectory.mktree and specify types explicitly. Since every write is an extend, we won't have to keep track of whether we're writing the first_chunk or a subsequent chunk anymore.

For the Muon_pt, Muon_eta, etc. TBranches to share a counter TBranch, nMuons, you want a Muon field to be an array of variable-length lists of muon objects with pt, eta, etc. fields. That type can be constructed from a string:

import awkward as ak

muons_type = ak.types.from_datashape("""var * {
    pt: float32,
    eta: float32,
    phi: float32,
    mass: float32,
    charge: int32
}""", highlevel=False)

Given a chunk of separated arrays with type var * float32, you can make a single array with type var * {pt: float32, eta: float32, ...} with ak.zip.

muons = ak.zip({
    "pt": chunk["Muon_pt"],
    "eta": chunk["Muon_eta"],
    "phi": chunk["Muon_phi"],
    "mass": chunk["Muon_mass"],
    "charge": chunk["Muon_charge"],
})

(Printing muons.type gives you the type string back.) This is the form you're likely to be using for a data analysis. The assumption was that users would be analyzing data as objects between a read and a write, not reading from one file and writing to another without any modifications.

Here's a reader-writer, using muons_type:

with uproot.recreate("/tmp/output.root") as output_file:
    output_ttree = output_file.mktree("Events", {"Muon": muons_type})

    with uproot.open(file_path) as input_file:
        input_ttree = input_file["Events"]

        for chunk in input_ttree.iterate(step_size="100 MB"):
            muons = ak.zip({
                "pt": chunk["Muon_pt"],
                "eta": chunk["Muon_eta"],
                "phi": chunk["Muon_phi"],
                "mass": chunk["Muon_mass"],
                "charge": chunk["Muon_charge"],
            })

            output_ttree.extend({"Muon": muons})

Or you could have done it without explicitly constructing the muons_type by keeping track of the first_chunk again:

with uproot.recreate("/tmp/output.root") as output_file:
    first_chunk = True

    with uproot.open(file_path) as input_file:
        input_ttree = input_file["Events"]

        for chunk in input_ttree.iterate(step_size="100 MB"):
            muons = ak.zip({
                "pt": chunk["Muon_pt"],
                "eta": chunk["Muon_eta"],
                "phi": chunk["Muon_phi"],
                "mass": chunk["Muon_mass"],
                "charge": chunk["Muon_charge"],
            })

            if first_chunk:
                output_file["Events"] = {"Muon": muons}
                first_chunk = False
            else:
                output_file["Events"].extend({"Muon": muons})

It is admittedly complex (because I'm showing many alternatives, with different pros and cons), but that's because copying TTrees without modification wasn't a foreseen use-case for the TTree-writing functions. Since it is an important use-case, a specialized function that hides these details would be a welcome addition.

-- Question -----------------------------------------------

I used to retrieve pandas dataframe from ROOT file using tree.pandas.df() function in Uproot4(2 years ago). However, I got the below errors when I ran my code recently. Could anyone tell me what the problem is?

f = uproot.open(inputFile)
treeName = "myTreeName"
tree = f[treeName]
myDf = tree.pandas.df('branchName',entrystop=nEvent, flatten = False)

AttributeError: 'Model_TTree_v19' object has no attribute 'pandas'

-- Answer ------------

In Uproot version 3, a special function named TTree.pandas.df created Pandas DataFrames.

In Uproot version 4 (and above), all of the functions that produce arrays have a library argument that specifies which library to use to represent the arrays. library="pd" makes Pandas DataFrames.

This change is described in the Uproot 3 → 4 cheat-sheet, the new argument is described in several places in the Getting Started Guide, as well as in all the reference documentation for array-fetching functions, such as TTree.arrays.

-- Question -----------------------------------------------

Say we have a jagged array that looks like this:

arr = ak.Array([[1, 2, 3], [3, 2], [], [5], [6, 9, 6, 9]])

We can see that it has a depth of 2. Is there something like a single or combination of built-in functions that would tell me that?

ak.size requires that I already know the depth and ak.to_numpy -> np.size would give me an error for incompatibility. I'm looking for something inbuilt because I need it to be fast.

Thanks!

Edit: I forgot to mention that I would like to solve this in the case where the given array is guaranteed to have a uniform depth and consists entirely of numbers.

-- Answer ------------

Actually, there is a function that just does this, and it's in the public API, but not the "high level" parts intended for data analysts. It's intended for downstream libraries to build on top of Awkward Array (and Awkward Array uses it internally quite a lot).

In an array's (low level) layout, there's a property called minmax_depth.

>>> import awkward as ak
>>> arr = ak.Array([[1, 2, 3], [3, 2], [], [5], [6, 9, 6, 9]])
>>> arr.layout.minmax_depth
(2, 2)

Here, the minimum and maximum are both 2 because this is a relatively simple type. But a heterogeneous union can have a different minimum and maximum:

>>> arr = ak.Array([1, [2, 3, [4, 5, 6]]])
>>> arr.layout.minmax_depth
(1, 3)

and (as a more common case), records can introduce different levels of depth:

>>> arr = ak.Array([{"x": 1, "y": [{"z": [[[2]]]}]}])
>>> arr.layout.minmax_depth
(1, 5)

There are also variants on this like branch_depth (bool for is-branching? and minimum depth) and purelist_depth (depth of just lists and missing value nodes, not records or unions).

>>> arr = ak.Array([{"x": 1, "y": [{"z": [[[2]]]}]}])
>>> arr.layout.branch_depth
(True, 1)
>>> arr.layout.purelist_depth
1

The fact that different parts of the array can have different depths (unlike a NumPy array, in which it's always ndim or len(shape)) is important for interpreting the axis parameter. Unlike NumPy, negative axis can mean different levels of depth in different parts of the array (because it's counting up from the deepest, which can be different in different parts).

>>> arr = ak.Array([{"x": [1, 2, 3], "y": [[1, 2, 3], []]}])
>>> ak.sum(arr, axis=-1)
<Record {x: [6], y: [[6, 0]]} type='{"x": var * int64, "y": var * var * int64}'>

Above, the y field is deeper than the x field, but axis=-1 means to sum along the deepest axis, wherever that is.

-- Question -----------------------------------------------

I am reading data from a file of "events". For each event, there is some number of "tracks". For each track there are a series of "variables". A stripped down version of the code (using awkward0 as awkward) looks like

f = h5py.File('dataAA/pv_HLT1CPU_MinBiasMagDown_14Nov.h5',mode="r")

afile = awkward.hdf5(f)

pocaz  = np.asarray(afile["poca_z"].astype(dtype_X))

pocaMx = np.asarray(afile["major_axis_x"].astype(dtype_X))
pocaMy = np.asarray(afile["major_axis_y"].astype(dtype_X))
pocaMz = np.asarray(afile["major_axis_z"].astype(dtype_X))

In this snippet of code, "pocaz", "pocaMx", etc. are what I have called variables (a physics label, not a Python data type). On rare occasions, pocaz takes on an extreme value, pocaMx and/or pocaMy take on nan values, and/or pocaMz takes on the value inf. I would like to remove these tracks from the events using some syntactically simple method. I am guessing this functionality exists (perhaps in the current version of awkward but not awkward0), but cannot find it described in a transparent way. Is there a simple example anywhere?

-- Answer ------------

It looks to me, from the fact that you're able to call np.asarray on these arrays without error, that they are one-dimensional arrays of numbers. If so, then Awkward Array isn't doing anything for you here; you should be able to find the one-dimensional NumPy arrays inside

f["poca_z"], f["major_axis_x"], f["major_axis_y"], f["major_axis_z"]

as groups (note that this is f, not afile) and leave Awkward Array entirely out of it.

The reason I say that is because you can use np.isfinite on these NumPy arrays. (There's an equivalent in Awkward v1, v2, but you're talking about Awkward v0 and I don't remember.) That will give you an array of booleans for you to slice these arrays.

I don't have the HDF5 file for testing, but I think it would go like this:

f = h5py.File('dataAA/pv_HLT1CPU_MinBiasMagDown_14Nov.h5',mode="r")

pocaz = np.asarray(a["poca_z"]["0"], dtype=dtype_X)

pocaMx = np.asarray(a["major_axis_x"]["0"], dtype=dtype_X)   # the only array
pocaMy = np.asarray(a["major_axis_y"]["0"], dtype=dtype_X)   # in each group
pocaMz = np.asarray(a["major_axis_z"]["0"], dtype=dtype_X)   # is named "0"

good = np.ones(len(pocaz), dtype=bool)
good &= np.isfinite(pocaz)
good &= np.isfinite(pocaMx)
good &= np.isfinite(pocaMy)
good &= np.isfinite(pocaMz)

pocaz[good], pocaMx[good], pocaMy[good], pocaMz[good]

If you also need to cut extreme finite values, you can include

good &= (-1000 < pocaz) & (pocaz < 1000)

etc. in the good selection criteria.

(The way you'd do this in Awkward Array is not any different, since Awkward is just generalizing what NumPy does here, but if you don't need it, you might as well leave it out.)

-- Question -----------------------------------------------

My question is about the Vector module in scikit-hep.

https://vector.readthedocs.io/en/latest/index.html

I have an awkward array of vectors and I'd like to set the mass of all of them to be a common value. For example, I can do this with a single vector object.

x = vector.obj(pt=2, eta=1.5, phi=1, energy=10)
y = x.from_rhophietatau(rho=x.rho, eta=x.eta, phi=x.phi, tau=20)

print(f"{x.mass:6.3f}  {x.pt}  {x.eta}  {x.phi}  {x.energy:6.2f}")
print(f"{y.mass:6.3f}  {y.pt}  {y.eta}  {y.phi}  {y.energy:6.2f}")

Output

 8.824  2  1.5  1   10.00
20.000  2  1.5  1   20.55

But suppose I want to do this with an awkward array of vectors?

Let me start with some starter code from this previous question:

https://stackoverflow.com/questions/72834275/using-awkward-array-with-zip-unzip-with-two-different-physics-objects/72834871

First, I'll get an input file

curl http://opendata.cern.ch/record/12361/files/SMHiggsToZZTo4L.root --output SMHiggsToZZTo4L.root

Then I'll make use of the code from the answer to that question:

import numpy as np
import matplotlib.pylab as plt

import uproot
import awkward as ak

import vector
vector.register_awkward()


infile = uproot.open("/tmp/SMHiggsToZZTo4L.root")

muon_branch_arrays = infile["Events"].arrays(filter_name="Muon_*")
electron_branch_arrays = infile["Events"].arrays(filter_name="Electron_*")

muons = ak.zip({
    "pt": muon_branch_arrays["Muon_pt"],
    "phi": muon_branch_arrays["Muon_phi"],
    "eta": muon_branch_arrays["Muon_eta"],
    "mass": muon_branch_arrays["Muon_mass"],
    "charge": muon_branch_arrays["Muon_charge"],
}, with_name="Momentum4D")

quads = ak.combinations(muons, 4)
mu1, mu2, mu3, mu4 = ak.unzip(quads)

p4 = mu1 + mu2 + mu3 + mu4

The type of p4 is <class 'vector._backends.awkward_.MomentumArray4D'>. Is there a way to set all the masses of the p4 objects to be, for example, 125? While this is not exactly my analysis, I need to do something similar where I will then use p4 to boost the muX objects to the CM frame of p4 and look at some relative angles. But I need to set the mass of p4 to be a constant value.

Is this possible? Thanks!

-- Answer ------------

This is a well written question, thank you for the effort!

The answer here is yes, you can set a new value for the mass! One would do this updating the mass field using ak.with_field, or using the subscript __setitem__ operator, e.g.

p4['mass'] = 125.0

This will internally call ak.with_field, which you could also use e.g.

p4 = ak.with_field(p4, 125.0, "mass")

and broadcasts the 125.0 value against the rest of the array.

It is sometimes more convenient to use the vector Awkward constructors, as it is slightly less typing:

muons = vector.zip({
    'pt': muon_branch_arrays['Muon_pt'],
    'phi': muon_branch_arrays['Muon_phi'],
    'eta': muon_branch_arrays['Muon_eta'],
    'charge': muon_branch_arrays['Muon_charge'],
    'mass': muon_branch_arrays['Muon_mass'],
})

vector determines what kind of array you are building from the field names. This provides a good opportunity to highlight something important: vector supports aliases for fields, e.g. taumass. If you compare the fields of the muons array above with the array you built with ak.zip, you'll notice that my muons array has fields ['rho', 'phi', 'eta', 'tau', 'charge'] whilst your muon array has fields ['pt', 'phi', 'eta', 'mass', 'charge']. What's happening here is that vector is canonicalising the field names. This means that, were you to build an array in this manner, you'd want to use p4['tau'] = 125.0 instead of p4['mass'] = 125.0.

This would also be apparent if you transformed your muons array in any way, e.g. double_muons = muons + muons. You'd find that the result loses the charge field, and has tau instead of mass and rho instead of pt. So, something to be mindful of if you need to set a field.

The reason that p4['mass'] = 125 works, but pt['mass'][:] = 125 does not is because of how Awkward Array is designed. Whilst Awkward Arrays are immutable, this is only half of the story. You can already see that there is some kind of mutability - we can modify a field in-place. This works because, whilst the underlying "layouts" from which Arrays are built do not allow users to modify their values, the high level ak.Array can be given a new layout. This is what __setitem__ does under the hood, i.e. https://github.com/scikit-hep/awkward/blob/72c9edd55b9c4611ffc46952cda4cf9920a91315/src/awkward/highlevel.py#L1062-L1063

-- Question -----------------------------------------------

I'm trying to reproduce parts of the Higgs discovery in the Higgs --> 4 leptons channel with open data and making use of awkward. I can do it when the leptons are the same (e.g. 4 muons) with zip/unzip, but is there a way to do it in the 2 muon/2 electron channel? I started with the example in the HSF tutorial

https://hsf-training.github.io/hsf-training-scikit-hep-webpage/04-awkward/index.html

So I now have the following. First get the input file

curl http://opendata.cern.ch/record/12361/files/SMHiggsToZZTo4L.root --output SMHiggsToZZTo4L.root

Then I do the following

import numpy as np
import matplotlib.pylab as plt

import uproot
import awkward as ak

# Helper functions

def energy(m, px, py, pz):
    E = np.sqrt( (m**2) + (px**2 + py**2 + pz**2))
    return E

def invmass(E, px, py, pz):

    m2 = (E**2) - (px**2 + py**2 + pz**2)
    
    if m2 < 0:
        m = -np.sqrt(-m2)
    else:
        m = np.sqrt(m2)
    return m


def convert(pt, eta, phi):
    px = pt * np.cos(phi)
    py = pt * np.sin(phi)
    pz = pt * np.sinh(eta)
    return px, py, pz

####

# Read in the file
infile_name = 'SMHiggsToZZTo4L.root'
infile = uproot.open(infile_name)

# Convert to Cartesian
muon_pt = infile_signal['Events/Muon_pt'].array()
muon_eta = infile_signal['Events/Muon_eta'].array()
muon_phi = infile_signal['Events/Muon_phi'].array()
muon_q = infile_signal['Events/Muon_charge'].array()
muon_mass = infile_signal['Events/Muon_mass'].array()

muon_px,muon_py,muon_pz = convert(muon_pt, muon_eta, muon_phi)
muon_energy = energy(muon_mass, muon_px, muon_py, muon_pz)

# Do the magic
nevents = len(infile['Events/Muon_pt'].array())
# nevents = 1000 # For testing

max_entries = nevents
muons = ak.zip({
    "px": muon_px[0:max_entries],
    "py": muon_py[0:max_entries],
    "pz": muon_pz[0:max_entries],
    "e": muon_energy[0:max_entries],
    "q": muon_q[0:max_entries],
})

quads = ak.combinations(muons, 4)

mu1, mu2, mu3, mu4 = ak.unzip(quads)

mass_try = (mu1.e + mu2.e + mu3.e + mu4.e)**2 - ((mu1.px + mu2.px + mu3.px + mu4.px)**2 + (mu1.py + mu2.py + mu3.py + mu4.py)**2 + (mu1.pz + mu2.pz + mu3.pz + mu4.pz)**2)

mass_try = np.sqrt(mass_try)

qtot = mu1.q + mu2.q + mu3.q + mu4.q

plt.hist(ak.flatten(mass_try[qtot==0]), bins=100,range=(0,300));

And the histogram looks good!

So how would I do this for 2-electron + 2-muon combinations? I would guess there's a way to make lepton_xxx arrays? But I'm not sure how to do this elegantly (quickly) such that I could also create a flag to keep track of what the lepton combinations are?

Thanks!

-- Answer ------------

This could be answered in a variety of ways:

  • make a union array (mixed data types) of electrons and muons
  • make an array of electrons and muons that are the same type, but have a flag to indicate flavor (electron vs muon)
  • use ak.combinations with n=2 for the muons, ak.combinations with n=2 again for the electrons, and then combine them with ak.cartesian (and deal with tuples of tuples, rather than one level of tuples, which would mean two calls to ak.unzip)
  • break the electron and muon collections down into single-charge collections. You'll want exactly 1 positive muon, 1 negative muon, 1 positive electron, and 1 negative electron, so that would be an ak.cartesian of the four collections.

I'll go with the last method because I've decided that it's easiest.

Another thing you probably want to know about is the Vector library. After

import vector
vector.register_awkward()

you don't have to do explicit coordinate transformations or mass calculations. I'll be using that. Here's how I read in the data:

infile = uproot.open("/tmp/SMHiggsToZZTo4L.root")

muon_branch_arrays = infile["Events"].arrays(filter_name="Muon_*")
electron_branch_arrays = infile["Events"].arrays(filter_name="Electron_*")

muons = ak.zip({
    "pt": muon_branch_arrays["Muon_pt"],
    "phi": muon_branch_arrays["Muon_phi"],
    "eta": muon_branch_arrays["Muon_eta"],
    "mass": muon_branch_arrays["Muon_mass"],
    "charge": muon_branch_arrays["Muon_charge"],
}, with_name="Momentum4D")
electrons = ak.zip({
    "pt": electron_branch_arrays["Electron_pt"],
    "phi": electron_branch_arrays["Electron_phi"],
    "eta": electron_branch_arrays["Electron_eta"],
    "mass": electron_branch_arrays["Electron_mass"],
    "charge": electron_branch_arrays["Electron_charge"],
}, with_name="Momentum4D")

And this reproduces your plot:

quads = ak.combinations(muons, 4)
quad_charge = quads["0"].charge + quads["1"].charge + quads["2"].charge + quads["3"].charge
mu1, mu2, mu3, mu4 = ak.unzip(quads[quad_charge == 0])
plt.hist(ak.flatten((mu1 + mu2 + mu3 + mu4).mass), bins=100, range=(0, 200));

The quoted number slices (e.g. "0" and "1") are picking tuple fields, rather than array entries; it's a manual ak.unzip. (The fields could have had real names if I had given a fields argument to ak.combinations.)

For the two muons, two electrons case, let's make four distinct collections.

muons_plus = muons[muons.charge > 0]
muons_minus = muons[muons.charge < 0]
electrons_plus = electrons[electrons.charge > 0]
electrons_minus = electrons[electrons.charge < 0]

The ak.combinations function (with default axis=1) returns the Cartesian product of each list in the array of lists with itself, excluding an item with itself, (x, x) (unless you want that, specify replacement=True), and excluding one of each symmetric pair (x, y)/(y, x).

If you want just a plain Cartesian product of lists from one array with lists from another array, that's ak.cartesian. The four collections muons_plus, muons_minus, electrons_plus, electrons_minus are non-overlapping and we want each four-lepton group to have exactly one from each, so that's a plain Cartesian product.

mu1, mu2, e1, e2 = ak.unzip(ak.cartesian([muons_plus, muons_minus, electrons_plus, electrons_minus]))
plt.hist(ak.flatten((mu1 + mu2 + e1 + e2).mass), bins=100, range=(0, 200));

Separating particles by flavor (electron vs muon) but not by charge is an artifact of the typing constraints imposed by C++. Electrons are measured in different ways from muons, so they have different attributes in the dataset. In a statically typed language like C++, we couldn't put them in the same collection because they differ by type (have different attributes). But charges only differ by value (an integer), so there was no reason they had to be in different collections.

But now, the only thing distinguishing a type is (a) what attributes the objects have and (b) what objects with that set of attributes are named. Here, I named them "Momentum4D" because that allowed Vector to recognize them as Lorentz vectors and give them Lorentz vector methods. But they could have had other attributes (e.g. e_over_p for electrons, but not for muons). charge is a non-Momentum4D attribute.

So if we're going to the trouble to put different-flavor leptons in different arrays, why not put different-charge leptons in different arrays, too? Physically, flavor and charge are both particle properties at the same level of distinction, so we probably want to do the analysis that way. No reason to let a C++ type distinction get in the way of the analysis logic!

Also, you probably want

nevents = infile["Events"].num_entries

to get the number of entries from a TTree without reading the array data from it.

-- Question -----------------------------------------------

I'm trying to output a TTree with the same general format as an input TTree which has the structure:

ttree.show(datatypes.keys())

name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
Weight               | float                    | AsDtype('>f4')
E_Beam               | float                    | AsDtype('>f4')
Px_Beam              | float                    | AsDtype('>f4')
Py_Beam              | float                    | AsDtype('>f4')
Pz_Beam              | float                    | AsDtype('>f4')
NumFinalState        | int32_t                  | AsDtype('>i4')
E_FinalState         | float[]                  | AsJagged(AsDtype('>f4'))
Px_FinalState        | float[]                  | AsJagged(AsDtype('>f4'))
Py_FinalState        | float[]                  | AsJagged(AsDtype('>f4'))
Pz_FinalState        | float[]                  | AsJagged(AsDtype('>f4'))

The NumFinalState branch contains the number of elements in all of the *_FinalState array branches. This will always be the case in my work, so it seems wasteful to do the following:

outfile = uproot.recreate("myData_OUT.root")
datatypes = {"Weight": "float32", "E_Beam": "float32", "Px_Beam": "float32", "Py_Beam": "float32", "Pz_Beam": "float32", "NumFinalState": "int32", "E_FinalState": "var * float32", "Px_FinalState": "var * float32", "Py_FinalState": "var * float32", "Pz_FinalState": "var * float32"}
outfile.mktree("kin", datatypes)
outfile["kin"].show()

name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
Weight               | float                    | AsDtype('>f4')
E_Beam               | float                    | AsDtype('>f4')
Px_Beam              | float                    | AsDtype('>f4')
Py_Beam              | float                    | AsDtype('>f4')
Pz_Beam              | float                    | AsDtype('>f4')
NumFinalState        | int32_t                  | AsDtype('>i4')
nE_FinalState        | int32_t                  | AsDtype('>i4')
E_FinalState         | float[]                  | AsJagged(AsDtype('>f4'))
nPx_FinalState       | int32_t                  | AsDtype('>i4')
Px_FinalState        | float[]                  | AsJagged(AsDtype('>f4'))
nPy_FinalState       | int32_t                  | AsDtype('>i4')
Py_FinalState        | float[]                  | AsJagged(AsDtype('>f4'))
nPz_FinalState       | int32_t                  | AsDtype('>i4')
Pz_FinalState        | float[]                  | AsJagged(AsDtype('>f4'))

In the documentation, it appears I can use the counter_name argument in mktree to give the counter branches custom names, but it seems to run into trouble if I try to give them the same name:

outfile = uproot.recreate("myData_OUT.root")
datatypes = {"Weight": "float32", "E_Beam": "float32", "Px_Beam": "float32", "Py_Beam": "float32", "Pz_Beam": "float32", "NumFinalState": "int32", "E_FinalState": "var * float32", "Px_FinalState": "var * float32", "Py_FinalState": "var * float32", "Pz_FinalState": "var * float32"}
def counter_name(in_str: str) -> str:
    if "FinalState" in in_str:
        return "NumFinalState"
    return f"n{in_str}"
outfile.mktree("kin", datatypes, counter_name=counter_name)

This code throws an error:

---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
/var/folders/td/j379rd296477649qvl1k8n180000gn/T/ipykernel_24226/1447106219.py in <module>
      5         return "NumFinalState"
      6     return f"n{in_str}"
----> 7 outfile.mktree("kin", datatypes, counter_name=counter_name)

/opt/homebrew/Caskroom/miniforge/base/lib/python3.9/site-packages/uproot/writing/writable.py in mktree(self, name, branch_types, title, counter_name, field_name, initial_basket_capacity, resize_factor)
   1268             path,
   1269             directory._file,
-> 1270             directory._cascading.add_tree(
   1271                 directory._file.sink,
   1272                 treename,

/opt/homebrew/Caskroom/miniforge/base/lib/python3.9/site-packages/uproot/writing/_cascade.py in add_tree(self, sink, name, title, branch_types, counter_name, field_name, initial_basket_capacity, resize_factor)
   1796             resize_factor,
   1797         )
-> 1798         tree.write_anew(sink)
   1799         return tree
   1800 

/opt/homebrew/Caskroom/miniforge/base/lib/python3.9/site-packages/uproot/writing/_cascadetree.py in write_anew(self, sink)
   1114                 # reference to fLeafCount
   1115                 out.append(
-> 1116                     uproot.deserialization._read_object_any_format1.pack(
   1117                         datum["counter"]["tleaf_reference_number"]
   1118                     )

error: required argument is not an integer

and I figure this error is related to uproot trying to make two branches with the same name. Is there any way to get around this? It'll probably be okay to just create a NumFinalState branch manually, since it gets read in by a subsequent program, but just in terms of compactness, it would be nice to not create a bunch of unnecessary branches.

-- Answer ------------

Uproot makes one counter branch for each Awkward Array in the dict it's given. Since your Awkward Arrays are arrays of lists of numbers, they're all presumed to have different counters. There isn't a way to manually force them to share a counter; the way it's supposed to work is to join them all into one Awkward Array, which Uproot will recognize as something that should have one counter.

So suppose you have

>>> import awkward as ak
>>> import uproot
>>> E_FinalState = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
>>> Px_FinalState = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
>>> Py_FinalState = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
>>> Pz_FinalState = ak.Array([[11, 22, 33], [], [44, 55]])

Passing each of these individually into the output TTree will make a nE_finalState, nPx_FinalState, etc., as you've seen. So make them one array with ak.zip:

>>> finalstate = ak.zip({"E": E_FinalState, "px": Px_FinalState, "py": Py_FinalState, "pz": Pz_FinalState})
>>> finalstate
<Array [[{E: 1.1, px: 1.1, ... pz: 55}]] type='3 * var * {"E": float64, "px": fl...'>
>>> print(finalstate.type)
3 * var * {"E": float64, "px": float64, "py": float64, "pz": int64}

The key thing is that the type is now number of entries * var * {record}, rather than number of entries * var * float64, individually for each array. It's in the ak.zip function that you find out whether they really do have the same number of entries. (It's a one-time cross-check; the creation of the finalstate array is itself zero-copy.)

Now you can use this when writing a TTree:

>>> outfile = uproot.recreate("myData_OUT.root")
>>> outfile["kin"] = {"finalstate": finalstate}
>>> outfile["kin"].show()
name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
nfinalstate          | int32_t                  | AsDtype('>i4')
finalstate_E         | double[]                 | AsJagged(AsDtype('>f8'))
finalstate_px        | double[]                 | AsJagged(AsDtype('>f8'))
finalstate_py        | double[]                 | AsJagged(AsDtype('>f8'))
finalstate_pz        | int64_t[]                | AsJagged(AsDtype('>i8'))

The counter_name and field_name arguments only control the generation of names. By default, they follow a convention in which the "finalstate" name in the dict gets prepended by "n" for the counter and appended by "_" and the name of each field for the branches (CMS NanoAOD conventions). Those arguments exist so that you can apply a different naming convention, but they don't actually change which counter branches get created. In fact, defining these functions so that they produce the same name might trigger a confusing error message like this—I don't think it's an explicitly checked case.

Oh, and you should also be able to use the uproot.WritableDirectory.mktree constructor (which makes a TTree without data, so that each uproot.WritableTree.extend call can be like each other). The dict syntax would be

>>> outfile.mktree("kin", {"finalstate": finalstate.type})
<WritableTree '/kin' at 0x7f48ff139550>
>>> outfile["kin"].extend({"finalstate": finalstate})

i.e. use the finalstate.type, rather than the finalstate array itself.

-- Question -----------------------------------------------

I just started looking into parallelizing the analysis of rootfiles, i.e. trees. I have worked with RDataFrames where implicit multiprocessing can be enabled with one line of code (EnableImplicitMT()). This works quite well. Now I want to experiment with explicit multiprocessing and uproot to see if efficiency can be boosted even further. I just need some guidance on a sensible approach.

Say I have a really large dataset (can not be read in at once) stored in a root file with a couple of branches. Nothing to crazy hast to be done for the analysis: Some calculations, filtering and then filling some histograms maybe.

The ideas I have:

  1. Trivial parallelizing: Somehow splitting the rootfile in many smaller files and running the same analysis in parallel on all files. In the end recombining the respective results.

  2. Maybe it is possible to read in the file and analyze it in batches as described in the uproot docs but distribute the batches and operations on them to different cores? One could use the python multiprocessing package?

  3. Similiar to 2. read in the file in batches but rather than distributing batches to the cores, slicing up the arrays of one batch and distribute the slices and the operation on them to the cores.

I need some feedback if these approaches are worth trying or if there are better ways of handling large files efficiently.

-- Answer ------------

A key thing to mind about Uproot is that it isn't a framework for doing HEP analysis—it only reads ROOT files. The HEP analysis is the next step—code beyond any interactions with Uproot.

For the record, Uproot's file-reading can be parallelized, but that just means that multiple threads can be employed to wait for disk/network and decompress data, but the effect is the same: you wait for all the threads to be done to get the chunk of data, maybe a little faster. That's not what you're asking about.

You want your analysis code to be running in parallel, and that's a generic question about parallel processing in Python, not Uproot. You can break your work up into pieces (explicitly), and have each of those pieces independently use Uproot to read the data. Or you can use a Python library for parallel-processing to do it implicitly, such as Dask, or you can use a HEP-specific library that pulls these parts together, such as Coffea.

-- Question -----------------------------------------------

I am trying to optimize the way trees are written in pyroot and came across uproot. In the end my application should write events (consisiting of arrays) to a tree which are continuously coming in.

The first approach is the classic way:

event= [1.,2.,3.]
f = ROOT.TFile("my_tree.root", "RECREATE")
tree = ROOT.TTree("tree", "An Example Tree")

pt = array.array('f', [0.]*3)


tree.Branch("pt", pt, "pt[3]/F")

#for loop to simulate incoming events
for _ in range(10000):
	for i, element in enumerate(event):
		pt[i] = element

	tree.Fill()

tree.Print()
tree.Write("", ROOT.TObject.kOverwrite);
f.Close()

This gives the following Tree and execution time:

Trying to do it with uproot my code looks like this:

np_array = np.array([[1,2,3]])
ak_array = ak.from_numpy(np_array)

with uproot.recreate("testing.root", compression=None) as fout:
    fout.mktree("tree", {"branch": ak_array.type})
    
    for _ in range(10000):
        
        fout["tree"].extend({"branch": ak_array})

which gives the following tree:

So the uproot method takes much longer, the file size is much bigger and each event gets a seperate basket. I tried out different commpression settings but that did not change anything. Any idea on how to optimize this? Is this even a sensible usecase for uproot and can the process of writing trees being speed up in comparision to the first way of doing it?

-- Answer ------------

The extend method is supposed to write a new TBasket with each invocation. (See the documentation, especially the orange warning box. The purpose of that is so that you can control the TBasket sizes.) If you're calling it 10000 times to write 1 value (the value [1, 2, 3]) each, that's a maximally inefficient use.

Fundamentally, you're thinking about this problem in an entry-by-entry way, rather than in terms of columns, the way that scientific processing is normally done in Python. What you want to do instead is to collect a large dataset in memory and write it to the file in one chunk. If the data that you'll eventually be addressing is larger than the memory on your computer, you would do it in "large enough" chunks, which is probably on the order of hundreds of megabytes or gigabytes.

For instance, starting with your example,

import time
import uproot
import numpy as np
import awkward as ak

np_array = np.array([[1, 2, 3]])
ak_array = ak.from_numpy(np_array)

starttime = time.time()

with uproot.recreate("bad.root") as fout:
    fout.mktree("tree", {"branch": ak_array.type})
    for _ in range(10000):
        fout["tree"].extend({"branch": ak_array})

print("Total time:", time.time() - starttime)

The total time (on my computer) is 1.9 seconds and the TTree characteristics are atrocious:

******************************************************************************
*Tree    :tree      :                                                        *
*Entries :    10000 : Total =         1170660 bytes  File  Size =    2970640 *
*        :          : Tree compression factor =   1.00                       *
******************************************************************************
*Br    0 :branch    : branch[3]/L                                            *
*Entries :    10000 : Total  Size=    1170323 bytes  File Size  =     970000 *
*Baskets :    10000 : Basket Size=      32000 bytes  Compression=   1.00     *
*............................................................................*

Instead, we want the data to be in a single array (or some loop that produces ~GB scale arrays):

np_array = np.array([[1, 2, 3]] * 10000)

(This isn't necessarily how you would get np_array, since * 10000 makes a large, intermediate Python list. Suffice to say, you get the data somehow.)

Now we do the write with a single call to extend, which makes a single TBasket:

np_array = np.array([[1, 2, 3]] * 10000)
ak_array = ak.from_numpy(np_array)

starttime = time.time()

with uproot.recreate("good.root") as fout:
    fout.mktree("tree", {"branch": ak_array.type})
    fout["tree"].extend({"branch": ak_array})

print("Total time:", time.time() - starttime)

The total time (on my computer) is 0.0020 seconds and the TTree characteristics are much better:

******************************************************************************
*Tree    :tree      :                                                        *
*Entries :    10000 : Total =          240913 bytes  File  Size =       3069 *
*        :          : Tree compression factor = 107.70                       *
******************************************************************************
*Br    0 :branch    : branch[3]/L                                            *
*Entries :    10000 : Total  Size=     240576 bytes  File Size  =       2229 *
*Baskets :        1 : Basket Size=      32000 bytes  Compression= 107.70     *
*............................................................................*

So, the writing is almost 1000× faster and the compression is 100× better. (With one entry per TBasket in the previous example, there was no compression because any compressed data would be bigger than the original!)

By comparison, if we do entry-by-entry writing with PyROOT,

import time
import array
import ROOT

data = [1, 2, 3]
holder = array.array("q", [0]*3)

file = ROOT.TFile("pyroot.root", "RECREATE")
tree = ROOT.TTree("tree", "An Example Tree")
tree.Branch("branch", holder, "branch[3]/L")

starttime = time.time()
for _ in range(10000):
    for i, x in enumerate(data):
        holder[i] = x

    tree.Fill()

tree.Write("", ROOT.TObject.kOverwrite)
file.Close()

print("Total time:", time.time() - starttime)

The total time (on my computer) is 0.062 seconds and the TTree characteristics are fine:

******************************************************************************
*Tree    :tree      : An Example Tree                                        *
*Entries :    10000 : Total =          241446 bytes  File  Size =       3521 *
*        :          : Tree compression factor =  78.01                       *
******************************************************************************
*Br    0 :branch    : branch[3]/L                                            *
*Entries :    10000 : Total  Size=     241087 bytes  File Size  =       3084 *
*Baskets :        8 : Basket Size=      32000 bytes  Compression=  78.01     *
*............................................................................*

So, PyROOT is 30× slower here, but the compression is almost as good. ROOT decided to make 8 TBaskets, which is configurable with AutoFlush parameters.

Keep in mind, though, that this is a comparison of techniques, not libraries. If you wrap a NumPy array with RDataFrame and write that, then you can skip all of the overhead involved in the Python for loop and you get the advantages of columnar processing.

But columnar processing only matters if you're working with big data. Much like compression, if you apply it to very small datasets (or a very small dataset many times), then it can hurt, rather than help.

-- Question -----------------------------------------------

I am using Uproot to access a Root Tree in Python and I am noticing a significant slowdown when I try to access one particular branch: wf, which contains an array of jagged arrays

I am accessing the branches by using the Lazy/Awkward method and I am using the step_size option.

LazyFileWF = uproot.lazy('../Layers9_Xe_Phantom102_run1.root:dstree;111', filter_name= "wf",step_size=100)

I experience a 6 to 10 second slow down when I want to access an entry in "LazyFileWF" but if I move on to the next consecutive entry, it only takes about 14 ms up until the end of the step_size. However my script needs to select entries randomly, not sequentially, which means every entry will take me about 8 seconds to access. I am able to access data from the other branches fairly quickly with the exception of this one and I wanted to find out why.

By using uproot.open() and then .show() I noticed that the interpretation of the branch was being labeled as AsObjects(AsObjects(AsVector(True, AsVector(False, dtype('>f4'))))

I did some digging in the Documentation and found this:

It mentions I can use simplify to improve the slow deserialization.

So here's what I would like to know, based on the Root Tree I have, can I use simplify to reduce the 8 second slowdown to access my branch? And if so how can implement it? Is there a better way to read this branch?

I tried:

a = uproot.AsObjects.simplify(LazyFileWF.wf)
a

but I got an error telling me

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/tmp/ipykernel_147439/260639244.py in <module>
      5 LazyFileWF = uproot.lazy('../Layers9_Xe_Phantom102_run1.root:dstree;111', filter_name= "wf",step_size=100)
      6 events.show(typename_width=35, interpretation_width= 60)
----> 7 a = uproot.AsObjects.simplify(LazyFileWF.wf)
      8 a

~/anaconda3/envs/rapids-21.10/lib/python3.7/site-packages/uproot/interpretation/objects.py in simplify(self)
    245         ``self``.
    246         """
--> 247         if self._branch is not None:
    248             try:
    249                 return self._model.strided_interpretation(

~/anaconda3/envs/rapids-21.10/lib/python3.7/site-packages/awkward/highlevel.py in __getattr__(self, where)
   1129                 raise AttributeError(
   1130                     "no field named {0}".format(repr(where))
-> 1131                     + ak._util.exception_suffix(__file__)
   1132                 )
   1133 

AttributeError: no field named '_branch'

(https://github.com/scikit-hep/awkward-1.0/blob/1.7.0/src/awkward/highlevel.py#L1131)

-- Answer ------------

The AsObjects.simplify function is internally applied to produce the default TBranch.interpretation that you're using if you don't override the interpretation when loading a TBranch as an array. The only reason you'd pass a custom interpretation is if the default is wrong—it's a back-door to fixing cases in which Uproot auto-detected the interpretation incorrectly.

If the default TBranch.interpreation is

AsObjects(AsVector(True, AsVector(False, dtype('>f4')))

then it did try to simplify—i.e. replace the AsObjects with an AsStridedObjects or AsJagged—but couldn't. This must be a C++ std::vector<std::vector<float>>, which has a variable number of bytes per object, so there aren't any simplified interpretations that will work. What's "simplified" about AsStridedObjects and AsJagged is that they have a fixed number of bytes per object and can therefore be interpreted in bulk, without a Python for loop over all the items in the TBasket.

Incidentally, we studied this exact case in https://arxiv.org/abs/2102.13516, and the AwkwardForth solution described in that paper will be adapted into Uproot this summer. Unfortunately, that doesn't help you right now.

The slow-fast pattern you're seeing is because each time you ask for an entry from a different TBasket, Uproot interprets the whole TBasket. If you were running sequentially, you'd see a pause at the beginning of each TBasket. The lazy array is caching interpreted data, so when your random-access comes back to a previously read TBasket, it should be fast again: by only looking at the first few requests, you're getting an impression that each request will be slow, but that's just because early requests are more likely to hit unread TBaskets than late requests.

If you're only looking into this because the process as a whole is too slow (i.e. just letting it run and fill up its cache isn't good enough), then consider reading the whole TBranch into an array and randomly access the array. If your random access is in a Python loop (as opposed to Numba), then there's also nothing to be gained and some performance to be lost by calling __getitem__ on an Awkward Array as opposed to a NumPy array, so pass library="np".

If you don't have enough memory to load the entire TBranch into an array—which could explain why you're using a lazy array—then you're in a difficult position, because the lazy array's caching would work against you: it would evict from cache the TBaskets that haven't been hit in a while, so even a long-running process would end up repeatedly reading/interpreting. This is a fundamental issue in random-access problems of data that are too large for memory: there isn't a good way to cache it because new requests keep pushing old results out of cache. (The same problem applies to disk access, web-cached data, databases, etc.)

Hopefully, the array fits into memory and you can random-access it in memory. Awkward Arrays have slower __getitem__ than NumPy, but they're more compact in memory, so which one will work best for you depends on the details.

I hope these pointers help!

-- Question -----------------------------------------------

I have an dask -boost_histogram question. I have a code structure as follows:

I have a class defined in some script:

class MyHist:
     def __init__(....):
         self.bh = None
     def make_hist(...):
           axis = bh.axis.Regular(....)
     @dask.delayed
     def fill_hist(data)
         self.bh.fill(data)

and in another script I want to fill multiple histograms in parallel with dask. The data are awkward arrays that I read from input, and for that I do something like:

from dask.distributed import Client
cl = Client()
histos = [MyHist(..), MyHist(another...)]
for i, file in enumerate(files):
    data = dask.delayed(open_file(file))
    for myhist in histos:
        if i ==0:  myhist.make_hist()
        fill_results.append(myhist.fill_hist(data)
dask.compute(*fill_results)

If I then try to call

for j, h in enumerate(histos):
        print(h.bh) 

I get empty histograms. However, if I print the boost histogram inside the fill_hist funciton, the histograms seem to be filled.

Does the dask computation create a deep copy or something of the MyHist object to perform the computation, and hence fill the bh associated with that copy? or am I doing something wrong here?

Update:

I see a similar or worse wall-time when using dask to read and fill than using sequential code. This is the case whether or not I use my code or the suggested answer. For an example that doesn't use an intermediate class, I've written the following code:

files = get_input_file_paths('myprocess')

@dask.delayed
def make_a_var(jet_pt):
    jets_pt = copy(jet_pt)
    jets_pt = ak.mask(jets_pt, ak.count(jets_pt, axis=1)>=1)
    return jets_pt[:, 0]*1e-3

@dask.delayed
def make_and_fill(data, axes):
    h = bh.Histogram(*axes, storage=bh.storage.Weight())
    h.fill(data)
    return h 

batch_size = 4
results = []
for i in range(0, len(files), batch_size):
    batch = []
    for j, file in enumerate(files[i:i+batch_size]):
        data = dask.delayed(read_file(file))
        var = data['jet_pt']
        new_var = make_a_var(var)
        new_var = new_pt.to_numpy() # Needed bc bh breaks for masked ak arrays
        new_var= new_var.compressed()
        for k in range(10):
            axes = (bh.axis.Regular(25, 0, 250), )
            h = make_and_fill(new_var, axes)
            batch.append(h)
    results.append(batch)
dask.compute(*results)

It takes a similar amount of wall-time ~7s to run this code sequentially as well as with dask, for k in range(10). For k in range(100) the parallel code takes 15s and sequential takes 21s, which is not as big of an improvement as I would have thought.

-- Answer ------------

I believe Jim's comment is correct w.r.t. the source of the problem; I'll also offer a solution I think may be helpful in solving the problem:

I think the definition of your class makes it difficult to work correctly with dask; that is, you probably will have an easier time if your fill_hist method was actually a free function. And in your loop you are actually calling dask.delayed on an already delayed method (this is likely not what you want to do):

fill_results.append(dask.delayed(myhist.fill_hist(data))
#                                       ^^^^^^^^^
#                                 already delayed method

My suggestion would be to go with a free function:

@dask.delayed
def fill_hist(data, axes, storage=None):
    storage = storage or bh.storage.Double()
    h = bh.Histogram(*axes, storage=storage)
    h.fill(data)
    return h

@dask.delayed
def open_file(fname):
    data = some_function_to_get_data(fname)
    return data

axes = (bh.axis.Regular(100, -10, 10),)  # tuple with a single axis
tasks = []
for f in files:
    data = open_file(f)
    hist = fill_hist(data=data, axes=axes)
    tasks.append(hist)

results = dask.compute(tasks)

This pattern is very similar to how dask-histogram works on its backend, (and dask-histogram has support for dask-awkward!)

-- Question -----------------------------------------------

I wish to sum all the 4-momenta of the constituents in a jet. In uproot3 (+ uproot3-methods) there was the functionality of creating a TLorentzVectorArray and just doing .sum()

So this worked fine:

import uproot3
import akward0 as ak

input_file = uproot3.open(input_path)
tree = input_file['Jets']
pt = tree.array('Constituent_pt')
phi = tree.array('Constituent_phi')
eta = tree.array('Constituent_eta')
energy = tree.array('Constituent_energy')
mass = tree.array('Constituent_mass')
p4 = uproot3_methods.TLorentzVectorArray.from_ptetaphie(pt, eta, phi, energy)
jet_p4_u3 = p4.sum()
jet_pt_u3 = jet_p4.pt
jet_eta_u3 = jet_p4.eta
jet_phi_u3 = jet_p4.phi
jet_energy_u3 = jet_p4.energy

However, since uproot3 is deprecated, the way to go according to https://stackoverflow.com/questions/66364294/tlorentz-vector-in-uproot-4 seems to be the vector package. What I tried was the following.

import uproot
import awkward
import vector

input_file = uproot.open(input_path)
tree = input_file['Jets']
pt = tree.arrays()['Constituent_pt']
phi = tree.arrays()['Constituent_phi']
eta = tree.arrays()['Constituent_eta']
energy = tree.arrays()['Constituent_energy']
mass = tree.arrays()['Constituent_mass']
p4 = vector.awk({"pt": pt, "phi": phi, "eta": eta, "energy": energy})

The problem now is that this functionality p4.sum() seems to not exist there. The other possibility that I found was shown in the vector discussion #117. So, now I add after the imports vector.register_awkward() and to the end jet_p4_u4 = ak.Array(p4, with_name="Momentum4D"),

import uproot
import awkward
import vector
vector.register_awkward()

input_file = uproot.open(input_path)
tree = input_file['Jets']
pt = tree.arrays()['Constituent_pt']
phi = tree.arrays()['Constituent_phi']
eta = tree.arrays()['Constituent_eta']
energy = tree.arrays()['Constituent_energy']
mass = tree.arrays()['Constituent_mass']
p4 = ak.Array({"pt": pt, "phi": phi, "eta": eta, "energy": energy})
jet_p4_u4 = ak.Array(p4, with_name="Momentum4D")

The question remains, how do I sum the 4-momenta? When doing ak.sum(jet_p4_u4, axis=-1), only pt and energy seem to have the correct values, eta and phi however are completely different from the result from uproot3.

Update: It seems that since the ```ak.sum`` function is not able to add together the angles in the wanted way, then replacing the summing part with summing x, y, z and energy and constructing the vector like this solves the problem. However, I believe there must be a better way than this. So current working version:

import uproot
import awkward
import vector

input_file = uproot.open(input_path)
tree = input_file['Jets']
pt = tree.arrays()['Constituent_pt']
phi = tree.arrays()['Constituent_phi']
eta = tree.arrays()['Constituent_eta']
energy = tree.arrays()['Constituent_energy']
mass = tree.arrays()['Constituent_mass']
p4 = vector.awk({"pt": pt, "phi": phi, "eta": eta, "energy": energy})
p4_lz = vector.awk({"x": p4.x, "y": p4.y, "z": p4.z, "t": energy})
lz_sum = ak.sum(p4_lz, axis=-1)
jet_p4 = vector.awk({
    "x": lz_sum.x,
    "y": lz_sum.y,
    "z": lz_sum.z,
    "t": lz_sum.t
})
jet_energy = jet_p4.t
jet_mass = jet_p4.tau
jet_phi = jet_p4.phi
jet_pt = jet_p4.rho

-- Answer ------------

For a solution that works equally well for flat arrays of Lorentz vectors as for jagged arrays of Lorentz vectors, try this:

import uproot
import awkward as ak
import vector
vector.register_awkward()   # any record named "Momentum4D" will be Lorentz

with uproot.open(input_path) as input_file:
    tree = input_file["Jets"]
    arrays = tree.arrays(filter_name="Constituent_*")
    p4 = ak.zip({
        "pt": arrays.Constituent_pt,
        "phi": arrays.Constituent_phi,
        "eta": arrays.Constituent_eta,
        "energy": arrays.Constituent_energy,
    }, with_name="Momentum4D")
    jet_p4 = ak.zip({
        "px": ak.sum(p4.px, axis=-1),
        "py": ak.sum(p4.py, axis=-1),
        "pz": ak.sum(p4.pz, axis=-1),
        "energy": ak.sum(p4.energy, axis=-1)
    }, with_name="Momentum4D")

Note that the uproot.TTree.arrays function, if given no arguments, will read all TBranches in the TTree. In your function, you read all the data four times, each time selecting a different column from the data that had been read and throwing the rest out.

Also, I don't like the vector.awk function because it can construct arrays of type:

N * Momentum4D[px: var * float64, py: var * float64, pz: var * float64, E: var * float64]

(in other words, each "px" value is a list of floats), rather than what you want:

N * var * Momentum4D[px: float64, py: float64, pz: float64, E: float64]

ak.zip combines the lists so that the "px" of each Lorentz vector is just a number, but you can have nested lists of Lorentz vectors. This only makes a difference if you have jagged arrays, but I'm pointing it out so that no one falls into this trap.

The with_name="Momentum4D" argument labels the records with that name, and having Lorentz-vector behaviors registered with vector.register_awkward() gives all such records Lorentz vector methods. In this case, we're using it so that p4, defined in terms of pt, phi, eta, energy, has properties px, py, pz—in other words, doing coordinate transformations on demand.

There isn't a Lorentz vector summation method that sums over each item in a jagged array (the uproot-methods one was a hack that only works for jagged arrays of Lorentz vectors, no other structures, like jagged-jagged, etc.), so sum the components with ak.sum in Cartesian coordinates.

-- Question -----------------------------------------------

I have a root file open in python as such:

file = uproot.open('C:\\Users\\me\\Documents\\test.root')
print(file.keys())
[b'evts;1', b'miscAccum;1', b'cal;1', b'configstr;1', b'time;1', b'Plugs;1', b'acc;1']
print(file[b'Plugs'].keys())
[b'veto;1', b'miscAccum;1', b'calfine;1']

All good but if I try to explore the time branch with print(file[b'time'].keys()) I get an error saying

'TVectorT_3c_double_3e_' object has no attribute 'keys'

How can I explore the value(s) within this branch?

-- Answer ------------

Directories (uproot.ReadOnlyDirectory) and TTrees (uproot.TTree) are objects that can initiate more reading, so they are navigable as Mappings with keys() and square-bracket syntax (__getitem__).

This object evidently has class type TVectorT_3c_double_3e_ (a Python model, uproot.Model, of C++ class TVectorT<double>). Some objects like this have custom "behaviors," which are user-friendly properties and methods. For example, the model of TParameter<double> has a value property (see uproot.behaviors.TParameter). This one does not.

When Uproot reads an object and does not have custom behaviors, you can still access its C++ private member data through the uproot.Model.member method or the uproot.Model.all_members property (dict). In fact, these are how high-level behaviors get implemented: the reading of C++ private member data is automatic, but humans have to design high-level APIs by writing functions to access them in a more user-friendly way, something that's happened for TParameter<double> but not TVectorT<double> yet.

(To first order, ROOT has infinitely many classes.)

So, to be explicit, do

file["time"].all_members

and see what you get.


One other thing, unrelated to that: I noticed that you're using Uproot 3 because all of the keys are bytes instead of str. Since you're not maintaining a legacy script, you should upgrade to Uproot 4. Since Dec 2020, Uproot 4 is just "uproot":

pip install uproot awkward

(Usually, you want Awkward Array as well, so I included that in the pip-install. You might need to pip uninstall the versions you have now to get a clean slate.)

The documentation links and interface I described above is all for Uproot 4.

-- Question -----------------------------------------------

I have a file in cern root format which contains a number of TH1D and TH2D histograms. I'd like to be able to plot these by using uproot4 to read them, then matplotlib to do the plotting.
I can open the file with uproot.open(path) ok, and print(file.keys()) gives a list of the histogram names, but I can't list the histogram contents or convert them to any other meaningful forms. Can anyone point me to example code?

-- Answer ------------

Following this section of the tutorial, particularly the end, you can do:

import uproot
file = uproot.open("https://scikit-hep.org/uproot3/examples/hepdata-example.root")
file["hpxpy"].to_hist().plot()
plt.show()

The to_hist() part sends the histogram to the hist library (which must be installed). The hist documentation has sections on installation and plotting.

-- Question -----------------------------------------------

I am working with a root file (array of arrays). When I load the array into python, I get an awkward array since this is an array of arrays of varying sizes. I would like to learn how to convert this to a numpy array of arrays of the same size, by populating empty elements with NaNs. How can I convert an awkward array of varying size to a numpy array?

-- Answer ------------

Suppose that you have an array of variable-length lists a:

>>> import numpy as np
>>> import awkward as ak
>>> a = ak.Array([[0, 1, 2], [], [3, 4], [5], [6, 7, 8, 9]])
>>> a
<Array [[0, 1, 2], [], ... [5], [6, 7, 8, 9]] type='5 * var * int64'>

The function that makes all lists have the same size is ak.pad_none. But first, we need a size to pad it to. We can get the length of each list with ak.num and then take the np.max of that.

>>> ak.num(a)
<Array [3, 0, 2, 1, 4] type='5 * int64'>
>>> desired_length = np.max(ak.num(a))
>>> desired_length
4

Now we can pad it and convert that into a NumPy array (because it now has rectangular shape).

>>> ak.pad_none(a, desired_length)
<Array [[0, 1, 2, None], ... [6, 7, 8, 9]] type='5 * var * ?int64'>
>>> ak.to_numpy(ak.pad_none(a, desired_length))
masked_array(
  data=[[0, 1, 2, --],
        [--, --, --, --],
        [3, 4, --, --],
        [5, --, --, --],
        [6, 7, 8, 9]],
  mask=[[False, False, False,  True],
        [ True,  True,  True,  True],
        [False, False,  True,  True],
        [False,  True,  True,  True],
        [False, False, False, False]],
  fill_value=999999)

The missing values (None) are converted into a NumPy masked array. If you want a plain NumPy array, you can ak.fill_none to give them a replacement value.

>>> ak.to_numpy(ak.fill_none(ak.pad_none(a, desired_length), 999))
array([[  0,   1,   2, 999],
       [999, 999, 999, 999],
       [  3,   4, 999, 999],
       [  5, 999, 999, 999],
       [  6,   7,   8,   9]])

-- Question -----------------------------------------------

I have a multiindex pandas dataframe that looks like this (called p_z):

                     p_z
entry subentry
0     0         0.338738
      1         0.636035
      2        -0.307365
      3        -0.167779
      4         0.243284
...                  ...
26692 891      -0.459227
      892       0.055993
      893      -0.469857
      894       0.192554
      895       0.155738

[11742280 rows x 1 columns]

I want to be able to select certain rows based on another dataframe (or numpy array) which is multidimensional. It would look like this as a pandas dataframe (called tofpid):

                tofpid
entry subentry
0     0              0
      1              2
      2              4
      3              5
      4              7
...                ...
26692 193          649
      194          670
      195          690
      196          725
      197          737

[2006548 rows x 1 columns]

I also have it as an awkward array, where it's a (26692, ) array (each of the entries has a non-standard number of subentries). This is a selection df/array that tells the p_z df which rows to keep. So in entry 0 of p_z, it should keep subentries 0, 2, 4, 5, 7, etc.

I can't find a way to get this done in pandas. I'm new to pandas, and even newer to multiindex; but I feel there ought to be a way to do this. If it's able to be broadcast even better as I'll be doing this over ~1500 dataframes of similar size. If it helps, these dataframes are from a *.root file imported using uproot (if there's another way to do this without pandas, I'll take it; but I would love to use pandas to keep things organised).

Edit: Here's a reproducible example (courtesy of Jim Pavinski's answer; thanks!).

import awkward as ak
import pandas as pd

>>> p_z = ak.Array([[ 0.338738, 0.636035, -0.307365, -0.167779, 0.243284,  
                      0.338738, 0.636035],
                    [-0.459227, 0.055993, -0.469857,  0.192554, 0.155738, 
                     -0.459227]])
>>> p_z = ak.to_pandas(p_z)
>>> tofpid = ak.Array([[0, 2, 4, 5], [1, 2, 4]])
>>> tofpid = ak.to_pandas(tofpid)

Both of these dataframes are produced natively in uproot, but this will reproduce the same dataframes that uproot would (using the awkward library).

-- Answer ------------

Here's a reproducible example with enough structure to represent the problem (using the awkward library):

>>> import awkward as ak
>>> 
>>> p_z = ak.Array([
...     [ 0.338738, 0.636035, -0.307365, -0.167779, 0.243284,  0.338738, 0.636035],
...     [-0.459227, 0.055993, -0.469857,  0.192554, 0.155738, -0.459227],
... ])
>>> p_z
<Array [[0.339, 0.636, ... 0.156, -0.459]] type='2 * var * float64'>
>>> 
>>> tofpid = ak.Array([[0, 2, 4, 5], [1, 2, 4]])
>>> tofpid
<Array [[0, 2, 4, 5], [1, 2, 4]] type='2 * var * int64'>

In Pandas form, this is:

>>> df_p_z = ak.to_pandas(p_z)
>>> df_p_z
                  values
entry subentry          
0     0         0.338738
      1         0.636035
      2        -0.307365
      3        -0.167779
      4         0.243284
      5         0.338738
      6         0.636035
1     0        -0.459227
      1         0.055993
      2        -0.469857
      3         0.192554
      4         0.155738
      5        -0.459227
>>> df_tofpid = ak.to_pandas(tofpid)
>>> df_tofpid
                values
entry subentry        
0     0              0
      1              2
      2              4
      3              5
1     0              1
      1              2
      2              4

As an Awkward Array, what you want to do is slice the first array by the second. That is, you want p_z[tofpid]:

>>> p_z[tofpid]
<Array [[0.339, -0.307, ... -0.47, 0.156]] type='2 * var * float64'>
>>> p_z[tofpid].tolist()
[[0.338738, -0.307365, 0.243284, 0.338738], [0.055993, -0.469857, 0.155738]]

Using Pandas, I managed to do it with this:

>>> df_p_z.loc[df_tofpid.reset_index(level=0).apply(lambda x: tuple(x.values), axis=1).tolist()]
                  values
entry subentry          
0     0         0.338738
      2        -0.307365
      4         0.243284
      5         0.338738
1     1         0.055993
      2        -0.469857
      4         0.155738

What's happening here is that df_tofpid.reset_index(level=0) turns the "entry" part of the MultiIndex into a column, then apply executes a Python function on each row if axis=1, each row is x.values, and tolist() turns the result into a list of tuples like

>>> df_tofpid.reset_index(level=0).apply(lambda x: tuple(x.values), axis=1).tolist()
[(0, 0), (0, 2), (0, 4), (0, 5), (1, 1), (1, 2), (1, 4)]

This is what loc needs to select entry/subentry pairs from its MultiIndex.

My Pandas solution has two disadvantages: it's complicated, and it goes through Python iteration and objects, which doesn't scale as well as arrays. There is a good chance that a Pandas expert would find a better solution than mine. There's a lot I don't know about Pandas.

-- Question -----------------------------------------------

I have a 2D correlation matrix between nuisance parameters that are used in a likelihood fit. I would like to translate this to NumPy to use with other vis libraries. The problem is TH2D.edges returns the bin index and I need the bin label, i.e. which bin is which nuisance parameter. The TH2D object has the following members:

['fName',
 'fTitle',
 'fLineColor',
 'fLineStyle',
 'fLineWidth',
 'fFillColor',
 'fFillStyle',
 'fMarkerColor',
 'fMarkerStyle',
 'fMarkerSize',
 'fNcells',
 'fXaxis',
 'fYaxis',
 'fZaxis',
 'fBarOffset',
 'fBarWidth',
 'fEntries',
 'fTsumw',
 'fTsumw2',
 'fTsumwx',
 'fTsumwx2',
 'fMaximum',
 'fMinimum',
 'fNormFactor',
 'fContour',
 'fSumw2',
 'fOption',
 'fFunctions',
 'fBufferSize',
 'fBuffer',
 'fBinStatErrOpt',
 'fStatOverflows',
 'fScalefactor',
 'fTsumwy',
 'fTsumwy2',
 'fTsumwxy']

Is it possible to extract the bin label for each index?

-- Answer ------------

TH2 has an axes tuple and an axis method for getting each of its two TAxis objects.

From there, TAxis has a labels method to extract the labels.

You didn't see it in the list of TH2D data members because it's in the fXaxis and fYaxis.

-- Question -----------------------------------------------

I am trying to add a 0 to every row of a jagged array. I want to go from

<JaggedArray [[1 2 3] [1 2]]>

to

<JaggedArray [[1 2 3 0] [1 2 0]]>

so that when I grab the -1th index, I get 0. Currently I'm padding every row to the length of the biggest row + 1, then filling nans with 0, which works, but I am wondering if there's a better way.

I saw that there's a class AppendableArray that has a .append() function, but I'm not sure how to convert between the two. I'm using awkward 0.12.22, and the data is read out of a ROOT file with uproot 3.11.0

-- Answer ------------

Perhaps this is too short to be an answer, but

  1. Upgrade to Awkward 1.x (you can still import awkward0 and use ak.from_awkward0 and ak.to_awkward0 to go back and forth in the same process).
  2. Create an array of single-item lists, perhaps in NumPy (ak.from_numpy), perhaps by slicing a one-dimensional array with np.newaxis.
  3. Concatenate it with your other array using ak.concatenate with axis=1. The first dimension needs to be the same (len of both arrays must be equal), but the second dimensions are unconstrained.

-- Question -----------------------------------------------

So aim is I'm trying to save some arrays as parquets. I can use a python debugger to reach the point in my code that they are ready for saving. Inside my complicated mess of code they look like;

ipdb> ak.__version__
'1.2.2'
ipdb> array1
<Array [... 0., 1.]], [[50.], [4., 47.]]]] type='1 * 3 * var * var * float64'>
ipdb> array2
<Array [[False, True, True]] type='1 * 3 * bool'>

If I try to save them it doesn't work, the error I get is

ipdb> group = ak.zip({'a': array1, 'b': array2}, depth_limit=1)
ipdb> ak.to_parquet(group, 'test.parquet')
*** ValueError: could not broadcast input array from shape (3) into shape (1)

So I start messing around in the terminal to try and recreate the problem and debug it, but I actually cannot replicate it. Here is what happens;

In [1]: import awkward as ak
In [2]: ak.__version__
'1.2.2'
In [3]: cat = ak.from_iter([[True, False, True]])
In [4]: dog = ak.from_iter([[[], [[50.0], [0.2, 0.1, 0., 0., 0.1]], [[50.0], [21., 0.1, 0.]]]])
In [5]: pets = ak.zip({'dog':dog, 'cat':cat}, depth_limit=1)
In [6]: ak.to_parquet(pets, "test.parquet")
In [7]: # no problems
In [8]: cat
<Array [[False, True, True]] type='1 * var * bool'>

Notice that the dimensions have changed from 1 * 3 * bool to 1 * var * bool. That seems to be the only difference - but I cant seem to work out how to control this?


Having managed to isolate the issue, it wasn't what I thought it was. The problem comes when using np.newaxis to make a new axis in a boolean array, then trying to save it.

dog = ak.from_iter([1, 2, 3])[np.newaxis]
pets = {"dog": dog}
zipped = ak.zip(pets, depth_limit=1)
ak.to_parquet(zipped, "test.parquet")
# works fine

dog = ak.from_iter([True, False, True])[np.newaxis]
pets = {"dog": dog}
zipped = ak.zip(pets, depth_limit=1)
ak.to_parquet(zipped, "test.parquet")

# Gives 
ValueError: could not broadcast input array from shape (3) into shape (1)

I should really know better than to post a question without isolating the problem first. Apologies for wasting your time.

-- Answer ------------

In ak.zip, depth_limit=1 means that the arrays are not deeply matched ("zipped") together: the only constraint is that len(array1) == len(array2). Is this not satisfied?

In your pets example, len(cat) == 1 and len(dog) == 1. Since you're asking for depth_limit=1, it doesn't matter that len(cat[0]) == len(dog[0]), though in this case it does (they're both 3). Thus, it would be possible to zip these at depth_limit=2, even though that's not what you're asking for.

Since the error message is saying that the mismatching lengths of array1 and array2 are 3 and 1, that should be easy to inspect in the debugger:

array1[0]
array1[1]
array1[2]   # should be the last one

array2[0]   # should be the only one

I hope this sheds some light on your problem!


Looking more closely, I see that you're telling me that you know the lengths of array1 and array2. They're both length 1. There should be no trouble zipping them at depth_limit=1.

You can make your pets example have exactly the right types by calling ak.to_regular on that axis:

>>> cat = ak.to_regular(ak.from_iter([[True, False, True]]), axis=1)
>>> dog = ak.to_regular(ak.from_iter([[[], [[50.0], [0.2, 0.1, 0., 0., 0.1]], [[50.0], [21., 0.1, 0.]]]]), axis=1)
>>> cat
<Array [[True, False, True]] type='1 * 3 * bool'>
>>> dog
<Array [... 0, 0.1]], [[50], [21, 0.1, 0]]]] type='1 * 3 * var * var * float64'>

So the types are exactly 1 * 3 * bool and 1 * 3 * var * var * float64. Zipping works:

>>> pets = ak.zip({'dog':dog, 'cat':cat}, depth_limit=1)
>>> pets
<Array [... 0]]], cat: [True, False, True]}] type='1 * {"dog": var * var * var *...'>
>>> pets.type
1 * {"dog": var * var * var * float64, "cat": var * bool}

Maybe the array1 and array2 you think you're working with are not what you're really working with?

-- Question -----------------------------------------------

So a collection of records has been created using awkward.zip;

import awkward
pineapple = awkward.from_iter([3, 4])
tofu = awkward.from_iter([5, 6])
pizza = awkward.zip({"pineapple": pineapple, "tofu": tofu})

I'd like to remove one of the records/fields from the collection I'm working with. My first thought is to do del pizza['pineapple'], but that doesn't work. I could do;

pizza = awkward.zip({name: pizza[name] for name in awkward.fields(pizza)
                     if name != 'pineapple'})

But I'm wondering if there is something more efficient I could be doing. I'm not trying to delete the object pineapple, just remove the reference to it in the pizza.

-- Answer ------------

There isn't an ak.Array.__del__, but that doesn't mean there couldn't be one. However, it would be implemented in much the same way you have above—it would build a new record with all fields except the one to remove. Amy operation that puts arrays together as records or spots arrays out of records are O(1) in the length of the arrays: they only change metadata, like reshaping a NumPy array. We generally only worry about performance for things that scale with the length of an array, as this is the parameter that can grow large (it's assumed that you'd have a lot more array entries than record fields).

You could also do it with a slice instead of an ak.zio:

pizza = pizza[[x for x in ak.fields(pizza) if x != "pineapple"]]

This would work regardless of how deep the records are within nested lists of lists. See nested projection.

-- Question -----------------------------------------------

A follow up from this question; https://stackoverflow.com/questions/66851761/best-way-to-save-a-dict-of-awkward1-arrays/66854439#66854439

To save multiple columns of nested awkward1 arrays (with varying length);

import awkward1 as ak
dog = ak.from_iter([[1, 2], [5]])
cat = ak.from_iter([[4]])
pets = ak.zip({"dog": dog[np.newaxis], "cat": cat[np.newaxis]}, depth_limit=1)

ak.to_parquet(pets, "pets.parquet")

Unfortunately, this doesn't seem to work for flat lists;

import awkward1 as ak
dog = ak.from_iter([1, 2, 5])
cat = ak.from_iter([4])
pets = ak.zip({"dog": dog[np.newaxis], "cat": cat[np.newaxis]}, depth_limit=1)

ak.to_parquet(pets, "pets.parquet")

creates the error;

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-31-7f3a7fefb261> in <module>
      3 cat = ak.from_iter([3])
      4 pets = ak.zip({"dog": dog[np.newaxis], "cat": cat[np.newaxis]}, depth_limit=1)
----> 5 ak.to_parquet(pets, "pets.parquet")

~/Programs/anaconda3/envs/tree/lib/python3.7/site-packages/awkward/operations/convert.py in to_parquet(array, where, explode_records, list_to32, string_to32, bytestring_to32, **options)
   2983     layout = to_layout(array, allow_record=False, allow_other=False)
   2984     iterator = batch_iterator(layout)
-> 2985     first = next(iterator)
   2986
   2987     if "schema" not in options:

~/Programs/anaconda3/envs/tree/lib/python3.7/site-packages/awkward/operations/convert.py in batch_iterator(layout)
   2978                 )
   2979             yield pyarrow.RecordBatch.from_arrays(
-> 2980                 pa_arrays, schema=pyarrow.schema(pa_fields)
   2981             )
   2982

~/Programs/anaconda3/envs/tree/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.RecordBatch.from_arrays()

TypeError: object of type 'pyarrow.lib.Tensor' has no len()

What is the reason for encountering this error?

-- Answer ------------

What you found is a bug, and now it is fixed: scikit-hep/awkward#799

What's happening here is that pyarrow can't write pyarrow.lib.Tensor (regular-length lists, such as the one you created with np.newaxis) to Parquet files. Parquet files don't have a concept of "regular-length list," so that makes sense. But rather than converting it, pyarrow hits an unhandled case, in which it fails to find the length of that pyarrow.lib.Tensor. (It's a little odd that pyarrow.lib.Tensor doesn't have a __len__ method, but that's another thing.)

Anyway, with version 1.2.0 of Awkward Array, we'll simply convert regular-length lists into (in principle) variable-length lists when writing to Parquet, since the format doesn't have that type. According to the schedule, version 1.2.0 will be released tomorrow. (This bug-fix is likely the last prerelease.)

-- Question -----------------------------------------------

So back in awkward v0 it was possible to do;

import awkward
dog = awkward.fromiter([[1., 2.], [5.]])
cat = awkward.fromiter([[4], [3]])
dict_of_arrays = {'dog': dog, 'cat': cat}
awkward.save("pets.awkd", dict_of_arrays)

Then we could lazy load the array

reloaded_data = awkward.load("pets.awkd")
# no data in ram
double_dog = reloaded_data["dog"]*2
# dog is in ram but not cat

In short with have a dataset consisting of 'dog' and 'cat' parts. The whole dataset saves to one file on disk. Even if I didn't have any documentation, it would be obvious what data is dog and what is cat. Dog and cat load as awkward arrays. I can load the data and work with just one part without the other part ending up in the ram.

I'm looking for the best way to do this in awkward v1. The requirements I would like to meet are;

  • The data consists of multiple named parts, with irregular shapes.
  • All items in one named part have the same data type, different parts may have different data types.
  • Some sort of lazy loading needs to be possible, working on bits of the data as awkward1 arrays without the whole thing.
  • Ideally, the names of the parts are unambiguously associated with the data for each part. Dict structure is good for this, but other things could work.
  • Ideally, the whole dataset saves and loads from one file without speed penalty.
  • Ideally, when the array is loaded it has the right type, so in the example dog is a float array and cat is an int array.

I had a look at awkward1.to_parquet and while it looks good it seems to be just for saving one array. This dosn't fit well with the need to hold multiple data types, and I'm not sure how I'd record the column names. I suppose I could convert back to awkward v0 and save that way but I'm not sure how that would play with lazy loading. It might be that I need to write a wrapper to do these things, which would be totally fine, but I wanted to check first if there is something built in that I should know about.

Edit; the answer given works great. For completeness I wanted to leave an example of using it;

In [1]: import awkward1 as ak

In [2]: dog = ak.from_iter([[1., 2.], [5.]])
   ...: cat = ak.from_iter([[4], [3]])

In [3]: ak.zip?

In [4]: pets = ak.zip({"dog": dog, "cat": cat}, depth_limit=1)

In [5]: pets.dog
Out[5]: <Array [[1, 2], [5]] type='2 * var * float64'>

In [6]: pets.cat
Out[6]: <Array [[4], [3]] type='2 * var * int64'>


In [7]: ak.to_parquet(pets, "pets.parquet")


-- Answer ------------

What Awkward v0 did with awkward0.save is entirely equivalent to pickling (in v0 or v1), so the special name "save" has been removed. (It was inspired by NumPy's "save" and "load," but eventually we just made Awkward's __setstate__ and __getstate__ do the same thing.)

But picking/old-style saving doesn't lazily load. (Edit: actually, I had forgotten that old-style save does lazily load, but only at top-most granularity—the arrays that are separate in the dict become separate "files" within the ZIP file. Parquet lazily loads subfields of nested records.)

You're right that ak.to_parquet/ak.from_parquet is a good option for lazy loading, and this file format has a better compression-to-read-speed than our picking format. It's also a standard that many programs recognize. (If you use it, I recommend passing through use_dictionary=False and use_byte_stream_split=True for floating point data; all the options on this page can be supplied to ak.to_parquet as **options. I need to add some documentation explaining how these are good options for floating point.)

It is also true that ak.to_parquet takes only one array argument. But that's fine: make a single array, not a dict. The fact that Awkward Array manipulates data structures helps you here. You can ak.zip all your arrays together into one array, using the same field names that you would have used as dict keys. If they have different internal structures, you can prevent it from trying to align them at all levels with depth_limit=1, and if they even have different lengths, you can meet then each in a length-1 outer structure with

has_one_more_dimension = original_array[np.newaxis]

The names that ak.to_parquet uses for column names come from the records of the Awkward Array itself. Different fields in a record can have different data types. Therefore, the names you zip them with are the column names of the Parquet file, and ready column can have a different type.

Parquet files are lazily loaded by column (including fields of nested records) and by row group. If you want to configure the granularity of reading groups of rows, write the file as a partitioned array (ak.partitioned or ak.repartition).

-- Question -----------------------------------------------

I am trying to use awkward in my Windows 10 system. I am using python 3.8.2. After installing the package, when I import it, I am getting this DLL import error.

>>> import awkward as ak
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\MSS\Work\PyWorkspace37\awkward_poc\venv\lib\site-packages\awkward\__init__.py", line 23, in <module>
    import awkward.layout
  File "C:\Users\MSS\Work\PyWorkspace37\awkward_poc\venv\lib\site-packages\awkward\layout.py", line 5, in <module>
    from awkward._ext import Index8
ImportError: DLL load failed while importing _ext: The specified module could not be found.

How to know which DLL is missing and how to mitigate it?

-- Answer ------------

The DLL is Awkward Array's C++ part, "awkward/_ext.pyc", which the installation should have put into place. I see two possibilities: (1) the installation went horribly wrong and should be restarted from a clean slate, or (2) the file is there but can't be loaded. If the directory it was installed to doesn't have executable permissions, that could be it. (Python code can be executed from a disk/directory without executable permissions, but native machine code can't be, which can cause some surprises.)

When it comes to installation issues, I can only make guesses because your system is different than mine. We try to use standard installation procedures, so if you use the standard Python inhalation tools (pip, conda) then it ought to work without problems, but evidently that didn't happen here.

-- Question -----------------------------------------------

I want to implement a boosted decision tree for my analysis. But the entries in my array contain are of varying length, so the array is not convertible directly into numpy or pandas.

Is there any way to use existing ML libraries with awkward array?

-- Answer ------------

Your ML library might assume that the arrays are NumPy arrays and not recognize an ak.Array. That problem, in itself, is easily solved: call np.to_numpy (or equivalently, cast it with np.asarray) to put it in a form the ML library expects. Incidentally, there's also ak.to_pandas to make a DataFrame in which variable-length nested lists are represented by a MultiIndex (with limitations: there has to be only one nested list, since a DataFrame has only one index).

The above is what I'd call a "branding" issue: the ML library just doesn't recognize the ak.Array "brand" of array, so we relabel it. But there's a more fundamental issue: does the ML algorithm in question intrinsically require rectilinear data? For instance, a feedforward neural network maps N-dimensional inputs to M-dimensional outputs; N and M can't be different for each input. This is a problem even if you're not using Awkward Array. In HEP, the old solution was to run variable-length data through a recurrent neural network (thus ignoring the boundaries between lists and imposing an irrelevant order on them) and the new solution seems to be graph neural networks (which is a more theoretically correct thing to do).

I've noticed that some ML libraries are introducing their own "jagged arrays," which are the minimum structure that Awkward Array provides: TensorFlow has RaggedTensors and PyTorch is getting NestedTensors. I don't know to what degree these data types have been integrated into the ML algorithms, though. If they have been, then Awkward Array ought to get an ak.to_tensorflow and ak.to_pytorch to complement ak.to_numpy and ak.to_pandas, as a way to preserve jaggedness when sending data to these libraries. Hopefully, they'll be able to use that jaggedness in their ML algorithms! (Otherwise, what's the point? But I haven't been following these developments closely.)

You're interested in boosted decision trees (BDTs). I can't think of how a decision tree model, boosted or not, could be adapted to different length inputs... Or maybe I can: the nodes of a decision tree choose which subtree to pass the data down to based on the value of one index in the N-dimensional input. That doesn't imply there's a maximum index value N, though a particular tree would have a set of indexes that it splits on, and there would be some maximum of that set (because the tree is finite!). Apply a tree that wants to split on index k on an input with n < k elements would have to have a contingency for how to split anyway, but there are already methods for applying decision trees to datasets with missing values. An input datum with n elements could be treated as an input for which indexes greater than n are considered missing values. To train such a BDT, you'd have to give it inputs with missing values beyond each list's maximum element.

In Awkward Array, the function for that is ak.pad_none. If you know the maximum length list in your sample (ak.num and ak.max), you can pad the whole array such that all lists have the same length with missing values at the end. If you set clip=True, then the resulting array type is "regular," it no longer considers the possibility that a list can have a length different from the chosen length. If you pass such an array to np.to_numpy (and not np.asarray), then it becomes a NumPy masked array, which a BDT algorithm that expects missing values should be able to recognize.

The only problem with this plan is that padding every list to have the same length as the maximum length list uses more memory. If the BDT algorithm were aware of jaggedness (the way that TensorFlow and soon PyTorch is/will be aware of jaggedness), then it should be able to make these trees and apply them to data without the memory-padding step. I don't know if there are any such BDT implementations out there, but if someone wants to write a "BDT with missing values that accepts jagged arrays," I'd be happy to help them get it set up with Awkward Arrays!

-- Question -----------------------------------------------

I am trying to read a branch of TH1D objects in uproot4. A sample rootfile can be created with:

TFile * f = new TFile("new.root","RECREATE");
TTree * t = new TTree("mytree","mytree");

t->SetMakeClass(1); //See note later

TH1D * histo;
t->Branch("myhisto","TH1D",&histo);

for(int i=0;i<100;i++){
  t->GetEntry(i);
  histo = new TH1D(Form("histo_%d",i),Form("histo_%d",i),100,0,100);
  histo->Fill(i);
  t->Fill();
}

t->Print();
t->Write();

In uproot:

Python 3.8.6 (default, Jan 27 2021, 15:42:20) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.17.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import uproot

In [2]: uproot.__version__
Out[2]: '4.0.5'

In [3]: uproot.open("new.root:mytree/myhisto")
Out[3]: <TBranchElement 'myhisto' at 0x7f91da583e50>

In [4]: uproot.open("new.root:mytree/myhisto").interpretation
Out[4]: AsObjects(Model_TH1D)

However, when I try to read the array, it fails with a long Traceback. The last calls are:

~/.local/lib/python3.8/site-packages/uproot/model.py in read(cls, chunk, cursor, context, file, selffile, parent, concrete)
    798         )
    799 
--> 800         self.read_members(chunk, cursor, context, file)
    801 
    802         self.hook_after_read_members(

~/.local/lib/python3.8/site-packages/uproot/models/TArray.py in read_members(self, chunk, cursor, context, file)
     41             )
     42         self._members["fN"] = cursor.field(chunk, _tarray_format1, context)
---> 43         self._data = cursor.array(chunk, self._members["fN"], self.dtype, context)
     44 
     45     def __array__(self, *args, **kwargs):

~/.local/lib/python3.8/site-packages/uproot/source/cursor.py in array(self, chunk, length, dtype, context, move)
    308         if move:
    309             self._index = stop
--> 310         return numpy.frombuffer(chunk.get(start, stop, self, context), dtype=dtype)
    311 
    312     _u1 = numpy.dtype("u1")

~/.local/lib/python3.8/site-packages/uproot/source/chunk.py in get(self, start, stop, cursor, context)
    366 
    367         else:
--> 368             raise uproot.deserialization.DeserializationError(
    369                 """attempting to get bytes {0}:{1}
    370 outside expected range {2}:{3} for this Chunk""".format(

DeserializationError: while reading

    TH1D version 8 as uproot.dynamic.Model_TH1D_v3 (514 bytes)
        TH1 version 1 as uproot.dynamic.Model_TH1_v8 (18 bytes)
            (base): <TNamed '' at 0x7f91da38d430>
            (base): <TAttLine (version 2) at 0x7f91da38d700>
            (base): <TAttFill (version 2) at 0x7f91da38da30>
            (base): <TAttMarker (version 2) at 0x7f91da38dd90>
            fNcells: 0
            TAxis version 2 as uproot.dynamic.Model_TAxis_v10 (12 bytes)
                (base): <TNamed '' title='\x00\x00' at 0x7f91da398910>
                (base): <TAttAxis (version 4) at 0x7f91da398bb0>
                fNbins: 81920
                fXmin: 8.34406940932277e-309
                fXmax: 2.0000190735445362
                TArrayD version None as uproot.models.TArray.Model_TArrayD (? bytes)
                    fN: 81792
                    TH1D version 8 as uproot.dynamic.Model_TH1D_v3 (514 bytes)
                        TH1 version 1 as uproot.dynamic.Model_TH1_v8 (18 bytes)
                            (base): <TNamed '' at 0x7f91da495850>
                            (base): <TAttLine (version 2) at 0x7f91da398970>
                            (base): <TAttFill (version 2) at 0x7f91da48cdc0>
                            (base): <TAttMarker (version 2) at 0x7f91da3773d0>
                            fNcells: 0
                            TAxis version 2 as uproot.dynamic.Model_TAxis_v10 (12 bytes)
                                (base): <TNamed '' title='\x00\x00' at 0x7f91da3779d0>
                                (base): <TAttAxis (version 4) at 0x7f91da377d30>
                                fNbins: 81920
                                fXmin: 8.34406940932277e-309
                                fXmax: 2.0000190735445362
                                TArrayD version None as uproot.models.TArray.Model_TArrayD (? bytes)
                                    fN: 81792

attempting to get bytes 58:654394
outside expected range 0:542 for this Chunk
in file new.root
in object /mytree;1

If I set SetMakeClass(0); at the creation of the file, the read fails instead with:

~/.local/lib/python3.8/site-packages/uproot/model.py in read(cls, chunk, cursor, context, file, selffile, parent, concrete)
    798         )
    799 
--> 800         self.read_members(chunk, cursor, context, file)
    801 
    802         self.hook_after_read_members(

<dynamic> in read_members(self, chunk, cursor, context, file)

NotImplementedError: memberwise serialization of Model_TAxis_v10
in file new.root

Tested with ROOT 6.22/06 and also 5.34/21, uproot 4.0.5 and 4.0.6, using both python 2.7.18 and 3.8.6 interpreters. Am I doing something wrong?

-- Answer ------------

You're not doing something wrong: it's a NotImplementedError because memberwise serialization has not been implemented in Uproot. That's Issue #38, which has been getting a lot of attention recently.

Other people finding this question, years later: check to see if Issue #38 has been resolved.

-- Question -----------------------------------------------

@numba.njit
def make_vid_plot(Photon):
    hoe_arr=[]
    sieie_arr=[]
    Isochg_arr = []

    for eventIdx,pho in enumerate(Photon):
        for phoIdx,_ in enumerate(pho):
            vid = Photon[eventIdx][phoIdx].vidNestedWPBitmap
            vid_cuts1 = PhotonVID(vid,1)
            vid_cuts2 = PhotonVID(vid,2)
            vid_cuts3 = PhotonVID(vid,3)

            if (vid_cuts2 & 0b1110111 == 0b1110111): # without sieie
                
                print(Photon[eventIdx][phoIdx].isScEtaEE) # isEcEtaEE is boolean 
                hoe_arr.append(Photon[eventIdx][phoIdx].hoe)
                sieie_arr.append(Photon[eventIdx][phoIdx].sieie)
                Isochg_arr.append(Photon[eventIdx][phoIdx].pfRelIso03_chg)
        
    return hoe_arr,sieie_arr,Isochg_arr

Error message

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Internal error at resolving type of attribute "isScEtaEE" of "$112binary_subscr.7".
module 'numba' has no attribute 'bool'
During: typing of get attribute at <ipython-input-26-0b0635b19185> (19)
Enable logging at debug level for details.
File "<ipython-input-26-0b0635b19185>", line 19:
def make_vid_plot(Photon):
    <source elided>               
print(Photon[eventIdx][phoIdx].isScEtaEE) # isEcEtaEE is boolean 
                ^

I'm using numba with Coffea but, Numba with boolean seems now working. How to solve this error?

numba version: 0.51.2

Thank you

-- Answer ------------

In fact, this was a bug in Awkward Array that was fixed in PR #776, which will probably be merged into the main branch by the time you read this.

I recognize that it can be difficult to distinguish user errors from internal bugs from the error messages—the general rule in Awkward Array is that RuntimeError is an internal bug, ValueError is a user error—but any error when Numba is trying to determine a type is presented as TypeError, regardless of what it is. (In this case, it was an AttributeError! It was trying to access numba.bool when the correct spelling is numba.boolean.)

However, if you think that something might be a bug, it would be easier if you report it as a GitHub Issue. It is also helpful to print out the whole error message, like this:

E           numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
E           No implementation of function Function(<built-in function getitem>) found for signature:
E            
E            >>> getitem(ak.ArrayView(ak.VirtualArrayType({"class":"ListOffsetArray64","offsets":"i64","content":{"class":"RecordArray","contents":{"x":{"class":"NumpyArray","inner_shape":[],"itemsize":1,"format":"?","primitive":"bool","has_identities":false,"parameters":{},"form_key":null}},"has_identities":false,"parameters":{},"form_key":null},"has_identities":false,"parameters":{},"form_key":null}, none, {}), None, ()), Literal[int](2))
E            
E           There are 28 candidate implementations:
E             - Of which 26 did not match due to:
E             Overload of function 'getitem': File: <numerous>: Line N/A.
E               With argument(s): '(ak.ArrayView(ak.VirtualArrayType({"class":"ListOffsetArray64","offsets":"i64","content":{"class":"RecordArray","contents":{"x":{"class":"NumpyArray","inner_shape":[],"itemsize":1,"format":"?","primitive":"bool","has_identities":false,"parameters":{},"form_key":null}},"has_identities":false,"parameters":{},"form_key":null},"has_identities":false,"parameters":{},"form_key":null}, none, {}), None, ()), int64)':
E              No match.
E             - Of which 2 did not match due to:
E             Overload in function 'type_getitem.generic': File: ../../../../irishep/awkward-1.0/awkward/_connect/_numba/arrayview.py: Line 575.
E               With argument(s): '(ak.ArrayView(ak.VirtualArrayType({"class":"ListOffsetArray64","offsets":"i64","content":{"class":"RecordArray","contents":{"x":{"class":"NumpyArray","inner_shape":[],"itemsize":1,"format":"?","primitive":"bool","has_identities":false,"parameters":{},"form_key":null}},"has_identities":false,"parameters":{},"form_key":null},"has_identities":false,"parameters":{},"form_key":null}, none, {}), None, ()), int64)':
E              Rejected as the implementation raised a specific error:
E                AttributeError: module 'numba' has no attribute 'bool'
E             raised from /home/jpivarski/irishep/awkward-1.0/awkward/_connect/_numba/layout.py:565
E           
E           During: typing of intrinsic-call at /home/jpivarski/irishep/awkward-1.0/tests/test_0776-numba-booleans-in-array.py (16)
E           During: typing of static-get-item at /home/jpivarski/irishep/awkward-1.0/tests/test_0776-numba-booleans-in-array.py (16)
E           
E           File "tests/test_0776-numba-booleans-in-array.py", line 16:
E               def f1(array):
E                   return array[2][0].x
E                   ^

../../miniconda3/lib/python3.8/site-packages/numba/core/dispatcher.py:357: TypingError

The part of the error message that you quoted, luckily enough, included the part "module 'numba' has no attribute 'bool'", which gave me a place to start. But it was crucial to understanding this bug that Photon is a virtual array—the offending numba.bool happens in virtual array-handling. (Coffea NanoEvents are virtual—they read data from ROOT files on demand, even inside a Numba JIT-compiled function.) The full error message includes the fact that this array has VirtualArrayType.

The best bug reports include some data to reproduce the issue (i.e. how did you get the Photon object?), but the GitHub Issue template prompts you for this.

Thanks for reporting this, though! Any report helps!

-- Question -----------------------------------------------

I try to use TLorentz vector in uproot4. But I found that methods in "uproot_methods" module are now worked with Awkward High level array.

Traceback (most recent call last):
  File "/home/jwkim/anaconda3/lib/python3.8/site-packages/awkward/array/base.py", line 389, in _util_toarray
    return cls.numpy.frombuffer(value, dtype=getattr(value, "dtype", defaultdtype)).reshape(getattr(value, "shape", -1))
TypeError: a bytes-like object is required, not 'Array'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "anal.py", line 19, in <module>
    Electron_T2vec   = TVector2Array.from_polar(Electron_pt,Electron_phi)
  File "/home/jwkim/anaconda3/lib/python3.8/site-packages/awkward/util.py", line 112, in func_wrapper
    wrap, arrays = unwrap_jagged(cls, awkcls, _normalize_arrays(cls, arrays))
  File "/home/jwkim/anaconda3/lib/python3.8/site-packages/awkward/util.py", line 84, in _normalize_arrays
    arrays[i] = cls.awkward.util.toarray(arrays[i], cls.awkward.numpy.float64)
  File "/home/jwkim/anaconda3/lib/python3.8/site-packages/awkward/util.py", line 32, in toarray
    return awkward.array.base.AwkwardArray._util_toarray(value, defaultdtype, passthrough=passthrough)
  File "/home/jwkim/anaconda3/lib/python3.8/site-packages/awkward/array/base.py", line 394, in _util_toarray
    return cls.numpy.array(value, copy=False)
  File "/home/jwkim/anaconda3/lib/python3.8/site-packages/awkward1/highlevel.py", line 1310, in __array__
    return awkward1._connect._numpy.convert_to_array(self._layout, args, kwargs)
  File "/home/jwkim/anaconda3/lib/python3.8/site-packages/awkward1/_connect/_numpy.py", line 16, in convert_to_array
    out = awkward1.operations.convert.to_numpy(layout, allow_missing=False)
  File "/home/jwkim/anaconda3/lib/python3.8/site-packages/awkward1/operations/convert.py", line 313, in to_numpy
    return to_numpy(array.toRegularArray(), allow_missing=allow_missing)
ValueError: in ListOffsetArray64, cannot convert to RegularArray because subarray lengths are not regular  
-------------------------------------------------->  ##

It seems that "uproot_method" only support the awkward.array.jagged.JaggedArray. Is there any other way to use the TLorentz vector in uproot4 (awkward high level array)?

I'm trying to convert this uproot3 and awkward0 based code to uproot4 and awkward1 based code. https://github.com/JW-corp/J.W_Analysis/blob/main/Uproot/anal.py

Thank youy!

-- Answer ------------

The uproot-methods package only works with Uproot 3.x and therefore Awkward Array 0.x. In the Uproot 4/Awkward 1 world, the methods for ROOT objects that uproot-methods once provided are now handled by Uproot itself (the "community contributed project" didn't work out, so this functionality is moving back into Uproot), except for Lorentz vectors.

The eventual home for Lorentz vector handling is the vector project, which I'll be contributing to in March. Here it is, almost March, and so that will be happening soon. (I have one last task before working on that.)

In the meantime, the excellent Coffea project has a module for Lorentz-vector handling: coffea.nanoevents.methods.vector. To get started with Lorentz vectors now, I recommend that.

(The standalone vector package would become Uproot's default vector handler, as in, if you read a TLorentzVector from a ROOT file, it will have vector methods automatically—when vector is finished, that is.)

-- Question -----------------------------------------------

Does the awkward library provide a way to slice out all attributes of a given name, regardless of the level? I was thinking something like this:

import awkward as ak

obj = {
    'resource_id': 'abc',
    'events': [
        {'resource_id': '123', 'value': 12, 'picks':
            [{'resource_id': 'asd', 'value': 1},
             {'resource_id': 'dll', 'value': 12}
            ]
         },
         {'resource_id': '456', 'value': 12, 'picks':
            [{'resource_id': 'cvf', 'value': 23},
             {'resource_id': 'ggf', 'value': 34},
             ]
         },
    ]
}


ar = ak.from_iter(obj)

rid = ar[..., 'resource_id']

The value of rid is simply the string 'abc' but I was expecting something more like the following:

[
   ['abc'],
   ['events':[
       [['123'], 'picks':[['asd'], ['dll']]], 
       [['456'], 'picks':[['cvf'], ['ggf']]],
   ]
]       

However, I am still trying to get my head around awkward arrays so I could be completely off here.

-- Answer ------------

It doesn't, and I'm not sure how the output of such an operation should be shaped. For instance, if you pick the outer "resource_id", you get

>>> ar["events", "resource_id"]
<Array ['123', '456'] type='2 * string'>

but if you pick the inner "resource_id", you get

>>> ar["events", "picks", "resource_id"]
<Array [['asd', 'dll'], ['cvf', 'ggf']] type='2 * var * string'>

Note that the ... does have a meaning, but it slices through rows (nested lists), not columns (record field names).

>>> ar["events", "picks", "value"]
<Array [[1, 12], [23, 34]] type='2 * var * int64'>
>>> ar["events", "picks", "value", ..., 0]
<Array [1, 23] type='2 * int64'>

Also, it might help to know that you can project with strings and lists of strings (nested projection):

>>> print(ar["events", "picks", ["resource_id", "value"]])
[[{resource_id: 'asd', value: 1}, ... {resource_id: 'ggf', value: 34}]]

in case that helps with your slicing problem (which will likely be manually picking out "resource_id" at all levels and putting them together in a way that makes sense for your data, but maybe can't be generalized).

-- Question -----------------------------------------------

I have a *.root file I'm trying to read into Python with uproot (uproot4). It looks like this:

>>> data = up.open('file.root')
>>> data.keys()
['ring_sums;1', 'tpc_multiplicity;1', 'impact_parameter;1']
>>> data['ring_sums']
<TMatrixT<double> (version 4) at 0x0205b219d748>

How can I read a TMatrixT type? Using .keys(), .values(), and .items() doesn't work.

>>> data['ring_sums'].values()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'Model_TMatrixT_3c_double_3e__v4' object has no attribute 'values'

What's the proper way to read out the data into an awkward array, pandas dataframe, or numpy array?

-- Answer ------------

TMatrixT is a class that Uproot doesn't have any specialized code for (yet?), but it can be read anyway because its "streamers" (instructions for deserialization) are included in the ROOT file. It's therefore presented in a rather generic way.

You can find all of an object's member data in all_members (dict) or extract only one using the member("memberName") method. So, for instance, try

data['ring_sums'].member("fElements")

because TMatrixT::fElements seems to be relevant. These will probably come out as an unshaped NumPy array (or uproot.models.TArray, which is a subclass of NumPy arrays). You may need to reshape it into the right form, and the proper shape might be found in some other member (maybe one belonging to a superclass).

If you find a good way of using TMatrixT objects and would like it to become automated, perhaps in a values method, add it as a pull request or describe it as a feature request for Uproot (a new "behavior").

-- Question -----------------------------------------------

I have a large nested dictionary with jagged list 'c' :

x = {'first_block': 
     {'unit1': {'a': (3,5,4), 'b': 23, 'c': [10]}, 
      'unit2': {'a': (5,8,7), 'b': 15, 'c': [20,10]}, 
      'unit10k': {'a': (2,4,9), 'b': 10, 'c': [6,10,20,5]}},
     
      'second_block': 
       {'unit1' : {'a': (8,20,14), 'b': 10, 'c': [17,12,9]}, 
        'unit2' : {'a': (9,25,50), 'b': 15, 'c': [17,15,9,4,12]}, 
        'unit12k': {'a': (12,24,9), 'b': 23, 'c': [12,22,15,4]}},
     
      'millionth_block': 
      {'unit1': {'a': (35,64,85), 'b': 64, 'c': [50]}, 
       'unit2': {'a': (56,23,34), 'b': 55, 'c': [89,59,77]},
       'unit5k': {'a': (90,28,12), 'b': 85, 'c': [48,90,27,59]}}}  

The elements of 'c' are point labels.

For every unique point label in 'c' I want to produce a filtered list of the corresponding value in 'b',

so for example 'first_block' has unique elements in 'c' of: 5, 6, 10, 20

and i want to obtain/extract the following lists for each 'block', to list each value of 'b' associated with a specific value in 'c' e.g.

first_block:
5: [10]
6: [10]
10: [10,15,23]
20: [10,15]
second_block:
4: [15,23]
9: [10,15]
12: [10,15,23]
15: [15,23]
17: [10,15]
22: [23]
etc.

Any thoughts on how to create this outcome given that 'c' is jagged?

Have been trying to do this by converting to Awkward arrays but documentation is currently sparse, and really don't understand how to do this in Awkward.

Also open to pythonic suggestions which don't involve Awkward

-- Answer ------------

I'll attempt to give an Awkward Array solution to this.

First of all, we need to know what in your dataset is really scaling up. I understand that you could have millions of blocks, but that means that you shouldn't represent them with a dict that has unique string-valued keys. The cost of creating strings, comparing strings, and looking things up by strings (though that's helped quite a bit by the dict's hash table) is not a good way to scale, especially if the only information in those strings is an ordering.

It also seems to me that you have a variable number of units (by the fact that the last one is named "unit10k", "unit12k", and "unit5k").

Finally, it took me several re-reads of your problem statement, but it seems that the field named "a" doesn't enter into the problem. I'll ignore it.

Therefore, I'd say your data structure would be better as:

as_python = [
    # first block
    [
        # unit 1
        {"b": 23, "c": [10]},
        # unit 2
        {"b": 15, "c": [20, 10]},
        # ...
        # unit 10k
        {"b": 10, "c": [6, 10, 20, 5]},
    ],
    # second block
    [
        # unit 1
        {"b": 10, "c": [17, 12, 9]},
        # unit 2
        {"b": 15, "c": [17, 15, 9, 4, 12]},
        # ...
        # unit 12k
        {"b": 23, "c": [12, 22, 15, 4]},
    ],
    # ...
    # millionth block
    [
        # unit 1
        {"b": 64, "c": [50]},
        # unit 2
        {"b": 55, "c": [89, 59, 77]},
        # ...
        # unit 15k
        {"b": 85, "c": [48, 90, 27, 59]},
    ],
]

A Pythonic solution to this would be

results = []
for block in as_python:
    block_result = {}
    for unit in block:
        b = unit["b"]    # only look up b once
        for c in unit["c"]:
            if c not in block_result:
                block_result[c] = []
            block_result[c].append(b)
    results.append(block_result)

which results in

[
    {10: [23, 15, 10], 20: [15, 10], 6: [10], 5: [10]},
    {17: [10, 15], 12: [10, 15, 23], 9: [10, 15], 15: [15, 23], 4: [15, 23], 22: [23]},
    {50: [64], 89: [55], 59: [55, 85], 77: [55], 48: [85], 90: [85], 27: [85]},
]

which should be pretty fast. (Short loops using only builtin Python types is surprisingly fast for an uncompiled, dynamically typed virtual machine. In my experience, using only builtin Python types is key.)

As for Awkward Array, I can only get halfway there using vectorized (i.e. "NumPy-like") operations before invoking Numba (JIT-compiled for loops), which reveals that the library needs a "groupby" function.

Of course, you can do the whole thing in Numba, which tends to be the fastest solution overall, but this problem depends crucially on grouping and uniqueness, which in Python is most naturally done with dicts or sets, but it's hard to get Numba to deduce the type of anything beyond pure numbers and arrays. (In order to JIT compile the code and make it fast, Numba must know their types, but Python type annotations are recent enough that they're still being incorporated into Numba.)

So let's start by making the above data structure an Awkward Array. Depending on your data source, there are different ways to do this (this part of the documentation is complete), but I'll just let ak.Array iterate over the above Python data structure.

>>> import awkward as ak
>>> as_awkward = ak.Array(as_python)

Ordinarily, if we have a "all combinations of X and Y" problem, we want to use ak.cartesian. We can't use it right away here because the "b" data and "c" data have different depths:

>>> as_awkward = ak.Array(as_python)
>>> as_awkward.b
<Array [[23, 15, 10], ... 23], [64, 55, 85]] type='3 * var * int64'>
>>> as_awkward.c
<Array [[[10], [20, 10], ... [48, 90, 27, 59]]] type='3 * var * var * int64'>
>>> as_awkward.type
3 * var * {"b": int64, "c": var * int64}
>>> as_awkward.b.type
3 * var * int64
>>> as_awkward.c.type
3 * var * var * int64

To make them match, we can create a new axis using np.newaxis (a.k.a. None, but I like to be explicit):

>>> import numpy as np
>>> bprime = as_awkward.b[:, :, np.newaxis]
>>> bprime
<Array [[[23], [15], [10, ... 64], [55], [85]]] type='3 * var * 1 * int64'>
>>> bprime.type
3 * var * 1 * int64
>>> as_awkward.c.type
3 * var * var * int64

Now we can take the Cartesian products within each unit to make lists of b-c pairs (axis=2).

>>> combinations = ak.cartesian({"c": as_awkward.c, "b": bprime}, axis=2)
>>> combinations
<Array [[[{c: 10, b: 23}], ... c: 59, b: 85}]]] type='3 * var * var * {"c": int6...'>
>>> combinations.type
3 * var * var * {"c": int64, "b": int64}
>>> combinations.tolist()
[
    [
        [
            [{"c": 10, "b": 23}]
        ],
        [
            [{"c": 20, "b": 15}],
            [{"c": 10, "b": 15}]],
        [
            [{"c": 6, "b": 10}],
            [{"c": 10, "b": 10}],
            [{"c": 20, "b": 10}],
            [{"c": 5, "b": 10}],
        ],
    ],
    [
        [
            [{"c": 17, "b": 10}],
            [{"c": 12, "b": 10}],
            [{"c": 9, "b": 10}]
        ],
        [
            [{"c": 17, "b": 15}],
            [{"c": 15, "b": 15}],
            [{"c": 9, "b": 15}],
            [{"c": 4, "b": 15}],
            [{"c": 12, "b": 15}],
        ],
        [
            [{"c": 12, "b": 23}],
            [{"c": 22, "b": 23}],
            [{"c": 15, "b": 23}],
            [{"c": 4, "b": 23}],
        ],
    ],
    [
        [
            [{"c": 50, "b": 64}]
        ],
        [
            [{"c": 89, "b": 55}],
            [{"c": 59, "b": 55}],
            [{"c": 77, "b": 55}]
        ],
        [
            [{"c": 48, "b": 85}],
            [{"c": 90, "b": 85}],
            [{"c": 27, "b": 85}],
            [{"c": 59, "b": 85}],
        ],
    ],
]

Now we have too much structure: as long as the "b" and "c" values are properly connected, you want to mix all units in a block because you're interested in uniqueness of "c" within each block.

>>> flattened = ak.flatten(combinations, axis=-1)
>>> flattened
<Array [[{c: 10, b: 23}, ... c: 59, b: 85}]] type='3 * var * {"c": int64, "b": i...'>
>>> flattened.type
3 * var * {"c": int64, "b": int64}
>>> flattened.tolist()
[
    [
        {"c": 10, "b": 23},
        {"c": 20, "b": 15},
        {"c": 10, "b": 15},
        {"c": 6, "b": 10},
        {"c": 10, "b": 10},
        {"c": 20, "b": 10},
        {"c": 5, "b": 10},
    ],
    [
        {"c": 17, "b": 10},
        {"c": 12, "b": 10},
        {"c": 9, "b": 10},
        {"c": 17, "b": 15},
        {"c": 15, "b": 15},
        {"c": 9, "b": 15},
        {"c": 4, "b": 15},
        {"c": 12, "b": 15},
        {"c": 12, "b": 23},
        {"c": 22, "b": 23},
        {"c": 15, "b": 23},
        {"c": 4, "b": 23},
    ],
    [
        {"c": 50, "b": 64},
        {"c": 89, "b": 55},
        {"c": 59, "b": 55},
        {"c": 77, "b": 55},
        {"c": 48, "b": 85},
        {"c": 90, "b": 85},
        {"c": 27, "b": 85},
        {"c": 59, "b": 85},
    ],
]

Sorting and uniqueness are at the bleeding edge of Awkward Array's capabilities; there are some functions that have been written internally that just haven't been exposed in Python, let alone documented. Fortunately, ak.sort and ak.argsort are documented. We ultimately want to group these b-c pairs by c, we can at least sort them:

>>> sorted = flattened[ak.argsort(flattened.c, axis=-1)]
>>> sorted
<Array [[{c: 5, b: 10}, ... {c: 90, b: 85}]] type='3 * var * {"c": int64, "b": i...'>
>>> sorted.type
3 * var * {"c": int64, "b": int64}
>>> sorted.tolist()
[
    [
        {"c": 5, "b": 10},
        {"c": 6, "b": 10},
        {"c": 10, "b": 23},
        {"c": 10, "b": 15},
        {"c": 10, "b": 10},
        {"c": 20, "b": 15},
        {"c": 20, "b": 10},
    ],
    [
        {"c": 4, "b": 15},
        {"c": 4, "b": 23},
        {"c": 9, "b": 10},
        {"c": 9, "b": 15},
        {"c": 12, "b": 10},
        {"c": 12, "b": 15},
        {"c": 12, "b": 23},
        {"c": 15, "b": 15},
        {"c": 15, "b": 23},
        {"c": 17, "b": 10},
        {"c": 17, "b": 15},
        {"c": 22, "b": 23},
    ],
    [
        {"c": 27, "b": 85},
        {"c": 48, "b": 85},
        {"c": 50, "b": 64},
        {"c": 59, "b": 55},
        {"c": 59, "b": 85},
        {"c": 77, "b": 55},
        {"c": 89, "b": 55},
        {"c": 90, "b": 85},
    ],
]

And now we really, really wish we had a "groupby" function, but there just isn't one. It would be a natural addition and perhaps should be a feature request.

So at this point, we switch to Numba. The problem is much simpler than the original problem because the data to group are already sorted: we just need to see when a value changes to insert boundaries. Awkward Arrays can be passed as arguments into Numba, but they're immutable. To make new arrays, use an ak.ArrayBuilder. Note: when developing a function like this, do it in small steps and remove @nb.njit to try it without JIT-compilation (to be sure you're making the right thing before trying to solve type errors).

import numba as nb

@nb.njit
def groupby(input, output):
    for block in input:
        output.begin_list()
        output.begin_list()
        last = block[0].c  # note: assumes that len(block) >= 1
        for unit in block:
            if unit.c != last:
                output.end_list()
                output.begin_list()
            output.append(unit)
            last = unit.c
        output.end_list()
        output.end_list()
    return output

The input is the array we just made (sorted), and the output is an ArrayBuilder that turns into a real array with snapshot (outside of Numba).

>>> grouped = groupby(sorted, ak.ArrayBuilder()).snapshot()
>>> grouped
<Array [[[{c: 5, b: 10}], ... {c: 90, b: 85}]]] type='3 * var * var * {"c": int6...'>
>>> grouped.type
3 * var * var * {"c": int64, "b": int64}
>>> grouped.tolist()
[
    [
        [{"c": 5, "b": 10}],
        [{"c": 6, "b": 10}],
        [{"c": 10, "b": 23}, {"c": 10, "b": 15}, {"c": 10, "b": 10}],
        [{"c": 20, "b": 15}, {"c": 20, "b": 10}],
    ],
    [
        [{"c": 4, "b": 15}, {"c": 4, "b": 23}],
        [{"c": 9, "b": 10}, {"c": 9, "b": 15}],
        [{"c": 12, "b": 10}, {"c": 12, "b": 15}, {"c": 12, "b": 23}],
        [{"c": 15, "b": 15}, {"c": 15, "b": 23}],
        [{"c": 17, "b": 10}, {"c": 17, "b": 15}],
        [{"c": 22, "b": 23}],
    ],
    [
        [{"c": 27, "b": 85}],
        [{"c": 48, "b": 85}],
        [{"c": 50, "b": 64}],
        [{"c": 59, "b": 55}, {"c": 59, "b": 85}],
        [{"c": 77, "b": 55}],
        [{"c": 89, "b": 55}],
        [{"c": 90, "b": 85}],
    ],
]

Then you can fiddle with that output to get a desired structure. If you want to have a scalar "c" for each list "b", you'll want to use a depth_limit to keep ak.zip from broadcasting it down to the deepest level.

>>> ak.zip({"c": grouped.c[:, :, 0], "b": grouped.b}, depth_limit=2).tolist()
[
    [
        {"c": 5, "b": [10]},
        {"c": 6, "b": [10]},
        {"c": 10, "b": [23, 15, 10]},
        {"c": 20, "b": [15, 10]},
    ],
    [
        {"c": 4, "b": [15, 23]},
        {"c": 9, "b": [10, 15]},
        {"c": 12, "b": [10, 15, 23]},
        {"c": 15, "b": [15, 23]},
        {"c": 17, "b": [10, 15]},
        {"c": 22, "b": [23]},
    ],
    [
        {"c": 27, "b": [85]},
        {"c": 48, "b": [85]},
        {"c": 50, "b": [64]},
        {"c": 59, "b": [55, 85]},
        {"c": 77, "b": [55]},
        {"c": 89, "b": [55]},
        {"c": 90, "b": [85]},
    ],
]

If you want to build a report of that type in Numba directly, you can do it. (This is just to illustrate what the "begin_list" and such are doing—it's like "printing out" a structure, imagining output.begin_list() to print a "[" and output.begin_record() to print a "{", etc.)

@nb.njit
def groupby(input, output):
    for block in input:
        output.begin_list()
        output.begin_record()
        output.field("c").integer(block[0].c)
        output.field("b")
        output.begin_list()
        last = block[0].c
        for unit in block:
            if unit.c != last:
                output.end_list()
                output.end_record()
                output.begin_record()
                output.field("c").integer(unit.c)
                output.field("b")
                output.begin_list()
            output.integer(unit.b)
            last = unit.c
        output.end_list()
        output.end_record()
        output.end_list()
    return output

and

>>> grouped = groupby(sorted, ak.ArrayBuilder()).snapshot()
>>> grouped
<Array [[{c: 5, b: [10]}, ... c: 90, b: [85]}]] type='3 * var * {"c": int64, "b"...'>
>>> grouped.type
3 * var * {"c": int64, "b": var * int64}
>>> grouped.tolist()
# it's the same

As I said above, the absolutely fastest solution is likely to be doing everything in Numba. The Cartesian products, flattening, and sorting all create partially new arrays (reusing as much from the inputs as possible, which is why the Awkward Arrays have to be immutable), which involves allocations and multiple passes over the data. But it's hard to express problems involving dicts, lists, and sets in Numba because it needs typed dicts, lists, and sets. I tried using Numba's typed dict, but it complained about the values of the dict being lists, which are not hashable. (Dict values do not need to be hashable, so I wonder what's going on there.) Just as uniqueness and sorting are the bleeding edge of Awkward Array, typing dicts is the bleeding edge of Numba since the concept of typing in Python is itself rather new.

I didn't performance-test any of these proposed solutions.

-- Question -----------------------------------------------

I have a dictionary with integer keys and float values. I also have a 2D awkward array with integer entries (I'm using awkward1). I want to replace these integers with the corresponding float according to the dictionary, keeping the awkward array format.

Assuming the keys run from 0 to 999, my solution so far is something like this:

resultArray = ak.where(myArray == 0, myDict.get(0), 0)
for key in range(1,1000):
    resultArray = resultArray + ak.where(myArray == key, myDict.get(key), 0)

Is there a faster way to do this?

Update

Minimal reproducible example of my working code:

import awkward as ak # Awkward 1

myArray = ak.from_iter([[0, 1], [2, 1, 0]]) # Creating example array
myDict = {0: 19.5, 1: 34.1, 2: 10.9}

resultArray = ak.where(myArray == 0, myDict.get(0), 0)
for key in range(1,3):
    resultArray = resultArray + ak.where(myArray == key, myDict.get(key), 0)

myArray:

<Array [[0, 1], [2, 1, 0]] type='2 * var * int64'>

resultArray:

<Array [[19.5, 34.1], [10.9, 34.1, 19.5]] type='2 * var * float64'>

-- Answer ------------

When I mentioned in a comment that np.searchsorted is where you should be looking, I hadn't noticed that myDict includes every consecutive integer as a key. Having a dense lookup table like this would allow faster algorithms, which also happen to be simpler in Awkward Array.

So, assuming that there's a key in myDict for each integer from 0 up to some value, you can equally well represent the lookup table as

>>> lookup = ak.Array([myDict[i] for i in range(len(myDict))])
>>> lookup
<Array [19.5, 34.1, 10.9] type='3 * float64'>

The problem of picking values at 0, 1, and 2 becomes just an array-slice. (This array-slice is an O(n) algorithm for array length n, unlike np.searchsorted, which would be O(n log n). That's the cost of having sparse lookup keys.)

The problem, however, is that myArray is nested and lookup is not. We can give lookup the same depth as myArray by slicing it up:

>>> multilookup = lookup[np.newaxis][np.zeros(len(myArray), np.int64)]
>>> multilookup
<Array [[19.5, 34.1, 10.9, ... 34.1, 10.9]] type='2 * 3 * float64'>
>>> multilookup.tolist()
[[19.5, 34.1, 10.9], [19.5, 34.1, 10.9]]

And then multilookup[myArray] is exactly what you want:

>>> multilookup[myArray]
<Array [[19.5, 34.1], [10.9, 34.1, 19.5]] type='2 * var * float64'>

The lookup had to be duplicated because each list within myArray uses global indexes in the whole lookup. If the memory involved in creating multilookup is prohibitive, you could instead break myArray down to match it:

>>> flattened, num = ak.flatten(myArray), ak.num(myArray)
>>> flattened
<Array [0, 1, 2, 1, 0] type='5 * int64'>
>>> num
<Array [2, 3] type='2 * int64'>
>>> lookup[flattened]
<Array [19.5, 34.1, 10.9, 34.1, 19.5] type='5 * float64'>
>>> ak.unflatten(lookup[flattened], nums)
<Array [[19.5, 34.1], [10.9, 34.1, 19.5]] type='2 * var * float64'>

If your keys are not dense from 0 up to some integer, then you'll have to use np.searchsorted:

>>> keys = ak.Array(myDict.keys())
>>> values = ak.Array([myDict[key] for key in keys])
>>> keys
<Array [0, 1, 2] type='3 * int64'>
>>> values
<Array [19.5, 34.1, 10.9] type='3 * float64'>

In this case, the keys are trivial because it is dense. When using np.searchsorted, you have to explicitly cast the flat Awkward Arrays as NumPy (for now; we're looking to fix that).

>>> lookup_index = np.searchsorted(np.asarray(keys), np.asarray(flattened), side="left")
>>> lookup_index
array([0, 1, 2, 1, 0])

Then we pass it through the trivial keys (which doesn't change it, in this case) before passing it to the values.

>>> keys[lookup_index]
<Array [0, 1, 2, 1, 0] type='5 * int64'>
>>> values[keys[lookup_index]]
<Array [19.5, 34.1, 10.9, 34.1, 19.5] type='5 * float64'>
>>> ak.unflatten(values[keys[lookup_index]], num)
<Array [[19.5, 34.1], [10.9, 34.1, 19.5]] type='2 * var * float64'>

But the thing I was waffling about in yesterday's comment was that you have to do this on the flattened form of myArray (flattened) and reintroduce the structure later ak.unflatten, as above. But perhaps we should wrap np.searchsorted as ak.searchsorted to recognize a fully structured Awkward Array in the second argument, at least. (It has to be unstructured to be in the first argument.)

-- Question -----------------------------------------------

When I acess a root file and extract the data I want like in following example:

events=uproot.open(filename)["btagana/ttree;6"]    
jet_data=events.arrays(filter_name=["Jet_nFirstTrack","Jet_nLastTrack","Jet_pt","Jet_phi","Jet_eta"],library="ak")

Where the sorting of the keys of this array doesn't resemble the sorting of the list used to filter the keys.If I now use ak.unzip():

jet_data=ak.unzip(jet_data)

is the sorting reliable and reproducable? If I open different root files, would I be able to achieve the same "sorting"

-- Answer ------------

This is actually a question about Uproot. In this line:

>>> jet_data=events.arrays(filter_name=["Jet_nFirstTrack","Jet_nLastTrack","Jet_pt","Jet_phi","Jet_eta"],library="ak")

the filter_name is just a filter, accepting or rejecting branches from the ROOT file. Those branches have a natural order in the file, and the output is probably that order (and therefore stable upon repeated attempts, unless a dict is involved at some point and you're using Python <= 3.5).

If you want to enforce an order, pass your list of branch names as expressions, rather than filter_name. That argument has different meaning: expressions can be simple formulas; filter_name can have wildcards—therefore, a character like * has very different meanings in each!

Alternatively, you can reorder the fields after reading the array by slicing with a list of strings. There's no performance penalty for doing so—it's just rearranging metadata (time to completion does not scale with the length of the array). This documentation has some examples (including more complex cases where you're selecting fields within fields, but the simple case is enough for your issue).

Edit: I should add that fields of records in Awkward Arrays have a reproducible order. They're not unstable hashmaps like dicts in Python <= 3.5. They're actually two equal length lists: the ordered fields (which is what ak.unzip returns) and ordered fields names (which ak.fields returns). The names are optional—without field names, records become tuples.

-- Question -----------------------------------------------

I have an awkward array of type float and I need it to be of type int. Something equivalent to the following numpy snippet:

import numpy as np
import awkward as ak

arr = np.array([1., 2., 3.])
arr = arr.astype(int)

arr2 = ak.Array(np.array([1., 2., 3.]))
arr2 = arr2.???

-- Answer ------------

If you have an awkward array, arr, where:

>>> arr.type
20 * float64

You can simply use ak.values_astype:

>>> arr = ak.values_astype(arr, "int64")
>>> arr
20 * int64

-- Question -----------------------------------------------

I currently have a list of values and an awkward array of integer values. I want the same dimension awkward array, but where the values are the indices of the "values" arrays corresponding with the integer values of the awkward array. For instance:

values = ak.Array(np.random.rand(100))
arr = ak.Array((np.random.randint(0, 100, 33), np.random.randint(0, 100, 125)))

I want something like values[arr], but that gives the following error:

>>> values[arr]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Anaconda3\lib\site-packages\awkward\highlevel.py", line 943, in __getitem__
    return ak._util.wrap(self._layout[where], self._behavior)
ValueError: cannot fit jagged slice with length 2 into RegularArray of size 100

If I run it with a loop, I get back what I want:

>>> values = ([values[i] for i in arr])
>>> values
[<Array [0.842, 0.578, 0.159, ... 0.726, 0.702] type='33 * float64'>, <Array [0.509, 0.45, 0.202, ... 0.906, 0.367] type='125 * float64'>]

Is there another way to do this, or is this it? I'm afraid it'll be too slow for my application.

Thanks!

-- Answer ------------

If you're trying to avoid Python for loops for performance, note that the first line casts a NumPy array as Awkward with ak.from_numpy (no loop, very fast):

>>> values = ak.Array(np.random.rand(100))

but the second line iterates over data in Python (has a slow loop):

>>> arr = ak.Array((np.random.randint(0, 100, 33), np.random.randint(0, 100, 125)))

because a tuple of two NumPy arrays is not a NumPy array. It's a generic iterable, and the constructor falls back to ak.from_iter.

On your main question, the reason that arr doesn't slice values is because arr is a jagged array and values is not:

>>> values
<Array [0.272, 0.121, 0.167, ... 0.152, 0.514] type='100 * float64'>
>>> arr
<Array [[15, 24, 9, 42, ... 35, 75, 20, 10]] type='2 * var * int64'>

Note the types: values has type 100 * float64 and arr has type 2 * var * int64. There's no rule for values[arr].

Since it looks like you want to slice values with arr[0] and then arr[1] (from your list comprehension), it could be done in a vectorized way by duplicating values for each element of arr, then slicing.

>>> # The np.newaxis is to give values a length-1 dimension before concatenating.
>>> duplicated = ak.concatenate([values[np.newaxis]] * 2)
>>> duplicated
<Array [[0.272, 0.121, ... 0.152, 0.514]] type='2 * 100 * float64'>

Now duplicated has length 2 and one level of nesting, just like arr, so arr can slice it. The resulting array also has length 2, but the length of each sublist is the length of each sublist in arr, rather than 100.

>>> duplicated[arr]
<Array [[0.225, 0.812, ... 0.779, 0.665]] type='2 * var * float64'>
>>> ak.num(duplicated[arr])
<Array [33, 125] type='2 * int64'>

If you're scaling up from 2 such lists to a large number, then this would eat up a lot of memory. Then again, the size of the output of this operation would also scale as "length of values" × "length of arr". If this "2" is not going to scale up (if it will be at most thousands, not millions or more), then I wouldn't worry about the speed of the Python for loop. Python scales well for thousands, but not billions (depending, of course, on the size of the things being scaled!).

-- Question -----------------------------------------------

In uproot 3 documentation there is information, that uproot can write only branches containing 1 value per entry. On the other hand, I can see some topics on uproot Github regarding writing jagged arrays, etc. So, I would like to make sure: can uproot write TBranches containing arrays to a TTree? If so, is it documented anywhere?

Thanks!

-- Answer ------------

This will be better documented when it's ported to Uproot 4, but the best documentation we have on writing jagged arrays in Uproot 3 right now is the pull request and associated issues (all linked to each other):

scikit-hep/uproot3#477

Here is an example from the tests:

import uproot3
import awkward0

a = awkward0.fromiter([[0],
                       [1, 2],
                       [10, 11, 12]])

with uproot3.recreate(filename, compression=None) as f:
    f["t"] = uproot3.newtree({
        "branch": uproot3.newbranch(numpy.dtype(">i4"), size="n")
    })
    f["t"].extend({"branch": a, "n": [1, 2, 3]})

f = ROOT.TFile.Open(filename)
tree = f.Get("t")
for i, event in enumerate(tree):
    assert(numpy.all([x for x in event.branch] == a[i]))

-- Question -----------------------------------------------

I have a singe level nested array, and I'd like to calculate the running sum at the deepest level:

<JaggedArray [[0.8143442176354316 0.18565578236456845] [1.0] [0.8029232081440607 0.1970767918559393] ... [0.42036116755776154 0.5796388324422386] [0.18512572262194366 0.31914669745950724 0.13598232751162054 0.3597452524069286] [0.34350475143310905 0.19023361856972956 0.4662616299971615]] at 0x7f8969e32af0>

after doing something like numpy.cumsum(jagged_array) I'd like to have:

[[0.8143442176354316 1.0] [1.0] [0.8029232081440607 1.0] ...

In short - the running sum at the deepest level (which is restarted with each new "event").

I'm using awkard0, and the documentation says that broadcast is run at the deepest level, however, I get an error when I tried just handing a JaggedArray directly to numpy.cumsum: operands could not be broadcast together with shapes (2,) (3,)

The dataset is large - I'd like to stay within the awkward system - so avoid python loops in processing these.

-- Answer ------------

There isn't a "high-level" way to do this, a way that is independent of knowledge of the array's layout, but I can walk you through this.

Awkward 0.x (obsolete)

Assuming that you have a simple jagged array,

>>> import awkward0
>>> import numpy as np
>>> array = awkward0.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
>>> array.layout
 layout 
[    ()] JaggedArray(starts=layout[0], stops=layout[1], content=layout[2])
[     0]   ndarray(shape=3, dtype=dtype('int64'))
[     1]   ndarray(shape=3, dtype=dtype('int64'))
[     2]   ndarray(shape=5, dtype=dtype('float64'))

You can apply the cumulative sum to the content:

>>> np.cumsum(array.content)
array([ 1.1,  3.3,  6.6, 11. , 16.5])

and wrap that up as a new jagged array:

>>> scan = awkward0.JaggedArray.fromoffsets(array.offsets, np.cumsum(array.content))
>>> scan
<JaggedArray [[1.1 3.3000000000000003 6.6] [] [11.0 16.5]] at 0x7f0621a826a0>

Awkward 1.x

The offsets and content structure that we directly manipulated in Awkward 0.x are now hidden in a "layout" to distinguish between high-level operations (which don't require knowledge of the exact layout) and low-level operations (which do). This problem doesn't have a high-level solution, and the low-level way is like the above, but it involves extra wrapping and unwrapping.

>>> import awkward as ak
>>> import numpy as np
>>> array = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
>>> array
<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>
>>> layout = array.layout
>>> layout
<ListOffsetArray64>
    <offsets><Index64 i="[0 3 3 5]" offset="0" length="4" at="0x55737ef6f880"/></offsets>
    <content><NumpyArray format="d" shape="5" data="1.1 2.2 3.3 4.4 5.5" at="0x55737ef71890"/></content>
</ListOffsetArray64>

As before, you can do a cumulative sum on the content:

>>> np.cumsum(layout.content)
array([ 1.1,  3.3,  6.6, 11. , 16.5])

Here's the structure of how it gets wrapped up:

>>> scan = ak.Array(
...     ak.layout.ListOffsetArray64(
...         layout.offsets,
...         ak.layout.NumpyArray(
...             np.cumsum(layout.content)
...         )
...     )
... )
...
>>> scan
<Array [[1.1, 3.3, 6.6], [], [11, 16.5]] type='3 * var * float64'>

What if you want the scan per-list?

If you want a solution similar to Frank Yellin's, in which each scan starts new in each list, the fact that we did one np.cumsum on the content is a problem. In concrete terms, we have the third list starting with 11, instead of 4.4.

A vectorized way to do that is to subtract the first scan element of each list from the whole list and add the first array element back in. In both Awkward 0.x and 1.x, this can be done with slices like array[:, 0] and broadcasting, but empty lists (if you have them) are going to be a problem. Awkward 1.x has enough alternatives to work around that:

>>> ak.firsts(scan)
<Array [1.1, None, 11] type='3 * ?float64'>

>>> scan - ak.firsts(scan)
<Array [[0, 2.2, 5.5], None, [0, 5.5]] type='3 * option[var * float64]'>

>>> scan - ak.firsts(scan) + ak.firsts(array)
<Array [[1.1, 3.3, 6.6], None, [4.4, 9.9]] type='3 * option[var * float64]'>

>>> ak.fill_none(scan - ak.firsts(scan) + ak.firsts(array), [])
<Array [[1.1, 3.3, 6.6], [], [4.4, 9.9]] type='3 * var * float64'>

Most of these don't have equivalents in Awkward 0.x.

-- Question -----------------------------------------------

I want to extract data out of a root file and get it into shape to end with a numpy array/tensor to fill it into a neural-network. I am already able to get the track data I want in shape with a padding to convert it into a numpy array, but I want to extend my array with the data of the jet they are originated from. So I have the information of all the tracks, of each jet and the intervall of tracks they are corresponded too. My first instinc was to construct an array in the shape of the tracks and using something like np.dstack to merge those two.

import uproot4 as uproot
import numpy as np
import awkward1 as ak

def ak_into_np(ak_array):
    data=np.dstack([ak.to_numpy(x) for x in ak_array])
    return data    

def get_data(filename,padding_size):
        f=uproot.open(filename)
        events= f["btagana/ttree;1"]
        track_data=events.arrays(filter_name=["Track_pt","Track_phi","Track_eta","Track_dxy","Track_dz","Track_charge"])
        jet_interval=events.arrays(filter_name=["Jet_nFirstTrack","Jet_nLastTrack"])
        jet_interval=jet_interval["Jet_nLastTrack"]-jet_interval["Jet_nFirstTrack"]
        
        jet_data=events.arrays(filter_name=["Jet_pt","Jet_phi","Jet_eta"])
        arrays_track=ak.unzip(ak.fill_none(ak.pad_none(track_data, padding_size), 0))
        arrays_interval=ak.unzip(ak.fill_none(ak.pad_none(jet_interval,padding_size),0))
        arrays_jet=ak.unzip(ak.fill_none(ak.pad_none(jet_data,padding_size),0))
        track=ak_into_np(arrays_track)
        jet=ak_into_np(arrays_jet)
        interval=ak_into_np(arrays_interval)
        
        return track,jet,interval

This is where I am so far. For efficiency reason I hope to be able to achieve this in awkward before going into numpy. I tried it in numpy with following:

def extend(track,jet,interval):
    events,tracks,varstrack=(np.shape(track))
    events,jets,varsjet=np.shape(jet)
    jet_into_track_data=[]
    for i in range(events):
        dataloop=[]
        for k in range(jets):
            if interval[i][k][0]!=0 :
                dataloop.append(np.broadcast_to(jet[i][k],(interval[i][k][0],varsjet)))
            else 
    jet_into_track_data.append(dataloop)
        
        
    
            
    return jet_into_track_data

but it already takes about 3 seconds without even achieving my goal for only 2000 events. The aim is basically [track_variables] ->[track_variables,jet_variables if track is in intervall] and it shall be stored [(event1)[[track_1],...,[track_padding_size]],...,(eventn)[[track_1],...,[track_padding_size]]]

-- Answer ------------

I don't get to see the structure of your original data and I don't have a clear notion of what the desired final state is, but I can give an example inspired by the above that you can adapt. Also, I'm going to ignore the padding because that only complicates things. You'll probably want to put off padding until you've finished the combinatorics.

The tracks and jets below come from the same set of events (i.e. the arrays have the same lengths and they're jagged, with a different number of tracks in each list and a different number of jets in each list). Since jets are somehow derived from tracks, there are strictly fewer jets than tracks, and I take it that in your problem, the link between them is such that each jet corresponds to a contiguous, non-overlapping set of tracks (the "intervals").

Real tracks and jets would have a lot more properties than these—this is a bare minimum example. I've given the tracks an "id" so we can tell one from another, but these jets only have the inclusive "start" index and exclusive "stop" index.

If your tracks don't come with identifiers, you can add them with ak.local_index.

>>> import awkward as ak
>>> tracks = ak.Array([[{"id": 0}, {"id": 1}, {"id": 2}],
...                    [],
...                    [{"id": 0}, {"id": 1}]])
>>> jets = ak.Array([[{"start": 0, "stop": 2}, {"start": 2, "stop": 3}],
...                  [],
...                  [{"start": 1, "stop": 2}, {"start": 0, "stop": 1}]])

If you had all combinations between tracks and jets in each event, then you could use a slice to pick the ones that match. This is particularly useful when you have imprecise matches (that you have to match with ΔR or something). The ak.cartesian function produces lists of combinations, and nested=True groups the results by the first argument:

>>> all_combinations = ak.cartesian([tracks, jets], nested=True)
>>> all_combinations.tolist()
[[[({'id': 0}, {'start': 0, 'stop': 2}),
   ({'id': 0}, {'start': 2, 'stop': 3})],
  [({'id': 1}, {'start': 0, 'stop': 2}),
   ({'id': 1}, {'start': 2, 'stop': 3})],
  [({'id': 2}, {'start': 0, 'stop': 2}),
   ({'id': 2}, {'start': 2, 'stop': 3})]],
 [[]],
 [[({'id': 0}, {'start': 1, 'stop': 2}),
   ({'id': 0}, {'start': 0, 'stop': 1})],
  [({'id': 1}, {'start': 1, 'stop': 2}),
   ({'id': 1}, {'start': 0, 'stop': 1})]]]

We can go from there, selecting "id" values that are between the "start" and "stop" values. I started writing up a solution, but the slicing gets kind of complicated, generating all Cartesian combinations is more computationally expensive than is strictly needed for this problem (though no where near as expensive as writing a for loop!), and the general view of the Cartesian product is more useful for approximate matches than the exact indexes you have.

Instead, let's write a for loop in Numba, a just-in-time compiler for Python. Numba is limited in the Python that it can compile (all types must be known at compile-time), but it can recognize read-only Awkward Arrays and append-only ak.ArrayBuilder.

Here's a loop that considers only tracks and jets in the same event, loops over tracks, and puts the first jet that matches each track into an output ArrayBuilder.

>>> import numba as nb
>>> @nb.njit
... def match(tracks_in_events, jets_in_events, output):
...     for tracks, jets in zip(tracks_in_events, jets_in_events):
...         output.begin_list()
...         for track in tracks:
...             for jet in jets:
...                 if jet.start <= track.id < jet.stop:
...                     output.append(jet)   # at most one
...                     break
...         output.end_list()
...     return output
... 
>>> builder = match(tracks, jets, ak.ArrayBuilder())
>>> builder.snapshot().tolist()
[[{'start': 0, 'stop': 2}, {'start': 0, 'stop': 2}, {'start': 2, 'stop': 3}],
 [],
 [{'start': 2, 'stop': 3}, {'start': 0, 'stop': 2}]]

Notice that these jet objects are duplicated to match the appropriate track. (This "duplication" is actually just a pointer, not really a copy.) To attach this to the tracks, you can assign it:

>>> tracks["jet"] = builder.snapshot()
>>> tracks.tolist()
[[{'id': 0, 'jet': {'start': 0, 'stop': 2}},
  {'id': 1, 'jet': {'start': 0, 'stop': 2}},
  {'id': 2, 'jet': {'start': 2, 'stop': 3}}],
 [],
 [{'id': 0, 'jet': {'start': 2, 'stop': 3}},
  {'id': 1, 'jet': {'start': 0, 'stop': 2}}]]

Here, I've assumed that you want to attach a jet to each track—perhaps you wanted to attach the set of all associated tracks to each jet:

>>> @nb.njit
... def match2(tracks_in_events, jets_in_events, output):
...     for tracks, jets in zip(tracks_in_events, jets_in_events):
...         output.begin_list()
...         for jet in jets:
...             output.begin_list()
...             for track in tracks:
...                 if jet.start <= track.id < jet.stop:
...                     output.append(track)    # all tracks
...             output.end_list()
...         output.end_list()
...     return output
... 
>>> jets["tracks"] = match2(tracks, jets, ak.ArrayBuilder()).snapshot()
>>> jets.tolist()
[[{'start': 0, 'stop': 2, 'tracks': [
      {'id': 0, 'jet': {'start': 0, 'stop': 2}},
      {'id': 1, 'jet': {'start': 0, 'stop': 2}}]},
  {'start': 2, 'stop': 3, 'tracks': [
      {'id': 2, 'jet': {'start': 2, 'stop': 3}}]}],
 [],
 [{'start': 1, 'stop': 2, 'tracks': [
      {'id': 1, 'jet': {'start': 0, 'stop': 2}}]},
  {'start': 0, 'stop': 1, 'tracks': [
      {'id': 0, 'jet': {'start': 0, 'stop': 2}}]}]]

(Since I did both, now there are links both ways.) Attaching jets to tracks, rather than tracks to jets, has the added complication that some tracks might not be associated with any jet, in which case you'd have to account for possibly-missing data. (Hint: make them lists of zero or one jet for "no match" and "yes match," then use ak.firsts to convert empty lists to Nones.)

Or you could make the output of the Numba-compiled function be plain NumPy arrays. Numba knows a lot of NumPy functions. Hint for building up a Numba-compiled function: start with a minimal loop over data that doesn't do anything, test-running it as you go add output. Numba complains with a lot of output when it can't recognize the types of something—which is allowed in non-compiled Python—so it's good to know which change caused it to complain so much.

Hopefully, these examples can get you started!

-- Question -----------------------------------------------

At the moment, no implementation exists for numpy.linalg.norm with akward-arrays. In awkward1 I can use behavior to overwrite (or add?) implementations as shown in the tutorial. In my use-case I have successfully implemented versions of np.absolute and np.subtract for TVector2 and TVector3 using the tutorial example. However, numpy.linalg.norm fails.

How can I define a behavior for awkward-arrays? E.g. in the example below I tried this with a simple int64 array:

import awkward1 as ak
import numpy as np

def norm(data, *args, **kwargs):
    return np.absolute(data)

ak.behavior[np.linalg.norm, "int64"] = norm

sample = ak.Array(np.arange(10))

print(ak.type(sample))  # 10 * int64
print(np.absolute(sample))  # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

print(np.linalg.norm(np.arange(10)))  # 16.881943016134134
print(np.linalg.norm(sample)) # raises TypeError

-- Answer ------------

If you haven't seen https://awkward-array.readthedocs.io/en/latest/ak.behavior.html , it's the place to start.

If np.linalg.norm is a ufunc, the following should work:

ak.behavior[np.linalg.norm, "vector2"] = lambda a: np.sqrt(a.x**2 +a.y**2)

where the data are in records with fields x and y that are named "vector2". You can make such a thing with

ak.zip({"x": x_array, "y": y_array}, with_name="vector2")

(Caveat: I haven't tested this—I wrote it on my phone. I'll test it later and edit this answer, or you might be able to edit it.)

If it's not a ufunc, then there isn't a publicly accessible (i.e. no underscores) way to define it, and perhaps there should be.

Also, be aware of the https://github.com/scikit-hep/vector project, which is defining operations like this.


Edit: Looking at the np.linalg.norm documentation, you're right, it's not a ufunc. (isinstance(np.linalg.norm, np.ufunc) is False.)

Also, this function makes some strong assumptions about its input. Two-dimensional arrays are interpreted as matrices, rather than lists of vectors. In Awkward Array, such distinctions would be labeled by metadata. I don't think we want to implement a function like this, since it could wreak havoc on jagged arrays that are supposed to be lists of different length vectors.

I think what you want is to add vector behaviors, like this:

>>> class Vector2(ak.Record):
...     @property
...     def norm(self):
...         return np.sqrt(self.x**2 + self.y**2)
... 
>>> class ArrayOfVector2(ak.Array):
...     @property
...     def norm(self):
...         return np.sqrt(self.x**2 + self.y**2)
... 
>>> ak.behavior["vector2"] = Vector2
>>> ak.behavior["*", "vector2"] = ArrayOfVector2

This just says that if you have a record that is named "vector2", then it should be presented as Python class Vector2, and if you have an array (any depth) of "vector2" records, they should collectively be presented as an ArrayOfVector2 class.

>>> x_array = ak.Array([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
>>> y_array = ak.Array([[1.0, 2.0, 3.0], [], [4.0, 5.0]])
>>> vectors = ak.zip({"x": x_array, "y": y_array}, with_name="vector2")
>>> vectors[0, 0]
<Vector2 {x: 1.1, y: 1} type='vector2["x": float64, "y": float64]'>
>>> vectors
<ArrayOfVector2 [[{x: 1.1, y: 1}, ... y: 5}]] type='3 * var * vector2["x...'>

That lets you use the methods and properties you've defined:

>>> vectors[0, 0].norm
1.4866068747318506
>>> vectors.norm
<Array [[1.49, 2.97, 4.46], ... [5.95, 7.43]] type='3 * var * float64'>

But also, this is exactly what the https://github.com/scikit-hep/vector project is doing. Keep an eye on it!

-- Question -----------------------------------------------

I have a root file that I open with 2000 entries, and variable amount of subentries and in each column is a different variable. Lets say I am only interested in 5 of those. I want to put them in an array with np.shape(array)=(2000,250,5). The 250 is plenty to contain all subentrys per entry.

The root file is converted into a dictionary by uproot DATA=[variablename:[array of entries [array of subentries]]

I create an array np.zeros(2000,250,5) and fill it with the data I want, but it takes about 500ms and I need a solution that scales as I aim for 1 million entries later on. I found multiple solutions, but my lowest was about 300ms

lim_i=len(N_DATA["nTrack"])
i=0
INPUT_ARRAY=np.zeros((lim_i,500,5))
for l in range(len(INPUT_ARRAY)):
    while i < lim_i:
        EVENT=np.zeros((500,5))
        k=0
        lim_k=len(TRACK_DATA["Track_pt"][i])
        while k<lim_k:
            EVENT[k][0]=TRACK_DATA["Track_pt"][i][k]
            EVENT[k][1]=TRACK_DATA["Track_phi"][i][k]
            EVENT[k][2]=TRACK_DATA["Track_eta"][i][k]
            EVENT[k][3]=TRACK_DATA["Track_dxy"][i][k]
            EVENT[k][4]=TRACK_DATA["Track_charge"][i][k]
            k+=1
        INPUT_ARRAY[i]=EVENT
        i+=1
INPUT_ARRAY

-- Answer ------------

Taking note of fKarl Knechtel's second comment, "You should avoid explicitly iterating over Numpy arrays yourself (there is practically guaranteed to be a built-in Numpy thing that just does what you want, and probably much faster than native Python can)," there is a way to do this with array-at-a-time programming, but not in NumPy. The reason Uproot returns Awkward Arrays is because you need a way to deal with variable-length data efficiently.

I don't have your file, but I'll start with a similar one:

>>> import uproot
>>> import skhep_testdata
>>> events = uproot.open(skhep_testdata.data_path("uproot-HZZ.root"))["events"]

The branches that start with "Muon_" in this file have the same variable-length structure as in your tracks. (The C++ typename is a dynamically sized array, interpreted in Python "as jagged.")

>>> events.show(filter_name="Muon_*")
name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
Muon_Px              | float[]                  | AsJagged(AsDtype('>f4'))
Muon_Py              | float[]                  | AsJagged(AsDtype('>f4'))
Muon_Pz              | float[]                  | AsJagged(AsDtype('>f4'))
Muon_E               | float[]                  | AsJagged(AsDtype('>f4'))
Muon_Charge          | int32_t[]                | AsJagged(AsDtype('>i4'))
Muon_Iso             | float[]                  | AsJagged(AsDtype('>f4'))

If you just ask for these arrays, you get them as an Awkward Array.

>>> muons = events.arrays(filter_name="Muon_*")
>>> muons
<Array [{Muon_Px: [-52.9, 37.7, ... 0]}] type='2421 * {"Muon_Px": var * float32,...'>

To put them to better use, let's import Awkward Array and start by asking for its type.

>>> import awkward as ak
>>> ak.type(muons)
2421 * {"Muon_Px": var * float32, "Muon_Py": var * float32, "Muon_Pz": var * float32, "Muon_E": var * float32, "Muon_Charge": var * int32, "Muon_Iso": var * float32}

What does this mean? It means you have 2421 records with fields named "Muon_Px", etc., that each contain variable-length lists of float32 or int32, depending on the field. We can look at one of them by converting it to Python lists and dicts.

>>> muons[0].tolist()
{'Muon_Px': [-52.89945602416992, 37.7377815246582],
 'Muon_Py': [-11.654671669006348, 0.6934735774993896],
 'Muon_Pz': [-8.16079330444336, -11.307581901550293],
 'Muon_E': [54.77949905395508, 39.401695251464844],
 'Muon_Charge': [1, -1],
 'Muon_Iso': [4.200153350830078, 2.1510612964630127]}

(You could have made these lists of records, rather than records of lists, by passing how="zip" to TTree.arrays or using ak.unzip and ak.zip in Awkward Array, but that's tangential to the padding that you want to do.)

The problem is that the lists have different lengths. NumPy doesn't have any functions that will help us here because it deals entirely in rectilinear arrays. Therefore, we need a function that's specific to Awkward Array, ak.num.

>>> ak.num(muons)
<Array [{Muon_Px: 2, ... Muon_Iso: 1}] type='2421 * {"Muon_Px": int64, "Muon_Py"...'>

This is telling us the number of elements in each list, per field. For clarity, look at the first one:

>>> ak.num(muons)[0].tolist()
{'Muon_Px': 2, 'Muon_Py': 2, 'Muon_Pz': 2, 'Muon_E': 2, 'Muon_Charge': 2, 'Muon_Iso': 2}

You want to turn these irregular lists into regular lists that all have the same size. That's called "padding." Again, there's a function for that, but we first need to get the maximum number of elements, so that we know how much to pad it by.

>>> ak.max(ak.num(muons))
4

So let's make them all length 4.

>>> ak.pad_none(muons, ak.max(ak.num(muons)))
<Array [{Muon_Px: [-52.9, 37.7, ... None]}] type='2421 * {"Muon_Px": var * ?floa...'>

Again, let's look at the first one to understand what we have.

{'Muon_Px': [-52.89945602416992, 37.7377815246582, None, None],
 'Muon_Py': [-11.654671669006348, 0.6934735774993896, None, None],
 'Muon_Pz': [-8.16079330444336, -11.307581901550293, None, None],
 'Muon_E': [54.77949905395508, 39.401695251464844, None, None],
 'Muon_Charge': [1, -1, None, None],
 'Muon_Iso': [4.200153350830078, 2.1510612964630127, None, None]}

You wanted to pad them with zeros, not None, so we convert the missing values into zeros.

>>> ak.fill_none(ak.pad_none(muons, ak.max(ak.num(muons))), 0)[0].tolist()
{'Muon_Px': [-52.89945602416992, 37.7377815246582, 0.0, 0.0],
 'Muon_Py': [-11.654671669006348, 0.6934735774993896, 0.0, 0.0],
 'Muon_Pz': [-8.16079330444336, -11.307581901550293, 0.0, 0.0],
 'Muon_E': [54.77949905395508, 39.401695251464844, 0.0, 0.0],
 'Muon_Charge': [1, -1, 0, 0],
 'Muon_Iso': [4.200153350830078, 2.1510612964630127, 0.0, 0.0]}

Finally, NumPy doesn't have records (other than the structured array, which also implies that the columns are contiguous in memory; Awkward Array's "records" are abstract). So let's unzip what we have into six separate arrays.

>>> arrays = ak.unzip(ak.fill_none(ak.pad_none(muons, ak.max(ak.num(muons))), 0))
>>> arrays
(<Array [[-52.9, 37.7, 0, 0, ... 23.9, 0, 0, 0]] type='2421 * var * float64'>,
 <Array [[-11.7, 0.693, 0, 0, ... 0, 0, 0]] type='2421 * var * float64'>,
 <Array [[-8.16, -11.3, 0, 0, ... 0, 0, 0]] type='2421 * var * float64'>,
 <Array [[54.8, 39.4, 0, 0], ... 69.6, 0, 0, 0]] type='2421 * var * float64'>,
 <Array [[1, -1, 0, 0], ... [-1, 0, 0, 0]] type='2421 * var * int64'>,
 <Array [[4.2, 2.15, 0, 0], ... [0, 0, 0, 0]] type='2421 * var * float64'>)

Note that this one line does everything from the initial data-pull from Uproot (muons). I'm not going to profile it now, but you'll find that this one line is considerably faster than explicit looping.

Now what we have is semantically equivalent to six NumPy arrays, so we'll just cast them as NumPy. (Attempts to do so with irregular data would fail. You have to explicitly pad the data.)

>>> numpy_arrays = [ak.to_numpy(x) for x in arrays]
>>> numpy_arrays
[array([[-52.89945602,  37.73778152,   0.        ,   0.        ],
        [ -0.81645936,   0.        ,   0.        ,   0.        ],
        [ 48.98783112,   0.82756668,   0.        ,   0.        ],
        ...,
        [-29.75678635,   0.        ,   0.        ,   0.        ],
        [  1.14186978,   0.        ,   0.        ,   0.        ],
        [ 23.9132061 ,   0.        ,   0.        ,   0.        ]]),
 array([[-11.65467167,   0.69347358,   0.        ,   0.        ],
        [-24.40425873,   0.        ,   0.        ,   0.        ],
        [-21.72313881,  29.8005085 ,   0.        ,   0.        ],
        ...,
        [-15.30385876,   0.        ,   0.        ,   0.        ],
        [ 63.60956955,   0.        ,   0.        ,   0.        ],
        [-35.66507721,   0.        ,   0.        ,   0.        ]]),
 array([[ -8.1607933 , -11.3075819 ,   0.        ,   0.        ],
        [ 20.19996834,   0.        ,   0.        ,   0.        ],
        [ 11.16828537,  36.96519089,   0.        ,   0.        ],
        ...,
        [-52.66374969,   0.        ,   0.        ,   0.        ],
        [162.17631531,   0.        ,   0.        ,   0.        ],
        [ 54.71943665,   0.        ,   0.        ,   0.        ]]),
 array([[ 54.77949905,  39.40169525,   0.        ,   0.        ],
        [ 31.69044495,   0.        ,   0.        ,   0.        ],
        [ 54.73978806,  47.48885727,   0.        ,   0.        ],
        ...,
        [ 62.39516068,   0.        ,   0.        ,   0.        ],
        [174.20863342,   0.        ,   0.        ,   0.        ],
        [ 69.55621338,   0.        ,   0.        ,   0.        ]]),
 array([[ 1, -1,  0,  0],
        [ 1,  0,  0,  0],
        [ 1, -1,  0,  0],
        ...,
        [-1,  0,  0,  0],
        [-1,  0,  0,  0],
        [-1,  0,  0,  0]]),
 array([[4.20015335, 2.1510613 , 0.        , 0.        ],
        [2.18804741, 0.        , 0.        , 0.        ],
        [1.41282165, 3.38350415, 0.        , 0.        ],
        ...,
        [3.76294518, 0.        , 0.        , 0.        ],
        [0.55081069, 0.        , 0.        , 0.        ],
        [0.        , 0.        , 0.        , 0.        ]])]

And now NumPy's dstack is appropriate. (This is making them contiguous in memory, so you could use NumPy's structured arrays if you want to. I would find that easier for keeping track of which index means which variable, but that's up to you. Actually, Xarray is particularly good at tracking metadata of rectilinear arrays.)

>>> import numpy as np
>>> np.dstack(numpy_arrays)
array([[[-52.89945602, -11.65467167,  -8.1607933 ,  54.77949905,
           1.        ,   4.20015335],
        [ 37.73778152,   0.69347358, -11.3075819 ,  39.40169525,
          -1.        ,   2.1510613 ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ]],

       [[ -0.81645936, -24.40425873,  20.19996834,  31.69044495,
           1.        ,   2.18804741],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ]],

       [[ 48.98783112, -21.72313881,  11.16828537,  54.73978806,
           1.        ,   1.41282165],
        [  0.82756668,  29.8005085 ,  36.96519089,  47.48885727,
          -1.        ,   3.38350415],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ]],

       ...,

       [[-29.75678635, -15.30385876, -52.66374969,  62.39516068,
          -1.        ,   3.76294518],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ]],

       [[  1.14186978,  63.60956955, 162.17631531, 174.20863342,
          -1.        ,   0.55081069],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ]],

       [[ 23.9132061 , -35.66507721,  54.71943665,  69.55621338,
          -1.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ],
        [  0.        ,   0.        ,   0.        ,   0.        ,
           0.        ,   0.        ]]])

-- Question -----------------------------------------------

Do we already have a function similar to np.add in awkward arrays?

I am in a situation i need to add them, and "+" operator work fine for simple array but not for nested array.

e.g.

>>> ak.to_list(c1)
[[], [], [], [], [0.944607075944902]]

>>> ak.to_list(c2)
[[0.9800207661211596], [], [], [], []]

>>> c1+c2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/afs/cern.ch/work/k/khurana/EXOANALYSIS/CMSSW_11_0_2/src/bbDMNanoAOD/analyzer/dependencies/lib/python3.6/site-packages/numpy/lib/mixins.py", line 21, in func
    return ufunc(self, other)
  File "/afs/cern.ch/work/k/khurana/EXOANALYSIS/CMSSW_11_0_2/src/bbDMNanoAOD/analyzer/dependencies/lib/python3.6/site-packages/awkward1/highlevel.py", line 1380, in __array_ufunc__
    return awkward1._connect._numpy.array_ufunc(ufunc, method, inputs, kwargs)
  File "/afs/cern.ch/work/k/khurana/EXOANALYSIS/CMSSW_11_0_2/src/bbDMNanoAOD/analyzer/dependencies/lib/python3.6/site-packages/awkward1/_connect/_numpy.py", line 107, in array_ufunc
    out = awkward1._util.broadcast_and_apply(inputs, getfunction, behavior)
  File "/afs/cern.ch/work/k/khurana/EXOANALYSIS/CMSSW_11_0_2/src/bbDMNanoAOD/analyzer/dependencies/lib/python3.6/site-packages/awkward1/_util.py", line 972, in broadcast_and_apply
    out = apply(broadcast_pack(inputs, isscalar), 0)
  File "/afs/cern.ch/work/k/khurana/EXOANALYSIS/CMSSW_11_0_2/src/bbDMNanoAOD/analyzer/dependencies/lib/python3.6/site-packages/awkward1/_util.py", line 745, in apply
    outcontent = apply(nextinputs, depth + 1)
  File "/afs/cern.ch/work/k/khurana/EXOANALYSIS/CMSSW_11_0_2/src/bbDMNanoAOD/analyzer/dependencies/lib/python3.6/site-packages/awkward1/_util.py", line 786, in apply
    nextinputs.append(x.broadcast_tooffsets64(offsets).content)
ValueError: in ListOffsetArray64, cannot broadcast nested list

(https://github.com/scikit-hep/awkward-1.0/blob/0.3.1/src/cpu-kernels/operations.cpp#L778)

only way I can add them is using the firsts and then replacing the None with 0.

>>> z1=ak.fill_none(ak.firsts(c1),0.)
>>> z2=ak.fill_none(ak.firsts(c2),0.)

>>> z1
<Array [0, 0, 0, 0, 0.945] type='5 * float64'>

>>> z2
<Array [0.98, 0, 0, 0, 0] type='5 * float64'>

>>> z1+z2
<Array [0.98, 0, 0, 0, 0.945] type='5 * float64'>

Can something similar to np.add be devised for ak even if with limited scope/functionality. By limited scope I meant if it can work only on the same dimension ak array then it would serve my present purpose at least.

Thanks.

-- Answer ------------

The exception that you saw for

>>> ak.to_list(c1)
[[], [], [], [], [0.944607075944902]]

>>> ak.to_list(c2)
[[0.9800207661211596], [], [], [], []]

>>> c1+c2

is correct: you can't add these two arrays. It's not because Awkward lacks an ak.add function. Such a thing would be identical to np.add:

>>> c1 + c2          # this actually calls np.add
<Array [[], [], [], [], [1.89]] type='5 * var * float64'>
>>> np.add(c1, c1)
<Array [[], [], [], [], [1.89]] type='5 * var * float64'>

It doesn't work because the arrays have a different number of elements at each position. It's like trying to add two NumPy arrays with different shapes. (You can add NumPy arrays with certain different shapes, just as you can add Awkward arrays with certain different shapes, if they broadcast. These don't.)

If you want an empty list to behave like a list with a zero in it, then you did the right thing: ak.firsts and ak.singletons convert between two ways of representing missing data:

  • as None vs another value
  • as empty lists vs the value in a length-1 list.

In some languages, a missing or potentially missing value is treated as a length-0 or length-1 list, such as Scala's Option type. Thus,

>>> ak.firsts(c1)
<Array [None, None, None, None, 0.945] type='5 * ?float64'>

presumes that you were starting from empty-or-singleton (appears to be true in your examples) and converts it to an option-type array with one level less depth. Then doing an ak.fill_none means that you wanted these missing values (which came from empty lists) to act like zeros for addition, and you got what you wanted.

>>> ak.fill_none(ak.firsts(c1), 0) + ak.fill_none(ak.firsts(c2), 0)
<Array [0.98, 0, 0, 0, 0.945] type='5 * float64'>

One thing that's not clear from your data is whether you always expect the lists to have at most one item—ak.firsts will only pull the first item out of each list. If you had

>>> c1 = ak.Array([[], [], [], [], [0.999, 0.123]])
>>> c2 = ak.Array([[0.98], [], [], [], []])

then

>>> ak.fill_none(ak.firsts(c1), 0) + ak.fill_none(ak.firsts(c2), 0)
<Array [0.98, 0, 0, 0, 0.999] type='5 * float64'>

might not be what you want, since it drops the 0.123. You might actually want to ak.pad_none each list to have at least one element, like this:

>>> ak.pad_none(c1, 1)
<Array [[None], [None], ... [0.999, 0.123]] type='5 * var * ?float64'>
>>> ak.fill_none(ak.pad_none(c1, 1), 0)
<Array [[0], [0], [0], [0], [0.999, 0.123]] type='5 * var * float64'>

This maintains the structure, distinguishing between list lengths for all lengths except for 0 and 1, because empty lists have been converted into [0]. You can't use this for adding unless these longer lists match lengths (back to your original problem), but you can arrange for that, too.

>>> ak.fill_none(ak.pad_none(c1, 2), 0) + ak.fill_none(ak.pad_none(c2, 2), 0)
<Array [[0.98, 0], [0, ... 0], [0.999, 0.123]] type='5 * var * float64'>

It all depends on what structures you have and what structures you want. It wouldn't be a good idea to create a new function that does one of the two things above, especially if it has a name that's dangerously close to a NumPy function's, like np.add, because it works in a different way that would have to be explained for anyone to safely use it. If you want to do a specialized thing, it's safer to have you build it out of simpler primitives (even if you wrap it up as a convenience function in your own work), because then you know what rules it follows.

-- Question -----------------------------------------------

Depending on your few on my approach this is either a question about using np.unique() on awkward1 arrays or a call for a better approach:

Let a and b be two awkward1 arrays of the same outer length (number of events) but different inner lengths. For example:

a = [[1, 2], [3]   , [] , [4, 5, 6]]
b = [[7]   , [3, 5], [6], [8, 9]]

Let f: (x, y) -> z be a function that acts on two numbers x and y and results in the number z. For example:

f(x, y):= y - x

The idea is to compare every element in a with every element in b via f for each event and filter out the matches of a and b pairs that survive some cut applied to f. For example:

f(x, y) < 4

My approach for this is:

a = ak.from_iter(a)
b = ak.from_iter(b)

c = ak.cartesian({'x':a, 'y':b})
#c= [[{'x': 1, 'y': 7}, {'x': 2, 'y': 7}], [{'x': 3, 'y': 3}, {'x': 3, 'y': 5}], [], [{'x': 4, 'y': 8}, {'x': 4, 'y': 9}, {'x': 5, 'y': 8}, {'x': 5, 'y': 9}, {'x': 6, 'y': 8}, {'x': 6, 'y': 9}]]

i = ak.argcartesian({'x':a, 'y':b})
#i= [[{'x': 0, 'y': 0}, {'x': 1, 'y': 0}], [{'x': 0, 'y': 0}, {'x': 0, 'y': 1}], [], [{'x': 0, 'y': 0}, {'x': 0, 'y': 1}, {'x': 1, 'y': 0}, {'x': 1, 'y': 1}, {'x': 2, 'y': 0}, {'x': 2, 'y': 1}]]

diff = c['y'] - c['x']
#diff= [[6, 5], [0, 2], [], [4, 5, 3, 4, 2, 3]]

cut = diff < 4
#cut= [[False, False], [True, True], [], [False, False, True, False, True, True]]

new = c[cut]
#new= [[], [{'x': 3, 'y': 3}, {'x': 3, 'y': 5}], [], [{'x': 5, 'y': 8}, {'x': 6, 'y': 8}, {'x': 6, 'y': 9}]]

new_i = i[cut]
#new_i= [[], [{'x': 0, 'y': 0}, {'x': 0, 'y': 1}], [], [{'x': 1, 'y': 0}, {'x': 2, 'y': 0}, {'x': 2, 'y': 1}]]

It is possible that pairs with the same element from a but different elements from b survive the cut. (e.g. {'x': 3, 'y': 3} and {'x': 3, 'y': 5})

My goal is to group those pairs with the same element from a together and therefore reshape the new array into:

new = [[], [{'x': 3, 'y': [3, 5]}], [], [{'x': 5, 'y': 8}, {'x': 6, 'y': [8, 9]}]]

My only idea how to achieve this is to create a list of the indexes from a that are still present after the cut by using new_i:

i = new_i['x']
#i= [[], [0, 0], [], [1, 2, 2]]

However, I need a unique version of this list to make every index appear only once. This could be achieved with np.unique() in NumPy. But doesn't work in awkward1:

np.unique(i)

<__array_function__ internals> in unique(*args, **kwargs)

TypeError: no implementation found for 'numpy.unique' on types that implement __array_function__: [<class 'awkward1.highlevel.Array'>]

My question:

Is their a np.unique() equivalent in awkward1 and/or would you recommend a different approach to my problem?

-- Answer ------------

In your approach you used the following code to pair up booth arrays:

c = ak.cartesian({'x':a, 'y':b})
#c= [[{'x': 1, 'y': 7}, {'x': 2, 'y': 7}], [{'x': 3, 'y': 3}, {'x': 3, 'y': 5}], [], [{'x': 4, 'y': 8}, {'x': 4, 'y': 9}, {'x': 5, 'y': 8}, {'x': 5, 'y': 9}, {'x': 6, 'y': 8}, {'x': 6, 'y': 9}]]

However, with the nested = True parameter from ak.cartesian() you get a list grouped by the elements of a:

c = ak.cartesian({'x':a, 'y':b}, axis = 1, nested = True)
#c= [[[{'x': 1, 'y': 7}], [{'x': 2, 'y': 7}]], [[{'x': 3, 'y': 3}, {'x': 3, 'y': 5}]], [], [[{'x': 4, 'y': 8}, {'x': 4, 'y': 9}], [{'x': 5, 'y': 8}, {'x': 5, 'y': 9}], [{'x': 6, 'y': 8}, {'x': 6, 'y': 9}]]]

After the cut you end up with:

new = c[cut]
#new= [[[], []], [[{'x': 3, 'y': 3}, {'x': 3, 'y': 5}]], [], [[], [{'x': 5, 'y': 8}], [{'x': 6, 'y': 8}, {'x': 6, 'y': 9}]]]

Extract the y values and reduce the most inner layer of the nested lists of new to only one element:

y = new['y']
#y= [[[], []], [[3, 5]], [], [[], [8], [8, 9]]]

new = ak.firsts(new, axis = 2)
#new= [[None, None], [{'x': 3, 'y': 3}], [], [None, {'x': 5, 'y': 8}, {'x': 6, 'y': 8}]]

Now every most inner entry in new belongs to exactly one element from a. By replacing the current y of new with the previously extracted y you end up with the desired result:

new['y'] = y
#new= [[None, None], [{'x': 3, 'y': [3, 5]}], [], [None, {'x': 5, 'y': [8]}, {'x': 6, 'y': [8, 9]}]]

-- Question -----------------------------------------------

Summary: TTree branches seem to go missing when running the uproot tutorial.

I have a root file that contains a TTree called 'prod' which has a complicated set of jagged leaves and branches which I can see in the TBrowser in ROOT. I started the uproot tutorial using this root file as the input and receive the following error at the beginning of an interactive session:

>>>import uproot as up
>>> file = up.open('small.root')
>>> file
<ROOTDirectory b'small.root' at 0x025b5e3477c0>
>>> file.keys()
[b'prod;1']
>>> file.classnames()
Traceback (most recent call last):
  File "<pyshell#13>", line 1, in <module>
    file.classnames()
AttributeError: 'ROOTDirectory' object has no attribute 'classnames'
>>> file['prod']
<TTree b'prod' at 0x025b5dd15d00>

Why am I getting an error when trying to get the class names?

Ignoring this and moving on. The next problem is when I try to see what is in the TTree prod

>>> tree = file['prod']
>>> tree.keys()
[b'COSMIC', b'COSMICRES', b'COTNBC', b'COTTIME', b'GLB', b'LUM', b'MET', b'MU', b'PHOTON', b'RESIDUALS', b'TRACK', b'TRKDET', b'VERTEX', b'ZVTX', b'MOM_ntk', b'MOM_pt', b'MOM_px', b'MOM_py', b'MOM_pz']
>>> branches = tree.arrays(namedecode='utf-8')
>>> branches.keys()
dict_keys(['MET', 'MOM_ntk', 'MOM_pt', 'MOM_px', 'MOM_py', 'MOM_pz'])

The uproot tutorial implies that I should get all the branches, but clearly I am missing quite a few of them. In particular, the only branches that were transferred over were the ones that were what I would call 'simple' in the sense that they only had numerical data members.

The other branches contain further items within them. For example, the 'MU' branch has properties of every muon in the event. First the number of such muons and then a further branch of each one of those muon's attributes like it's quality and the track-number associated with it. 'MET', 'MOM_ntk', 'MOM_pt', 'MOM_px', 'MOM_py', and 'MOM_pz' all contain only lists of floating point numbers. MOM_ntk has only one number per event (call it 'alpha'), and the other MOM branches will have 'alpha' numbers in each of them.

The file only has 1000 events and is only about 5MB in total size.

I am wondering where all the other Branches have gone! Where are my friends 'COTNBC' or 'GLB' (that one should have all the run and event numbers in it).

Any advice or help would be appreciated!


Windows 10 PC with 32GB RAM Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz x64 processor

-- Answer ------------

Based on what you're trying to do and the outputs you're getting, it looks like you're expecting Uproot 4, but you have Uproot 3.

One of those differences is that Uproot 3's keys() methods did not recurse through subbranches of branches, but Uproot 4's does. (There's a recurse=True/False parameter in both, but the default changed for exactly the reason you're seeing.)

-- Question -----------------------------------------------

I have an awkward array (1) which I obtained post-processing.

An array look like:

>>> ak.Array([96., 99., 67., 13.,  3.,  None, 1.,  1.,None])

I want to remove the None elements from this array. I could remove them using loop, but I want to avoid it to save some computing time. Or writing a function and compiling using Numba is only option?

Thanks.

-- Answer ------------

ak.is_none is the function you want to use:

a[~ak.is_none(a)]

-- Question -----------------------------------------------

I have a function which I would like to convert so that i can use with awkward array 1.

Function following which used to work for float but not for awkward arrays for the known reasons.

def Phi_mpi_pi(x):
    kPI=(3.14159265)
    kTWOPI = 2 * kPI

    #while ((x.any() >= kPI).any()): x = x - kTWOPI;
    #while ((x.any() < -kPI).any()): x = x + kTWOPI;
    while ((x >= kPI)): x = x - kTWOPI;
    while ((x < -kPI)): x = x + kTWOPI;
    return x;

I tried to convert it into numpy/awkward compatible form and new function look like

def Phi_mpi_pi(x):
    kPI=numpy.array(3.14159265)
    kPI = kPI.repeat(len(x))
    kTWOPI = 2 * kPI
    while ((x >= kPI)): x = x - kTWOPI;
    while ((x < -kPI)): x = x + kTWOPI;
    return x;

This function remains stuck in the while loop forever, I couldn't find a way to debug it.

Task of the function is to keep the values in an awkward array between +- kPI but this logic does not give the desired results.

e.g.

x=ak.Array([[0.7999999999999998, 1.0, -1.3], [], [-1.4], [-1.8000000000000003, -6.1000000000000005, -1.6000000000000005], [-4.6]])

However ((x < -kPI)) this give desired output.

>>> ak.to_list(x <= -kPI)
    [[False, False, False], [], [False], [False, True, False], [True]]

but not the function

the desired output should be b/w +- kPI based on the logic of while loop, is there something straightforward or suggestion which can be used?

-- Answer ------------

You want to adjust every single scalar value in x (which are all angles) to be between -π and π.

You can do it like this:

def Phi_mpi_pi(x):
    y = np.add(x, numpy.pi)
    y = np.mod(y, 2*numpy.pi)
    y = np.subtract(y, numpy.pi)
    return y

Or, more terse and much less readable:

def Phi_mpi_pi(x):
    return np.subtract(np.mod(np.add(x, np.pi), 2*np.pi), np.pi)

What it does is this:

  1. Add π to all angles (so they point in the opposite direction).
  2. Take modulo 2π on all angles so they are all from 0 to 2π (2π not included).
  3. Subtract π from all the angles again (so they point in the right direction again). Now they are all from -π to +π (+π not included).

Test:

x = ak.Array([[0.3, 3.1, numpy.pi, -numpy.pi, -4 * numpy.pi,
               200 * numpy.pi, 2 * numpy.pi, -400 * numpy.pi], 
              [], [-1.4], [-1.8, -6, -1.6], [-4.6]])
y = Phi_mpi_pi(x)
print("Type of result:", type(y))
print("Result =", y)

# Check that the resulting range of each value is correct.
range_is_correct = (ak.all(y >= -numpy.pi) and ak.all(y < numpy.pi))

# Calculate the factors of the 2π adjustments.
factors = (x - y) / (2 * numpy.pi)
print("2π factors=", factors)

# Test that all factors of the 2π adjustmenst are approximate integers.
adjustments_are_correct = ak.all(numpy.abs(numpy.mod(factors, 1)) < 1e-15)

# Test that all values are correct, given the test input.
print("Result is correct:", range_is_correct and adjustments_are_correct)

gives this output:

Type of result: <class 'awkward1.highlevel.Array'>
Result = [[0.3, 3.1, -3.14, -3.14, 0, 1.78e-14, 0, 4.62e-14, ... [-1.8, 0.283, -1.6], [1.68]]
2π factors= [[2.65e-17, 7.07e-17, 1, 0, -2, 100, 1, -200], [], [0], [0, -1, 0], [-1]]
Result is correct: True

which proves that the operation was correctly performed with the specifically used test data.

-- Question -----------------------------------------------

I have a problem when trying to plot 2d histogram or graph with different length of jagged arrays.

Here is a simple example. Suppose there are 7 events of gen-level pT and its Et.

pT = [ [46.8], [31.7], [21], [29.9], [13.9], [41.2], [15.7] ]
Et = [ [41.4], [25.5, 20], [19.6], [27.4], [12, 3.47], [37.8], [10] ]

Here, some events (2nd, 5th) have two y values corresponding one x value. I want to make graph or 2d histogram putting x = pt and y = et, and put two points together. i.e (31.7, 25.5) and (31.7, 20)

How can I make align these values for plotting?

-- Answer ------------

What you want to do is "broadcast" the two arrays:

Awkward broadcasting is a generalization of NumPy broadcasting to include variable-length lists.

Broadcasting usually happens automatically when you're performing a mathematical calculation:

>>> import awkward as ak
>>> ak.Array([[1, 2, 3], [], [4, 5]]) + ak.Array([100, 200, 300])
<Array [[101, 102, 103], [], [304, 305]] type='3 * var * int64'>

but you can also do it manually:

>>> ak.broadcast_arrays(ak.Array([[1, 2, 3], [], [4, 5]]),
...                     ak.Array([100, 200, 300]))
[<Array [[1, 2, 3], [], [4, 5]] type='3 * var * int64'>,
 <Array [[100, 100, 100], [], [300, 300]] type='3 * var * int64'>]

When the two arrays have different depths (different "dimensions" in NumPy terminology), scalars from one are replicated to align with all elements of lists in the other.

You have two lists of the same depth:

>>> pT = ak.Array([ [46.8], [31.7], [21], [29.9], [13.9], [41.2], [15.7] ])
>>> Et = ak.Array([ [41.4], [25.5, 20], [19.6], [27.4], [12, 3.47], [37.8], [10] ])

To manually broadcast them, you could reduce the depth of pT by taking the first element from each list.

>>> pT[:, 0]
<Array [46.8, 31.7, 21, ... 13.9, 41.2, 15.7] type='7 * float64'>

Then you can broadcast each scalar of pT into each list of Et.

>> ak.broadcast_arrays(pT[:, 0], Et)
[<Array [[46.8], [31.7, 31.7, ... 41.2], [15.7]] type='7 * var * float64'>,
 <Array [[41.4], [25.5, 20], ... [37.8], [10]] type='7 * var * float64'>]

This will be more clear if I print them in their entirety by turning them into Python lists:

>>> pT_broadcasted, Et = ak.broadcast_arrays(pT[:, 0], Et)
>>> pT_broadcasted.tolist()
[[46.8], [31.7, 31.7], [21.0], [29.9], [13.9, 13.9], [41.2], [15.7]]
>>> Et.tolist()
[[41.4], [25.5, 20.0], [19.6], [27.4], [12.0, 3.47], [37.8], [10.0]]

Now you see that the 31.7 has been duplicated to align with each value in [25.5, 20.0].

In NumPy, you'll often see examples of broadcasting a dimension of length 1, rather than creating a dimension, like this:

>>> import numpy as np
>>> np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) + np.array([[100], [200], [300]])
array([[101, 102, 103],
       [204, 205, 206],
       [307, 308, 309]])

Awkward Array follows this rule, but only if the dimension has length "exactly 1," not "a bunch of variable-length lists that happen to each have length 1." The way I've written pT, it has the latter:

>>> ak.type(pT)     # 7 lists with variable length
7 * var * float64
>>> ak.num(pT)      # they happen to each have length 1... this time...
<Array [1, 1, 1, 1, 1, 1, 1] type='7 * int64'>

Since these lists are in-principle variable, they don't broadcast the way that length-1 NumPy arrays would.

>>> ak.broadcast_arrays(pT, Et)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/irishep/awkward-1.0/awkward1/operations/structure.py", line 699, in broadcast_arrays
    out = awkward1._util.broadcast_and_apply(inputs, getfunction, behavior)
  File "/home/jpivarski/irishep/awkward-1.0/awkward1/_util.py", line 972, in broadcast_and_apply
    out = apply(broadcast_pack(inputs, isscalar), 0)
  File "/home/jpivarski/irishep/awkward-1.0/awkward1/_util.py", line 745, in apply
    outcontent = apply(nextinputs, depth + 1)
  File "/home/jpivarski/irishep/awkward-1.0/awkward1/_util.py", line 786, in apply
    nextinputs.append(x.broadcast_tooffsets64(offsets).content)
ValueError: in ListOffsetArray64, cannot broadcast nested list

(https://github.com/scikit-hep/awkward-1.0/blob/0.3.2/src/cpu-kernels/operations.cpp#L778)

If you explicitly cast the array as NumPy, it will have regular types. (Note to self: it would be nice to have a way to turn a variable-length dimension regular or vice-versa without converting the whole array to NumPy.)

>>> ak.type(pT)
7 * var * float64
>>> ak.type(ak.to_numpy(pT))
7 * 1 * float64

So an alternate way to get the same broadcasting is to convert pT to NumPy, instead of picking out the first element of each list with pT[:, 0].

>>> ak.broadcast_arrays(ak.to_numpy(pT), Et)
[<Array [[46.8], [31.7, 31.7, ... 41.2], [15.7]] type='7 * var * float64'>,
 <Array [[41.4], [25.5, 20], ... [37.8], [10]] type='7 * var * float64'>]

Either way, an assumption is being made that pT consists of lists of length 1. The pT[:, 0] expression assumes this because it requires something to have index 0 in each list (so the length is at least 1) and it ignores whatever else might be there. The ak.to_numpy(pT) expression will raise an exception if the pT array doesn't happen to be regular, a shape that can be expressed in NumPy.

Now that you have pT_broadcasted and Et aligned with the same structure, you'll have to flatten them both to pass them to a plotting routine (which expects non-jagged data).

>>> ak.flatten(pT_broadcasted), ak.flatten(Et)
(<Array [46.8, 31.7, 31.7, ... 13.9, 41.2, 15.7] type='9 * float64'>,
 <Array [41.4, 25.5, 20, ... 3.47, 37.8, 10] type='9 * float64'>)

The plotting routine will probably try np.asarray on each of these, which is identical to ak.to_numpy, which will work because these flattened arrays are regular. If you have doubly jagged data or something more complex, you'd have to flatten more.

-- Question -----------------------------------------------

I'm running into some trouble when using awkward array 1.0 to make combinations of particles with ak.combinations. I see the error when trying to do ak.unzip on the combinations I have made (following the tutorial here: https://mybinder.org/v2/gh/jpivarski/2020-07-13-pyhep2020-tutorial.git/1.1?urlpath=lab/tree/tutorial.ipynb).

I have put an example which reproduces the error here:

https://github.com/donalrinho/awkward1_combinatorics_test

and would be very grateful for any help to understand what I'm doing wrong. The error I see is:

Traceback (most recent call last):
  File "python/simple_combiner.py", line 41, in <module>
    pi1, pi2 = ak.unzip(pi_pairs)
ValueError: too many values to unpack (expected 2)

The error doesn't occur when I restrict the input data to a specific variable, so for instance:

pi_pairs = ak.combinations(pi["pt"], 2)
pi1, pi2 = ak.unzip(pi_pairs)

where I pass in only the pt of the pi array.

-- Answer ------------

The ValueError: too many values to unpack (expected 2) is a Python (not Awkward) error message saying that the tuple on the right hand side of the assignment has a length that is not 2. We can get the same error message like this:

>>> pi1, pi2 = ("a", "b", "c")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: too many values to unpack (expected 2)

So the ak.unzip(pi_pairs) must have a length that is not 2. All that ak.unzip does is to find record fields in the array that it's given and make a tuple where each of these is projected out. All it would take to get the above error message is for the array to have more or less than two fields. Here's an example with three:

>>> ak.unzip(ak.Array([{"x": 1, "y": 2, "z": 3}, {"x": 10, "y": 20, "z": 30}]))
(<Array [1, 10] type='2 * int64'>,
 <Array [2, 20] type='2 * int64'>,
 <Array [3, 30] type='2 * int64'>)

The ak.combinations function should create an array of records with n fields, where n is the argument you passed. That seems to be 2 in your code example, but before we can tell what's going on, you need to print out the output of ak.zip. Is it returning a tuple with zero items? One? Three? More?

It looks like you're developing this as a script. I'd recommend developing interactively, on a Python console, iPython, Jupyter, etc. At least when I am building up an analysis based on NumPy-like indexing, I get the indexing wrong more than half the time, but with the immediate feedback, such mistakes can be fixed in seconds. ("Oh yeah, I forgot to include a...") It's likely that the pi you're passing into the ak.combinations(pi, 2) is not what you think it is. I'd recommend breaking it down and examining small examples (slice the first event or two and call ak.to_list and ak.type on it), working backward until you get to a structure that is shaped as you expect it to be.

-- Question -----------------------------------------------

I am very new to python and uproot. Previously, I have been using ROOT in a C++ environment. Following the uproot tutorial, I can read my TH2D graphs from a .root file

I want now to recreate and replot the existing graph through matplotlib or seaborn, but I don't get the structure of the imported TH2. myTH2D._members() outputs correctly:

['fName',
 'fTitle',
 'fLineColor',
 'fLineStyle',
 'fLineWidth',
 'fFillColor',
 'fFillStyle',
 'fMarkerColor',
 'fMarkerStyle',
 'fMarkerSize',
 'fNcells',
 'fXaxis',
 'fYaxis',
 'fZaxis',
 'fBarOffset',
 'fBarWidth',
 'fEntries',
 'fTsumw',
 'fTsumw2',
 'fTsumwx',
 'fTsumwx2',
 'fMaximum',
 'fMinimum',
 'fNormFactor',
 'fContour',
 'fSumw2',
 'fOption',
 'fFunctions',
 'fBufferSize',
 'fBuffer',
 'fBinStatErrOpt',
 'fScalefactor',
 'fTsumwy',
 'fTsumwy2',
 'fTsumwxy']

myTH2D.edges outputs the right axis, myTH2D.values outputs the right counts (confirmed with a rough plt.imshow(myTH2D.values). The problems start when I call myTH2D.pandas()

count	variance
tof1 [ns]	tof2 [ns]		
[-inf, 4500.0)	[-inf, 4500.0)	0.0	0.0
[4500.0, 4507.142857142857)	0.0	0.0
[4507.142857142857, 4514.285714285715)	0.0	0.0
[4514.285714285715, 4521.428571428572)	0.0	0.0
[4521.428571428572, 4528.571428571428)	0.0	0.0
...	...	...	...
[7500.0, inf)	[6971.428571428572, 6978.571428571429)	0.0	0.0
[6978.571428571429, 6985.714285714286)	0.0	0.0
[6985.714285714286, 6992.857142857143)	0.0	0.0
[6992.857142857143, 7000.0)	0.0	0.0
[7000.0, inf)	0.0	0.0
123904 rows × 2 columns

and the ntuple that is created with myTH2D.numpy() is nested in a way that I don't understand:

(array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]),
 [(array([4500.        , 4508.57142857, 4517.14285714, 4525.71428571,
          4534.28571429, 4542.85714286, 4551.42857143, 4560.        ,
          ...,
          7414.28571429, 7422.85714286, 7431.42857143, 7440.        ,
          7448.57142857, 7457.14285714, 7465.71428571, 7474.28571429,
          7482.85714286, 7491.42857143, 7500.        ]),
   array([4500.        , 4507.14285714, 4514.28571429, 4521.42857143,
          4528.57142857, 4535.71428571, 4542.85714286, 4550.        ,
          ...,
          6957.14285714, 6964.28571429, 6971.42857143, 6978.57142857,
          6985.71428571, 6992.85714286, 7000.        ]))])

Do you have any suggestion on how to handle these ntuple?

Thank you!

EDIT:

with the following syntax, I can almost achieve the right plot. It is flipped compared to the original:

    plt.pcolormesh(myTH2D[1][0][0],myTH2D[1][0][1],myTH2D[0])

Nevertheless, my problem is still there: I'd like to have the data processed through pandas, having therefore the labels: now I don't know which is x- and which is y-axis. Any ideas?

-- Answer ------------

From the arrays of edges and bin counts (myTH2D.numpy()), you could use any of these techniques to plot it in Matplotlib:

https://stackoverflow.com/questions/27156381/python-creating-a-2d-histogram-from-a-numpy-matrix

You mentioned Seaborn, but I'm less familiar with that. Surely it has similar functions.

You could instead install uproot>=4 and hist>=2, and then just

myTH2D.to_hist().plot()

The hist library aims to be a one-stop-shop for histogramming, and it's close to its first non-pre release.

The Uproot 4 codebase is almost ready to replace the current Uproot; it needs documentation and file-writing capabilities. The interface is slightly different to address issues with Uproot 3's interface (e.g. strings vs bytestrings), so that's why this is being handled as a gradual transition with a temporarily different library name, rather than changing all at once. But if you're just starting out, you might want to start with the new library, so that you don't have to get used to a change in the near future (this fall).

-- Question -----------------------------------------------

I have a root file from which I would like to extract a certain candidate per event. On the other hand, I have a numpy array containing the index of the candidate I want to extract.

Let's say that my root file has the following branch:

branch = [[8.956237 9.643666] [5.823581] [3.77208 5.6549993] [5.91686] [13.819047 14.108783]]

And I want the first candidate of the first 4 events and the second for the last, therefore, I have the following numpy array:

npMask = array([[0],[0],[0],[0],[1]])

When I apply the npMask to the branch, the result is not what I expected:

branch[npMask]
[[[8.956237 9.643666]] [[8.956237 9.643666]] [[8.956237 9.643666]] [[8.956237 9.643666]] [[5.823581]]]

However, if I cast the numpy array into a jagged array, it works just fine:

awkMask = awk.fromiter(npMask)

branch[awkMask]
[[8.956237] [5.823581] [3.77208] [5.91686] [14.108783]]

The problem here is that casting takes too much time, I am using the iterate method and with 10k entrysteps, the casting takes around 65% of the time per iteration.

So, my question here is: Is there a correct way to use a numpy array as a mask for a jagged array?



Note

I create my numpy array by comparing three different branches and selecting the candidate with the highest value from those three branches, e.g.

compare1 = [[0 -0.1] [0] [0.65 0.55] [0.5] [0.6 0.9]]

compare2 = [[0.99 -0.1] [0.9] [0.45 0.2] [0.5] [0.66 0.99]]

compare3 = [[0.91 0.3] [0.77] [0.5 -0.2] [0.5] [0.87 0.59]]

-- Answer ------------

ak.from_iter is the one function that was allowed to be written in Python for loops, and hence it is designated to be slow. The function you want for turning regular NumPy arrays into Awkward Arrays that happen to have uniform counts is ak.from_numpy. That ought to be considerably faster.

Meanwhile, your original issue is an example of an inconsistency in Awkward 0.x. In Awkward 1.x, the behavior of Awkward Arrays that happen to be regular and NumPy arrays with the same logical meaning are identical.

-- Question -----------------------------------------------

I need to read a root tree that contains a 2D array stored within a struct and would like to use uproot for this.

For example: the following code-snippet creates a tree with both a 2D array and a 2D array within a struct. Uproot has no problem reading the 2D array by itself, but does not know how to parse it within the struct.

Is there a way to tell uproot how to parse this struct?

Float_t x2[15][2]={{0}};
struct POINT{
  Float_t x[15][2]={{0}};
  Float_t y[15][2]={{0}};
};
POINT point;
TTree tree("T","ROOT tree with 2D array and 2D array in struct");
tree.Branch("point",&point,"x[15][2]:y[15][2]");
tree.Branch("x2",x2,"x2[15][2]/F");

-- Answer ------------

Although I can't test this without the file, the following should allow you to read that data structure:

import numpy as np
import uproot

dtype = np.dtype([("x", ">f4", (15, 2)), ("y", ">f4", (15, 2))])
interpretation = uproot.AsDtype(dtype)
points_array = tree["point"].array(interpretation)

It is a problem that Uproot is not recognizing this structure, though I can imagine why not: it's a special case (fixed-size dimensions) of a special case (leaf-list). If you could post a small (< 1 MB) example file as an Uproot GitHub issue, I'll look into the problem of automatically identifying this interpretation.

-- Question -----------------------------------------------

I'm wondering about the recommended protocol for converting TLorentzVector information from .root files into Pandas DataFrames. So far, my strategy has been to save pT, eta, and phi information for each particle I care about. I then write my own functions (based on TLorentzVector definitions) to compute any other quantities I might occasionally need, like DeltaR, mT, etc.

I then wondered if I could just save only the TLorentzVector to my DataFrame and use uproot to get quantities like pT, eta, phi, etc. on-the-fly using something like this (which works when I'm running over a DataFrame that I have just converted from a .root file):

for row in df.index:
	print(df.at[row,"leptons_p4_0"].pt)

I quickly realized, though, that Pandas alone doesn't understand what a TLorentzVector is, so this doesn't work when I reload the file later on using pd.read_csv.

My question, then, is how do others recommend I save TLorentzVector information in a DataFrame that I'll open later in pandas, not uproot? It seems like my options are either to save (pT, eta, phi) columns for each particle and then write my own functions, or to save the TLorentzVector components (E, px, py, pz) and use uproot_methods to convert those components back into a TLorentzVector each time I re-load the DataFrame. Or, hopefully, there's another, easier solution I haven't come across yet!

Thanks very much for any suggestions.

-- Answer ------------

Since Pandas does not have any facilities for dealing with Lorentz vectors, expressing them in terms of their components (pT, eta, phi, mass) and writing your own functions for transforming them is the only way to go, especially if you want to save to and from CSV.

That said, it is possible to create Lorentz vector objects that retain their "Lorentziness" inside of Pandas, but there are limitations. You can create structured data as Awkward Arrays:

>>> import awkward as ak
>>> import pandas as pd
>>> import numpy as np
>>> class Lorentz:
...     @property
...     def p(self):
...         return self.pt * np.cosh(self.eta)
... 
>>> class LorentzRecord(Lorentz, ak.Record): pass
... 
>>> class LorentzArray(Lorentz, ak.Array): pass
... 
>>> ak.behavior["lorentz"] = LorentzRecord
>>> ak.behavior["*", "lorentz"] = LorentzArray
>>> array = ak.Array([{"pt": 1.1, "eta": 2.2},
...                   {"pt": 3.3, "eta": 4.4},
...                   {"pt": 5.5, "eta": -2.2}],
...                  with_name="lorentz")
>>> array
<LorentzArray [{pt: 1.1, eta: 2.2}, ... eta: -2.2}] type='3 * lorentz["pt": floa...'>

The above defined an array of records with fields pt and eta and gave both the single-record and the array-of-records views a new property p, which is derived from pt and eta.

>>> # Each record has a pt, eta, and p.
>>> array[0].pt
1.1
>>> array[0].eta
2.2
>>> array[0].p
5.024699161788051
>>> # The whole array has a pt, eta, and p (columns).
>>> array.pt
<Array [1.1, 3.3, 5.5] type='3 * float64'>
>>> array.eta
<Array [2.2, 4.4, -2.2] type='3 * float64'>
>>> array.p
<Array [5.02, 134, 25.1] type='3 * float64'>

You can put an array of Lorentz records into a Pandas DataFrame:

>>> df = pd.DataFrame({"column": array})
>>> df
                 column
0   {pt: 1.1, eta: 2.2}
1   {pt: 3.3, eta: 4.4}
2  {pt: 5.5, eta: -2.2}

and do the same things with it:

>>> df.column.values.pt
<Array [1.1, 3.3, 5.5] type='3 * float64'>
>>> df.column.values.eta
<Array [2.2, 4.4, -2.2] type='3 * float64'>
>>> df.column.values.p
<Array [5.02, 134, 25.1] type='3 * float64'>

but that's because we're pulling the Awkward Array back out to apply these operations.

>>> df.column.values
<LorentzArray [{pt: 1.1, eta: 2.2}, ... eta: -2.2}] type='3 * lorentz["pt": floa...'>

Any NumPy functions applied to the DataFrame, such as negation (implicitly calls np.negative), get passed through to the Awkward Array without having to unpack it.

>>> -df
                  column
0  {pt: -1.1, eta: -2.2}
1  {pt: -3.3, eta: -4.4}
2   {pt: -5.5, eta: 2.2}

but at present, it's the wrong operation: it shouldn't negate the pt. It's possible to further overload that:

>>> def negative_Lorentz(x):
...     return ak.zip({"pt": x.pt, "eta": -x.eta})
... 
>>> ak.behavior[np.negative, "lorentz"] = negative_Lorentz
>>> -df
                 column
0  {pt: 1.1, eta: -2.2}
1  {pt: 3.3, eta: -4.4}
2   {pt: 5.5, eta: 2.2}

We're still building up a suite of functions for Lorentz arrays, but now they work in the array-at-a-time mode that Pandas operates in. There's a project called vector to define all of these functions for 2D, 3D, and Lorentz vectors, but it's in early stages of development.

Getting back to the issue of saving—all of the above doesn't help you with that because Pandas "saves" these data by printing them out:

>>> df.to_csv("whatever.csv")

writes

,column
0,"{pt: 1.1, eta: 2.2}"
1,"{pt: 3.3, eta: 4.4}"
2,"{pt: 5.5, eta: -2.2}"

which is not something that can be read back. We can try,

>>> df2 = pd.read_csv("whatever.csv")
>>> df2
   Unnamed: 0                column
0           0   {pt: 1.1, eta: 2.2}
1           1   {pt: 3.3, eta: 4.4}
2           2  {pt: 5.5, eta: -2.2}
>>> df2.column.values
array(['{pt: 1.1, eta: 2.2}', '{pt: 3.3, eta: 4.4}',
       '{pt: 5.5, eta: -2.2}'], dtype=object)

and so far, it looks good, but it isn't good:

>>> df2.column.values
array(['{pt: 1.1, eta: 2.2}', '{pt: 3.3, eta: 4.4}',
       '{pt: 5.5, eta: -2.2}'], dtype=object)

They're strings. They are no longer computable. So if you want to save to a file, break it down into components.

Maybe all of this can be pulled together into a usable system, but some aspects, like saving these arrays with their "Lorentizness" intact, are not ready yet.

-- Question -----------------------------------------------

Is there a way to open a ROOT file using s3?

ROOT support reading (but not writing) with TS3WebFile

-- Answer ------------

Since uproots support http it is possible to convert s3 resource to http signing the URL. It can be done with boto3:

import boto3
s3_client = boto3.client('s3', endpoint_url=my_endpoint)
url = s3_client.generate_presigned_url('get_object', Params={'Bucket': my_bucket, 'Key': my_key}, ExpiresIn=10000)

f = uproot.open(url)

There are various authentication method using boto, e.g. via environment variables AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY.

The same procedure (thanks Chris) can be done without any special library, just implementing the algorithm provided by s3 (https://docs.aws.amazon.com/general/latest/gr/sigv4-signed-request-examples.html, http://awsdocs.s3.amazonaws.com/S3/latest/s3-qrc.pdf)

-- Question -----------------------------------------------

I'm new to uproot and I am trying to achieve a fairly simply task, but I'm not sure how to do this. Essentially, I have a root file that contains a bunch of histograms and one TTree that is made up of 8 branches for roughly 4 million entries.

What I need to do, I make a new root file, and copy 80% of the TTree from the original file into a TTree (called training) and the remaining 20% into a second TTree in the same new file (called test).

What I have tried is making a directory in python into which I read all of the data from the original file branch by branch. I then used this directory to write the data into the two new TTrees.

This is kind of working, I am getting a file with the structure that I wanted, I'm not entirely satisfied for two reasons:

  • Surely there has to be a more direct way? First reading the data into python and then writing it into a file seems extremely cumbersome and memory intensive.
  • I am honestly not very experienced with root, but from the way I understand it, in my original file, I have a tree that contains my 4 million events. Each event has a value for each branch, so when I say, 'get me entry 555!', I get 8 values (1 for each branch). If I just copy the branches the way I am doing, do I lose this structure or does the index of all the arrays in my directory replace the entry number? So, grabbing the vales from all arrays @ index 555 was the same as returning entry 555 before?

Any help would be welcome. Thanks!

-- Answer ------------

This task would always involve reading into memory and writing back out, whether those arrays are in the user's control or hidden.

There's one possible exception, if you want to read TBaskets from one file and write them to another without decompressing them—then they're still in memory but not decompressed and that can be a performance boost. ROOT can do this as a "fast copy," but Uproot doesn't have an equivalent. Such a copy would require that you don't want to modify the data in the TBaskets in any way, including slicing at arbitrary event boundaries, which can be an issue if the TBaskets for the 8 TBranches you're interested in don't line up at common event boundaries. (Such a feature could be added to Uproot—there's no technical limitation, but this feature is only useful in certain cases.)

So the process of reading arrays out of one file and writing them into another is about as good as it gets, with the above caveat.

I am unsure what you mean by a "directory in Python."

To answer your second question, the arrays that are read out of a TTree are aligned in the sense that entry 555 of one TBranch belongs to the same event as entry 555 of another TBranch. This is a common way of working with sets of arrays in NumPy, though it's an uncommon way of working with ROOT data; in ROOT, an event is an object or at least you don't see more than one object at a time.

If you have memory issues (probably not with 8 TBranches × 4 million events, not jagged, = 244 MB of RAM if double precision), then you can consider iterating:

numtraining = int(0.8*ttree.numentries)
numtest = ttree.num_entries - numtraining

for chunk in ttree.iterate("*", step_size="1 GB", entry_stop=numtraining):
    training.extend(chunk)

for chunk in ttree.iterate("*", step_size="1 GB", entry_start=numtraining):
    test.extend(chunk)

This gives you control over the size of your output TBaskets, since each TBranch gets one TBasket per call to extend. The above example ensures that a set of TBranches that all have to be used together collectively consist of at most 1 GB.

Unlike a "fast copy" (see above), what you're doing is not just copying, it's also repartitioning the data, which can improve performance when you read those output files. Generally, larger chunks (larger TBaskets) are faster to read, but too large and they can require too much memory.

-- Question -----------------------------------------------

It seems uproot doesn't recognise "open". I'm using code from the documentation;

import uproot 
import skhep_testdata

file = uproot.open(skhep_testdata.data_path("uproot-nesteddirs.root"))

which returns AttributeError: 'module' object has no attribute 'open'

I'm using Python 2.7 and uproot 3.10.11, but I've also installed a virtual environment and tried other python and uproot versions. I've tried to reinstall uproot and I've tried to open other files, all returns the same error.

Any ideas?

-- Answer ------------

This sounds more like a bug report than a usage question, but even as a bug report, I don't see what could be going wrong from the description. It seems like something went bizarrely wrong during the installation, such that you get a module named "uproot" without any of its contents. The "open" function is the very first thing that gets imported into the "uproot" module.

You could try printing dir(uproot) to see what's in it, though I suspect there will be nothing in it. I don't know how you ended up with a module named uproot without (all of?) its contents. You say you've tried different installation methods, but somehow they're all reproducing the same installation glitch. On my side, I can't reproduce it—I can't make it happen (including Python 2.7, which gets less attention these days).

On a new Docker image with no uproot installed, I did a pip install uproot and

>>> import uproot
>>> dir(uproot)
['AsArray', 'AsDouble32', 'AsDtype', 'AsDtypeInPlace', 'AsDynamic', 'AsFloat16', 'AsGrouped', 'AsJagged', 'AsMap', 'AsObjects', 'AsPointer', 'AsRVec', 'AsSTLBits', 'AsSet', 'AsStridedObjects', 'AsString', 'AsStrings', 'AsVector', 'Cursor', 'DeserializationError', 'FSSpecSource', 'HTTPSource', 'ImplementsFormMapping', 'ImplementsFormMappingInfo', 'KeyInFileError', 'LRUArrayCache', 'LRUCache', 'LZ4', 'LZMA', 'MemmapSource', 'Model', 'MultithreadedFileSource', 'MultithreadedHTTPSource', 'MultithreadedXRootDSource', 'ObjectSource', 'ReadOnlyDirectory', 'ReadOnlyFile', 'S3Source', 'STLMap', 'STLSet', 'STLVector', 'TBranch', 'TTree', 'ThreadPoolExecutor', 'TrivialExecutor', 'WritableBranch', 'WritableDirectory', 'WritableFile', 'WritableTree', 'XRootDSource', 'ZLIB', 'ZSTD', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_awkwardforth', '_dask', '_util', 'as_TGraph', 'behavior', 'behavior_of', 'behaviors', 'cache', 'class_named', 'classes', 'classname_decode', 'classname_encode', 'compression', 'concatenate', 'const', 'containers', 'create', 'dask', 'dask_write', 'default_library', 'deserialization', 'dynamic', 'exceptions', 'extras', 'from_pyroot', 'has_class_named', 'interpretation', 'iterate', 'language', 'model', 'models', 'no_filter', 'num_entries', 'open', 'pyroot', 'reading', 'recreate', 'reset_classes', 'serialization', 'sink', 'source', 'streamers', 'to_pyroot', 'to_writable', 'unknown_classes', 'update', 'uproot', 'version', 'writing']

Are you able to install other Python packages or is it just uproot?

-- Question -----------------------------------------------

I know that there is a solution for the similar question: https://stackoverflow.com/questions/58670742/how-to-get-uproot-iterate-output-like-the-root-numpy-root2array-output-fast But as I understand it is suitable for the flat ROOT TTrees only. I want to have generalized solution for:

  1. fixed-size dimension but nested data like particle momentum (px, py, pz) which are represented in the ROOT TTree as the vector<double>
  2. arbitrary-size dimension data

All my attempts to apply asjagged for it failed. Is it possible to avoid jaggedarray for case (1)?

-- Answer ------------

If the data are fixed-size but stored as vector<double>, then they're treated as though they were not fixed-size. Uproot would always read them as jagged arrays, and therefore the asarray method described in the other question is unavailable.

That said, if you have more knowledge than your file's metadata and are willing to try an unsupported hack, you can force the interpretation of your vector<double> to always have three elements. Here's an example—first, we need a suitable ROOT file:

TFile *f = new TFile("tmp.root", "recreate");
TTree *t = new TTree("t", "t");
std::vector<double> x;
t->Branch("x", &x);
x = {101, 102, 103};
t->Fill();
x = {201, 202, 203};
t->Fill();
x = {301, 302, 303};
t->Fill();
x = {401, 402, 403};
t->Fill();
x = {501, 502, 503};
t->Fill();
x = {601, 602, 603};
t->Fill();
t->Write();
f->Close();

Now, the normal way to read this in Uproot is with an Awkward Array:

>>> import uproot
>>> branch = uproot.open("tmp.root")["t"]["x"]
>>> branch.array()
<Array [[101.0 102.0 103.0] [201.0 202.0 203.0] [301.0 302.0 303.0] [401.0 402.0 403.0] [501.0 502.0 503.0] [601.0 602.0 603.0]]>
>>> branch.interpretation
AsJagged(AsDtype('>f8'), 10)

But this allocates new arrays every time you read it and does some manipulation to find the borders, which you know are regular. (And Uproot does not know that.)

The header for an STL vector happens to be 10 bytes long. After that, its contents are serialized in order, from first to last, before moving on to the next STL vector. For three 8-byte floating point numbers, that's 10 + 8 + 8 + 8 = 34 bytes, always the same size. The fact that it always happens to be the same size is critical for the following.

A NumPy structured array can represent arrays of fixed-size structs as a dtype:

>>> array = branch.array(uproot.AsDtype([("header", "S10"), ("x", ">f8"), ("y", ">f8"), ("z", ">f8")]))
>>> array
array([(b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 101., 102., 103.),
       (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 201., 202., 203.),
       (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 301., 302., 303.),
       (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 401., 402., 403.),
       (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 501., 502., 503.),
       (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 601., 602., 603.)],
      dtype=[('header', 'S10'), ('x', '<f8'), ('y', '<f8'), ('z', '<f8')])

We don't like looking at the header, so we can strip it out using ordinary NumPy slicing:

>>> array[["x", "y", "z"]]
array([(101., 102., 103.), (201., 202., 203.), (301., 302., 303.),
       (401., 402., 403.), (501., 502., 503.), (601., 602., 603.)],
      dtype={'names':['x','y','z'], 'formats':['<f8','<f8','<f8'], 'offsets':[10,18,26], 'itemsize':34})

(or just ask for array["x"] to get a non-structured array). Since we can do that with a plain uproot.AsDtype, we can do it with a preallocated uproot.AsDtypeInPlace.

>>> import numpy as np
>>> big = np.empty(1000, dtype=[("header", "S10"), ("x", ">f8"), ("y", ">f8"), ("z", ">f8")])
>>> branch.array(uproot.AsDtypeInPlace(big, big.from_dtype))
array([(b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 101., 102., 103.),
       (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 201., 202., 203.),
       (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 301., 302., 303.),
       (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 401., 402., 403.),
       (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 501., 502., 503.),
       (b'@\x00\x00\x1e\x00\t\x00\x00\x00\x03', 601., 602., 603.)],
      dtype=[('header', 'S10'), ('x', '>f8'), ('y', '>f8'), ('z', '>f8')])

Above, big has to be at least as large as the dataset you want to read and reading it returns a trimmed view of that array. It does not allocate a new array—the data are stored in big:

>>> big[["x", "y", "z"]][:10]
array([( 1.01000000e+002,  1.02000000e+002,  1.03000000e+002),
       ( 2.01000000e+002,  2.02000000e+002,  2.03000000e+002),
       ( 3.01000000e+002,  3.02000000e+002,  3.03000000e+002),
       ( 4.01000000e+002,  4.02000000e+002,  4.03000000e+002),
       ( 5.01000000e+002,  5.02000000e+002,  5.03000000e+002),
       ( 6.01000000e+002,  6.02000000e+002,  6.03000000e+002),
       ( 1.22164945e-309,  5.26335088e-310,  1.22167067e-309),
       (-5.70498984e+158,  5.97958175e+158, -5.97958175e+158),
       (-4.92033505e+032, -4.92033505e+032, -4.92033505e+032),
       ( 3.77957352e+011,  3.77957221e+011,  3.77957320e+011)],
      dtype={'names':['x','y','z'], 'formats':['>f8','>f8','>f8'], 'offsets':[10,18,26], 'itemsize':34})

Everything beyond the 6 events that were read is uninitialized junk, so I'd recommend using the trimmed output from the branch.array function; this is just to show that big is being filled—you're not getting a new array.

As per your question (2), if the data are not regular (in number of bytes per entry), then there's no way you can do this. Also, remember that the above technique is not officially supported: you have to know that your data are regular and you have to know that STL vectors have a 10-byte header, which are not things we expect most users to know.

-- Question -----------------------------------------------

When reading data with uproot from a tree compressed with zlib, I find there are some compression errors from zlib, such as: Error -3 while decompressing data: incorrect data check or Error -5 while decompressing data: incomplete or truncated stream. When I open the file in ROOT, I get a similar error from zlib:

R__unzip: error -3 in inflate (zlib)
Error in <TBasket::ReadBasketBuffers>: fNbytes = 20102, fKeylen = 199, fObjlen = 28540, noutot = 0, nout=0, nin=19903, nbuf=28540
Error in <TBranchElement::GetBasket>: File: Stage_1_files/AnalysisResults.31.root at byte:51212830, branch:data.fJetConstituents.fPt, entry:133851, badread=1, nerrors=1, basketnumber=189
...

However, ROOT skips over the problematic entry (or entries) and continues try to read the file. In uproot, the zlib exception is passed up. I catch it, but I'm unable to continue processing the file. There are clearly underlying issues with the file (seems to be from issues with ROOT merging which are out of my control), but is there a way to have uproot identify and skip problematic entries and continue with the rest of the data? I could imagine restricting the entries when reading, but how would I identify them with uproot without trial and error? I could only identify the problematic branch by reading each branch one-by-one in uproot, and that still doesn't identify which entries are the issue (or by checking with ROOT).

Thanks!

-- Answer ------------

Data in TTrees are compressed by baskets, so if a basket's compression is corrupted, no data can be read from that basket but all the other baskets are potentially fine.

Uproot's array-reading functions give up if any baskets raise an error, but you can use the more low-level TBranch.basket method to read baskets one by one, catching any exceptions along the way. Get a TBranch object from the TTree with dict-like access (e.g. mytree["branch_name"]) and call basket(i, ...) with the same arguments you'd pass to TTree.array but additionally with the basket number i. (They start at 0 and go up to but not including TBranch.numbaskets.)

There's one more issue: you might need to correlate data from different branches, and their baskets might not begin and end at the same entry numbers. If you ask for TTree.clusters(branches_list) with the branches you're interested in, it will give you entry start and stop numbers at basket boundaries that are common to the set of branches you provide. Using these entry numbers as entry_start and entry_stop in the normal uproot.TTree.arrays method would read only the requested baskets, and you can put try-catch logic around that.

-- Question -----------------------------------------------

Is there an equivalent of TTree::AddFriend with uproot ? I have 2 parallel trees in 2 different files which I'd need to read with uproot.iterate and using interpretations (setting the 'branches' option of uproot.iterate).

Maybe I can do that by manually obtaining several iterators from iterate() calls on the files, and then calling next() on each iterators... but maybe there's a simpler way akin to AddFriend ?

Thanks for any hint !

edit: I'm not sure I've been clear, so here's a bit more details. My question is not about usage of arrays, but about how to read them from different files. Here's a mockup of what I'm doing :

# I will fill this array and give it as input to my DNN
# it's very big so I will fill it in place

bigarray = ndarray( (2,numentries),...)

# get a handle on a tree, just to be able to build interpretations :
t0 = .. first tree in input_files
interpretations = dict(
    a=t0['a'].interpretation.toarray(bigarray[0]),
    b=t0['b'].interpretation.toarray(bigarray[1]),
    )
# iterate with :
uproot.iterate( input_files, treename,
                branches = interpretations )

So what if a and b belong to 2 trees in 2 different files ?

-- Answer ------------

In array-based programming, friends are implicit: you can JOIN any two columns after the fact—you don't have to declare them as friends ahead of time.

In the simplest case, if your arrays a and b have the same length and the same order, you can just use them together, like a + b. It doesn't matter whether a and b came from the same file or not. Even if I've if these is jagged (like jets.phi) and the other is not (like met.phi), you're still fine because the non-jagged array will be broadcasted to match the jagged one.

If the two arrays are not in the same order, possibly because each writer was individually parallelized, then you'll need some column to act as the key associating rows of one array with different rows of the other. This is a classic database-style JOIN and although Uproot and Awkward don't provide routines for it, Pandas does. (Look up "merging, joining, and concatenating" in the Pandas documenting—there's a lot!) You can maintain an array's jaggedness in Pandas by preparing the column with the awkward.topandas function.

The following issue talks about a lot of these things, though the users in the issue below had to join sets of files, rather than just a single tree. (In principle, a process would have to look ahead to all the files to see which contain which keys: a distributed database problem.) Even if that's not your case, you might find more hints there to see how to get started.

-- Question -----------------------------------------------

Is there a better way to do this logic? I want to propagate a selection from a lower-level selection available only on a subset of inner elements upwards

Specifically, I am looking to have an event level cut for oppositely charged muon-electron pair.

req_mu = (events.Muon.counts >= 1)
req_ele = (events.Electron.counts >= 1)
req = req_ele & req_mu

def propagate_up(subset, selection):
    '''
    subset: bool array slice on upper level
    '''
    dummy = np.zeros_like(subset)
    dummy[subset] = selection
    return dummy
    
req_opposite_charge = propagate_up(req, events[req].Muon[:, 0].charge * events[req].Electron[:, 0].charge == -1)

-- Answer ------------

The easiest way to select a pair from separate collections is with ak.cartesian, e.g.

good_el = electrons[electrons.pt > 10]
good_mu = muons[muons.pt > 10]
pairs = ak.cartesian([good_el, good_mu])
# filter our pairs to have opposite charge (jagged mask)
pairs = pairs[pairs["0"].charge == -1 * pairs["1"].charge]
# filter events to have exactly one good pair
pairs = pairs[pairs.counts == 1]
# get the two leptons as a now flat array
el, mu = pairs.i0[:, 0], pairs.i1[:, 0]

-- Question -----------------------------------------------

So I'm given a TFile that contains two TTree objects, which contain track/tower pT, eta and phi, divided by events. My goals is to extract each track and tower in the event and then cluster whole event using the FastJet package. Now, if I'm doing this task using pure ROOT my analysis takes 30 minutes at max (with ~100 GB TFile). In the meanwhile, uproot will process only 10,000 events in this time limit...

It is apparent that I'm doing something wrongly, so I wanted to ask, what would be the proper way to access track-by-track information to get the same speed as in ROOT?

-- Answer ------------

Uproot gets its efficiency from operating on many events per Python function call. The fastjet library has an array-oriented interface to get a batch of events in one Python function call, as Awkward Arrays.

-- Question -----------------------------------------------

I'm trying to access a ROOT TGraph with uproot. In the uproot tutorial (https://github.com/scikit-hep/uproot#histograms-tprofiles-tgraphs-and-others) TGraphs are mentioned, but no where is written how to load them to python. Could somebody show me a code sniplet of how to do that? Thank you already!

-- Answer ------------

Following the example given for a histogram, it's tfile["tgraph_name"], where tfile is the open file object and "tgraph_name" is the name of the TGraph. The object this returns has all of the data that a TGraph contains, but only has the methods defined here.

-- Question -----------------------------------------------

I have a .root file containing a tree named FlatSubstructureJetTreeD file = uproot.open("/data/debo/jetAnomaly/AtlasData/dijets/mergedRoot/miniTrees/user.cdelitzs.JZ2W.mini.root")["FlatSubstructureJetTreeD"]

It has the following branches

file.keys()
['fjet_pt',
 'fjet_clus_P',
 'fjet_clus_px',
 'fjet_clus_py',
 'fjet_clus_pz',
 'EventInfo_mcEventWeight',
 'fjet_xsec_filteff_numevents']

fjet_clus_P,fjet_clus_px,fjet_clus_py,fjet_clus_pz are jagged array (different entries in different events)

I need to make a zero-padded dataset as a form of .h5 file so that each row has entries in format of [fjet_clus_P1,fjet_clus_px1,fjet_clus_py1,fjet_clus_pz1,fjet_clus_P2,fjet_clus_px2,fjet_clus_py2,fjet_clus_pz2,....,fjet_clus_Pn,fjet_clus_pxn,fjet_clus_pyn,fjet_clus_pzn], could you suggest what would be the smartest and memory-efficient way to do so in uproot?

-- Answer ------------

Assuming that you have read out all of the arrays as a dict named arrays,

import uproot
file = uproot.open("/data/debo/jetAnomaly/AtlasData/dijets/mergedRoot/"
            "miniTrees/user.cdelitzs.JZ2W.mini.root"
           )["FlatSubstructureJetTreeD"]
arrays = file.arrays(["fjet_clus_P", "fjet_clus_px", "fjet_clus_py", "fjet_clus_pz"])

You can None-pad each array with the pad method and then turn the None values into zeros with fillna. In pad, you have to specify a length; let's take the maximum length per event. After this operation, the JaggedArrays happen to have equal lengths in the second dimension, so turn them into NumPy arrays with regular.

for name in ak.fields(arrays):
    longest = ak.max(ak.num(arrays[name]))
    arrays[name] = ak.to_numpy(ak.fill_none(ak.pad_none(arrays[name], longest)))

Now that they're (two-dimensional) NumPy arrays, h5py will recognize them and you can write them to the HDF5 file the normal way.

-- Question -----------------------------------------------

Working with simulations in Geant4 that outputs .root files, I was happy to discover the uproot package.

Believing that dataframes are the best joice for my specific ananylsis task, I'm using uproot.pandas.df() to read contents from a TTree into such a dataframe.

Unfortunately, this turned out to present a bottleneck. While the code deals well with all numeric input, handling strings seems to pose a serious problem. The file is quite big, with the resulting frame having 2406703 rows.

While this code (Egamma and z_eu both numeric):

df = uproot.open('rootFile.root')['seco_tuple;1].pandas.df( ['Egamma','z_eu'])

needs on average 430 ms, including already one column with strings:

df = uproot.open('rootFile.root')['seco_tuple;1].pandas.df( ['Name','Egamma','z_eu'])

increases the time to almost 3.5 s. Having a second column with strings doubles the time. What I also tried was reading the data in a dictionary and then passing it to a dataframe. Reading the data was fairly quick, but passing it into a dataframe then again quite slow.

Since the strings obviously cause the code to take much more ressources, I was wondering if strings in general are a problem for dataframes or if the specific type of string here might make a difference?

I hope to get some further insight here and can try to provide the .root file as well as a MWE, in case this is required.

Thanks in advance!

-- Answer ------------

Part of the problem is that uproot, written in pure Python, is sharply divided in performance between numeric work, which NumPy handles very quickly, and anything involving Python objects, such as strings. In some cases, like this one, we have the option of treating strings as a numeric array, and that can help performance a lot.

The other part is Pandas itself. NumPy's strings are inefficient in the sense that they have to be padded to a common length—the length of the longest string in the dataset—because NumPy can only deal with rectilinear arrays. So Pandas opts for a different inefficiency: it takes strings as Python objects, via a NumPy array whose dtype is object (i.e. pointers to Python objects, not raw numerical data).

>>> import pandas, numpy
>>> df = pandas.DataFrame({"column": ["one", "two", "three", "four", "five"]})
>>> df
  column
0    one
1    two
2  three
3   four
4   five
>>> df.values
array([['one'],
       ['two'],
       ['three'],
       ['four'],
       ['five']], dtype=object)

When strings are just labels, Pandas has a "categorical" dtype (not part of NumPy!) that replaces each distinct string with an integer so that it's really an integer array with metadata.

>>> df["column"].astype("category")
0      one
1      two
2    three
3     four
4     five
Name: column, dtype: category
Categories (5, object): [five, four, one, three, two]

If you have a small number of distinct strings compared to the total number of strings (not the case above), then this is faster and uses less memory. If every string is unique, this only bloats the data.

-- Question -----------------------------------------------

I need a memory efficient way to zero pad uneven array size per event according to largest array in a tree. Is there any way available to do that in uproot?

-- Answer ------------

jagged_array.pad(max_length).fillna(0)

If you need to find the maximum, you can

jagged_array.counts.max()

If you then need to turn it into pure NumPy, use regular().

-- Question -----------------------------------------------

I have a tree with one branch storing a string. When I read using uproot.open() and then the method arrays() I get the following:

>>> array_train['backtracked_end_process']
<ObjectArray [b'FastScintillation' b'FastScintillation' b'FastScintillation' ... b'FastScintillation' b'FastScintillation' b'FastScintillation'] at 0x7f48936e6c90>

I would like to use this branch to create masks, by doing things like array_train['backtracked_end_process'] != b'FastScintillation' but unfortunately this produces an error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-97-a28f3706c5b5> in <module>
----> 1 array_train['backtracked_end_process'] == b'FastScintillation'

~/.local/lib/python3.7/site-packages/numpy/lib/mixins.py in func(self, other)
     23         if _disables_array_ufunc(other):
     24             return NotImplemented
---> 25         return ufunc(self, other)
     26     func.__name__ = '__{}__'.format(name)
     27     return func

~/.local/lib/python3.7/site-packages/awkward/array/objects.py in __array_ufunc__(self, ufunc, method, *inputs, **kwargs)
    216                 contents.append(x)
    217 
--> 218         result = getattr(ufunc, method)(*contents, **kwargs)
    219 
    220         if self._util_iscomparison(ufunc):

~/.local/lib/python3.7/site-packages/awkward/array/jagged.py in __array_ufunc__(self, ufunc, method, *inputs, **kwargs)
    987                 data = self._util_toarray(inputs[i], inputs[i].dtype)
    988                 if starts.shape != data.shape:
--> 989                     raise ValueError("cannot broadcast JaggedArray of shape {0} with array of shape {1}".format(starts.shape, data.shape))
    990 
    991                 if parents is None:

ValueError: cannot broadcast JaggedArray of shape (24035,) with array of shape ()

Does anyone have any suggestion on how to proceed? Being able to transform it to a numpy.chararray would already solve the problem, but I don't know how to do that.

-- Answer ------------

Here's how you can do it:

>>> import numpy as np
>>> import awkward as ak
>>> import skhep_testdata
>>> import uproot

>>> f = uproot.open(skhep_testdata.data_path("uproot-sample-6.10.05-zlib.root"))
>>> t = f["sample"]

>>> t["str"].array()
<Array ['hey-0', 'hey-1', 'hey-2', ..., 'hey-28', 'hey-29'] type='30 * string'>

>>> t["str"].array() == "hey-0"
<Array [True, False, False, False, ..., False, False, False] type='30 * bool'>

-- Question -----------------------------------------------

I'm setting up a machine learning project with scikit learn. The input data are flat ROOT NTuples.

In the past I have been using root_numpy to convert the NTuples to pandas.DataFrame saved in an h5 file.

I was wondering if I could use uproot to

  1. skip the h5 conversion altogether?
  2. not use as much memory as loading in the DataFrame from h5?

My naive first try looks something like this:

'''
Runs preselection, keeps only desired variables in DataFrame
'''
def dropAndKeep(df, dropVariables=None, keepVariables=None, presel=None, inplace=True):

    if ((presel is not None) and (not callable(presel))):
        print("Please either provide a function to 'presel' or leave blank")
        raise ValueError

    if callable(presel):
        if not(inplace):
            df = df.drop(df[~presel(df)].index, inplace=False)
        else:
            df.drop(df[~presel(df)].index, inplace=True)

    if keepVariables is not None:
        dropThese = list( set(df.columns) - set(keepVariables) )
        return df.drop(columns=dropThese, inplace=inplace)
    
    if dropVariables is not None:
        return df.drop(columns=dropVariables, inplace=inplace)
    
'''
Loads a TTree from ROOT file into a DataFrame 

'''
def load_root(inFile, key, dropVariables=None, keepVariables=None, presel=None):
    df = uproot.open(inFile)[key].pandas.df()
    dropAndKeep(df, dropVariables, keepVariables, presel=presel, inplace=True)
    return df


inFile = "path/to/file.root"
key = "ntuple"
df = load_root(inFile, key)

This takes a really long time. Is there a better way of doing this?

-- Answer ------------

Be aware that each call to uproot.open(...) and file [key] loads TFile and TTree metadata using pure Python—the slowest part of uproot. If you call this more than once, try keeping TFile and/or TTree objects around and re-using them.

Also, it looks like your dropAndKeep function is only dropping rows (events), but if I'm reading it wrong and it's doing columns (branches), then use the branches argument of uproot's array-reading functions to only send the branches you want. Since the data in a ROOT file are arranged in columns, you can't avoid reading unwanted events—you have to cut them after the fact (in any framework).

Next, note that Pandas is considerably slower than NumPy for simple operations like filtering events. If you want to speed that up, get the arrays with TTree.arrays, rather than TTree.pandas.df, construct one NumPy array of booleans for your selection, and apply it to each array in the dict that TTree.arrays returns. Then you can put all of those into a DataFrame with Pandas's DataFrame constructor (if you really need Pandas at all).

It's true that you don't need to go through HDF5, and you don't need to go through Pandas, either. Your machine learning framework (TensorFlow? Torch?) almost certainly has an interface that accepts NumPy arrays with zero-copy (or one-copy to the GPU). Tutorials stressing HDF5 or Pandas do so because for the majority of users (non-HEP), these are the most convenient interfaces. Their data are likely already in HDF5 or Pandas; our data are likely in ROOT.

If your machine learning will be on the GPU, perhaps you want to do your event selection on the GPU as well. CuPy is a NumPy clone that allocates and operates entirely on the GPU, and your TensorFlow/Torch tensors might have a zero-copy interface to CuPy arrays. In principle, uproot should be able to write directly from ROOT files into CuPy arrays if a CuPy array is used as the destination of the asarray interpretation. I haven't tried it, though.

If you have control over the ROOT files to process, try to make their baskets large (increase the flush size) and their data structure simple (e.g. pure numbers or array/vector of numbers, no deeper). Perhaps most importantly, use a lightweight compression like lz4, rather than a heavyweight Luke lzma.

Uproot can read baskets in parallel, but this has only proven useful when it has a lot of non-Python computation to do, such as decompressing lzma.

If you're going to be reading these arrays over and over, you might want to write intermediate files with numpy.save, which is essentially just raw bytes on disk. That means there's no deserialization when reading it back, as opposed to the work necessary to decode a ROOT or HDF5 file. Because it's such a simple format, you can even read it back with numpy.memmap, which peeks into the OS's page cache as it lazily loads the data from disk, removing even the explicit copy of bytes.

Not all of these tricks will be equally helpful. I tried to put the most important ones first, but experiment before committing to a big code rewrite that might not make much difference. Some tricks can't be combined with others, such as CuPy and memmap (memmap always lazily loads into main memory, never GPU memory). But some combinations may be fruitful.

Good luck!

-- Question -----------------------------------------------

Can you please tell me why uproot doesn't interpret the trees: Evt, AAObject, TObject and t? I am probably doing something wrong here because I am not familiar with root files.

My goal: open data in my root file into pandas DataFrame.

When I try to loop over data in ['Evt'] tree, I get the following error for the following branches: AAObject, TObject and t

ValueError: cannot interpret branch b'AAObject' as a Python type
in file: /myfile.root


ValueError: cannot interpret branch b't' as a Python type
in file: /myfile.root

ValueError: cannot interpret branch b'TObject' as a Python type
in file: /myfile.root

This is what I type to explore my file

data = uproot.open("myfile.root")["E"]
data.show() 
data_branch_Evt['Evt']['AAObject'].basket(0)
data_branch_Evt['t'].basket(0)`

Here is the result of data.show()

Evt                        TStreamerInfo              None
AAObject                   TStreamerInfo              None
TObject                    TStreamerInfo              None
fUniqueID                  TStreamerBasicType         asdtype('>u4')
fBits                      TStreamerBasicType         asdtype('>u4')

usr                        TStreamerSTL  asjagged(asdtype('>f8'), 10)
usr_names                  TStreamerSTL asgenobj(STLVector(STLString()))

id                         TStreamerBasicType         asdtype('>i4')
det_id                     TStreamerBasicType         asdtype('>i4')
mc_id                      TStreamerBasicType         asdtype('>i4')
run_id                     TStreamerBasicType         asdtype('>i4')
mc_run_id                  TStreamerBasicType         asdtype('>i4')
frame_index                TStreamerBasicType         asdtype('>i4')
trigger_mask               TStreamerBasicType         asdtype('>u8')
trigger_counter            TStreamerBasicType         asdtype('>u8')
overlays                   TStreamerBasicType         asdtype('>u4')
t                          TStreamerObjectAny         None
t.fSec                     TStreamerBasicType         asdtype('>i4')
t.fNanoSec                 TStreamerBasicType         asdtype('>i4')

hits                       TStreamerSTL               asdtype('>i4')
....etc...

Thank you!

-- Answer ------------

Not all ROOT types are recognized—there's something in the definition of the class that hasn't been handled yet. Without seeing the file (as a GitHub Issue), I can't diagnose it. We started with a very minimal set of types and added more as needed.

It's not a matter of handling every class on a class-by-class basis, it's about handling class features. For instance, the most recent addition was classes containing vectors of numbers. If you have classes with very conservative contents, such as "only numeric fields," then it should be automatically identified.

What we have to handle as case-by-case items are not types (such as classes), but technically "kinds," or types of types.

-- Question -----------------------------------------------

I have trouble loading y-values from a TProfile object from my .root file.

It seems using file.pandas() loads only the x-values, the counts and variances but not the specific y-values.

I have tried file.values also, which returns counts, but not y-values

-- Answer ------------

Objects read from ROOT files have a Pythonic interpretation only if "methods" have been included in uproot-methods. TProfile has not been handled in this way—the only reason anything works for you is because TProfile is a subclass of TH1, and TH1 has a .pandas() method. Since TH1 is for one-dimensional histograms, it doesn't include code that would handle a second dimension.

uproot-methods is user contributed. If you want TProfile methods, you can add one here and submit a pull request:

https://github.com/scikit-hep/uproot-methods/tree/master/uproot_methods/classes

If you add a TProfile module containing a Methods class here, uproot will automatically load it and apply it to the TProfile objects it loads from ROOT files.

For any class, even those we haven't heard of, uproot loads all of the private member data into a Python object. You'll find all of your TProfile data in attributes that start with _f, such as _fSumw2 and _fBinEntries. The methods just use these "internal" attributes to present a more user-friendly picture of the data, such as Pandas.

Here's an example of getting data from an uninterpreted TProfile:

>>> import numpy as np
>>> import uproot
>>> import skhep_testdata
>>> file = uproot.open(skhep_testdata.data_path("uproot-hepdata-example.root"))
>>> hprof = file["hprof"]
>>> np.sqrt(np.array(hprof.member("fSumw2")) / np.array(hprof.member("fBinEntries")))

will print

<stdin>:1: RuntimeWarning: invalid value encountered in divide
array([18.00160401, 17.08522023, 16.98982401, 15.18948269, 13.73788834,
       13.37976188, 13.55597897, 12.66036748, 12.68400763, 11.86598111,
       11.69668773, 11.64276616, 10.08540753, 10.15367219,  9.76541727,
        8.84047502,  8.73065944,  8.24346727,  7.46907584,  7.6620407 ,
        7.09749961,  6.62763951,  6.42199904,  5.98234406,  5.78193984,
        5.14476043,  5.15432392,  4.715674  ,  4.50246509,  4.36212988,
        3.86741244,  3.80635409,  3.60625837,  3.29332404,  3.04594764,
        2.98292569,  2.74283755,  2.50385849,  2.50978034,  2.30446027,
        2.29145554,  2.18459822,  1.99270465,  1.92651016,  1.86253428,
        1.79821825,  1.807856  ,  1.80803939,  1.75249321,  1.81588997,
        1.8226282 ,  1.74867267,  1.79842962,  1.75790778,  1.70895758,
        1.859834  ,  1.82793384,  1.96163584,  1.81903189,  2.0223054 ,
        2.14521968,  2.24565426,  2.24184203,  2.48460961,  2.63208939,
        2.69851588,  2.98989799,  3.1141892 ,  3.25304525,  3.46517157,
        3.53320406,  3.98586013,  4.25062225,  4.57520817,  4.75338005,
        5.18494724,  5.387066  ,  5.6277353 ,  5.8802666 ,  6.34427442,
        6.51966721,  7.24680462,  7.33759813,  7.63198011,  8.34535604,
        9.30064575,  8.82698396,  9.4099613 ,  9.60905376, 10.31570735,
       11.17540473, 11.13947421, 12.78232904, 12.1993165 , 12.39763587,
       16.68535354, 13.30451531, 14.67711301, 14.96741772,         nan,
       18.32199478, 17.84275258])

which are the y-values for this TProfile.

-- Question -----------------------------------------------

I have am using Python 3.6.5 (or higher) and I have successfully installed 'numpy', 'uproot', and 'awkward'. I have a previously made *.root file with a jagged NTuple which contains quite a large number of branches. This is particle physics data and so one can think of the "rows" as individual collisions or "events" while the columns have the data structures. (And some of the columns might have a third dimension or more...I'll explain in a bit.)

In this case I have events where there are many "Jets" in the events and each "jet" has a lot of information about it.

jet_E, jet_pT, jet_eta, jet_phi, Numb (number of b tags), NLayer0 etc.

Each "event" could have any number of jets, but it is impossible for it to have zero jets in this case. Each of these jets will have this information stored, but all information from one "event" must be kept uncorrelated with any other "event". (If you already know Particle physics this part is probably already understood.)

I've been reading the uproot documentation and examples and I cannot see easily how one would, using only pythonic code like this, histogram the jet_pT but only for jets within the events where some OTHER jet variable is being cut upon. eta, for example.

How do I extract only the information from the *.root file about all the jet_pT for those jets with jet_eta>-1.0 and jet_eta<1.0 ? And suppose I only wanted to look at the first 3 jets in any event and ignore the rest, how would I put in place the cut described and only histogram the first 3 jets in any event that passed this cut?

The uproot documentation doesn't really make this very clear. Thanks!

-- Answer ------------

This is actually an awkward-array question, so I have tagged it as such. That also makes it easier to write an answer, because I can create simple artificial cases without having to put them in a ROOT file and read them back with uproot.

Here are some artificial jetpt and jeteta jagged arrays:

>>> jetpt = ak.Array([[0.0, 1.1, 2.2], [3.3], [4.4, 5.5], [6.6, 7.7, 8.8, 9.9]])
>>> jeteta = ak.Array([[0.1, -1.2, 0.8], [1.2], [-0.8, 0.8], [0.2, -0.3, 0.9, 0.0]])
>>> jetpt
<Array [[0, 1.1, 2.2], ..., [6.6, 7.7, 8.8, 9.9]] type='4 * var * float64'>
>>> jeteta
<Array [[0.1, -1.2, 0.8], ..., [0.2, -0.3, ..., 0]] type='4 * var * float64'>

The key thing about these two jagged arrays is that they have the same number of inner array elements per inner array:

>>> ak.num(jetpt)
<Array [3, 1, 2, 4] type='4 * int64'>
>>> ak.num(jeteta)
<Array [3, 1, 2, 4] type='4 * int64'>

This is still true when we perform mathematical operations on the arrays, such as inequality comparisons. (Note that we have to use bitwise operators for and/or/not because these are the only ones Numpy can overload. You also need parentheses because of the order of operations with bitwise operations.)

>>> (-1.0 < jeteta) & (jeteta < 1.0)
<Array [[True, False, True], ..., [True, ..., True]] type='4 * var * bool'>
>>> ak.num((-1.0 < jeteta) & (jeteta < 1.0))
<Array [3, 1, 2, 4] type='4 * int64'>

Since these boolean arrays have the same number of counts as jetpt, you can use them as an index on jetpt or any other jet variable. You can't use them as an index on muonpt, etc., because in general, there's a different number of muons per event than jets per event. (There's a lot more about this in this tutorial.)

>>> jetpt[(-1.0 < jeteta) & (jeteta < 1.0)]
<Array [[0, 2.2], [], ..., [6.6, 7.7, 8.8, 9.9]] type='4 * var * float64'>

Your second question was about truncating to at most three jets. This is a simple slice, but one that applies to the second dimension. I'll apply it to the example above—it needs to be a separate square bracket, and often you'd assign the above to a variable if you think that makes it easier to read.

>>> jetpt[(-1.0 < jeteta) & (jeteta < 1.0)][:, :3]
<Array [[0, 2.2], [], [...], [6.6, 7.7, 8.8]] type='4 * var * float64'>

See awkward-array.org for a lot more!

-- Question -----------------------------------------------

>>> so_jaggered = ak.Array([[[0, 1, 2]], [[0, 1], [2, 3]], [[0, 1, 2], [3, 4]]])
>>> ak.num(so_jaggered)
<Array [1, 2, 2] type='3 * int64'>

However, I want to count only the innermost part, which can be achieved by following code:

>>> count_so_jaggered = np.array([[len(x) for x in trks] for trks in so_jaggered], dtype=object)
>>> count_so_jaggered
array([list([3]), list([2, 2]), list([3, 2])], dtype=object)

But it has at least two drawbacks: slow and dtype=object. Any plans to support such feature?

-- Answer ------------

You can do this:

>>> ak.num(so_jaggered, axis=2)
<Array [[3], [2, 2], [3, 2]] type='3 * var * int64'>

-- Question -----------------------------------------------

I have a ragged array that contains indices that point to positions in another ragged array. Both arrays have the same length, but each of the lists that the ragged arrays contain can be of different lengths. I would like to sort the second array using the indices of the first array, at the same time dropping the elements from the second array that are not indexed from the first array. The first array can additionally contain values of -1 (could also be replaced by None if needed, but this is currently not that case) that mean that there is no match in the second array. In such a case, the corresponding position in the first array should be set to a default value (e.g. 0).

Here's a practical example and how I solve this at the moment:

import numpy as np
import awkward as ak

def good_index(my_indices, my_values):
    my_list = []
    for index in my_indices:
        if index > -1:
            my_list.append(my_values[index])
        else:
            my_list.append(0)
    return my_list

indices = ak.Array([[0, -1], [3,1,-1], [-1,0,-1]])
values = ak.Array([[1.1, 1.2, 1.3], [2.1,2.2,2.3,2.4], [3.1]])

new_map = ak.Array(map(good_index, indices, values))

The resulting new_map is: [[1.1 0.0] [2.4 2.2 0.0] [0.0 3.1 0.0]].

Is there a more efficient/faster way achieving this? I was thinking that one could use numpy functionality such as numpy.where, but due to the different lengths of the ndarrays this fails at least for the ways that I tried.

-- Answer ------------

If all of the subarrays in values are guaranteed to be non-empty (so that indexing with -1 returns the last subelement, not an error), then you can do this:

>>> almost = values[indices]       # almost what you want; uses -1 as a real index
>>> ak.fill_none(almost.mask[indices >= 0], 0.0)
<Array [[1.1, 0], [2.4, ..., 0], [0, 3.1, 0]] type='3 * var * float64'>

The ak.fill_none is optional because without it, the missing elements are None, rather than 0.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment