Notes on Refactoring DataBroker to Use Intake

This document is still a working draft.

Summary

We recently refactored DataBroker, a project maintained by the DOE Light Sources as part of the Bluesky Collaboration, to leverage intake and to join the nascent community around it. We adopted several good ideas from intake's design, and we benefit from intake's support for remote access through an HTTP service. We contributed significant changes to intake, in particular scaling to large catalogs with ~10^5 entries. Looking back on the pain points we experienced during adoption, we have some suggestions for future development.

DataBroker

DataBroker, a project started at Brookhaven National Lab in 2013, is a Python project with similar goals to intake:

Abstract over file formats, handing users useful PyData/SciPy data structures alongside dictionaries of metadata.
Provide a catalog of datasets with support for search.
Support a variety of storage backends. At the time, the "metadata" could be stored in MongoDB or sqlite. (Now we support MongoDB, JSONL, or msgpack.) Then and now, the large detector array data was stored in whatever file formats the detector happened to write, handled by a registry of readers that returned numpy arrays or lazy arrays (dask arrays or pims.FramesSequence objects).

Intake Refactor

In November 2018, we undertook to refactor DataBroker to leverage intake as much as possible. Our motivations were:

Put our resources behind intake's vision of abstracting over files, which we share.
Benefit from a larger community of reviewers, collaborators, and users.
Avoid making databroker an island separated from the larger SciPy/PyData ecosystem.

The result of this work, databroker 1.0, is essentially a distribution of intake drivers that encode knowledge of our storage formats and our data model. It also bundles a backward-compatible shim layer that supports our original API on top on intake-based internals, providing a gentle migration path for users.

What We Gained

DataBroker had a object similar to intake's Catalog, but it lacked a good story of how to nest them. Intake's concept of "a catalog of catalogs (of catalogs...)" was the simple solution we were looking for.

Intake Catalogs support progressive search. That is, catalog.search(...) returns another Catalog which can in turn be searched. DataBroker's search method did not do this.

# Before:
results = db(detector='fccd')
# inspect results
results = db(detector='fccd', plan_name='scan')
# inspect results
results = db(detector='fccd', plan_name='scan', num_points=50)

# After:
results = catalog.search(detector='fccd')
# inspect results
results = results.search(plan_name='scan')
# inspect results
results = results.search(num_points=50)

DataBroker did not yet have a usable service layer. Intake has a functional Tornado-based HTTP service. It needed (and still needs) some additional development to make it a properly multi-tennant server, but it's better than starting from nothing. It is very appealing that the Python API for remote catalogs and datasoures and local catalogs and datasources feels the same to the user.
Intake-xarray has a clever mechanism for constructing "remote xarrays" on the client side that pull chunks from a counterpart on the server side using dask.

Changes We Made to Intake

When we started, intake was optimized for the use case of a modest number of catalog entries (~100s) typically enumerated in a YAML file. At first, we got the incorrect impression that YAML was deeply baked into intake's design, which led us to believe it wouldn't work for us. Fortunately, it was not, and only a handful of changes were needed to make catalogs scale to ~100000 entries. These changes were promptly reviewed and accepted.
Intake's mechanism for driver discovery was based on scraping packages whose names began with intake. This was limiting: we wanted intake to discover drivers in our existing package already named databroker. After much iteration in a very constructive and beneficial review process, we landed on using entrypoints instead.
We made several other contributions, submitting about 30 pull requests (between @danielballan and @gwbischof).

What Was Easy

Writing Catalogs was easy. It's a pretty simple abstraction: a dict-like object with a search method on it that returns another dict with a subset of the contents.
Writing UIs on top of Catalogs was easy. We wrote one for the Xi-cam framework in collaboration with colleagues at ALS.
Getting the server up and running was easy. Once we made the changes mentioned above to make RemoteCatalog catalog scale well and support search, the remote aspect was painless. (We are not yet using the server in production because of performance questions, but we intend to.)

What Was Difficult, and How It Might Be Improved

It took significant effort and trial and error to work with the lifecycle of a DataSource, sorting out the roles of each of these methods in particular:
- _get_schema
- _load_metadata
- read
- read_parition
- read_chunked
- _get_partition
identifying:
- which ones we were required to implement to get the functionality of interest to us
- what intake (especially the server) expects from them in terms of signatures and return values
- if and when they call one another internally in the base classes
This motivates our interest in an alternative to DataSource with a smaller API, provisionally dubbed Reader. Perhaps a Reader's lifecycle could be reduced to:
- __init__(...) --- inexpensive setup
- read(delayed=<bool>) --- construct and return SciPy/PyData data structure, perhaps deferring I/O using dask
- close()
Such an API would rhyme nicely with the syntax for opening files in Python:
```
file = open(...)
file.read()
file.close()
```
"Using an intake Reader is just like opening a file in Python, but when you read it you get an (optionally lazy) SciPy/PyData data structure," is pleasantly easy to explain.

The chunking and partitioning logic, which are crucial for the server in particular, could be handled outside of and downstream from the Reader. For example, the server could call reader.read(delayed=True), inspect the return type (dask bag, array, dataframe or xarray-of-dask-arrays) and manage the rest of its job by handling that object directly rather than specifically requesting individual partitions/chunks from the Reader. In this way, the Reader's responsibility would be reduced to, "Construct a dask-backed data structure."
In order to scale to ~100000 entries, our Catalogs implement lazy __getitem__ and __iter__. With these in place, the additional laziness provided by the Entry layer becomes redundant. We found in user testing that there was confusion around whether the object returned by some expression was an Entry or the contents of that Entry (a Catalog or DataSource, as the case may be), especially because of the "automatic Entry instantiation on __getattr__" feature. (See bluesky/databroker#457.)

We have experimentally merged the API provided by Entry with the API provided by its contents. It may be worth considering retiring the Entry class and going all-in on lazy Catalog access methods. If it can be done while still retaining the current functionality, both the usage and the implementation would be simplified.

danielballan/draft.md