This document is still a working draft.
We recently refactored DataBroker, a project maintained by the DOE Light Sources as part of the Bluesky Collaboration, to leverage intake and to join the nascent community around it. We adopted several good ideas from intake's design, and we benefit from intake's support for remote access through an HTTP service. We contributed significant changes to intake, in particular scaling to large catalogs with ~10^5 entries. Looking back on the pain points we experienced during adoption, we have some suggestions for future development.
DataBroker, a project started at Brookhaven National Lab in 2013, is a Python project with similar goals to intake:
- Abstract over file formats, handing users useful PyData/SciPy data structures alongside dictionaries of metadata.
- Provide a catalog of datasets with support for search.
- Support a variety of storage backends. At the time, the "metadata" could be
stored in MongoDB or sqlite. (Now we support MongoDB, JSONL, or msgpack.) Then
and now, the large detector array data was stored in whatever file formats the
detector happened to write, handled by a registry of readers that returned
numpy arrays or lazy arrays (dask arrays or
pims.FramesSequence
objects).
In November 2018, we undertook to refactor DataBroker to leverage intake as much as possible. Our motivations were:
- Put our resources behind intake's vision of abstracting over files, which we share.
- Benefit from a larger community of reviewers, collaborators, and users.
- Avoid making databroker an island separated from the larger SciPy/PyData ecosystem.
The result of this work, databroker 1.0, is essentially a distribution of intake drivers that encode knowledge of our storage formats and our data model. It also bundles a backward-compatible shim layer that supports our original API on top on intake-based internals, providing a gentle migration path for users.
-
DataBroker had a object similar to intake's Catalog, but it lacked a good story of how to nest them. Intake's concept of "a catalog of catalogs (of catalogs...)" was the simple solution we were looking for.
-
Intake Catalogs support progressive search. That is,
catalog.search(...)
returns anotherCatalog
which can in turn be searched. DataBroker's search method did not do this.# Before: results = db(detector='fccd') # inspect results results = db(detector='fccd', plan_name='scan') # inspect results results = db(detector='fccd', plan_name='scan', num_points=50) # After: results = catalog.search(detector='fccd') # inspect results results = results.search(plan_name='scan') # inspect results results = results.search(num_points=50)
-
DataBroker did not yet have a usable service layer. Intake has a functional Tornado-based HTTP service. It needed (and still needs) some additional development to make it a properly multi-tennant server, but it's better than starting from nothing. It is very appealing that the Python API for remote catalogs and datasoures and local catalogs and datasources feels the same to the user.
-
Intake-xarray has a clever mechanism for constructing "remote xarrays" on the client side that pull chunks from a counterpart on the server side using dask.
- When we started, intake was optimized for the use case of a modest number of catalog entries (~100s) typically enumerated in a YAML file. At first, we got the incorrect impression that YAML was deeply baked into intake's design, which led us to believe it wouldn't work for us. Fortunately, it was not, and only a handful of changes were needed to make catalogs scale to ~100000 entries. These changes were promptly reviewed and accepted.
- Intake's mechanism for driver discovery was based on scraping packages whose
names began with
intake
. This was limiting: we wanted intake to discover drivers in our existing package already nameddatabroker
. After much iteration in a very constructive and beneficial review process, we landed on using entrypoints instead. - We made several other contributions, submitting about 30 pull requests (between @danielballan and @gwbischof).
- Writing Catalogs was easy. It's a pretty simple abstraction: a dict-like
object with a
search
method on it that returns another dict with a subset of the contents. - Writing UIs on top of Catalogs was easy. We wrote one for the Xi-cam framework in collaboration with colleagues at ALS.
- Getting the server up and running was easy. Once we made the changes mentioned
above to make
RemoteCatalog
catalog scale well and supportsearch
, the remote aspect was painless. (We are not yet using the server in production because of performance questions, but we intend to.)
-
It took significant effort and trial and error to work with the lifecycle of a DataSource, sorting out the roles of each of these methods in particular:
_get_schema
_load_metadata
read
read_parition
read_chunked
_get_partition
identifying:
- which ones we were required to implement to get the functionality of interest to us
- what intake (especially the server) expects from them in terms of signatures and return values
- if and when they call one another internally in the base classes
This motivates our interest in an alternative to DataSource with a smaller API, provisionally dubbed Reader. Perhaps a Reader's lifecycle could be reduced to:
__init__(...)
--- inexpensive setupread(delayed=<bool>)
--- construct and return SciPy/PyData data structure, perhaps deferring I/O using daskclose()
Such an API would rhyme nicely with the syntax for opening files in Python:
file = open(...) file.read() file.close()
"Using an intake Reader is just like opening a file in Python, but when you read it you get an (optionally lazy) SciPy/PyData data structure," is pleasantly easy to explain.
The chunking and partitioning logic, which are crucial for the server in particular, could be handled outside of and downstream from the Reader. For example, the server could call
reader.read(delayed=True)
, inspect the return type (dask bag, array, dataframe or xarray-of-dask-arrays) and manage the rest of its job by handling that object directly rather than specifically requesting individual partitions/chunks from the Reader. In this way, the Reader's responsibility would be reduced to, "Construct a dask-backed data structure." -
In order to scale to ~100000 entries, our Catalogs implement lazy
__getitem__
and__iter__
. With these in place, the additional laziness provided by theEntry
layer becomes redundant. We found in user testing that there was confusion around whether the object returned by some expression was anEntry
or the contents of thatEntry
(aCatalog
orDataSource
, as the case may be), especially because of the "automatic Entry instantiation on__getattr__
" feature. (See bluesky/databroker#457.)We have experimentally merged the API provided by
Entry
with the API provided by its contents. It may be worth considering retiring theEntry
class and going all-in on lazy Catalog access methods. If it can be done while still retaining the current functionality, both the usage and the implementation would be simplified.
You should also link to, or include part of, the example you sent me of bluesky/databroker usage before and after intake