Skip to content

Instantly share code, notes, and snippets.

@arokem
Created February 5, 2020 23:17
Show Gist options
  • Save arokem/bf2d5f335b4c4a390907b0c7907600ac to your computer and use it in GitHub Desktop.
Save arokem/bf2d5f335b4c4a390907b0c7907600ac to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# How useful would a zarr backend for NWB be?\n",
"\n",
"The goal of this notebook is to compare some simple dask processing between data stored in the (currently-available) HDF5 backend of NWB and a (currently in [two](https://github.com/hdmf-dev/hdmf/pull/98) [PRs](https://github.com/NeurodataWithoutBorders/pynwb/pull/1018)) zarr backend. \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The starting point of this is the example provided [here](https://neurodatawithoutborders.github.io/nwb_hackathons/HCK07_2020_Seattle/projects/zarrio/). We download some ICE data: "
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# !curl -O http://download.alleninstitute.org/informatics-archive/prerelease/H19.28.012.11.05-2.nwb"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And then convert from the HDF5 backend to the ZARR backend:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from pynwb import NWBHDF5IO, NWBZarrIO\n",
"import os\n",
"infile = \"H19.28.012.11.05-2.nwb\"\n",
"outfile = \"test_zarr_\" + os.path.basename(infile)\n",
"# h5r = NWBHDF5IO(infile , 'r', load_namespaces=False)\n",
"# f = h5r.read()\n",
"# zw = NWBZarrIO(outfile,\n",
"# mode='w',\n",
"# manager=h5r.manager,\n",
"# chunking=True)\n",
"# zw.write(f, cache_spec=True)\n",
"# zw.close()\n",
"# h5r.close()\n",
"# zr = NWBZarrIO(outfile, 'r')\n",
"# zf = zr.read()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using dask, we're going to delay reading from file and we're going to use `dask.array` for array operations"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from dask import delayed\n",
"import dask.array as da"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's start with the currently-available HDF5 backend. I am creating a delayed read function, to allow parallel data access:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"h5r = NWBHDF5IO(infile, 'r', load_namespaces=False)\n",
"delayed_h5_read = delayed(NWBHDF5IO.read)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For the purpose of benchmarking, I am going to focus on index 003 thru 013, which contain `CurrentClampSeries` objects that all have the same number of samples. The processing done here is taking zscoring each channel and returning an array with the z-scored data:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"arrays = []\n",
"for index in [f'index_{i:03d}' for i in range(3, 14)]:\n",
" dd = delayed_h5_read(h5r)\n",
" h5_arr = dd.acquisition[index].data[:]\n",
" arrays.append(da.from_delayed(h5_arr, shape=(300000, ), dtype='<f4'))\n",
"\n",
"arr = da.stack(arrays, -1)\n",
"zscored = (arr - arr.mean(axis=0)) / arr.std(axis=0)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment