Created
February 5, 2020 23:17
-
-
Save arokem/bf2d5f335b4c4a390907b0c7907600ac to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# How useful would a zarr backend for NWB be?\n", | |
"\n", | |
"The goal of this notebook is to compare some simple dask processing between data stored in the (currently-available) HDF5 backend of NWB and a (currently in [two](https://github.com/hdmf-dev/hdmf/pull/98) [PRs](https://github.com/NeurodataWithoutBorders/pynwb/pull/1018)) zarr backend. \n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The starting point of this is the example provided [here](https://neurodatawithoutborders.github.io/nwb_hackathons/HCK07_2020_Seattle/projects/zarrio/). We download some ICE data: " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# !curl -O http://download.alleninstitute.org/informatics-archive/prerelease/H19.28.012.11.05-2.nwb" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"And then convert from the HDF5 backend to the ZARR backend:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from pynwb import NWBHDF5IO, NWBZarrIO\n", | |
"import os\n", | |
"infile = \"H19.28.012.11.05-2.nwb\"\n", | |
"outfile = \"test_zarr_\" + os.path.basename(infile)\n", | |
"# h5r = NWBHDF5IO(infile , 'r', load_namespaces=False)\n", | |
"# f = h5r.read()\n", | |
"# zw = NWBZarrIO(outfile,\n", | |
"# mode='w',\n", | |
"# manager=h5r.manager,\n", | |
"# chunking=True)\n", | |
"# zw.write(f, cache_spec=True)\n", | |
"# zw.close()\n", | |
"# h5r.close()\n", | |
"# zr = NWBZarrIO(outfile, 'r')\n", | |
"# zf = zr.read()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Using dask, we're going to delay reading from file and we're going to use `dask.array` for array operations" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from dask import delayed\n", | |
"import dask.array as da" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Let's start with the currently-available HDF5 backend. I am creating a delayed read function, to allow parallel data access:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"h5r = NWBHDF5IO(infile, 'r', load_namespaces=False)\n", | |
"delayed_h5_read = delayed(NWBHDF5IO.read)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"For the purpose of benchmarking, I am going to focus on index 003 thru 013, which contain `CurrentClampSeries` objects that all have the same number of samples. The processing done here is taking zscoring each channel and returning an array with the z-scored data:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"arrays = []\n", | |
"for index in [f'index_{i:03d}' for i in range(3, 14)]:\n", | |
" dd = delayed_h5_read(h5r)\n", | |
" h5_arr = dd.acquisition[index].data[:]\n", | |
" arrays.append(da.from_delayed(h5_arr, shape=(300000, ), dtype='<f4'))\n", | |
"\n", | |
"arr = da.stack(arrays, -1)\n", | |
"zscored = (arr - arr.mean(axis=0)) / arr.std(axis=0)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment