arokem · February 5, 2020 23:17
diff --git a/nwb-zarr-hdf5-dask.ipynb b/nwb-zarr-hdf5-dask.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# How useful would a zarr backend for NWB be?\n",
    "\n",
    "The goal of this notebook is to compare some simple dask processing between data stored in the (currently-available) HDF5 backend of NWB and a (currently in [two](https://github.com/hdmf-dev/hdmf/pull/98) [PRs](https://github.com/NeurodataWithoutBorders/pynwb/pull/1018)) zarr backend. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The starting point of this is the example provided [here](https://neurodatawithoutborders.github.io/nwb_hackathons/HCK07_2020_Seattle/projects/zarrio/). We download some ICE data: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !curl -O http://download.alleninstitute.org/informatics-archive/prerelease/H19.28.012.11.05-2.nwb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And then convert from the HDF5 backend to the ZARR backend:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pynwb import NWBHDF5IO, NWBZarrIO\n",
    "import os\n",
    "infile = \"H19.28.012.11.05-2.nwb\"\n",
    "outfile = \"test_zarr_\" + os.path.basename(infile)\n",
    "# h5r = NWBHDF5IO(infile , 'r', load_namespaces=False)\n",
    "# f = h5r.read()\n",
    "# zw = NWBZarrIO(outfile,\n",
    "#                mode='w',\n",
    "#                manager=h5r.manager,\n",
    "#                chunking=True)\n",
    "# zw.write(f, cache_spec=True)\n",
    "# zw.close()\n",
    "# h5r.close()\n",
    "# zr = NWBZarrIO(outfile, 'r')\n",
    "# zf = zr.read()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using dask, we're going to delay reading from file and we're going to use `dask.array` for array operations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dask import delayed\n",
    "import dask.array as da"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start with the currently-available HDF5 backend. I am creating a delayed read function, to allow parallel data access:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "h5r = NWBHDF5IO(infile, 'r', load_namespaces=False)\n",
    "delayed_h5_read = delayed(NWBHDF5IO.read)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the purpose of benchmarking, I am going to focus on index 003 thru 013, which contain `CurrentClampSeries` objects that all have the same number of samples. The processing done here is taking zscoring each channel and returning an array with the z-scored data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "arrays = []\n",
    "for index in [f'index_{i:03d}' for i in range(3, 14)]:\n",
    "    dd = delayed_h5_read(h5r)\n",
    "    h5_arr = dd.acquisition[index].data[:]\n",
    "    arrays.append(da.from_delayed(h5_arr, shape=(300000, ), dtype='<f4'))\n",
    "\n",
    "arr = da.stack(arrays, -1)\n",
    "zscored = (arr - arr.mean(axis=0)) / arr.std(axis=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# How useful would a zarr backend for NWB be?\n",
	"\n",
	"The goal of this notebook is to compare some simple dask processing between data stored in the (currently-available) HDF5 backend of NWB and a (currently in [two](https://github.com/hdmf-dev/hdmf/pull/98) [PRs](https://github.com/NeurodataWithoutBorders/pynwb/pull/1018)) zarr backend. \n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The starting point of this is the example provided [here](https://neurodatawithoutborders.github.io/nwb_hackathons/HCK07_2020_Seattle/projects/zarrio/). We download some ICE data: "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"# !curl -O http://download.alleninstitute.org/informatics-archive/prerelease/H19.28.012.11.05-2.nwb"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"And then convert from the HDF5 backend to the ZARR backend:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"from pynwb import NWBHDF5IO, NWBZarrIO\n",
	"import os\n",
	"infile = \"H19.28.012.11.05-2.nwb\"\n",
	"outfile = \"test_zarr_\" + os.path.basename(infile)\n",
	"# h5r = NWBHDF5IO(infile , 'r', load_namespaces=False)\n",
	"# f = h5r.read()\n",
	"# zw = NWBZarrIO(outfile,\n",
	"# mode='w',\n",
	"# manager=h5r.manager,\n",
	"# chunking=True)\n",
	"# zw.write(f, cache_spec=True)\n",
	"# zw.close()\n",
	"# h5r.close()\n",
	"# zr = NWBZarrIO(outfile, 'r')\n",
	"# zf = zr.read()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Using dask, we're going to delay reading from file and we're going to use `dask.array` for array operations"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"from dask import delayed\n",
	"import dask.array as da"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Let's start with the currently-available HDF5 backend. I am creating a delayed read function, to allow parallel data access:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [],
	"source": [
	"h5r = NWBHDF5IO(infile, 'r', load_namespaces=False)\n",
	"delayed_h5_read = delayed(NWBHDF5IO.read)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"For the purpose of benchmarking, I am going to focus on index 003 thru 013, which contain `CurrentClampSeries` objects that all have the same number of samples. The processing done here is taking zscoring each channel and returning an array with the z-scored data:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [],
	"source": [
	"arrays = []\n",
	"for index in [f'index_{i:03d}' for i in range(3, 14)]:\n",
	" dd = delayed_h5_read(h5r)\n",
	" h5_arr = dd.acquisition[index].data[:]\n",
	" arrays.append(da.from_delayed(h5_arr, shape=(300000, ), dtype='<f4'))\n",
	"\n",
	"arr = da.stack(arrays, -1)\n",
	"zscored = (arr - arr.mean(axis=0)) / arr.std(axis=0)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"data": {