Skip to content

Instantly share code, notes, and snippets.

@mdsumner
Last active July 1, 2025 05:22
Show Gist options
  • Save mdsumner/dc3d076427c8bd440b1587c8448996fd to your computer and use it in GitHub Desktop.
Save mdsumner/dc3d076427c8bd440b1587c8448996fd to your computer and use it in GitHub Desktop.

This works to load but we can't sel() it sensibly, any ideas?

import virtualizarr
#virtualizarr.__version__
#'1.3.3.dev81+ga5d04d7'

from obstore.store import HTTPStore
from virtualizarr.parsers import HDFParser

parser = HDFParser()
store = HTTPStore(url="https://thredds.nci.org.au")

nc = ['https://thredds.nci.org.au/thredds/fileServer/gb6/BRAN/BRAN2020/month/ocean_temp_mth_2019_05.nc', 'https://thredds.nci.org.au/thredds/fileServer/gb6/BRAN/BRAN2020/month/ocean_temp_mth_2019_06.nc']
ds= virtualizarr.open_virtual_mfdataset(nc, object_store=store, parser=parser, 
   drop_variables = ["average_DT", "Time_bounds", "average_T1", "average_T2", "st_edges_ocean", "nv"])
   
ds.isel(yt_ocean = slice(0, 2))

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/workenv/lib/python3.12/site-packages/xarray/core/dataset.py", line 2778, in isel
    var = var.isel(var_indexers)
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/workenv/lib/python3.12/site-packages/xarray/core/variable.py", line 1045, in isel
    return self[key]
           ~~~~^^^^^
  File "/workenv/lib/python3.12/site-packages/xarray/core/variable.py", line 791, in __getitem__
    data = indexing.apply_indexer(indexable, indexer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workenv/lib/python3.12/site-packages/xarray/core/indexing.py", line 1038, in apply_indexer
    return indexable[indexer]
           ~~~~~~~~~^^^^^^^^^
  File "/workenv/lib/python3.12/site-packages/xarray/core/indexing.py", line 1564, in __getitem__
    return array[key]
           ~~~~~^^^^^
  File "/workenv/lib/python3.12/site-packages/virtualizarr/manifests/array.py", line 261, in __getitem__
    indexer = _possibly_expand_trailing_ellipsis(indexer, self.ndim)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workenv/lib/python3.12/site-packages/virtualizarr/manifests/array.py", line 366, in _possibly_expand_trailing_ellipsis
    raise ValueError(
ValueError: Invalid indexer for array. Indexer length must be less than or equal to the number of dimensions in the array, but indexer=(slice(None, None, None), slice(None, None, None), slice(0, 2, None), slice(None, None, None), Ellipsis) has length 5 and array has 4 dimensions.
If concatenating using xarray, ensure all non-coordinate data variables to be concatenated include the concatenation dimension, or consider passing `data_vars='minimal'` and `coords='minimal'` to the xarray combining function.
>
@mdsumner
Copy link
Author

mdsumner commented Jul 1, 2025

import virtualizarr  ## git tag v1.3.2
import xarray as xr


nc = ['https://thredds.nci.org.au/thredds/fileServer/gb6/BRAN/BRAN2020/month/ocean_temp_mth_2023_05.nc', 'https://thredds.nci.org.au/thredds/fileServer/gb6/BRAN/BRAN2020/month/ocean_temp_mth_2023_06.nc']
lds= [virtualizarr.open_virtual_dataset(xnc, 
   drop_variables = ["average_DT", "Time_bounds", "average_T1", "average_T2", "st_edges_ocean", "nv"]) for
        xnc in nc]
xr_concat_kwargs = {
    "coords": "minimal",
    "compat": "override",
    "data_vars": "minimal",
} 


ds = xr.concat(lds, dim="Time", **xr_concat_kwargs)
ds.virtualize.to_kerchunk("ocean_temp.parquet", format = "parquet")
xr.open_dataset("ocean_temp.parquet").isel(xt_ocean= slice(0, 3))

<xarray.Dataset> Size: 2MB
Dimensions:    (Time: 2, nv: 2, st_ocean: 51, yt_ocean: 1500, xt_ocean: 3)
Coordinates:
  * Time       (Time) datetime64[ns] 16B 2023-05-16T12:00:00 2023-06-16
  * st_ocean   (st_ocean) float64 408B 2.5 7.5 12.5 ... 3.603e+03 4.509e+03
  * xt_ocean   (xt_ocean) float64 24B 0.05 0.15 0.25
  * yt_ocean   (yt_ocean) float64 12kB -74.95 -74.85 -74.75 ... 74.85 74.95
Dimensions without coordinates: nv
Data variables:
    Time_bnds  (Time, nv) datetime64[ns] 32B ...
    temp       (Time, st_ocean, yt_ocean, xt_ocean) float32 2MB ...
Attributes:
    filename:           TMP/ocean_temp_2023_05_01.nc.0000
    NumFilesInSet:      20
    grid_type:          regular
    history:            Tue Feb 20 13:55:18 2024: ncrcat -4 --dfl_lvl 1 --cnk...
    NCO:                netCDF Operators version 5.0.5 (Homepage = http://nco...
    title:              BRAN2020
    catalogue_doi_url:  http://dx.doi.org/10.25914/6009627c7af03
    acknowledgement:    BRAN is made freely available by CSIRO Bluelink and i...

@mdsumner
Copy link
Author

mdsumner commented Jul 1, 2025

or with dask

dask.config.set(num_workers = 24, scheduler = "processes") 
lds= [dask.delayed(virtualizarr.open_virtual_dataset)(xnc, 
   drop_variables = ["average_DT", "Time_bounds", "average_T1", "average_T2", "st_edges_ocean", "nv"]) for
        xnc in nc]
xr_concat_kwargs = {
    "coords": "minimal",
    "compat": "override",
    "data_vars": "minimal",
} 

vd = dask.compute(lds)

ds = xr.concat(vd[0], dim="Time", **xr_concat_kwargs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment