Like HDF5, the Zarr format stores the array data in physical chunks:
-
Chunks are typically small rectangular regions of the array (e.g. 100x100).
-
They span the full array with no overlaps (tiles).
The FWR4 region in a germline J gene allele is expected to start with amino acid motif WGXG on the heavy chain and FGXG on the light chain. Note that X represents any amino acid.
However, we've found that some of the germline J gene alleles provided by IMGT or AIRR-community/OGRDB do not conform to this.
Goal: can we efficiently perform a single-cell analysis fully in memory, even for a big dataset like the 1.3M cells dataset? As long as the machine has enough memory.
Here are some low-hanging fruits that have the potential to significantly improve the current situation.