Skip to content

Instantly share code, notes, and snippets.

@mtekman
Last active October 25, 2020 22:24
Show Gist options
  • Save mtekman/31393c5fe9d32bfb992b001927d16013 to your computer and use it in GitHub Desktop.
Save mtekman/31393c5fe9d32bfb992b001927d16013 to your computer and use it in GitHub Desktop.
How to Debug a Red Error Dataset

How to Debug a Failing Dataset

This gist has migrated to GitLab

https://gitlab.com/mtekman/galaxy-snippets

(The copy here is archived. Please see the above link for the latest developments)

Here we see that dataset 126 fails, and then it was fed dataset 115 (expanded below) as input

https://gist.githubusercontent.com/mtekman/31393c5fe9d32bfb992b001927d16013/raw/a336ff91d4ba1cbd67488458e81d29423d9eb9e1/img1.png

Why did 126 fail?

Let’s expand the dataset and look at the stderr (Click on the (i) symbol).

https://gist.githubusercontent.com/mtekman/31393c5fe9d32bfb992b001927d16013/raw/a336ff91d4ba1cbd67488458e81d29423d9eb9e1/img2.png

If we click on the stderr link we see the following info:

Traceback (most recent call last):
  File "/data/dnb03/galaxy_db/job_working_directory/011/697/11697717/configs/tmpy_50aq_y", line 11, in <module>
    adata = ad.read('/data/dnb03/galaxy_db/files/9/9/5/dataset_995e0841-fac8-4422-96ab-d1d5445e76fb.dat')
  File "/usr/local/tools/_conda/envs/mulled-v1-2a63a50b8ad4044410393b113d4c7e7c6ceece72f1cb585e4c16192d3eb9ebbd/lib/python3.6/site-packages/anndata/readwrite/read.py", line 447, in read_h5ad
    constructor_args = _read_args_from_h5ad(filename=filename, chunk_size=chunk_size)
  File "/usr/local/tools/_conda/envs/mulled-v1-2a63a50b8ad4044410393b113d4c7e7c6ceece72f1cb585e4c16192d3eb9ebbd/lib/python3.6/site-packages/anndata/readwrite/read.py", line 502, in _read_args_from_h5ad
    return AnnData._args_from_dict(d)
  File "/usr/local/tools/_conda/envs/mulled-v1-2a63a50b8ad4044410393b113d4c7e7c6ceece72f1cb585e4c16192d3eb9ebbd/lib/python3.6/site-packages/anndata/core/anndata.py", line 2157, in _args_from_dict
    if key in d_true_keys[true_key].dtype.names:
AttributeError: 'dict' object has no attribute 'dtype'

What this tells us is that it tried to execute a standard anndata.read command, and failed because the underlying dictionary did not have the key ‘dtype’ in it.

Whilst this is good to note, it doesn’t really tell us anything interesting.

Can we replicate this same message outside of Galaxy?

Download required inputs

Let’s download dataset 115 (the green one) which was fed as input to dataset 126 (our errored dataset), and let’s rename it from the really long dataset names that Galaxy loves to give, to something simple like: testthis.h5ad and save it to ~/Downloads/

Find the XML file

This will help us learn what libraries are being used.

So the tool name is “Inspect AnnData”, and by looking at the above image I can see that the tool ID is anndata_inspect, which means the file containing the tool is likely called anndata_inspect.xml or if it’s in a folder maybe it’s just called inspect.xml.

Tools are developed in the tools-iuc repository, which I have cloned on my system at the path ~/repos/_work/_galaxy/tools-iuc/.

Open up a bash terminal.

I can find the exact location of anndata_inspect.xml by doing:

  • find ~/repos/_work/_galaxy/tools-iuc/ -name anndata_inspect.xml (no result)
  • find ~/repos/_work/_galaxy/tools-iuc/ -name anndata*.xml (no result)
  • find ~/repos/_work/_galaxy/tools-iuc/ -name inspect.xml | grep anndata
    /home/tetris/repos/_work/_galaxy/tools-iuc/tools/anndata/inspect.xml
        

    Found it! (Without the grep anndata command at the end for filtering lines, it would have shown me all tools that had inspect.xml)

Find the required libraries

If we peek at the inspect.xml file using an editor (here I will just do less -S ~/repos/_work/_galaxy/tools-iuc/tools/anndata/inspect.xml) we see:

<tool id="anndata_inspect" name="Inspect AnnData" version="@VERSION@+@GALAXY_VERSION@">
    <description>object</description>
    <macros>
        <import>macros.xml</import>
    </macros>
    <expand macro="requirements"/>
    <expand macro="version_command"/>
    <command detect_errors="exit_code"><![CDATA[
@CMD@

The macro <expand macro=requirements" /> seems like it would have what we want, but we need to find the actual definition of this macro. If we go up a few lines we see that all our macros in this file are imported from macros.xml.

So let’s exit this file (press q if you’re using less), and look up the macros file (less -S ~/repos/_work/_galaxy/tools-iuc/tools/anndata/macros.xml), where we now see:

<macros>
    <token name="@VERSION@">0.6.22.post1</token>
    <token name="@GALAXY_VERSION@">galaxy4</token>
    <xml name="requirements">
        <requirements>
            <requirement type="package" version="@VERSION@">anndata</requirement>
            <requirement type="package" version="2.0.17">loompy</requirement>
            <requirement type="package" version="2.9.0">h5py</requirement>
            <yield />
        </requirements>
    </xml>
    <xml name="citations">
        <citations>

Ah, so it’s using h5py==2.9.0, loompy===2.0.17, and anndata==@VERSION@, where we see that @VERSION is defined as 0.6.22.post1

Install the required libraries

NOTE: It is a really good habit to create small virtual environments when testing Python stuff. It is essentially what Galaxy does when installing a tool via Conda (which I think harnesses virtualenv in some way).

Create virtual env

Let’s create a small test environment:

virtualenv .sameasgalaxy

(ignore messages, and then)

source .sameasgalaxy/bin/activate

Your shell should now have (.sameasgalaxy) prepended to it.

Install specific libraries

pip3 install h5py==2.9.0 loompy==2.0.17 anndata==0.6.22.post1

This will install these libraries not to your system, but just to the little virtualenv folder we created (./.sameasgalaxy)

Get the same python script that Galaxy calls

So we are lucky in this example because the script that is generated is actually output into the stdout of dataset 126, so we can click on that, copy it and save it to a file somewhere for us to modify.

import anndata as ad
    
    
import pandas as pd
from scipy import io

pd.options.display.precision = 15

adata = ad.read('/data/dnb03/galaxy_db/files/9/9/5/dataset_995e0841-fac8-4422-96ab-d1d5445e76fb.dat')

with open('/data/dnb03/galaxy_db/job_working_directory/011/697/11697717/outputs/galaxy_dataset_f2af22dd-91be-47e9-8d67-414b3037daf7.dat', 'w', encoding="utf-8") as f:
    print(adata, file=f)

Now we don’t have these long Galaxy paths /data/blah/blah/blah/ so we need to change these to something that will work on our system. We know that our downloaded input file is in ~/Downloads/testthis.h5ad, so we can replace the first /data file, and for the output file we will just replace it with a random output file I will call junk.txt.

import anndata as ad
    
    
import pandas as pd
from scipy import io

pd.options.display.precision = 15

adata = ad.read('~/Downloads/testthis.h5ad')

with open('junk.txt', 'w', encoding="utf-8") as f:
    print(adata, file=f)

Save this to somewhere you can find it like ~/Downloads/testanndata.py

Test the modified script

Go back to your terminal and make sure your virtualenv is activated (your shell should have (.sameasgalaxy) preprendeded to each line, otherwise you need to reactivate it with source .sameasgalaxy/bin/activate)

Now we run the script: python ~/Downloads/testanndata.py and we see:

Traceback (most recent call last):
  File "~/Downloads/testanndata.py", line 11, in <module>
    adata = ad.read('~/Downloads/testthis.h5ad')
  File ".sameasgalaxy/lib/python3.6/site-packages/anndata/readwrite/read.py", line 447, in read_h5ad
    constructor_args = _read_args_from_h5ad(filename=filename, chunk_size=chunk_size)
  File ".sameasgalaxy/lib/python3.6/site-packages/anndata/readwrite/read.py", line 502, in _read_args_from_h5ad
    return AnnData._args_from_dict(d)
  File ".sameasgalaxy/lib/python3.6/site-packages/anndata/core/anndata.py", line 2157, in _args_from_dict
    if key in d_true_keys[true_key].dtype.names:
AttributeError: 'dict' object has no attribute 'dtype'

Which is the same error as Galaxy! Great, so what can we do to fix it?

Update the libraries

Maybe the script will work with a newer library? What’s the latest anndata we can use?

pip3 install anndata==

(we use ==== at the end so that it errors out and gives us a list of versions)

Defaulting to user installation because normal site-packages is not writeable
WARNING: Cache entry deserialization failed, entry ignored
ERROR: Could not find a version that satisfies the requirement anndata== (from versions: 0.1, 0.2, 0.2.1, 0.3, 0.3.0.1, 0.3.0.2, 0.3.0.3, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.4, 0.4.1, 0.4.2, 0.4.3, 0.4.4, 0.5, 0.5.1, 0.5.2, 0.5.3, 0.5.4, 0.5.5, 0.5.6, 0.5.7, 0.5.8, 0.5.8.post1, 0.5.9, 0.5.10, 0.5.10.post1, 0.5.10.post2, 0.6, 0.6.1, 0.6.2, 0.6.3, 0.6.4, 0.6.5, 0.6.6, 0.6.7, 0.6.8, 0.6.9, 0.6.10, 0.6.11, 0.6.12, 0.6.13, 0.6.14, 0.6.15, 0.6.16, 0.6.17, 0.6.18, 0.6.19, 0.6.20, 0.6.21, 0.6.22rc1, 0.6.22, 0.6.22.post1, 0.7rc1, 0.7rc2, 0.7, 0.7.1, 0.7.2a1, 0.7.2, 0.7.3, 0.7.4)
ERROR: No matching distribution found for anndata==

It looks like the latest is 0.7.4, and since conda is pretty much a slave to pip for Python stuff, we know that Galaxy will support this version (because Galaxy is slave to conda).

Test the new update

Create virtual env

Let’s create a new small test environment:

virtualenv .newtest
source .newtest/bin/activate

Your shell should now have (.newtest) prepended to it.

Install updated libraries

Here we change the anndata to the highest supported version

pip3 install h5py==2.9.0 loompy==2.0.17 anndata==0.7.4

Test the script again

python ~/Downloads/testanndata.py

(NOTE: We don’t need to edit anything here, the calling script does not change, only the libraries in the environment do)

Running this appears to produce no output, so let’s check the output test.txt file:

AnnData object with n_obs × n_vars = 1022 × 2000
    obs: 'Barcode', 'batch', 'emptyFDR', 'emptyLimited', 'emptyLogProb', 'emptyPValue', 'emptyTotal', 'log1p_n_genes_by_counts', 'log1p_total_counts', 'log1p_total_counts_mito', 'n_counts', 'n_genes', 'n_genes_by_counts', 'pct_counts_in_top_50_genes', 'pct_counts_mito', 'total_counts', 'total_counts_mito', 'sex', 'health', 'louvain'
    var: 'ID', 'Symbol', 'NA.', 'mito', 'n_cells_by_counts-0', 'mean_counts-0', 'log1p_mean_counts-0', 'pct_dropout_by_counts-0', 'total_counts-0', 'log1p_total_counts-0', 'n_cells-0', 'n_counts-0', 'n_cells_by_counts-1', 'mean_counts-1', 'log1p_mean_counts-1', 'pct_dropout_by_counts-1', 'total_counts-1', 'log1p_total_counts-1', 'n_cells-1', 'n_counts-1', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'hvg', 'louvain', 'neighbors', 'pca', 'umap'
    obsm: 'X_pca', 'X_tsne', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances

Woo, just the output we want.

Okay cool, now let’s tell Galaxy which libraries to use to get this same output.

Patching Galaxy

Let’s navigate to our tools-iuc directory.

Create new patch branch

cd ~/repos/_work/_galaxy/tools-iuc

and take a look at what branch we’re on

git status

which gives:

On branch somethingnew
Your branch is up to date with 'origin/somethingnew'.

Untracked files:
  (use "git add <file>..." to include in what will be committed)
    tools/umi_tools/test-data/SA_E725_FACS_1_R1.fastq.gz
    tools/umi_tools/test-data/SA_E725_FACS_1_R2.fastq.gz
    tools/umi_tools/umi-tools_count.xml~

nothing added to commit but untracked files present (use "git add" to track)

Not the branch I want to be on, so let’s switch to master

git checkout master
git status

giving:

Switched to branch 'master'
Your branch is up to date with 'origin/master'

Check remotes

First let us look at our remotes:

origin	[email protected]:mtekman/tools-iuc.git (fetch)
origin	[email protected]:mtekman/tools-iuc.git (push)
upstream	[email protected]:galaxyproject/tools-iuc.git (fetch)
upstream	[email protected]:galaxyproject/tools-iuc.git (push)

So I have two remotes, origin (pointing at my stuff) and upstream (pointing at the galaxy stuff). My stuff always lags behind the latest stuff, so I should update my stuff with the latest stuff.

If you don’t have these remotes you can add them:

git remote add upstream [email protected]:galaxyproject/tools-iuc.git
git remote add origin [email protected]:<your-github-id>/tools-iuc.git

Update origin with upstream

on branch master:

# pull latest stuff from remote galaxyproject into my local files
git pull upstream master  
# push my (now updated local files) to my remote origin
git push origin master

Done. Now let’s create a new branch with knowledge that we are up to date with the latest stuff.

Create new branch

git checkout -b my-patch-for-anndata
Switched to a new branch 'my-anndata'

Good. Now all file changes we make will be to this new branch.

Change the XML

So let’s open ~/repos/_work/_galaxy/tools-iuc/tools/anndata/macros.xml and edit the @VERSION@ from 0.6.22.post1 to 0.7.4.

Let’s also change the @GALAXY_VERSION@ from galaxy4 to galaxy1 (because @GALAXY_VERSION@ is slave to the package @VERSION@).

<macros>
    <token name="@VERSION@">0.7.4</token>
    <token name="@GALAXY_VERSION@">galaxy1</token>
    <xml name="requirements">
        <requirements>
            <requirement type="package" version="@VERSION@">anndata</requirement>
            <requirement type="package" version="2.0.17">loompy</requirement>
            <requirement type="package" version="2.9.0">h5py</requirement>
            <yield />
        </requirements>
    </xml>
    <xml name="citations">
        <citations>

Test the XML

Here we use planemo, and if you haven’t installed planemo yet you will need to install it (by creating a virtual env)

Install planemo (skip if already installed)

virtualenv ~/.myplanemo;
source ~/.myplanemo/bin/activate;
pip3 install planemo

Lint your changes

source ~/.myplanemo/bin/activate;
planemo lint ~/repos/_work/_galaxy/tools-iuc/tools/anndata/*.xml

Here we should see green outputs only, if not, check the files that you edited for any obvious XML errors (e.g. missing </tags>)

Test your changes

planemo test ~/repos/_work/_galaxy/tools-iuc/tools/anndata/*.xml

These should all be green too, but if not check the error messages. You might see something like “History item different” which means you did not get the same output as the one that is currently saved in the ~/repos/_work/_galaxy/tools-iuc/tools/anndata/test-data folder.

To re-run just the failed datasets, do:

planemo test ~/repos/_work/_galaxy/tools-iuc/tools/anndata/*.xml --failed

To update the outputs of just the failed datasets, do:

planemo test ~/repos/_work/_galaxy/tools-iuc/tools/anndata/*.xml --failed --update_test_data

(it should still fail, but if you run planemo test ~/repos/_work/_galaxy/tools-iuc/tools/anndata/*.xml --failed it should be green)

Commit Your Changes

git add ~/repos/_work/_galaxy/tools-iuc/tools/anndata/macros.xml
git commit -m "bumped anndata to 0.7.4"
git push -u origin my-anndata

Here we push our changes to the origin remote (i.e. my stuff, your stuff, not galaxyproject).

To commit to GalaxyProject visit https://github.com/galaxyproject/tools-iuc/ and you should see a yellow dialog at the top prompting to make a Pull Request of your new branch against the galaxyproject.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment