https://gitlab.com/mtekman/galaxy-snippets
Here we see that dataset 126 fails, and then it was fed dataset 115 (expanded below) as input
Let’s expand the dataset and look at the stderr (Click on the (i) symbol).
If we click on the stderr
link we see the following info:
Traceback (most recent call last):
File "/data/dnb03/galaxy_db/job_working_directory/011/697/11697717/configs/tmpy_50aq_y", line 11, in <module>
adata = ad.read('/data/dnb03/galaxy_db/files/9/9/5/dataset_995e0841-fac8-4422-96ab-d1d5445e76fb.dat')
File "/usr/local/tools/_conda/envs/mulled-v1-2a63a50b8ad4044410393b113d4c7e7c6ceece72f1cb585e4c16192d3eb9ebbd/lib/python3.6/site-packages/anndata/readwrite/read.py", line 447, in read_h5ad
constructor_args = _read_args_from_h5ad(filename=filename, chunk_size=chunk_size)
File "/usr/local/tools/_conda/envs/mulled-v1-2a63a50b8ad4044410393b113d4c7e7c6ceece72f1cb585e4c16192d3eb9ebbd/lib/python3.6/site-packages/anndata/readwrite/read.py", line 502, in _read_args_from_h5ad
return AnnData._args_from_dict(d)
File "/usr/local/tools/_conda/envs/mulled-v1-2a63a50b8ad4044410393b113d4c7e7c6ceece72f1cb585e4c16192d3eb9ebbd/lib/python3.6/site-packages/anndata/core/anndata.py", line 2157, in _args_from_dict
if key in d_true_keys[true_key].dtype.names:
AttributeError: 'dict' object has no attribute 'dtype'
What this tells us is that it tried to execute a standard anndata.read
command, and failed because the underlying dictionary did not have the key ‘dtype’ in it.
Whilst this is good to note, it doesn’t really tell us anything interesting.
Let’s download dataset 115 (the green one) which was fed as input to dataset 126 (our errored dataset), and let’s rename it from the really long dataset names that Galaxy loves to give, to something simple like: testthis.h5ad
and save it to ~/Downloads/
This will help us learn what libraries are being used.
So the tool name is “Inspect AnnData”, and by looking at the above image I can see that the tool ID is anndata_inspect
, which means the file containing the tool is likely called anndata_inspect.xml
or if it’s in a folder maybe it’s just called inspect.xml
.
Tools are developed in the tools-iuc
repository, which I have cloned on my system at the path ~/repos/_work/_galaxy/tools-iuc/
.
Open up a bash terminal.
I can find the exact location of anndata_inspect.xml
by doing:
find ~/repos/_work/_galaxy/tools-iuc/ -name anndata_inspect.xml
(no result)find ~/repos/_work/_galaxy/tools-iuc/ -name anndata*.xml
(no result)find ~/repos/_work/_galaxy/tools-iuc/ -name inspect.xml | grep anndata
/home/tetris/repos/_work/_galaxy/tools-iuc/tools/anndata/inspect.xml
Found it! (Without the
grep anndata
command at the end for filtering lines, it would have shown me all tools that hadinspect.xml
)
If we peek at the inspect.xml
file using an editor (here I will just do less -S ~/repos/_work/_galaxy/tools-iuc/tools/anndata/inspect.xml
) we see:
<tool id="anndata_inspect" name="Inspect AnnData" version="@VERSION@+@GALAXY_VERSION@">
<description>object</description>
<macros>
<import>macros.xml</import>
</macros>
<expand macro="requirements"/>
<expand macro="version_command"/>
<command detect_errors="exit_code"><![CDATA[
@CMD@
The macro <expand macro=requirements" />
seems like it would have what we want, but we need to find the actual definition of this macro. If we go up a few lines we see that all our macros in this file are imported from macros.xml
.
So let’s exit this file (press q
if you’re using less
), and look up the macros file (less -S ~/repos/_work/_galaxy/tools-iuc/tools/anndata/macros.xml
), where we now see:
<macros>
<token name="@VERSION@">0.6.22.post1</token>
<token name="@GALAXY_VERSION@">galaxy4</token>
<xml name="requirements">
<requirements>
<requirement type="package" version="@VERSION@">anndata</requirement>
<requirement type="package" version="2.0.17">loompy</requirement>
<requirement type="package" version="2.9.0">h5py</requirement>
<yield />
</requirements>
</xml>
<xml name="citations">
<citations>
Ah, so it’s using h5py==2.9.0
, loompy===2.0.17
, and anndata==@VERSION@
, where we see that @VERSION
is defined as 0.6.22.post1
NOTE: It is a really good habit to create small virtual environments when testing Python stuff. It is essentially what Galaxy does when installing a tool via Conda (which I think harnesses virtualenv in some way).
Let’s create a small test environment:
virtualenv .sameasgalaxy
(ignore messages, and then)
source .sameasgalaxy/bin/activate
Your shell should now have (.sameasgalaxy)
prepended to it.
pip3 install h5py==2.9.0 loompy==2.0.17 anndata==0.6.22.post1
This will install these libraries not to your system, but just to the little virtualenv folder we created (./.sameasgalaxy
)
So we are lucky in this example because the script that is generated is actually output into the stdout
of dataset 126, so we can click on that, copy it and save it to a file somewhere for us to modify.
import anndata as ad
import pandas as pd
from scipy import io
pd.options.display.precision = 15
adata = ad.read('/data/dnb03/galaxy_db/files/9/9/5/dataset_995e0841-fac8-4422-96ab-d1d5445e76fb.dat')
with open('/data/dnb03/galaxy_db/job_working_directory/011/697/11697717/outputs/galaxy_dataset_f2af22dd-91be-47e9-8d67-414b3037daf7.dat', 'w', encoding="utf-8") as f:
print(adata, file=f)
Now we don’t have these long Galaxy paths /data/blah/blah/blah/
so we need to change these to something that will work on our system. We know that our downloaded input file is in ~/Downloads/testthis.h5ad
, so we can replace the first /data
file, and for the output file we will just replace it with a random output file I will call junk.txt
.
import anndata as ad
import pandas as pd
from scipy import io
pd.options.display.precision = 15
adata = ad.read('~/Downloads/testthis.h5ad')
with open('junk.txt', 'w', encoding="utf-8") as f:
print(adata, file=f)
Save this to somewhere you can find it like ~/Downloads/testanndata.py
Go back to your terminal and make sure your virtualenv is activated (your shell should have (.sameasgalaxy)
preprendeded to each line, otherwise you need to reactivate it with source .sameasgalaxy/bin/activate
)
Now we run the script: python ~/Downloads/testanndata.py
and we see:
Traceback (most recent call last):
File "~/Downloads/testanndata.py", line 11, in <module>
adata = ad.read('~/Downloads/testthis.h5ad')
File ".sameasgalaxy/lib/python3.6/site-packages/anndata/readwrite/read.py", line 447, in read_h5ad
constructor_args = _read_args_from_h5ad(filename=filename, chunk_size=chunk_size)
File ".sameasgalaxy/lib/python3.6/site-packages/anndata/readwrite/read.py", line 502, in _read_args_from_h5ad
return AnnData._args_from_dict(d)
File ".sameasgalaxy/lib/python3.6/site-packages/anndata/core/anndata.py", line 2157, in _args_from_dict
if key in d_true_keys[true_key].dtype.names:
AttributeError: 'dict' object has no attribute 'dtype'
Which is the same error as Galaxy! Great, so what can we do to fix it?
Maybe the script will work with a newer library? What’s the latest anndata
we can use?
pip3 install anndata==
(we use ==== at the end so that it errors out and gives us a list of versions)
Defaulting to user installation because normal site-packages is not writeable
WARNING: Cache entry deserialization failed, entry ignored
ERROR: Could not find a version that satisfies the requirement anndata== (from versions: 0.1, 0.2, 0.2.1, 0.3, 0.3.0.1, 0.3.0.2, 0.3.0.3, 0.3.1, 0.3.2, 0.3.3, 0.3.4, 0.4, 0.4.1, 0.4.2, 0.4.3, 0.4.4, 0.5, 0.5.1, 0.5.2, 0.5.3, 0.5.4, 0.5.5, 0.5.6, 0.5.7, 0.5.8, 0.5.8.post1, 0.5.9, 0.5.10, 0.5.10.post1, 0.5.10.post2, 0.6, 0.6.1, 0.6.2, 0.6.3, 0.6.4, 0.6.5, 0.6.6, 0.6.7, 0.6.8, 0.6.9, 0.6.10, 0.6.11, 0.6.12, 0.6.13, 0.6.14, 0.6.15, 0.6.16, 0.6.17, 0.6.18, 0.6.19, 0.6.20, 0.6.21, 0.6.22rc1, 0.6.22, 0.6.22.post1, 0.7rc1, 0.7rc2, 0.7, 0.7.1, 0.7.2a1, 0.7.2, 0.7.3, 0.7.4)
ERROR: No matching distribution found for anndata==
It looks like the latest is 0.7.4
, and since conda is pretty much a slave to pip for Python stuff, we know that Galaxy will support this version (because Galaxy is slave to conda).
Let’s create a new small test environment:
virtualenv .newtest
source .newtest/bin/activate
Your shell should now have (.newtest)
prepended to it.
Here we change the anndata to the highest supported version
pip3 install h5py==2.9.0 loompy==2.0.17 anndata==0.7.4
python ~/Downloads/testanndata.py
(NOTE: We don’t need to edit anything here, the calling script does not change, only the libraries in the environment do)
Running this appears to produce no output, so let’s check the output test.txt
file:
AnnData object with n_obs × n_vars = 1022 × 2000
obs: 'Barcode', 'batch', 'emptyFDR', 'emptyLimited', 'emptyLogProb', 'emptyPValue', 'emptyTotal', 'log1p_n_genes_by_counts', 'log1p_total_counts', 'log1p_total_counts_mito', 'n_counts', 'n_genes', 'n_genes_by_counts', 'pct_counts_in_top_50_genes', 'pct_counts_mito', 'total_counts', 'total_counts_mito', 'sex', 'health', 'louvain'
var: 'ID', 'Symbol', 'NA.', 'mito', 'n_cells_by_counts-0', 'mean_counts-0', 'log1p_mean_counts-0', 'pct_dropout_by_counts-0', 'total_counts-0', 'log1p_total_counts-0', 'n_cells-0', 'n_counts-0', 'n_cells_by_counts-1', 'mean_counts-1', 'log1p_mean_counts-1', 'pct_dropout_by_counts-1', 'total_counts-1', 'log1p_total_counts-1', 'n_cells-1', 'n_counts-1', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
uns: 'hvg', 'louvain', 'neighbors', 'pca', 'umap'
obsm: 'X_pca', 'X_tsne', 'X_umap'
varm: 'PCs'
obsp: 'connectivities', 'distances
Woo, just the output we want.
Okay cool, now let’s tell Galaxy which libraries to use to get this same output.
Let’s navigate to our tools-iuc
directory.
cd ~/repos/_work/_galaxy/tools-iuc
and take a look at what branch we’re on
git status
which gives:
On branch somethingnew
Your branch is up to date with 'origin/somethingnew'.
Untracked files:
(use "git add <file>..." to include in what will be committed)
tools/umi_tools/test-data/SA_E725_FACS_1_R1.fastq.gz
tools/umi_tools/test-data/SA_E725_FACS_1_R2.fastq.gz
tools/umi_tools/umi-tools_count.xml~
nothing added to commit but untracked files present (use "git add" to track)
Not the branch I want to be on, so let’s switch to master
git checkout master
git status
giving:
Switched to branch 'master'
Your branch is up to date with 'origin/master'
First let us look at our remotes:
origin [email protected]:mtekman/tools-iuc.git (fetch)
origin [email protected]:mtekman/tools-iuc.git (push)
upstream [email protected]:galaxyproject/tools-iuc.git (fetch)
upstream [email protected]:galaxyproject/tools-iuc.git (push)
So I have two remotes, origin
(pointing at my stuff) and upstream
(pointing at the galaxy stuff). My stuff always lags behind the latest stuff, so I should update my stuff with the latest stuff.
If you don’t have these remotes you can add them:
git remote add upstream [email protected]:galaxyproject/tools-iuc.git
git remote add origin [email protected]:<your-github-id>/tools-iuc.git
on branch master:
# pull latest stuff from remote galaxyproject into my local files
git pull upstream master
# push my (now updated local files) to my remote origin
git push origin master
Done. Now let’s create a new branch with knowledge that we are up to date with the latest stuff.
git checkout -b my-patch-for-anndata
Switched to a new branch 'my-anndata'
Good. Now all file changes we make will be to this new branch.
So let’s open ~/repos/_work/_galaxy/tools-iuc/tools/anndata/macros.xml
and edit the @VERSION@
from 0.6.22.post1
to 0.7.4
.
Let’s also change the @GALAXY_VERSION@
from galaxy4
to galaxy1
(because @GALAXY_VERSION@
is slave to the package @VERSION@
).
<macros>
<token name="@VERSION@">0.7.4</token>
<token name="@GALAXY_VERSION@">galaxy1</token>
<xml name="requirements">
<requirements>
<requirement type="package" version="@VERSION@">anndata</requirement>
<requirement type="package" version="2.0.17">loompy</requirement>
<requirement type="package" version="2.9.0">h5py</requirement>
<yield />
</requirements>
</xml>
<xml name="citations">
<citations>
Here we use planemo, and if you haven’t installed planemo yet you will need to install it (by creating a virtual env)
virtualenv ~/.myplanemo;
source ~/.myplanemo/bin/activate;
pip3 install planemo
source ~/.myplanemo/bin/activate;
planemo lint ~/repos/_work/_galaxy/tools-iuc/tools/anndata/*.xml
Here we should see green outputs only, if not, check the files that you edited for any obvious XML errors (e.g. missing </tags>)
planemo test ~/repos/_work/_galaxy/tools-iuc/tools/anndata/*.xml
These should all be green too, but if not check the error messages. You might see something like “History item different” which means you did not get the same output as the one that is currently saved in the ~/repos/_work/_galaxy/tools-iuc/tools/anndata/test-data
folder.
To re-run just the failed datasets, do:
planemo test ~/repos/_work/_galaxy/tools-iuc/tools/anndata/*.xml --failed
To update the outputs of just the failed datasets, do:
planemo test ~/repos/_work/_galaxy/tools-iuc/tools/anndata/*.xml --failed --update_test_data
(it should still fail, but if you run planemo test ~/repos/_work/_galaxy/tools-iuc/tools/anndata/*.xml --failed
it should be green)
git add ~/repos/_work/_galaxy/tools-iuc/tools/anndata/macros.xml
git commit -m "bumped anndata to 0.7.4"
git push -u origin my-anndata
Here we push our changes to the origin
remote (i.e. my stuff, your stuff, not galaxyproject).
To commit to GalaxyProject visit https://github.com/galaxyproject/tools-iuc/ and you should see a yellow dialog at the top prompting to make a Pull Request of your new branch against the galaxyproject.