Skip to content

Instantly share code, notes, and snippets.

@mtekman
Last active October 25, 2020 22:22
Show Gist options
  • Save mtekman/149a7c52fd73e5d8ebe49f5a27b0743d to your computer and use it in GitHub Desktop.
Save mtekman/149a7c52fd73e5d8ebe49f5a27b0743d to your computer and use it in GitHub Desktop.
Subsetting STARsolo data for Galaxy testing

STARSolo Galaxy wrapper

This repo has migrated to GitLab

https://gitlab.com/mtekman/galaxy-snippets

(The copy here is archived. Please see the above link for the latest developments)

We need to generate small test data for this to pass in Galaxy For now we will try to get this to work with large datasets and work our way down

The datasets we will be using will be those from the 1K PBMCs v2 chemistry and we will therefore be using hg38 from ensembl, but only chr21 fasta and the gtf

Note

STARsolo seems to be unable to run without a GTF file, so trying to run with a transcriptome fails.

Downloading all the datasets

mkdir -p $dname
cd $dname
wget ftp://ftp.ensembl.org/pub/release-100/gtf/homo_sapiens/Homo_sapiens.GRCh38.100.gtf.gz
wget ftp://ftp.ensembl.org/pub/release-100/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz
wget https://zenodo.org/record/3457880/files/737K-august-2016.txt?download=1

Making the STAR environment

conda create -n starnew star==2.7.5b
conda activate starnew

(You will need to Ctrl-G here to type Y in the prompt)

Creating a STAR index

(needs to be plaintext)

gunzip Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz
gunzip Homo_sapiens.GRCh38.100.gtf.gz
STAR --runMode genomeGenerate \
     --genomeDir 'tempstargenomedir' \
     --readFilesCommand zcat \
     --genomeFastaFiles Homo_sapiens.GRCh38.dna.chromosome.21.fa \
     --sjdbOverhang 100 \
     --sjdbGTFfile Homo_sapiens.GRCh38.100.gtf \
     --genomeSAindexNbases 4 \
     --runThreadN 5

We should now see a geneInfo.tab file in the tempstargenomedir that we can use

wc -l tempstargenomedir/{exonInfo,geneInfo}.tab

876 genes and 1378276 exons. Good.

Let’s see how much of our 1K PBMC data can be mapped

Prepare the 1K data

mv ~/pbmc_1k_v2_fastqs.tar ./
tar xvf pbmc_1k_v2_fastqs.tar
ls -lh pbmc_1k_v2_fastqs/

Run STARsolo on the index

Let’s just use one of the lanes L001

STAR  --runThreadN 4 \
      --genomeLoad NoSharedMemory \
      --genomeDir tempstargenomedir \
      --readFilesCommand zcat \
      --readFilesIn pbmc_1k_v2_fastqs/pbmc_1k_v2_S1_L001_R2_001.fastq.gz pbmc_1k_v2_fastqs/pbmc_1k_v2_S1_L001_R1_001.fastq.gz \
      --soloType Droplet \
      --soloCBwhitelist 737K-august-2016.txt \
      --soloBarcodeReadLength 1  \
      --soloCBstart 1 \
      --soloCBlen 16 \
      --soloUMIstart 17 \
      --soloUMIlen 10 \
      --soloStrand 'Forward' \
      --soloFeatures 'Gene' \
      --soloUMIdedup '1MM_All'

Attempt the previous version of STAR

We will use a instead of b

conda create -n starnewa star==2.7.5a
conda activate starnewa
STAR --runMode genomeGenerate \
     --genomeDir 'tempstargenomedira' \
     --readFilesCommand zcat \
     --genomeFastaFiles Homo_sapiens.GRCh38.dna.chromosome.21.fa \
     --sjdbOverhang 100 \
     --sjdbGTFfile Homo_sapiens.GRCh38.100.gtf \
     --genomeSAindexNbases 4 \
     --runThreadN 5

STAR  --runThreadN 4 \
      --genomeLoad NoSharedMemory \
      --genomeDir 'tempstargenomedira' \
      --readFilesCommand zcat \
      --readFilesIn pbmc_1k_v2_fastqs/pbmc_1k_v2_S1_L001_R2_001.fastq.gz pbmc_1k_v2_fastqs/pbmc_1k_v2_S1_L001_R1_001.fastq.gz \
      --soloType Droplet \
      --soloCBwhitelist 737K-august-2016.txt \
      --soloBarcodeReadLength 1  \
      --soloCBstart 1 \
      --soloCBlen 16 \
      --soloUMIstart 17 \
      --soloUMIlen 10 \
      --soloStrand 'Forward' \
      --soloFeatures 'Gene' \
      --soloUMIdedup '1MM_All'

Same problem.

Could it be that the issue is that the FASTA is requires more than one chromosome?

Run STAR 2.7.5b on full FASTA sequence

Let’s go for the primary assembly.

wget ftp://ftp.ensembl.org/pub/release-100/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
STAR --runMode genomeGenerate \
     --genomeDir 'tempstargenomedirprimary' \
     --readFilesCommand zcat \
     --genomeFastaFiles Homo_sapiens.GRCh38.dna.primary_assembly.fa \
     --sjdbOverhang 100 \
     --sjdbGTFfile Homo_sapiens.GRCh38.100.gtf \
     --genomeSAindexNbases 4 \
     --runThreadN 5

Memory usage was too high, killed.

Run with a truncated GTF file

The idea is that maybe the GTF file should be constrained only to chromosome 21

cat Homo_sapiens.GRCh38.100.gtf | grep "^#!"  >  Homo_sapiens.GRCh38.100.chr21.gtf
cat Homo_sapiens.GRCh38.100.gtf | grep "^21"  >> Homo_sapiens.GRCh38.100.chr21.gtf
wc -l Homo_sapiens.GRCh38.100.chr21.gtf

Generate Index

STAR --runMode genomeGenerate \
     --genomeDir $gdir \
     --readFilesCommand zcat \
     --genomeFastaFiles $fasta \
     --sjdbOverhang 100 \
     --sjdbGTFfile $gtf \
     --genomeSAindexNbases 4 \
     --runThreadN 5

STARSolo with chr21 GTF and FASTA, full PBMC 1k

STAR  --runThreadN 4 \
      --genomeLoad NoSharedMemory \
      --genomeDir 'tempstargenomedir21' \
      --readFilesCommand zcat \
      --readFilesIn pbmc_1k_v2_fastqs/pbmc_1k_v2_S1_L001_R2_001.fastq.gz pbmc_1k_v2_fastqs/pbmc_1k_v2_S1_L001_R1_001.fastq.gz \
      --soloType Droplet \
      --soloCBwhitelist 737K-august-2016.txt \
      --soloBarcodeReadLength 1  \
      --soloCBstart 1 \
      --soloCBlen 16 \
      --soloUMIstart 17 \
      --soloUMIlen 12 \
      --soloStrand 'Forward' \
      --soloFeatures 'Gene' \
      --soloUMIdedup '1MM_All'

This appears to run, but for a long time, so I will try truncating the input files slightly to only 100,000 reads

STARSolo with chr21 GTF and FASTA, truncated PBMC

zcat pbmc_1k_v2_fastqs/pbmc_1k_v2_S1_L001_R1_001.fastq.gz \
    | awk 'NR%10000==1 {print $0; getline; print $0; getline; print $0; getline; print $0}' \
    | gzip - > pbmc_1k_v2_L001.R1.10k.fastq.gz
zcat pbmc_1k_v2_fastqs/pbmc_1k_v2_S1_L001_R2_001.fastq.gz \
    | awk 'NR%10000==1 {print $0; getline; print $0; getline; print $0; getline; print $0}' \
    | gzip - > pbmc_1k_v2_L001.R2.10k.fastq.gz

Check uniqueness

zcat pbmc_1k_v2_L001.R1.10k.fastq.gz | awk 'NR%4==1 {print $1}' > heads1.txt
zcat pbmc_1k_v2_L001.R2.10k.fastq.gz | awk 'NR%4==1 {print $1}' > heads2.txt
cat heads1.txt heads2.txt | sort | uniq -c | sort -nk1 > dupes.txt
cat dupes.txt | grep -v  "^\s*2\s"

only duplicate read headers, good.

Let’s run these again, but using only the subsetted data

STAR  --runThreadN 4 \
      --genomeLoad NoSharedMemory \
      --genomeDir 'tempstargenomedir21' \
      --readFilesCommand zcat \
      --readFilesIn pbmc_1k_v2_L001.R2.10k.fastq.gz pbmc_1k_v2_L001.R1.10k.fastq.gz \
      --soloType Droplet \
      --soloCBwhitelist 737K-august-2016.txt \
      --soloBarcodeReadLength 1  \
      --soloCBstart 1 \
      --soloCBlen 16 \
      --soloUMIstart 17 \
      --soloUMIlen 12 \
      --soloStrand 'Forward' \
      --soloFeatures 'Gene' \
      --soloUMIdedup '1MM_All'

Success! Let’s see which barcodes actually mapped, and whether we can truncated the barcodes file based on it

wc -l Solo.out/Gene/filtered/barcodes.tsv

394 cell barcodes were detected after filtering. Let’s see if we can use just these barcodes as input to our whitelist.

STARSolo with chr21 GTF and FASTA, truncated PBMC, truncated barcodes

cp Solo.out/Gene/filtered/barcodes.tsv ./filtered.barcodes.txt
STAR  --runThreadN 4 \
      --genomeLoad NoSharedMemory \
      --genomeDir $gdir \
      --readFilesCommand zcat \
      --readFilesIn pbmc_1k_v2_L001.R2.10k.fastq.gz pbmc_1k_v2_L001.R1.10k.fastq.gz \
      --soloType Droplet \
      --soloCBwhitelist filtered.barcodes.txt \
      --soloBarcodeReadLength 1  \
      --soloCBstart 1 \
      --soloCBlen 16 \
      --soloUMIstart 17 \
      --soloUMIlen 12 \
      --soloStrand 'Forward' \
      --soloFeatures 'Gene' \
      --soloUMIdedup '1MM_All'

Looks good!

head Solo.out/Gene/filtered/matrix.mtx

394 cells detected again, with 538 total counts

Subsetting the GTF

So right now we have subsetted input FASTQ and Barcodes data, each less than 1MB

ls -lh filtered.barcodes.txt pbmc_1k_v2_L001.R2.10k.fastq.gz pbmc_1k_v2_L001.R1.10k.fastq.gz

What we need to now is see how small we can make STAR index for testing purposes. Right now the big files are

ls -lh Homo_sapiens.GRCh38.dna.chromosome.21.fa Homo_sapiens.GRCh38.100.chr21.gtf

Which are big for testing purposes.

Let’s extract the detected features from filtered matrices, and filter it from the GTF

cut -f 1 Solo.out/Gene/filtered/features.tsv | sort | uniq > filtered.features.txt
wc -l filtered.features.txt

875 unique features in the final matrix. Let’s filter for these

barcodes = {}
with open('teststar/filtered.features.txt', 'r') as ff:
    bars = [x.strip() for x in ff.read().splitlines()]
    if subset != -1:
       bars = bars[:subset]
    barcodes = {x:True for x in bars}

with open('teststar/filtered.Homo_sapiens.GRCh38.100.chr21.gtf', 'w') as out:
    count = 0
    with open('teststar/Homo_sapiens.GRCh38.100.chr21.gtf', 'r') as gtf:
        for line in gtf:
            if line[0:2] == "#!":
                print(line, end="", file=out)
                count += 1
                continue

            info = [x.split() for x in line.split('\t')[8].split(";")]
            gene = None
            for k_v in info:
                if len(k_v) == 2:
                    if k_v[0]=="gene_id":
                        gene = k_v[1].replace('"',"")
                        break
            if gene is not None:
                if gene in barcodes:
                    count += 1
                    print(line, end="", file=out)

    print("Lines kept:", count)
wc -l Homo_sapiens.GRCh38.100.chr21.gtf filtered.Homo_sapiens.GRCh38.100.chr21.gtf

Wow, so apparently all genes in the GTF file make it to the final filtered matrix

Let’s try instead to just pick 100 or so genes

We reduce the GTF file by 1/2 if we keep only the first 100 genes. Let’s try 50

ls -lh Homo_sapiens.GRCh38.100.chr21.gtf filtered.Homo_sapiens.GRCh38.100.chr21.gtf

with 50 we keep 7310 lines and reduce the size to 3M. Still too big. Let’s try 10

828K, that’s fine. Let’s try generating an index with this GTF file

STARSolo with chr21 GTF and FASTA, truncated PBMC, truncated barcodes, truncated GTF

And now let’s try with STARsolo

The output is a tiny matrix of 4 cells. Hopefully we can work with this. We just need to truncate the FASTA now.

Subsetting the FASTA

cat Solo.out/Gene/filtered/features.tsv

These are the 10 genes we selected and are detected in the filtered matrices. We just need to target these in the FASTA

Let’s find the largest position in the filtered GTF file and cut the FASTA after it.

cut -f 4,5 filtered.Homo_sapiens.GRCh38.100.chr21.gtf \
    | grep -v "#!" \
    | sed 's|\s|\n|' \
    | sort -nk1 | uniq > positions.in.filtered
head -2 positions.in.filtered
tail -2 positions.in.filtered

Here we have first 2 positions and the last 2 positions in the filtered GTF: 31.6-46.1Mbp

How much can we shave off the FASTA?

head -1 Homo_sapiens.GRCh38.dna.chromosome.21.fa

We can see that the original FASTA goes from 1-46.7Mbp, so we can actually shave off a lot from the beginning!

We need to be careful when doing this!

with open('teststar/' + prefix+ '.Homo_sapiens.GRCh38.dna.chromosome.21.fa', 'w') as out:
    with open('teststar/Homo_sapiens.GRCh38.dna.chromosome.21.fa', 'r') as fasta:
        head = fasta.readline().split(':')
        start = int(head[4])
        end = int(head[5]);
        #
        row_start = int(startbp / 60)
        row_end = int(endbp / 60)
        #
        newstart = (row_start * 60) + 1
        newend = (row_end * 60)
        #
        head[4] = str(newstart)
        head[5] = str(newend)
        header = ':'.join(head)
        print(header, end="", file=out)
        row_current = 0
        for line in fasta:
            line = line.splitlines()[0]
            # each line is 60bp
            if len(line) != 60:
               print("Problem line", line)
               exit(-1)
            row_current +=1
            if row_start <= row_current <= row_end:
               print(line, file=out)

I’ve got it down to 14M from 40M, that’s something but lets see if it maps

We reclaim the same matrix, that’s good

Improvements

We could select genes that are closer together to minimise the size of the FASTA. Let’s select the first 10 genes in the GTF file.

with open('teststar/filtered2.Homo_sapiens.GRCh38.100.chr21.gtf', 'w') as out:
    count = 0
    genes_found = []
    with open('teststar/Homo_sapiens.GRCh38.100.chr21.gtf', 'r') as gtf:
        for line in gtf:
            if line[0:2] == "#!":
                print(line, end="", file=out)
                count += 1
                continue

            info = [x.split() for x in line.split('\t')[8].split(";")]
            gene = None
            for k_v in info:
                if len(k_v) == 2:
                    if k_v[0]=="gene_id":
                        gene = k_v[1].replace('"',"")
                        break

            if len(genes_found)==0:
                genes_found.append(gene)
            if gene != genes_found[-1]:
                genes_found.append(gene)
            if len(genes_found) == 15:
                break
            if gene is not None:
                count += 1
                print(line, end="", file=out)

    print("Lines kept:", count)

15 genes in 462 lines. What range is that?

cut -f 4,5 filtered2.Homo_sapiens.GRCh38.100.chr21.gtf \
    | grep -v "#!" \
    | sed 's|\s|\n|' \
    | sort -nk1 | uniq > positions.in.filtered
head -2 positions.in.filtered
tail -2 positions.in.filtered

Much smaller range, let’s see how small this makes the FASTA

Down to 693K, nice. Let’s re-run the indexing

It hangs forever. Weird bug where it takes the position of the exon end and uses that to assume the size of the exon. Bad. Bad. Bad.

So the smallest FASTA we can make has to be larger than biggest exon position. (internal screaming)

How to resolve this

Assuming we don’t change the information in the FASTA and tweak only the positions, then surely… if we decrease the start and end positions of the FASTA header by 1, that should also correspond to a shift in all start and end positions in the GTF… right?

So let’s subtract 5,000,000 from all positions and see what the heck happens. \+ 11,799

cat ~/repos/_mtekman/myorg/notebooks/teststar/filtered2.Homo_sapiens.GRCh38.100.chr21.gtf \
  | awk -F $'\t' 'BEGIN {OFS = FS} /#!/ {print;} /21/ {$4=($4-5010799);$5=($5-5010799); print}' \
  > ~/repos/_mtekman/myorg/notebooks/teststar/filtered3.Homo_sapiens.GRCh38.100.chr21.gtf

and now let’s subtract the same from the header of the FASTA

cat ~/repos/_mtekman/myorg/notebooks/teststar/filtered2.Homo_sapiens.GRCh38.dna.chromosome.21.fa \
  | awk -F ':' 'BEGIN {OFS = FS} NR==1 {$5=($5-5010799);$6=($6-5010799);print} NR>1 {print;}' \
  > ~/repos/_mtekman/myorg/notebooks/teststar/filtered3.Homo_sapiens.GRCh38.dna.chromosome.21.fa

and now to see if that generates a valid index…

Yes! Okay, let’s see if STARsolo works..

Looks like it, and now for the final matrix…

It looks like we have a matrix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment