https://gitlab.com/mtekman/galaxy-snippets
We need to generate small test data for this to pass in Galaxy For now we will try to get this to work with large datasets and work our way down
The datasets we will be using will be those from the 1K PBMCs v2 chemistry and we will therefore be using hg38 from ensembl, but only chr21 fasta and the gtf
STARsolo seems to be unable to run without a GTF file, so trying to run with a transcriptome fails.
mkdir -p $dname
cd $dname
wget ftp://ftp.ensembl.org/pub/release-100/gtf/homo_sapiens/Homo_sapiens.GRCh38.100.gtf.gz
wget ftp://ftp.ensembl.org/pub/release-100/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz
wget https://zenodo.org/record/3457880/files/737K-august-2016.txt?download=1
conda create -n starnew star==2.7.5b
conda activate starnew
(You will need to Ctrl-G here to type Y
in the prompt)
(needs to be plaintext)
gunzip Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz
gunzip Homo_sapiens.GRCh38.100.gtf.gz
STAR --runMode genomeGenerate \
--genomeDir 'tempstargenomedir' \
--readFilesCommand zcat \
--genomeFastaFiles Homo_sapiens.GRCh38.dna.chromosome.21.fa \
--sjdbOverhang 100 \
--sjdbGTFfile Homo_sapiens.GRCh38.100.gtf \
--genomeSAindexNbases 4 \
--runThreadN 5
We should now see a geneInfo.tab file in the tempstargenomedir that we can use
wc -l tempstargenomedir/{exonInfo,geneInfo}.tab
876 genes and 1378276 exons. Good.
Let’s see how much of our 1K PBMC data can be mapped
mv ~/pbmc_1k_v2_fastqs.tar ./
tar xvf pbmc_1k_v2_fastqs.tar
ls -lh pbmc_1k_v2_fastqs/
Let’s just use one of the lanes L001
STAR --runThreadN 4 \
--genomeLoad NoSharedMemory \
--genomeDir tempstargenomedir \
--readFilesCommand zcat \
--readFilesIn pbmc_1k_v2_fastqs/pbmc_1k_v2_S1_L001_R2_001.fastq.gz pbmc_1k_v2_fastqs/pbmc_1k_v2_S1_L001_R1_001.fastq.gz \
--soloType Droplet \
--soloCBwhitelist 737K-august-2016.txt \
--soloBarcodeReadLength 1 \
--soloCBstart 1 \
--soloCBlen 16 \
--soloUMIstart 17 \
--soloUMIlen 10 \
--soloStrand 'Forward' \
--soloFeatures 'Gene' \
--soloUMIdedup '1MM_All'
We will use a instead of b
conda create -n starnewa star==2.7.5a
conda activate starnewa
STAR --runMode genomeGenerate \
--genomeDir 'tempstargenomedira' \
--readFilesCommand zcat \
--genomeFastaFiles Homo_sapiens.GRCh38.dna.chromosome.21.fa \
--sjdbOverhang 100 \
--sjdbGTFfile Homo_sapiens.GRCh38.100.gtf \
--genomeSAindexNbases 4 \
--runThreadN 5
STAR --runThreadN 4 \
--genomeLoad NoSharedMemory \
--genomeDir 'tempstargenomedira' \
--readFilesCommand zcat \
--readFilesIn pbmc_1k_v2_fastqs/pbmc_1k_v2_S1_L001_R2_001.fastq.gz pbmc_1k_v2_fastqs/pbmc_1k_v2_S1_L001_R1_001.fastq.gz \
--soloType Droplet \
--soloCBwhitelist 737K-august-2016.txt \
--soloBarcodeReadLength 1 \
--soloCBstart 1 \
--soloCBlen 16 \
--soloUMIstart 17 \
--soloUMIlen 10 \
--soloStrand 'Forward' \
--soloFeatures 'Gene' \
--soloUMIdedup '1MM_All'
Same problem.
Could it be that the issue is that the FASTA is requires more than one chromosome?
Let’s go for the primary assembly.
wget ftp://ftp.ensembl.org/pub/release-100/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
STAR --runMode genomeGenerate \
--genomeDir 'tempstargenomedirprimary' \
--readFilesCommand zcat \
--genomeFastaFiles Homo_sapiens.GRCh38.dna.primary_assembly.fa \
--sjdbOverhang 100 \
--sjdbGTFfile Homo_sapiens.GRCh38.100.gtf \
--genomeSAindexNbases 4 \
--runThreadN 5
Memory usage was too high, killed.
The idea is that maybe the GTF file should be constrained only to chromosome 21
cat Homo_sapiens.GRCh38.100.gtf | grep "^#!" > Homo_sapiens.GRCh38.100.chr21.gtf
cat Homo_sapiens.GRCh38.100.gtf | grep "^21" >> Homo_sapiens.GRCh38.100.chr21.gtf
wc -l Homo_sapiens.GRCh38.100.chr21.gtf
STAR --runMode genomeGenerate \
--genomeDir $gdir \
--readFilesCommand zcat \
--genomeFastaFiles $fasta \
--sjdbOverhang 100 \
--sjdbGTFfile $gtf \
--genomeSAindexNbases 4 \
--runThreadN 5
STAR --runThreadN 4 \
--genomeLoad NoSharedMemory \
--genomeDir 'tempstargenomedir21' \
--readFilesCommand zcat \
--readFilesIn pbmc_1k_v2_fastqs/pbmc_1k_v2_S1_L001_R2_001.fastq.gz pbmc_1k_v2_fastqs/pbmc_1k_v2_S1_L001_R1_001.fastq.gz \
--soloType Droplet \
--soloCBwhitelist 737K-august-2016.txt \
--soloBarcodeReadLength 1 \
--soloCBstart 1 \
--soloCBlen 16 \
--soloUMIstart 17 \
--soloUMIlen 12 \
--soloStrand 'Forward' \
--soloFeatures 'Gene' \
--soloUMIdedup '1MM_All'
This appears to run, but for a long time, so I will try truncating the input files slightly to only 100,000 reads
zcat pbmc_1k_v2_fastqs/pbmc_1k_v2_S1_L001_R1_001.fastq.gz \
| awk 'NR%10000==1 {print $0; getline; print $0; getline; print $0; getline; print $0}' \
| gzip - > pbmc_1k_v2_L001.R1.10k.fastq.gz
zcat pbmc_1k_v2_fastqs/pbmc_1k_v2_S1_L001_R2_001.fastq.gz \
| awk 'NR%10000==1 {print $0; getline; print $0; getline; print $0; getline; print $0}' \
| gzip - > pbmc_1k_v2_L001.R2.10k.fastq.gz
Check uniqueness
zcat pbmc_1k_v2_L001.R1.10k.fastq.gz | awk 'NR%4==1 {print $1}' > heads1.txt
zcat pbmc_1k_v2_L001.R2.10k.fastq.gz | awk 'NR%4==1 {print $1}' > heads2.txt
cat heads1.txt heads2.txt | sort | uniq -c | sort -nk1 > dupes.txt
cat dupes.txt | grep -v "^\s*2\s"
only duplicate read headers, good.
Let’s run these again, but using only the subsetted data
STAR --runThreadN 4 \
--genomeLoad NoSharedMemory \
--genomeDir 'tempstargenomedir21' \
--readFilesCommand zcat \
--readFilesIn pbmc_1k_v2_L001.R2.10k.fastq.gz pbmc_1k_v2_L001.R1.10k.fastq.gz \
--soloType Droplet \
--soloCBwhitelist 737K-august-2016.txt \
--soloBarcodeReadLength 1 \
--soloCBstart 1 \
--soloCBlen 16 \
--soloUMIstart 17 \
--soloUMIlen 12 \
--soloStrand 'Forward' \
--soloFeatures 'Gene' \
--soloUMIdedup '1MM_All'
Success! Let’s see which barcodes actually mapped, and whether we can truncated the barcodes file based on it
wc -l Solo.out/Gene/filtered/barcodes.tsv
394 cell barcodes were detected after filtering. Let’s see if we can use just these barcodes as input to our whitelist.
cp Solo.out/Gene/filtered/barcodes.tsv ./filtered.barcodes.txt
STAR --runThreadN 4 \
--genomeLoad NoSharedMemory \
--genomeDir $gdir \
--readFilesCommand zcat \
--readFilesIn pbmc_1k_v2_L001.R2.10k.fastq.gz pbmc_1k_v2_L001.R1.10k.fastq.gz \
--soloType Droplet \
--soloCBwhitelist filtered.barcodes.txt \
--soloBarcodeReadLength 1 \
--soloCBstart 1 \
--soloCBlen 16 \
--soloUMIstart 17 \
--soloUMIlen 12 \
--soloStrand 'Forward' \
--soloFeatures 'Gene' \
--soloUMIdedup '1MM_All'
Looks good!
head Solo.out/Gene/filtered/matrix.mtx
394 cells detected again, with 538 total counts
So right now we have subsetted input FASTQ and Barcodes data, each less than 1MB
ls -lh filtered.barcodes.txt pbmc_1k_v2_L001.R2.10k.fastq.gz pbmc_1k_v2_L001.R1.10k.fastq.gz
What we need to now is see how small we can make STAR index for testing purposes. Right now the big files are
ls -lh Homo_sapiens.GRCh38.dna.chromosome.21.fa Homo_sapiens.GRCh38.100.chr21.gtf
Which are big for testing purposes.
Let’s extract the detected features from filtered matrices, and filter it from the GTF
cut -f 1 Solo.out/Gene/filtered/features.tsv | sort | uniq > filtered.features.txt
wc -l filtered.features.txt
875 unique features in the final matrix. Let’s filter for these
barcodes = {}
with open('teststar/filtered.features.txt', 'r') as ff:
bars = [x.strip() for x in ff.read().splitlines()]
if subset != -1:
bars = bars[:subset]
barcodes = {x:True for x in bars}
with open('teststar/filtered.Homo_sapiens.GRCh38.100.chr21.gtf', 'w') as out:
count = 0
with open('teststar/Homo_sapiens.GRCh38.100.chr21.gtf', 'r') as gtf:
for line in gtf:
if line[0:2] == "#!":
print(line, end="", file=out)
count += 1
continue
info = [x.split() for x in line.split('\t')[8].split(";")]
gene = None
for k_v in info:
if len(k_v) == 2:
if k_v[0]=="gene_id":
gene = k_v[1].replace('"',"")
break
if gene is not None:
if gene in barcodes:
count += 1
print(line, end="", file=out)
print("Lines kept:", count)
wc -l Homo_sapiens.GRCh38.100.chr21.gtf filtered.Homo_sapiens.GRCh38.100.chr21.gtf
Wow, so apparently all genes in the GTF file make it to the final filtered matrix
Let’s try instead to just pick 100 or so genes
We reduce the GTF file by 1/2 if we keep only the first 100 genes. Let’s try 50
ls -lh Homo_sapiens.GRCh38.100.chr21.gtf filtered.Homo_sapiens.GRCh38.100.chr21.gtf
with 50 we keep 7310 lines and reduce the size to 3M. Still too big. Let’s try 10
828K, that’s fine. Let’s try generating an index with this GTF file
And now let’s try with STARsolo
The output is a tiny matrix of 4 cells. Hopefully we can work with this. We just need to truncate the FASTA now.
cat Solo.out/Gene/filtered/features.tsv
These are the 10 genes we selected and are detected in the filtered matrices. We just need to target these in the FASTA
Let’s find the largest position in the filtered GTF file and cut the FASTA after it.
cut -f 4,5 filtered.Homo_sapiens.GRCh38.100.chr21.gtf \
| grep -v "#!" \
| sed 's|\s|\n|' \
| sort -nk1 | uniq > positions.in.filtered
head -2 positions.in.filtered
tail -2 positions.in.filtered
Here we have first 2 positions and the last 2 positions in the filtered GTF: 31.6-46.1Mbp
How much can we shave off the FASTA?
head -1 Homo_sapiens.GRCh38.dna.chromosome.21.fa
We can see that the original FASTA goes from 1-46.7Mbp, so we can actually shave off a lot from the beginning!
We need to be careful when doing this!
with open('teststar/' + prefix+ '.Homo_sapiens.GRCh38.dna.chromosome.21.fa', 'w') as out:
with open('teststar/Homo_sapiens.GRCh38.dna.chromosome.21.fa', 'r') as fasta:
head = fasta.readline().split(':')
start = int(head[4])
end = int(head[5]);
#
row_start = int(startbp / 60)
row_end = int(endbp / 60)
#
newstart = (row_start * 60) + 1
newend = (row_end * 60)
#
head[4] = str(newstart)
head[5] = str(newend)
header = ':'.join(head)
print(header, end="", file=out)
row_current = 0
for line in fasta:
line = line.splitlines()[0]
# each line is 60bp
if len(line) != 60:
print("Problem line", line)
exit(-1)
row_current +=1
if row_start <= row_current <= row_end:
print(line, file=out)
I’ve got it down to 14M from 40M, that’s something but lets see if it maps
We reclaim the same matrix, that’s good
We could select genes that are closer together to minimise the size of the FASTA. Let’s select the first 10 genes in the GTF file.
with open('teststar/filtered2.Homo_sapiens.GRCh38.100.chr21.gtf', 'w') as out:
count = 0
genes_found = []
with open('teststar/Homo_sapiens.GRCh38.100.chr21.gtf', 'r') as gtf:
for line in gtf:
if line[0:2] == "#!":
print(line, end="", file=out)
count += 1
continue
info = [x.split() for x in line.split('\t')[8].split(";")]
gene = None
for k_v in info:
if len(k_v) == 2:
if k_v[0]=="gene_id":
gene = k_v[1].replace('"',"")
break
if len(genes_found)==0:
genes_found.append(gene)
if gene != genes_found[-1]:
genes_found.append(gene)
if len(genes_found) == 15:
break
if gene is not None:
count += 1
print(line, end="", file=out)
print("Lines kept:", count)
15 genes in 462 lines. What range is that?
cut -f 4,5 filtered2.Homo_sapiens.GRCh38.100.chr21.gtf \
| grep -v "#!" \
| sed 's|\s|\n|' \
| sort -nk1 | uniq > positions.in.filtered
head -2 positions.in.filtered
tail -2 positions.in.filtered
Much smaller range, let’s see how small this makes the FASTA
Down to 693K, nice. Let’s re-run the indexing
It hangs forever. Weird bug where it takes the position of the exon end and uses that to assume the size of the exon. Bad. Bad. Bad.
So the smallest FASTA we can make has to be larger than biggest exon position. (internal screaming)
Assuming we don’t change the information in the FASTA and tweak only the positions, then surely… if we decrease the start and end positions of the FASTA header by 1, that should also correspond to a shift in all start and end positions in the GTF… right?
So let’s subtract 5,000,000 from all positions and see what the heck happens. \+ 11,799
cat ~/repos/_mtekman/myorg/notebooks/teststar/filtered2.Homo_sapiens.GRCh38.100.chr21.gtf \
| awk -F $'\t' 'BEGIN {OFS = FS} /#!/ {print;} /21/ {$4=($4-5010799);$5=($5-5010799); print}' \
> ~/repos/_mtekman/myorg/notebooks/teststar/filtered3.Homo_sapiens.GRCh38.100.chr21.gtf
and now let’s subtract the same from the header of the FASTA
cat ~/repos/_mtekman/myorg/notebooks/teststar/filtered2.Homo_sapiens.GRCh38.dna.chromosome.21.fa \
| awk -F ':' 'BEGIN {OFS = FS} NR==1 {$5=($5-5010799);$6=($6-5010799);print} NR>1 {print;}' \
> ~/repos/_mtekman/myorg/notebooks/teststar/filtered3.Homo_sapiens.GRCh38.dna.chromosome.21.fa
and now to see if that generates a valid index…
Yes! Okay, let’s see if STARsolo works..
Looks like it, and now for the final matrix…
It looks like we have a matrix!