Please see the most up-to-date version of this protocol on my blog at https://darencard.net/blog/.
When annotating genomes it is often desireable to know the overall structure of genes, including information like exon and intron lengths among other metrics. Here is a program genestats
that will calculate such measures for a user.
#!/usr/bin/env bash
usage()
{
cat << EOF
genestats
Version 1.0 (10 January, 2018)
License: GNU GPLv2
To report bugs or errors, please contact Daren Card ([email protected]).
This script is provided as-is, with no support and no guarantee of proper or
desirable functioning.
This script calculates several gene structure measures based on a GFF annotation file.
The bgzip and tabix programs (http://www.htslib.org/doc/tabix.html) and bedtools
(https://bedtools.readthedocs.io/en/latest/) must be installed and in the \$PATH.
The user supplies the GFF3 file as a single argument to the command and the output is a
12 column tab-delimited text written to STDOUT with the following columns:
1. transcript ID
2. transcript sequence length
3. number of exons
4. total exon sequence length
5. number of introns
6. total intron sequence length
7. number of CDS chunks
8. total CDS sequence length
9. number of 5' UTR sequences
10. total 5' UTR sequence length
11. number of 3' UTR sequences
12. total 3' UTR sequence length
To work optimally and produce reliable results, the following tags in column 3 must be
included within the GFF file:
mRNA = transcript
exon = exons
CDS = coding sequence
five_prime_UTR = 5' UTR sequence
three_prime_UTR = 3' UTR sequence
The first three tags should be present in any GFF file but the latter two may not be
depending on the source of the file. In those cases, estimates of UTR features will be
erroneous, but the other features should be reported correctly. Note that temporary files
are created in the working directory as the program runs.
USAGE:
genestats <file.gff>
EOF
}
if [[ -z $1 ]] || [[ $1 == "-h" ]] || [[ $1 == "-help" ]]
then
usage
exit 1
fi
# this script basically pastes together 6 queries of the GFF file for each mRNA annotation
# for the mRNA, exons, introns, CDS, and UTRs (x2)
# for each, tabix is used to rapidly pull out the feature from the GFF file and
# the appropriate feature lines are then used to produce counts and sequence lengths
# for the introns, bedtools is used to essentially find the features not matching exons
# (i.e. inverse match)
cat $1 | sort -k1,1 -k4,4n -k5,5n | bgzip -c > $1.gz && \
tabix -p gff $1.gz && \
cat $1 | \
awk '{ if ($3 == "mRNA") print $0 }' | \
awk -F "ID=|;" '{ print $1, $2 }' | \
while read line; do \
parent=`echo $line | awk '{ print $9 }'`; \
chrom=`echo $line | awk '{ print $1 }'`; \
start=`echo $line | awk '{ print $4 }'`; \
end=`echo $line | awk '{ print $5 }'`; \
paste <(echo $parent) \
<(tabix $1.gz $chrom:$start-$end | grep "$parent" | awk '{ if ($3 == "mRNA") print $0 }' | \
awk -v sum=0 -v count=0 -v OFS="\t" '{ count += 1; sum += ($5 - $4 + 1) } END { print sum }') \
<(tabix $1.gz $chrom:$start-$end | grep "$parent" | awk '{ if ($3 == "exon") print $0 }' | \
awk -v sum=0 -v count=0 -v OFS="\t" '{ count += 1; sum += ($5 - $4 + 1) } END { print count, sum }') \
<(tabix $1.gz $chrom:$start-$end | grep "$parent" | awk '{ if ($3 == "mRNA") print $0 }' > mRNA.txt && \
tabix $1.gz $chrom:$start-$end | grep "$parent" | awk '{ if ($3 == "exon") print $0 }' > exons.txt && \
bedtools subtract -a mRNA.txt -b exons.txt | \
awk -v sum=0 -v count=0 -v OFS="\t" '{ count += 1; sum += ($5 - $4 + 1) } END { print count, sum }' && \
rm mRNA.txt exons.txt) \
<(tabix $1.gz $chrom:$start-$end | grep "$parent" | awk '{ if ($3 == "CDS") print $0 }' | \
awk -v sum=0 -v count=0 -v OFS="\t" '{ count += 1; sum += ($5 - $4 + 1) } END { print count, sum }') \
<(tabix $1.gz $chrom:$start-$end | grep "$parent" | awk '{ if ($3 == "five_prime_UTR") print $0 }' | \
awk -v sum=0 -v count=0 -v OFS="\t" '{ count += 1; sum += ($5 - $4 + 1) } END { print count, sum }') \
<(tabix $1.gz $chrom:$start-$end | grep "$parent" | awk '{ if ($3 == "three_prime_UTR") print $0 }' | \
awk -v sum=0 -v count=0 -v OFS="\t" '{ count += 1; sum += ($5 - $4 + 1) } END { print count, sum }'); \
done
Let's run this on a test GFF file to demonstrate it. Here is the GFF file we will use, called test.gff
.
##gff-version 3
scaffold-162 . contig 1 2399525 . . . ID=scaffold-162;Name=scaffold-162
scaffold-162 maker gene 1481900 1501151 . + . ID=BoaCon13386;Name=BoaCon13386;Alias=maker-scaffold-162-augustus-gene-5.4;
scaffold-162 maker mRNA 1481900 1501151 . + . ID=BoaCon13386-RA;Parent=BoaCon13386;Name=BoaCon13386-RA;Alias=maker-scaffold-162-augustus-gene-5.4-mRNA-1;_AED=0.05;_QI=0|0|0|0.8|0.75|0.6|5|0|297;_eAED=0.25;
scaffold-162 maker exon 1481900 1481924 . + . ID=BoaCon13386-RA:exon:218;Parent=BoaCon13386-RA;
scaffold-162 maker exon 1485901 1486158 . + . ID=BoaCon13386-RA:exon:219;Parent=BoaCon13386-RA;
scaffold-162 maker exon 1490203 1490387 . + . ID=BoaCon13386-RA:exon:220;Parent=BoaCon13386-RA;
scaffold-162 maker exon 1495786 1495929 . + . ID=BoaCon13386-RA:exon:221;Parent=BoaCon13386-RA;
scaffold-162 maker exon 1500870 1501151 . + . ID=BoaCon13386-RA:exon:222;Parent=BoaCon13386-RA;
scaffold-162 maker CDS 1481900 1481924 . + 0 ID=BoaCon13386-RA:cds;Parent=BoaCon13386-RA;
scaffold-162 maker CDS 1485901 1486158 . + 2 ID=BoaCon13386-RA:cds;Parent=BoaCon13386-RA;
scaffold-162 maker CDS 1490203 1490387 . + 2 ID=BoaCon13386-RA:cds;Parent=BoaCon13386-RA;
scaffold-162 maker CDS 1495786 1495929 . + 0 ID=BoaCon13386-RA:cds;Parent=BoaCon13386-RA;
scaffold-162 maker CDS 1500870 1501151 . + 0 ID=BoaCon13386-RA:cds;Parent=BoaCon13386-RA;
scaffold-162 maker gene 1641672 1650129 . - . ID=BoaCon13387;Name=BoaCon13387;Alias=maker-scaffold-162-augustus-gene-5.5;
scaffold-162 maker mRNA 1641672 1650129 . - . ID=BoaCon13387-RA;Parent=BoaCon13387;Name=BoaCon13387-RA;Alias=maker-scaffold-162-augustus-gene-5.5-mRNA-1;_AED=0.21;_QI=0|0|0|0.5|0.66|0.75|4|0|130;_eAED=0.06;
scaffold-162 maker exon 1650055 1650129 . - . ID=BoaCon13387-RA:exon:226;Parent=BoaCon13387-RA;
scaffold-162 maker exon 1647269 1647309 . - . ID=BoaCon13387-RA:exon:225;Parent=BoaCon13387-RA;
scaffold-162 maker exon 1644412 1644586 . - . ID=BoaCon13387-RA:exon:224;Parent=BoaCon13387-RA;
scaffold-162 maker exon 1641672 1641773 . - . ID=BoaCon13387-RA:exon:223;Parent=BoaCon13387-RA;
scaffold-162 maker CDS 1650055 1650129 . - 0 ID=BoaCon13387-RA:cds;Parent=BoaCon13387-RA;
scaffold-162 maker CDS 1647269 1647309 . - 0 ID=BoaCon13387-RA:cds;Parent=BoaCon13387-RA;
scaffold-162 maker CDS 1644412 1644586 . - 1 ID=BoaCon13387-RA:cds;Parent=BoaCon13387-RA;
scaffold-162 maker CDS 1641672 1641773 . - 0 ID=BoaCon13387-RA:cds;Parent=BoaCon13387-RA;
scaffold-162 maker gene 2299416 2309756 . + . ID=BoaCon13393;Name=BoaCon13393;Alias=maker-scaffold-162-augustus-gene-6.11;
scaffold-162 maker mRNA 2299416 2309756 . + . ID=BoaCon13393-RA;Parent=BoaCon13393;Name=BoaCon13393-RA;Alias=maker-scaffold-162-augustus-gene-6.11-mRNA-1;_AED=0.02;_QI=171|1|1|1|1|1|2|4907|158;_eAED=0.02;
scaffold-162 maker exon 2299416 2299884 . + . ID=BoaCon13393-RA:exon:227;Parent=BoaCon13393-RA;
scaffold-162 maker exon 2304671 2309756 . + . ID=BoaCon13393-RA:exon:228;Parent=BoaCon13393-RA;
scaffold-162 maker five_prime_UTR 2299416 2299586 . + . ID=BoaCon13393-RA:five_prime_utr;Parent=BoaCon13393-RA;
scaffold-162 maker CDS 2299587 2299884 . + 0 ID=BoaCon13393-RA:cds;Parent=BoaCon13393-RA;
scaffold-162 maker CDS 2304671 2304849 . + 2 ID=BoaCon13393-RA:cds;Parent=BoaCon13393-RA;
scaffold-162 maker three_prime_UTR 2304850 2309756 . + . ID=BoaCon13393-RA:three_prime_utr;Parent=BoaCon13393-RA;
scaffold-162 maker gene 2319095 2328392 . + . ID=BoaCon13395;Name=BoaCon13395;Alias=maker-scaffold-162-augustus-gene-6.12;
scaffold-162 maker mRNA 2319095 2328392 . + . ID=BoaCon13395-RA;Parent=BoaCon13395;Name=BoaCon13395-RA;Alias=maker-scaffold-162-augustus-gene-6.12-mRNA-1;_AED=0.08;_QI=0|1|0.8|1|1|1|5|789|478;_eAED=0.08;
scaffold-162 maker exon 2319095 2319800 . + . ID=BoaCon13395-RA:exon:229;Parent=BoaCon13395-RA;
scaffold-162 maker exon 2322272 2322404 . + . ID=BoaCon13395-RA:exon:230;Parent=BoaCon13395-RA;
scaffold-162 maker exon 2323841 2323973 . + . ID=BoaCon13395-RA:exon:231;Parent=BoaCon13395-RA;
scaffold-162 maker exon 2325881 2326087 . + . ID=BoaCon13395-RA:exon:232;Parent=BoaCon13395-RA;
scaffold-162 maker exon 2327346 2328392 . + . ID=BoaCon13395-RA:exon:233;Parent=BoaCon13395-RA;
scaffold-162 maker CDS 2319095 2319800 . + 0 ID=BoaCon13395-RA:cds;Parent=BoaCon13395-RA;
scaffold-162 maker CDS 2322272 2322404 . + 2 ID=BoaCon13395-RA:cds;Parent=BoaCon13395-RA;
scaffold-162 maker CDS 2323841 2323973 . + 1 ID=BoaCon13395-RA:cds;Parent=BoaCon13395-RA;
scaffold-162 maker CDS 2325881 2326087 . + 0 ID=BoaCon13395-RA:cds;Parent=BoaCon13395-RA;
scaffold-162 maker CDS 2327346 2327603 . + 0 ID=BoaCon13395-RA:cds;Parent=BoaCon13395-RA;
scaffold-162 maker three_prime_UTR 2327604 2328392 . + . ID=BoaCon13395-RA:three_prime_utr;Parent=BoaCon13395-RA;
scaffold-162 maker gene 2049228 2051664 . + . ID=BoaCon13389;Name=BoaCon13389;Alias=augustus_masked-scaffold-162-processed-gene-6.2;
scaffold-162 maker mRNA 2049228 2051664 . + . ID=BoaCon13389-RA;Parent=BoaCon13389;Name=BoaCon13389-RA;Alias=augustus_masked-scaffold-162-processed-gene-6.2-mRNA-1;_AED=0.17;_QI=0|0|0|0.33|1|1|3|0|99;_eAED=0.16;
scaffold-162 maker exon 2049228 2049258 . + . ID=BoaCon13389-RA:exon:234;Parent=BoaCon13389-RA;
scaffold-162 maker exon 2049470 2049538 . + . ID=BoaCon13389-RA:exon:235;Parent=BoaCon13389-RA;
scaffold-162 maker exon 2051465 2051664 . + . ID=BoaCon13389-RA:exon:236;Parent=BoaCon13389-RA;
scaffold-162 maker CDS 2049228 2049258 . + 0 ID=BoaCon13389-RA:cds;Parent=BoaCon13389-RA;
scaffold-162 maker CDS 2049470 2049538 . + 2 ID=BoaCon13389-RA:cds;Parent=BoaCon13389-RA;
scaffold-162 maker CDS 2051465 2051664 . + 2 ID=BoaCon13389-RA:cds;Parent=BoaCon13389-RA;
Now it is very easy to run genestats
.
genestats test.gff
Which will write the following output to STDOUT.
BoaCon13386-RA 19251 5 889 4 18354 5 889 0 0 0 0
BoaCon13387-RA 8457 4 389 3 8062 4 389 0 0 0 0
BoaCon13393-RA 10340 2 5553 1 4785 2 475 1 170 1 4906
BoaCon13395-RA 9297 5 2221 4 7068 5 1432 0 0 1 788
BoaCon13389-RA 2436 3 297 2 2135 3 297 0 0 0 0
It is possible to redirect this to an output file, but the user may also wish to calculate some basic statistics like mean exon and intron lengths, like this.
genestats test.gff | \
awk -v OFS="\t" '{ exon_sum += $3; exon_len += $4; intron_sum += $5; intron_len += $6 } END { print exon_sum, exon_len / exon_sum, intron_sum, intron_len / intron_sum }'
Which provides the number of exons, mean exon length, the number of introns, and the mean intron length.
Hi @Carol-Symbiomics,
No need to cite in any formal way, as this isn't in any paper. Can just provide the URL for the record.
Daren