Skip to content

Instantly share code, notes, and snippets.

@zhoujj2013
Created July 1, 2014 15:56
Show Gist options
  • Save zhoujj2013/199b29fe6e0c34afe7b8 to your computer and use it in GitHub Desktop.
Save zhoujj2013/199b29fe6e0c34afe7b8 to your computer and use it in GitHub Desktop.
Standardize GeneBank Fasta Format
#!/usr/bin/perl -w
use strict;
my $f = shift; # input fasta file download from batch Entrez (http://www.ncbi.nlm.nih.gov/sites/batchentrez)
open IN,"$f" || die $!;
$/ = ">"; <IN>; $/ = "\n";
while(<IN>){
chomp;
my $id = $_;
# replace "gi|634859302|gb|KJ524455.1|" > "gi_634859302|gb|KJ524455.1|"
$id =~ s/^(gi\|)/gi_/g;
# replace "gi_634859302|gb|KJ524455.1|" > "gi_634859302 KJ524455.1|"
# you can add more things to this part
$id =~ s/(\|gb\|)/ /g;
$id =~ s/(\|ref\|)/ /g;
$id =~ s/(\|dbj\|)/ /g;
# replace "gi_634859302 KJ524455.1|" > "gi_634859302 KJ524455.1"
$id =~ s/(\|) / /g;
# get seq
$/ = ">";
my $seq = <IN>;
chomp($seq);
$seq =~ s/\n//g;
$/ = "\n";
# stdout
print ">$id\n$seq\n";
}
close IN;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment