Last active
February 5, 2019 14:07
-
-
Save edawson/324890d62faa88713f347e77804ef08c to your computer and use it in GitHub Desktop.
Split a FASTQ (or pair) into 100K read splits using GNU split and pigz. Modified from an original script by @ekg.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
first_reads=$1 | |
second_reads=$2 | |
ddir=$(dirname $first_reads) | |
obase_first=$(basename $first_reads .fastq.gz) | |
obase_second=$(basename $second_reads .fastq.gz) | |
splitsz=4000000 | |
if [ ! -z ${first_reads} ] && [ -e ${first_reads} ] | |
then | |
time pigz -p4 -cd ${first_reads} | \ | |
split -d -a 6 -l ${splitsz} --filter='pigz -p4 > $FILE.gz' - ${ddir}/${obase_first}.fastq.part | |
else | |
echo "ERROR: no file ${first_reads} found." | |
fi | |
if [ ! -z "${second_reads}" ] && [ -e ${second_reads} ] | |
then | |
time pigz -p4 -cd ${second_reads} | \ | |
split -d -a 6 -l ${splitsz} --filter='pigz -p4 > $FILE.gz' - ${ddir}/${obase_second}.fastq.part | |
fi |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
If you want more parallelism (and you have the disk IO), you can combine this script with GNU parallel or LaunChair to parallelize by fastq file as well.
GNU parallel on all fastqs in a directory, using four jobs (e.g. for a 16 core system):
LaunChair example on a 16-core system: