The tools fastq-dump
and fasterq-dump
are used to extract reads from the Sequence Read Archive and export them to (for example) fastq format. There is a hidden gotcha that you should be aware of using fastq-dump
to extract data.
tl;dr make sure you add --split-3
to your fastq-dump
command or you will get spurious results.
I was comparing some sequences to the human genome for partie, and had two different sets of results that I could not explain:
For sample ERR1051620 I had downloaded a data set and compared it to the human genome and got these results:
137083 reads; of these: 137083 (100.00%) were unpaired; of these: 136946 (99.90%) aligned 0 times 127 (0.09%) aligned exactly 1 time 10 (0.01%) aligned >1 times 0.10% overall alignment rate
But then in other code I downloaded the same sample, but got these results:
274166 reads; of these: 274166 (100.00%) were unpaired; of these: 13394 (4.89%) aligned 0 times 213200 (77.76%) aligned exactly 1 time 47572 (17.35%) aligned >1 times 95.11% overall alignment rate
Notice the difference?
For the same sample, in the first case, 0.09% of my reads mapped to the human genome exactly one time. In the second case, 77.6% of my reads mapped to the human genome exactly one time.
It turns out, the biosample is from a human, and so one would expect most of the reads to map to the human genome!
The difference between the two:
In the first case I used this command to download the data:
fastq-dump --outdir fastq --skip-technical --readids --read-filter pass --dumpbase --clip $SRR
While in the second case, I used this command to download the data:
fastq-dump --outdir fastq --skip-technical --readids --read-filter pass --dumpbase --clip --split-3 $SRR
Make sure you include the --split-3
in a fastq-dump
command!
Note that the newer fasterq-dump
command splits by default, and so although you don’t need to add the --split-3
there, you should still do so!