The perils of fast(er)q-dump

The tools fastq-dump and fasterq-dump are used to extract reads from the Sequence Read Archive and export them to (for example) fastq format. There is a hidden gotcha that you should be aware of using fastq-dump to extract data.

tl;dr make sure you add --split-3 to your fastq-dump command or you will get spurious results.

I was comparing some sequences to the human genome for partie, and had two different sets of results that I could not explain:

For sample ERR1051620 I had downloaded a data set and compared it to the human genome and got these results:

137083 reads; of these:
   137083 (100.00%) were unpaired; of these:
     136946 (99.90%) aligned 0 times
     127 (0.09%) aligned exactly 1 time
     10 (0.01%) aligned >1 times
 0.10% overall alignment rate

But then in other code I downloaded the same sample, but got these results:

274166 reads; of these:
   274166 (100.00%) were unpaired; of these:
     13394 (4.89%) aligned 0 times
     213200 (77.76%) aligned exactly 1 time
     47572 (17.35%) aligned >1 times
 95.11% overall alignment rate

Notice the difference?

For the same sample, in the first case, 0.09% of my reads mapped to the human genome exactly one time. In the second case, 77.6% of my reads mapped to the human genome exactly one time.

It turns out, the biosample is from a human, and so one would expect most of the reads to map to the human genome!

The difference between the two:

In the first case I used this command to download the data:

fastq-dump --outdir fastq --skip-technical  --readids --read-filter pass --dumpbase --clip $SRR

While in the second case, I used this command to download the data:

fastq-dump --outdir fastq --skip-technical  --readids --read-filter pass --dumpbase --clip --split-3 $SRR

Make sure you include the --split-3 in a fastq-dump command!

Note that the newer fasterq-dump command splits by default, and so although you don’t need to add the --split-3 there, you should still do so!