NCBI’s fastq-dump has to be one of the worst-documented programs available online. The default parameters for fastq-dump are also ridiculous and certainly not what you want to use. They also have absolutely required parameters mixed in with totally optional parameters, and so you have no idea what is required and what is optional. Here, we take a look at some of the options and hopefully help you decide which parameters to run.
If you are working with SRA files you will need, at some point, to use fastq-dump. Unfortunately, it is not very well explained. In fact, the official NCBI documentation doesn’t even include all the options available to fastq-dump!. This is what we have learned from using it, and also what we use to extract sequences.
We usually use this command (changing SRR_ID to whatever the ID is).
fastq-dump --outdir fastq --skip-technical --readids --read-filter pass --dumpbase --split-3 --clip $SRR
If you are using fasterq-dump we usually use this command:
fasterq-dump $SRA --outdir fasta --fasta-unsorted
Note that (a) this outputs fasta
and not fastq
(why waste space with all those pesky quality scores you won’t use anyway), and (b) the --fasta-unsorted
makes it fasta because it can stream the data using multiple threads and write when a record is received rather than trying to keep things organised. Note: fasterq-dump
uses --split-3
by default, but this output uses --split-spot
. For more information, see the fasterq-dump wiki.
Note: if you use this, you will get interleaved fasta
files. Here is some simple, but fast, C code to split interleaved fasta files into R1
and R2
files: [fasta_split.c](https://github.com/linsalrob/EdwardsLab/blob/master/bin/fasta_split.c) (You can compile it with the Makefile in the same directory!
After that, and depending on your downstream analyses, you may need to reorganize the fastq files so that the sequences in each file match and that you get file(s) of singletons. I suggest that you try fastq_pair to do that. Update: if you use –split-3 you will get three files, a left file, a right file, and a singletons file. Then you may not need to reorder your fastq files.
If you are using fasterq-dump we recommend these options
fasterq-dump --outdir fastq --mem 1G --split-3 --threads 8 --skip-technical --print-read-nr $SRR
Installing fastq-dump and fasterq-dump
Both of these tools are available in the sra toolkit, and you should install that using conda
conda create -n sra -c bioconda sra-tools conda activate sra
Note: you should really use mamba, not conda.
SRA Documentation
The official documentation has pretty minimal descriptions of what the outputs are and how you should use them. Here is a more detailed description, along with noting some options that you absolutely must use when you extract sequences from SRA (noted [required])
Skipping the general options (help and version), you get to the “Data formatting” options:
Data formatting
Splitting reads [required]
We start with two related parameters (these are the definitions from the NCBI website)
--split-spot Split spots into individual reads.
--split-files Dump each read into separate file. Files will receive suffix corresponding to read number.
--split-3 Legacy 3-file splitting for mate-pairs: First biological reads satisfying dumping conditions are placed in files *_1.fastq and *_2.fastq If only one biological read is present it is placed in *.fastq Biological reads and above are ignored.
Rationale: Some of the reads in SRA are paired-end reads where they sequenced (e.g.) from the left and right end of the sequence and have an estimated gap size between the ends (i.e. the average length of the fragments they are sequencing). It is important that you know if the sequences are paired-end for your downstream analysis, and most programs take the pairs into consideration.
–split-spot separates the read into left and right ends, and puts the reads in the same file. This creates a single file.
–split-files separates the read into left and right ends, and puts the forward and reverse reads in two separate files. This creates two files.
–split-3 separates the read into left and right ends, but if there is a left end without a matching right end, or a right end without a matching left end, they will be put in a single file.
You need to pass one of these two parameters to your dump-fastq command (but that is not all, you need –readids, see below). If you don’t, the paired end reads are concatenated together. This makes absolutely no sense and is a stupid default to have reads put out concatenated.
Fasta output
A simple one:
-fasta <[line width]> FASTA only, no qualities. Optional line wrap width (set to zero for no wrapping).
Rationale: This outputs the sequences in fasta format with the specified line width. e.g. -fasta 80 will use 80 characters per line.
As the help says, if you use -fasta 0 you don’t get any wrapping.
A lot of downstream software requires fasta sequences only, and this is a good way to get those sequences. However, there are quick ways to convert fastq to fasta, and so if you think the fastq format may be useful (e.g. if you are going to do assembly), use that for the initial extraction.
Readids [required]
-I | --readids Append read id after spot id as 'accession.spot.readid' on defline.
Rationale: This is another one in the “what do you mean this is not default?” set of options. If you split your sequences into one (using –split-spot) or two (using –split-files) files, by default the sequences get the same ID. Huh? That makes no sense and will break your downstream processes because you have duplicate sequence IDs in the same file.
If you add the -I / –readids flag, one sequence gets appended the ID .1 and the other .2. This is pretty much what every program that accepts paired-end reads expects for input, so if you use –split-spot or –split-files you should use –readids. You should.
However, it was pointed out to me (thanks, Dave!) that the –readids option breaks BWA in downstream applications. You may need to omit this if you are using BWA. Using fastq-pair may solve this issue (but I haven’t tested it yet).
Original format
-F | --origfmt Defline contains only original sequence name.
Rationale: The SRA archive has rewritten the definition line in the sequence so that it contains the SRA ID and the length of the sequence. Therefore, a header that starts out:
@GFOWKFF06HMZ8O
becomes
@DRR000979.1 GFOWKFF06HMZ8O length=104
(this sequence is in SRA entry DRR000979).
This is a matter of personal preference, but I use the modified IDs so that all the sequences have the ID of the SRA entry that they come from so you can back-track them if needed.
SOLiD data
-C | --dumpcs <[cskey]> Formats sequence using color space (default for SOLiD). "cskey" may be specified for translation.
Rationale: If your data comes from a SOLiD machine, you may want to enable this option so you can use color space.
Sequence data
-B | --dumpbase Formats sequence using base space (default for other than SOLiD).
Rationale: This will ensure that your output has A, T, G, and C instead of being put into color space. We include this by default because we don’t want colorspace, even by accident.
Quality Scores
-Q | –offset <integer> Offset to use for ASCII quality scores. The default is 33 (“!”).
Rationale: In the old days, Illumina used slightly different ways to calculate the quality scores and a slightly different offset for the quality scores. You should leave this as the default value for almost all applications.
Filtering
Spot IDs
-N | --minSpotId <rowid> Minimum spot id to be dumped. Use with "X" to dump a range. -X | --maxSpotId <rowid> Maximum spot id to be dumped. Use with "N" to dump a range.
Rationale: If you want to output one or a few spots from the SRA file, you can use -N and -X. Note that if you use this, you should probably not start at position 0, but start a few thousand reads into the file. For reasons that we are not sure about (but we have some conspiracy theories about), many fastq files have “unusual” sequences at the start of the file. We usually use -N 10000 -X 110000 to get 100,000 reads.
Read length
-M | --minReadLen <len> Filter by sequence length >= <len>
Rationale: Read length is a good proxy for sequence quality, and so you may want to filter reads below a certain length so they are ignored. Again, we post-filter the sequences, so we don’t use this one.
Technical sequences [required]
--skip-technical Dump only biological reads.
Rationale: If the sequencing was done with the “Illumina multiplex library construction protocol” the SRA entry ends up with application reads and technical reads like this:
Application Read Forward -> Technical Read Forward <- Application Read Reverse - Technical Read Reverse.
The spots look like this:
You don’t want the technical reads – you only want the biological reads – so include –skip-technical to remove those technical parts. If you omit this option and include the –split-files you actually end up with three or four files per SRA archive!
Clipping [required]
-W|–clip Apply left and right clips
Rationale: Some of the sequences in the SRA contain tags that are used e.g. for whole genome amplification and need to be removed. This will remove those sequences for you. You should enable this flag as it will trim off those sequences for you.
Aligned and unaligned sequences
--aligned Dump only aligned sequences. Aligned datasets only; see sra-stat. --unaligned Dump only unaligned sequences. Will dump all for unaligned datasets
Rationale: If you are looking for reads that map to the human genome, for example, you may want only the aligned or unaligned part. This is optional and up to you.
Read filtering [required]
--read-filter pass
Rationale: You want to filter out reads that are all N’s or otherwise completely useless. This filter can be set to pass|reject|criteria|redacted but you want only those reads that pass filtering. Otherwise you just get a bunch of Ns.
Workflow and piping
These options control the output of fastq-dump and are straightforward but worth noting.
Output to a specific directory
-O | --outdir <path> Output directory, default is current working directory ('.').
Rationale: The default directory to write to is your current working directory. This will output the fastq file to an alternate directory.
Output to standard out.
-Z | --stdout Output to stdout, all split data become joined into single stream
Rationale: By default the output is written to one or more files. This will output the data to standard out, so you can (for example) pipe it into another command. You probably don’t want to do this.
Compression
--gzip Compress output using gzip. --bzip2 Compress output using bzip2.
You can compress the sequences files using one of two standard compression algorithms, gzip or bzip2. Gzip is probably more widely supported (but only just) and several common downstream programs like bowtie2 can use both gzip and bzip2 directly.
Note: I used to recommend the --gzip
option, but according to NCBI this is much slower than downloading the fastq file and then gzipping it or streaming the output to gzip. Your mileage will vary, but I often still use it in pipelines when I am downloading a bunch of data, but will omit it for a single download.
Extracting sequences
The command that we use to extract sequences from SRA archives is:
fastq-dump --gzip --skip-technical --readids --dumpbase --split-files --clip sra_filename
Other options
There are other options available in fastq-dump, but you probably can leave those as defaults (caveat emptor!) You can get this list of options by running
fastq-dump -h
fasterq-dump
fasterq-dump
introduces a lot of simplified defaults that do make things easier, however there are some options that you should consider.
First, both --split-3
and --skip-technical
are now the defaults (thankfully), so you can omit those options if you want (although, tbh, I typically leave them in).
Increasing memory and threads
Your download is almost certainly going to be network limited (especially if you live in Australia!), but I always increase the memory to 1G and the threads to 16:
--mem 1G --threads 16
In my very limited testing, adding more threads seemed to speed
Adding read numbers
For reasons that I don’t understand, the --print-read-nr
option is overloaded and does two different things. It both prints how many spots were read/written (as fastq-dump does by default) and adds the read number (/1
or /2
) to the output sequence IDs.
Therefore, I recommend adding that to the download option.
Missing options
There are two options that I use a lot that are currently missing from fasterq-dump, the ability to download parts of files rather than the whole thing, and the ability for on-the-fly gzipping, although see the note above about that!
Thanks
Most of this was figured out by trial and error, although we thank the anonymous reviewers of our partie paper that pointed out a couple of options we didn’t regularly include but should have!