The tools fastq-dump
and fasterq-dump
are used to extract reads from the Sequence Read Archive and export them to (for example) fastq format. There is a hidden gotcha that you should be aware of using fastq-dump
to extract data.

The tools fastq-dump
and fasterq-dump
are used to extract reads from the Sequence Read Archive and export them to (for example) fastq format. There is a hidden gotcha that you should be aware of using fastq-dump
to extract data.
Sometimes when you look at a record in RefSeq/GenBank it is a virtual record that is really a pointer to a set of records. For example, the entry for Callorhinchus milii isolate IMCB2004 points you to the WGS records AAVX02000001-AAVX02067420. Here we show how to get these records.
As part of the STRIDES initiative, the NIH has moved the SRA to the cloud. This includes the metadata, and the whole SRA archive. Here, I show how to set up a new instance to access the sequence read archive in the cloud. In a separate post, we’ll explore getting the metadata out of bigtable.
Continue readingWe have been working on updating access to the SRA metadata. In previous posts, we used the SQLite database provided by the Meltzer lab, but for a variety of reasons, we are now using the XML provided by the NCBI.
Continue readingOur new service,
The NCBI SRA contains a lot of data – about 1016 bp at the moment! However, searching that has always been problematic. We’re happy to unveil a new search SRA service that allows you to search the random metagenomes in the SRA using either DNA or protein queries.
There is lots of crAssphage in the world, and there are lots of metagenomes in the sequence read archive. Can we find those metagenomes that do, or do not, have crAssphage in them in the SRA? Lets try…
We were curious about how many bp of metagenomes in the SRA. This was partly inspired by our grant writing, and partly by this question on twitter from Tom Delmont:
great Rob! Do you have the ratio in term of ‘file volumes’ between WGS and 16S amplicons? Just curious to know if WGS wins on this front 🙂
— tom delmont (@tomodelmont) March 30, 2017
This is how to answer the question!
The metadata in the SRA is not all the data you can get about a run. Here is how to get more data about a run from the SRA without going to the SRA website.
Recall that in the SRA A project (SRP) has one or more samples, a sample (SRS) has one or more experiments (SRX), and an experiment has one or more runs (SRR). [source: davetang.org]
How many experiments only have one run, and how many experiments have lots of runs?