The metadata in the SRA is not all the data you can get about a run. Here is how to get more data about a run from the SRA without going to the SRA website.

The metadata in the SRA is not all the data you can get about a run. Here is how to get more data about a run from the SRA without going to the SRA website.
Recall that in the SRA A project (SRP) has one or more samples, a sample (SRS) has one or more experiments (SRX), and an experiment has one or more runs (SRR). [source: davetang.org]
How many experiments only have one run, and how many experiments have lots of runs?
While answering some reviewers comments, I pulled out this data about the instruments used to submit data to the SRA. Clearly the HiSeq and MiSeq are dominating the number of runs that people are submitting.
I love standards; there are always so many to choose from. The sequence read archive strives hard to capture appropriate information about the sequences that people deposit, but in the end scientists are people too, and they are never uniform and standard. This means there are a lot of ways to describe metagenomes. To get your data used by other people (and cite your papers), make sure you tag it so we can find it!
There is a lot of metagenomics data in the SRA, but it is not very well organized. To get it all, you need some wicked SQL-FU … or you can copy these recipes!
These are all the attributes in the SRA files
Getting data from the NCBI Sequence Read Archive is not easy. Here we combine a few of our posts to go step by step through getting the data.
NCBI’s fastq-dump has to be one of the worst-documented programs available online. The default parameters for fastq-dump are also ridiculous and certainly not what you want to use. They also have absolutely required parameters mixed in with totally optional parameters, and so you have no idea what is required and what is optional. Here, we take a look at some of the options and hopefully help you decide which parameters to run.
Continue readingNot all the options available to fastq-dump are listed on the NCBI website. It is not a very well documented program! Here is the current list.
The sequence read archive (aka short read archive) SRA metadata is complex! This is a brief guide to help you navigate it.
One key thing to remember is that:
A project (SRP) has one or more samples. However, projects are in the table called study.
A sample (SRS) has one or more experiments (SRX).
An experiment has one or more runs (SRR).
[source: davetang.org]
What you really want are the runs, and this is how you can get them!