SRA Metadata

The sequence read archive (aka short read archive) SRA metadata is complex! This is a brief guide to help you navigate it.

One key thing to remember is that:

A project (SRP) has one or more samples. However, projects are in the table called study.
A sample (SRS) has one or more experiments (SRX).
An experiment has one or more runs (SRR).

[source: davetang.org]

What you really want are the runs, and this is how you can get them!

We are interested in parsing and downloading the SRA metadata, which is large, and complex. However, we can get a head start by leveraging the work of the bioconductor folks!

The SRA Metadata is available as a series of XML files but this is really not an easy way to ingest or process the data.

The SRAdb bioconductor R package includes the ability to download an SQLlite database of the SRA metadata that was created by the Meltzer Lab. However, we don’t use R that often, but we can still take advantage of their hard work (thank you!!!). If you download that package and look inside the getSRAdbFile.R file, you will see three different URLs for the database. I am not going to list them here so that you have to download the SRAdb bioconductor package and look inside. That helps their download stats! (also, don’t forget to cite them).

Once you have downloaded the SQLlite database you can access it with your favorite programming tools. Almost all languages have an SQLlite connector. But you need to understand the relationships first, and this is not clear!

The experiments table has both study_accession and sample_accession attributes. There is a many-to-many relationship between both experiments and studies and experiments and samples.

The run table has the experiment_accession attribute. Therefore, you can connect from study or sample to run via experiments.

We have a spreadsheet that has the first ~100 entries from the SRA metadata each for a bunch of tables. This shows you the column organization – each tab has the name of the SQL table that you will access. Download the spreadsheet for excel or open office.

Here is a list of all the attributes in the SRA data set.

You can also use the SQLlite command line interface to run simple SQL queries. This is a trivial way to retrieve data that you can process later. The general syntax for SQLlite is

sqllite3  

so to run the first query below, we would use:

sqlite3 SRAmetadb.sqlite "select run_accession from run where submission_accession='SRA012183'"

Here are some example SQL queries you can use. Note that we have found using subselects is a lot faster than using joins with sqlite3, and so where we offer both approaches, we recommend a subselect approach. Try them and see.

1. Retrieve all the run IDs associated with an SRA ID so you can download them (e.g. with prefetch or fastq-dump). Remember, that downloading requires runs, not submission accession numbers.

select run_accession from run where \
submission_accession='SRA012183';

2. Count the study types in the database. This is a good one to see if the database loaded!

select study_type, count(1) from \
study group by study_type;

3. Here is a count of all the amplicon sequencing projects:

select study_type, count(1) from study where study_accession in \
(select study_accession from experiment \
where library_strategy = "AMPLICON") group by study_type;

or with a LEFT OUTER JOIN

select study_type from study left join experiment on \
study.study_accession = experiment.study_accession where \
experiment.library_strategy = 'FL-cDNA';

4. Find all the run ids from all the metagenomics studies that are not from amplicon sequencing.

select run_accession from run where experiment_accession in (select experiment_accession from study left join experiment on study.study_accession = experiment.study_accession where experiment.library_strategy != "AMPLICON" and study_type = "Metagenomics");

Or we can do the same thing with two subselects:

select run_accession from run where experiment_accession in \
(select experiment_accession from experiment where experiment.library_strategy \
!= "AMPLICON" and experiment.study_accession in \
(select study_accession from study where study_type = "Metagenomics"));

We can also test for all the metagenomes, not just amplicons:

select run_accession from run where experiment_accession in (select experiment_accession from experiment where experiment.study_accession in (select study_accession from study where study_type = "Metagenomics"));

5. Find all the metadata about an experiment associated with a run. This is multiple joins, and you can either do them one at a time, replacing the IDs as you go, or as a single large join:

One at a time:

select experiment_accession from run where run_accession='ERR184835'; select sample_accession from experiment where experiment_accession='ERX160387'; select sample_attribute from sample where sample_accession='ERS183971';

This is quick, but you need to paste the IDs each time (or write a script).

Using left joins:

select sample_attribute from sample left join experiment on sample.sample_accession = experiment.sample_accession left join run on run.experiment_accession = experiment.experiment_accession where run_accession='ERR184835';

Using subselects

select sample_attribute from sample where sample_accession in (select sample_accession from experiment where experiment_accession in (select experiment_accession from run where run_accession='ERR184835'))

Using SQLlite, the subselect approach is a lot faster than the left joins, especially because in these cases we are only looking for one (or a few) entries.

6. Get the accession number, study title, and study abstract associated with a run accession number

select study.study_accession, study.study_title, study.study_abstract from study where study.study_accession in (select study_accession from experiment where experiment.experiment_accession in (select experiment_accession from run where run_accession = 'ERR187444'));

This will give you information about why they did what they did (possibly!)

7. Get all the run accessions from studys that have the words feces or fecal in the title or abstract. This is really a reorientation of the previous query:

sqlite3 ~/SRA/SRA/SRAdb/SRAmetadb.sqlite "select run_accession from run\
 where run.experiment_accession in (select experiment.experiment_accession\
 from experiment where experiment.study_accession in (select study.study_accession\
 from study where study.study_title like '%feces%' or study.study_title like\
 '%fecal%' or study.study_abstract like '%feces%' or study.study_abstract\ 
 like '%fecal%'))"

(Answer: as of Dec 4th, 2015 there were 41,803 runs from poo!

8. List all the run accessions and the experiment libraries for those runs (e.g. amplicon, wgs, etc)

sqlite3 ~/SRA/SRA/SRAdb/SRAmetadb.sqlite "select run.run_accession,\
 experiment.library_strategy from run left join experiment on\
 run.experiment_accession = experiment.experiment_accession"

9. Find all the runs from human microbiome studies. This includes reference genomes, 16S libraries, and WGS samples.

sqlite3 ~/SRA/SRA/SRAdb/SRAmetadb.sqlite "select run_accession from run\
 where run.experiment_accession in (select experiment.experiment_accession\
 from experiment where experiment.study_accession in (select\
 study.study_accession from study where study.study_title like\
 'Human Microbiome Project%'))"

10. Get all runs associated with a sample submission. We do this by joining via the experiment accession.

This is an approach using subselects:

sqlite3 SRAmetadb.sqlite "select run_accession from run where\
 experiment_accession in (select experiment_accession from experiment\
 where sample_accession='SRS014689')"

Or you can do the same thing with joins:

sqlite3 SRAmetadb.sqlite "select run_accession from run\
 left join experiment on run.experiment_accession=experiment.experiment_accession\
 where experiment.sample_accession='SRS014689'"

The opposite of that command is:

sqlite3 SRAmetadb.sqlite "select sample_accession from experiment where experiment_accession in (select experiment_accession from run where run_accession='SRR062394')"

11. Select the run id, the spot length, and the instrument model from runs and experiments:

sqlite3 SRAmetadb.sqlite 'select run.run_accession, experiment.spot_length, experiment.instrument_model \
from run left join experiment on run.experiment_accession=experiment.experiment_accession'