There is a lot of metagenomics data in the SRA, but it is not very well organized. To get it all, you need some wicked SQL-FU … or you can copy these recipes!
There are many ways to describe sequences in the SRA, and even with a controlled vocabulary people use different terms to mean different things. This makes it hard to accurately identify the random metagenomes in the SRA dataset.
Assuming you have downloaded the SQLite database version of the SRA metadata, you should be able to use this combination of commands and routines to get a list of all the metagenomes.
- Get a list of all IDs that are amplicon sequences. We will use this to filter out things later. Note that there are two ways of doing this, either marking the library strategy as amplicon or the library selection as PCR. Here we use both.
sqlite3 SRAmetadb.sqlite 'select run_accession from run where experiment_accession in (select experiment_accession from experiment where (experiment.library_strategy = "AMPLICON" or experiment.library_selection = "PCR"))' > amplicons.ids
At the time of writing, there are 502,657 runs from amplicons.
- Get a list of all IDs that are associated where the source is METAGENOMIC (see the note below).
sqlite3 SRAmetadb.sqlite 'select run_accession from run where experiment_accession in (select experiment_accession from experiment where experiment.library_source = "METAGENOMIC")' > source_metagenomic.ids
At the time of writing there are 336,994 runs where the source is metagenomic.
- Get a list of all IDs that where the study type is METAGENOMICS
sqlite3 SRAmetadb.sqlite 'select run_accession from run where experiment_accession in (select experiment_accession from experiment where experiment.study_accession in (select study_accession from study where study_type = "Metagenomics"));' > study_metagenomics.ids
At the time of writing there are 308,113 runs where the study is metagenomics.
- Get a list of all IDs where the organism contains the word metagenome. This is the sample.scientific_name attribute of the data. At the time of writing there are 3,699 different scientific names that include “metagenom” (to allow metagenome/metagenomic). There is currently only one that includes “microbiom”, the human gut microbiome samples.
sqlite3 SRAmetadb.sqlite 'select run_accession from run where experiment_accession in (select experiment_accession from experiment where experiment.sample_accession in (select sample.sample_accession from sample where (sample.scientific_name like "%microbiom%" OR sample.scientific_name like "%metagenom%")))' > sci_name_metagenome.ids
At the time of writing there are 387,769 runs where the organism contains the word metagenome.
We can use a little grep and sort to choose the lines that are in the last three files that are not in amplicons and get a unique list. Here is how to do it in three separate commands:
grep -F -x -v -f amplicons.ids source_metagenomic.ids > source_metagenomic.notamplicons.ids grep -F -x -v -f amplicons.ids study_metagenomics.ids > study_metagenomics.notamplicons.ids grep -F -x -v -f amplicons.ids sci_name_metagenome.ids > sci_name_metagenome.notamplicons.ids
(In this grep, -F means don’t use regexp, -x means whole line match, and -v inverts the match. -f uses the amplicons.ids file as the source file for our searches).
At the time of writing (July 2016) after filtering to remove amplicons we get 75,652 samples with a scientific name that includes metagenome; 52,611 samples that have a source metagenome; and 63,844 samples that have a study metagenome.
Now, we can just use sort to get a unique list of IDs
sort -u sci_name_metagenome.notamplicons.ids source_metagenomic.notamplicons.ids study_metagenomics.notamplicons.ids > SRA-metagenomes.txt
This file has 93,658 IDs – i.e. there is a lot of redundancy in the three methods for getting the IDs (but that is OK!).
Figuring out the library information.
You can get a list of all the library sources using:
sqlite3 SRAmetadb.sqlite "select distinct experiment.library_source from experiment"
This is the current list is:
- GENOMIC
- METAGENOMIC
- METATRANSCRIPTOMIC
- OTHER
- SYNTHETIC
- TRANSCRIPTOMIC
- VIRAL RNA
You can get a list of the study types using:
sqlite3 SRAmetadb.sqlite "select distinct study_type from study"
The current list is
- Cancer Genomics
- Epigenetics
- Exome Sequencing
- Metagenomics
- Other
- Pooled Clone Sequencing
- Population Genomics
- Synthetic Genomics
- Transcriptome Analysis
- Transcriptome Sequencing
- Whole Genome Sequencing
- Whole Genome Sequencing