Describing metagenomes in the SRA

I love standards; there are always so many to choose from. The sequence read archive strives hard to capture appropriate information about the sequences that people deposit, but in the end scientists are people too, and they are never uniform and standard. This means there are a lot of ways to describe metagenomes. To get your data used by other people (and cite your papers), make sure you tag it so we can find it!

The examples below were randomly chosen from lists of >10,000 samples that I am screening through to understand the SRA annotations, they are not designed to call out one specific group.

The goal of labeling the samples is so that we can computationally figure out what the samples are. There are lots of ways to do this, and they are mostly confusing!

This example from the USDA gets it right. Notice that the source is metagenomic, the selection is random, and the library is whole genome sequencing. This is probably the easiest way to identify your sample as being a random metagenome.


This example, from Cornell University gets pretty much everything wrong. According to the abstract, it is a 16S library, but the strategy is listed as whole genome sequencing, the source genomic (not even metagenomic). The only real clue that this might be a metagenome is that the organism is called “human gut metagenome”


This example from the Center for Comparative Genomics and Bioinformatics labels the study type as a metagenome, but not the source . The study type is not shown on that summary page, you need to go to the study overview to find that information. I don’t understand why the sample was labeled as genomic and not metagenomic.


This is a typical example from the UCSD Microbiome Initiative (they have many mislabeled samples).  The abstract says that this is a 16S sample, the source says that it is METAGENOMIC, but the strategy is listed as OTHER. The saving grace here is that the library strategy is listed as PCR.