How many bp of metagenomes are there in the SRA?

We were curious about how many bp of metagenomes in the SRA. This was partly inspired by our grant writing, and partly by this question on twitter from Tom Delmont:

 

 

This is how to answer the question!

First, we use partie to get all the metagenomes (including amplicon, WGS, and other) into a single file:

cut -f 1 partie/SRA_Metagenome_Types.txt > metagenome_ids.txt

Next, we extract the sizes of all of those runs using edirect from NCBI. This is a recipe that I have added as a pull request to NCBI’s account. We have to do this in blocks (in this case of 250 entries) so we don’t crash NCBI’s servers. (Note that there were 296678 lines in metagenome_ids.txt when I did this):

while [ $IDX -lt 296678 ]; do echo $IDX; head -n $IDX metagenome_ids.txt | tail -n 250 > temp; epost -db sra -input temp -format acc | esummary -format runinfo -mode xml | xtract -pattern Row -element Run,bases >> metagenome_sizes.txt; IDX=$((IDX+250)); done

(As an aside, if you time this NCBI delivers approximately 20 responses per second). So 296,678 responses will take slightly over 4 hours to complete.

Finally, we use awk to sum up the reads that are either WGS or Amplicon:

grep WGS ~/GitHubs/partie/SRA_Metagenome_Types.txt | cut -f 1 | grep -f - metagenome_sizes.txt | cut -f 2 | awk '{s+=$1}END{print "Total: ",s/1e+12, " Tbp"}'

Changing WGS to AMPLICON or OTHER gives us the final solution. This is size in bases, so uncompressed data. Of course, you can compress it to reduce size some!

Dataset Size (TB)
WGS 247.129
AMPLICON 29.8637
OTHER 26.0274

So as @chris_osulliva noted, its 10x WGS to AMPLICON sequencing!