We were curious about how many bp of metagenomes in the SRA. This was partly inspired by our grant writing, and partly by this question on twitter from Tom Delmont:
great Rob! Do you have the ratio in term of ‘file volumes’ between WGS and 16S amplicons? Just curious to know if WGS wins on this front 🙂
— tom delmont (@tomodelmont) March 30, 2017
This is how to answer the question!
First, we use partie to get all the metagenomes (including amplicon, WGS, and other) into a single file:
cut -f 1 partie/SRA_Metagenome_Types.txt > metagenome_ids.txt
Next, we extract the sizes of all of those runs using edirect from NCBI. This is a recipe that I have added as a pull request to NCBI’s account. We have to do this in blocks (in this case of 250 entries) so we don’t crash NCBI’s servers. (Note that there were 296678 lines in metagenome_ids.txt when I did this):
while [ $IDX -lt 296678 ]; do echo $IDX; head -n $IDX metagenome_ids.txt | tail -n 250 > temp; epost -db sra -input temp -format acc | esummary -format runinfo -mode xml | xtract -pattern Row -element Run,bases >> metagenome_sizes.txt; IDX=$((IDX+250)); done
(As an aside, if you time this NCBI delivers approximately 20 responses per second). So 296,678 responses will take slightly over 4 hours to complete.
Finally, we use awk to sum up the reads that are either WGS or Amplicon:
grep WGS ~/GitHubs/partie/SRA_Metagenome_Types.txt | cut -f 1 | grep -f - metagenome_sizes.txt | cut -f 2 | awk '{s+=$1}END{print "Total: ",s/1e+12, " Tbp"}'
Changing WGS to AMPLICON or OTHER gives us the final solution. This is size in bases, so uncompressed data. Of course, you can compress it to reduce size some!
Dataset | Size (TB) |
---|---|
WGS | 247.129 |
AMPLICON | 29.8637 |
OTHER | 26.0274 |
So as @chris_osulliva noted, its 10x WGS to AMPLICON sequencing!
its 10:1 WGS:amplicon measured as bytes or bases
— AchillVirusSon (@chris_osulliva) April 10, 2017