There is lots of crAssphage in the world, and there are lots of metagenomes in the sequence read archive. Can we find those metagenomes that do, or do not, have crAssphage in them in the SRA? Lets try…
First, we are going to need a list of all the fecal metagenomes in the SRA. We can create that list using one of the many SQL queries we use to demonstrate searching the SRA. Specifically, we use this query that gives us a list of all the Run IDs where the word fecal or feces appears in the title or abstract (note we probably miss the UK faecal samples, but they should learn to spell!)
sqlite3 SRAmetadb.sqlite "select run_accession from run where run.experiment_accession in (select experiment.experiment_accession from experiment where experiment.study_accession in (select study.study_accession from study where (study.study_title like '%feces%' or study.study_title like '%fecal%' or study.study_abstract like '%feces%' or study.study_abstract like '%fecal%') AND (study.study_title like '%human%' or study.study_abstract like '%human%')))" > human_feces.txt
This gave me a list of 48,944 SRA runs that were probably from human feces. Some of these were probably metagenomes, and so I filtered those using PARTIE for just the WGS metagenomes:
grep WGS partie/SRA_Metagenome_Types.txt | cut -f 1 | grep -Fxf - human_feces.txt > human_feces_wgs.txt
Note in this command the SRA_Metagenome_Types.txt is from PARTIE and only has two columns. The first is the ID, and the second is the SRR id. We only want the WGS metagenomes. The grep is looking for exact matches without a regular expression (makes it a lot faster), and is reading the patterns, one per line, from STDIN.
This narrows my list to 3,718 WGS metagenomes. You can download the complete list here.
Next, we need a list of SRA Runs that have crAssphage. Luckily, we have created that as part of our global crAssphage survey. This file has additional columns, and we only want the first column which is the run ID. There are 10,260 metagenomes here, as we have other metagenomes including sewage, skin, and oral metagenomes that have crAssphage in them.
Lets use our same grep to find the IDs from human fecal metagenomes that have crAssphage:
cut -f 1 ~/GitHubs/crAssphage/Metagenomes/Metadata/sample_ids.txt | grep -Fxf - human_feces_wgs.txt > human_feces_wgs_crAssphage.txt
This gives us the 720 human fecal WGS samples that have crAssphage in them. We also need the opposite list, those that are not found. Simply adding a -v to the grep inverts the match:
cut -f 1 ~/GitHubs/crAssphage/Metagenomes/Metadata/sample_ids.txt | grep -Fvxf - human_feces_wgs.txt > human_feces_wgs_no_crAssphage.txt
Now we have two files:
- human_feces_wgs_crAssphage_20170831.txt This file has 720 SRA run IDs
- human_feces_wgs_no_crAssphage_20170831.txt This file has 2,998 SRA run IDs.
I’ve also created a single file (human_feces_wgs_sizes_20170831.txt) that has the sizes of all of these metagenomes (in bp). This has all the metagenomes, and I leave it as an exercise to split this file based on whether they have crAssphage or not, and to sort the file by size. The average size of the metagenomes is 1,448,230,000 bp
The next step is to download some of the runs and analyze them. You can download them using fastq-dump as described here.