crAssphage poops up again!

Yet again, analysis of a metagenomic sample shows that crAssphage is the most abundant phage anywhere. It also shows what a dis-service NCBI did to science by deleting the crAssphage record. We used meta-spades to reconstruct the entire crAssphage genome from someone else’s data set, but in their paper, the largest contig was ~3 kb. This analysis suggests that crAssphage is present in ulcerative colitis samples but the abundance goes down after treatment!

Mapping the raw reads to crAssphage

The Bushman lab at UPenn (where I used to work!) recently published a terrific paper on measuring the virome after human fecal transplants in patients used to treat ulcerative colitis. (Chehoud C, Dryga A, Hwang Y, Nagy-Szakal D, Hollister EB, Luna RA, Versalovic J, Kellermayer R, Bushman FD. 2016. Transfer of Viral Communities between Human Individuals during Fecal Microbiota Transplantation. mBio 7:e00322–16.)

We were curious about whether they had found any crAssphage in their studies – we’d expect they would, but they didn’t report any hits to crAssphage in their paper. Of course, since NCBI unilaterally deleted the crAssphage record it is likely that they never even had it in their database. [Note that you can get the crAssphage record from the NCBI if you specifically look for it, from phantome, or from our global crAssphage survey.]

Thanks to folks in the Bushman lab, we got the SRA records associated with this study. [As an aside, you’ll note the complexity of the SRA metadata: these are listed as amplicon metagenome sequences when they are viromes. We will have a post on that soon]. There are 18 samples in this data set, so we grabbed them all and used bowtie to map those reads to crAssphage using the steps we outlined elsewhere.


SRA run Number of reads Percent of reads that map to crAssphage
SRR3403834 4892517 35.81%
SRR3403848 4367579 0.08%
SRR3403845 1572115 0.01%
SRR3403844 773975 0.00%
SRR3403847 631600 0.00%
SRR3403846 737878 0.00%
SRR3403841 708 0.00%
SRR3403840 1001070 0.00%
SRR3403850 767738 0.00%
SRR3403842 20160 0.00%
SRR3403835 646661 1.70%
SRR3403836 746002 0.27%
SRR3403837 479366 0.00%
SRR3403843 756227 0.00%
SRR3403832 1022414 0.00%
SRR3403833 596273 0.00%
SRR3403838 664633 0.00%
SRR3403849 680576 0.00%

crAssphage is everywhere – in one run, 35% of the reads map to crAssphage.

How do these map to the patients?

We filtered for only the runs that had matches to crAssphage, and mapped the sequences that have crAssphage to the labels in the SRA data set:

Sample ID SRR ID Percent of reads that map to crAssphage Number of reads that map to crAssphage
P1_t0_pre_r2 SRR3403845 0.01%  267
P1_t1_post_r2 SRR3403846 0.00%  2
P1_t2_post_r2 SRR3403847 0.00%  2
P2_t0_pre_r2 SRR3403848 0.08%  4,410
P3_t0_pre_r2 SRR3403834 35.81%  1,816,814
P3_t1_post_r2 SRR3403835 1.70%  112,18
P3_t2_post_r2 SRR3403836 0.27%  2,119
S_1014_r2 SRR3403837 0.00%  9


Is it really crAssphage?

We were curious to see how the reads aligned to the crAssphage genome, so a few minutes with the awesome anvi’o shows us coverage of crAssphage.


We have the eight samples that have hits (as shown in the table above), and plotted those reads around the genome. This looks like uniform coverage to me!

We took the sample with the most reads (SRR3403834) and pulled all the pairs of sequences that mapped to crAssphage (we pulled any sequence where either pair of the reads mapped to crAssphage), and assembled them independently. Initially we used SPAdes 3.5, but the results were not that impressive. Considering only contigs >1kb

SPAdes 3.5

number of seqs 662
total 902,781
average 1363.72
N50 1,325
shortest >NODE_662_length_1001_cov_2136.3_ID_1323 (1,001 bp)
Longest >NODE_1_length_2670_cov_1441.69_ID_1 (2,670 bp)
But then we switched to the new SPAdes 3.7 with the –meta flag, AKA metaspades. Again only showing contigs > 1kb.

metaspades 3.7

number of seqs 6
total 97,158
average 16,193
N50 24,723
shortest >NODE_6_length_1684_cov_1958.39_ID_29459159 (1,684 bp)
Longest >NODE_1_length_51019_cov_5883.5_ID_29459149 (51,019 bp)

Note that crAssphage is only 97,092 bp and using metaspades we recovered only 97,158 bp in contigs > 1kb. i.e. it seems like we pretty much covered the whole thing! Note that this was a de novo assembly just based on the reads, and was not a scaffolded assembly!

We used scaffold_builder to orient the contigs based on the crAssphage genome, kablammo to visualize the alignment:


We have the whole genome, except for a couple of gaps, it is contiguous with crAssphage, and yet again we have recovered the whole genome.

So what about the biology?


Take a look back at the second table (under “how do these map to the patients”): Notice how all three of the patients have high crAssphage levels before treatment but it goes down after.

I don’t believe the libraries with <200 reads are significant matches. For example,  sample S_1014_r2 is described in the SRA as “Each amplicon library was constructed by amplifying the 16S rRNA gene using the BSF8 and and the BSR357 primer pair. Primers contained DNA barcode sequences such as those described by Hamady et al. 2008, and the recommended 454 adapter sequences. Amplification conditions are described in McKenna et al 2007, with exception that the polymerase used was AccuPrime (Invitrogen, Carlsbad, CA, USA).” Only 9 reads  from this library map to the phage, but it suggests that it was sequenced after one of the other samples, perhaps SRR3403834.

I am not sure about P1_t0_pre_r2. It only has 267 reads, but the anvi’o figure clearly shows that those reads are around the genome. I think that sample should be checked using our PCR primers.

P2_t0_pre_r2 also clearly has crAssphage in the sample.

However it is clear that patient 3 had crAssphage in both pre- and post-transplant samples, although the abundance is going down over time.

crAssphage is present in all the UC samples before treatment. We wonder whether crAssphage is affecting ulcerative colitis in these patients!