CAT/BAT by Bastiaan von Meijenfeldt and Bas Dutilh is a terrific tool for assigning taxonomy to contigs or metagenome bins. However, phages, and especially prophages, cause some problems because their taxonomy clashes with the bacterial taxonomy. Here is how to update the taxonomic profiles to handle prophages somewhat better.
A note of caution
This is still a work in progress, and I would appreciate feedback on it!
Identifying prophages
We used PhiSpy to identify prophages in 666,608 bacterial genomes available in the GenBank assembly summary. We found 89,693,516 phage proteins in 511,721 bacterial genomes (76.&% of genomes). We are going to have more to say about that effort separately, but we summarize our findings in this two-column table of protein ID (locus tag) and protein function.
Using that file, we updated the original 2020-07-19.nr.fastaid2LCAtaxid
from CAT to indicate proteins that came from phages. For this iteration we have used the taxonomy ID 10239 which is the generic virus taxonomy ID. We’re discussing whether to be more specific in this assignment.
If you have installed (and preferably run) CAT, here are the steps to update the assignments.
- I recommend copying the entire CAT database directory, so that you always have the original to go back to (e.g.
cp -r 2020-06-18_CAT_database 2020-06-18_CAT_database.backup
) but if you have the tarball you downloaded from CAT/BAT you may skip this step. - Rename the existing
fastaid2LCAtaxid
so that you can use it as desired.mv 2020-06-18.nr.fastaid2LCAtaxid 2020-06-18.nr.fastaid2LCAtaxid.original
- Download and uncompress the revised
fastaid2LCAtaxid
file. The current version is 2020-07-19.nr.phage.fastaid2LCAtaxid - Create a symbolic link to that file with the original file name.
ln -s 2020-07-19.nr.phage.fastaid2LCAtaxid 2020-06-18.nr.fastaid2LCAtaxid
You are now set to rerun CAT with the new prophage enhanced data set. If you have already run CAT, I suggest these steps
- Make a new directory for the outputs (e.g.
mkdir CAT_phage && cd CAT_phage
) - Link the outputs from your previous run into this directory. I use symbolic links here, but you could also use hard links.
ln -s ../contigs.fasta
ln -s ../out.CAT.alignment.diamond
ln -s ../out.CAT.predicted_proteins.faa
ln -s ../out.CAT.predicted_proteins.gff
3. Run CAT again this time with the databases pointing at the phage data above.
Comparing the outputs
Of course you should spend some time looking at the outputs, but a trivial way is to count the superkingdoms:
Once you have assigned the names, you can do it like this
cut -f 4 names.txt | sort | uniq -c