How to create a phage proteomic tree

It has been a while since the original phage proteomic tree paper came out (twelve years!), and we still don’t have a web based method for doing it.

However, here are the steps and code that we use to make the phage proteomic tree.

  1. We start with three files:
  • phage_proteins.fasta. This is just a fasta file of all phage proteins
  • phage_protein_genome_length.txt. This is a tab-separated text file with three columns: [the protein identifier (in phage_proteins.fasta); the genome ID that protein comes from; and the length (in amino acids) of the protein]
  • phage_names.txt. Again, a tab separated file with three columns: [the genome id (from the previous file); the full name of the phage; and the abbreviation we will use in the tree].

For the users with a copy of the SEED, there is a simple set of commands you can run to make these files:

$HOME/bioinformatics/phage_tree/make_fasta.pl > phage_proteins.fasta
$HOME/bioinformatics/phage_tree/seed_prots2genome_tuple.pl phage_proteins.fasta > phage_protein_genome_length.txt
$HOME/bioinformatics/phage_tree/genome_names.pl > phage_names.txt

Otherwise you are on your own, but they are not that hard to make!

  1. BLAST those proteins against each other.

On our cluster we use the script split_blast_queries that just distributes jobs around the cluster:

formatdb -pT -i phage_proteins.fasta
~/bioinformatics/cluster/split_blast_queries_edwards.pl -f phage_proteins.fasta -n 60 -p blastp -db phage_proteins.fasta -d phage_proteins_blast -m 8 -e 0.1
cat phage_proteins_blast > phage_proteins.blastp

But on a single computer you can run:

formatdb -pT -i phage_proteins.fasta
blastall -i phage_proteins.fasta -d phage_proteins.fasta -p blastp -m 8 -e 0.1 -o phage_proteins.blastp
  1. Now we have to make groups of related proteins. These are just any pairs of proteins that appear in the blast output. At this stage we are greedy since we cull things later on. The code to make these groups (ppt.jar) is memory intensive and so we run it on a node with a lot of memory.
mkdir fastafiles
echo -e "/bin/bash\njava -jar ~/bioinformatics/phage_tree/ppt.jar phage_proteins.blastp 0.1 phage_proteins.fasta fastafiles" > groups.sh
qsub -cwd -S /bin/bash -l h=node1 -pe make 8 ./groups.sh

However, you can also run that on a single machine just using the java command. Remember, it takes a bunch of memory, so don’t run it on your laptop!

This program writes out the fasta files in a directory (fastafiles) and also a file called id.map that has the original protein ids and the new protein ids (which are just integers). Some of the software we use later on is limited in the number of characters protein names can have, so this is a good point to rename them.

  1. Use make_id_maps.pl make two files called genome_id.map and protein_genome_lengths.txt that you need for later steps.
~/bioinformatics/phage_tree/make_id_maps.pl id.map phage_protein_genome_length.txt phage_names.txt

5. Run clustalw and protdist

This is the computationally intensive part of the process, and we use our cluster to make it go quickly. With 600 nodes it doesn’t take long, and infact most of the time is spend on data transfer … moving files across the network to be computed on. Go infiniband!

We need to find out how many files we have and make some directories:

NUMFILES=$(ls fastafiles/ | sed -e 's/fasta.//' | sort -nr | head -n 1); echo "There are $NUMFILES files to process"
mkdir sge_output aligned dnd protdist

Then we create two shell files that we use to process the data. Note we use SGE as our scheduler, and we rely on their array job capabilities that are not available (I think) on PBS:
clustalw.sh has:

#!/bin/bash
 /home/redwards/bin/clustalw -infile=fastafiles/fasta.$SGE_TASK_ID -outfile=aligned/clustalw.$SGE_TASK_ID -output=phylip -newtree=dnd/tree.$SGE_TASK_ID.dnd -align

queue_protdist.sh has:

#!/bin/bash
 touch outfile; # this make sure that the outfile exists, so protdist asks for the correct name
# echo what we need to protdist since it is supposed to be an interactive program
 /bin/echo -e "aligned/clustalw.$SGE_TASK_ID\nF\nprotdist/protdist.$SGE_TASK_ID\ny" | /home/redwards/bin/protdist

and make them both executable:

chmod +x clustalw.sh queue_protdist.sh

submit the clustal job – this will give you a new job number

qsub -S /bin/bash -cwd -o sge_output/ -e sge_output/ -t 1-$NUMFILES:1 ./clustalw.sh

(e.g. the output will be: “Your job-array 246290.1-57191:1 (“clustalw.sh”) has been submitted” and the new job number is 246290)

and submit the protdist job to wait until that previous job has finished processing all its data

qsub -S /bin/bash -cwd -o sge_output/ -e sge_output/ -t 1-$NUMFILES:1 -hold_jid <previous job number> ./queue_protdist.sh

(e.g. qsub -S /bin/bash -cwd -o sge_output/ -e sge_output/ -t 1-$NUMFILES:1 -hold_jid 246290 ./queue_protdist.sh)

Note: SGE has a limit of 75000 elements in an array job, and so nowadays we have to break this up into two or more separate submissions (e.g. 1-75000; 75001-$NUMFILES)

  1. The heavy lifting is done, we need to clean up the output. First, remove empty protdist files, and then reformat them so that they have one result per line using rewrite_protdists.pl
rm -f `find protdist -size 0`
~/bioinformatics/phage_tree/rewrite_protdists.pl protdist protdist.fixed
  1. Now we need to make the script to combine the matrix files so we make a shell script called matrix.sh

This script, combineprotdists.by.length.nomysql.noolds.pl, is an original, and is very slow. We have tackled rewriting it a couple of times but never quite completed it.

echo -e "#/bin/bash\n~/bioinformatics/phage_tree/combineprotdists.by.length.nomysql.noolds.pl protdist.fixed matrix 655 -d protein_genome_lengths.txt -p 10 -l -lp > out 2> err" > matrix.sh

and submit that to the cluster:

chmod a+x matrix.sh; qsub -q important -S /bin/bash -cwd -o sge_output -e sge_output ./matrix.sh

This takes a while to process, and so you can monitor the progress with tail -d sge_output/matrix.sh.eXXXXXX (where XXXXXX) is your job number.

The output from this is matrix or matrix.nosubreplicates. You will most likely want to use the latter which is just the matrix of distances between genomes.

Optional: To delete a genome from the matrix, use

perl ~/bioinformatics/phage_tree/delete_a_genome_from_matrix.pl -m matrix.nosubreplicates -n 115 -o matrix.nosubreplicates.115

This will make a new version of the matrix with the column and row of the offending genome deleted. It will tell you the genome that was deleted and the score before, of the genome, and the next score.

  1. Run neighbor and build a tree:
mkdir neighbor/
 cp matrix.nosubreplicates neighbor/infile
 cd neighbor/
 echo -e "j\n133\ny\n" | neighbor
 (accept defaults)
 cp outtree ../raw.tree
 cd ..
  1. Rename the leaves on the tree. This uses the naming file we created in step 1 and a renaming code, rename_tree_leaves.pl.
 ~/bioinformatics/phage_tree/rename_tree_leaves.pl genome_id.map raw.tree > renamed_full.tree

Optional: Prepare an xml version of the tree using newick2phyloxml.pl from bioperl

perl ~/bioinformatics/phage_tree/newick2phyloxml.pl renamed_full.20140407.tree renamed_full.20140407.xml