Percent similarity at different taxonomic levels

We are interested in knowing when two things are in the same, or different, taxonomic groups, and so we were curious when two things would be considered the same based on protein similarity. We used a couple of python scripts and a couple of perl scripts to generate all the pairwise alignments and calculate the average protein similarity at different taxonomic levels.

Edit: This post has been updated. The original results are at the bottom. I redid the calculations without preclustering at 90% as suggested by @pangenomics on twitter while the computation was running:

Edit 2: As suggested by Lewis_X I have included violin plots below (because I don’t know how to jitter my outliers, but it does sound fun!)

 

As part of GenomePeek we have curated databases of different proteins, for this test we just chose the RecA protein, a well known protein involved in recombination. Our data set is available in our github repository for all of this data and code. The RecA protein data set has 6,633 proteins in it, and we computed all pairwise alignments for n-choose-2 combinations of these. We performed pairwise alignments of those using the ultra-fast Needleman-Wunsch alignment code that we forked from Isaac Turner. We ran this on ~22 million pairs of proteins, just printing out the two input files and the percent identity between them. This took basically overnight a long time on one of our computes, and is limited by file I/O not compute cycles (ie. we can compute the alignment as fast as the disk can read/write the input/output!).

For FOCUS2, we also developed a normalized taxonomy file that has all the NCBI taxonomy but only for selected taxonomic levels (namely: kingdom, phylum, class, order, family, genus, species, and strain). Armed with these two pieces of information, lets calculate the average percent identity for things that are the same. We munged that table a little bit to match up the names of the organisms with those in our RecA file. We end up with a file that has the name from the RecA file followed by its taxonomy.

First, we use pairwise_percent_ids.py to calculate a table of all ids and the percent pairwise identity [figshare] between them. Then we use a PERL script [average_pairwise.pl] to join the data from that percent pairwise identity file to the taxonomy file, and test whether the different taxonomic levels are the same. This generates a JSON file that has the taxonomic level and a list of percent identities between them.

Finally, we use a Python script [plot_pairwise_percents.py] to plot that data as box plots.

Here’s the results:

For all 21,995,028 pairs of (non-identical) RecA proteins, the average similarity is 58.81% and the median similarity is  59.89%.

Here is the breakdown by phylogenetic level:

PHYLOGENETIC LEVEL NUMBER OF SIMILARITIES MEAN PERCENT IDENTITY MEDIAN PERCENT IDENTITY
kingdom 19,059,690 61.39 60.30
phylum 6,331,565 69.30 66.84
class 3,167,385 75.39 72.21
order 1,289,635 83.40 88.17
family 849,970 90.95 94.83
genus 418,628 93.07 99.3
species 241,150 98.62 100.0
strain 7,084 96.68 100.0

And here is the graphical summary of the similarity at different phylogenetic levels.

percent_identity

 

Here are the violin plots. The red lines are the mean of the data.

ViolinPlot

 


Here is the original data with 90% pre-clustering with CD-HIT for comparison to the full data set above.

The original post used a data set with 1,883 proteins that had been reduced from the data set above using CD-HIT with a 90% cutoff.

For all 3,543,806 pairs of (non-identical) RecA proteins, the average similarity is 53.95% and the median similarity is  58.64%.

Here is the breakdown by phylogenetic level:

Phylogenetic Level Number of similarities Mean percent identity Median percent identity
kingdom 2,559,426 58.75 59.78
phylum 590,066 64.44 65.05
class 206,872 68.19 69.9
order 71,824 70.17 70.46
family 35,730 69.26 71.51
genus 9,266 76.90 79.81
species 410 74.49 79.79
strain 272 73.26 78.09

 

And here is the graphical summary of the similarity at different phylogenetic levels.

 

PercentSimilarity

 

This data suggests that on average proteins are ~75% identical at the genus/species/strain level, which very little (and not statistically significant) differences between those levels. The average similarity drops to the low 70%/high 60% for family, class, order; and then kingdom and phylum are in the 50% – mid-60% similarity.