Pangenomes

A project that started with the question, “how many microbial genes are there in the world?” has grown to potentially lead to answers to this and broader questions about the microbial universe. First, known taxa (E. coli) were organized into matrices, with strains as rows, and proteins as columns. Hamming distances define a metric for organizing strains into phylogenetic trees. The phylogenetic distance is the importance of the split between the strains, or the alpha score, as refered to in d-splits literature. This approach became our main focus when we attempted the same heuristic with viral data, with surprisingly strong results. At present, we are taking “pie slices” of the phage proteonomic tree, and seeing to what extent we can recreate that observed internal structure, as a “proof of concept” for viral applicability. Reading and work on splitstrees, d-splits, and consecutive ones property, will drive the next developments. In addition, this coming week, on August 18th, our group will be attending a lecture on whole genome taxonomy, which should help drive further progress on our project.