Research

Connecting Genotype and Phenotype

The Edwards and Dinsdale labs at SDSU are collaborating on a project to connect genotype and phenotypes of microbial genomes. Specifically, we are sequencing the genomes of microbes and assaying their growth in 96-well microtiter plates. By combining this data with state-of-the-art machine learning tools we will be able to identify the function of unknown genes, identify gaps in our knowledge of microbial metabolism, and build more robust metabolic models.

The lack of high-throughput phenotypic data is hindering our ability to describe the metabolic potential of microbes based on their genome annotation and our ability to predict gene complements from phenotypic analyses. To effectively understand microbes and their interaction with their environment, we need to be able to predict their metabolic capabilities. Conversely, if we detect a phenotypic change in microbes such as increased copper tolerance we need to be able to predict the genes that are responsible. In this grant we will transform genotypic to phenotypic analysis in microbes, create high throughput analyses of phenotypes, and create a rapid metabolic modeling process for the microbial tree of life.

We will provide the capability to reliably predict phenotype from genotype for microbial life across the bacterial kingdom. This work will result in genome sequences, annotations, and metabolic models for ~100 microorganisms, with potential for propagation to all sequenced microbial genomes. We will produce a publicly available database of growth of these microbes on over 150 different media, and we will develop new methods for the analysis and utilization of that growth data. We will generate new draft genome sequences and transcriptomes, and create new web resources for querying, comparing, sharing, and analyzing growth phenotype data for microbial genomes.

Broader Impacts of this work include Teaching: We developed two capstone courses in biology that have increased our student’s ability to conduct science and have been recognized as transformative by the NSF. This proposal will directly contribute to the education of minority students: SDSU is a Hispanic Serving Institute and we promote diversity in our labs, having taught classes with 30% Hispanics, 10% Filipinos and 10% Asian students. Our teaching includes general education classes, lower division and upper division major classes, approximately > 400 students per year. We guest lecture in summer REU and Bridges classes. Release of data: We have a long history of releasing our data in its raw and analyzed state and this will continue with this grant, including innovative websites and APIs for data exchange. International activities: Students have the opportunity to experience working and interacting with scientists from Brazil, Chile, Mexico, and Australia. Outreach: To inform the general community, we provided lectures at public forums, and for the popular press, such as National Geographic, and “Good Morning Brazil” on Rede TV. We talk at schools in San Diego, Australia, and Brazil about environmental microbes, oceanography, conservation and college education. We support the SDSU Coastal Waters Institute Open day, which over 300 people from the local community attend.

Dark Matter

The viral dark matter is all the sequences that we find in metagenomes that we don’t know what to do with. In a project funded by the National Science Foundation, together with Dr. Forest Rohwer and Dr. Anca Segall in the SDSU Biology Department, and Dr. Alex Burgin, we will tackle some of this dark matter. We’re going to combine metagenomics, metaproteomics, metabolomics, and structural biology to unearth the functions of sets of genes that we have no idea what they do.

Halophile Genome Sequencing

Together with the Eisen and Facciotti labs at UC Davis, we sequenced and annotated the genomes of eight Halophilic Archaea. These are bugs that love to grow in environments with very high salt, and are often found in solar salterns, crystalizer ponds where salt is dried out of the ocean. The challenge for this project was that all of the sequence annotation and analysis was completed at the American Society of Microbiology’s 2008 annual meeting in Philadelphia, PA. We love these kind of crazy challenges, and were able to annotate the data, find new discoveries and present them to the audience in hours. For more information, visit the halophile website.

Coral Reef Informatics

Coral reefs are the most diverse environments around the world, and we are helping biologists, ecologists, geologiests, scientists and engineers to solve some of the informatics problems associated with coral reef research. Our work covers all aspects of informatics support to enable work on coral reefs. We have developed

High-performance computing solutions to calculate fractal dimensions of coral reefs.
Image analysis tools for taking images of the reef floor from moving boats.
A genomic database of corals, sponges, bacteria, viruses, and metagenomes
Hardware and software tools for detecting movement in coral reefs
Software to automatically detect corals from images

Identifying Prophages in Bacterial Genomes

Finding prophages in microbial genomes remains a problem with no definitive answer. The majority of existing tools rely on detecting genomic regions enriched in proteins with known phage homologs, which hinders the de novo discovery of phage regions. In this study, a weighted phage detection algorithm, Phage_detector was developed based on seven distinctive characteristics of prophages i.e. protein length, transcription strand directionality, customized AT and GC skew, the abundance of unique phage words, phage insertion points and the similarity of phage proteins. The first five characteristics are capable of identifying prophages without any sequence similarity with known phage genes. Phage_detector locates prophages by ranking genomic regions enriched in distinctive phage traits, which leads to the successful prediction of 92% of prophages (including 33 previously unidentified prophages) in 95 complete bacterial genomes with 8% false negative and 18% false positive.

Marine Sciences

The US-Brazilian Consortium for Marine Sciences is funded by the Department of Education through its Fund for the Improvement of Postsecondary Education (FIPSE), and the Fundacao Coordenacao de Aperfeicoamento de Pessoal de Nivel Superior (CAPES) from the Brazilian Ministry of Education. We’ve assembled a team of marine sciences researchers from San Diego State University and Scripps Institution of Oceanography, together with a team from the Federal University of Rio de Janeiro (UFRJ), the Universidade Federal de Pernambuco, and Universidade Federal da Paraíba, together with FIOCRUZ and the Rio de Janeiro Botanical Gardens. Together, we will develop a completely new marine sciences course to be held in Brazil in 2011 and 2012, and exchange students between San Diego, Rio de Janeiro, Pernambuco, and Paraíba.

Metagenome Sequence Matcher

Metagenome analysis spans a large range of different methods and tools in the bioinformatics community. These tools provide scientists with biological information present in a sequenced environmental sample, more specifically the genetic functions encoded in the DNA of the sampled metagenome. Most often those tools have been developed to compare a specific metagenome file against databases that are filled with sequences and annotation data.

This project is directed to performing a comparative analysis between multiple metagenomic FASTA files. By importing n-length pieces of the sequences from one file into a hash table structure, comparing other metagenome sequences from other files will be done quickly and precisely. Finding similar sequences and structures between numerous metagenomes can give insight into what biological functions are shared between related and unrelated organisms.

Pangenomes

A project that started with the question, “how many microbial genes are there in the world?” has grown to potentially lead to answers to this and broader questions about the microbial universe. First, known taxa (E. coli) were organized into matrices, with strains as rows, and proteins as columns. Hamming distances define a metric for organizing strains into phylogenetic trees. The phylogenetic distance is the importance of the split between the strains, or the alpha score, as refered to in d-splits literature. This approach became our main focus when we attempted the same heuristic with viral data, with surprisingly strong results. At present, we are taking “pie slices” of the phage proteonomic tree, and seeing to what extent we can recreate that observed internal structure, as a “proof of concept” for viral applicability. Reading and work on splitstrees, d-splits, and consecutive ones property, will drive the next developments. In addition, this coming week, on August 18th, our group will be attending a lecture on whole genome taxonomy, which should help drive further progress on our project.

PHACTS: Phage Classification Tool Set

There are two distinct phage lifestyles: lytic and lysogenic. The lysogenic lifestyle has many implications for phage therapy, genomics, and microbiology, however it is often very difficult to determine whether a newly sequenced phage isolate grows lytically or lysogenically just from the genome. Using the ~200 known phage genomes, a supervised random forest classifier was built to determine which proteins of phage are important for determining lytic and lysogenic traits. A similarity vector is created for each phage by comparing each protein from a random sampling of all known phage proteins to each phage genome. Each value in the similarity vector represents the protein with the highest similarity score for that phage genome. This vector is used to train a random forest to classify phage according to their lifestyle. To test the classifier each phage is removed from the data set one at a time and treated as a single unknown. The classifier was able to successfully group 188 of the 196 phages for whom the lifestyle is known, giving my algorithm an estimated 4% error rate. The classifier also identifies the most important genes for determining lifestyle; in addition to integrases, expected to be important, the composition of the phage (capsid and tail) also determines the lifestyle. A large number of hypothetical proteins are also involved in determining whether a phage is lytic or lysogenic.

PhAnToMe

The lab’s spearhead PhAnToMe project is funded by the National Science Foundation to understand viral life. We are researching the genomics of viruses that infect bacteria — phages — with Dr. Mya Breitbart (Univ. Southern Florida), Dr. Matt Sullivan (U. Arizona), and Dr. Jeff Elhai (Virginia Commonwealth University). These viruses are the most abundant biological entities on the planet, and are responsible for many of the evolutionary changes that bacteria undergo. Phages carry virulence genes that allow bacteria to cause disease, they carry photosynthetic genes that allow bacteria to grow in the oceans, and they carry many genes that we don’t even know what they do. Our project will unearth the role of some of those genes and proteins, and help biologists get to grips with the most diverse parts of microbiology — the phages. At the PhAnToMe website you can browse complete genomes, and download phage genomes and associated data.

EdwardsLab

Delivering the best in bioinformatics…