Monthly Archives: October 2014

FOCUS: analyzing metagenomic (big) data in seconds using k-mers and a optimization method

Hi, I’m Genivaldo, but you can call me Geni. I’m a PhD student in the Computational Science Research Center (CSRC) in the joint program between San Diego State University (SDSU) and Claremont Graduate University (CGU), working with Dr. Rob Edwards. My research focus is developing agile methods to analyze metagenomics data. This blog post describes my experience creating models/tools to profile metagenomes, and in particular, how and why I created FOCUS (Silva et al., 2014).
Ever since I started my college studies I don’t like taking classes; mainly because they were never applied to my research or because my professors would rather give the students a test than a project. As part of the CSRC program, I have to take an extensive list of math classes, which were tough to me because I come from a computer science undergrad background. Lately I have tried to apply what I have learned at these math classes to computational biology to speed up the analyses of metagenomics data. It makes the classes more fun and lets me apply my new ideas and skills.
In the fall 2013, I was taking a mathematical modeling class where I learned a modeling method called Monte Carlo Simulations which led me to and spent part of my research time in 2013 trying to develop a model using what I learned in the class to try understanding which organisms were present in a given metagenomic sample. PHACCS (Angly et al., 2005) showed that metagenomes normally follow the power law distribution, so basically I was trying to re-sample a number of k-mer compositions from a random number of organisms, apply the power law distribution to it, and compute the distance between the random k-mer frequencies. Rob had talked a lot about this with Chris Quince, and they thought that it would be a good way to analyze metagenomes. If we repeat it thousands of time, the program predicts the organisms that are most probably in the metagenomic sample, and their abundances of those organisms. The program was tested with simulated data, and it worked. However, it was slow and did not really work for real data; it would only predict the organisms that were most abundant (which is nice for some applications, but we knew we would be criticized by the reviewers), and it ended up as a pre-print (Silva, Dutilh & Edwards, 2014). It is always interesting to create a new method, but  better is to develop a method that is fast; you know, we have enough tools to analyze metagenomes which are slow. Rob and Chris’ idea wasn’t so great.
OK! I got an A in the math modeling class, I applied it to my research, but I felt like I developed something which worked, but was not really useful in the real world. In the same semester, I was taking a class which focused on Computational Optimization methods, and I started to be interested in the topics which lead me to learn about non negative least squares (NNLS). Everybody knows about Least Squares (LS), right? A method which minimizes the function f(x) =||Ax-b||; however in the LS is an unconstrained method where the vector x may have negative values. The NNLS approach fits in metagenomic analysis because organisms only positive abundance; I haven’t heard someone saying “My sample is represented by -3% of Salmonella“. The idea is simple: for the function f(x) =||Ax-b||; A is going to be the k-mer frequencies from all known complete genomes, b is going to be the k-mer frequency from the user input data (their metagenome), and x is the optimal set of abundances for each organism in the database present in the sample constrained only to values >=0.
We named the tool by FOCUS which means Find Organisms by Composition USage, and it was my project for the optimization class which guaranteed me an A again. FOCUS is fast! It takes about 40 seconds to analyze huge metagenomics dataset, and the good news are that the results are similar to great tools such as MetaPhlAn (Segata et al., 2012). In the discussion part of the paper, we compared almost half (256 GB of data) of the Human Microbiome Project (HMP) (Consortium, 2012), and the tool took ~ 2 hours to profile all the data. FOCUS was submitted to PeerJ, and after a few rounds of revisions, it was accepted and published. You might be asking yourself “why is FOCUS so fast?” Well… calculating 7-mers (k-mer of frequency 7) can be done super fast by Jellyfish (Marçais & Kingsford, 2011), Turtle (Roy, Bhattacharya & Schliep, 2014) or Khmer (Zhang et al., 2014). Moreover, scipy has a fortan wrapper for the NNLS algorithm.
Now I am taking two other math classes at CGU (Advanced Numerical Analysis and Statistical Linear Models), and definitely gaining some new ideas and will come out from these classes and apply them in my research. Furthermore, I am working on a novel method to profile “what they are doing?” where I am using the SEED database to show the subsystems presents in a given input. The new tool is named SUPER-FOCUS, and that is all I can tell you for now, but I promise that it is going to be way faster than MG-RAST (Meyer et al., 2008), and MEGAN (Huson et al., 2007) (even when MEGAN  uses PAUDA (Huson & Xie, 2013)).

For more information about FOCUS read the paper.

References

Angly F, Rodriguez-Brito B, Bangor D, McNairnie P, Breitbart M, Salamon P, Felts B, Nulton J, Mahaffy J, Rohwer F. 2005. PHACCS, an online tool for estimating the structure and diversity of uncultured viral communities using metagenomic information. BMC Bioinformatics 6:41.
Consortium THMP. 2012. Structure, function and diversity of the healthy human microbiome. Nature 486:207–214.
Huson DH, Auch AF, Qi J, Schuster SC. 2007. MEGAN analysis of metagenomic data. Genome Research 17:377–386.
Huson DH, Xie C. 2013. A poor man’s BLASTX – high-throughput metagenomic protein database search using PAUDA. Bioinformatics:btt254.
Marçais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27:764–770.
Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke A et al. 2008. The metagenomics RAST server – a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics 9:386.
Roy RS, Bhattacharya D, Schliep A. 2014. Turtle: Identifying frequent k-mers with cache-efficient algorithms. Bioinformatics:btu132.
Segata N, Waldron L, Ballarini A, Narasimhan V, Jousson O, Huttenhower C. 2012. Metagenomic microbial community profiling using unique clade-specific marker genes. Nature Methods 9:811–814.
Silva GGZ, Cuevas DA, Dutilh BE, Edwards RA. 2014. FOCUS: an alignment-free model to identify organisms in metagenomes using non-negative least squares. PeerJ 2:e425.

Silva GGZ, Dutilh BE, Edwards RA. 2014. FORMAL: A model to identify organisms present in metagenomes using Monte Carlo Simulation. bioRxiv:010801.
Zhang Q, Pell J, Canino-Koning R, Howe AC, Brown CT. 2014. These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure. PLoS ONE 9:e101271.

 

Calculating Chi-squared with perl

There are two Perl repositories available on CPAN that deal with Chi-squared analysis(Statistics::ChiSquare and Statistics::Distributions).  However neither one outputs the Chi-squared value for the analysis of two binary populations.

We can use the formula below to calculate the Chi-squared value with one degree of freedom.

χ2 = [n(ad – bc)2] / [(a + b) (c + d) (a + c) (b + d)]

n = a + b + c + d

Where:

variable population 1 population 2
+ a b
c d

Example:
Suppose we wish to determine the relationship between disease in two species. Both disease and the species are binary variables, so the Chi-squared test is applied:

Diseased species 1 species 2
No 57 36
Yes 63 88

n = (57 + 36 + 63 + 88) = 244

χ2 = [244*(57*88 – 36*63)2] / [(57 + 36) (63 + 88) (57 + 63) (36 + 88)]

χ2 = 8.81

The critical Chi-squared distribution P-values at 1 degree of freedom are:

D.F. 0.1 0.05 0.025 0.01 0.005
1 2.71 3.84 5.02 6.63 7.88

The χ2 value (8.82) is below the P-value 0.005.

Since the corresponding P-value is less than 0.05 (P<0.05), the data suggest that the prevalence of disease is significantly higher in species 2. Therefore we reject the null hypothesis.

Below is a Perl subroutine to automatically calculate Chi-squared.

sub chi_squared {
     my ($a,$b,$c,$d) = @_;
     return 0 if($b+$d == 0);
     my $n= $a + $b + $c + $d;
     return (($n*($a*$d - $b*$c)**2) / (($a + $b)*($c + $d)*($a + $c)*($b + $d)));
}
print &chi_squared(57,36,63,88); 

Output:

8.81780430153469