There are two Perl repositories available on CPAN that deal with Chi-squared analysis(Statistics::ChiSquare
and Statistics::Distributions)
. However neither one outputs the Chi-squared value for the analysis of two binary populations.
We can use the formula below to calculate the Chi-squared value with one degree of freedom.
χ2 = [n(ad – bc)2] / [(a + b) (c + d) (a + c) (b + d)]
n = a + b + c + d
Where:
variable | population 1 | population 2 |
---|---|---|
+ | a | b |
– | c | d |
Example:
Suppose we wish to determine the relationship between disease in two species. Both disease and the species are binary variables, so the Chi-squared test is applied:
Diseased | species 1 | species 2 |
---|---|---|
No | 57 | 36 |
Yes | 63 | 88 |
n = (57 + 36 + 63 + 88) = 244
χ2 = [244*(57*88 – 36*63)2] / [(57 + 36) (63 + 88) (57 + 63) (36 + 88)]
χ2 = 8.81
The critical Chi-squared distribution P-values at 1 degree of freedom are:
D.F. | 0.1 | 0.05 | 0.025 | 0.01 | 0.005 |
---|---|---|---|---|---|
1 | 2.71 | 3.84 | 5.02 | 6.63 | 7.88 |
The χ2 value (8.82) is below the P-value 0.005.
Since the corresponding P-value is less than 0.05 (P<0.05), the data suggest that the prevalence of disease is significantly higher in species 2. Therefore we reject the null hypothesis.
Below is a Perl subroutine to automatically calculate Chi-squared.
sub chi_squared {
my ($a,$b,$c,$d) = @_;
return 0 if($b+$d == 0);
my $n= $a + $b + $c + $d;
return (($n*($a*$d - $b*$c)**2) / (($a + $b)*($c + $d)*($a + $c)*($b + $d)));
}
print &chi_squared(57,36,63,88);
Output:
8.81780430153469