Category Archives: Uncategorized

Python and Locales

If you are using utf-8 documents in Python, you may occasionally run into this error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 124106: ordinal not in range(128)

The fix is trivial!

Continue reading →

How to find the lengths of all the proteins in genbank

How can we generate a list of all the lengths of all the proteins [in a specific group] in genbank? Its easy with ftp!

Continue reading →

C++11 on CentOS 6

CentOS is great because it is secure, but not great because it doesn’t have the latest software. Here is how to install C++11 on CentOS6 or CentOS7, and temporarily activate it in a shell. This does not change the default compiler and should cause less problems with your system (but that is not a money back guarantee … you are own your own if it does!)

Continue reading →

Italicizing scientific names

When writing scientific names: italicize family, genus, species, and variety or subspecies. Begin family and genus with a capital letter. Kingdom, phylum, class, order, and suborder begin with a capital letter but are not italicized.

Here is the complete taxonomy:

Domain
Kingdom
Phylum
Class
Order
Family
Genus
Species

How to create a genome scale metabolic model

If you have a genome annotated in RAST and you want to create a genome scale metabolic model, here is one way to do it using PyFBA and the SEED.

Continue reading →

2015 SDSU Metagenomics Workshop

The 2015 SDSU Metagenomics Workshop is designed to be a combination of lectures, discussions, and practical hands on experience to bring people up to date on data analysis for metagenomics.

The workshop is being held in Adams Humanities Room 2108 from 10 am – 6 pm every day from June 22nd – 26th, 2015.

Registration is closed.

The agenda is online here, and will be updated as we progress.

We will use a VirtualBox virtual machine during the class. More information about the image and how to download is here. (Please note, the image is still subject to change, and so don’t download it yet!)

fastq to fasta

We often have people ask us how to convert fastq files to fasta format. We have a variety of code on this website, but sometimes that is not easy enough.

Here are a couple of ways to do it on the command line: using a PERL script written by Bas, using the command line, or using prinseq-lite. Here is a C++ version that you can compile (e.g. with c++ -o fastq2fasta fastq2fasta.cpp) and run on your machine.

We also have a simple form that converts fastq files to fasta files (DNA only … it does not give you the quality scores).

Anthill Training Notes

We successfully completed a one-day training course for ~40 people on how to use anthill, and everyone is now an expert, right?

The latest version of the anthill training notes are now available at this link: AnthillTrainingNotes

Testing the EdwardsLab ROV

Here we are testing the ROV in a pool.

[KGVID width=”640″ height=”360″]https://edwards.sdsu.edu/wordpress/wp-content/uploads/2014/11/OpenROV.mp4[/KGVID]

Calculating Chi-squared with perl

There are two Perl repositories available on CPAN that deal with Chi-squared analysis(Statistics::ChiSquare and Statistics::Distributions). However neither one outputs the Chi-squared value for the analysis of two binary populations.

We can use the formula below to calculate the Chi-squared value with one degree of freedom.

χ2 = [n(ad – bc)2] / [(a + b) (c + d) (a + c) (b + d)]

n = a + b + c + d

Where:

variable	population 1	population 2
+	a	b
–	c	d

Example:
Suppose we wish to determine the relationship between disease in two species. Both disease and the species are binary variables, so the Chi-squared test is applied:

Diseased	species 1	species 2
No	57	36
Yes	63	88

n = (57 + 36 + 63 + 88) = 244

χ² = [244*(57*88 – 36*63)²] / [(57 + 36) (63 + 88) (57 + 63) (36 + 88)]

χ² = 8.81

The critical Chi-squared distribution P-values at 1 degree of freedom are:

D.F.	0.1	0.05	0.025	0.01	0.005
1	2.71	3.84	5.02	6.63	7.88

The χ² value (8.82) is below the P-value 0.005.

Since the corresponding P-value is less than 0.05 (P<0.05), the data suggest that the prevalence of disease is significantly higher in species 2. Therefore we reject the null hypothesis.

Below is a Perl subroutine to automatically calculate Chi-squared.

sub chi_squared {
     my ($a,$b,$c,$d) = @_;
     return 0 if($b+$d == 0);
     my $n= $a + $b + $c + $d;
     return (($n*($a*$d - $b*$c)**2) / (($a + $b)*($c + $d)*($a + $c)*($b + $d)));
}
print &chi_squared(57,36,63,88);

Output:

8.81780430153469