Monthly Archives: July 2014

Phage Identification

We are interested in phages — viruses that infect bacteria. For years the Edwards’ lab has been looking at new, undiscovered phages.

Recently, we identified the crAssphage, a new type of virus that has never been seen before. By looking at the sequences in metagenomes we were able to identify a set of contigs that were common among many different metagenomes. When we assembled them, they looked like a phage. We could compare them to other known phages in our database of sequences.

Working with folks in the biology department we proved that this is a circular virus by using PCR. However, we have so far been unable to culture the virus in vivo. We’re working on it, and hopefully others are too, but until that point we don’t have an image of the virus or an idea of what it does.

What is the relative biomass of crAssphage in the intestine?

Following up from the crAssphage press and comments Dan asked me the following question:

It was interesting to hear that there are 10 times as many viruses as bacteria in the body. If you have time to answer a question, I’ve always wondered about the relative biomass of bacteria compared to human cells, and now the relative biomass of viruses compared to human cells.

Inspired by XKCD’s what-if we can use some Fermi estimation to answer this. A typical virus is about 10-19 kg. (e.g. Adenovirus which is about 50kb is 2.5 x 10-19 kg [1]). A typical bacterium, like E. coli is about 10-15 kg, and a typical human cell is about 10-12 kg.

Scientists like to say that we have ~10x more bacteria than human cells and ~10x more viruses than bacteria. In the human body there are about 37 trillion cells [2] (37 x 1012, but since we are estimating we’ll round that to 1014) . Based on these estimates we have the average human weighs about 100 kg (1014 cells x 10-12 kg) in human cells, 1 kg in bacteria (1015 cells x 10-15 kg), and 0.001 kg in viruses (1016 viruses x 10-19 kg)

Is there a difference between a ‘query to db’ BLAST and a ‘db to query’ BLAST

Often times the question comes up whether there is a difference between using your query file as the query in a BLAST when compared to using the query file as a database.  There is, in fact a difference between the two, and that difference is in the e-value.   Consider the following db file:

>one
acgtacgtagCtagctagctagctgactagc
>two
acgtacgtagAtagctagctagctgactagc
>three
acgtacgtagTtagctagctagctgactagc
>four
acgtacgtagGtagctagctagctgactagc

And the following query file:

>qs1
acgtacgtagCtagctagctagctgactagc

When you have a blast database, the e-value is calculated from the bit score, and takes into account the size of the database, to give an estimate of the probability of finding a certain sequence in a database. In the example above, you see the same four sequences which only differ by one base.  When you BLAST the query sequence qs1 against the database of four sequences, and since you see the sequence qs1 quite often in the database, your e-values will be higher(worse).

$ blastn -db db.fna -query query.fna -outfmt '6'
qs1 one 100.00  31  0   0   1   31  1   31  7e-15   58.4
qs1 four    96.77   31  1   0   1   31  1   31  3e-13   52.8
qs1 three   96.77   31  1   0   1   31  1   31  3e-13   52.8
qs1 two 96.77   31  1   0   1   31  1   31  3e-13   52.8

However if you reverse the database and query, because there is only a single unique sequence qs1 in your database, your e-values will be lower(better).

$ blastn -db query.fna -query db.fna -outfmt '6'
one qs1 100.00  31  0   0   1   31  1   31  2e-15   58.4
two qs1 96.77   31  1   0   1   31  1   31  9e-14   52.8
three   qs1 96.77   31  1   0   1   31  1   31  9e-14   52.8
four    qs1 96.77   31  1   0   1   31  1   31  9e-14   52.8

The two red highlighted e-values should be identical, but they are not.

While the skew is only marginal here, consider if you had a database consisting of the millions of bases.

SGE Array Jobs

How to create an array job for the cluster using SGE

 

Often you have a lot of jobs that are all the same. For example, if you want to blast a series of files against the same database. Here is how to make an array job

First, you need to know about environment variables. In an array job, the environment variable $SGE_TASK_ID is set to a unique number in a range that you define, and is incremented as you define it.

To submit an array job, we use the -t flag in our qsub command:

This will submit an array job where $SGE_TASK_ID is set to every number from one to one hundred and is incremented by one: qsub -t 1-100:1    

This will submit an array job where $SGE_TASK_ID is set to every number from one to one thousand and is incremented by ten: qsub -t 1-1000:10

The range can be any set of numbers you define. There is an upper limit of 75000 jobs in a single array job, but you can submit a second array job with numbers 75001 onwards.

Now all you need is a script that processes your files and runs them. There are several ways to do this. One approach is to number all of your input files, then in your script you can replace the number with $SGE_TASK_ID:

#!/bin/bash
blastn -in $SGE_TASK_ID.fasta -db nr -o $SGE_TASK_ID.blast

You can also list all the files that you want to process and use head and tail 

#!/bin/bash
input=$(head -n $SGE_TASK_ID file_of_files | tail -n 1)
blastn -in $input -db nr -out $input.blast

Another way to do it is to have  a file with all the commands and use head and tail to get a specific command:

#!/bin/bash
cmd=$(head -n $SGE_TASK_ID file_of_commands | tail -n 1)
./$cmd

NOTE: All of these examples use bash. You should be sure to include -S /bin/bash in your qsub command to make sure that they run with the bash shell.

Atlas Scientific Raspberry PIs

We use the Atlas Scientific probes to measure all kinds of things. One of the setups is using Raspberry PI and Plotly and is based on this instructables post. After the read-more I have distilled the essential steps!

 

Boot up and log into your PI.

Start by disabling the getty on the serial line, by commenting out the T0:23 line at the end of inittab:

vi /etc/inittab
#Spawn a getty on Raspberry Pi serial line
#T0:23:respawn:/sbin/getty -L ttyAMA0 115200 vt100

Install some core modules:

sudo apt-get install python-serial git-core python-pip
sudo pip install rpi.gpio plotly

Then reboot the machine.

Grab the atlas scientific python library, (git clone https://github.com/plotly/atlas-scientific.git) and edit atlas-pi.py to include your API key, streaming token, and username from plotly.

Wire up the pi to the BNC connector

  1. Connect Ground on the Atlas stamp to Ground on the Pi Cobbler
  2. Connect VCC on the Atlas stamp to 5V on the Pi Cobbler
  3. Connect RX on the Atlas circuit to TX on the Pi Cobbler
  4. Connect TX on the Atlas circuit to RX on the Pi Cobbler

You should be good to go now.

Note that Rob’s image: RaspberryPiAtlasScientific20140531.img has all of this already done!

The dilemma of open access

Open access publications are a Good Thing. People can read and cite your work free of charge at any time from any where. However, the open access movement has ignored a big opportunity cost associated with open access publication: we can’t all afford the cost. In the Edwards’ lab we’re going to try and level the playing field. Either we publish open access and we only cite open access, freely available articles, or we publish closed access and then we can cite every article. The choice is upto the authors, the editors, and the journals. If the authors want to cite closed-access articles, if the editors require us to cite closed-access articles, or the journal policy requires that, then we can not and will not pay open access fees.

In 2013 the Edwards’ lab contributed to over a dozen papers. The open access fees associated with those papers are in the range $1,500 – $5,000. For example, Source Code for Biology and Medicine charged $2,030 for the scaffold_builder paper. The paper we published in ISME J was $3,300 for the open access fee, and an additional $2,696 for the color figures ($5,996 for one paper!). Even a cursory calculation tells you that this is not sustainable. My grants typically have in the order $5,000 for publications over the lifetime of the grant. Any additional money I spend on publications requires that I spend less on something else.

However, we measure publications effectiveness by their citation index: how many times that paper is cited by others. Therefore, whenever we cite a paper, we are contributing to the importance of that paper and that authors work. There is a debate about whether open access increases citation rate [e.g. Gargouri et al; Moed; and others], and that is a clear advantage of paying the fees, but it it worth spending a PostDocs salary on?

So we are left with a dilemma: I can spend the money on open access publications, or I can spend the money on people and supplies. I fully support open access movement, almost all (if not all?) of our papers are published open access. However, it is not an even playing field if my competitors are publishing their work in the same journals but not paying the open access fee and we are being forced to cite them. Remember, ethically you should not cite a paper that you have not read (reading the abstract does not count!), and we do not have access to most closed access papers without paying for them individually (so I get doubly penalized: once for paying open access fees, and a second time for paying for someone else’s publication costs). SDSU does not have the funds to pay for access to large numbers of journals (California has had a hard few years with its budget!).

We are going to try an experiment: we will only cite papers that we can access without restriction. This means if your paper is published closed access we will not cite it in our work, regardless of how good or important it is. The metric we will use is that we will only cite papers where we can access the whole article from SDSU without additional payment. If authors, reviewers, editors, or journal policy requires that we cite a paper that we can not access without paying for, then I will not pay open access fees – the playing field has to be level.

Hypothetical proteins regular expressions in PERL

When working with genome annotations we often need to determine whether a protein is hypothetical (putative) or not. There are a lot of variants that people use which basically all mean we don’t know what this protein does. It is hypothetical. These PERL regular expressions will allow you to test whether a protein is hypothetical or not. [Note we also have a python version that does the same thing].

This code came from the SEED source files, and we have collected and collated it over the years.

sub hypo {
my $x = (@_ == 1) ? $[0] : $[1];

if (! $x) { return 1 }
if ($x =~ /lmo\d+ protein/i) { return 1 }
if ($x =~ /hypoth/i) { return 1 }
if ($x =~ /conserved protein/i) { return 1 }
if ($x =~ /gene product/i) { return 1 }
if ($x =~ /interpro/i) { return 1 }
if ($x =~ /B[sl][lr]\d/i) { return 1 }
if ($x =~ /^U\d/) { return 1 }
if ($x =~ /^orf[^]/i) { return 1 }
if ($x =~ /uncharacterized/i) { return 1 }
if ($x =~ /pseudogene/i) { return 1 }
if ($x =~ /^predicted/i) { return 1 }
if ($x =~ /AGR
/) { return 1 }
if ($x =~ /similar to/i) { return 1 }
if ($x =~ /similarity/i) { return 1 }
if ($x =~ /glimmer/i) { return 1 }
if ($x =~ /unknown/i) { return 1 }
if (($x =~ /domain/i) ||
($x =~ /^y[a-z]{2,4}\b/i) ||
($x =~ /complete/i) ||
($x =~ /ensang/i) ||
($x =~ /unnamed/i) ||
($x =~ /EG:/) ||
($x =~ /orf\d+/i) ||
($x =~ /RIKEN/) ||
($x =~ /Expressed/i) ||
($x =~ /[a-zA-Z]{2,3}|/) ||
($x =~ /predicted by Psort/) ||
($x =~ /^bh\d+/i) ||
($x =~ /cds_/i) ||
($x =~ /^[a-z]{2,3}\d+[^:+-0-9]/i) ||
($x =~ /similar to/i) ||
($x =~ / identi/i) ||
($x =~ /ortholog of/i) ||
(index($x, “Phage protein”) == 0) ||
($x =~ /structural feature/i)) { return 1 }
return 0;
}

argparse: Python’s command line parsing module

Although GUIs and web pages are great ways for users to interact with our tools and software, the command line interface is still a prevalent medium for executing scripts in the bioinformatics field. One of the ways that we can make command line scripts more interactive with users is to include capabilities for options, flags, and arguments in our code. These allow users to change the behavior of the script, i.e., input values and input format, file output format and nomenclature, algorithm values and thresholds, status updates, and more. Before really diving into Python, C-style argument parsing was the implementation I was most familiar with, such as the getopt Python module or Getopt Perl module, but it does not follow the object-oriented style that languages like Python are most known for. I usually spent two or three dozen lines of code implementing the function and writing out a usage help message. I recently came across the argparse module and felt that this is exactly what I was looking for. It took away much of the manual programming and simplifies the process. Here I’ll explain a short tutorial with a few simple cases on how to use argparse and the benefits I found from using it.

USING REQUIRED ARGUMENTS

A simple example to show is a program that takes in a file and prints out its name.

import argparse parser = argparse.ArgumentParser()
# Initiate argument parser object
# Add input file argument with help message
parser.add_argument(‘infile’, help=‘Input file to print out’)
args = parser.parse_args() print ‘The filename is {}’.format(args.infile)

When we run the command without giving a filename, the following help message appears:

$ python sample.py
usage: sample.py [-h] infile
sample.py: error: too few arguments

We can then run the script with the -h flag to get a full help message:

$ python sample.py -h
usage: sample.py [-h] infile

positional arguments:
infile       Input file
optional arguments:
-h, –help   show this help message and exit

Here, we can see the positional (required) arguments listed along with the help message that we wrote. What is great is that we did not need to manually code the help message ourselves. The argparse object contains methods to format and print out the help message whenever there was a problem with the script during the argument parsing.

We can successfully run the code as so:

$ python sample.py test_file.fasta
The filename is test_file.fasta

 

USING OPTIONAL ARGUMENTS

Optional arguments, like flags, are also essential in many programs and the argparse module supports these.

Here, we’ll add the option to print out the number of lines in the file:

import argparse
parser = argparse.ArgumentParser()  # Initiate argument parser object

Add input file argument with help message

parser.add_argument(‘infile’, help=’Input file to print out’)

Add line count optional argument

parser.add_argument(‘—-linecount’, help=’Printout number of lines in file’,
action=’store_true’)

args = parser.parse_args()  # Call command line parser method
print ‘The filename is {}’.format(args.infile)

Check if the linecount flag was raised

if args.linecount:
with open(args.infile) as f:
numLines = len(f.readlines())
print ‘Number of lines: {}’.format(numLines)

To explain what I’ve added, we can see that the new argument includes a double hyphen ‘–‘ before the name. This will let the parser know that this is not a required or positional argument. I also added the action=‘store_true’ option to this line. This will let the parser know that it will store True for the variable args.linecount and False if the user does not include the flag. The default behavior for action is to accept an argument value after the flag.

We can run the script with the help flag to get new information:

$ python sample.py -h
usage: sample.py [-h] [–linecount] infile

positional arguments:
infile       Input file to print out

optional arguments:
-h, –help   show this help message and exit
–linecount  Printout number of lines in file
$ python sample.py test_file.fasta
The filename is test_file.fasta
$
$ python sample.py test_file.fasta –linecount
The filename is test_file.fasta
Number of lines: 4
$
$ python sample.py –linecount test_file.fasta
The filename is test_file.fasta
Number of lines: 4

We can see here that the new help message includes the –linecount flag and its help message. I then run the script without the flag and it completes successfully. Finally, I include the flag in the command, one case where I include it before the filename and one case after the filename. I did this to show that the order of the optional arguments does not matter.

We can add short arguments to the code because some users prefer them over long arguments. Changing that one line of code will give us:

import argparse
parser = argparse.ArgumentParser()  # Initiate argument parser object

Add input file argument with help message

parser.add_argument(‘infile’, help=’Input file to print out’)

Add line count optional argument

parser.add_argument(‘-c’, ‘–linecount’,
help=’Printout number of lines in file’,
action=’store_true’)

args = parser.parse_args()  # Call command line parser method
print ‘The filename is {}’.format(args.infile)

Check if the linecount flag was raised

if args.linecount:
with open(args.infile) as f:
numLines = len(f.readlines())
print ‘Number of lines: {}’.format(numLines)

The new short argument is prepended with a single hyphen. Running the script gives us the output:

$ python sample.py -h
usage: sample.py [-h] [-c] infile

positional arguments:
infile           Input file to print out

optional arguments:
-h, –help       show this help message and exit
-c, –linecount  Printout number of lines in file
$ python sample.py test_file.fasta -c
The filename is test_file.fasta
Number of lines: 4

One thing to notice, the -c is shown in the usage line at the top of the help message because we put this as the first argument in the parser.add_argument() line. If we put the -c option after –linecount then the long argument would have shown up in the usage line. The order would also have been flipped under the optional arguments section.

 

To conclude, the argparse module handles much of the work of parsing command line arguments and formatting help and usage messages. There are other functions that argparse supplies programmers with that I did not go over here, such as type checking, limited choices for arguments, and argument counting. These can be further explained in the tutorial link below. This covers all of what I presented here and more.

More in depth tutorial @ http://goo.gl/Y4CsIH

Making Live USBs

I’ve had some adventures in making live USBs recently. Here are some lessons learned and some tips and tricks to extend the standard USB.

I started with the Ubuntu installation candidate. I wanted to make a live USB that had a bunch of bioinformatics software on it for a class I was teaching. I grabbed the current version (13.10/64-bit) but you should use whatever is available.

I also have a set of 16 gb USB drives. I like the ADATA USB 3 drives, but you should go with what you have. However, I wanted to partition my drives so that I have a partition for the Ubuntu installation, and a partition for other stuff (documents, photos, etc). I used gparted to create my partitions, made them both FAT32, and set up labels for each. Set the first partition to be 6 gb – this is where you will install the operating system and a persistence file (4 gb) for your software.

Installing linux on one of the partitions was a breeze using the startup disk creator in Ubuntu. However, here’s where things get interesting.

First, I had a list of ~200 pieces of software that I wanted to install from the debian repository (you can see the full list here). I tried just installing all of these, but that didn’t work – my 4 gb hard drive filled up pretty quickly, and worse, the computer was slowed down because of memory constraints (I think). So I manually installed them, choosing to install packages that (a) had a lot of dependencies (e.g. bioperl) and (b) other things depended on (e.g. bioperl) first.By limiting the maximum number of files that I installed at a time, by a surreptitious use of apt-get, and by frequently removing files in /var/cache/apt/archive, I successfully downloaded and installed all the files.

Copying the installed system

If you want to duplicate the USB disk, there are two ways to do that:

  1. Use dd:
    •  insert the disk, and unmount the partitions. Then use dd if=/dev/sdx of=bioubuntu.img to make an image of the whole disk using dd. You will need to change your /dev/sdx to the appropriate device (e.g./dev/sdb or /dev/sdc). 
  2. Copy the casper-rw file
    • The file system is stored in a persistence file called casper-rw. The reason that the operating system is limited to 4gb is that FAT32 (which your thumb drive surely is) can only handle files that are upto 4 gb in size.