How to mess with BLAST results

or why not to trust the top blast hit …

or why is my BLAST wrong …

This little treat tricked us today, and it happens more than you would think. In our case we were looking at mitochondria, but this example also works with phages.

If you take a “circular” genome and recircularize it at a different point. That shouldn’t change the BLAST results, right? You haven’t changed the sequence at all, so why would the results differ.

Let’s try it with phage ΦX174. The sequence is available from GenBank as NC_001422.1. You can also download the original sequence here . When we BLAST this against the nt database at NCBI, as expected, ΦX174 is the top hit. That makes sense, it is the same genome:

Now, watch carefully as there is some sleight of hand here….

Let’s take the ΦX174 genome and rotate it by moving the first 2,000 nucleotides to the end. There is no real rationale for the genome being split where it is, in fact putting the split in the genome where it is appears to interrupt three genes. So here is the same genome but recircularized by 2,000 bp. Now when we run the BLAST we should still expect to see ΦX174 as the top hit:

Now all of a sudden we have Anderseniella sp. Alg231_50 genome assembly, chromosome: VII  and Escherichia coli DH5alpha plasmid p301-4 as the top hits! In fact, our real genome is now the 12th in the list by score!

Of course, what has happened is that the overall score of the two pieces is the same, but the maximum score for any local match is not nearly as much for ΦX174 as it is for the other genomes.

Just remember, genome orientation is important when you are comparing whole pieces of short genomes like mitochondria or phage!