Category Archives: Lab blog

Fast correlations with turbocor

We often want to calculate Pearson correlation between different datasets, for example, we have used it to identify the hosts of different phages. Often, we want to calculate Pearson on really large matrices, and so our usual solution is to use crappy code and be patient!

However, recently Daniel Jones released turbocor, a fast, rust-based implementation, of pairwise Pearson correlations, and so we are intrigued to work with it. Here is a brief guide to making correlations using turbocor.

Continue reading

minimap2 hints

Here are some tips and tricks for minimap2 that I keep forgetting!

–split-prefix

If you have a large (>4 GB) multisequence index file, there are two options.

The first is to increase the value of -I when you build the index (preferred) so that the whole index is kept in memory. Note: This must be done when you build the index, you can’t build the index and then change -I during runtime.

The second is to use --split-prefix with a string. For snakemake, there are two options:

  1. You can use "{sample}" as your prefix like so:
params:
    prfx = "{sample}"
...
shell:
    """
         minimap2 --split-prefix {params.prfx} ...
    """

2. You can use a random 6 character string like so:

import random, string

params:
        pfx = ''.join(random.choices(string.ascii_uppercase + string.digits, k=6)) 
...
shell:
    """
         minimap2 --split-prefix {params.prfx} ...
    """

The trick is here, things will probably break if your index file is small. If you see the errorr: [W::sam_hdr_create] Duplicated sequence it is probably because you have split a small index sequence, and the sequence IDs are being duplicated. Remove the --split-prefix option and you should be good.

Primer Trimming Challenge

In DNA sequencing, we add primers and adapters to the ends of sequences. These are short (typically <50bp) known sequences, that we use so we can identify different kinds of sequences. You can find out more about the adapters in this YouTube video.

This challenge is to write software to efficiently detect and remove the primers and adapters from a fastq format file.

Continue reading
Global Distribution of Crassphage Map

How to make beautiful maps

Making maps is hard. Even though we’ve been making maps for hundreds of years, it is still hard. Making good looking maps is really hard. We published a map that is both beautiful and tells a story, and this is the story of how we made that map.

But a figure like this does not appear immediately, it takes work to get something to look this good, and needless to say it wasn’t me that made it look so great!

Continue reading
Django logo

Publishing a Django Website behind a proxy server

We use proxy servers all the time: we have a main server (eg http://edwards.sdsu.edu/) that serves applications (eg. http://edwards.sdsu.edu/GenomePeek) but the application itself runs on different hardware than the webserver.

Here, we show how to host a Django project on a proxy server using the apache web server and make it accessible.

Continue reading