Python has a versioning module that will automatically get its version from your latest git tag
, and put that version in a pip
repository.
Briefly, here are the components
Continue readingPython has a versioning module that will automatically get its version from your latest git tag
, and put that version in a pip
repository.
Briefly, here are the components
Continue readingHere are some (probably simple) mmseqs2 tips that you probably don’t remember how to do. If you need explanations or details, read the mmseqs2 wiki. Otherwise, good luck!
Continue readingWe often want to calculate Pearson correlation between different datasets, for example, we have used it to identify the hosts of different phages. Often, we want to calculate Pearson on really large matrices, and so our usual solution is to use crappy code and be patient!
However, recently Daniel Jones released turbocor, a fast, rust-based implementation, of pairwise Pearson correlations, and so we are intrigued to work with it. Here is a brief guide to making correlations using turbocor
.
Here are some tips and tricks for minimap2 that I keep forgetting!
If you have a large (>4 GB) multisequence index file, there are two options.
The first is to increase the value of -I
when you build the index (preferred) so that the whole index is kept in memory. Note: This must be done when you build the index, you can’t build the index and then change -I
during runtime.
The second is to use --split-prefix
with a string. For snakemake
, there are two options:
"{sample}"
as your prefix like so:params:
prfx = "{sample}"
...
shell:
"""
minimap2 --split-prefix {params.prfx} ...
"""
2. You can use a random 6 character string like so:
import random, string
params:
pfx = ''.join(random.choices(string.ascii_uppercase + string.digits, k=6))
...
shell:
"""
minimap2 --split-prefix {params.prfx} ...
"""
The trick is here, things will probably break if your index file is small. If you see the errorr: [W::sam_hdr_create] Duplicated sequence
it is probably because you have split a small index sequence, and the sequence IDs are being duplicated. Remove the --split-prefix
option and you should be good.
A few tips and tricks for working with slurm (i.e. submitting jobs using sbatch) that I frequently forget!
Continue readingIn DNA sequencing, we add primers and adapters to the ends of sequences. These are short (typically <50bp) known sequences, that we use so we can identify different kinds of sequences. You can find out more about the adapters in this YouTube video.
This challenge is to write software to efficiently detect and remove the primers and adapters from a fastq format file.
Continue readingMaking maps is hard. Even though we’ve been making maps for hundreds of years, it is still hard. Making good looking maps is really hard. We published a map that is both beautiful and tells a story, and this is the story of how we made that map.
But a figure like this does not appear immediately, it takes work to get something to look this good, and needless to say it wasn’t me that made it look so great!
Continue readingWe use Google Scholar a lot to search the literature, but also to track our own works. Several grant administration systems use ORCID to retrieve citations. Here is how you can “easily” update ORCID with all your citations from Google Scholar.
Continue readingCAT/BAT by Bastiaan von Meijenfeldt and Bas Dutilh is a terrific tool for assigning taxonomy to contigs or metagenome bins. However, phages, and especially prophages, cause some problems because their taxonomy clashes with the bacterial taxonomy. Here is how to update the taxonomic profiles to handle prophages somewhat better.
Continue readingWe use proxy servers all the time: we have a main server (eg http://edwards.sdsu.edu/) that serves applications (eg. http://edwards.sdsu.edu/GenomePeek) but the application itself runs on different hardware than the webserver.
Here, we show how to host a Django project on a proxy server using the apache web server and make it accessible.
Continue reading