Here are some (probably simple) mmseqs2 tips that you probably don’t remember how to do. If you need explanations or details, read the mmseqs2 wiki. Otherwise, good luck!
Continue reading
Here are some (probably simple) mmseqs2 tips that you probably don’t remember how to do. If you need explanations or details, read the mmseqs2 wiki. Otherwise, good luck!
Continue readingWe often want to calculate Pearson correlation between different datasets, for example, we have used it to identify the hosts of different phages. Often, we want to calculate Pearson on really large matrices, and so our usual solution is to use crappy code and be patient!
However, recently Daniel Jones released turbocor, a fast, rust-based implementation, of pairwise Pearson correlations, and so we are intrigued to work with it. Here is a brief guide to making correlations using turbocor
.
We love bioconda! Our preferred way to add a python package to bioconda is via PyPi as it is the lazy solution (of course, Perl programmers are notoriously lazy, but Python programmers can be too!) Once you have the bioconda recipe working bots will take care of the updates for you!
Continue readingSnakemake checkpoints are a little complex to get your head around, and so here are two examples that will hopefully clarify some use cases.
Before we begin, we’re going to design a rule that makes an unknown number of randomly named files (but less than 10 files maximum!)
Here is a simple snakemake rule that will do this, and put them in a directory called first_directory
.
OUTDIR = "first_directory"
rule all:
input:
OUTDIR
rule make_some_files:
output:
directory(OUTDIR)
shell:
"""
mkdir {output};
N=$(((RANDOM%10)+1));
for D in $(seq $N); do
touch {output}/$RANDOM.txt
done
"""
There are a couple of things you need to know about this block:
bash
variable $RANDOM
issues a random number whose maximum size will depend on your particular computer.bash
command seq
will make the sequence upto that number (e.g. seq 10
will run ten times).We don’t know what files are there!
So we convert this rule
to a checkpoint
, by changing the first word in that definition from rule
to checkpoint
.
OUTDIR = "first_directory"
rule all:
input:
OUTDIR
checkpoint make_some_files:
output:
directory(OUTDIR)
shell:
"""
mkdir {output};
N=$(((RANDOM%10)+1));
for D in $(seq $N); do
touch {output}/$RANDOM.txt
done
"""
Notes:
rule
-> checkpoint
.Now we can do something with that checkpoint.
We can use that checkpoint as an input to another rule. Here, we use it as the input to rule all.
We need to get the directory that the checkpoint makes (OUTDIR) and then the files that are in that directory
OUTDIR = "first_directory"
def get_file_names(wildcards):
# note 1: ck_output is the same as OUTDIR, but this requires
# the checkpoint to complete before we can figure out what it is!
# note 2: checkpoints will have attributes for each of the checkpoint
# rules, accessible by name. Here we use make_some_files
ck_output = checkpoints.make_some_files.get(**wildcards).output[0]
SMP, = glob_wildcards(os.path.join(ck_output, "{sample}.txt"))
return expand(os.path.join(ck_output, "{SAMPLE}.txt"), SAMPLE=SMP)
rule all:
input:
get_file_names
checkpoint make_some_files:
output:
directory(OUTDIR)
shell:
"""
mkdir {output};
N=$(((RANDOM%10)+1));
for D in $(seq $N); do
touch {output}/$RANDOM.txt
done
"""
Note: See how we used a checkpoint called make_some_files
and then asked for
checkpoints.make_some_files.get(**wildcards).output[0]
in the function? That’s the connection!
At that point, snakemake knows not to complete the function until it has the checkpoint completed.
We can also use the output as wild cards in another rule.
OUTDIR = "first_directory"
SNDDIR = "second_directory"
def get_file_names(wildcards):
ck_output = checkpoints.make_some_files.get(**wildcards).output[0]
SMP, = glob_wildcards(os.path.join(ck_output, "{sample}.txt"))
return expand(os.path.join(ck_output, "{SAMPLE}.txt"), SAMPLE=SMP)
def dup_file_names(wildcards):
du_output = checkpoints.make_some_files.get(**wildcards).output[0]
SMPLS, = glob_wildcards(os.path.join(du_output, "{smpl}.txt"))
return expand(os.path.join(SNDDIR, "{SM}.tsv"), SM=SMPLS)
rule all:
input:
get_file_names,
dup_file_names,
checkpoint make_some_files:
output:
directory(OUTDIR)
shell:
"""
mkdir {output};
N=$(((RANDOM%10)+1));
for D in $(seq $N); do
touch {output}/$RANDOM.txt
done
"""
rule duplicate:
input:
get_file_names
output:
os.path.join(SNDDIR, "{SAMPLE}.tsv")
shell:
"""
touch {output}
"""
Here, I am going to use global wildcards to allow us to read and use the wildcards in different places.
OUTDIR = "first_directory"
SNDDIR = "second_directory"
SMP = None
def get_file_names(wildcards):
ck_output = checkpoints.make_five_files.get(**wildcards).output[0]
global SMP
SMP, = glob_wildcards(os.path.join(ck_output, "{sample}.txt"))
return expand(os.path.join(ck_output, "{SAMPLE}.txt"), SAMPLE=SMP)
def get_second_files(wildcards):
ck_output = checkpoints.make_five_files.get(**wildcards).output[0]
SMP2, = glob_wildcards(os.path.join(ck_output, "{sample}.txt"))
return expand(os.path.join(SNDDIR, "{SM}.tsv"), SM=SMP2)
rule all:
input:
"list_of_files.txt",
get_second_files
checkpoint make_five_files:
output:
directory(OUTDIR)
params:
o = OUTDIR
shell:
"""
mkdir {output};
for D in $(seq 1 5); do
touch {params.o}/$RANDOM.txt
done
"""
rule copy_files:
input:
get_file_names
output:
os.path.join(SNDDIR, "{SAMPLE}.tsv")
shell:
"""
touch {output}
"""
rule list_all_files:
input:
get_file_names,
expand(os.path.join(SNDDIR, "{s}.tsv"), s=SMP)
output:
"list_of_files.txt"
shell:
"""
echo {input} > {output}
"""
No problem! We can use two checkpoints, and then combine them in a single expand statement.
For example, lets make two random sets of files, and then a third set of files that combines all their filenames
Note: Here I only use 5 random files each time as we could end up with 100 files, but the answer is the same.
OUTDIR = "first_directory"
SNDDIR = "second_directory"
THRDIR = "third_directory"
def combine(wildcards):
# read the first set of outputs
ck_output = checkpoints.make_some_files.get(**wildcards).output[0]
FIRSTS, = glob_wildcards(os.path.join(ck_output, "{sample}.txt"))
# read the second set of outputs
sn_output = checkpoints.make_more_files.get(**wildcards).output[0]
SECONDS, = glob_wildcards(os.path.join(sn_output, "{smpl}.txt"))
return expand(os.path.join(THRDIR, "{first}.{second}.tsv"), first=FIRSTS, second=SECONDS)
rule all:
input:
combine
checkpoint make_some_files:
output:
directory(OUTDIR)
shell:
"""
mkdir {output};
N=$(((RANDOM%5)+1));
for D in $(seq $N); do
touch {output}/$RANDOM.txt
done
"""
checkpoint make_more_files:
output:
directory(SNDDIR)
shell:
"""
mkdir {output};
N=$(((RANDOM%5)+1));
for D in $(seq $N); do
touch {output}/$RANDOM.txt
done
"""
rule make_third_files:
input:
directory(OUTDIR),
directory(SNDDIR),
output:
os.path.join(THRDIR, "{first}.{second}.tsv")
shell:
"""
touch {output}
"""
Hopefully this will get you started, but let us know if not!
I can never remember the order of the columns in samtools, so here they are
There are lots more details in the samtools manual
You should also use the samtools flag explainer to understand column 2
This is an Australian thing! If you are not from Australia, you probably don’t need to do this, but in case you do, here it is!
The ARC’s Research Management System has an outputs section that is, of course, mandatory, and is, equally of course, different from everyone else’s system.
I try and culture my references in Google Scholar, ORC ID, and my resume, and I don’t want to manually enter them into RMS, especially when a grant deadline is close. So here is one way to avoid that.
Continue readingHere are some tips and tricks for minimap2 that I keep forgetting!
If you have a large (>4 GB) multisequence index file, there are two options.
The first is to increase the value of -I
when you build the index (preferred) so that the whole index is kept in memory. Note: This must be done when you build the index, you can’t build the index and then change -I
during runtime.
The second is to use --split-prefix
with a string. For snakemake
, there are two options:
"{sample}"
as your prefix like so:params:
prfx = "{sample}"
...
shell:
"""
minimap2 --split-prefix {params.prfx} ...
"""
2. You can use a random 6 character string like so:
import random, string
params:
pfx = ''.join(random.choices(string.ascii_uppercase + string.digits, k=6))
...
shell:
"""
minimap2 --split-prefix {params.prfx} ...
"""
The trick is here, things will probably break if your index file is small. If you see the errorr: [W::sam_hdr_create] Duplicated sequence
it is probably because you have split a small index sequence, and the sequence IDs are being duplicated. Remove the --split-prefix
option and you should be good.
A few tips and tricks for working with slurm (i.e. submitting jobs using sbatch) that I frequently forget!
Continue readingIn DNA sequencing, we add primers and adapters to the ends of sequences. These are short (typically <50bp) known sequences, that we use so we can identify different kinds of sequences. You can find out more about the adapters in this YouTube video.
This challenge is to write software to efficiently detect and remove the primers and adapters from a fastq format file.
Continue readingMaking maps is hard. Even though we’ve been making maps for hundreds of years, it is still hard. Making good looking maps is really hard. We published a map that is both beautiful and tells a story, and this is the story of how we made that map.
But a figure like this does not appear immediately, it takes work to get something to look this good, and needless to say it wasn’t me that made it look so great!
Continue reading