Snakemake is a versatile pipeline manager for doing a lot of bioinformatics analysis, but handling wildcards in snakemake is not transparent, and here are some tips and tricks that we have gathered to help you process lots of files easily.
Note: most of these are contrived examples since I fixed the problems and don’t want to go back and recreate it to get the exact instance where they occurred! Hopefully some of these tips will help keep you sane as you work out your workflow!
Use constraints on wildcards
One of the problems I run into is that snakemake
says it can’t figure out how to create a file that stems from a wildcard. For example, if you start with a file called seqs.fastq
and then make a file called seqs.clean.fastq
and then seqs.clean.really.fastq
using wildcards, you might use something like this:
rule clean:
input:
"{seqfile}.fastq"
output:
"{seqfile}.clean.fastq"
rule reallyclean:
input:
"{seqfile}.clean.fastq"
output:
"{seqfile}.clean.really.fastq"
Now, however snakemake
will try and match {seqfile}.fastq
to both seqs.fastq
and seqs.clean.fastq
giving you wildcards of seqs
and seqs.clean
You can overcome this problem by constraining your wildcards. There are three ways to do it, but typically I find the global solution to be the cleanest:
wildcard_constraints:
seqfile = '\w+'
With this constraint, the wilcard can only contain letters, numbers, or underscores (not strictly true but close enough). This means that the wildcard can only match seqs
and not seqs.clean
(because the .
is not a letter, number, or underscore). This constrains your wildcard matches and ensures that you can have multiple files with the same extension. Of course, with this rule, your wildcards need to only match the regular expression (\w+
)!
You can use wildcards in input
, output
, and params
but not shell
If you have a wildcard like {sample}
you can not use that directly in the shell
rule. Instead, you can set it as a param.
Example. This will not work:
rule get:
output:
"{sample}.fasta"
shell:
"curl -Lo {output} http://ftp://ftp.patricbrc.org/genomes/{sample}/{sample}.fna"
You will get an error: The name 'sample' is unknown in this context.
Instead, you need to make {sample}
into a parameter:
rule get:
output:
"{sample}.fasta"
params:
gid = "{sample}"
shell:
"curl -Lo {output} http://ftp://ftp.patricbrc.org/genomes/{params.gid}/{params.gid}.fna"
Use python magic to make life easier
Python has a lot of built-in operations that you can rely on to easily identify sets of files. For example, regular expressions, filters, and maps, are essential for snakemake
processing.
Example: find all fastq
files in a directory
Often you want to process all the fastq
files in a given directory, but none of the other files. You can do this with wildcard globbing, but using a little Python code can also make life easier.
In this scenario, I have two directories: READDIR
has a list of directories, one per barcode for my sequences. Each barcode directory has some fastq
files, but also has some reports generated by the sequencer that I want to ignore. Using map
and filter
I can easily make ` list of all the fastq files that I can then use as an input:
import re
fastqre = re.compile(r'\.fastq$')
# note, in reality this is one line!
fastq_files = list(
map(
lambda x: os.path.join(READDIR, wildcards.barcode, x),
filter(
fastqre.search,
os.listdir(os.path.join(READDIR, wildcards.barcode)
))))
the os.listdir(ospath.join(READDIR, wildcards.barcode))
reads all the files in each barcode directory (both fastq files and other files), and then the filter
applies the fastqre
regular expression to only get the fastq files. The map
then merges the filename with the rest of the path, so that the final list has complete paths to the fastq files.
Note that at the end I turn this into a list
. Both map
and filter
return iterable objects, but Snakemake
doesn’t iterate them by default. Therefore you need to make them a list so that snakemake
will process them.
Be careful where you use expand!
The expand()
function of snakemake
by default creates a cartesian product of your list(s) and processes them. However, it makes a big difference where you actually put that expand call!
expand()
in rule all
This is probably where you want to use expand (or meant to!). This will make a cartesian product and process all the other rules, setting the wildcard as appropriate. This means that each rule will be run with (typically) one input, once per expanded item.`
expand()
in other rule’s input:
calls
This will run that rule with the cartesian product of all the files. In this case, {input}
will be a long list of all your filenames. This is probably not where you want to expand()
your wildcards unless you want to, for example, concatenate the output of a lot of files into one.
expand()
is not sorted()
but can be
By default expand()
is not sorted but it is easy to do so:
sorted(expand(os.path.join(PNGDIR, "{bc1}.png"), bc1 = FA))
This list is now sorted alphabetically.
Global variables can be accessed in rules
If you have a global variable and want to access it later, use a lambda function if it is a dictionary.
For example, lets set up a method that reads a set of fasta
files and remembers the sequence length in each
genome_len = {}
def fasta_len(faf):
""" calculate the length of sequence in the file"""
seqlen = 0
global genome_len
with open(faf, 'r') as f:
for l in f:
if not l.startswith('>'):
seqlen += len(l.strip())
genome_len[faf] = seqlen
return seqlen
Later, in a rule, we can access the dictionary genome_len
but we need to use a lambda function:
rule blah:
params:
gl = lambda wildcards: genome_len[wildcards.fastafiles],
This is particularly useful if you need to pass the genome length (gl
) as a parameter to the next command (e.g. mummerplot
requires the genome length as a parameter)
Ignore a specific exit code
If your code exits with non-zero, snakemake
will assume that is an error and stop computing. However, sometimes you want to keep going. Here is code you can use to skip these errors.
In this case, mummerplot
exists with code 25 if one of the fasta files is empty. This will touch the output file ({output.fp}
) included in the output rule, and create an empty .png
file that mummerplot
might have made but didn’t.
If the exit code is not 25, we continue as normal.
shell:
"""
set +e
mummerplot -x '[0,{params.g1}]' -y '[0,{params.g2}]' --png -p {params.outbase} {input.mums}
exitcode=$?
if [ $exitcode == 25 ]
then
touch {output.fp}
convert -label {params.label} -size 100x100 xc:white {output.png}
exit 0
else
convert -label {params.label} {params.outbase}.png {output.png}
exit $exitcode
fi
"""