Wildcards in Snakemake

Snakemake is a versatile pipeline manager for doing a lot of bioinformatics analysis, but handling wildcards in snakemake is not transparent, and here are some tips and tricks that we have gathered to help you process lots of files easily.

Note: most of these are contrived examples since I fixed the problems and don’t want to go back and recreate it to get the exact instance where they occurred! Hopefully some of these tips will help keep you sane as you work out your workflow!

Use constraints on wildcards

One of the problems I run into is that snakemake says it can’t figure out how to create a file that stems from a wildcard. For example, if you start with a file called seqs.fastq and then make a file called seqs.clean.fastq and then seqs.clean.really.fastq using wildcards, you might use something like this:

rule clean:
    input:
        "{seqfile}.fastq"
    output:
        "{seqfile}.clean.fastq"

rule reallyclean:
    input:
        "{seqfile}.clean.fastq"
    output:
        "{seqfile}.clean.really.fastq"

Now, however snakemake will try and match {seqfile}.fastq to both seqs.fastq and seqs.clean.fastq giving you wildcards of seqs and seqs.clean

You can overcome this problem by constraining your wildcards. There are three ways to do it, but typically I find the global solution to be the cleanest:

wildcard_constraints:
   seqfile = '\w+'

With this constraint, the wilcard can only contain letters, numbers, or underscores (not strictly true but close enough). This means that the wildcard can only match seqs and not seqs.clean (because the . is not a letter, number, or underscore). This constrains your wildcard matches and ensures that you can have multiple files with the same extension. Of course, with this rule, your wildcards need to only match the regular expression (\w+)!

You can use wildcards in input, output, and params but not shell

If you have a wildcard like {sample} you can not use that directly in the shell rule. Instead, you can set it as a param.

Example. This will not work:

rule get:
    output:
       "{sample}.fasta"
    shell:
       "curl -Lo {output} http://ftp://ftp.patricbrc.org/genomes/{sample}/{sample}.fna"

You will get an error: The name 'sample' is unknown in this context.

Instead, you need to make {sample} into a parameter:

rule get:
    output:
       "{sample}.fasta"
    params:
         gid = "{sample}"
    shell:
       "curl -Lo {output} http://ftp://ftp.patricbrc.org/genomes/{params.gid}/{params.gid}.fna"

Use python magic to make life easier

Python has a lot of built-in operations that you can rely on to easily identify sets of files. For example, regular expressions, filters, and maps, are essential for snakemake processing.

Example: find all fastq files in a directory

Often you want to process all the fastq files in a given directory, but none of the other files. You can do this with wildcard globbing, but using a little Python code can also make life easier.

In this scenario, I have two directories: READDIR has a list of directories, one per barcode for my sequences. Each barcode directory has some fastq files, but also has some reports generated by the sequencer that I want to ignore. Using map and filter I can easily make ` list of all the fastq files that I can then use as an input:

import re
fastqre   = re.compile(r'\.fastq$')

# note, in reality this is one line!
fastq_files = list(
                  map(
                     lambda x: os.path.join(READDIR, wildcards.barcode, x), 
                      filter(
                           fastqre.search, 
                             os.listdir(os.path.join(READDIR, wildcards.barcode)
))))

the os.listdir(ospath.join(READDIR, wildcards.barcode)) reads all the files in each barcode directory (both fastq files and other files), and then the filter applies the fastqre regular expression to only get the fastq files. The map then merges the filename with the rest of the path, so that the final list has complete paths to the fastq files.

Note that at the end I turn this into a list. Both map and filter return iterable objects, but Snakemake doesn’t iterate them by default. Therefore you need to make them a list so that snakemake will process them.

Be careful where you use expand!

The expand() function of snakemake by default creates a cartesian product of your list(s) and processes them. However, it makes a big difference where you actually put that expand call!

expand() in rule all

This is probably where you want to use expand (or meant to!). This will make a cartesian product and process all the other rules, setting the wildcard as appropriate. This means that each rule will be run with (typically) one input, once per expanded item.`

expand() in other rule’s input: calls

This will run that rule with the cartesian product of all the files. In this case, {input} will be a long list of all your filenames. This is probably not where you want to expand() your wildcards unless you want to, for example, concatenate the output of a lot of files into one.

expand() is not sorted() but can be

By default expand() is not sorted but it is easy to do so:

sorted(expand(os.path.join(PNGDIR, "{bc1}.png"), bc1 = FA))

This list is now sorted alphabetically.

Global variables can be accessed in rules

If you have a global variable and want to access it later, use a lambda function if it is a dictionary.

For example, lets set up a method that reads a set of fasta files and remembers the sequence length in each

genome_len = {}
def fasta_len(faf):
    """ calculate the length of sequence in the file"""
    seqlen = 0
    global genome_len
    with open(faf, 'r') as f:
        for l in f:
            if not l.startswith('>'):
                seqlen += len(l.strip())
    genome_len[faf] = seqlen
    return seqlen

Later, in a rule, we can access the dictionary genome_len but we need to use a lambda function:

rule blah:
    params:
        gl = lambda wildcards: genome_len[wildcards.fastafiles],

This is particularly useful if you need to pass the genome length (gl) as a parameter to the next command (e.g. mummerplot requires the genome length as a parameter)

Ignore a specific exit code

If your code exits with non-zero, snakemakewill assume that is an error and stop computing. However, sometimes you want to keep going. Here is code you can use to skip these errors.

In this case, mummerplotexists with code 25 if one of the fasta files is empty. This will touch the output file ({output.fp}) included in the output rule, and create an empty .png file that mummerplot might have made but didn’t.

If the exit code is not 25, we continue as normal.

shell:        
    """
    set +e
    mummerplot -x '[0,{params.g1}]' -y '[0,{params.g2}]' --png -p {params.outbase} {input.mums}
    exitcode=$?
    if [ $exitcode == 25 ]
    then
        touch {output.fp}
        convert -label {params.label} -size 100x100 xc:white {output.png}
        exit 0
    else
        convert -label {params.label} {params.outbase}.png {output.png}
        exit $exitcode
    fi
"""