snakemake tutorial

There are a lot of snakemake tutorials out there to get you started:

This tutorial is not one of those! It is quick, hands-on, and we’ll jump right in with little explanation (OK, there will be some). But I encourage you to read those tutorials, especially Titus’s tutorials as you will learn a lot from them.

This tutorial is example-based. We are going to work through a series of problems and use snakemake to solve them. Along the way, we will learn something about snakemake.

Problem 1: Running phispy on all genomes in PATRIC

The goal is to run PhiSpy on all genomes in PATRIC to identify prophages. Before we start, you will need PhiSpy installed, and you will need the PATRIC CLI installed. If you are using our systems you should find them already installed.

You can test PhiSpy by running the -h command and you can test the PATRIC CLI by running p3-whoami. If you are not logged in, you should use p3-login.

First, we are going to get some PATRIC genomes using the p3 commands. On the command line we might do something like:

p3-all-genomes > genomes.txt
mkdir gto genbank
for G in $(sort -R genomes.txt | head -n 10); do 
      p3-gto $G;
      mv $G.gto gto/ 
      rast-export-genome -i $G.gto -o genbank/$G.gbk genbank; -o phispy/$G $G.gbk

(It is an exercise for the reader how to figure out where those commands came from).

In snakemake we would run these as a series of rules.

This rule will download the genome-typed-object file from PATRIC and move it to the gto directory. Note that this rule does not have any inputs!

rule download_gto:
         gid = "{genomeid}"
         "p3-gto {params.gid} && mv {params.gid}.gto {output}"

This rule will export the genome-typed-object file to genbank format:

rule export_gto:
         "rast-export-genome -i {input} -o {output} genbank" 

Now we are going to make a rule to run phispy:

rule run_phispy:
         outdir = "phispy/{genomeid}"
         " -o {params.outdir} {input}"

There are two things we have left to do:

  1. Figure out a list of genomeids to pass around
  2. Figure out what our all rule should be

There are several ways to figure out our genomeids list, but the easiest is perhaps to just read in the file we downloaded above. Of course we could make that a rule, and have it download for us, but since we have it, how about this as python code:

from random import shuffle
 with open("genomes.txt", 'r') as f:
     for l in f:
 #choose 10 ids at random
 IDS = IDS[0:10]

Now we have a list of ten random IDs, we can run the same code on all of them:

rule all:
         expand("phispy/{genomeid}/prophage.tbl"), genomeid=IDS)

Remember, our all rule is a fake rule that doesn’t make an output. This is the first rule in the file.

When we run snakemake with this file we can tell it how many threads to use, in this case, we will use 10 threads to run everything, one thread per genome.

snakemake -j 10 -s phispy.snakefile

You can see a slightly more complex version of this on GitHub

Problem 2: Assemble Nanopore Data

In our second example, we are going to build a Nanopore assembly (well, polishing) pipeline based on this paper

Wick RR, Holt KE. 2019. Benchmarking of long-read assemblers for prokaryote whole genome sequencing. F1000Res 8:2138.

The steps that we will accomplish are:

  • start with some fastq files. These are unpaired sequences (Nanopore is not normally paired end)
  • remove the host genome using minimap2 (from Heng Li)
  • use filtlong to select long reads (from Ryan Wick)
  • use minimap2 again to find overlaps between reads (from Heng Li)
  • use miniasm to assemble the overlaps (from Heng Li)
  • use minipolish to polish the overlaps between reads (from Ryan Wick)
  • convert the final gfa file to fasta with awk

The script to handle this is available in GitHub.

Here are some of the new things we have included in this script:

We use glob_wildcards to create a list of files in the READDIR directory that end .fastq.gz. In this case, the variable FASTQ becomes a list of the filenames without .fastq.gz on the end. Note: glob_wildcardsreturns a tuple and so the comma after FASTQ is critical!

FASTQ, = glob_wildcards(os.path.join(READDIR, '{fastq}.fastq.gz'))

Our output file is a single file:

os.path.join(STATS, "all_statistics.tsv")

but we also include an expand rule in the inputs to all. This ensures that all the combinations are run:

expand(os.path.join(OUTDIR, "{sample}_6.fasta"), sample=FASTQ),

This is a little “belt and braces”, but it makes sure that we run all the fastq files and create all the output files, and then create the final statistics file.

Also, note that I append _1, _2, _3, etc to the filenames. This ensures that later I can sort the file names in the statistics output and have them in the correct order.

Problem 3: Processing metagenomes

In this new pipeline we assemble a bunch of reads into contigs using megahit, and then map the reads back to those contigs. Next, we identify unmapped reads and run another round of assembly with megahit.

This snakefile introduces a couple of concepts. Note at the start, we have a bunch of python code that figures out our hostname and chooses the executables based on the hostname (they are the same overall executables, but I am defining their absolute paths in the config file).

Next, we have this common idiom for finding R1 and R2 paired-end reads from Illumina sequencing

SAMPLES, = glob_wildcards(os.path.join(READDIR, '{sample}_R1_001.fastq.gz'))
PATTERN_R1 = '{sample}_R1_001.fastq.gz'
PATTERN_R2 = '{sample}_R2_001.fastq.gz'

Note we use glob_wildcards again, but this time we limit it to the bit before the R1 and then we can define what R1 or R2 would look like.

Later, for our assembly, we use this idiom:

rule megahit_assemble:
        r1 = os.path.join(READDIR, PATTERN_R1),
        r2 = os.path.join(READDIR, PATTERN_R2)
        directory(os.path.join(ASSDIR, '{sample}')),
        '{config[executables][assembler]} -1 {input.r1} -2 {input.r2} -o {output}'

Now we have both R1 and R2 and they match (hopefully).

Our shell: commands can get quite complex. Here we are using awk to parse the samtools output and choose the sequences we want based on the flags in the sam output.

        samtools view {input} | \
                awk 'BEGIN {{FS="\t"; OFS="\t"}} \
                {{if (/^@/ && substr($2, 3, 1)==":") {{print}} \
                else if (and($2, 0x1) && and($2, 0x40) && \
                (and($2, 0x4) || and($2, 0x8))) {{print}}}}' \
                | samtools bam2fq > {output}

Note that awk also uses { and } for control characters, just like snakemake does, and so in this instance, we need to double quote our brackets. For example a normal awk command is something like awk '/abc/ {print}' but in snakemake it would become awk '/abc/ {{print}}'

Wildcards in snakemake

Wildcards are their own problem, and we have seen some examples using them here. I have also written separately about some tips and tricks with using wildcards in snakemake that will hopefully help you out.