There are a lot of snakemake tutorials out there to get you started:
- Intro to workflows for efficient automated data analysis, using snakemake This is a great introductory tutorial by Titus Brown at UC Davis
- GGG 298, Jan 2020 – Week 4 – snakemake for running workflows! is also by Titus and is a terrific intro too.
- Creating workflows with snakemake and conda on biostars is a great introduction
- A lot of anvi’o is built on snakemake, but this tutorial doesn’t really focus on the snakemake part of things.
- If you are still lost, check out twitter for a bunch of really great snakemake tutorials about all kinds of data (scRNA, nanopore, illumina, pacbio etc etc etc)
This tutorial is not one of those! It is quick, hands-on, and we’ll jump right in with little explanation (OK, there will be some). But I encourage you to read those tutorials, especially Titus’s tutorials as you will learn a lot from them.
This tutorial is example-based. We are going to work through a series of problems and use snakemake to solve them. Along the way, we will learn something about snakemake.
Problem 1: Running phispy
on all genomes in PATRIC
The goal is to run PhiSpy on all genomes in PATRIC to identify prophages. Before we start, you will need PhiSpy installed, and you will need the PATRIC CLI installed. If you are using our systems you should find them already installed.
You can test PhiSpy by running the PhiSpy.py -h
command and you can test the PATRIC CLI by running p3-whoami
. If you are not logged in, you should use p3-login
.
First, we are going to get some PATRIC genomes using the p3
commands. On the command line we might do something like:
p3-all-genomes > genomes.txt
mkdir gto genbank
for G in $(sort -R genomes.txt | head -n 10); do
p3-gto $G;
mv $G.gto gto/
rast-export-genome -i $G.gto -o genbank/$G.gbk genbank;
PhiSPy.py -o phispy/$G $G.gbk
done
(It is an exercise for the reader how to figure out where those commands came from).
In snakemake
we would run these as a series of rules.
This rule will download the genome-typed-object
file from PATRIC and move it to the gto
directory. Note that this rule does not have any inputs!
rule download_gto: output: "gto/{genomeid}.gto" params: gid = "{genomeid}" shell: "p3-gto {params.gid} && mv {params.gid}.gto {output}"
This rule will export the genome-typed-object file to genbank format:
rule export_gto: input: "gto/{genomeid}.gto" output: "genbank/{genomeid}.gbk" shell: "rast-export-genome -i {input} -o {output} genbank"
Now we are going to make a rule to run phispy:
rule run_phispy: input: "genbank/{genomeid}.gbk" output: "phispy/{genomeid}.prophage_coordinates.tsv", "phispy/{genomeid}.prophage.gff3", "phispy/{genomeid}.prophage.tbl", "phispy/{genomeid}.prophage_tbl.tsv" params: outdir = "phispy/{genomeid}" shell: "PhiSpy.py -o {params.outdir} {input}"
There are two things we have left to do:
- Figure out a list of
genomeid
s to pass around - Figure out what our
all
rule should be
There are several ways to figure out our genomeid
s list, but the easiest is perhaps to just read in the file we downloaded above. Of course we could make that a rule, and have it download for us, but since we have it, how about this as python code:
from random import shuffle IDS=[] with open("genomes.txt", 'r') as f: for l in f: IDS.append(l.strip()) #choose 10 ids at random shuffle(IDS) IDS = IDS[0:10]
Now we have a list of ten random IDs, we can run the same code on all of them:
rule all: input: expand("phispy/{genomeid}/prophage.tbl"), genomeid=IDS)
Remember, our all
rule is a fake rule that doesn’t make an output. This is the first rule in the file.
When we run snakemake
with this file we can tell it how many threads to use, in this case, we will use 10 threads to run everything, one thread per genome.
snakemake -j 10 -s phispy.snakefile
You can see a slightly more complex version of this on GitHub
Problem 2: Assemble Nanopore Data
In our second example, we are going to build a Nanopore assembly (well, polishing) pipeline based on this paper
Wick RR, Holt KE. 2019. Benchmarking of long-read assemblers for prokaryote whole genome sequencing. F1000Res 8:2138.
The steps that we will accomplish are:
- start with some
fastq
files. These are unpaired sequences (Nanopore is not normally paired end) - remove the host genome using minimap2 (from Heng Li)
- use filtlong to select long reads (from Ryan Wick)
- use minimap2 again to find overlaps between reads (from Heng Li)
- use miniasm to assemble the overlaps (from Heng Li)
- use minipolish to polish the overlaps between reads (from Ryan Wick)
- convert the final
gfa
file tofasta
withawk
The script to handle this is available in GitHub.
Here are some of the new things we have included in this script:
We use glob_wildcards
to create a list of files in the READDIR
directory that end .fastq.gz
. In this case, the variable FASTQ
becomes a list of the filenames without .fastq.gz
on the end. Note: glob_wildcards
returns a tuple and so the comma after FASTQ
is critical!
FASTQ, = glob_wildcards(os.path.join(READDIR, '{fastq}.fastq.gz'))
Our output file is a single file:
os.path.join(STATS, "all_statistics.tsv")
but we also include an expand
rule in the inputs to all
. This ensures that all the combinations are run:
expand(os.path.join(OUTDIR, "{sample}_6.fasta"), sample=FASTQ),
This is a little “belt and braces”, but it makes sure that we run all the fastq files and create all the output files, and then create the final statistics file.
Also, note that I append _1
, _2
, _3
, etc to the filenames. This ensures that later I can sort the file names in the statistics output and have them in the correct order.
Problem 3: Processing metagenomes
In this new pipeline we assemble a bunch of reads into contigs using megahit, and then map the reads back to those contigs. Next, we identify unmapped reads and run another round of assembly with megahit.
This snakefile introduces a couple of concepts. Note at the start, we have a bunch of python code that figures out our hostname and chooses the executables based on the hostname (they are the same overall executables, but I am defining their absolute paths in the config file).
Next, we have this common idiom for finding R1
and R2
paired-end reads from Illumina sequencing
SAMPLES, = glob_wildcards(os.path.join(READDIR, '{sample}_R1_001.fastq.gz')) PATTERN_R1 = '{sample}_R1_001.fastq.gz' PATTERN_R2 = '{sample}_R2_001.fastq.gz'
Note we use glob_wildcards
again, but this time we limit it to the bit before the R1
and then we can define what R1
or R2
would look like.
Later, for our assembly, we use this idiom:
rule megahit_assemble: input: r1 = os.path.join(READDIR, PATTERN_R1), r2 = os.path.join(READDIR, PATTERN_R2) output: directory(os.path.join(ASSDIR, '{sample}')), shell: '{config[executables][assembler]} -1 {input.r1} -2 {input.r2} -o {output}'
Now we have both R1
and R2
and they match (hopefully).
Our shell:
commands can get quite complex. Here we are using awk
to parse the samtools output and choose the sequences we want based on the flags in the sam output.
samtools view {input} | \ awk 'BEGIN {{FS="\t"; OFS="\t"}} \ {{if (/^@/ && substr($2, 3, 1)==":") {{print}} \ else if (and($2, 0x1) && and($2, 0x40) && \ (and($2, 0x4) || and($2, 0x8))) {{print}}}}' \ | samtools bam2fq > {output}
Note that awk
also uses {
and }
for control characters, just like snakemake
does, and so in this instance, we need to double quote our brackets. For example a normal awk
command is something like awk '/abc/ {print}'
but in snakemake it would become awk '/abc/ {{print}}'
Wildcards in snakemake
Wildcards are their own problem, and we have seen some examples using them here. I have also written separately about some tips and tricks with using wildcards in snakemake that will hopefully help you out.