AKA: how to remove contamination from your metagenome! We use sharks genomes, but it works with humans, corals, and other things too!
A while ago we wrote
This is trivial to do with modern sequence analysis tools, and so we provide recipes here for filtering your reads based on matches to a reference genome. Read more to find out how!
We also provide a snakefile that does all these steps. Set up the bowtie2 index, a directory of reads, and and output location, and it will generate mapped and unmapped reads for you!
First, you need your sequence files (presumably as fastq files) and your reference genome that you are going to map to. For this example, we will use bowtie2, because it is still one of the most up-to-date mapping tools, but you can replace it with your mapper of choice.
Lets assume your sequences are in three files
left.fastq
– the left reads in afastq fileright.fastq
– the right reads in a filesingle.fastq
– the singleton reads in afastq file
And also your reference genome is called sharks.fasta
, because it contains sequences of sharks (why not?).
We’re going to use bowtie2 to index our sequences, search against them, and then use samtools to create a binary bam file of the alignment.
# build the index
bowtie2-build sharks.fasta sharks
# run the mapping using 16 processes
bowtie2 -p 16 -x sharks -1 left.fastq -2 right.fastq -U single.fastq | samtools sort > seqs.sharks.bam
Now we have our
Note that for all these steps, we are making use of the samtools flags filters, using either -f or -F (match a flag or do not match a flag) or -G (exclude reads that match). You can learn more about the samtools flags and explore the options with this samtools flags calculator.
Sequences that map to the reference
We can extract all the sequences that map to the shark reference genomes into three files: the left reads, the right reads, and the unmapped reads.
# extract the left reads that match the reference genome
samtools fastq -G 12 -f 65 seqs.sharks.bam > left.shark.fastq
# extract the right reads that match the reference genome
samtools fastq -G 12 -f 129 seqs.sharks.bam > right.shark.fastq
# extract the single reads that match the reference genome
samtools fastq -F 5 seqs.sharks.bam > single.shark.fastq
In the first two commands, we -G 12
-f 65
) or it is the second read in the pair (-f 129
). This combination gives us either the left or right mapped reads.
Next, we look for reads that are not pairs (that would normally be -F 1
) and are mapped (that would normally be -F 4
), hence we use -F 5
to find the singletons that are mapped to the reference, and are hence shark sequences.
Sequences that do not map to the reference
We apply a similar logic to find the sequences that do not map to the reference genome. Again, we’ll end up with three files, a left reads, right reads, and unmapped reads.
# extract the left reads that do NOT match the reference genome
samtools fastq -f 77 seqs.sharks.bam > left.notshark.fastq
# extract the right reads that do NOT match the reference genome
samtools fastq -f 141 seqs.sharks.bam > right.notshark.fastq
# extract the single reads that do NOT match the reference genome
samtools fastq -f 4 -F 1 seqs.sharks.bam > single.notshark.fastq
The logic here is very similar. A -f 77
means the read is -f 141
means the same except it is the second mate in the pair.
Finally, we use two flags: -f 4
(the read is unmapped)
and -F 1
(the read is not paired) to find the single reads that are not mapped.
Here are a couple of other options that you might consider too. These will give you all the left/right reads even if only one of them is unmapped.
69
: the read is paired, the reads is unmapped, it is the first read73
: the read is paired, the mate is unmapped, it is the first read
133
: the read is paired, the reads is unmapped, it is the second read137
: the read is paired, the mate is unmapped, it is the second read
We can’t use samtools to combine these (to my knowledge – let me know if I am wrong), but we can combine them with awk (prompted by this answer):
samtools view alignments.bam | awk 'BEGIN {FS="\t"; OFS="\t"} {if (/^@/ && substr($2, 3, 1)==":") {print} else if (and($2, 0x1) && and($2, 0x40) && (and($2, 0x4) || and($2, 0x8))) {print}}'
The awk code BEGIN {FS="\t"; OFS="\t"}
uses tabs (FS=field separator; OFS=output field separator), if (/^@/ && substr($2, 3, 1)==":") {print}
prints the bam file header lines, and if (and($2, 0x1) && and($2, 0x40) && (and($2, 0x4) || and($2, 0x8))) {print}
prints those lines that match flag 0x1 (1; read paired), flag 0x40 (64, first in pair), and either 0x4 (4; read is unmapped) or 0x8 (8, mate is unmapped).
Substitute 0x80
for 0x40
in that command to get the second in the pair.
In the file I am testing, when I try this, I get samtools flags 69, 73, 77, and 89. Flag 77 occurred 85% of the time, 73/89 were 5% each, and 69,101 were about 2.5% each.
By combining the flags in the samtools files we can easily separate our sequences into those reads that match the reference genome, and those that do not. The next question is, which ones are you interested in!