command line deconseq

AKA: how to remove contamination from your metagenome! We use sharks genomes, but it works with humans, corals, and other things too!

A while ago we wrote deconseq to allow you to remove contamination from your sequence libraries. We used an HTS-mapper to map the reads in your sequences to your reference genome, and then filtered the sequences after mapping.

This is trivial to do with modern sequence analysis tools, and so we provide recipes here for filtering your reads based on matches to a reference genome. Read more to find out how!

We also provide a snakefile that does all these steps. Set up the bowtie2 index, a directory of reads, and and output location, and it will generate mapped and unmapped reads for you!

First, you need your sequence files (presumably as fastq files) and your reference genome that you are going to map to. For this example, we will use bowtie2, because it is still one of the most up-to-date mapping tools, but you can replace it with your mapper of choice.

Lets assume your sequences are in three files

  • left.fastq – the left reads in a fastq file
  • right.fastq – the right reads in a file
  • single.fastq – the singleton reads in a fastq file

And also your reference genome is called sharks.fasta, because it contains sequences of sharks (why not?).

We’re going to use bowtie2 to index our sequences, search against them, and then use samtools to create a binary bam file of the alignment.

# build the index
bowtie2-build sharks.fasta sharks
# run the mapping using 16 processes
bowtie2 -p 16 -x sharks -1 left.fastq -2 right.fastq -U single.fastq | samtools sort > seqs.sharks.bam

Now we have our bamfile we can use samtools to separate the sequences into those sequences that map to the sharks and those sequences that do not map to the sharks. We’ll do this in two separate steps.

Note that for all these steps, we are making use of the samtools flags filters, using either -f or -F (match a flag or do not match a flag) or -G (exclude reads that match). You can learn more about the samtools flags and explore the options with this samtools flags calculator.

Sequences that map to the reference

We can extract all the sequences that map to the shark reference genomes into three files: the left reads, the right reads, and the unmapped reads.

# extract the left reads that match the reference genome
samtools fastq -G 12 -f 65 seqs.sharks.bam > left.shark.fastq
# extract the right reads that match the reference genome
samtools fastq -G 12 -f 129 seqs.sharks.bam > right.shark.fastq
# extract the single reads that match the reference genome
samtools fastq -F 5 seqs.sharks.bam > single.shark.fastq

In the first two commands, we use -G 12 which excludes reads where the read is unmapped and the mate is unmapped, and then we look for reads that are paired and it is the first read in the pair (-f 65) or it is the second read in the pair (-f 129). This combination gives us either the left or right mapped reads.

Next, we look for reads that are not pairs (that would normally be -F 1) and are mapped (that would normally be -F 4), hence we use -F 5 to find the singletons that are mapped to the reference, and are hence shark sequences.

Sequences that do not map to the reference

We apply a similar logic to find the sequences that do not map to the reference genome. Again, we’ll end up with three files, a left reads, right reads, and unmapped reads.

# extract the left reads that do NOT match the reference genome
samtools fastq -f 77 seqs.sharks.bam > left.notshark.fastq
# extract the right reads that do NOT match the reference genome
samtools fastq -f 141 seqs.sharks.bam > right.notshark.fastq
# extract the single reads that do NOT match the reference genome
samtools fastq -f 4 -F 1 seqs.sharks.bam > single.notshark.fastq

The logic here is very similar. A samtools flag of -f 77 means the read is paird, neither the read nor mate is mapped, and the it is the first read in the pair, while a flag of -f 141 means the same except it is the second mate in the pair.

Finally, we use two flags: -f 4 (the read is unmapped) and -F 1 (the read is not paired) to find the single reads that are not mapped.

Here are a couple of other options that you might consider too. These will give you all the left/right reads even if only one of them is unmapped.

  • 69 : the read is paired, the reads is unmapped, it is the first read
  • 73 : the read is paired, the mate is unmapped, it is the first read
  • 133 : the read is paired, the reads is unmapped, it is the second read
  • 137 : the read is paired, the mate is unmapped, it is the second read

We can’t use samtools to combine these (to my knowledge – let me know if I am wrong), but we can combine them with awk (prompted by this answer):

samtools view alignments.bam | awk 'BEGIN {FS="\t"; OFS="\t"} {if (/^@/ && substr($2, 3, 1)==":") {print} else if (and($2, 0x1) && and($2, 0x40) && (and($2, 0x4) || and($2, 0x8))) {print}}' 

The awk code BEGIN {FS="\t"; OFS="\t"} uses tabs (FS=field separator; OFS=output field separator), if (/^@/ && substr($2, 3, 1)==":") {print} prints the bam file header lines, and if (and($2, 0x1) && and($2, 0x40) && (and($2, 0x4) || and($2, 0x8))) {print} prints those lines that match flag 0x1 (1; read paired), flag 0x40 (64, first in pair), and either 0x4 (4; read is unmapped) or 0x8 (8, mate is unmapped).

Substitute 0x80 for 0x40 in that command to get the second in the pair.

In the file I am testing, when I try this, I get samtools flags 69, 73, 77, and 89. Flag 77 occurred 85% of the time, 73/89 were 5% each, and 69,101 were about 2.5% each.

By combining the flags in the samtools files we can easily separate our sequences into those reads that match the reference genome, and those that do not. The next question is, which ones are you interested in!