Using qudaich to compare metagenomes

We recently released a new version of our qudaich software, designed to compare short read sequence data sets to each other. Qudaich is built around a suffix trie and provides a rapid way to compare short read data sets at the DNA or protein level. Here is how to use qudaich to compare a set of metagenomes to find out how similar they are.

Background

For this example, we are going to use human gut viromes from Rick Bushman’s lab. These are the metagenomes that appeared in the crAssphage poops up again post, so we know that there are some similar sequences in them.

The data comes from this paper: Chehoud C, Dryga A, Hwang Y, Nagy-Szakal D, Hollister EB, Luna RA, Versalovic J, Kellermayer R, Bushman FD. 2016.[Transfer of Viral Communities between Human Individuals during Fecal Microbiota Transplantation, and you can download the data from the sequence read archive.

We focused on four datasets from this study:

  • SRR3403834 4,892,517 reads totaling 1,218,689,925 bp
  • SRR3403835 646,661 reads totaling 154,849,037 bp
  • SRR3403836 746,002 reads totaling 179,374,069 bp
  • SRR3403848 4,367,579 reads totaling 1,093,194,181 bp

We chose those four because they have the most crAssphage like sequences in them! For this demonstration, we only used one end of the reads (the _1.fastq.gz files for each), although you should probably pair your data and/or compare both ends.

First, we used our new ceeqlib library (not published and really only in pre-alpha release) to convert the fastq files to fasta format. As you can see, two of the libraries have >4 million sequences in them, and so we also randomly subsampled (without replacement) those two libraries to get down to 750,000 sequences each. That way all four data sets have similar numbers of sequences.

Using qudaich

The first step in qudaich is to build a suffix array for each pair of comparisons. qudaich treats the databases and queries differently in two ways: First, only the database is reverse complemented – the query is compared to both the forward and reverse complement of the database – and second, qudaich reports the best hit from the database for each query sequence.

Therefore in this analysis, we compared each sequence to each other sequence in both directions.

By default, qudaich only reports the sequences where the number of matches is above the average for all matches (i.e. it only reports strong hits). That option was used in the table below.

 

Query Database Query sequences

that have a match

Unique database sequences

that matched

SRR3403834 SRR3403835 421,742 15,375
SRR3403835 SRR3403834 302,198 16,588
SRR3403834 SRR3403836 391,983 18,225
SRR3403836 SRR3403834 286,167 20,607
SRR3403834 SRR3403848 392,922 2,234
SRR3403848 SRR3403834 367,471 2,326
SRR3403835 SRR3403836 412,646 98,875
SRR3403836 SRR3403835 408,972 130,336
SRR3403835 SRR3403848 341,979 1,174
SRR3403848 SRR3403835 167,025 3,375
SRR3403836 SRR3403848 287,202 1,663
SRR3403848 SRR3403836 163,779 4,245

 

As an alternative, you can specify generating an alignment for every query sequence by providing the -a all command line option. That will generate the best possible alignment for every sequence.