Similarity Searching the SRA

The NCBI SRA contains a lot of data – about 10¹⁶ bp at the moment! However, searching that has always been problematic. We’re happy to unveil a new search SRA service that allows you to search the random metagenomes in the SRA using either DNA or protein queries.

The SRA is the common repository for all sequence data, much of it deposited as raw data. We’ve been figuring out ways to explore this data and are pleased to provide the new, experimental, search SRA service.

TL;DR

Go to the search SRA website and create an account
Upload a DNA or Protein sequence
Choose a set or subset of metagenomes to search
Get your results back as either indexed bam files or tab-separated alignment files (depending on whether you provide DNA or protein as your query).

The data you can search

We developed the partition engine, partie to identify all the random metagenomes in the Sequence Read Archive. We estimate there are currently about 120,000 WGS metagenomes in the SRA. Combined, this is about 4 x 10¹⁴ bp and 2 x 10¹² reads. We have provisioned that (or some of it, at least) so that you can search through it for your gene, genome, or protein of interest.

We have currently provided three subsets of this data:

The human microbiome project (HMP) data
The TARA Oceans data
The entire SRA metagenome dataset

If you are a power user, though, you can upload your own set of sequence IDs and search a subset of the data we have provisioned for you. Talk to Rob if you are interested in doing this.

How we do the search and what you get back

If you give us a DNA sequence, we will do the search using bowtie2. We index your sequence as the database, and then stream the sampled SRA data against your sequences. We return a bam file that has your alignments in them and you can use standard software to process. Note that we do not store the sequences that did not match, so you only get the sequences that match your query. We also have some downstream processing workflows that we are happy to share with you.

If you give us a protein sequence, we use Diamond to perform the searches, and provide you with a the alignment output files from diamond.

In both cases, we provide you with a zip archive of the results, one file per metagenome. Yes, that does mean you can end up with hundreds of files.

Data Sampling

In order to complete the computes in a reasonable time, we can not search all 4 x 10¹⁴ bp of sequence data. Instead, we choose 100,000 reads from each metagenome (if the metagenome is less than 100,000 reads we use all of it), and search against that. What does that mean for you? This ends up being about 1% of the entire data set – so we are still searching about 10¹² bp of sequence data!!

It does mean there are some false negatives – there are samples that we omit from your results that would have significant similarity to the search. However, in our experience, these sequences have very few hits (less than 1% of the sequence data) and so there is a limited utility to providing those results. In contrast, the datasets that we do return are guaranteed to have robust similarities to your sequences.

Once you have your results in hand, you can always download the entirety of those few metagenomes that match to your sequence and run more extensive searches.

How we handle the computes

The computation is the biggest challenge in searching the SRA. There are several components that we had to bring together to make this work, all of which relies on the amazing compute resources of XSEDE.

Huge storage: We have access to the massive Wrangler storage system to stage the data for you.
Huge computes: We have an allocation on Jetstream the NSF funded cloud computing resource. This allows us to complete your computes in the cloud!
Data sampling: We take advantage of sampling the data to perform the compute quicker.

Its still experimental!

We’re actively working on this project and it is still experimental, so be prepared to share some feedback about it with us!

Thanks

The search SRA website would absolutely not have been possible without the help and support of XSEDE, the ECCS team, and in particular Mats Rynge and Eroma Abeysinghe. I can not say enough good things about how amazing the whole development experience has been with them.

EdwardsLab

Delivering the best in bioinformatics…