Category Archives: SRA

Identifying Metagenomes from the SRA in the Cloud

PARTIE

We have several projects that look for all the metagenomes in the cloud, and we have several ways of searching the SRA. Here, we’ll search for all the WGS metagenomes in the SRA using a Google Big Table query.

Log into Google Console

You’ll need to log into Google console and access a project or create a new one.

Use SQL to find the metagenome/microbiome/metatranscriptome results

We use temporary tables to store the two main searches: what are amplicon projects and what are metagenome/microbiome/metatranscriptome projects, and then we find the projects that are metagenomes:

create temp table AMPLICON(acc STRING) as select acc as amplicon from `nih-sra-datastore.sra.metadata` where assay_type = 'AMPLICON' or libraryselection = 'PCR';
create temp table METAGENOMES(acc STRING) as select acc from `nih-sra-datastore.sra.metadata` where librarysource = "METAGENOMIC" or librarysource = 'METATRANSCRIPTOMIC' or organism like "%microbiom%" OR organism like "%metagenom%"  or organism like '%metatran%';
select acc from METAGENOMES where acc not in (select acc from AMPLICON);

Then save that as a JSON file to Google Drive.

Use `jq` to parse the JSON file

This is probably overkill because we only have one attribute in our data.

jq -r '.acc' bq-results-20221006-054328-1665035790273.json > SRA-metagenomes.txt

Find all the information about all the sequences

We can edit the above SQL to get all the information about all the metagenomes. Basically, we just change the second select statement.

create temp table AMPLICON(acc STRING) as select acc as amplicon from `nih-sra-datastore.sra.metadata` where assay_type = 'AMPLICON' or libraryselection = 'PCR';
select * from `nih-sra-datastore.sra.metadata` where acc not in (select acc from AMPLICON) and (librarysource = "METAGENOMIC" or librarysource = 'METATRANSCRIPTOMIC' or organism like "%microbiom%" OR organism like "%metagenom%");

Note: In this query, the parenthesis are important to make sure we do the and and or in the right place.

Then you can export the data as a JSON Newline file to Google Drive.

Current results

At the moment, this returns 642,842 runs from the SRA

Some things we can’t find

The old study_type field that we searched (using study_type = "Metagenomics") does not appear to have mapped to bigtable.
THe old scientific name that we searched (using sample.scientific_name like "%microbiom%" OR sample.scientific_name like "%metagenom%") does not appear to have mapped to bigtable.

Search SRA Metadata in the Cloud

This is another in our series of posts on searching things in the SRA. As we noted previously, NCBI has moved most of the SRA into the clouds, which makes searching more convenient.

Let’s search through the SRA Metadata attributes (fields)

We can log into the Google Cloud (but you can also use AWS/Azure as you wish), and run a search for a big query.

Note that the Big Query searches are all using standard SQL and here are some videos explaining SQL

SELECT *
FROM `nih-sra-datastore.sra.metadata`
WHERE acc = "DRR071086";

Suppose we have a whole list of IDs to search?

We can extend this approach using WHERE IN like this:

SELECT *
FROM `nih-sra-datastore.sra.metadata`
WHERE acc in ("ERR10082948", "ERR10082949", "ERR10082950", "ERR10082951", "ERR10082952", "ERR10082953", "ERR10082954", "ERR10082955", "ERR10082956", "ERR10082957", "ERR10082959", "ERR10082960", "ERR10082961", "ERR10082963", "ERR10082964", "ERR10082965", "ERR10082966", "ERR10082967", "ERR10082968", "ERR10082970", "ERR10082971", "ERR10082972", "ERR10082973", "ERR10082974", "ERR10082975", "ERR10082976", "ERR10082977", "ERR10082978", "ERR10082979", "ERR10082980", "ERR10082981", "ERR10082982", "ERR10082983", "ERR10082984", "ERR10082985", "ERR10082986", "ERR10082987", "ERR10082989", "ERR10082990", "ERR10082991", "ERR10082992", "ERR10082993", "ERR10082994", "ERR10082995", "ERR10082996", "ERR10082997", "ERR10083000", "ERR10083002", "ERR10083003", "ERR10083004", "ERR10083005", "ERR10083006", "ERR10083008", "ERR10083009", "ERR10083010", "ERR10083011", "ERR10083013", "ERR10083015", "ERR10083016", "ERR10083017", "ERR10083018", "ERR10083020", "ERR10083021", "ERR10083022", "ERR10083023", "ERR10083024", "ERR10083025", "ERR10083026", "ERR10083027", "ERR10083028", "ERR10083029", "ERR10083030", "ERR10083031", "ERR10083033", "ERR10083034", "ERR10083035", "ERR10083036", "ERR10083037", "ERR10083038", "ERR10083039", "ERR10083043", "ERR10083044", "ERR10083046", "ERR10083047", "ERR10083048", "ERR10083049", "ERR10083050", "ERR10083051", "ERR10083054", "ERR10083055", "ERR10083056", "ERR10083057", "ERR10083058", "ERR10083059", "ERR10083060", "ERR10083061", "ERR10083062", "ERR10083063", "ERR10083064", "ERR10083065", "SRR21081047", "SRR21081048", "SRR21081049", "SRR21081050", "SRR21081051", "SRR21081052", "SRR21081053", "SRR21081054", "SRR21081055", "SRR21081056", "SRR21081057", "SRR21081058", "SRR21081059");

We can search the NCBI k-mer based taxonomy profiles too:

SELECT * FROM `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` WHERE acc = 'SRR21081434' order by ileft, ilevel

And then finally search for a record by the presence of crAssphage (taxonomy ID: 1211417)

SELECT m.acc, m.sample_acc, m.biosample, m.sra_study, m.bioproject
FROM `nih-sra-datastore.sra.metadata` as m, `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` as tax
WHERE m.acc=tax.acc and tax_id=1211417
ORDER BY m.bioproject, m.sra_study, m.biosample, m.sample_acc
LIMIT 100

(Note without the limit, this will return 140,864 records!)

But I have thousands of Accessions! What do I do?

OK, let’s take it to the next level!

We are going to create a new table in our Big Query database.

Click on your project name and choose CREATE DATA SET:

This opens a side bar where you can give your data set a name! I called my data sra_searches and because I am using the NCBI SRA data, I want to search in us (multiple regions) (although the SRA data is probably all in Iowa).

Now, expand on your data set and click on the three dots and choose Create Table

Now we need to fill in four things:

Choose Upload in the Create Table From menu
Choose your file. Use csv even if you have a single list of accessions!
Give your table a name. e.g. accessions_20220903 (because later I will have other accessions to search!)
Check the Auto-detect table format (it works well, I don’t know why it is not the default!)

This will create your table, and open it in the browser!

Now click on the data field that has the accessions and the query box will open to allow you to query them:

Run that search and make sure it works!

Subselects to the rescue

Now we can just use a sub-select to get all the data.

I recommend

I recommend doing this on a part of your data first to make sure that it works, rather than doing it on all the data!!

Finally, I recommend that you export this to Google Drive using JSON:

The perils of fast(er)q-dump

The tools fastq-dump and fasterq-dump are used to extract reads from the Sequence Read Archive and export them to (for example) fastq format. There is a hidden gotcha that you should be aware of using fastq-dump to extract data.

Continue reading →

Getting WGS data from GenBank via the SRA

Sometimes when you look at a record in RefSeq/GenBank it is a virtual record that is really a pointer to a set of records. For example, the entry for Callorhinchus milii isolate IMCB2004 points you to the WGS records AAVX02000001-AAVX02067420. Here we show how to get these records.

Continue reading →

Accessing SRA in the Cloud

As part of the STRIDES initiative, the NIH has moved the SRA to the cloud. This includes the metadata, and the whole SRA archive. Here, I show how to set up a new instance to access the sequence read archive in the cloud. In a separate post, we’ll explore getting the metadata out of bigtable.

Continue reading →

SRA Metadata with JSON

We have been working on updating access to the SRA metadata. In previous posts, we used the SQLite database provided by the Meltzer lab, but for a variety of reasons, we are now using the XML provided by the NCBI.

Continue reading →

Search SRA Data

Our new service, searchSRA provides you with hundreds of bam-format files. However, running the search is only part of the battle, and the next daunting step is the downstream processing. Here are some hints about processing the data.

Continue reading →

Similarity Searching the SRA

The NCBI SRA contains a lot of data – about 10¹⁶ bp at the moment! However, searching that has always been problematic. We’re happy to unveil a new search SRA service that allows you to search the random metagenomes in the SRA using either DNA or protein queries.

Continue reading →

Metagenomes with and without crAssphage

There is lots of crAssphage in the world, and there are lots of metagenomes in the sequence read archive. Can we find those metagenomes that do, or do not, have crAssphage in them in the SRA? Lets try…

Continue reading →

How many bp of metagenomes are there in the SRA?

We were curious about how many bp of metagenomes in the SRA. This was partly inspired by our grant writing, and partly by this question on twitter from Tom Delmont:

great Rob! Do you have the ratio in term of ‘file volumes’ between WGS and 16S amplicons? Just curious to know if WGS wins on this front 🙂

— tom delmont (@tomodelmont) March 30, 2017

This is how to answer the question!

Continue reading →

EdwardsLab

Delivering the best in bioinformatics…

Category Archives: SRA

Identifying Metagenomes from the SRA in the Cloud

PARTIE

Log into Google Console

Use SQL to find the metagenome/microbiome/metatranscriptome results

Use `jq` to parse the JSON file

Find all the information about all the sequences

Current results

Some things we can’t find

Search SRA Metadata in the Cloud

But I have thousands of Accessions! What do I do?

Subselects to the rescue

I recommend

The perils of fast(er)q-dump

Getting WGS data from GenBank via the SRA

Accessing SRA in the Cloud

SRA Metadata with JSON

Search SRA Data

Similarity Searching the SRA

Metagenomes with and without crAssphage

How many bp of metagenomes are there in the SRA?

PARTIE

Log into Google Console

Use SQL to find the metagenome/microbiome/metatranscriptome results

Use jq to parse the JSON file

Find all the information about all the sequences

Current results

Some things we can’t find

But I have thousands of Accessions! What do I do?

Subselects to the rescue

I recommend

Use `jq` to parse the JSON file