We have been working on updating access to the SRA metadata. In previous posts, we used the SQLite database provided by the Meltzer lab, but for a variety of reasons, we are now using the XML provided by the NCBI.
Please note that we are currently updating this with new code and versions, and should have an update on the next SRA Metadata Release soon!
The code to convert the XML and more tips on getting the data are all available on our SRA Metadata GitHub repo.
We download the XML from the NCBI FTP site. There are two files that you should get. The first is SRA_Accessions.tab, which is all the accession information. The second file is the actual metadata, and whose name changes each time it is updated, as NCBI adds a timestamp. The current version of the file is NCBI_SRA_Metadata_Full_20190810.tar.gz. This file (which is approximately 1GB) contains about a million XML files (currently 1,354,364 files), one for each of the submissions to the SRA. (As a side note, a million submissions over 12 years is about 200 submissions per day. However, remember that the frequency of data submissions to the SRA has not been linear over the 12 year life of the archive!)
We convert those to JSON format so we can parse them easily using grep
and jq
. The script to convert them and the command to do so is provided at our SRA Metadata GitHub repo. You may not need to convert them to JSON if you are familiar with XML, but we find it helps. It also allows us to use the terrific jq parser as shown below. You can download the NCBI Metadata in JSON format from our website.
Directory Organization
There are about 1.5 million files in the SRA JSON archive, and so rather than having each of them as separate files in one colossal directory, we have broken them into subdirectories. The basic directory structure is:
json/SRA/SRA000/SRA000000.json
Where SRA
might be one of SRA
, DRA
or ERA
depending on where the data originated from, then there are the first three digits of the accession (e.g. SRA000
), and then the full accession with the .json
extension.
If you have an SRA ID, it is quite easy to get to the json file:
SRA=SRA551586
ls json/${SRA:0:3}/${SRA:0:6}/$SRA.json
There is one more caveat! The files are listed by SRA submission accession, and most often you want to access them via SRR Run ID. We also create a file called srr_sra_ids.tsv
that has a mapping of SRA ID -> SRR ID. You can quickly grep for the submission accession for any given run using
grep -m 1 SRR5421081 srr_sra_ids.tsv
The -m 1
only reports a single match (there should only be a single match) and thus grep quits as soon as it finds a match and does not keep looking.
Using JQ
So how do we use grep
and jq
to query the SRA metadata. Let’s start with some simple queries and build up.
Finding all metagenomes and microbiomes in the SRA
First, we use egrep
(extended grep) to find all the submissions that mention metagenome or microbiome. We remove the path and extension to just end up with the SRA accession numbers.
egrep -rli 'metagenome|microbiome' json | perl -pe 's#json/##; s#.json##' > metagenomes.txt
This resulted in 78,437 submissions that are metagenomes or microbiomes. Note that this file has the submission accession, not the run ids. To get the actual runs, we need to parse the JSON data like so
cat metagenomes.txt | xargs -i jq -r "try .RUN[].IDENTIFIERS.PRIMARY_ID" json/{}.json > metagenome_runs.txt
The two options to jq
are -r
for raw output (rather than JSON formatted output), and try
which will skip without warning if there is an error in the JSON. You might want to add a catch to the try but I don’t!
In jq
, we are extracting the PRIMARY_ID
field from the IDENTIFIERS
field, and doing that for every RUN
in the file.
This results in a list of SRR run IDs (really [EDS]RR run IDs) that you can continue processing. For example, we have passed all these through partie.
Extract Submissions, Titles, and Abstracts for a Set of Runs
Given a file called srr.ids
, we want to combine all runs based on their submission accession, and then extract the titles and abstracts for those accessions. We use a combination of bash
, perl
, grep
, and jq
to accomplish this.
grep -f srr.ids NCBI_SRA_Metadata_20190914_JSON/srr_sra_ids.tsv | perl -F"\t" -lane 'push(@{$d->{$F[0]}}, $F[1]); END {map {print "$_\t", join(",", @{$d->{$_}})} keys %$d}' | while read SRA VALS; do TITLE=""; ABST=""; TITLE=$(jq 'try .STUDY[].DESCRIPTOR | .STUDY_TITLE' NCBI_SRA_Metadata_20190914_JSON/json/${SRA:0:3}/${SRA:0:6}/${SRA}.json | tr \n \t | sed -e 's/"//g'); ABST=$(jq 'try .STUDY[].DESCRIPTOR | .STUDY_ABSTRACT' NCBI_SRA_Metadata_20190914_JSON/json/${SRA:0:3}/${SRA:0:6}/${SRA}.json | tr \n \t | sed -e 's/"//g'); echo -e "$SRA\t$TITLE\t$ABST\t$VALS"; done > abstracts.tsv
Here is the same command broken down line by line to make it clearer. Note that we actually parse the JSON file twice, the first time for the title and the second time for the abstract. This should ensure both are provided if one or the other (usually the abstract) is missing.
grep -f srr.ids NCBI_SRA_Metadata_20190914_JSON/srr_sra_ids.tsv |\ perl -F"\t" -lane 'push(@{$d->{$F[0]}}, $F[1]); END {map {print "$_\t", join(",", @{$d->{$_}})} keys %$d}' |\ while read SRA VALS; do TITLE=""; ABST=""; TITLE=$(jq 'try .STUDY[].DESCRIPTOR | .STUDY_TITLE' NCBI_SRA_Metadata_20190914_JSON/json/${SRA:0:3}/${SRA:0:6}/${SRA}.json | tr \n \t | sed -e 's/"//g'); ABST=$(jq 'try .STUDY[].DESCRIPTOR | .STUDY_ABSTRACT' NCBI_SRA_Metadata_20190914_JSON/json/${SRA:0:3}/${SRA:0:6}/${SRA}.json | tr \n \t | sed -e 's/"//g'); echo -e "$SRA\t$TITLE\t$ABST\t$VALS"; done > abstracts.tsv
Extract Titles And Abstracts for Submissions
We can extract the title and abstract from a single run using code like this. Here we extract each STUDY entry and in from that only the DESCRIPTOR field, and from those we get the TITLE and ABSTRACT.
jq '.STUDY[].DESCRIPTOR | {"title": .STUDY_TITLE, "abstract": .STUDY_ABSTRACT}' json/SRA/SRA216/SRA216625.json
Which will give us nicely formatted JSON:
{
"title": "16S rRNA Environmental Study of Soil Samples From Various Locations in New Zealand",
"abstract": "Raw sequence reads taken from soil samples, taken from a number of different locations throughout New Zealand, with the intention of exploring microbiomes"
}
We can combine these approaches, using grep
and jq
to find the titles for some runs.
Identify all strategies, sources, and selections
In this example, we iterate through each experiment. But we only need the information from the LIBRARY_DESCRIPTOR field, and so we get each of those, and then extract strategy, selection, or source fields
jq 'try .EXPERIMENT[].DESIGN.LIBRARY_DESCRIPTOR | {"strategy" : .LIBRARY_STRATEGY, "source" : .LIBRARY_SOURCE, "selection": .LIBRARY_SELECTION }' json/SRA/SRA551/SRA551586.json