SRA Metadata with JSON

We have been working on updating access to the SRA metadata. In previous posts, we used the SQLite database provided by the Meltzer lab, but for a variety of reasons, we are now using the XML provided by the NCBI.

Please note that we are currently updating this with new code and versions, and should have an update on the next SRA Metadata Release soon!

The code to convert the XML and more tips on getting the data are all available on our SRA Metadata GitHub repo.

We download the XML from the NCBI FTP site. There are two files that you should get. The first is SRA_Accessions.tab, which is all the accession information. The second file is the actual metadata, and whose name changes each time it is updated, as NCBI adds a timestamp. The current version of the file is NCBI_SRA_Metadata_Full_20190810.tar.gz. This file (which is approximately 1GB) contains about a million XML files (currently 1,354,364 files), one for each of the submissions to the SRA. (As a side note, a million submissions over 12 years is about 200 submissions per day. However, remember that the frequency of data submissions to the SRA has not been linear over the 12 year life of the archive!)

We convert those to JSON format so we can parse them easily using grep and jq. The script to convert them and the command to do so is provided at our SRA Metadata GitHub repo. You may not need to convert them to JSON if you are familiar with XML, but we find it helps. It also allows us to use the terrific jq parser as shown below. You can download the NCBI Metadata in JSON format from our website.

Directory Organization

There are about 1.5 million files in the SRA JSON archive, and so rather than having each of them as separate files in one colossal directory, we have broken them into subdirectories. The basic directory structure is:

json/SRA/SRA000/SRA000000.json

Where SRA might be one of SRA, DRA or ERA depending on where the data originated from, then there are the first three digits of the accession (e.g. SRA000), and then the full accession with the .jsonextension.

If you have an SRA ID, it is quite easy to get to the json file:

SRA=SRA551586
ls json/${SRA:0:3}/${SRA:0:6}/$SRA.json

There is one more caveat! The files are listed by SRA submission accession, and most often you want to access them via SRR Run ID. We also create a file called srr_sra_ids.tsv that has a mapping of SRA ID -> SRR ID. You can quickly grep for the submission accession for any given run using

grep -m 1 SRR5421081 srr_sra_ids.tsv

The -m 1 only reports a single match (there should only be a single match) and thus grep quits as soon as it finds a match and does not keep looking.

Using JQ

So how do we use grep and jq to query the SRA metadata. Let’s start with some simple queries and build up.

Finding all metagenomes and microbiomes in the SRA

First, we use egrep (extended grep) to find all the submissions that mention metagenome or microbiome. We remove the path and extension to just end up with the SRA accession numbers.

egrep -rli 'metagenome|microbiome' json | perl -pe 's#json/##; s#.json##' > metagenomes.txt

This resulted in 78,437 submissions that are metagenomes or microbiomes. Note that this file has the submission accession, not the run ids. To get the actual runs, we need to parse the JSON data like so

cat metagenomes.txt | xargs -i jq -r "try .RUN[].IDENTIFIERS.PRIMARY_ID" json/{}.json > metagenome_runs.txt

The two options to jq are -r for raw output (rather than JSON formatted output), and try which will skip without warning if there is an error in the JSON. You might want to add a catch to the try but I don’t!

In jq, we are extracting the PRIMARY_ID field from the IDENTIFIERS field, and doing that for every RUN in the file.

This results in a list of SRR run IDs (really [EDS]RR run IDs) that you can continue processing. For example, we have passed all these through partie.

Extract Submissions, Titles, and Abstracts for a Set of Runs

Given a file called srr.ids, we want to combine all runs based on their submission accession, and then extract the titles and abstracts for those accessions. We use a combination of bash, perl, grep, and jq to accomplish this.

grep -f srr.ids NCBI_SRA_Metadata_20190914_JSON/srr_sra_ids.tsv  | perl -F"\t" -lane 'push(@{$d->{$F[0]}}, $F[1]); END {map {print "$_\t", join(",", @{$d->{$_}})} keys %$d}' | while  read SRA VALS; do TITLE=""; ABST=""; TITLE=$(jq 'try .STUDY[].DESCRIPTOR | .STUDY_TITLE' NCBI_SRA_Metadata_20190914_JSON/json/${SRA:0:3}/${SRA:0:6}/${SRA}.json | tr \n \t | sed -e 's/"//g'); ABST=$(jq 'try .STUDY[].DESCRIPTOR | .STUDY_ABSTRACT' NCBI_SRA_Metadata_20190914_JSON/json/${SRA:0:3}/${SRA:0:6}/${SRA}.json | tr \n \t | sed -e 's/"//g'); echo -e "$SRA\t$TITLE\t$ABST\t$VALS"; done > abstracts.tsv

Here is the same command broken down line by line to make it clearer. Note that we actually parse the JSON file twice, the first time for the title and the second time for the abstract. This should ensure both are provided if one or the other (usually the abstract) is missing.

grep -f srr.ids NCBI_SRA_Metadata_20190914_JSON/srr_sra_ids.tsv |\
perl -F"\t" -lane 'push(@{$d->{$F[0]}}, $F[1]); END {map {print "$_\t", join(",", @{$d->{$_}})} keys %$d}' |\
while  read SRA VALS; 
    do 
        TITLE=""; ABST=""; 
        TITLE=$(jq 'try .STUDY[].DESCRIPTOR | .STUDY_TITLE' NCBI_SRA_Metadata_20190914_JSON/json/${SRA:0:3}/${SRA:0:6}/${SRA}.json | tr \n \t | sed -e 's/"//g'); 
        ABST=$(jq 'try .STUDY[].DESCRIPTOR | .STUDY_ABSTRACT' NCBI_SRA_Metadata_20190914_JSON/json/${SRA:0:3}/${SRA:0:6}/${SRA}.json | tr \n \t | sed -e 's/"//g'); 
         echo -e "$SRA\t$TITLE\t$ABST\t$VALS"; 
done > abstracts.tsv

Extract Titles And Abstracts for Submissions

We can extract the title and abstract from a single run using code like this. Here we extract each STUDY entry and in from that only the DESCRIPTOR field, and from those we get the TITLE and ABSTRACT.

jq '.STUDY[].DESCRIPTOR | {"title": .STUDY_TITLE, "abstract": .STUDY_ABSTRACT}' json/SRA/SRA216/SRA216625.json

Which will give us nicely formatted JSON:

{
   "title": "16S rRNA Environmental Study of Soil Samples From Various Locations in New Zealand",
   "abstract": "Raw sequence reads taken from soil samples, taken from a number of different locations throughout New Zealand, with the intention of exploring microbiomes"
 }

We can combine these approaches, using grep and jq to find the titles for some runs.

Identify all strategies, sources, and selections

In this example, we iterate through each experiment. But we only need the information from the LIBRARY_DESCRIPTOR field, and so we get each of those, and then extract strategy, selection, or source fields

jq 'try .EXPERIMENT[].DESIGN.LIBRARY_DESCRIPTOR | {"strategy" : .LIBRARY_STRATEGY, "source" : .LIBRARY_SOURCE, "selection":  .LIBRARY_SELECTION }' json/SRA/SRA551/SRA551586.json