Hidden SRA metadata

The metadata in the SRA is not all the data you can get about a run. Here is how to get more data about a run from the SRA without going to the SRA website.

The SRA metadata has lots of information about the sequences, but unfortunately it does not have information about the runs such as how many reads there are, what the average run length is, or the total size of the data set. This makes it tricky to try and plan ahead. Luckily there is an API that allows you to access this data for all the runs.

If you have the sra tools installed (and you should!), then the vdb-dump command is your friend. For example, lets look atĀ SRR3280498.

$ vdb-dump --info SRR3280498
acc : SRR3280498
path : http://sra-download.ncbi.nlm.nih.gov/srapub/SRR3280498
size : 269,650,849
cache : /home/redwards/ncbi/public/sra/SRR3280498.sra
percent: 0.097182
bytes : 262,144
type : Database
platf : SRA_PLATFORM_PACBIO_SMRT
SEQ : 76,757
SCHEMA : NCBI:align:db:alignment_sorted#1.3
TIME : 0x0000000056f1cf52 (03/22/2016 16:03)
FMT : FASTQ
FMTVER : 2.5.7
LDR : latf-load.2.5.7
LDRVER : 2.5.7
LDRDATE: Dec 21 2015 (12/21/2015 0:0)

This is a PacBio run that has 269MB of sequence in 76,757 reads. The average read length is 3,513 bp.