Sometimes when you look at a record in RefSeq/GenBank it is a virtual record that is really a pointer to a set of records. For example, the entry for Callorhinchus milii isolate IMCB2004 points you to the WGS records AAVX02000001-AAVX02067420. Here we show how to get these records.
We have a script (of course) that works with standard PERL libraries to get the records using e-utils. You can run that script like this:
perl get_wgs_eutils.pl AAVX02000000 100 AAVX02000000.out
In this case, AAVX02000000
is the base name of the record, 100
is the number of records to request at once, so we don’t run afoul of NCBI’s rules, and AAVX02000000.out
is the output file.
Here is another way to do that using NCBI’s tools.
First, you need to locate the record:
curl -s "https://locate.ncbi.nlm.nih.gov/sdl/2/retrieve?acc=AAVX02000000"
The -s
option to curl is silent
mode so it doesn’t print progress. Change the part after acc=
to be the accession number you are interested in.
This will give you a block of json
like so:
{ "version": "2", "result": [ { "bundle": "AAVX02000000", "status": 200, "msg": "ok", "files": [ { "object": "wgs|AAVX02000000", "name": "AAVX02.5", "locations": [ { "link": "https://sra-download.ncbi.nlm.nih.gov/traces/wgs03/WGS/AA/VX/AAVX02.5", "service": "ncbi" } ] } ] } ] }
The key item here is the link entry:
https://sra-download.ncbi.nlm.nih.gov/traces/wgs03/WGS/AA/VX/AAVX02.5
This tells you that the current link is AAVX02.5
and now you can use vdb-dump
from the SRA toolkit to download the sequences:
vdb-dump AAVX02.5 > AAVX02.5.out
Note that the standard vdb-dump
output is not that useful for bioinformatics, so you probably want to use
vdb-dump -f fasta AAVX02.5 > AAVX02.5.fna
Note you could also output fastq
or other formats. Use vdb-dump -h
for more choices