As part of the STRIDES initiative, the NIH has moved the SRA to the cloud. This includes the metadata, and the whole SRA archive. Here, I show how to set up a new instance to access the sequence read archive in the cloud. In a separate post, we’ll explore getting the metadata out of bigtable.
NOTE: There is a gotcha with this, which I keep forgetting! For AWS you must be in the US East (N. Virginia) us-east-1 region. For GCP you can must be in US East (S. Carolina) us-east-1 region.
For this example, I am using Google Cloud Computing. The data is also currently available on AWS, and setting it up is essentially the same (except using the AWS tab below).
First, log into the Google Cloud Console and launch a new instance. I am using a Debian version 10 instance, but the approach is the same if you use a different variant.
Once the machine starts, you should access it via ssh as you normally do. Note that I am not giving you instructions here on how to ssh to your Google cloud instance, refer to the regular documentation for that.
Head to the SRA Toolkit GitHub download page and copy the link for the latest version of the toolkit. I am using the version called Ubuntu Linux 64 bit architecture - non-sudo tar archive
At the time of writing it is 2.10.2, but change the version below because I am sure there will be a new version by the time you are reading this! Use curl
to download the tarball.
curl -LO http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.10.2/sratoolkit.2.10.2-ubuntu64.tar.gz
(update this as appropriate!)
extract the archive
tar xf sratoolkit.2.10.2-ubuntu64.tar.gz
and add the path to the executables to your path:
export PATH=$PATH:sratoolkit.2.10.2-ubuntu64/bin/
Note that you probably want to make this permanent so that the next time you log in to this instance, you are ready to go. The trivial way to do this is to edit your .bashrc
file and add that line, but the instructions for that are outside the scope of this blog.
If we try and use the SRA Toolkit we get an error:
$ srapath SRR000001
This sra toolkit installation has not been configured.
Before continuing, please run: vdb-config --interactive
For more information, see https://www.ncbi.nlm.nih.gov/sra/docs/sra-cloud/
And so we need to configure vdb-config
vdb-config -i
There are a couple of settings that you should change:
- Disable the cache:
- Press
C
to choose theCache
tab - Press
i
to uncheck the boxenable local file-caching
- Press
- Report cloud instance
- Press
G
to choose theGCP
tab - Press
r
to enablereport cloud instance identity
- Press
- Press
s
to save the changes - Press
x
to exitvdb-config
Now we are going to check and see where the SRA Toolkit is pulling the data from:
$ srapath SRR000001
https://locate.ncbi.nlm.nih.gov/sdlr/sdlr.fcgi?jwt=eyJhbGciOiJSUzI1NiIsImtpZCI6InNkbGtpZDEiLCJ0eXAiOiJKV1QifQ.eyJleHAiOjE1ODE4ODA0NzEsImlhdCI6MTU3OTI4ODQ3MSwibGluayI6Imh0dHBzOi8vc3RvcmFnZS5nb29nbGXXaGlzLmNvbS9zcmEtcHViLXJ1bi03L1NSUjAwMDAwMS9TUlIwMDAwMDEuNCIsInJlZ2lvbiI6InVzIiwic2Vy2dmljZSI6ImdzIiwic2lnbmluZ0FjY291bnQiOiJzcmFfZ3MiLCJ0aW1lb3V0Ijo2MDAwfQ.jC1xVd60uevm_g-TjynZt_66X2-JpnBorRTGlRInlPrNFk7Zw27H5lpAjtBwOhvRaqC4payupnz6ymFw6TS5H1TJ8LAHAZbNg-qoSDqnPiict1qDswlr2tTwT3xctoUn2y2SjVbAlChJTprXVdXE17Fnptwy-OlT0I9sPXByvA_4OWggpD3EcrQSwuNwAOBSuyYX35n-Xnthl_Y-DdhFIu3Zmw8bMSHBfkCpR5QVU0_TazIvfWFaVorxq--E0Rvi9kCx7URTOS85DVHle2oYoi_pCONJT2DRmeL5nSTiQwLZvOfoK2tieoihYOpi_1TwEjI5bKzqL5lW9r2qA&ncbi_phid=939B877E5B0A11B500005DB9CB44A770.1.1
If you see a long and complex URL like this one then you are accessing SRA in the Cloud. Congratulations!
If, however, you see a short URL like this one:
$ srapath SRR000001
https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR000001/SRR000001.4
then, unfortunately, you are accessing SRA from NCBI, and you perhaps missed one of the steps in vdb-config
Note: Unselecting the local file caching is my personal preference and is not required for accessing the SRA in the cloud. However, the local file cache will very quickly fill up your local hard drive. I recommend not keeping a local file cache!