As part of the STRIDES initiative, the NIH has moved the SRA to the cloud. This includes the metadata, and the whole SRA archive. Here, I show how to set up a new instance to access the sequence read archive in the cloud. In a separate post, we’ll explore getting the metadata out of bigtable.
NOTE: There is a gotcha with this, which I keep forgetting! For AWS you must be in the US East (N. Virginia) us-east-1 region. For GCP you can be in any US region and any zone within the US region.
For this example, I am using Google Cloud Computing. The data is also currently available on AWS, and setting it up is essentially the same (except using the AWS tab below).
First, log into the Google Cloud Console and launch a new instance. I am using a Debian version 10 instance, but the approach is the same if you use a different variant.
Once the machine starts, you should access it via ssh as you normally do. Note that I am not giving you instructions here on how to ssh to your Google cloud instance, refer to the regular documentation for that.
Head to the SRA Toolkit GitHub download page and copy the link for the latest version of the toolkit. I am using the version called
Ubuntu Linux 64 bit architecture - non-sudo tar archive At the time of writing it is 2.10.2, but change the version below because I am sure there will be a new version by the time you are reading this! Use
curl to download the tarball.
curl -LO http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.10.2/sratoolkit.2.10.2-ubuntu64.tar.gz (update this as appropriate!)
extract the archive
tar xf sratoolkit.2.10.2-ubuntu64.tar.gz
and add the path to the executables to your path:
Note that you probably want to make this permanent so that the next time you log in to this instance, you are ready to go. The trivial way to do this is to edit your
.bashrc file and add that line, but the instructions for that are outside the scope of this blog.
If we try and use the SRA Toolkit we get an error:
$ srapath SRR000001 This sra toolkit installation has not been configured. Before continuing, please run: vdb-config --interactive For more information, see https://www.ncbi.nlm.nih.gov/sra/docs/sra-cloud/
And so we need to configure
There are a couple of settings that you should change:
- Disable the cache:
Cto choose the
ito uncheck the box
enable local file-caching
- Report cloud instance
Gto choose the
report cloud instance identity
sto save the changes
Now we are going to check and see where the SRA Toolkit is pulling the data from:
$ srapath SRR000001
If you see a long and complex URL like this one then you are accessing SRA in the Cloud. Congratulations!
If, however, you see a short URL like this one:
$ srapath SRR000001 https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR000001/SRR000001.4
then, unfortunately, you are accessing SRA from NCBI, and you perhaps missed one of the steps in
Note: Unselecting the local file caching is my personal preference and is not required for accessing the SRA in the cloud. However, the local file cache will very quickly fill up your local hard drive. I recommend not keeping a local file cache!