Search SRA Data

Our new service, searchSRA provides you with hundreds of bam-format files. However, running the search is only part of the battle, and the next daunting step is the downstream processing. Here are some hints about processing the data.

All of the hints here are for using a Linux-type operating system. If you are using a mac some of these commands will work in the terminal. If you are using windows, you should probably see if you can get access to a Linux machine!

First, you need to download the data. We recommend cURL for that! Copy the url of the results zip archive from the email that you get or the results.txt file, and then use cURL to download the results (note this is an invalid link!):

curl -Lo results.zip http://141.18.18.16/results/0e1c6dd49b2/results.zip

After uncompressing the zip file you will most likely have many directories, each of which contains some results. We combine those into a single directory. Note that the number of directories depends on the number of SRA runs you search against. The most is about 45 directories, but you may have many fewer. Adjust the number here as appropriate.

mkdir bamfiles
for i in $(seq 1 45); do mv $i/* bamfiles/; rmdir $i; done

This creates a single directory with all the results. At the moment, we don’t remove bamfiles with no hits. Yes, I know we should, but we don’t. Sorry. To do that, we take any file with less than 421 characters and remove it

find . -size -421c | xargs rm -f

Because this also deletes some of the index files, we delete all those and remake them. We could be more precise about this, but this step takes a couple of seconds so we don’t care!

rm -f *.bai
for bamfile in *.bam; do samtools index $bamfile; done

Note, if that rm -f command fails, you can try this one:

find . -name \*bai | xargs rm -f

EdwardsLab

Delivering the best in bioinformatics…

Search SRA Data

Related