Getting data from the NCBI Sequence Read Archive is not easy. Here we combine a few of our posts to go step by step through getting the data.

Identifying the data you need.

The first step is identifying the data that you actually want to get. The SRA publishes XML files each month that contain all the data about the reads in the SRA, but luckily the Meltzer lab converts that to SQLlite databases. Here is a description of how to download those databases and query them using SQLlite3. They are updated every month, so bookmark the URL!

Here is an example spreadsheet with multiple tabs that has some of the SRA data so you can browse it. Note that there are multiple tabs, one each for experiment, study, run, sample, and submission.

For example, here is the SQL command to find all runs from the human microbiome dataset

sqlite3 ~/SRA/SRA/SRAdb/SRAmetadb.sqlite "select run_accession from run where run.experiment_accession in (select experiment.experiment_accession from experiment where experiment.study_accession in (select study.study_accession from study where study.study_title like 'Human Microbiome Project%'))"

Downloading the data

Now you have identified some runs you want from the SRA, you need to download them. Install fastq-dump to download the data. You can either use

prefetch SRRxxxxxx

to prefetch the data from SRA and then use fastq-dump to convert them to fastq, or you can just use fastq-dump which will download the data and convert it to fastq in one step. If you want to ignore the prefetch line, just go ahead to the next command!

Extracting the data

We have explained the options to fastq-dump (and here are a list of those options), as they are not well explained. The approach that we use is:

/usr/local/sratoolkit/bin/fastq-dump --outdir fastq --gzip --skip-technical --readids --dumpbase --split-files --clip SRRxxxxxx

where SRRxxxxxx is the ID of the run you want from the SRA.

This gives you a directory called fastq that (probably) has two files in it for each run. There may be only one file if it is not a paired end run. You can use those with spades, convert them to fasta, and so on, however you can’t use them in bowtie.

Convert to matched paired end files

If you want to use them in bowtie, then you need to match the paired ends and make sure you have matched ends in both files and a file with whatever single reads you have left over.

We use the command:

/usr/bin/python2.7 ~/EdwardsLab/bam/fastq_pairs.py > fastq_paired_files.txt 2> pairing.err

Note that this uses the fastq_pairs.py script available on github. This is probably not the most fastest code, but it works in reasonable time with most files.

Now you are ready to use bowtie to map the reads to your target sequence. We use fastq2crassphage.sh to map the reads to crAssphage. This maps all the reads and only stores the reads that map to the alignment.

The next step is to pull out all the reads that mapped to a sequence so you can reassemble them. We use a command like this:

python2.7 ~/EdwardsLab/bam/bam2reads.py -b crassphage/SRR3403834.bam -f fastq/SRR3403834_1.fastq.gz -f fastq/SRR3403834_2.fastq.gz > crassphage/SRR3403834.interleaved.fastq

The code ~/EdwardsLab/bam/bam2reads.py takes a BAM file (created by the fastq2crassphage.sh above) and one or more fastq files and extracts all the aligned reads into an interlaced fastq file that you can, for example, assemble using spades.

EdwardsLab

Delivering the best in bioinformatics…

Getting data from the SRA

Identifying the data you need.

Downloading the data

Extracting the data

Convert to matched paired end files

Related