Monthly Archives: July 2017

Splitting and pairing fastq files

A lot of software benefits from paired fastq files that contain mate pair information, and usually you get these from your sequence provider. However, sometimes (e.g. when you download them from the SRA) you get sequences that are not appropriately paired.

There are lots of solutions (e.g. this thread suggests using Trimmomatic and this thread has an awk solution) but none split the sequences and order the sequences. Until now.

We’ve developed a bunch of different solutions to this problem in python (including fastq_pairs.pypair_fastq_fast.pypair_fastq_files.py, and pair_fastq_lowmem.py).

Recently, however, we’ve been handling very large files and the performance of these programs, (yes, even the lowmem version) is hindering our ability to process these files.

Therefore, we introduce fastq_pair, a C-implementation for pairing fastq files and sorting out which reads have matches in both files and which are singletons. This code starts with two fastq files and creates four output files. It is quick, and efficient, especially if you manipulate the size of the hash table (which you can do with a command line option).

It takes advantage of the random access ability to read files. We open a file and make an index of the ids in the file and the positions those indices occur in the file. Then, we read the second file, and if the IDs match, we scoot to the start of the appropriate line and write out those two sequences to the “pairs” files. We also set a flag in our data structure so we know that we’ve printed that sequence out. If the IDs don’t match, we write them to the “singles” file, and atthe end of all the processing we go through the IDs in our data structure and make print out those sequences we haven’t printed yet.

Take a look and give it a try!

Installing PyFBA (and necessary modules) without admin permissions

As easy as it is to install PyFBA using the pip command, it can be quite cumbersome to do so when you are working on a system without granted administrative or sudo permissions. Here is a quick guide that has worked for me when installing PyFBA on a CentOS 6.3 system running a SunGrid Engine cluster system. If you are working on a Linux system and you do have admin and sudo permissions, please follow the install guide here. Continue reading

How many bp of metagenomes are there in the SRA?

We were curious about how many bp of metagenomes in the SRA. This was partly inspired by our grant writing, and partly by this question on twitter from Tom Delmont:

 

 

This is how to answer the question!

Continue reading