Our new service,
Category Archives: bamfiles
command line deconseq
AKA: how to remove contamination from your metagenome! We use sharks genomes, but it works with humans, corals, and other things too!
A while ago we wrote
This is trivial to do with modern sequence analysis tools, and so we provide recipes here for filtering your reads based on matches to a reference genome. Read more to find out how!
We also provide a snakefile that does all these steps. Set up the bowtie2 index, a directory of reads, and and output location, and it will generate mapped and unmapped reads for you!
Continue readingSimilarity Searching the SRA
The NCBI SRA contains a lot of data – about 1016 bp at the moment! However, searching that has always been problematic. We’re happy to unveil a new search SRA service that allows you to search the random metagenomes in the SRA using either DNA or protein queries.
Splitting and pairing fastq files
A lot of software benefits from paired fastq files that contain mate pair information, and usually you get these from your sequence provider. However, sometimes (e.g. when you download them from the SRA) you get sequences that are not appropriately paired.
There are lots of solutions (e.g. this thread suggests using Trimmomatic and this thread has an awk solution) but none split the sequences and order the sequences. Until now.
We’ve developed a bunch of different solutions to this problem in python (including fastq_pairs.py, pair_fastq_fast.py, pair_fastq_files.py, and pair_fastq_lowmem.py).
Recently, however, we’ve been handling very large files and the performance of these programs, (yes, even the lowmem version) is hindering our ability to process these files.
Therefore, we introduce fastq_pair, a C-implementation for pairing fastq files and sorting out which reads have matches in both files and which are singletons. This code starts with two fastq files and creates four output files. It is quick, and efficient, especially if you manipulate the size of the hash table (which you can do with a command line option).
It takes advantage of the random access ability to read files. We open a file and make an index of the ids in the file and the positions those indices occur in the file. Then, we read the second file, and if the IDs match, we scoot to the start of the appropriate line and write out those two sequences to the “pairs” files. We also set a flag in our data structure so we know that we’ve printed that sequence out. If the IDs don’t match, we write them to the “singles” file, and atthe end of all the processing we go through the IDs in our data structure and make print out those sequences we haven’t printed yet.
Take a look and give it a try!
SRA attributes
These are all the attributes in the SRA files
Getting data from the SRA
Getting data from the NCBI Sequence Read Archive is not easy. Here we combine a few of our posts to go step by step through getting the data.
Creating indexed bam files from bowtie alignments
Searching with bowtie Creating indexed bam files from sam files is something we need to do all the time. This is just a hand full of simple commands to remind you how to do that.
We also use the PySAM module to extract some information about the alignments.