In DNA sequencing, we add primers and adapters to the ends of sequences. These are short (typically <50bp) known sequences, that we use so we can identify different kinds of sequences. You can find out more about the adapters in this YouTube video.
This challenge is to write software to efficiently detect and remove the primers and adapters from a fastq format file.
One of the first steps in processing sequence data is to remove the primer and adapter sequences. Typically we will have two files: the raw sequence files that we have just generated that will be in fastq format, and the primer/adapter sequence file that is known and is typically in fasta format.
The challenge is relatively straightforward: read all the primers and adapters in the file first, and then read the DNA sequences one at a time and check the left part of the sequences, and then check the right part of the sequences.
Pseudocode might be something like
# read the primers
for primer in primer_file:
store primer in primer_data_structure
# read the sequences
for sequence in sequence_file:
for primer in primer_data_structure:
if index(sequence, primer) < max_left_offset:
remove primer from sequence at left end
if index(sequence, primer) > max_right_offset:
remove primer from sequence at right end
write modified sequence
Note that we include a max_left_offset
and a max_right_offset
because there is a chance that the primer could occur in the middle of the sequence. We typically ignore those occurrences, but we could also consider either splitting the sequence to remove them, or masking them out somehow.
Thought question: If you have a 10 bp primer, how often would you find that sequence just by chance? Would you expect to find that sequence in the human genome and if so, how many times?
Example Data
Here is a file of primer sequences in fasta format. This is a good example set that you can use, although in real world usage you will most likely end up using a different set of primers, so you should not hard code these!
Here are is a small fastq file with only 100 or so sequences that you should find a few matches in the primer sequences.
Suggestions
Use a compiled language: I strongly encourage plain ANSI C for this problem, you don’t need the additional libraries that C++ offers, though that choice is up to you!
If you are using C, use Heng Li’sawesome kseq library for reading the sequence files. It is described in this blog post a bit, but there are some example use cases here. Kate McNair has also put together a very simple template for using kseq.
If you are using (almost) any other language you should consider using readfq (also from Heng Li) that has the fastest ways to parse fasta and/or fastq files in many languages.
Evaluation
Once you have code that works you should try it on a few more data sets. If you need other examples, contact Rob. However, eventually you may want to compare it to the best code so far usingthe adapter benchmark suite.