Counting K-mers

One of the most essential tools in bioinformatics is counting k-mers. These are short, identical strings, of length k. Here, we will look at counting all the k-mers in string.

DNA has a four letter alphabet, [A, G, C, T] that corresponds to the four chemicals that make up DNA (Adenine, Guanine, Cytosine, and Thymine). That means for any string of letters of length k there are 4k possible strings that you can make.

For example, here are all possible 4-mers of a DNA sequence:

AAAA AAAC AAAG AAAT AACA AACC AACG AACT AAGA AAGC AAGG AAGT AATA AATC AATG AATT
ACAA ACAC ACAG ACAT ACCA ACCC ACCG ACCT ACGA ACGC ACGG ACGT ACTA ACTC ACTG ACTT
AGAA AGAC AGAG AGAT AGCA AGCC AGCG AGCT AGGA AGGC AGGG AGGT AGTA AGTC AGTG AGTT
ATAA ATAC ATAG ATAT ATCA ATCC ATCG ATCT ATGA ATGC ATGG ATGT ATTA ATTC ATTG ATTT
CAAA CAAC CAAG CAAT CACA CACC CACG CACT CAGA CAGC CAGG CAGT CATA CATC CATG CATT
CCAA CCAC CCAG CCAT CCCA CCCC CCCG CCCT CCGA CCGC CCGG CCGT CCTA CCTC CCTG CCTT
CGAA CGAC CGAG CGAT CGCA CGCC CGCG CGCT CGGA CGGC CGGG CGGT CGTA CGTC CGTG CGTT
CTAA CTAC CTAG CTAT CTCA CTCC CTCG CTCT CTGA CTGC CTGG CTGT CTTA CTTC CTTG CTTT
GAAA GAAC GAAG GAAT GACA GACC GACG GACT GAGA GAGC GAGG GAGT GATA GATC GATG GATT
GCAA GCAC GCAG GCAT GCCA GCCC GCCG GCCT GCGA GCGC GCGG GCGT GCTA GCTC GCTG GCTT
GGAA GGAC GGAG GGAT GGCA GGCC GGCG GGCT GGGA GGGC GGGG GGGT GGTA GGTC GGTG GGTT
GTAA GTAC GTAG GTAT GTCA GTCC GTCG GTCT GTGA GTGC GTGG GTGT GTTA GTTC GTTG GTTT
TAAA TAAC TAAG TAAT TACA TACC TACG TACT TAGA TAGC TAGG TAGT TATA TATC TATG TATT
TCAA TCAC TCAG TCAT TCCA TCCC TCCG TCCT TCGA TCGC TCGG TCGT TCTA TCTC TCTG TCTT
TGAA TGAC TGAG TGAT TGCA TGCC TGCG TGCT TGGA TGGC TGGG TGGT TGTA TGTC TGTG TGTT
TTAA TTAC TTAG TTAT TTCA TTCC TTCG TTCT TTGA TTGC TTGG TTGT TTTA TTTC TTTG TTTT

The challenge is to read a DNA sequence file, and count the k-mers in that file, efficiently (i.e. using as little memory as possible), and quickly (i.e. using the least amount of time).

There are some publicly available solutions for this problem. One that we use a lot in the lab is Jellyfish which makes some assumptions to speed up the counting and uses a compare-and-swap approach to make the computation faster.

If you are thinking about writing code to count k-mers, you should start with a simple case! What is it that you need to accomplish? The most trivial approach is to start with a DNA sequence, and create a table with the k-mer counts for each sequence.

Here are a set of DNA sequences for crAssphage (note that you can download these sequences compressed with gzip or zip for faster transfer). These come from the paper by Emma Guerin et al. (published in Cell Host & Microbe), and there are 249 sequences in here. The summary statistics are:

  • Number of sequences: 249
  • Total length: 23,311,060
  • Shortest sequence: 70,305
  • Longest sequence: 106,584

Note that each sequence starts with a line that begins with a > and has the sequence ID, and then the rest of the sequence follows. This format is a very common DNA sequence format called fasta format.

You should start by defining what your data structures will be. How are you going to read that file into memory, and access the contents. Then you should think about the algorithm that you will need to count the k-mers. Start with a simple algorithm – identifying each k-mer one at a time, and counting it. Once you have a simple algorithm that works, can you improve upon that algorithm. Can you make it run faster, use less memory, or use a better data structure?