How to create a database for BWA and BWA-SW

The following example shows how to create a database from the human reference genome for the use with BWA, BWA-SW and DeconSeq. The commands used below assume a Unix-based operating system.

Download sequence data
There are several places where one can retrieve the sequence data. The human reference genome build 37 can be downloaded from the National Center for Biotechnology Information (NCBI) FTP server. If you get an error, check if the files are still available at the specified loacation and if the file names are still the same.

$ for i in {1..22} X Y MT; do wget ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/Assembled_chromosomes/seq/hs_ref_GRCh37.p2_chr$i.fa.gz; done

 

Extracting and joining data

The sequence data is compressed with the GZIP algorithm and in 25 different files. The next command will extract the data and write them to a single file. At the same time, the compressed files will be deleted (rm command). The -v in the gzip command provides an easy option to see the progress.

$ for i in {1..22} X Y MT; do gzip -dvc hs_ref_GRCh37.p2_chr$i.fa.gz >>hs_ref_GRCh37_p2.fa; rm hs_ref_GRCh37.p2_chr$i.fa.gz; done

 

Splitting sequences by long repeats of ambiguous base N

BWA and BWA-SW replace ambiguous bases (everything that is not an ACGT) with a random base (an ACGT). This can, for example, result in false positive hits in long stretches of Ns that were randomly converted into a sequence similar to a query sequence. The reason for the replacement of ambiguous bases is the 2 bit representation of sequence data that only allows to store/represent four different bases (as 00, 01, 10 and 11). Therefore, it is a good idea to remove long strechtes of ambiguous bases before creating the database (especially considering the stretches of 50,000 Ns in the human reference genome). The following command will additionally remove Ns at the beginning and end of each sequence (e.g. the 10,000 Ns at the beginning of chromosome 1).

$ cat hs_ref_GRCh37_p2.fa | perl -p -e 's/Nn/N/' | perl -p -e 's/^N+//;s/N+$//;s/N{200,}/n>splitn/' >hs_ref_GRCh37_p2_split.fa; rm hs_ref_GRCh37_p2.fa

 

Filtering sequences

BWA-SW uses the concatenation of all input sequences when generating the database. After splitting the genomic sequences, it is possible that a query sequence will be longer than a sequence in the database. This can cause problems when a query sequence aligns over multiple (concatenated) sequences in the database. In the current version, BWA-SW is not able to resolve alignments spanning more than two database sequences. By removing shorter sequences, a query sequence should not be able to align to more than two sequences in the database.
In addition to short sequences, sequence duplicates should be removed as well. Duplicates do not add new information, but add to the size of the database. In order to reduce the amount of false positive alignments (see above), sequences with too many ambiguous bases should be removed.
The sequences can be easily filtered with programs such as PRINSEQ (http://prinseq.sourceforge.net). The following command will additionally rename the sequence identifiers to ensure unique identifiers in the whole data set and delete the input file (rm command).

$ perl prinseq-lite.pl -log -verbose -fasta hs_ref_GRCh37_p2_split.fa -min_len 200 -ns_max_p 10 -derep 12345 -out_good hs_ref_GRCh37_p2_split_prinseq -seq_id hs_ref_GRCh37_p2_ -rm_header -out_bad null; rm hs_ref_GRCh37_p2_split.fa

 

Creating the database

The BWA program provides an option to create databases from sequence data in FASTA format. For bigger data sets, the bwtsw algorithm should be used. The maximum size of a database is 4GB. If you want to generate a database from larger data sets, you have to split the data into chunks of less than 4GB. The indexing process can take a while and therefore, we will use the “&” at the end of the command to run the process in the background. In order to know what BWA is doing, we will write its output to the file bwa.log. You can use “$ tail -f bwa.log” to check for the current status.

$ bwa index -p hs_ref_GRCh37_p2 -a bwtsw hs_ref_GRCh37_p2_split_prinseq.fasta >bwa.log 2>&1 &

When the BWA program has finished, you should have eight files with the name specified by the -p parameter.