mmseqs2 tips

Here are some (probably simple) mmseqs2 tips that you probably don’t remember how to do. If you need explanations or details, read the mmseqs2 wiki. Otherwise, good luck!

Add a taxonomy to a set of proteins (e.g. for LCA work). This works best with PATRIC IDs as they contain the taxonomy ID in the sequence ID!

# Create a taxidmapping file that has the sequence ID->taxonomy ID
perl -F"\t" -lane '$F[1] =~ m#fig\|(\d+)\.#; print "$F[1]\t", $1 == 6666666 ? 12908 : $1;' protein_ids > taxidmapping

# Create a taxonomy database
mmseqs createtaxdb formattedDB $TMPDIR/createtaxdb --ncbi-tax-dump $HOME/ncbi/taxonomy/current --tax-mapping-file taxidmapping --threads 64

# Use the above in a taxonomic mapping
mmseqs easy-taxonomy input_R1.fq.gz  formattedDB  output_R1.taxonomy $TMPDIR/mmseqs --threads 64

2. Simply format some sequences for searching (e.g. with blastx). Assumes protein database and nucleotide queries.

# format the database
mmseqs createdb protein.faa proteinDB

# search the database and create an m8 format file
# you will get an error if TMPDIR is not defined. You can set it to /tmp if not sure!
mmseqs easy-search  query.fastq.gz proteinDB query_proteinDB.m8 $(mktemp -d -p $TMPDIR) --threads 64

EdwardsLab

Delivering the best in bioinformatics…

mmseqs2 tips

Related