Here are some (probably simple) mmseqs2 tips that you probably don’t remember how to do. If you need explanations or details, read the mmseqs2 wiki. Otherwise, good luck!
- Add a taxonomy to a set of proteins (e.g. for LCA work). This works best with PATRIC IDs as they contain the taxonomy ID in the sequence ID!
# Create a taxidmapping
file that has the sequence ID->taxonomy ID
perl -F"\t" -lane '$F[1] =~ m#fig\|(\d+)\.#; print "$F[1]\t", $1 == 6666666 ? 12908 : $1;' protein_ids > taxidmapping
# Create a taxonomy database
mmseqs createtaxdb formattedDB $TMPDIR/createtaxdb --ncbi-tax-dump $HOME/ncbi/taxonomy/current --tax-mapping-file taxidmapping --threads 64
# Use the above in a taxonomic mapping
mmseqs easy-taxonomy input_R1.fq.gz formattedDB output_R1.taxonomy $TMPDIR/mmseqs --threads 64
2. Simply format some sequences for searching (e.g. with blastx). Assumes protein database and nucleotide queries.
# format the database
mmseqs createdb
protein.faa proteinDB
# search the database and create an m8 format file
# you will get an error if TMPDIR is not defined. You can set it to /tmp if not sure!
mmseqs easy-search query.fastq.gz proteinDB query_proteinDB.m8 $(mktemp -d -p $TMPDIR) --threads 64