Running Autocycler on Pawsey

Autocycler is undoubtedly the best assembler, especially for Oxford Nanopore sequencing of bacterial genomes. You often get complete circular genomes.

We are going to use autocycler on Pawsey to assembled a genome!

We are really following the instructions that Ryan Wick provided, and they are excellent

Before you begin, please install conda and create an autocycler environment.

Once you have installed those commands, please run this command to update plassembler:

mamba activate  /scratch/$PAWSEY_PROJECT/$USER/software/miniconda3/autocycler
plassembler download -d "$CONDA_PREFIX"/plassembler_db

Note, we also have an autocycler installer slurm script that does everything!

To run autocycler you need:

  1. A fastq file with your sequences that you want to assemble.
  2. The autocycler run assmebly slurm script

You should be able to run this with the command:

sbatch autocycler_run.slurm reads.fastq assembly

The assembly process will take a while, and you can see the outputs in the temporary directory. Check the error log for the location of those files.

Once complete, your output should be in assembly

Using Conda (or Mamba) on Pawsey

Note:
This is part of our series on Pawsey that are written by users – not by Pawsey staff. There are certainly other, and probably better, ways to do this, but this is what we are currently doing!
You should also read the Pawsey Help Documentation

Pawsey storage locations (disks)

There are three main storage locations that you can access:

  1. /home (where you log in to) has a limit of 10,000 files and 1Gb of storage, so you will quickly fill that up.
  2. /software has a limit of 16,384G abd 250k files, so you can put more things there, but not everything
  3. /scratch has 9.8P of storage, but everything is deleted after 21 days, so this is not a brilliant location either.
  4. acacia is for longer term storage but you can’t access that directly, so you can’t install software there.

A basic conda set up.

I use /software for some basic conda environments that I am going to use regularly. For example, I have an rclone environment that only has rclone and I use to move data on and off of setonix or acacia. My other environments are a bioinformatics environment which has a few common tools I use day-to-day like samtools and minimap and a git-lfs environment I also use regularly, that only has git-lfs installed. (If you don’t know what git-lfs is for, you probably don’t need it!)

Everything else, I put in a temporary directory in /scratch and then I recreate them as I need it.

There are two different solutions to this problem, and I use both depending on how I feel.

Disposable /scratch conda environments.

I make a temporary environment on /scratch with a directory name that is a meaningless random set of characters. I install what I need, use it as I need it, and then later, when I remember, I delete the environment.

The advantage of this approach, is you leave it if something is broken and start again, and you make a new directory for each thing you are doing.

Rememberable, but disposable, /scratch conda environments.

The alternative is to use a name that you will remember, but then you also need to remember that things are probably broken after 21 days and you need to reinstall everything.

Let’s walk through setting up your conda, installing some software, remembering how to do it, and deleting the environment.

For this example, I’m going to use autocycler as my software to install, and I’ll also install minimap2 and samtools.

Install conda/mamba

Start with installing conda/mamba from miniforge.

Go to the instructions for installing miniforge on a Unix-like platform and use either wget or curl to download the installer. It doesn’t matter which one, so start with curl (because that is first on the list), and if that doesn’t work use wget.

Set up your .condarc file.

Use nano ~/.condarc and copy the block below and paste into the file.

channels:
  - conda-forge
  - bioconda
envs_dirs:
  - /software/projects/$PAWSEY_PROJECT/$USER/miniconda3/envs_dirs
pkgs_dirs:
  - /scratch/$PAWSEY_PROJECT/$USER/software/miniconda3/pkg_dirs
env_prompt: "({name}) "
channel_priority: strict

This block adds conda-forge and bioconda so you can easily install software, sets the default environment directory to /software and the location where the files are downloaded to /scratch.

Create a environment file to install the software

If you use environment files, you can install the software directly from the file, and then if you need to reinstall things (e.g. because the file has been deleted, you just need one command!).

Use nano to create a file called environment.yml and paste this information:

name: autocycler
channels:
   - conda-forge
   - bioconda
dependencies:
  - autocycler>=0.5.0            # https://github.com/rrwick/Autocycler
  - canu>=2.3                    # https://github.com/marbl/canu
  - flye>=2.9.6                  # https://github.com/mikolmogorov/Flye
#  - lja>=0.2                     # https://github.com/AntonBankevich/LJA
  - metamdbg>=1.0                # https://github.com/GaetanBenoitDev/metaMDBG
  - miniasm>=0.3                 # https://github.com/lh3/miniasm
  - minimap2>=2.28               # https://github.com/lh3/minimap2
  - minipolish>=0.2.0            # https://github.com/rrwick/Minipolish
  - myloasm>=0.1.0               # https://github.com/bluenote-1577/myloasm
  - necat>=0.0.1_update20200803  # https://github.com/xiaochuanle/NECAT
  - nextdenovo>=2.5.2            # https://github.com/Nextomics/NextDenovo
  - nextpolish>=1.4.1            # https://github.com/Nextomics/NextPolish
  - plassembler>=1.8.0           # https://github.com/gbouras13/plassembler
  - racon>=1.5.0                 # https://github.com/lbcb-sci/racon
  - raven-assembler>=1.8.3       # https://github.com/lbcb-sci/raven
  - wtdbg>=2.5                   # https://github.com/ruanjue/wtdbg2

If you install this with:

mamba env create -f environment.yml

it will download the packages and install them into a mamba environment called autocycler located on /software.

Once the install is complete, you can list the environments with:

mamba info --envs

This is now consuming a part of your quota on /software and if you install too many packages here, it will get full!

Create a disposable environment

Now that we have an environment file, we don’t need to install it on /software every time.

Here, we make a random 12 character long string, and make the environment with that name.

TMP=$(for i in {1..12}; do printf "%x" $((RANDOM % 16)); done)
mamba env create --yes --prefix /scratch/$PAWSEY_PROJECT/$USER/software/miniconda3/$TMP --file environment.yml
mamba activate /scratch/$PAWSEY_PROJECT/$USER//software/miniconda3/$TMP

Note that when the installation is complete it tells you how to activate the environment. When I did this, mine was called e95467637aed.

You can also see that environment listed using mamba info --envs

Create a memorable, disposable, environment

You can do the same thing, but give the environment a name you remmeber. For example:

mamba env create --yes --prefix /scratch/$PAWSEY_PROJECT/$USER/software/miniconda3/autocycler --file environment.yml
mamba activate /scratch/$PAWSEY_PROJECT/$USER//software/miniconda3/autocycler

[NOTE!]
The environment created without using a --prefix command is called autocycler and is on /software. The environment created with the --prefix command is on /scratch and is a different environment. Since this is exceptionally confusing, do one or the other, but NOT both!

Deleting environments

Pawsey will automatically delete any files that are older than 21 days, so you don’t need to worry about old environments, however it gets very confusing, so you should delete them.

Start by doing mamba info --envs to get a list of your environments, and then choose the path of the one you want to remove.

Delete the environment, and any files left in it, using

mamba env remove --prefix  /software/projects/$PAWSEY_PROJECT/$USER/miniconda3/envs_dirs/autocycler

Clean up your downloaded packages

Sometimes when you are installing software you will get random errors about packages being incomplete or not able to be installed. Usually, the problem is that the packages on /scratch have been deleted, so clean them out and try again, which will force them to be re-downloaded.

mamba clean -af

UniRef50 or UniRef100 for metagenomics? Sometimes faster is not just faster

In metagenomics, we often face a familiar trade-off: do we use a smaller, more clustered reference database that runs quickly, or a larger, more detailed database that may give finer resolution but requires substantially more compute?

One common example is the choice between the clustered databases from UniRef. We use both UniRef50 and UniRef100 for protein-based taxonomic or functional classification. Both databases derive from UniProt, but they differ in clustering level. UniRef50 clusters sequences at 50% identity, producing a smaller, less redundant database. UniRef100 retains much more sequence-level detail and is therefore larger, more comprehensive, and more computationally demanding. At the moment, the mmseqs2 version of UniRef100 is 275G while the mmseqs2 version of UniRef50 is 29G.

The obvious assumption is that UniRef100 should be “better” because it contains more information. But in metagenomics, the more useful question is often: does UniRef100 change the biological interpretation enough to justify the additional runtime?

In one recent comparison using the same metagenomic dataset, the UniRef50 search took 503 seconds using 64 threads. The equivalent UniRef100 search took 9,846 seconds using the same 64 threads.

That is approximately:

  • 8 minutes 23 seconds for UniRef50
  • 2 hours 44 minutes 6 seconds for UniRef100
  • an almost 20-fold increase in runtime for UniRef100 (even though its ~10x the size)

That is not a minor computational penalty. On a single dataset, it may be acceptable. Across hundreds or thousands of metagenomes, it becomes a major consideration for throughput, queue time, storage, and reproducibility.

What changed in the results?

At both genus and family taxonomic levels, the UniRef50 and UniRef100-derived profiles were broadly concordant, especially after log transformation. This is important because metagenomic abundance tables are typically sparse and highly skewed: a small number of taxa or functions can dominate the raw counts, while most features are rare or absent in most samples.

At the family level, the agreement was especially strong. Comparing family-by-sample abundance values, the raw Pearson correlation was modest, but this was expected because raw count data are sparse and unevenly distributed. After applying a log1p transformation (y=log(1 + x)), the correlation was much stronger. Family-level total abundances across the dataset were also well correlated.

The sample-level totals were almost perfectly correlated between the two database choices: the two approaches largely agreed on which samples had higher or lower overall assignment levels.

However, the two approaches were not identical. UniRef100 assigned more total counts and, in some cases, appeared to resolve reads that were left at broader or more ambiguous taxonomic levels in the UniRef50-derived result.

The practical interpretation

The key point is not that UniRef50 and UniRef100 are interchangeable. They are not. UniRef100 can produce more assignments and may give more specific taxonomic placements in some parts of the tree.

But for many metagenomic questions, especially when working at higher taxonomic ranks such as family, the broad biological signal may be very similar between UniRef50 and UniRef100. If the main goal is to compare samples, identify large-scale community shifts, or screen many datasets efficiently, UniRef50 may be more than adequate.

If the goal is to chase fine-scale taxonomic differences, resolve difficult clades, or maximise the number of assigned reads, UniRef100 may be worth the cost.

A useful rule of thumb

For large-scale metagenomic projects, I would treat UniRef50 as the sensible default for exploratory analysis, cohort-scale comparisons, and routine pipelines. It is faster, cheaper, and often preserves the major biological patterns.

UniRef100 is better reserved for cases where the additional resolution is likely to matter: detailed reanalysis of selected samples, validation of specific signals, fine-grained taxonomic interpretation, or situations where ambiguous assignments from UniRef50 need to be resolved.

A practical workflow might be:

  1. Run the full dataset against UniRef50.
  2. Identify the major biological patterns, outliers, and features of interest.
  3. Re-run selected samples or specific analyses against UniRef100.
  4. Ask whether UniRef100 changes the conclusion, not just whether it changes the counts.

This gives the best of both worlds: speed and scalability from UniRef50, with targeted use of UniRef100 where the extra resolution may be informative.

Bigger databases are not automatically better

Metagenomics already has enough computational bottlenecks. Bigger databases increase runtime, memory pressure, disk usage, and downstream complexity. They can also increase the number of plausible matches, which does not always make interpretation easier.

The important question is not simply, “Which database is larger?” or “Which database is more complete?”

The better question is:

Does the larger database change the biological conclusion enough to justify the additional computational cost?

In my comparison, UniRef100 required nearly 20 times longer than UniRef50. The resulting profiles were not identical, but the major sample-level and family-level abundance patterns were strongly concordant.

Jupyter

Pip installing for jupyter!

If you have a jupyter environment running, how do you know where to pip install. The solution is quite easy!

Install using jupyter! You can just install directly from the notebook.

!pip install <pypi package>

Inspect the kernelspec to confirm which conda environment you are using

I rename my jupyter environments, and that is _supposed_ to make things easier, but never does! Here’s a four step process:

  1. Check the kernelspec from a command line: jupyter kernelspec list
  2. From your environment, cat the kernel.json file, and this will show you which conda environment that kernel is actually using
  3. Activate that environment (e.g. with mamba)
  4. Install the package
GPU

Creating a conda environment for GPU programming with pytorch and tensorflow

After a few mis-steps, here is how I set up a conda environment to use in Jupyter with tensorflow, pytorch, and using the GPU.

As a note, I do this on the node with the GPU, so that things (hopefully) compile correctly!

1. Create an environment

First, create a new environment for tensorflow and friends, and activate it.

mamba create -n gpu_notebook cudatoolkit tensorflow nvidia::cuda pytorch torchvision torchaudio pytorch-cuda -c pytorch -c nvidia
mamba activate gpu_notebook

2. Install the python libraries

Install the usual suspect python packages that you will probably want to use. For convenience, I usually put these in a file in my
Git repo called requirements.txt.

$ cat requirements.txt 
jupyter
matplotlib
natsort
numpy
pandas
scipy
scikit-learn
seaborn
statsmodels
pip install -r requirements.txt

3. Reame your jupyter kernel

When you open jupyter there is a list of kernels that you can connect to. (If you have a window open that list will be on the top right.) If you rename your jupyter kernel it makes it
much easier to find the kernel associated with this conda environment. The default name is something like Python 3 which is not helpful if you have lots of them!

a. Find where your kernel is installed

This command shows your jupyter kernels

jupyter kernelspec list

You’ll see your kernel(s) and the locations of them. In the location listed there is a file called kernel.json.

b. Edit that file:

vi $HOME/miniconda3/envs/gpu_notebook/share/jupyter/kernels/python3/kernel.json

c. Change the name to be meaningful

Change the value associated with the display_name key. Set it to something meaningful so you can find it in your browser!

4. Set up the XLA_FLAGS environment variable.

This was essential for me to get tensorflow working. There is a directory somewhere in your conda environment with the libdevice library that is needed. For my installation that was in nvvm/libdevice/libdevice.10.bc. Of course you can find yours with:

find ~/miniconda3/ -name libdevice

You want to set the XLA_FLAGS variable to point to the base of the nvvm folder. This command sets it inside the conda environment so it is always set when the conda environment is activated, and unset when it is deactivated.

conda env config vars set XLA_FLAGS=--xla_gpu_cuda_data_dir=$HOME/miniconda3/envs/gpu_notebook

5. Activate the environment

Don’t forget to submit this to a node with GPU capabilities!

statsmodels.mixedlm Singular Matrix error

When building linear mixed models with Python’s statsmodules module, I repeatedly, and often incoherently, ran into np.linalg.LinAlgError errors that are Singular matrix errors.

There are a couple of things to check for with these errors:

First, drop any rows where there are NaN values for the predictors:

e.g. if your predictors are in a list called predictors, try this

df= df.dropna(subset=predictors)

Second, remove any columns whose sum is zero:

to_drop = list(df.loc[:,df.sum(axis=0) <1].columns)
df.drop(columns=to_drop)

Third, now that you have dropped columns, make sure they are still in your predictors. Something like

updated_predictors = list(set(predictors).intersection(set(df.columns)))

Finally, when all that doesn’t work, you should try different methods to fit the model. These are the methods I currently use, and I try them in this order and save the results for the first one that completes.

results = None
for meth in 'bfgs', 'lbfgs', 'cg', 'powell', 'nm':
    try:
        result = model.fit(method=meth)
        print(f"Method {meth} PASSED", file=sys.stderr)
        break
    except np.linalg.LinAlgError as e:
        print(f"Method {meth} failed", file=sys.stderr)
if results:
    print(results.summary)
node.js logo

Installing node.js and nvm on WSL

I needed an updated node.js on my WSL … for some reason ubuntu 20 has quite an old version.

First, as sudo, I removed all the old versions to avoid any conflicts:

apt purge nodejs
apt auto-remove

The purge command completely removes it, while the autoremove removes any dependencies. That should be optional, because those should work once we install the new node.js versions, but YMMV.

Next, install the node.js installer, called nvm for node version manager

To begin, set up your install location. For a single install it should be in ~/.nvm but for a site-wide install, you might want to install nvm in /usr/local/nvm.

Put these lines in your ~/.bashrc

export NVM_ROOT_DIR=$HOME/.nvm
# export NVM_ROOT_DIR=/usr/local/nvm ## << use this version if you are doing a site wide install
export NVM_DIR=$HOME/.nvm
[ -s "$NVM_ROOT_DIR/nvm.sh" ] && . "$NVM_ROOT_DIR/nvm.sh"
[ -s "$NVM_ROOT_DIR/bash_completion" ] && . "$NVM_ROOT_DIR/bash_completion"
NVM_BIN=$HOME/.nvm/versions/node/v23.3.0/bin ## << you may need to change this line
NVM_INC=$HOME/.nvm/versions/node/v23.3.0/include/node ## << you may need to change this line
export PATH=$NVM_DIR:$NVM_BIN:$PATH

Download nvm and get the files you want:

git clone https://github.com/nvm-sh/nvm.git
mkdir -p $NVM_ROOT_DIR
cp nvm/nvm.sh nvm/bash_completion $NVM_ROOT_DIR

Now, reload your bash settings

source ~/.bashrc

You should be able to run the nvm commands:

which nvm
nvm ls

To install node.js use nvm install

nvm install node

If you see this error:

wsl -bash:  node: cannot execute binary file: Exec format error

then try installing a different version

node install --lts 

For me, using the default node install node didn’t work because of the binary file Exec format error, while using node install --lts worked like a charm!

Installing snakemake error

If you are trying to install snakemake from conda like this:

 mamba create -c conda-forge -c bioconda -n snakemake'snakemake

and you get an error:

 Encountered problems while solving:
 - nothing provides snakemake-minimal 8.11.2.* needed by snakemake-8.11.2-hdfd78af_0

Then you need to switch the order of your conda channels.

Try this quickfix:

mamba create -c conda-forge -c bioconda -n snakemake 'snakemake>8'

Migrating from snakemake 7 to snakemake 8

I finally bit the bullet and migrated from snakemake 7 to snakemake 8. The short answer its not too hard!

First, install the executor plugin for snakemake using mamba:

mamba install snakemake-executor-plugin-cluster-generic

Next, edit your config file and change the following:

If you have a cluster: section, change that to cluster-generic-submit-cmd:

Add the line:

executor: cluster-generic

You need to remove these lines if you have them:

cluster-status
use-conda
conda-frontend

Here is my current snakemake config file

# non-slurm settings

conda-prefix: ~/.config/snakemake/conda/

# slurm settings

jobs: 600

executor: cluster-generic
cluster-generic-submit-cmd:
  mkdir -p logs_slurm/{rule} &amp;&amp;
  sbatch
    --cpus-per-task={threads}
    --mem={resources.mem_mb}
    --output=logs_slurm/{rule}/{jobid}.out
    --error=logs_slurm/{rule}/{jobid}.err
    --job-name=smk-{rule}
    --time={resources.runtime}
    --parsable

default-resources:
  - mem_mb=2000
  - runtime=7200
  - load_superfocus=0
  - load_kraken=0
  - load_onehundred=0

resources: [load_superfocus=100, load_kraken=100, load_onehundred=100]
local-cores: 32
latency-wait: 60
shadow-prefix: /scratch/user/edwa0468
keep-going: False
max-jobs-per-second: 20
max-status-checks-per-second: 10
scheduler: greedy

restart-times: 1

Global Phage Host Prediction Consortium

White Paper

Robert Edwards, Flinders University

September 2023

For comment

Please contact Rob Edwards (robert.edwards@flinders.edu.au)

Executive Summary

Infectious diseases are already one of the leading causes of mortality, and the rapid spread of antibiotic resistant bacteria is worsening this threat. Phage therapy is the most promising treatment to tackle the global threat of antimicrobial-resistant bacteria. However, numerous hurdles prevent its widespread implementation. The major benefit of phages is that they are highly specific, infecting only one or a few kinds of bacteria. However, this specificity defines one of the major hurdles: identifying, and accessing, the phages that might treat a patient. Currently, we have to ship the patient’s bacteria around the world to identify phages that might treat it, and then ship candidate phages back to where the treatment is needed. However, we are on the cusp of being able to choose appropriate therapeutic phages based on genome sequencing alone. This will open a new area for phage therapy: phages on demand. As Jean-Paul Pirnay describes it, PhageBeam to send phages via the internet. We should be able to sequence a patient’s bacterial isolate and use machine learning to identify the phage sequences that are most likely to infect that bacterium. Immediately, this would remove one step: the candidate phages could be retrieved and tested. In the long term, those phages could be synthesised locally and tested on the patient’s infection much faster than is currently possible. This project will enable scientists in less developed countries, will discover fundamentals of phage biology, and will save lives by reducing the threat of AMR.

Current State of the Art

Phages are being used worldwide to treat antimicrobial-resistant bacterial infections. However, successfully using a phage to kill bacteria depends on numerous factors: recognition of capsular polysaccharides (CPS) and lipopolysaccharides (LPS) on the cell surface; recognition of the bacterial receptor by the phage receptor binding protein; DNA translocation from the phage into the cell, and evasion of CRISPR/Cas and restriction/modification systems; eluding the bacterial defence mechanisms; replication; packing; and cell lysis. In the millions of years of evolution between bacteria and phages, they have both developed an arsenal to stop the other, and weapons to stop that arsenal. We know many of the actors that drive this warfare: there are approximately 200 bacterial defence mechanisms described so far, and almost as many anti-defence mechanisms. All of this makes it impossible, a priori, to predict which phage will infect which bacteria, based on the genome sequences alone.

What the field needs

That prediction is not impossible! Given enough sequenced phages and enough sequenced bacteria, we can build machine learning algorithms to predict which phages successfully kill which bacteria, and importantly, those algorithms will teach us why phages can, or cannot, kill a bacterium. From this, we can learn new bacterial defence mechanisms (e.g., CRISPR/Cas), new phage attack mechanisms, the role of prophages in regulating these interactions, the nature of the bacterial-CPS and bacterial-LPS interactions, and more about fundamental phage biology. We can use this information to select a suitable phage to treat a patient’s bacterial infection, identify where that phage is in the world, and speed up finding phages to cure infections.

The unknowns, which we can’t answer right now, are how may phages is enough, and how many bacteria is enough? We can’t answer these questions because we don’t have the underlying datasets needed to tackle this important challenge. This project will deliver those datasets.

How can we deliver enough data to create accurate machine learning algorithms?

We need to generate a massive dataset that covers the genomes of phages, the genomes of the bacteria, and the efficiency of plating of each phage on each bacterium (i.e., how many plaques does it form?). Although this sounds like an insurmountable challenge, the hardest part of this has already been done by labs all over the world: they have identified thousands of phages and compared them to thousands of bacteria because this is a cheap and easy experiment that only requires a few laboratory supplies. Almost every phage lab in the world does this experiment: They isolate a new phage from the environment, test it against all the bacteria in their collections and generate a matrix where the rows are bacterial species, the columns are phage species, and the cell contents are a number representing the efficiency of plating. If we can sequence the bacteria and phages that constitute those tables, and capture the sequence data, EOP data, and other data that is generated we can build machine learning algorithms that will predict phage infectivity.

We propose to distribute Oxford Nanopore Mk1C DNA sequencers to phage labs all over the world and have them sequence the bacteria and the phages that they are isolating. We will ask that they provide that data to our central database, and we will encourage them to publish their sequences and analyses of their results and make their data publicly available (e.g., through the NCBI, ENA, and DDBJ). We specifically propose the Mk1C because it will enable researchers in less developed countries to become engaged in this research experience. The Mk1C is a standalone unit that does not require internet access which has been a limiting factor in many countries (e.g., India, Africa) to using the Mk1B MinION

We estimate that one Nanopore Mk1C sequencing run can be used to sequence between 15 and 46 bacterial genomes together with 50 phage genomes.

We will build the computational infrastructure to allow the collaborators to upload their phage and bacterial genomes, the efficiency of plating metrics, and other data they measure (e.g., one-step growth curves, synograms, etc.) and to integrate them into a publicly available resource. All sequences will be annotated and shared using the RAST database at Argonne National Labs/ the University of Chicago, to ensure consistent and accurate annotations. RAST is the most widely used phage and bacterial genome annotation resource. We will complement those analyses with local analyses using new computational tools we will develop for this process.

Estimated Budget

For each sample, we estimate that the budget will be (prices are current in AUD):

  • $845 – 50 Bacterial DNA extraction
  • $1,289 – 50 Phage DNA extractions
  • $699 – V14 barcoding kit for 96 samples
  • $900 – for one V14 flow cell
  • $218 – 100 assays Qubit

Total: $3,951 (approximately US$2,500 // €2,400)
Therefore, with ~€100,000 we will provide approximately 40 sequencing kits globally, and sequence approximately 2,000 bacterial and phage genomes.