Global Phage Host Prediction Consortium

White Paper

Robert Edwards, Flinders University

September 2023

For comment

Please contact Rob Edwards (robert.edwards@flinders.edu.au)

Executive Summary

Infectious diseases are already one of the leading causes of mortality, and the rapid spread of antibiotic resistant bacteria is worsening this threat. Phage therapy is the most promising treatment to tackle the global threat of antimicrobial-resistant bacteria. However, numerous hurdles prevent its widespread implementation. The major benefit of phages is that they are highly specific, infecting only one or a few kinds of bacteria. However, this specificity defines one of the major hurdles: identifying, and accessing, the phages that might treat a patient. Currently, we have to ship the patient’s bacteria around the world to identify phages that might treat it, and then ship candidate phages back to where the treatment is needed. However, we are on the cusp of being able to choose appropriate therapeutic phages based on genome sequencing alone. This will open a new area for phage therapy: phages on demand. As Jean-Paul Pirnay describes it, PhageBeam to send phages via the internet. We should be able to sequence a patient’s bacterial isolate and use machine learning to identify the phage sequences that are most likely to infect that bacterium. Immediately, this would remove one step: the candidate phages could be retrieved and tested. In the long term, those phages could be synthesised locally and tested on the patient’s infection much faster than is currently possible. This project will enable scientists in less developed countries, will discover fundamentals of phage biology, and will save lives by reducing the threat of AMR.

Current State of the Art

Phages are being used worldwide to treat antimicrobial-resistant bacterial infections. However, successfully using a phage to kill bacteria depends on numerous factors: recognition of capsular polysaccharides (CPS) and lipopolysaccharides (LPS) on the cell surface; recognition of the bacterial receptor by the phage receptor binding protein; DNA translocation from the phage into the cell, and evasion of CRISPR/Cas and restriction/modification systems; eluding the bacterial defence mechanisms; replication; packing; and cell lysis. In the millions of years of evolution between bacteria and phages, they have both developed an arsenal to stop the other, and weapons to stop that arsenal. We know many of the actors that drive this warfare: there are approximately 200 bacterial defence mechanisms described so far, and almost as many anti-defence mechanisms. All of this makes it impossible, a priori, to predict which phage will infect which bacteria, based on the genome sequences alone.

What the field needs

That prediction is not impossible! Given enough sequenced phages and enough sequenced bacteria, we can build machine learning algorithms to predict which phages successfully kill which bacteria, and importantly, those algorithms will teach us why phages can, or cannot, kill a bacterium. From this, we can learn new bacterial defence mechanisms (e.g., CRISPR/Cas), new phage attack mechanisms, the role of prophages in regulating these interactions, the nature of the bacterial-CPS and bacterial-LPS interactions, and more about fundamental phage biology. We can use this information to select a suitable phage to treat a patient’s bacterial infection, identify where that phage is in the world, and speed up finding phages to cure infections.

The unknowns, which we can’t answer right now, are how may phages is enough, and how many bacteria is enough? We can’t answer these questions because we don’t have the underlying datasets needed to tackle this important challenge. This project will deliver those datasets.

How can we deliver enough data to create accurate machine learning algorithms?

We need to generate a massive dataset that covers the genomes of phages, the genomes of the bacteria, and the efficiency of plating of each phage on each bacterium (i.e., how many plaques does it form?). Although this sounds like an insurmountable challenge, the hardest part of this has already been done by labs all over the world: they have identified thousands of phages and compared them to thousands of bacteria because this is a cheap and easy experiment that only requires a few laboratory supplies. Almost every phage lab in the world does this experiment: They isolate a new phage from the environment, test it against all the bacteria in their collections and generate a matrix where the rows are bacterial species, the columns are phage species, and the cell contents are a number representing the efficiency of plating. If we can sequence the bacteria and phages that constitute those tables, and capture the sequence data, EOP data, and other data that is generated we can build machine learning algorithms that will predict phage infectivity.

We propose to distribute Oxford Nanopore Mk1C DNA sequencers to phage labs all over the world and have them sequence the bacteria and the phages that they are isolating. We will ask that they provide that data to our central database, and we will encourage them to publish their sequences and analyses of their results and make their data publicly available (e.g., through the NCBI, ENA, and DDBJ). We specifically propose the Mk1C because it will enable researchers in less developed countries to become engaged in this research experience. The Mk1C is a standalone unit that does not require internet access which has been a limiting factor in many countries (e.g., India, Africa) to using the Mk1B MinION

We estimate that one Nanopore Mk1C sequencing run can be used to sequence between 15 and 46 bacterial genomes together with 50 phage genomes.

We will build the computational infrastructure to allow the collaborators to upload their phage and bacterial genomes, the efficiency of plating metrics, and other data they measure (e.g., one-step growth curves, synograms, etc.) and to integrate them into a publicly available resource. All sequences will be annotated and shared using the RAST database at Argonne National Labs/ the University of Chicago, to ensure consistent and accurate annotations. RAST is the most widely used phage and bacterial genome annotation resource. We will complement those analyses with local analyses using new computational tools we will develop for this process.

Estimated Budget

For each sample, we estimate that the budget will be (prices are current in AUD):

  • $845 – 50 Bacterial DNA extraction
  • $1,289 – 50 Phage DNA extractions
  • $699 – V14 barcoding kit for 96 samples
  • $900 – for one V14 flow cell
  • $218 – 100 assays Qubit

Total: $3,951 (approximately US$2,500 // €2,400)
Therefore, with ~€100,000 we will provide approximately 40 sequencing kits globally, and sequence approximately 2,000 bacterial and phage genomes.

bash

Human Readable Numbers

There is an easy convenience to using human readable numbers. Instead of a number like 1,099,511,627,776 you can use 1T. Instead of 1,073,741,824 you can use 1G and instead of 1,024 you can use 1K (that 1K = 1,024 is why the other numbers don’t end with multiple zeros).

But how do you do some (simple) math with human readable numbers, like adding up a list?

This is where numfmt comes to your aid.

For example, lets make a list of numbers:

259G
1.1G
692G
5.5G
5.3G
140M
30G
302G
222G
281M
1.9G
60G
2.2T

If we put those in a file called sizes.txt we can sum them with a simple command like this:

cat sizes.txt | numfmt --from=iec | awk 's+=$1 {} END {print s}'  | numfmt --to=iec

The --from=iec converts the numbers from human readable format to numbers, the awk adds the numbers, and then the second numfmt converts the sum back to a human readable number,

phyloseq logo

Converting phyloseq objects to read in other languages

Phyloseq is an R package for microbiome analysis that incorporates several data types.

Occassionally our colleagues share a phyloseq object with as an .rds file (R Data Serialization format). It is quite simple to convert that for use in other languages (e.g. python or even Excel!)

Converting the data to .tsv format

This approach requires an R installation somewhere, but we don’t need many commands, so you can probably use a remote R installation on a server!

If you have not yet installed phyloseq, you can do so with bioconductor:

if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("phyloseq")

Next, we load the phyloseq package and read the .RDS file:

library("phyloseq");
packageVersion("phyloseq"); # check the version we are using
# you may need to use setwd("C:/Users/username/Downloads") to move to whereever you downloaded the file!
p <- readRDS("phyloseq.rds"); # change the filename here! 
print(p)

This will print typical output from a phyloseq object like:

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 3210 taxa and 11 samples ]
sample_data() Sample Data:       [ 11 samples by 12 sample variables ]
tax_table()   Taxonomy Table:    [ 3210 taxa by 7 taxonomic ranks ]

These are our base phyloseq objects, and we can explore them:

print(otu_table(p))
print(sample_data(p))

And we can also write them to tab separated text in a .tsv file:

write.table(otu_table(p), "p_otu.tsv", sep="\t")
write.table(sample_data(p), "p_sample.tsv", sep="\t")
write.table(tax_table(p), "p_tax.tsv", sep="\t")

Read those files into Python

You can now use pandas to read those files into Python:

import pandas as pd

otu = pd.read_csv("p_otu.tsv", sep="\t")
otu

# sometimes the sample metadata has characters that can't be read using `utf-8` so we have to use `latin-1`
samples = pd.read_csv("p_sample.tsv", sep="\t", encoding='latin-1')
samples

tax = pd.read_csv("p_tax.tsv", sep="\t")
tax
git logo

Git change remote from https to ssh

Here’s how to change your remote from https to ssh.

Start by checking your current remote:

git remote -v

and this will list your remotes for both push and fetch.

To set a remote, you can use

git remote set-url origin git@github.com:username/project.git

and then use remote again to confirm:

git remote -v

now your git pull and git push will work as you expect!

bash

Use column to display tsv files in columns

If you have a tab separated file and view it in a pager like less or more, the columns never line up. Here is a simple way to make those columns appear correct.

column -t file.tsv

For example, here is a file with three columns of words, displayed with cat

If we pass that to column with the -t option to detect the columns, we get nicely organised columns:

However, note that this is not exactly correct, notice that “Paradigm shift” has been split into two columns because the -t option uses whitespace by default, so to display the columns using tabs, we need to add a -s option:

column -t -s$'\t'

bash

awk use tab as input separator

By default, awk uses any whitespace to separate the input text. You can use the same construct as perl to change the input Field separator:

cat file.tsv | awk -F"\t" '!s[$1]'

The above example will split a file on tabs and then print the unique entries in the first column.

SQLite banner image

Automatically create a SQLlite database from a gzip file

This is a neat trick in a couple of steps to create a SQLite database from a gzip file.

First, make sure your text file is tab-separated, and has a header row that you would like to use as the table columns.

First, we make a FIFO object and stream the gzip file to that. Note that we background the zcat command here.

mkfifo tempfile
zcat sprot.uniprot.proc.gz > tempfile &

This doesn’t actually uncompress your file, it just sets a 0 byte file object that will read the uncompressed file.

Now, we open SQLIte3 with the name of our (new or existing) database.

sqlite3 db.sqlite

And then set tabs mode for the import and read the tempfile filehandle into a new table.

.mode tabs
.import tempfile newtable

SQLite3 will automatically create the table with column names the same as the headers, and the data will be loaded.

You can check by looking at the .schema

Adapter trimming

“plus ça change, plus c’est la même chose “

Jean-Baptiste Alphonse Karr

So its 2023 and the new [sequencing] kid on the block, MGI, hasn’t figured out adapter trimming.

Here is a quick primer [pun intended] on the easiest way to remove primers and filter reads.

Use fastp and give it this file of Illumina adapters to trim against.

I use this command to filter and remove adapters from our sequences

mkdir output
fastp -n 1 -l 100 -i fastq/$R1 -I fastq/$R2 -o output/$R1 -O output/$R2 --adapter_fasta IlluminaAdapters.fa

If you have a lot of files, you can easily wrap this in a for loop to process all the files in a directory

mkdir fastq_fastp
for R1 in $(find fastq -name \*R1\* -printf "%f\n"); do
    R2=${R1/R1/R2}
    fastp -n 1 -l 100 -i fastq/$R1 -I fastq/$R2 -o fastq_fastp/$R1 -O fastq_fastp/$R2 --adapter_fasta IlluminaAdapters.fa 
done

In addition to trimming all adapter sequences, this will remove any sequence with an N (yes, one or more), and remove any sequence shorter than 100 bp. fastp will also make sure that your reads are paired as well!

git logo

Checking out a remote git branch

If you are working on git, sometimes you create a branch on one machine, and then want to get that on your remote machine.

Step 1. Create a branch on the remote machine called rob

git checkout -b rob

Do some work, make some changes, commit some code.

Now I am working on a remote machine in the same repo, but I want my rob branch. Easy!

Git fetch just makes sure we are upto date.

Git branch -v -a lists all remote branches, and provides more information about each. You should see the branch listed as remotes/origin/rob but you don’t need the remotes/origin part.

Git switch allows you to move into the existing remote branch

git fetch
git branch -v -a
git switch rob