Category Archives: Lab blog

Search SRA Metadata in the Cloud

This is another in our series of posts on searching things in the SRA. As we noted previously, NCBI has moved most of the SRA into the clouds, which makes searching more convenient.

Let’s search through the SRA Metadata attributes (fields)

We can log into the Google Cloud (but you can also use AWS/Azure as you wish), and run a search for a big query.

Note that the Big Query searches are all using standard SQL and here are some videos explaining SQL

SELECT *
FROM `nih-sra-datastore.sra.metadata`
WHERE acc = "DRR071086";

Suppose we have a whole list of IDs to search?

We can extend this approach using WHERE IN like this:

SELECT *
FROM `nih-sra-datastore.sra.metadata`
WHERE acc in ("ERR10082948", "ERR10082949", "ERR10082950", "ERR10082951", "ERR10082952", "ERR10082953", "ERR10082954", "ERR10082955", "ERR10082956", "ERR10082957", "ERR10082959", "ERR10082960", "ERR10082961", "ERR10082963", "ERR10082964", "ERR10082965", "ERR10082966", "ERR10082967", "ERR10082968", "ERR10082970", "ERR10082971", "ERR10082972", "ERR10082973", "ERR10082974", "ERR10082975", "ERR10082976", "ERR10082977", "ERR10082978", "ERR10082979", "ERR10082980", "ERR10082981", "ERR10082982", "ERR10082983", "ERR10082984", "ERR10082985", "ERR10082986", "ERR10082987", "ERR10082989", "ERR10082990", "ERR10082991", "ERR10082992", "ERR10082993", "ERR10082994", "ERR10082995", "ERR10082996", "ERR10082997", "ERR10083000", "ERR10083002", "ERR10083003", "ERR10083004", "ERR10083005", "ERR10083006", "ERR10083008", "ERR10083009", "ERR10083010", "ERR10083011", "ERR10083013", "ERR10083015", "ERR10083016", "ERR10083017", "ERR10083018", "ERR10083020", "ERR10083021", "ERR10083022", "ERR10083023", "ERR10083024", "ERR10083025", "ERR10083026", "ERR10083027", "ERR10083028", "ERR10083029", "ERR10083030", "ERR10083031", "ERR10083033", "ERR10083034", "ERR10083035", "ERR10083036", "ERR10083037", "ERR10083038", "ERR10083039", "ERR10083043", "ERR10083044", "ERR10083046", "ERR10083047", "ERR10083048", "ERR10083049", "ERR10083050", "ERR10083051", "ERR10083054", "ERR10083055", "ERR10083056", "ERR10083057", "ERR10083058", "ERR10083059", "ERR10083060", "ERR10083061", "ERR10083062", "ERR10083063", "ERR10083064", "ERR10083065", "SRR21081047", "SRR21081048", "SRR21081049", "SRR21081050", "SRR21081051", "SRR21081052", "SRR21081053", "SRR21081054", "SRR21081055", "SRR21081056", "SRR21081057", "SRR21081058", "SRR21081059");

We can search the NCBI k-mer based taxonomy profiles too:

SELECT * FROM `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` WHERE acc = 'SRR21081434' order by ileft, ilevel

And then finally search for a record by the presence of crAssphage (taxonomy ID: 1211417)

SELECT m.acc, m.sample_acc, m.biosample, m.sra_study, m.bioproject
FROM `nih-sra-datastore.sra.metadata` as m, `nih-sra-datastore.sra_tax_analysis_tool.tax_analysis` as tax
WHERE m.acc=tax.acc and tax_id=1211417
ORDER BY m.bioproject, m.sra_study, m.biosample, m.sample_acc
LIMIT 100

(Note without the limit, this will return 140,864 records!)

But I have thousands of Accessions! What do I do?

OK, let’s take it to the next level!

We are going to create a new table in our Big Query database.

Click on your project name and choose CREATE DATA SET:

This opens a side bar where you can give your data set a name! I called my data sra_searches and because I am using the NCBI SRA data, I want to search in us (multiple regions) (although the SRA data is probably all in Iowa).

Now, expand on your data set and click on the three dots and choose Create Table

Now we need to fill in four things:

  1. Choose Upload in the Create Table From menu
  2. Choose your file. Use csv even if you have a single list of accessions!
  3. Give your table a name. e.g. accessions_20220903 (because later I will have other accessions to search!)
  4. Check the Auto-detect table format (it works well, I don’t know why it is not the default!)

This will create your table, and open it in the browser!

Now click on the data field that has the accessions and the query box will open to allow you to query them:

Run that search and make sure it works!

Subselects to the rescue

Now we can just use a sub-select to get all the data.

I recommend

I recommend doing this on a part of your data first to make sure that it works, rather than doing it on all the data!!

Finally, I recommend that you export this to Google Drive using JSON:

Phage genome annotation for therapeutic phages

Annotating phage for therapeutics

It is always therapeutic to annotate phages, but in this instance, we are specifically thinking about how we annotate phages so that we can use them for therapeutic purposes.

Rob is giving a talk at ESCMID entitled “The annotation of therapeutic phages” where he discusses some of the issues that come up. This blog post accompanies that talk and provides links to some of the papers that he discusses.

Host lifestyle prediction

It is generally accepted that lysogeny is bad for therapeutic phages, but there are ways around it.
Lysogeny can lead to superinfection exclusion, recombination with other phages, and the development of phage resistance.

Here are some tools that can be used to predict whether phages are lytic or lysogenic

We can overcome lysogeny, either through engineering phages, or through Gibson Assembly based on prophage sequences. These two papers suggest some cutting edge approaches to making that happen!

Toxins

Phages encode a lot of toxins! They help the bacteria replicate and escape a nasty death, and provide a mechanism for the spread of the phage.

In Streptococcus, the presence of toxins helps the bacteria spread, and we know phages control bacterial virulence.

Antibiotic resistance

Obviously it would be bad if the phage encoded an antibiotic resistance cassette, and there is some evidence that they do occassionally:

But the jury is still out on how important this is! For many, espeically lytic phages, they may not care about antibiotics since they are going to kill the host anyway. There is some debate as to the importance of antibiotic resistance genes in phages.

Databases

Nonetheless, because of the overall importance of antibiotic resistance in bacterial genomes (which, after all, is the reason we are here), there are lots of databases that you can use to search for different antibiotic resistance genes.

Ensemble approaches for therapeutic phages

New ways of identifying phages that have the potential for therapy are starting to emerge, and these are some of the ensemble tools that are trying to integrate multiple lines of evidence and provide support for phages for therapy.

AlphaFold of all Phage Lambda Proteins

DeepMind’s AlphaFold is winning at predicting tertiary structures from primary amino acid sequences. We thought it would be fun to investigate how it performed on phage Lambda.

We took the NCBI version of λ and extracted all the proteins, and then ran them through AlphaFold. It was able to make a prediction for all the proteins except for three proteins: NP_040594.1 (144 amino acids), NP_040597.1 (232 amino acids), and NP_040645.1 (158 amino acids).

Click to see a larger version

As you can see, many of the structures are just predicted to be long alpha helices with little order, but some of the structures are complex and closer representation to the predicted structures.

There are, of course, a heap of caveats to this analysis, including the fact that we did not (at this time) filter out any of the existing phage λ structures so one would hope that those are really good!

You can download all the best ranked structures for phage Lambda so you can view them in your favorite structure viewer

Fast correlations with turbocor

We often want to calculate Pearson correlation between different datasets, for example, we have used it to identify the hosts of different phages. Often, we want to calculate Pearson on really large matrices, and so our usual solution is to use crappy code and be patient!

However, recently Daniel Jones released turbocor, a fast, rust-based implementation, of pairwise Pearson correlations, and so we are intrigued to work with it. Here is a brief guide to making correlations using turbocor.

Continue reading

minimap2 hints

Here are some tips and tricks for minimap2 that I keep forgetting!

–split-prefix

If you have a large (>4 GB) multisequence index file, there are two options.

The first is to increase the value of -I when you build the index (preferred) so that the whole index is kept in memory. Note: This must be done when you build the index, you can’t build the index and then change -I during runtime.

The second is to use --split-prefix with a string. For snakemake, there are two options:

  1. You can use "{sample}" as your prefix like so:
params:
    prfx = "{sample}"
...
shell:
    """
         minimap2 --split-prefix {params.prfx} ...
    """

2. You can use a random 6 character string like so:

import random, string

params:
        pfx = ''.join(random.choices(string.ascii_uppercase + string.digits, k=6)) 
...
shell:
    """
         minimap2 --split-prefix {params.prfx} ...
    """

The trick is here, things will probably break if your index file is small. If you see the errorr: [W::sam_hdr_create] Duplicated sequence it is probably because you have split a small index sequence, and the sequence IDs are being duplicated. Remove the --split-prefix option and you should be good.