Category Archives: Lab blog

Automatically create a SQLlite database from a gzip file

This is a neat trick in a couple of steps to create a SQLite database from a gzip file.

First, make sure your text file is tab-separated, and has a header row that you would like to use as the table columns.

First, we make a FIFO object and stream the gzip file to that. Note that we background the zcat command here.

mkfifo tempfile
zcat sprot.uniprot.proc.gz > tempfile &

This doesn’t actually uncompress your file, it just sets a 0 byte file object that will read the uncompressed file.

Now, we open SQLIte3 with the name of our (new or existing) database.

sqlite3 db.sqlite

And then set tabs mode for the import and read the tempfile filehandle into a new table.

.mode tabs
.import tempfile newtable

SQLite3 will automatically create the table with column names the same as the headers, and the data will be loaded.

You can check by looking at the .schema

Adapter trimming

“plus ça change, plus c’est la même chose “
Jean-Baptiste Alphonse Karr

So its 2023 and the new [sequencing] kid on the block, MGI, hasn’t figured out adapter trimming.

Here is a quick primer [pun intended] on the easiest way to remove primers and filter reads.

Use fastp and give it this file of Illumina adapters to trim against.

I use this command to filter and remove adapters from our sequences

mkdir output
fastp -n 1 -l 100 -i fastq/$R1 -I fastq/$R2 -o output/$R1 -O output/$R2 --adapter_fasta IlluminaAdapters.fa

If you have a lot of files, you can easily wrap this in a for loop to process all the files in a directory

mkdir fastq_fastp
for R1 in $(find fastq -name \*R1\* -printf "%f\n"); do
    R2=${R1/R1/R2}
    fastp -n 1 -l 100 -i fastq/$R1 -I fastq/$R2 -o fastq_fastp/$R1 -O fastq_fastp/$R2 --adapter_fasta IlluminaAdapters.fa 
done

In addition to trimming all adapter sequences, this will remove any sequence with an N (yes, one or more), and remove any sequence shorter than 100 bp. fastp will also make sure that your reads are paired as well!

How much does phage DNA weigh?

We need to get nanogram quantities of phages for DNA sequencing, but how much does the genome of a single phage weigh, and how many phages do we need? The math is quite simple!

Continue reading →

Checking out a remote git branch

If you are working on git, sometimes you create a branch on one machine, and then want to get that on your remote machine.

Step 1. Create a branch on the remote machine called rob

git checkout -b rob

Do some work, make some changes, commit some code.

Now I am working on a remote machine in the same repo, but I want my rob branch. Easy!

Git fetch just makes sure we are upto date.

Git branch -v -a lists all remote branches, and provides more information about each. You should see the branch listed as remotes/origin/rob but you don’t need the remotes/origin part.

Git switch allows you to move into the existing remote branch

git fetch
git branch -v -a
git switch rob

Installing The Littlest Jupyter Hub

Start an instance using nectar or jetstream
You don’t need a user account. Possibly
Install jupyterhub:

curl -L https://tljh.jupyter.org/bootstrap.py | sudo python3 -  --admin rob

Allow users to create usernames and passwords:

tljh-config set auth.type nativeauthenticator.NativeAuthenticator
tljh-config reload

Create a username using the web interface
Authorize those users:

add /hub/authorize to your URL. e.g.
http://203.101.227.65/hub/authorize

Install R and Python

apt install r-base nodejs

Start R and install ir-kernel

$ R
> install.packages('IRkernel')
> IRkernel::installspec(user = FALSE)
> quit()

Usermanagement:

If you are having issues with user management you can cheat and get into the SQLIte database

sqlite3  /opt/tljh/state/jupyterhub.sqlite
#find the schema for the user tables
.tables
.schema users_info

CREATE TABLE users_info (
        id INTEGER NOT NULL,
        username VARCHAR(128) NOT NULL,
        password BLOB NOT NULL,
        is_authorized BOOLEAN,
        login_email_sent BOOLEAN,
        email VARCHAR(128),
        has_2fa BOOLEAN,
        otp_secret VARCHAR(16),
        PRIMARY KEY (id)
);

Find out about a user:

select * from users_info;

Authorise a user:

update users_info set is_authorized=1 where username='rob';

select * from users_info;

Should be good to go now!

Calculate the SHA256 checksum

If you create a conda recipe you need to calculate the sha256 checksum. This is a quick post to explain how to do that.

We often submit things to PyPi and then use the PyPi versions to create conda installations. The beauty of this approach is that if you update the PyPi installation, you don’t need to do anything else: the conda bot will automagically notice the new version and update for you. Procrastination pays off again! We talk about this in our PhiSpy blog post.

In the bioconda recipe we usually use this to point to a specific PyPi package for conda to install:

{% set name = "pyctv_taxonomy" %}
{% set version = "0.25" %}
{% set sha256 = "332e54fed6640f61e5c4722c62b9df633921358ba0eb8daf6230711970da2ad9" %}

package:
  name: "{{ name|lower }}"
  version: '{{ version }}'

source:
  url: "https://pypi.io/packages/source/{{ name[0] }}/{{ name }}/{{ name }}-{{ version }}.tar.gz"
  sha256: '{{ sha256 }}'

Note that we have the name (which is lower case) and the version number, and the URL is constructed from the first character of the name, the name, and the name-version.tar.gz. So in this case, the URL would be https://pypi.io/packages/source/p/pyctv_taxonomy/pyctv_taxonomy-0.25.tar.gz

Now there are a couple of ways we can generate the sha256 sum:

URL=&lt;code>https://pypi.io/packages/source/p/pyctv_taxonomy/pyctv_taxonomy-0.25.tar.gz&lt;/code>
wget -qO- $URL | shasum -a 256

URL=&lt;code>https://pypi.io/packages/source/p/pyctv_taxonomy/pyctv_taxonomy-0.25.tar.gz&lt;/code>
curl -sL $URL | openssl sha256

In this case, they both give the same answer:

332e54fed6640f61e5c4722c62b9df633921358ba0eb8daf6230711970da2ad9

BASH file tests

Just because I can never remember where this list is, here are the key bash file tests

BASH commands: number of arguments supplied

This code checks for the number of arguments supplied to a bash script

if [[ $# -eq 0 ]]; then echo "No arguments supplied"; exit; fi

If there are no arguments you probably don’t need to do anything.

Remember the arguments are:

$0 - the name of the script
$1 - the first argument
$2 - the second argument

Splitting the Non-Redundant Database for MMSeqs2

We have a computational problem searching the Non-Redundant database, but we can solve that!

If you use MMSeqs2 to search the NR database, it needs about 1.75TB of RAM (that is approximate!). Our Flinders deepthought cluster has several nodes with 2.5TB RAM so at most you can run three concurrent jobs searching the database. However, database searches are completely independent: the scores you get for one pair of sequences are independent from all the other sequences in the database.

Note: That is true, but there is a caveat. The E-value of the search is dependent on the length of the query and the size of the database.

E value depends on the database size!

Therefore, the E-value that is reported is only approximate. There is a GitHub Issue to resolve this with MMSeqs2

Our solution is to split the database into smaller pieces, and then we can use those across more nodes of the cluster.

Download the NR database

Step 1, download the preformatted NR database using mmseqs2

mkdir --parents NR
mmseqs databases --threads 8 NR NR/NR (mktemp -d)

This will download the non-redundant database into the directory NR and the database will be called NR.

Split the database

Let’s split that database into many smaller chunks. From my tests it appears that 50 chunks requires about 50-60 GB RAM per compute, and 100 chunks requires about 25 GB RAM per compute.

CHUNKS=100
mkdir NR.split.$CHUNKS
cd NR.split.$CHUNKS
cut -f 1 ../NR/NR.index > ids.txt
split -n l/$CHUNKS --numeric-suffixes=1 --suffix-length=3 ids.txt
for N in $(seq 1 $CHUNKS); do
        echo $N;
        X="000$N";
        X="x${X:(-3)}";
        mkdir NR.$N
        mmseqs createsubdb $X ../NR/NR NR.$N/NR;
        mmseqs createsubdb $X ../NR/NR_h NR.$N/NR_h;
done
rm -f x???

(Hint: this works quite well as a slurm script!)

There are now 1 directory called NR.split.100 and within that there are 100 directories called NR.1, NR.2, NR.3, … NR.100.

Search against all the directories

We can now use a slurm array job to search against all of those:

#!/bin/bash
###############################################################
#                                                             #
# Search the NR using an array job.                           #
#                                                             #
# submit with this command:                                   #
#                                                             #
# sbatch --array=1-100:1 ./nr_chunks.slurm                    #
#                                                             #
# Written by Rob Edwards. Good luck!                          #
#                                                             #
###############################################################

#SBATCH --job-name=NRmmseqs
#SBATCH --time=5-0
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=24
#SBATCH --mem=25G
#SBATCH -o mmseqsNRA_%a.out
#SBATCH -e mmseqsNRA_%a.err

## Memory requirement: for 100 chunks 25G (average from 72 searches: 16.3889, range: 12-20G)

set -euo pipefail

# change this to point to the output from `mmseqs createdb` on your data:
QUERY='mmseqs_formatted_data.db'

NR=$SLURM_ARRAY_TASK_ID
CHUNKS=100
mmseqs search --threads 24 $QUERY NR.split.$CHUNKS/NR.$NR/NR $QUERY.nr.$CHUNKS.$NR.db $(mktemp -d)
mmseqs convertalis --threads 24 $QUERY NR.split.$CHUNKS/NR.$NR/NR $QUERY.nr.$CHUNKS.$NR.db $QUERY.nr.$CHUNKS.$NR.m8

You can save this file as nr_chunks.slurm, edit the line with the QUERY location, and submit 100 MMSeqs2 jobs as an array job (using sbatch --array=1-100:1 ./nr_chunks.slurm). You now have 2,400 processors computing on your search, and each 24 processors is only consuming 25 GB RAM.

Note: I recommend that you adjust the number of threads, I am not sure if 24 is optimum!

Identifying Metagenomes from the SRA in the Cloud

PARTIE

We have several projects that look for all the metagenomes in the cloud, and we have several ways of searching the SRA. Here, we’ll search for all the WGS metagenomes in the SRA using a Google Big Table query.

Log into Google Console

You’ll need to log into Google console and access a project or create a new one.

Use SQL to find the metagenome/microbiome/metatranscriptome results

We use temporary tables to store the two main searches: what are amplicon projects and what are metagenome/microbiome/metatranscriptome projects, and then we find the projects that are metagenomes:

create temp table AMPLICON(acc STRING) as select acc as amplicon from `nih-sra-datastore.sra.metadata` where assay_type = 'AMPLICON' or libraryselection = 'PCR';
create temp table METAGENOMES(acc STRING) as select acc from `nih-sra-datastore.sra.metadata` where librarysource = "METAGENOMIC" or librarysource = 'METATRANSCRIPTOMIC' or organism like "%microbiom%" OR organism like "%metagenom%"  or organism like '%metatran%';
select acc from METAGENOMES where acc not in (select acc from AMPLICON);

Then save that as a JSON file to Google Drive.

Use `jq` to parse the JSON file

This is probably overkill because we only have one attribute in our data.

jq -r '.acc' bq-results-20221006-054328-1665035790273.json > SRA-metagenomes.txt

Find all the information about all the sequences

We can edit the above SQL to get all the information about all the metagenomes. Basically, we just change the second select statement.

create temp table AMPLICON(acc STRING) as select acc as amplicon from `nih-sra-datastore.sra.metadata` where assay_type = 'AMPLICON' or libraryselection = 'PCR';
select * from `nih-sra-datastore.sra.metadata` where acc not in (select acc from AMPLICON) and (librarysource = "METAGENOMIC" or librarysource = 'METATRANSCRIPTOMIC' or organism like "%microbiom%" OR organism like "%metagenom%");

Note: In this query, the parenthesis are important to make sure we do the and and or in the right place.

Then you can export the data as a JSON Newline file to Google Drive.

Current results

At the moment, this returns 642,842 runs from the SRA

Some things we can’t find

The old study_type field that we searched (using study_type = "Metagenomics") does not appear to have mapped to bigtable.
THe old scientific name that we searched (using sample.scientific_name like "%microbiom%" OR sample.scientific_name like "%metagenom%") does not appear to have mapped to bigtable.

EdwardsLab

Delivering the best in bioinformatics…

Category Archives: Lab blog

Automatically create a SQLlite database from a gzip file

Adapter trimming

How much does phage DNA weigh?

Checking out a remote git branch

Installing The Littlest Jupyter Hub

Usermanagement:

Calculate the SHA256 checksum

BASH file tests

BASH commands: number of arguments supplied

Splitting the Non-Redundant Database for MMSeqs2

Download the NR database

Split the database

Search against all the directories

Identifying Metagenomes from the SRA in the Cloud

PARTIE

Log into Google Console

Use SQL to find the metagenome/microbiome/metatranscriptome results

Use `jq` to parse the JSON file

Find all the information about all the sequences

Current results

Some things we can’t find

Usermanagement:

Download the NR database

Split the database

Search against all the directories

PARTIE

Log into Google Console

Use SQL to find the metagenome/microbiome/metatranscriptome results

Use jq to parse the JSON file

Find all the information about all the sequences

Current results

Some things we can’t find

Use `jq` to parse the JSON file