phyloseq logo

Converting phyloseq objects to read in other languages

Phyloseq is an R package for microbiome analysis that incorporates several data types.

Occassionally our colleagues share a phyloseq object with as an .rds file (R Data Serialization format). It is quite simple to convert that for use in other languages (e.g. python or even Excel!)

Converting the data to .tsv format

This approach requires an R installation somewhere, but we don’t need many commands, so you can probably use a remote R installation on a server!

If you have not yet installed phyloseq, you can do so with bioconductor:

if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("phyloseq")

Next, we load the phyloseq package and read the .RDS file:

library("phyloseq");
packageVersion("phyloseq"); # check the version we are using
# you may need to use setwd("C:/Users/username/Downloads") to move to whereever you downloaded the file!
p <- readRDS("phyloseq.rds"); # change the filename here! 
print(p)

This will print typical output from a phyloseq object like:

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 3210 taxa and 11 samples ]
sample_data() Sample Data:       [ 11 samples by 12 sample variables ]
tax_table()   Taxonomy Table:    [ 3210 taxa by 7 taxonomic ranks ]

These are our base phyloseq objects, and we can explore them:

print(otu_table(p))
print(sample_data(p))

And we can also write them to tab separated text in a .tsv file:

write.table(otu_table(p), "p_otu.tsv", sep="\t")
write.table(sample_data(p), "p_sample.tsv", sep="\t")
write.table(tax_table(p), "p_tax.tsv", sep="\t")

Read those files into Python

You can now use pandas to read those files into Python:

import pandas as pd

otu = pd.read_csv("p_otu.tsv", sep="\t")
otu

# sometimes the sample metadata has characters that can't be read using `utf-8` so we have to use `latin-1`
samples = pd.read_csv("p_sample.tsv", sep="\t", encoding='latin-1')
samples

tax = pd.read_csv("p_tax.tsv", sep="\t")
tax
git logo

Git change remote from https to ssh

Here’s how to change your remote from https to ssh.

Start by checking your current remote:

git remote -v

and this will list your remotes for both push and fetch.

To set a remote, you can use

git remote set-url origin git@github.com:username/project.git

and then use remote again to confirm:

git remote -v

now your git pull and git push will work as you expect!

bash

Use column to display tsv files in columns

If you have a tab separated file and view it in a pager like less or more, the columns never line up. Here is a simple way to make those columns appear correct.

column -t file.tsv

For example, here is a file with three columns of words, displayed with cat

If we pass that to column with the -t option to detect the columns, we get nicely organised columns:

However, note that this is not exactly correct, notice that “Paradigm shift” has been split into two columns because the -t option uses whitespace by default, so to display the columns using tabs, we need to add a -s option:

column -t -s$'\t'

bash

awk use tab as input separator

By default, awk uses any whitespace to separate the input text. You can use the same construct as perl to change the input Field separator:

cat file.tsv | awk -F"\t" '!s[$1]'

The above example will split a file on tabs and then print the unique entries in the first column.

SQLite banner image

Automatically create a SQLlite database from a gzip file

This is a neat trick in a couple of steps to create a SQLite database from a gzip file.

First, make sure your text file is tab-separated, and has a header row that you would like to use as the table columns.

First, we make a FIFO object and stream the gzip file to that. Note that we background the zcat command here.

mkfifo tempfile
zcat sprot.uniprot.proc.gz > tempfile &

This doesn’t actually uncompress your file, it just sets a 0 byte file object that will read the uncompressed file.

Now, we open SQLIte3 with the name of our (new or existing) database.

sqlite3 db.sqlite

And then set tabs mode for the import and read the tempfile filehandle into a new table.

.mode tabs
.import tempfile newtable

SQLite3 will automatically create the table with column names the same as the headers, and the data will be loaded.

You can check by looking at the .schema

Adapter trimming

“plus ça change, plus c’est la même chose “

Jean-Baptiste Alphonse Karr

So its 2023 and the new [sequencing] kid on the block, MGI, hasn’t figured out adapter trimming.

Here is a quick primer [pun intended] on the easiest way to remove primers and filter reads.

Use fastp and give it this file of Illumina adapters to trim against.

I use this command to filter and remove adapters from our sequences

mkdir output
fastp -n 1 -l 100 -i fastq/$R1 -I fastq/$R2 -o output/$R1 -O output/$R2 --adapter_fasta IlluminaAdapters.fa

If you have a lot of files, you can easily wrap this in a for loop to process all the files in a directory

mkdir fastq_fastp
for R1 in $(find fastq -name \*R1\* -printf "%f\n"); do
    R2=${R1/R1/R2}
    fastp -n 1 -l 100 -i fastq/$R1 -I fastq/$R2 -o fastq_fastp/$R1 -O fastq_fastp/$R2 --adapter_fasta IlluminaAdapters.fa 
done

In addition to trimming all adapter sequences, this will remove any sequence with an N (yes, one or more), and remove any sequence shorter than 100 bp. fastp will also make sure that your reads are paired as well!

git logo

Checking out a remote git branch

If you are working on git, sometimes you create a branch on one machine, and then want to get that on your remote machine.

Step 1. Create a branch on the remote machine called rob

git checkout -b rob

Do some work, make some changes, commit some code.

Now I am working on a remote machine in the same repo, but I want my rob branch. Easy!

Git fetch just makes sure we are upto date.

Git branch -v -a lists all remote branches, and provides more information about each. You should see the branch listed as remotes/origin/rob but you don’t need the remotes/origin part.

Git switch allows you to move into the existing remote branch

git fetch
git branch -v -a
git switch rob
Jupyter

Installing The Littlest Jupyter Hub

  1. Start an instance using nectar or jetstream
  2. You don’t need a user account. Possibly
  3. Install jupyterhub:
curl -L https://tljh.jupyter.org/bootstrap.py | sudo python3 -  --admin rob
  1. Allow users to create usernames and passwords:
tljh-config set auth.type nativeauthenticator.NativeAuthenticator
tljh-config reload
  1. Create a username using the web interface
  2. Authorize those users:
add /hub/authorize to your URL. e.g.
http://203.101.227.65/hub/authorize
  1. Install R and Python
apt install r-base nodejs
  1. Start R and install ir-kernel
$ R
> install.packages('IRkernel')
> IRkernel::installspec(user = FALSE)
> quit()

Usermanagement:

If you are having issues with user management you can cheat and get into the SQLIte database

sqlite3  /opt/tljh/state/jupyterhub.sqlite
#find the schema for the user tables
.tables
.schema users_info
CREATE TABLE users_info (
        id INTEGER NOT NULL,
        username VARCHAR(128) NOT NULL,
        password BLOB NOT NULL,
        is_authorized BOOLEAN,
        login_email_sent BOOLEAN,
        email VARCHAR(128),
        has_2fa BOOLEAN,
        otp_secret VARCHAR(16),
        PRIMARY KEY (id)
);

Find out about a user:

select * from users_info;

Authorise a user:

update users_info set is_authorized=1 where username='rob';
select * from users_info;

Should be good to go now!

Calculate the SHA256 checksum

If you create a conda recipe you need to calculate the sha256 checksum. This is a quick post to explain how to do that.

We often submit things to PyPi and then use the PyPi versions to create conda installations. The beauty of this approach is that if you update the PyPi installation, you don’t need to do anything else: the conda bot will automagically notice the new version and update for you. Procrastination pays off again! We talk about this in our PhiSpy blog post.

In the bioconda recipe we usually use this to point to a specific PyPi package for conda to install:

{% set name = "pyctv_taxonomy" %}
{% set version = "0.25" %}
{% set sha256 = "332e54fed6640f61e5c4722c62b9df633921358ba0eb8daf6230711970da2ad9" %}

package:
  name: "{{ name|lower }}"
  version: '{{ version }}'

source:
  url: "https://pypi.io/packages/source/{{ name[0] }}/{{ name }}/{{ name }}-{{ version }}.tar.gz"
  sha256: '{{ sha256 }}'

Note that we have the name (which is lower case) and the version number, and the URL is constructed from the first character of the name, the name, and the name-version.tar.gz. So in this case, the URL would be https://pypi.io/packages/source/p/pyctv_taxonomy/pyctv_taxonomy-0.25.tar.gz

Now there are a couple of ways we can generate the sha256 sum:

URL=<code>https://pypi.io/packages/source/p/pyctv_taxonomy/pyctv_taxonomy-0.25.tar.gz</code>
wget -qO- $URL | shasum -a 256

or

URL=<code>https://pypi.io/packages/source/p/pyctv_taxonomy/pyctv_taxonomy-0.25.tar.gz</code>
curl -sL $URL | openssl sha256

In this case, they both give the same answer:

332e54fed6640f61e5c4722c62b9df633921358ba0eb8daf6230711970da2ad9