We often want to calculate Pearson correlation between different datasets, for example, we have used it to identify the hosts of different phages. Often, we want to calculate Pearson on really large matrices, and so our usual solution is to use crappy code and be patient!
However, recently Daniel Jones released turbocor, a fast, rust-based implementation, of pairwise Pearson correlations, and so we are intrigued to work with it. Here is a brief guide to making correlations using turbocor
.
Installing turbocor
Of course, we want to install it using conda but there are a couple of simple gotchas that are easy to overcome. Here are some step by step instructions
Specifically, hdf5-sys
requires hdf5=1.10.1
at the moment, and balks at hdf5=1.12.1
(the current version in anaconda).
Step 1: Create the conda environment and install hdf5
mamba create -n turbocor rust hdf5=1.10.1
Step 2: Activate the environment
conda activate turbocor
Step 3: Set your environment variables
You will need to set this to the base of your conda environment. Change the path as appropriate for your conda installation. See the hdf5 conda installation for more details.
export HDF5_DIR=$HOME/miniconda3/envs/turbocor
export RUSTFLAGS="-C link-args=-Wl,-rpath,$HOME/miniconda3/envs/turbocor/lib"
Step 4: Build the release
cargo build --release
and now the executable will be in the target/release
directory, so you can either add this to your path (e.g. PATH=$PATH:$PWD/target/release
) or remember the location (e.g. TURBOCOR=$PWD//target/release
).
Using turbocor
You need your data conceptually in a matrix. Turbocor
will read an hdf5 format file with a dataset
tag that points to a two-dimensional matrix.
Step 1: Convert your data to hdf5 format
If you have your data in a tab-separated (or even comma-separated) text file, you can use matrix_to_h5.py to convert that data into the hdf5
format file that you need.
For example, here is how to convert the matrix to an hdf5 file. In this example, matrix.tsv
is a tab-separated matrix file that has a header row. It outputs the data into matrix.h5
with a dataset tag mydata
and outputs a separate file with the column names (taken from the first column) into indices.tsv
.
python3 /home/edwa0468/GitHubs/EdwardsLab/h5py/matrix_to_h5.py --file matrix.tsv --output matrix.h5 --header --dataset mydata --indexfile indices.tsv
Step 2: Run turbocor
Next, we run turbocor on the matrix.h5
file. Here we use the $TURBOCOR
path we set earlier, and output the correlations to the matrix.cor
file.
$TURBOCOR/turbor compute --dataset mydata matrix.h5 matrix.cor
We convert that to a comma-separated list of row indices and their correlation coefficients:
Step 3: Extract the coefficients
$TURBOCOR/turbor topk 1000 matrix.cor
You can use the indices.tsv
file we created in step 1 to identify the row names of things that correlate with each other.
Enjoy!