Category Archives: Uncategorized

samtools / bbmap conda conflict

If you see the error samtools: error while loading shared libraries: libcrypto.so.1.0.0: cannot open shared object file: No such file or directory it is probably a bbmap conflict.

As of the time of writing, bbmap installs samtools version 1.7, but samtools 1.14 is available, and there were some significant changes around version 1.9 that changed the SSL libraries samtools uses.

There are lots of github issues and biostars posts about this. Here is my simple solution:

mamba uninstall samtools

This will tell you what depends on samtools, and it will remove it too! If your samtools is an earlier version, then you will also see that here.

Then,

mamba install samtools

Now you are up to date. Once you have confirmed this version of samtools works, you can try installing the offending package (that was removed alongside samtools) and see it tries to downgrade your samtools version. For me, that was bbmap.

Adding a python package to bioconda

We love bioconda! Our preferred way to add a python package to bioconda is via PyPi as it is the lazy solution (of course, Perl programmers are notoriously lazy, but Python programmers can be too!) Once you have the bioconda recipe working bots will take care of the updates for you!

Continue reading →

How to use snakemake checkpoints

Snakemake checkpoints are a little complex to get your head around, and so here are two examples that will hopefully clarify some use cases.

Before we begin, we’re going to design a rule that makes an unknown number of randomly named files (but less than 10 files maximum!)

Here is a simple snakemake rule that will do this, and put them in a directory called first_directory.

OUTDIR = "first_directory"


rule all:
    input:
        OUTDIR


rule make_some_files:
    output:
        directory(OUTDIR)
    shell:
        """
        mkdir {output};
	N=$(((RANDOM%10)+1));
        for D in $(seq $N); do
            touch {output}/$RANDOM.txt
        done
        """

There are a couple of things you need to know about this block:

The bash variable $RANDOM issues a random number whose maximum size will depend on your particular computer.
By using the Modulo, (RANDOM%10) will reduce this to the remainder when you divide by 10. i. If your number is 10, 20, 30, … etc, the remainder will be zero, so we add one to it.
The bash command seq will make the sequence upto that number (e.g. seq 10 will run ten times).

We don’t know what files are there!

So we convert this rule to a checkpoint, by changing the first word in that definition from rule to checkpoint.

OUTDIR = "first_directory"


rule all:
    input:
        OUTDIR


checkpoint make_some_files:
    output:
        directory(OUTDIR)
    shell:
        """
        mkdir {output};
	N=$(((RANDOM%10)+1));
        for D in $(seq $N); do
            touch {output}/$RANDOM.txt
        done
        """

Notes:

We have only changed one word: rule -> checkpoint.
You can run that snakemake pipeline as-is and it will still run

Now we can do something with that checkpoint.

Check to see which files were made

We can use that checkpoint as an input to another rule. Here, we use it as the input to rule all.

We need to get the directory that the checkpoint makes (OUTDIR) and then the files that are in that directory

OUTDIR = "first_directory"

def get_file_names(wildcards):
    # note 1: ck_output is the same as OUTDIR, but this requires
    # the checkpoint to complete before we can figure out what it is!

    # note 2: checkpoints will have attributes for each of the checkpoint
    # rules, accessible by name. Here we use make_some_files
    ck_output = checkpoints.make_some_files.get(**wildcards).output[0]
    SMP, = glob_wildcards(os.path.join(ck_output, "{sample}.txt"))
    return expand(os.path.join(ck_output, "{SAMPLE}.txt"), SAMPLE=SMP)



rule all:
    input:
        get_file_names


checkpoint make_some_files:
    output:
        directory(OUTDIR)
    shell:
        """
        mkdir {output};
	N=$(((RANDOM%10)+1));
        for D in $(seq $N); do
            touch {output}/$RANDOM.txt
        done
        """

Note: See how we used a checkpoint called make_some_files and then asked for checkpoints.make_some_files.get(**wildcards).output[0] in the function? That’s the connection!

At that point, snakemake knows not to complete the function until it has the checkpoint completed.

We can also use the output as wild cards in another rule.

OUTDIR = "first_directory"
SNDDIR = "second_directory"


def get_file_names(wildcards):
    ck_output = checkpoints.make_some_files.get(**wildcards).output[0]
    SMP, = glob_wildcards(os.path.join(ck_output, "{sample}.txt"))
    return expand(os.path.join(ck_output, "{SAMPLE}.txt"), SAMPLE=SMP)

def dup_file_names(wildcards):
    du_output = checkpoints.make_some_files.get(**wildcards).output[0]
    SMPLS, = glob_wildcards(os.path.join(du_output, "{smpl}.txt"))
    return expand(os.path.join(SNDDIR, "{SM}.tsv"), SM=SMPLS)



rule all:
    input: 
        get_file_names,
        dup_file_names,


checkpoint make_some_files:
    output:
        directory(OUTDIR)
    shell:
        """
        mkdir {output};
	N=$(((RANDOM%10)+1));
        for D in $(seq $N); do
            touch {output}/$RANDOM.txt
        done
        """

rule duplicate:
    input:
        get_file_names
    output:
        os.path.join(SNDDIR, "{SAMPLE}.tsv")
    shell:
        """
        touch {output}
        """

Here, I am going to use global wildcards to allow us to read and use the wildcards in different places.


OUTDIR = "first_directory"
SNDDIR = "second_directory"

SMP = None

def get_file_names(wildcards):
    ck_output = checkpoints.make_five_files.get(**wildcards).output[0]
    global SMP
    SMP, = glob_wildcards(os.path.join(ck_output, "{sample}.txt"))
    return expand(os.path.join(ck_output, "{SAMPLE}.txt"), SAMPLE=SMP)

def get_second_files(wildcards):
    ck_output = checkpoints.make_five_files.get(**wildcards).output[0]
    SMP2, = glob_wildcards(os.path.join(ck_output, "{sample}.txt"))
    return expand(os.path.join(SNDDIR, "{SM}.tsv"), SM=SMP2)


rule all:
    input: 
        "list_of_files.txt",
        get_second_files


checkpoint make_five_files:
    output:
        directory(OUTDIR)
    params:
        o = OUTDIR
    shell:
        """
        mkdir {output};
        for D in $(seq 1 5); do
            touch {params.o}/$RANDOM.txt
        done
        """

rule copy_files:
    input:
        get_file_names
    output:
        os.path.join(SNDDIR, "{SAMPLE}.tsv")
    shell:
        """
        touch {output}
        """


rule list_all_files:
    input:
        get_file_names,
        expand(os.path.join(SNDDIR, "{s}.tsv"), s=SMP)
    output:
        "list_of_files.txt"
    shell:
        """
        echo {input} > {output}
        """

What about two different variables

No problem! We can use two checkpoints, and then combine them in a single expand statement.

For example, lets make two random sets of files, and then a third set of files that combines all their filenames

Note: Here I only use 5 random files each time as we could end up with 100 files, but the answer is the same.

OUTDIR = "first_directory"
SNDDIR = "second_directory"
THRDIR = "third_directory"


def combine(wildcards):
    # read the first set of outputs
    ck_output = checkpoints.make_some_files.get(**wildcards).output[0]
    FIRSTS, = glob_wildcards(os.path.join(ck_output, "{sample}.txt"))
    # read the second set of outputs
    sn_output = checkpoints.make_more_files.get(**wildcards).output[0]
    SECONDS, = glob_wildcards(os.path.join(sn_output, "{smpl}.txt"))
    return expand(os.path.join(THRDIR, "{first}.{second}.tsv"), first=FIRSTS, second=SECONDS)



rule all:
    input: 
        combine


checkpoint make_some_files:
    output:
        directory(OUTDIR)
    shell:
        """
        mkdir {output};
        N=$(((RANDOM%5)+1));
        for D in $(seq $N); do
            touch {output}/$RANDOM.txt
        done
        """

checkpoint make_more_files:
    output:
        directory(SNDDIR)
    shell:
        """
        mkdir {output};
        N=$(((RANDOM%5)+1));
        for D in $(seq $N); do
            touch {output}/$RANDOM.txt
        done
        """

rule make_third_files:
    input:
        directory(OUTDIR),
        directory(SNDDIR),
    output:
        os.path.join(THRDIR, "{first}.{second}.tsv")
    shell:
        """
        touch {output}
        """

Hopefully this will get you started, but let us know if not!

samtools flags

I can never remember the order of the columns in samtools, so here they are

QNAME: Query name of the read or the read pair
FLAG: Bitwise flag (pairing, strand, mate strand, etc.)
RNAME: Reference sequence name
POS: 1-Based leftmost position of clipped alignment
MAPQ: Mapping quality (Phred-scaled)
CIGAR: Extended CIGAR string (operations: MIDNSHP)
MRNM: Mate reference name (‘=’ if same as RNAME)
MPOS: 1-based leftmost mate position
INSIZE: Inferred insert size
SEQ: Sequence on the same strand as the reference
QUAL: Query quality (ASCII-33=Phred base quality)

There are lots more details in the samtools manual

You should also use the samtools flag explainer to understand column 2

Updating your ARC RMS research outputs

This is an Australian thing! If you are not from Australia, you probably don’t need to do this, but in case you do, here it is!

The ARC’s Research Management System has an outputs section that is, of course, mandatory, and is, equally of course, different from everyone else’s system.

I try and culture my references in Google Scholar, ORC ID, and my resume, and I don’t want to manually enter them into RMS, especially when a grant deadline is close. So here is one way to avoid that.

Continue reading →

Things I can never remember for Snakemake

These are some of the things that I can never remember how to do in snakemake but they make life just that little bit easier.

Continue reading →

Submitting a PhiSpy update to pip and conda

First, make sure everything is upto date in GitHub.

We are going to call this release version 4.0 and we will have release candidates, starting at rc1

First, create a release on GitHub. Strictly speaking you don’t need to do that but it is a great thing to do.

PyPi Release

The PyPi instructions cover this, but I have abstracted out the parts we need to focus on (since we have a setup.py already!)

As a regular user we build everything. This make a new release that we will upload

python3 setup.py sdist

This will create the tarball and the wheel file in the dist directory. Then we need to upload those to PyPi.

We are going to use the PyPi test interface to make sure that everything is OK. Do not skip this step!

If you need an API key, navigate to the PyPi login page . However, if you have done this before, you probably don’t need to save it again 😉

python3 -m twine upload --repository testpypi dist/PhiSpy-4.0.0rc1.tar.gz

Note that you can not upload the wheel. Binary wheels from linux are not supported.

Now we are going to test it out. Lets make a virtual environment and install it there

virtualenv test_phispy
cd test_phispy
source bin/activate
which pip

This should tell you that the current pip is from your virtual environment. If it is not, solve that problem!

For PhiSpy, we have a couple of dependencies that you should install with regular pip before you can install your new release candidate:

pip3 install scikit-learn biopython

This will install other things like numpy that you need.

Now you can install your new release.

pip install -i https://test.pypi.org/simple/ PhiSpy==4.0.0rc1

If you are not sure exactly the URL, logging into the PyPi test login page will show your available repositories, including the newly uploaded repository. If you click on the version you want, you can get the link to download and install that.

Once you are happy and have run some tests, login to the real PyPi page (good to do anyway, even if you have an API key)

Now you can upload the final version to PyPi for everyone to access

python3 -m twine upload dist/PhiSpy-4.0.0.tar.gz

Its worth logging into the real PyPi page to make sure that you can download it!

Making a CONDA release

It turns out that for most code all you have to do is wait! The conda bots will take care of incrementing to the next version and running the continuous integration tests for you.

However, if you need to update the code manually, you probably need to change the version in meta.yaml and then you should update the SHA hash:

URL=https://pypi.io/packages/source/p/phispy/PhiSpy-4.2.17.tar.gz
wget -O- $URL | shasum -a 256

and then paste the output of that into the SHA field. In this case, the shasum should be

38bb8f072e2eba8efe0c46258ad9b45940eed8f126901af9d455ad0bae396e99

Note: the format for this block is:

TOOL=PhiSpy-4.2.17.tar.gz
TL=$(echo $TOOL | cut -f 1 -d '-' | awk '{print tolower($0)}')
URL=https://pypi.io/packages/source/${TL:0:1}/$TL/$TOOL
wget -O- $URL | shasum -a 256

Counting K-mers

One of the most essential tools in bioinformatics is counting k-mers. These are short, identical strings, of length k. Here, we will look at counting all the k-mers in string.

Continue reading →

SoCal Hackathon 2019

We are pleased to announce the second installment of the SoCal Bioinformatics Hackathon.

From 9-11 January, 2019, the NCBI will help run a bioinformatics hackathon in Southern California hosted by the Computational Sciences Research Center at San Diego State University! We are going to put a few hundred thousand metagenomic datasets on cloud infrastructure and identify known, taxonomically definable and novel viruses! We’re specifically looking for folks who have experience in Computational Virus Hunting or Adjacent Fields! If this describes you, please apply! This event is for researchers, including students and postdocs, who are already engaged in the use of bioinformatics data or in the development of pipelines for virological analyses from high-throughput experiments. The event is open to anyone selected for the hackathon and willing to travel to SDSU (see below).

Continue reading →

Downloading all the sequences from a PATRIC family

PATRIC families are somehow organized sets proteins that are related. Sometimes we want to get all the protein sequences in the family. Here’s how to do that using the PATRIC command line

Continue reading →

EdwardsLab

Delivering the best in bioinformatics…

Category Archives: Uncategorized

samtools / bbmap conda conflict

Adding a python package to bioconda

How to use snakemake checkpoints

Check to see which files were made

What about two different variables

samtools flags

Updating your ARC RMS research outputs

Things I can never remember for Snakemake

Submitting a PhiSpy update to pip and conda

PyPi Release

Making a CONDA release

Counting K-mers

SoCal Hackathon 2019

Downloading all the sequences from a PATRIC family