Mapping UniRef100 to PhAnToMe

UniRef100 is another non-redundant database. In this post, I describe how to map the UniRef100 proteins to the proteins in the phantome database and get the subsystems for each.

This is similar to the description of how to map things to the SEED using the SEED servers, but this time we’ll download everything and do it locally.

 

Quick guide:

  1. Download UniRef100 from EBI. You could also use another uniref! ftp://ftp.ebi.ac.uk/pub/databases/uniprot/uniref/uniref100/
  2. Download the latest PhAnToMe annotations from http://www.phantome.org/Downloads/proteins/all_sequences/
  3. Download the code to map between the two: http://edwards-sdsu.cvs.sourceforge.net/viewvc/edwards-sdsu/bioinformatics/bin/map_phantome_to_uniref.pl?view=log
  4. Run the code to map between the datasets: perl map_phantome_to_uniref.pl phage_proteins_1317466802.fasta uniref100.fasta.gz > phantome.uniref.map.txt

 

 


 

UniRef100 is a collection of non-redundant proteins. From their README file:

The UniProt Reference Clusters (UniRef) provide clustered sets (UniRef100, UniRef90 and UniRef50 clusters) of sequences from the UniProt Knowledgebase and selected UniParc records, in order to obtain complete coverage of sequence space at several resolutions (100%, >90% and >50%) while hiding redundant sequences (but not their descriptions) from view. 

To map the IDs from this file to our annotations, we need to figure out how things go together.

To start with we need to download the two fasta files. For this example, I downloaded the uniref100.fasta.gz file with a timestamp of 9/21/11 (it was 2.7GB compressed) and the PhAnToMe file phage_proteins_1317466802.fasta.gz (I include the creation timestamp in the filename so you know which one is the most recent, or you can go back to a previous version).

I’m going to start by parsing out the phantome sequences because these are a lot fewer and I will be able to keep them all in memory while I scan through the uniref100 sequences. I have two choices, one is to use MD5 sums to match things, but my phantome sequences are only 22M, and so there is really no reason not to just keep all those sequences in memory as a hash.

The format of the phantome sequences is:

>fig|12018.1.peg.1 [Phage maturase] [Phage_entry_and_exit] [12018.1] [Enterobacteria phage GA]

In this format we have: ID, [functional role] [subsystem] [genome id] and [organism information]

The UniRef file format is like this (this is the first sequence in the file, and I found it by using the command gunzip -c uniref100.fasta.gz | head) :

>UniRef100_Q6GZX4 Putative transcription factor 001R n=1 Tax=Frog virus 3 (isolate Goorha) RepID=001R_FRG3G

To parse these sequences, I am going to use the PERL regular expression:

m/^>(S+)s+(.*?)s+n=d+s+Tax=(.*?)s+RepID=(S+)/;
my ($id, $org, $repid)=($1, $2, $3);

so that I can capture the uniref100 id, the organism name, and the representative protein id.

I wrote some simple code to read the two files and map the proteins sequences between the two data sets. You can download it from our sourceforge repostiory.

To run the code, use the command:

perl map_phantome_to_uniref.pl phage_proteins_1317466802.fasta uniref100.fasta.gz > phantome.uniref.map.txt

on my computer this took 1 m 39 seconds to map between the two data sets. Not too bad, since it took > 2 hours to download the uniref100 from EBI!!

How good are our annotations? See for yourself, here is a table I compiled of the data as of October 6th, 2011. I added some column headers to this table, just to make it more obvious which column is which. [The file is gzip compressed and is 6.7M]