Hypothetical proteins regular expressions in PERL

When working with genome annotations we often need to determine whether a protein is hypothetical (putative) or not. There are a lot of variants that people use which basically all mean we don’t know what this protein does. It is hypothetical. These PERL regular expressions will allow you to test whether a protein is hypothetical or not. [Note we also have a python version that does the same thing].

This code came from the SEED source files, and we have collected and collated it over the years.

sub hypo {
my $x = (@_ == 1) ? $[0] : $[1];

if (! $x) { return 1 }
if ($x =~ /lmo\d+ protein/i) { return 1 }
if ($x =~ /hypoth/i) { return 1 }
if ($x =~ /conserved protein/i) { return 1 }
if ($x =~ /gene product/i) { return 1 }
if ($x =~ /interpro/i) { return 1 }
if ($x =~ /B[sl][lr]\d/i) { return 1 }
if ($x =~ /^U\d/) { return 1 }
if ($x =~ /^orf[^]/i) { return 1 }
if ($x =~ /uncharacterized/i) { return 1 }
if ($x =~ /pseudogene/i) { return 1 }
if ($x =~ /^predicted/i) { return 1 }
if ($x =~ /AGR
/) { return 1 }
if ($x =~ /similar to/i) { return 1 }
if ($x =~ /similarity/i) { return 1 }
if ($x =~ /glimmer/i) { return 1 }
if ($x =~ /unknown/i) { return 1 }
if (($x =~ /domain/i) ||
($x =~ /^y[a-z]{2,4}\b/i) ||
($x =~ /complete/i) ||
($x =~ /ensang/i) ||
($x =~ /unnamed/i) ||
($x =~ /EG:/) ||
($x =~ /orf\d+/i) ||
($x =~ /RIKEN/) ||
($x =~ /Expressed/i) ||
($x =~ /[a-zA-Z]{2,3}|/) ||
($x =~ /predicted by Psort/) ||
($x =~ /^bh\d+/i) ||
($x =~ /cds_/i) ||
($x =~ /^[a-z]{2,3}\d+[^:+-0-9]/i) ||
($x =~ /similar to/i) ||
($x =~ / identi/i) ||
($x =~ /ortholog of/i) ||
(index($x, “Phage protein”) == 0) ||
($x =~ /structural feature/i)) { return 1 }
return 0;
}