Parsing Newick Trees

We often need to parse “newick” format phylogentic trees to figure out some information. Writing a parser is good for the soul, because the best way to do it is through recursion.

After the readmore, I provide some perl code for parsing newick phylogenetic trees into a lightweight data structure. Each node consists of an array of three things [left child, right child, and distance]. If the node is a leaf then the node consists of [“node”, the node name, and the distance]. It allows for very easy analysis of the tree, and simple ways to get data back. I also provide some example code for printing out the root-to-tip distance of every leaf in the tree.


This is based in part on code from perlmonks … thanks!!

 sub FromStringWithDistance {          my ($self, $in)=@_;          # do we have a distance. If so we want to trim that off and save it for later          my $distance="";          my $rparen=rindex($in, ")");          my $rcolon=rindex($in, ":");          if ($rparen [is less than] $rcolon) {                  $distance = substr($in, $rcolon+1);                  $in = substr($in, 0, $rcolon); # adjust the string to exclude the distance          }          # if we don't have a comma we are just the node          if (index($in, ",") == -1) {                  return ["node", $in, $distance];          }                             # when we get here, we need to find the first level comma to split on          my $p=0;          # find the split point where we are at one level above nothing!          if (index($in, "(") != -1) {                  my $depth = 0; # the depth we are in the tree                  for (my $i=0; $i [is less than]  length($in); $i++) {                          my $c = substr($in, $i, 1);                          ++$depth if ($c eq '(');                          --$depth if ($c eq ')');                          if ($c eq ',' && $depth == 1) {                                  $p = $i;                                  last;                          }                  }          }          else {                  $p=index($in, ",");          }           # now we have the split points we need to figure out if the left and right halves begin and end with parens, and ignore them if they do          my ($start, $endlength)=(0, length($in)-$p-1);          if (index($in, "(") == 0 && rindex($in, ")") == length($in)-1) {                  $start=1; $endlength--;          }                             return [$self->FromStringWithDistance(substr($in, $start, $p-1 )), $self->FromStringWithDistance(substr($in, $p+1, $endlength)), $distance]; } 

[Note: obviously you need to change the two instances of [is less than] to the < sign, but for some reason, when I put that in the code it is breaking the html markup!

Now that we have the recursion done, we just need to call ourselves with the whole tree and start things going.



 sub parse {       my ($self, $tree)=@_;       # remove the trailing ;       $tree =~ s/;s*$//;       return $self->FromStringWithDistance($tree); }  

You should put these in a package somewhere, and then bless your object. Then you can call it any time you need it.

For example, to calculate the root to tip distance of every leaf in the tree, you will need to create a package that I called ParseTree that has those two routines (along with a typical perl “new” routine to create and bless the object). Then a simple script will print all the distances out:



 use ParseTree;  my $parser = new ParseTree;   my $file = shift || die "Tree file (newick format) to parse"; open(IN, $file) || die "Can't open $file"; my $rawtree=""; while () {chomp; $rawtree .= $_}  # parse the tree my $tree = $parser->parse($rawtree);   # collect the distance of every leaf from the root in %dist my %dist; parsenode($tree, 0); map {print "$_t$dist{$_}n"} keys %dist;   sub parsenode {     my ($node, $dist)=@_;     my ($left, $right, $newdist)=@$node;     if ($newdist) {$dist+=$newdist}     if (ref($left) eq "ARRAY" && ref($right) eq "ARRAY") {          parsenode($left, $dist);          parsenode($right, $dist);     }     elsif ($left eq "node") {          # we are a leaf in the format ['node', 'node name', distance]          $dist{$right}=$dist;     }     else {          print STDERR "Uh oh : $left and $rightn";     } }   

This will print out the name of every node and its distance from the root of the tree (or the midpoint of the base of the tree if it is an unrooted tree).

All of this code is available from the bioinformatics realm of my lab’s sourceforge repository: