A time-optimal algorithm to build the SPR subgraph on a set of trees

30 Jun 2016, by Erick

We would like to better understand the subtree-prune-regraft (SPR) graph, which is a graph underlying most modern phylogenetic inference methods. The nodes of this graph are the set of leaf-labeled phylogenetic trees, and the edges connect pairs of trees that can be transformed from one to another by moving a subtree from one place to another. Phylogenetic methods implicitly move around this graph, whether to sample trees or find the most likely tree. The work to understand this graph has been led by Chris Whidden, including learning about how the graph structure influences Bayesian phylogenetic inference and learning about the overall structure of the graph.

These projects required us to reconstruct the subgraph of the full SPR graph induced by a subset of the nodes. In the course of our work we have been getting progressively better at constructing this graph efficiently. In our latest work we develop a time-optimal algorithm.

Chris’ insight driving this new algorithm is that we shouldn’t be focusing on the trees and checking for pairs of adjacencies, but should rather shift focus to enumerating the potential adjacencies themselves. These adjacencies can be formalized as structures called agreement forests, which in this case have two components. If one is clever, and Chris is very clever, you can quickly store these forests and recognize if you’ve seen them before. The strategy then is to move through the trees in an arbitrary sequential order, storing all of the potential adjacencies of each tree. If a given tree returns a same adjacency as another previous tree, then connect the trees in the graph.

Although for this paper we obtained an asymptotically time-optimal algorithm, there is still interesting work to be done in order to get a fast implementation. For example, we could be more thoughtful about exactly how the forests get serialized, which should lead to a faster look up in the central trie. But not having done any coding at all, much less profiling, we don’t know where the bottlenecks will lie. This paper was in part motivated by a fun discussion on phylobabble.

all posts