- 05 Sep 2017 » Using genotype abundance to improve phylogenetic inference
- 26 Jun 2017 » Probabilistic Path Hamiltonian Monte Carlo
- 07 Jun 2017 » Smart proposals win for online phylogenetics using sequential Monte Carlo.
- 26 Oct 2016 » Incorporating new sequences into posterior distributions using SMC
- 22 Jul 2016 » Summer high school and undergraduate students 2016
- 11 Jul 2016 » Analysis of a slightly gentler discretization of time-trees
- 30 Jun 2016 » A time-optimal algorithm to build the SPR subgraph on a set of trees
- 05 Jun 2016 » An information-theoretic analysis of phylogenetic regularization

When doing computational biology, listen to biologists. I have found them to have remarkable intuition; this can be a gold mine of opportunity for us computational types.

In this particular case, the starting point was the stunningly beautiful work of Gabriel Victora’s lab visualizing germinal center dynamics in living mice. For those not yet initiated into the beauty of B cell repertoire, germinal centers are crucibles of evolution, in which B cells compete in an antigen-binding contest such that the best binder reproduces more. As part of the Victora lab work, they did single-cell extraction and sequencing, which enabled them to quantify the frequency of each B cell genotype without PCR bias or other artifacts. Such single-cell sequencing, and consequent abundance information, is now becoming commonplace. *How should we use this abundance information in phylogenetics?*

Well, the Victora lab knew, even if their algorithm implementation is not one we would have considered. Indeed, they...
*(full post)*

Hamiltonian Monte Carlo (HMC) is an exciting approach for sampling Bayesian posterior distributions. HMC is distinguished from typical MCMC because proposals are derived such that acceptance probability is very high, even though proposals can be distant from the starting state. These lovely proposals come from approximating the path of a particle moving without friction on the posterior surface.

Indeed, the situation is analogous to a bar of soap sliding around a bathtub, where in this case the bathtub is the posterior surface flipped upside down. When the soap hits an incline, i.e. a region of bad posterior, it slows down and heads back in another direction. We give the soap some random momentum to start with, let it slide around for some period, and where it is at the end of this period is our proposal. Calculating these proposals requires integrating out the soap dynamics, which is done by numerically integrating physics equations (hence the...
*(full post)*

Sometimes projects take years to bear fruit.

As I described previously, Aaron Darling and I have been thinking for a good while about using Sequential Monte Carlo (SMC) for *online* Bayesian phylogenetic inference, in which an existing posterior on trees can be updated with additional sequences. In fact, we had a good enough proof-of-concept implementation in 2013 to give a talk at the Evolution meeting. The other talk from my group that year was about a surrogate function for likelihoods parameterized by a single branch length, which we called “lcfit”. These two projects have just recently resulted in intertwined submitted papers.

The SMC implementation, which we called `sts`

for Sequential Tree Sampler, lay dormant for a while after Connor McCoy left the group despite a few efforts to rekindle it. One of the things that kept us from wrapping it up was problems with *particle degeneracy*, which is as follows. I think of...
*(full post)*

The Bayesian approach is a beautiful means of statistical inference in general, and phylogenetic inference in particular. In addition to getting posterior distributions on trees and mutation model parameters, the Bayesian approach has been used to get posteriors on complex model parameters such as geographic location and ancestral population sizes of viruses, among other applications.

The bummer about Bayesian computation? It takes so darn long for those chains to converge. And what’s worse in this age of in-situ sequencing of viruses for rapidly unfolding epidemics? *If you get a new sequence you have to start over from scratch.*

I’ve been thinking about this for several years with Aaron Darling, and in particular about Sequential Monte Carlo (SMC) for this application. SMC, also called particle filtering, is a way to get a posterior estimate with a collection of “particles” in a sequential process of reproduction and selection. You can think about it as a probabilistically...
*(full post)*

I definitely didn’t set out to have three high school and two undergrad students this summer.

But they’re fantastic, and making real contributions to our scientific work! Anna and Apurva (left and right) have written a new C++ front-end to the essential Smith-Waterman pre-alignment step for Duncan’s partis software. Andrew and Lola (center) are investigating the traces of maximum-likelihood phylogenetic inference software packages. Thayer (back left) is writing a multi-threaded C++ program to systematically search for all of the trees above a given likelihood cutoff. All of them are learning about science and coding.

These students rock, and I can’t wait to see what great things they bring into the world with their talent.

Inferring a good phylogenetic tree topology, i.e. a tree without branch lengths, is the primary challenge for efficient tree inference. As such, we and others think a lot about how algorithms move between topologies, typically formalizing this information as a path through a graph representing tree topologies as vertices and edges as moves from one tree to another. Removing all branch length information makes sense because algorithms are formulated in terms of these topologies: for classical unrooted tree inference, the set of trees that are tried from a specific tree is not determined by the branch lengths of the current tree.

But what about time-trees? Time-trees are rooted phylogenetic trees such that every event is given a time: each internal node is given a divergence time, and each leaf node is given a sampling time. These are absolute times of the sort one could put on a calendar. To first approximation, working with time-trees is...
*(full post)*

We would like to better understand the subtree-prune-regraft (SPR) graph, which is a graph underlying most modern phylogenetic inference methods. The nodes of this graph are the set of leaf-labeled phylogenetic trees, and the edges connect pairs of trees that can be transformed from one to another by moving a subtree from one place to another. Phylogenetic methods implicitly move around this graph, whether to sample trees or find the most likely tree. The work to understand this graph has been led by Chris Whidden, including learning about how the graph structure influences Bayesian phylogenetic inference and learning about the overall structure of the graph.

These projects required us to reconstruct the subgraph of the full SPR graph induced by a subset of the nodes. In the course of our work we have been getting progressively better at constructing this graph efficiently. In our latest work we develop a time-optimal algorithm.

Chris’...
*(full post)*

How frequently are genes transferred horizontally? A popular means of addressing this question involves building phylogenetic trees on many genes, and looking for genes that end up in surprising places. For example, if we have a lineage B that got a gene from lineage A, then a tree for that gene will have B’s version of that gene descending from an ancestor of A, which may be on the other side of the tree.

Using this approach requires that we have accurate trees for the genes. That means doing a good job with our modeling and inference, but it also means having data with plenty of the mutations which give signal for tree building. Unfortunately, sometimes we don’t have such rich data, but we’d still like to do such an analysis.

A naïve approach is just to run the sequences we have through maximum-likelihood tree estimation software and take the best tree for each...
*(full post)*

all posts