30 Mar 2021 » In search of a programmer to develop next-generation Bayesian phylogenetics algorithms for genomic epidemiology

The challenge

In the current SARS-CoV-2 outbreak, we have all learned the importance of epidemiology. Epidemiology is currently undergoing a revolution due to easy and cheap access to viral genetic sequences, available in real time as the epidemics unfold. This helps us understand viral spread because viruses mutate as they are passed between individuals, and by using shared mutations we can infer transmission history.

Such detective work using viral genetic sequences is beautifully expressed on the nextstrain.org platform. This platform allows people to understand viral transmission between geographic areas using the relationships among viral genomes. Specifically, it represents these relationships in terms of a phylogenetic tree, analogous to a “tree of life” but for a single viral outbreak.

Although nextstrain is wonderful, and perfect for some audiences, it has some limitations. Some of these limitations derive from the fact that it infers a single tree structure, when in fact each tree structure is a hypothesis with a certain level of support that we can obtain... (full post)

29 Mar 2021 » Postdoc position available for variational Bayes phylogenetic inference

This description focuses on methodological challenges, and is written for individuals with a background in modern Bayesian inference. For more detail about our desired real-world impact of this work, see the companion job ad for a programmer position.

The project

Bayesian phylogenetic (evolutionary tree) inference is important for genomic epidemiology and for our understanding of evolution. Trees, along with associated information, are complicated objects of inference, with intertwined discrete (tree structure) and continuous (dates, rates) structure. Random-walk Markov Chain Monte Carlo, implemented in packages such as BEAST (~20,000 citations) and MrBayes (>70,000 citations), is currently the only widely-applied inference technique.

We have recently developed a rich means of parameterizing tree distributions with a fixed parameter set. This renders them accessible to more modern inference techniques, such as variational Bayes. We have developed a proof-of-concept application of phylogenetic variational Bayes using modern general-purpose gradient estimators. Our collaborative group also has preliminary integrations with both PyTorch and TensorFlow.

To achieve the promise of variational... (full post)

15 Jan 2021 » Postdoctoral position to develop Bayesian phylogenetic methods for B cell receptor sequence lineages


The goal of our project is to develop, implement, and apply Bayesian evolutionary algorithms for the analysis of B cell receptor sequence lineages. These lineages are important for understanding the events leading to the development of high-affinity antibodies.

We are motivated to:


We will work together to develop novel models, implement these models in open-source software, write tests to verify correctness, apply the methods to a variety of data sets, and write papers describing the results.

We’ll have the opportunity to work with a broad range of leading researchers, including:

19 Oct 2020 » Life changes

Hi everyone. Through a combination of COVID and the arrival of a second child, I haven’t had time to write about our recent work. I’ll be back to posting at some point, but right now I’m focusing on being a dad and supporting my trainees. Thanks for understanding.

23 Jan 2020 » A Bayesian phylogenetic hidden Markov model for B cell receptor sequences


A brief description of antibody affinity maturation

In order to defend against a very large and ever-mutating pool of pathogens, your body randomly generates, and then optimizes, a large collection of antibodies. These antibodies are displayed as so-called B cell receptors on the surface of specialized B cells. The random generation is a process called V(D)J... (full post)

24 Aug 2019 » Variational Bayesian phylogenetic inference

In late 2017 we were stuck without a clear way forward for our research on Bayesian phylogenetic inference methods.

We knew that we should be using gradient (i.e. multidimensional derivative) information to aid in finding the posterior, but couldn’t think of a way to find the right gradient. Indeed, we had recently finished our work on a variant of Hamiltonian Monte Carlo (HMC) that used the branch length gradient to guide exploration, along with a probabilistic means of hopping from one tree structure to another when a branch became zero. Although this project was a lot of fun and was an ICML paper, it wasn’t the big advance that we needed: these continuous branch length gradients weren’t contributing enough to the fundamental challenge of keeping the sampler in the good region of phylogenetic tree structures. But it was hard to even imagine a good solution to the central question: how can we take gradients in the discrete space of phylogenetic trees?

Meanwhile,... (full post)

18 Jun 2019 » Bayesian phylogenetic inference without sampling trees

Most every description of Bayesian phylogenetics I’ve read proceeds as follows:

With statements like these in popular (and otherwise excellent!) reviews, it’s not surprising that people confuse Bayesian phylogenetics and Markov chain Monte Carlo (MCMC). Well, let’s be clear.

MCMC is one way to approximate a Bayesian phylogenetic posterior distribution. It is not the only way.

In this post I’ll describe two of our recent papers that together give a systematic, rather than random, means of approximating a phylogenetic posterior distribution.

Without a... (full post)

05 Dec 2018 » Generalizing tree probability estimation via Bayesian networks

Posterior probability estimation of phylogenetic tree topologies from an MCMC sample is currently a pretty simple affair. You run your sampler, you get out some tree topologies, you count them up, normalize to get a probability, and done. It doesn’t seem like there’s a lot of room for improvement, right?


Let’s step back a little and think like statisticians. The posterior probability of a tree topology is an unknown quantity. By running an MCMC sampler, we get a histogram, the normalized version of which will converge to the true posterior in the limit of a large number of samples. We can use that simple histogram estimate, but nothing is stopping us from taking other estimators of the per-topology posterior distribution that may have nicer properties.

For real-valued samples we might use kernel density estimates to smooth noisy sampled distributions, which may reduce error when sampling is sparse. Because the number of phylogenies is huge, MCMC is computationally expensive, and we are naturally... (full post)

Complete list of all posts