Matsen group: general

Postdoc position available: Bayesian phylogenetics in the densely sampled regime

matsen@fredhutch.org (Frederick A. Matsen) — Fri, 16 Jul 2021 00:00:00 UTC

The project

Statistical phylogenetic (evolutionary tree) methods have been essential for understanding the SARS-CoV-2 epidemic, whether for understanding origins, global spread, or lineage dynamics of the virus. These methods are extremely mature, with optimized code and software packages implementing complex models. However, these methods were developed with the “classical” sampling regime in mind: a relatively small number of sequences with relatively large divergences between them.

Methods for the classical sampling regime work to integrate out the uncertainty we have in ancestral sequences. Although the Felsenstein algorithm does allow for efficient calculation and updating of phylogenetic likelihoods, even this is not enough to handle the massive trees we would like to use for SARS-CoV-2. Furthermore, the Felsenstein algorithm only works for IID models between sites.

With SARS-CoV-2 we are in a completely different sampling regime, with over 2 million genomes for a virus without very much evolutionary divergence. That means that we frequently sample identical viruses, and we often sequence the direct ancestor of a given virus. This greatly limits the uncertainty that we have in the ancestral states of the genome. However, the transmission history is quite uncertain, motivating a Bayesian approach.

There are some interesting opportunities in this new regime. For example, du Plessis, McCrone, Zarebski, Hill, Ruis, et al, (2021) replace the classical phylogenetic likelihood with a proxy that counts the number of substitutions that could have happened along a branch. This reduces computation time by orders of magnitude, and allows the model to focus on the important aspects of uncertainty: how the virus spread between individuals.

I think that there are many more opportunities in this new regime, including for substitution model complexity (think whole-sequence modeling), online (i.e. incremental updating) inference, integration with epidemiological models, and for inference (it’s not going to be MCMC).

There are other settings that we care about for which we have dense sampling, and for which complex sequence substitution models are quite important. Specifically, I’m thinking about the evolution of antibodies that happens inside of our lymph nodes when we are vaccinated or infected. Our collaborator Gabriel Victora and his lab sample these evolutionary histories in great depth. We are also very interested in the interplay between sequence and fitness.

It’s still early stages, but thus far it looks like this will become a collaborative project with:

Trevor Bedford, an evolutionary biologist and genomic epidemiologist known for his co-development of the nextstrain platform
JT McCrone, a postdoc in the Rambaut group working on scaling Bayesian phylogenetics for SARS-CoV-2
Vladimir Minin, a leading statistician especially known for his work in “phylodynamics:” the intersection of phylogenetics, immunology, and epidemiology
Marc Suchard, another leading statistician working on phylodynamics, who has developed much of the statistical framework for complex data integration in BEAST, as well as high performance algorithms

and hopefully many others in the phylogenetics community.

Environment

The position will come with a competitive postdoc-level salary with great benefits for two years, with the ability to extend if things are going well. The environment is lively yet casual, with a strong emphasis on collaborative work. The collaborative aspect should make this position especially interesting for cross-disciplinary work: postdocs can work with programmers in the group to implement inferential models, and with statisticians in the group to formulate them as needed.

The Center is housed in a lovely campus on Lake Union a short walk from downtown, and a slightly longer walk from the University of Washington. The Matsen group is in the newly-remodeled Steam Plant building overlooking the lake. Powerful computing resources and helpful IT staff await. Ideally you’d want to be on campus but long-term remote work is possible from these states: Alabama, Alaska, Arizona, California, Colorado, Hawaii, Idaho, Maryland, Minnesota, Montana, New York, Ohio, Oregon, South Carolina, and Texas.

We believe that science is for everyone. We have had researchers with a variety of backgrounds, including Latinx, Black, Asian, and Middle Eastern. We have had women, men, gay, and straight, and we welcome people of all sexual orientations and gender identities. We have had successful high schoolers, postdocs, people who were the first in their family to attend college, and one who had decided that college wasn’t for them. We have had researchers with backgrounds in biology, physics, statistics, math, and computer science.

We acknowledge the historical and present barriers for underrepresented groups, and work to increase diversity, equity and inclusion in computational biology. Members of underrepresented groups are especially encouraged to apply.

Please read our expectations of group members. By applying for this position, I expect that you will fulfill these expectations. I enthusiastically solicit feedback on these expectations or requests for clarification.

You can find out more about our group by visiting our group website.

Qualifications

This position is most suited to someone a PhD in statistics, computer science, biology, or another relevant field. However, we will consider exceptionally skilled candidates with BS or MS degrees.

Essential skills

We are looking for someone who has:

experience doing methods development for Bayesian inference, or experience doing high-performance software development using graphs
clear ability to perform independent research
the ability to work and collaborate with a team.

Additional helpful skills

Ideally the candidate would have:

knowledge of Bayesian phylogenetics and genomic epidemiology
experience with C++, and with modern C++ idioms
desire to improve software development skills towards clean and tested code
experience with a modern git-based workflow
experience with Docker and continuous integration
experience developing in a Linux environment

Applying

If you are interested in this position, please submit the following materials:

Two representative publications.
A CV summarizing your education and work experience so far.
The names and email addresses of three references.
A code sample showing work that you are proud of. This has to be nontrivial, but doesn’t have to be long. Ideally it would be publicly accessible, e.g. on GitHub, but if that’s not possible an emailed attachment is fine too.

Please send these materials to: if you’re interested.

Postdoc position available: variational Bayes phylogenetic inference

matsen@fredhutch.org (Frederick A. Matsen) — Mon, 29 Mar 2021 00:00:00 UTC

The project

Bayesian phylogenetic (evolutionary tree) inference is important for genomic epidemiology and for our understanding of evolution. Trees, along with associated information, are complicated objects of inference, with intertwined discrete (tree structure) and continuous (dates, rates) structure. Random-walk Markov Chain Monte Carlo, implemented in packages such as BEAST (~20,000 citations) and MrBayes (>70,000 citations), is currently the only widely-applied inference technique.

We have recently developed a rich means of parameterizing tree distributions with a fixed parameter set. This renders them accessible to more modern inference techniques, such as variational Bayes. We have developed a proof-of-concept application of phylogenetic variational Bayes using modern general-purpose gradient estimators. Our collaborative group also has preliminary integrations with both PyTorch and TensorFlow.

To achieve the promise of variational Bayes phylogenetics, we will develop:

structure learning methods that will infer the discrete aspect of our variational approximation
fitting methods that leverage the special structure of our variational phylogenetic models
a modeling framework that integrates with PyTorch, enabling rich models that leverage covariates such as travel history.

This will be a collaborative project with

Marc Suchard, a leading statistician especially known for his work in “phylodynamics:” the intersection of phylogenetics, immunology, and epidemiology
Mathieu Fourment, who has led the development of fixed-tree variational inference for time-tree models
Cheng Zhang (张成), who has led the development of flexible-tree variational inference.

These other groups will focus more on the 3rd aspect, whereas the Fred Hutch group will focus more on the 1st and 2nd aspect.

We are implementing these algorithms in our Python-interface C++17 library.

Environment

The position will come with a competitive postdoc-level salary with great benefits for two years, with the ability to extend if things are going well. The environment is lively yet casual, with a strong emphasis on collaborative work. The Center is housed in a lovely campus on Lake Union a short walk from downtown, and a slightly longer walk from the University of Washington. The Matsen group is in the newly-remodeled Steam Plant building overlooking the lake. Powerful computing resources and helpful IT staff await. Ideally you’d want to be on campus but long-term remote work is possible from these states: Alabama, Alaska, Arizona, California, Colorado, Hawaii, Idaho, Maryland, Minnesota, Montana, New York, Ohio, Oregon, South Carolina, and Texas.

You can find out more about our group by visiting: 

Qualifications

This position requires a PhD in statistics, computer science, biology, or another relevant field.

Essential skills

We are looking for someone who has:

experience doing methods development for a challenging Bayesian estimation problem
clear ability to perform independent research
the ability to work and collaborate with a team.

Additional helpful skills

Ideally the candidate would have:

experience with PyTorch or TensorFlow
experience with C++, and with modern C++ idioms
experience with a modern git-based workflow
experience with Docker and continuous integration
experience developing in a Linux environment

Applying

If you are interested in this position, please submit the following materials:

Two representative publications.
A CV summarizing your education and work experience so far.
The names and email addresses of three references.
A code sample showing work that you are proud of. This has to be nontrivial, but doesn’t have to be long. Ideally it would be publicly accessible, e.g. on GitHub, but if that’s not possible an emailed attachment is fine too.

Please send these materials to: if you’re interested.

Life changes

matsen@fredhutch.org (Frederick A. Matsen) — Mon, 19 Oct 2020 00:00:00 UTC

Hi everyone. Through a combination of COVID and the arrival of a second child, I haven’t had time to write about our recent work. I’ll be back to posting at some point, but right now I’m focusing on being a dad and supporting my trainees. Thanks for understanding.

A Bayesian phylogenetic hidden Markov model for B cell receptor sequences

matsen@fredhutch.org (Frederick A. Matsen) — Thu, 23 Jan 2020 00:00:00 UTC

Summary

antibodies develop within you via an evolutionary process
understanding these evolutionary patterns is important for understanding how we respond to infection and vaccination
we have found using Bayesian methods that evolutionary inferences are uncertain in this regime
our most recent work develops a “Bayesian phylogenetic hidden Markov model,” which takes into account uncertainty in both the V(D)J recombination process and the evolutionary process
this work reveals substantial amino-acid uncertainty in the inference of the unmutated common ancestor of VRC01, an important and heavily-studied anti-HIV antibody
our results are described in a preprint which is now being revised for PLOS Computational Biology

A brief description of antibody affinity maturation

In order to defend against a very large and ever-mutating pool of pathogens, your body randomly generates, and then optimizes, a large collection of antibodies. These antibodies are displayed as so-called B cell receptors on the surface of specialized B cells. The random generation is a process called V(D)J recombination, in which a collection of candidate genes are randomly selected, trimmed a random amount, and then joined by random nucleotides. The optimization is called “affinity maturation,” in which antibody-making B cells are rewarded for being able to bind antigen by being allowed to divide, during which they further mutate their B cell receptor to continue improving binding. . The left hand image below is a cartoon of affinity maturation, simplified from a diagram in Victora and Mesin (2014).

If we were omniscient, we would be able to see this process unfold and record the series of division events and the genetic sequences of the B cell receptors. But, as mere mortals, we have to be satisfied with sequencing some subset of the cells and then reconstructing the process that led to these observed cells. This includes both the tree structure, as well as the states at the internal nodes of the tree. This is a little more complex than the usual phylogenetic tree and ancestral state reconstruction, because we have partial (but highly informative) knowledge about the ancestral state: we know it was sampled from some random process for which we know the ensemble of genes that could have rearranged in order to make the unmutated ancestral sequence. More on this below.

We would like to know the pathways of affinity maturation

The goal of vaccination is to stimulate and affinity-mature antibodies that will be able to block infection. Therefore, if we wish to design better vaccines and understand the impact of existing vaccines, we can use sequencing and sequence analysis methods to understand the development of antibodies. This might be in response to a vaccine, or it might be in response to a viral infection.

Such analysis is especially important for difficult viruses such as HIV. Despite decades of research and an enormous global budget, we still do not have an effective vaccine. This failure stems largely from HIV’s astounding diversity and mutation rate, which make it incredibly difficult for your body to make antibodies that block a usefully large range of HIV strains. Specifically, development of such antibodies typically requires a lot of mutation, including some relatively rare events.

In order to better understand what we can do to stimulate these difficult-to-elicit antibodies, the research community is studying the success stories: individuals that do manage to make antibodies that block a diversity of HIV strains. These super-antibodies are called “broadly neutralizing antibodies” or bNAbs. There is a tremendous amount of interest in a particular bNAb called VRC01. For example, there is currently a clinical trial to find if VRC01 infusion can block infection. Although this trial will tell us if VRC01 can block HIV infection in a realistic setting, what we really want is to be able to immunize such that the body makes something like VRC01 from scratch.

This is why it’s important to get a detailed understanding of the steps of affinity maturation taken by VRC01, which means understanding the unmutated common ancestor, as well as the series of mutations. The better we can understand the history of such antibodies, the better we can understand the barriers to eliciting them with vaccination, and the better we can design vaccinations to overcome those barriers.

Correspondingly, researchers have made beautiful and detailed computational dissections of this lineage and of other antibodies in the same class. A 2018 paper had as a primary result a new inference of the unmutated common ancestor. These computational methods are then followed by biochemical analysis of these predicted sequences, which will guide vaccine design.

These biochemical characterizations are only meaningful to the extent that the computational inferences are correct. Next I will describe the main point of this post, which is that if we use Bayesian methods we find that antibody phylogenetics is uncertain.

Bayesian phylogenetic methods describe tree uncertainty

Given a model of sequence evolution and some sequence data, there is in general some ambiguity about the correct evolutionary history. This is depicted in the following cartoon:

If we are only interested in the best-fitting tree, we can pick the left-hand one, but it fits the data only a little better than the other one. Thus if we really care about the outcome of the analysis, we should consider both of these trees as potential explanations of the data. The goal of Bayesian phylogenetics is to find all of the credible phylogenetic trees, and assign each of these a probability that it is the correct tree.

We can “boil” these trees down to quantities of actual interest to researchers. For example, antibody researchers are commonly interested in the sequence of mutations leading to a specific antibody of interest, rather than the full tree containing those mutations. We can represent the possible mutation paths and their probabilities in a diagram like this one, which was made using a simpler version of the methods described in this blog post. We were inspired in this representation by lovely work from Jesse Bloom’s lab.

When we apply Bayesian methods to real data, we find substantial uncertainty in antibody trees. This wouldn’t surprise an experienced phylogenetics researcher, because antibody sequences are relatively short, and mutation is typically focused in specific areas. Thus, we think that Bayesian methods should be the method of choice for researchers who care a lot about the details of the sequence of mutations leading to an antibody of interest. (If you cared a lot about the slope of a regression, wouldn’t you want to get a confidence interval? 😊)

A Bayesian phylogenetic hidden Markov model for B cell receptor sequences

There is something special and interesting about phylogenetics in this regime: we have information about the sequence at the root of the tree. This is because we know that it came from V(D)J recombination, and there are databases of the various V, D, and J genes present in the population that go into that recombination. We can formalize this knowledge as a probabilistic model of V(D)J recombination, which can be used as a prior on root sequences for our B cell phylogeny.

Putting this together with the usual Bayesian phylogenetic machinery, we have a posterior that looks like so:

where the box at the top of the tree is meant to represent a probabilistic model of the V(D)J recombination process. Samples from this posterior integrate uncertainty in both the recombination process and the phylogenetic tree.

Amrit Dhar, a statistics PhD student working with Vladimir Minin and me, led development of a way of sampling from the posterior of these structures. In doing so, he had to cope with the complexities of V(D)J recombination modeling (with assistance from Duncan Ralph), as well as the complexities of doing Bayesian phylogenetics. The methods and validations are described in a preprint which is now being revised for PLOS Computational Biology.

Substantial uncertainty in the unmutated ancestor of VRC01

Using Amrit’s method, we find substantial uncertainty in the inference of the VRC01 unmutated common ancestor for the CDR3 region, which is a key region determining binding. We can visualize it like so, with the heights of the letters being the posterior probability of that letter’s amino acid at each site:

Substantial uncertainty is evident at a number of sites, and with quite different amino acids. For example, Tyrosine (Y) and Asparagine (N) have very different biochemical properties. We might expect these variants to have different binding properties.

On the other hand, when the data supports a single clear answer, then the method reports it. For example, we find very little uncertainty in the inferred ancestor of PC64, another antibody lineage of great interest to HIV researchers.

Some final thoughts

Bayesian methods are expensive to run, and it would require a staggering amount of compute to run this method on every inferred clonal family in a repertoire. We don’t. We only run it when we really care about understanding the collection of predicted ancestral sequences, such as when we want to express inferred antibodies and test their properties in the lab.

There’s a lot left to do here. We would like to make the method faster (see our other research on accelerating Bayesian phylogenetic inference) so we can scale the method to more sequences. Our method does not take into account the context-sensitive nature of the B cell receptor mutation process, like IgPhyML does. Also, we’d ideally like to have a method that also incorporates uncertainty concerning which sequences are in the tree, which is not known a priori.

Thank you to Amrit, whose incredible determination pushed this challenging project through to completion, Duncan, for VDJ recombination consults and integration with partis, and to Vladimir, for being an awesome collaborator from the vision to the final details. See the preprint for many other thank-yous, but I’d like to especially credit the Overbaugh lab for keeping us motivated to work on this challenging problem.

I would also like to credit Tom Kepler, whose pioneering work gave the first means of integrating phylogeny and rearrangement inference. His software performed remarkably well in our benchmarks for the task of providing a point estimate of the tree.

Please comment and ask questions here.

We’re always interested in hearing from people interested in our work who might want to come work with us as students or postdocs. Please drop me a line!

Variational Bayesian phylogenetic inference

matsen@fredhutch.org (Frederick A. Matsen) — Sat, 24 Aug 2019 00:00:00 UTC

In late 2017 we were stuck without a clear way forward for our research on Bayesian phylogenetic inference methods.

We knew that we should be using gradient (i.e. multidimensional derivative) information to aid in finding the posterior, but couldn’t think of a way to find the right gradient. Indeed, we had recently finished our work on a variant of Hamiltonian Monte Carlo (HMC) that used the branch length gradient to guide exploration, along with a probabilistic means of hopping from one tree structure to another when a branch became zero. Although this project was a lot of fun and was an ICML paper, it wasn’t the big advance that we needed: these continuous branch length gradients weren’t contributing enough to the fundamental challenge of keeping the sampler in the good region of phylogenetic tree structures. But it was hard to even imagine a good solution to the central question: how can we take gradients in the discrete space of phylogenetic trees?

Meanwhile, in another line of research we were trying to separate out the process of exploring discrete tree structures with that of handling the continuous branch length parameters. As I described in a previous post, this combined a systematic search strategy modeled after what maximum-likelihood phylogenetic inference programs do, along with efficient marginal likelihood estimators to “integrate out” the branch lengths. This worked well for some data sets, but was bound to fail for any data set in which the posterior was spread across too many trees. Indeed, any method that needs to do a calculation on each tree in the credible set is bound to fail for large and flat posterior distributions.

At this point I was feeling despondent. I didn’t know how to take the gradient in discrete tree structure space, and the sampling-based methods we wanted to avoid seemed like the only approach that could work for flat posteriors. The only opportunity I could see was the Höhna-Drummond and Larget work on parametrizing tree structure posteriors, however we had previously shown that they were insufficiently flexible to represent the shape of true phylogenetic posteriors. Perhaps we could generalize them?

Cheng Zhang, when he was a postdoc in my group, took that vague idea and built a completely new means of inferring phylogenetic posteriors: variational Bayes phylogenetic inference. In this post I hope to explain this advance to the phylogenetics community.

How variational inference and the Metropolis-Hastings ratio each get around the normalizing constant problem

Bayesian phylogenetic inference targets the posterior distribution $p(\mathbf{z} \mid D)$ on structures $\mathbf{z}$ consisting of phylogenetic trees along with associated model parameters including branch lengths. Bayes’ rule tells us that the posterior is proportional to the likelihood times the prior:

\[p(\mathbf{z} \mid D) \propto p(D \mid \mathbf{z}) \, p(\mathbf{z})\]

We can efficiently evaluate the two terms on the right hand side: the likelihood $p(D \mid \mathbf{z})$ via Felsenstein’s tree-pruning algorithm and the prior $p(\mathbf{z})$. However, it’s still quite hard to get correct values for the posterior $p(\mathbf{z} \mid D)$ because of the unknown proportionality constant hidden in $\propto$. We will call the likelihood times the prior on the right hand side of Bayes’ rule, $p(D \mid \mathbf{z}) \, p(\mathbf{z})$, the unnormalized posterior.

The difficulty posed by the unknown proportionality constant is analogous to surveyors trying to calculate the average absolute height of a mountain range using only relative height measurements: they have to cover the entire mountain range before feeling confident that they can translate their relative measurements into an absolute estimate of the average height.

The Metropolis-Hastings algorithm avoids this problem by only working in terms of ratios of posterior probabilities. This cancels out the hidden proportionality constant, but with the cost of not directly giving an estimate of the posterior probability. Such an estimate then comes from running a Metropolis-Hastings sampler, which in the phylogenetic case doesn’t scale to data sets with many sequences as I described in a previous post.

Variational inference takes a different approach, fitting a variational approximation $q_\phi(\mathbf{z})$ to the posterior $p(\mathbf{z} \mid D)$. This approximation is parameterized in terms of some parameters $\phi$. Once we have fit this approximation, we use it in place of our actual posterior for whatever downstream analyses we have in mind. It is an inferential method that can be used in place of Metropolis-Hastings.

The fitting procedure for the variational approximation avoids the normalizing constant problem by taking a measure of “goodness of fit” that only requires evaluating the unnormalized posterior. In the most common formulation, this is the Kullback-Liebler divergence $\text{KL}(q_\phi(\mathbf{z}) \parallel p(\mathbf{z} \mid D))$, in which the expectation of the normalizing constant $\log p(D)$ can be pulled out. We can then ignore that constant when optimizing.

This optimization process happens by stochastic gradient descent, in which one samples from the current approximation $q_\phi$ and uses that sample to take an optimization step in terms of $\phi$ to improve the fit. That’s what I’m showing in the above figure, in which the pink points represent samples from the current variational approximation. We take those points and calculate a gradient in terms of the variational parameters $\phi$ using the un-normalized posterior, and then take a gradient ascent step. I show a lot of points, corresponding to the fact that we use a multi-sample gradient estimator to decrease variance in the gradient estimate.

Intuitively, one can simply imagine that after a sample from $q_\phi$ one would like to fiddle with $\phi$ so as to improve fit of the variational approximation, just as if the posterior was “data” and we were fitting a statistical model. Early in the fitting procedure, this will involve increasing the probability of generating samples $\mathbf{z}$ that had a high un-normalized posterior and decrease the probability of generating those that did not. If you want to learn more, see the excellent review article by Blei et al for background, and our ICLR paper for details about gradients.

However, I’d like to clarify a point that seems to cause confusion, including in the minds of reviewers who rejected our grant application: there is a clear distinction between the general technique of variational inference (VI) and a specific variational parameterization, such as mean-field VI. Mean-field VI makes strong independence assumptions which limits the flexibility of variational approximations; indeed it is not appropriate even for some simple hierarchical models. In contrast, VI is a general technique that will work given an appropriate approximating density and fitting algorithm. I describe evidence below that our parameterization for phylogenetic posteriors is sufficiently rich. More generally, there are now many methods that use more richer families of variational approximations such as normalizing flows.

How do we obtain a variational approximation of a phylogenetic posterior?

You may be thinking “well this all sounds very nice, but how are we going to parameterize a discrete set of phylogenetic trees using real-valued parameters $\phi$?” This is not at all obvious, and is the subject of a previous post (where we also credit to the originators of this approach). In short, one approximates the phylogenetic posterior using a series of conditional probabilities, like so:

We showed in our our 2018 NeurIPS paper that this parametrization was sufficiently rich to approximate the shape of phylogenetic posteriors on real data to high accuracy. In fact, in that paper (Table 1) we showed that the variational approximation fit to an MCMC sample was significantly more accurate than just using the MCMC samples in the usual way.

For full variational inference, we also layer on a variational distribution of branch lengths in terms of another set of variational parameters $\psi$. I’m not going to describe how those work, but Cheng found a nice parameterization that used “just the right amount” of tree structure. See our 2018 ICLR paper for a full description.

With the complete variational parameterization in hand, all that remains is to fit it to the posterior. This required deft coding and a lot of tinkering on Cheng’s part, using control variate ideas for the tree structure and the reparametrization trick for branch lengths. The result? An algorithm that can outperform MCMC in terms of the number of likelihood computations.

The phylogenetic reader may also be interested in Table 1 of the ICLR paper, which shows that importance sampling using the full variational approximation gives marginal likelihood results quite concordant with, though with lower variance, than the stepping-stone method. Stepping-stone is a computationally expensive gold-standard method, whereas our method only required 1000 importance samples (and thus only 1000 likelihood evaluations once the variational approximation was fit). That’s promising!

What’s next?

We’re working hard to realize the promise of variational Bayes phylogenetic inference. On the coding front, we’re developing the libsbn library along with a team including Mathieu Fourment. The concept behind this Python-interface C++ library is that you can express interesting parts of your phylogenetic model in Python/TensorFlow/PyTorch/whatever and let an optimized library handle the tree structure and likelihood computations for you. It’s not quite useful yet, but we already have the essential data structures, as well as likelihood computation and branch length gradients using BEAGLE. I’m having a blast hacking on it, and it shouldn’t be too long before it can perform inference.

But the really fun part about variational inference is the ability to develop tricks that accelerate convergence. VI is fundamentally an optimization algorithm, and we can do whatever we want to do to accelerate that optimization. For example stochastic variational inference accelerates inference by taking random subsets of the data. We need to be careful about how to do that in the phylogenetic case (we can’t naively subsample tips of the tree) but we are currently pursuing ideas along those lines. In contrast, MCMC is a fairly constrained algorithm, and clever algorithms run the risk of either disturbing detailed balance or leading to an impossible-to-calculate proposal density.

I haven’t mentioned continuous model parameters other than branch lengths, and our initial work only used the simplest phylogenetic model: Jukes-Cantor without an explicit model of rates across sites. Mathieu is working out the gradients of nucleotide model parameters, which will allow us to formulate variational approximations of those too.

There’s still a lot to be done, and I’m having the time of my research life working in this area. I’d love to hear any comments, and don’t hesitate to reach out with questions.

We’re always interested in hearing from people interested in our work who might want to come work with us as students or postdocs. Please drop me a line!

I’m very grateful to Cheng Zhang for his creativity and skill in making this project happen. He is now now tenure-track faculty at Peking University in Beijing. I’d also like to thank our growing team of collaborators working on this subject.

Also, if you are interested in this area, check out the work of Mathieu Fourment and Aaron Darling, which is an independent development from ours.

Bayesian phylogenetic inference without sampling trees

matsen@fredhutch.org (Frederick A. Matsen) — Tue, 18 Jun 2019 00:00:00 UTC

Most every description of Bayesian phylogenetics I’ve read proceeds as follows:

“Bayesian phylogenetic analyses are conducted using a simulation technique known as Markov chain Monte Carlo (MCMC).” (Alfaro & Holder, 2006)
“Posterior probabilities are obtained by exploring tree space using a sampling technique, called Markov chain Monte Carlo (MCMC).” (Lemey et al, The Phylogenetic Handbook)
“Once the biologist has decided on the data, model and prior, the next step is to obtain a sample from the posterior. This is done by using MCMC…” (Nascimento et al, 2017.)

With statements like these in popular (and otherwise excellent!) reviews, it’s not surprising that people confuse Bayesian phylogenetics and Markov chain Monte Carlo (MCMC). Well, let’s be clear.

MCMC is one way to approximate a Bayesian phylogenetic posterior distribution. It is not the only way.

In this post I’ll describe two of our recent papers that together give a systematic, rather than random, means of approximating a phylogenetic posterior distribution.

Without a doubt MCMC is the most popular means of approximating the posterior. MCMC is wonderfully simple. To implement a sampler (assuming you have a computable likelihood and prior) all you have to do is devise a proposal distribution that tries out a new tree and/or model parameter, and then accept/reject based on the Metropolis-Hastings ratio. If your proposal is reasonably good, then MCMC will converge (in the limit of a large number of samples) to the posterior distribution.

However, the simplicity of MCMC implies inherent limitations. It is not a “smart” algorithm. For example, one cannot easily adapt the proposal distribution to reflect what the MCMC is learning about the posterior.

This is problematic because the number of trees grows super-exponentially, and most of those trees are terrible explanations of a given data set. Thus, naive random tree modification proposals are likely to take us out of the high-posterior region. This is manifested in either timid tree proposal distributions, or as an unacceptably high rejection rate for proposals in an MCMC run.

Before we start talking about alternatives, I want to emphasize that people have done wonderful and important work within the MCMC framework. It has brought us all of the biological insights we have learned using Bayesian phylogenetic methodology, from deep divergence inference with complex models to genetic epidemiology. Methodologically, many authors have made this possible, from developing and testing tree proposals to leveraging Metropolis-Coupled Monte Carlo, which although not a panacea certainly improves topological mixing.

Let’s start discussing alternatives by backing up a bit.

What is Bayesian phylogenetics trying to do?

In the Bayesian phylogenetic framework, we are interested in the posterior distribution $p(\tau, \theta \mid D)$ on tree topologies $\tau$ and model parameters $\theta$ (including branch lengths) given sequence data $D$. Bayes’ rule tells us that the posterior is proportional to the likelihood times the prior:

\[p(\tau, \theta \mid D) \propto p(D \mid \tau, \theta) \, p(\tau, \theta)\]

Because this is a proportionality, we can easily evaluate ratios of posterior probabilities (such as to compute the Metropolis-Hastings ratio), but getting the true value of the posterior is intractable for phylogenetics.

Now, it’s common to be interested in the tree topologies $\tau$ rather than the joint distribution on the topology and all of the associated continuous parameters. For instance, one might want to test monophyly of a given clade. That is, we would like to know $p(\tau \mid D)$.

The common way to do this with MCMC is to run your chain, count the number of times you saw each topology $\tau_i$, then divide by the number of samples from your chain. By ignoring the continuous parameters, we effectively marginalize them out.

In our work we wondered if we could develop an alternative means of getting $p(\tau \mid D)$ without a sampling-based method such as MCMC. Specifically, we wanted to avoid any randomized movement through tree space.

Consider that

\[p(\tau_i \mid D) = \frac{p(D \mid \tau_i) \, p(\tau_i)}{\sum_j p(D \mid \tau_j) \, p(\tau_j)}\]

where the sum in the denominator of the ratio is over all trees $\tau_j$, $p(D \mid \tau_j)$ is the marginal likelihood over continuous parameters

\[p(D \mid \tau_j) = \int_\theta p(D \mid \tau_j, \theta) \, p(\theta),\]

and $p(\tau_j)$ is the prior on tree topology $\tau_j$.

Thus we can approximate the per-topology posterior distribution $p(\tau_i \mid D)$ given

some systematic way of identifying a collection of “good” trees $\tau_j$ that contain most of the posterior probability weight in the denominator of the ratio
some efficient way of estimating the marginal likelihood $p(D \mid \tau)$.

The question is, then, can we obtain these two ingredients?

Ingredient 1: efficiently finding a “good” set of trees

Current maximum-likelihood and Bayesian phylogenetic algorithms are opposites in terms of objective and method. Maximum-likelihood algorithms systematically zoom up to the top of the likelihood surface with no regard for trees that serve as nearly-as-good explanations of the data. Bayesian algorithms explore tree space randomly, wasting effort by returning to the same trees many times, but given enough time do a good job of exploring the whole posterior region.

We decided to combine the systematic search of ML algorithms with the Bayesian objective, such that we would systematically find all of the “good” trees. To do so, we did the tree rearrangements that one usually does with these algorithms, but keeping track of all of the trees that were above some likelihood threshold rather that just allowing rearrangements that result in an improvement of likelihood.

Note that I said “likelihood” and not “posterior” here. In fact, by “likelihood” I mean the likelihood of the tree with the maximum-likelihood assignment of branch lengths (and other model parameters). One of the surprising results of our work is that this likelihood acts as a surprisingly good proxy for the posterior when looking for this “good” set of trees.

We find that this strategy works reasonably well if one uses a collection of starting points obtained by running RAxML starting at several hundred random trees. These starting points appear to cover all of the local peaks one finds in the posterior distribution. We implemented our algorithm using the libpll library from the Stamatakis group, which was a pleasant foundation. We added a multithreading strategy that allowed different workers to spread out across tree space. We report these results in Systematic Exploration of the High Likelihood Set of Phylogenetic Tree Topologies, with Whidden, Claywell, Fisher, Magee, and Fourment.

Ingredient 2: efficiently estimating the per-tree marginal likelihood

The other component we needed was a way to evaluate the per-tree marginal likelihood $p(D \mid \tau)$. Please note that we are doing this marginal likelihood estimation with a single fixed tree topology at a time, which is in contrast to many applications of phylogenetic marginal likelihood estimation in which one is comparing one evolutionary model to another while integrating out the tree topology as well.

Because marginal likelihood in this formulation is a problem with continuous parameters only, there are many existing methods for estimating it. We also developed a few of our own specifically for this application, giving 19 methods in total. This immediately brought to mind the classic 1978 paper by Moler & Van Loan: Nineteen Dubious Ways to Compute the Exponential of a Matrix. Accordingly, we named our paper 19 Dubious Ways to Compute the Marginal Likelihood of a Phylogenetic Tree Topology with Fourment, Magee, Whidden, Bilge, and Minin.

We were surprised to find that some fast methods could be quite accurate, and some slow methods showed rather poor accuracy. One of the star algorithms here was a “Gamma Laplus” method devised by the group that used Laplace-like approximation to fit a Gamma distribution, which can then be used directly to compute a marginal likelihood. One can then boost the accuracy with a relatively small computational cost by adding an importance sampling step.

So, does it work?

Yes, combining these two strategies does work as a proof of principle. There are some caveats, though. The first caveat is that we used the Jukes-Cantor model, so we didn’t have to marginalize out model parameters other than branch lengths. This seems tractable: it would take another papers-worth of work with more interesting variational parameterizations, but I think we could deal with substitution-model and rate-variation parameters.

The second caveat is a more inherent issue for any method that attempts to individually explore every “good” tree. Sometimes tree posteriors are just really diffuse! For example, there are some data sets for which our extremely long “Golden” MrBayes runs never once sampled the same topology twice.

In fact, thinking about what to do with these very diffuse posterior distributions is what led us down the road of thinking more about density estimation on the set of phylogenetic trees, which then in turn led us to investigate full variational Bayes phylogenetic inference, which I’ll write about in an upcoming post.

This was a big and complex project that required a lot of hard work to pull off. For the first paper, I’d like to highlight the stamina of Chris Whidden and the programming prowess of Brian Claywell, as well as thank Thayer Fisher for starting off the project as a summer undergraduate project. For the second paper, I’d like to thank Mathieu Fourment, who implemented every one of the 19 methods with almost unbelievable gumption and skill, Andy Magee, who is the unsung hero of the project for contributing implementations and analysis, and Arman Bilge, who did important work early on for the Laplace-type methods. Vladimir was as usual a wonderful collaborator.

Generalizing tree probability estimation via Bayesian networks

matsen@fredhutch.org (Frederick A. Matsen) — Wed, 05 Dec 2018 00:00:00 UTC

Posterior probability estimation of phylogenetic tree topologies from an MCMC sample is currently a pretty simple affair. You run your sampler, you get out some tree topologies, you count them up, normalize to get a probability, and done. It doesn’t seem like there’s a lot of room for improvement, right?

Wrong.

Let’s step back a little and think like statisticians. The posterior probability of a tree topology is an unknown quantity. By running an MCMC sampler, we get a histogram, the normalized version of which will converge to the true posterior in the limit of a large number of samples. We can use that simple histogram estimate, but nothing is stopping us from taking other estimators of the per-topology posterior distribution that may have nicer properties.

For real-valued samples we might use kernel density estimates to smooth noisy sampled distributions, which may reduce error when sampling is sparse. Because the number of phylogenies is huge, MCMC is computationally expensive, and we are naturally impatient, one is often in the sparsely-sampled regime for topology posteriors. Can we smooth out stochastic under- and over-estimates of topology posterior probabilities by using similarities between trees? (See upper-right cartoon.) This smoothing should also extend the posterior to unsampled topologies.

The question is, then, how do we do something like a kernel density estimate in tree space? In a beautiful line of work started by Höhna and Drummond and extended by Larget one can think of each tree as being determined by local choices about how groups of leaves (“clades”) get split apart recursively down the tree. Their work assumed independence between these clade splitting probabilities.

This is a super-cool idea, but the formulation didn’t seem to work well for tree probability estimation from posterior samples on real data. For example, Chris Whidden and I noticed that this procedure underestimated the posterior for sub-peaks and overestimated the posterior between peaks. This says that the conditional independence assumption on clades made by this method was too strong. But this doesn’t doom the entire approach! We just need to take a more flexible family of distributions over phylogenetic tree topologies.

I suggested this direction to Cheng Zhang, a postdoc in my group, and within a week he figured out the right construction that generalized this earlier work but allowed for much more complex distributions. Cheng’s construction parameterizes a tree in terms of “subsplits,” which are the choices about how to split up a given clade. To allow for more complex distributions than the previous conditional independence assumptions allow, he encodes tree probabilities in terms of a collection of subsplit-valued random variables that are placed in a Bayesian network.

The simplest such network enables dependence of a subsplit on the parent subsplit, which in tree terms means that when assigning a probability to a given subsplit we are influenced by what the sister clade is. More complex networks can encode more complex dependence structures. To our surprise, the simplest formulation worked well: allowing split frequencies for clades to depend on the sister clade gives a sufficiently flexible set of distributions to be able to fit complex tree-valued distributions.

In the simplest version one can write out the probability for a given tree like so:

where the qs are inferred probability distributions that we call conditional subsplit distributions.

In addition to more complex dependence structure, Cheng’s approach also more formally treats this whole procedure as an exercise in estimating an approximating distribution. Where previous efforts estimated probabilities by counting, one can do better in the unrooted case for subsplit networks by optimizing the parameterized distribution on trees to match an empirical sampled distribution of unrooted trees via expectation maximization. One can also take some weak priors to handle the sparsely-sampled case.

We’ve written up these results in a paper that has been accepted to the NeurIPS (previously NIPS¹) conference as a Spotlight presentation. I’m proud of Cheng for this accomplishment, but consequently the paper is written more for a machine-learning audience rather than a phylogenetics audience. If you aren’t familiar with the Bayesian network formalism it may be a tough read. The key thing to keep in mind that the network (paper Figure 2) encodes the tree as a collection of subsplits assigned to the nodes of the network, and the edges describe probabilistic dependence. For example, the reason we can think of a conditional subsplit distribution as conditioning on the sister clade (see figure above) is because parent-child relationships in the subsplit Bayesian networks must take values such that the child subsplit is compatible with the parent subsplit.

If you don’t read anything else, flip to Table 1 and check out how much better these estimates are on big posteriors from real data than what everyone does right now, which is just to use the simple fraction. Magic! Hopefully it makes sense that we are smoothing out our MCMC posterior, and extending it to unsampled trees. If you have questions, I hope you will head on over to Phylobabble and ask them— let’s have a discussion!

Subsplit Bayesian networks open up a lot of opportunities for new ways of inferring posteriors on trees. Stay tuned!

We are always looking for folks to contribute in this area. If you’re interested, get in touch!

1: I’m happy to report that the NIPS conference has changed its name to NeurIPS. This is an important move that at signals at least a desire by the board for diversity and inclusion in machine learning. We can all hope that it is followed with concrete action.

Human T cell receptor occurrence patterns encode immune history, genetic background, and receptor specificity

matsen@fredhutch.org (Frederick A. Matsen) — Tue, 15 May 2018 00:00:00 UTC

High-throughput sequencing of our adaptive immune repertoires holds great promise for understanding immune state. These sequences implicitly contain a wealth of information on past and present exposures to infectious and autoimmune diseases, to environmental stimuli, and even to tumor-derived antigens. In principle, we should be able to use these sequences of rearranged receptors to infer their eliciting antigens, either individually or collectively.

We’re starting to see neat progress in these areas for T cell receptors (TCRs). Some recent studies compare TCR repertoire between individuals who do or do not have some immune state, such as an immunization, an autoimmune disease or a viral infection and work to find sequence-level differences between the repertoires. The Walczak-Mora team recently upped the bar by not requiring a control cohort. There has also been interesting progress on predicting epitope specificity from TCR sequence using structurally-informed sequence analysis.

Phil Bradley, just down the hall from us, wanted to take a different approach, asking given appropriate statistical analysis of a sufficiently large data set, can we infer pathogen-responsive TCRs from co-occurrence and HLA information alone? (If you don’t remember about HLA, it determines the sequence of MHC, the hot dog bun presenting peptides for recognition to T cells.) He showed that this indeed was the case, one example of which is shown in the figure above. Each point is a cluster of TCR sequences, where clustering is performed based on both co-occurrence and on TCR sequence similarity. Only TCR sequences that are significantly associated with an HLA type are allowed to participate in the clustering, and only clusters that were significant in terms of family-wise error rate are shown. These clusters are plotted with respect to the cluster size and a co-occurrence score.

The surprising result is that this procedure, which knows nothing about what stimulated the TCRs to expand, identifies previously-labeled TCR sequences corresponding to certain immune states. You probably recognize EBV, MS, and CMV, but we also see B19=parvovirus B19, INF=influenza, RA=rheumatoid arthritis, T1D=type 1 diabetes, and others. That’s pretty neat! This, along with other fun surprises, is published in eLife.

I made very minor contributions to this manuscript, but wanted to write about it because I think it’s an exciting advance. This proof of concept is definitely motivating us to think harder about what sorts of statistical frameworks would be useful for doing this sort of research more comprehensively. Thanks to Will and Phil, to the Hansen lab for the neat data, and to the study participants.

The Bayesian optimist's guide to adaptive immune receptor repertoire analysis

matsen@fredhutch.org (Frederick A. Matsen) — Sat, 12 May 2018 00:00:00 UTC

Immune receptor sequencing is stochastic through and through. We have cells with random V(D)J rearrangements that are stimulated through some random process of exposures, which lead to some random amount of expansion, and in the B cell case there is some random process of mutation and selection. So why don’t we use methods incorporating that uncertainty into our analysis?

We’ve tried to do this in our work, and have made some progress, but there is so much left to be done. When Sarah Cobey and Patrick Wilson kindly invited me to contribute to their special issue of Immunological Reviews, I knew I wanted to step back and ask:

If computation was no barrier, how would we design an analysis framework that integrated out uncertainty in unknown quantities and took advantage of the hierarchical structure inherent in immune receptor data?

I teamed up with Branden Olson, a Statistics PhD student in the lab, and went to work. It was a fun exercise to think through all of the steps of immune repertoire development and ask: what is the most realistic model under which inference should be possible, and what is the most realistic model for which we can perform simulation? This was more effort than anticipated, but 230 references later the final version is now up on arXiv and accessible for free (though I understand if you want to wait a few months to pay $38 and get it from the journal website).

In addition to dreaming research directions, I wanted to explain to my immunologist pals why I think probabilistic analysis methods are crucial, and describe the basics of Bayesian analysis via simple metaphors. Ideally this will lead to a little more crosstalk between communities. Traditionally, statisticians and lab biologists have been on independent tracks (see image above) even though they investigate the same underlying phenomena. I hope that in the future we can unify these tracks by developing statistical models based on mechanism and design experiments based on statistical inferences.

I also hope that this serves as an invitation to the computational statistics community. As we say at the end: “The computational statistician interested in immune receptor modeling is blessed with a complex biological system to analyze, intractable computational problems heaped on top of one another, and an ever-expanding collection of data sets generated from various in-vivo and in-vitro perturbations.”

Come play!

Benchmarking tree and ancestral sequence inference for B cell receptor sequences

matsen@fredhutch.org (Frederick A. Matsen) — Wed, 02 May 2018 00:00:00 UTC

Phylogenetic tools, in particular for ancestral sequence reconstruction, get used a lot in the B cell receptor (BCR) sequence analysis world. For example, they get used to reconstruct intermediate antibodies that then get synthesized in the lab and tested for binding (Wu et. al, 2011). But how well do phylogenetic tools work in this parameter regime? Although there have been countless benchmarking studies for phylogenetics, the case of B cell sequence evolution is different than the usual setting for phylogenetics:

Sampling and sequencing, especially for direct sequencing of germinal centers, is dense compared to divergence between sequences. Because of the resulting distribution of short branch lengths, zero-length branches and multifurcations representing simultaneous divergence are common.
The somatic hypermutation (SHM) process in affinity maturation is highly nucleotide-context-dependent process.
Repertoire sequencing typically focuses on the coding sequence of antibodies, which are under very strong selective constraint. This contrasts with the neutral evolution assumptions of most phylogenetic algorithms, as well as the simulation software assumptions traditionally used for phylogenetics benchmarks.
In contrast to typical phylogenetic problems where the root sequence is unknown, one has significant information about the root sequence for BCR sequences: namely, that it’s a recombination of V, (D), and J genes, which are somewhat well characterized.

BCR sequences also offer additional opportunities for validation. Specifically, the irreversible class switching process gives us a marker that should only go in one direction along a tree branch. If it goes another direction, this indicates problems with the tree reconstruction.

Before I sketch the results of our analysis, I should mention differences between our work and another recent paper also set up a benchmark of phylogenetic methods. Much of that paper concerns the results of phylogenetic inference using a “toy” clonal family inference method with necessarily bad performance, whereas here we assume that clonal families have been properly inferred. In addition, we simulate sequences under selection using an affinity-based model (which we show makes the inferential problem significantly more difficult), we compare accuracy of ancestral sequence inference, we include additional software tools (several of which are BCR-specific), and we use class-switching data as a further non-simulation means of benchmarking methods.

For this work, Kristian cooked up a simulator for B cell affinity maturation. Although quite a lot of simulators have been written, going back to Clone, none of these did what we wanted, which was to use a context model to simulate mutations, and then use the corresponding amino acid sequences for a selection step. Kristian’s model is simple, but nonetheless we feel that it does an appropriate job of simulating sequences for the purposes of benchmarking methods. We show that the simulated data broadly speaking “looks like” germinal center data.

You can read the full results on bioRxiv, but here are the things that surprised us:

Picking between equally parsimonious trees using a context-sensitive model works surprisingly well. This makes us want to continue working on incorporating full context models into phylogenetic methods.
PHYLIP is quite a good choice! I thought that the BCR community was fairly behind the times by not using some of the more modern maximum-likelihood packages, but IQ-TREE is the only recently-developed package that does ML on trees and ancestral sequence inference, and it performs significantly worse (although it’s much faster and nicer to use!).
IgPhyML is a cool project that works to integrate hotspot motifs and Goldman-Yang codon modeling, which it does by marginalizing out hotspot motifs when they extend across a codon boundary. It does reasonably but not as well as we expected, which may be because we are benchmarking on the moderately-sized trees with which we have experience rather than the very deep broadly-neutralizing trees investigated in the IgPhyML paper.
The class-switching data gave noisier results than we had hoped for, giving error bars of the same magnitude as differences between methods. However, it confirmed that picking equally parsimonious trees using a context-sensitive model increases accuracy. Perhaps with better sampling or just more data we can learn more from class-switching data in the future.

There’s quite a lot more to do here, both in terms of method development and benchmarking, and we look forward watching this area mature in the coming years. Thanks to Kristian for his great work!

Predicting B cell receptor substitution profiles using public repertoire data

matsen@fredhutch.org (Frederick A. Matsen) — Thu, 19 Apr 2018 00:00:00 UTC

Can we predict how sites of an antibody will tolerate amino acid substitutions? Kristian Davidsen posed this question shortly after he arrived in my group, pointing out that being able to do such prediction would be quite useful. For example, engineered antibodies sometimes aggregate into clumps or have other properties that that make them useless for mass production. If we could figure out ways to change the amino acid sequence of an antibody without changing binding properties, that could help us avoid aggregation and make a more useful antibody.

How to start to address this complex and high-dimensional question? Although people have started to do deep mutational scanning on antibodies this type of data is hard to come by. On the other hand, B cell repertoire (i.e. antibody-coding) sequence data is becoming plentiful. B cells undergo affinity maturation to improve binding in collections of sequences called “clonal families” grouped by naive ancestor sequence (more background here). Although it’s not quite the same, we can use the frequency of an amino acid at a given site in that clonal family as a proxy for the suitability of that amino acid for an antibody binding the same target. Or perhaps such a clonal-family amino-acid frequency is simply an interesting object in itself.

In any case, our goal became: given a single sequence from a clonal family, can we predict the amino acid frequency of the collection of sequences in the clonal family? We follow Sheng, Schramm et. al (2017) in calling this sort of thing a substitution profile. Inferring a substitution profile from a single sequence might sound hard or impossible, but several features of the affinity maturation process lean in our favor:

There are a finite number of germline ancestor sequences from which diversification begins, and we can do a good job of inferring from which ancestor a given B cell sequence derives.
Simply because of the mutation process, some sites are more likely to mutate than others (recently covered here).
There’s lots of other repertoire data that we can use to watch the affinity maturation process.

This last one is sort of special, and deserves a bit of explanation. If we had a database containing every B cell sequence that had ever occurred, one could simply look for clonal families containing the sequence given to us, and take the average amino acid profile of those clonal family sequences. Unfortunately we don’t have access to such a database, but we can at least look for somewhat similar sequences and learn from their substitution profiles.

The previous Sheng-Schramm work, as well as contemporaneous work by Kirik et. al (2017), also indicates that various germline genes diversify in various characteristic ways (this sentiment also appears in Duncan’s first B cell paper and I’m sure many other previous works). This tells us that a profile based on germline gene identity should also inform a predicted substitution profile. Also, the context-sensitive neutral process given a germline gene should be helpful.

How do we combine these various sorts of information, especially considering that what is helpful for prediction at one site might not be helpful for another? Well, our group, consisting of Kristian, Amrit Dhar, and Vladimir Minin, decided to use a penalized tensor regression framework. That sounds fancy, but it just means that a single profile is a weighted linear combination of the profiles from each of the sources of information (see picture above). The weights may differ from site to site, but the kind of penalization we put on keeps them from changing too much between neighboring sites. It also zeroes out coefficients that don’t seem to be helping out-of-sample prediction. We find that different sources of information are useful for different parts of the B cell receptor sequence, in a way that corresponds to intuition about the “framework” and “complementarity determining” regions.

In any case, we show that integrating these diverse sources of information can help prediction, and provide a pre-trained prediction algorithm to do so. The code and parameters are on Github and the paper is on arXiv. So have at it with your sequences, and let us know how it fares!

I think that predicting substitution profiles is an interesting and useful goal. It did take a little getting used to, because we previously worked super hard to get per-residue natural selection estimates for B cell receptors by carefully separating the mutation and selection processes; here these substitution profiles just smash all that complexity down to a simpler object. There’s more to be done here: as data sets get bigger and machine learning algorithms get smarter, I look forward to seeing prediction improve! Thanks to Amrit, Kristian, and Vladimir for a fun project.

Postdoc opening to learn about antibody development during HIV superinfection

matsen@fredhutch.org (Frederick A. Matsen) — Wed, 10 Jan 2018 00:00:00 UTC

Please see https://b-t.cr/t/506 for details.

Per-sample immunoglobulin germline inference from B cell receptor deep sequencing data

matsen@fredhutch.org (Frederick A. Matsen) — Fri, 01 Dec 2017 00:00:00 UTC

Every B cell receptor sequence in a repertoire came from a V(D)J recombination of germline genes. Each individual has only certain alleles of these genes in their germline, and knowing this set improves the accuracy of all aspects of BCR sequence analysis, from alignment to phylogenetic ancestral sequence reconstruction. This germline allele set can be estimated directly from BCR sequence data, and it’s time to treat such estimation as part of standard BCR sequence analysis pipelines.

This central message is not new, but it’s worth emphasizing because doing germline set inference is not part of most current studies of B cell receptor (BCR) sequences.

Indeed, the most common way to annotate sequences is to align them one by one to the full set of alleles present in the IMGT database, which has hundreds of alleles. Each individual has only a fraction of these alleles in their genome.

Unsurprisingly, aligning sequences one by one to the whole IMGT set can cause problems. Imagine that A and B are two germline alleles in IMGT that are similar to one another. Sequences deriving from germline allele A can somatically hypermutate to look more similar to the B allele than the A allele from which they came. If we allow A and B in our germline repertoire, such sequences will be incorrectly annotated as being from B when they are from A. This will certainly lead to an incorrect estimation of the naive sequence from which they came.

In addition, it’s known through the work of many groups that the total set of germline genes is much larger than that represented in IMGT. This is not surprising given that this region is tricky to sequence directly, and that so far genetic studies have been primarily done on people of European ancestry. Here again, if we are missing a sequence from our germline set, we will have problems with all of our downstream analyses.

Thus, we should be estimating per-sample germline sets for BCR sequence data. This is not a trivial task. In 2010, Scott Boyd and others were the first to use high-throughput sequencing data of rearranged BCRs to estimate per-sample germline sets with a combination of computation, expert judgement, and statistics. In 2015, the Kleinstein group made a big step by developing TIgGER, an automated method for inferring germline sets that weren’t too far from existing alleles, and more recently the Hedestam group developed IgDiscover, a method that could start more “from scratch” for species where we have little or no germline information.

The motivation for Duncan’s work came from analyzing sequence data from diverse sources, and seeing clear evidence of alleles that were not represented in IMGT. He tried the existing tools but became frustrated first with software usability. He then started by re-implementing TIgGER, and then realized that he could use the same input information (their “mutation accumulation” plot depicted above) but in a way that more directly tests for the presence of new alleles, by considering the goodness of fit for one- vs two-component fits. In classic Duncan fashion, he has done a ton of validation, varying many different parameters in his simulation and also comparing the results of the different methods on experimental data sets. The work is now up on arXiv and is part of his partis suite of repertoire analysis tools.

There’s still a lot to be done here, and our knowledge of this highly diverse and important locus will continue to improve as more sequencing data of all types comes in. This is one example of many showing how analysis of a whole data set at once is more powerful for each individual sequence than one-at-a-time analysis of sequences.

Survival analysis of DNA mutation motifs with penalized proportional hazards

matsen@fredhutch.org (Frederick A. Matsen) — Tue, 14 Nov 2017 00:00:00 UTC

We are equipped with purpose-built molecular machinery to mutate our genome so that we can become immune to pathogens. This is truly a thing of wonder.

More specifically, I’m talking about mutations in B cells, the cells that make antibodies. Once a randomly-generated antibody expressed on the outside of the B cell finds something it’s good at binding, the cell boosts the mutation rate of its antibody-coding region by about one million fold. Those that have better binding are rewarded by stimulation to divide further. The result of this Darwinian mutation and selection process is antibodies with improved binding properties.

The mutation process is wonderfully complex and interesting. Being statisticians, we payed our highest tribute that we can to a process we think is beautiful: we developed a statistical model of it. This work was led by the dynamic duo of Jean Feng and David Shaw, while Vladimir Minin, Noah Simon and I kibitzed. Our model is known in statistics as a type of proportional hazards model. These models were introduced in Sir David Cox’s paper Regression Models and Life-Tables, which with over 4600 citations makes it the second most cited paper in statistics.

These models are typically used to infer rates of failure, such as that of humans getting disease. During our life span we get a sequence of diseases, some of which predispose us to other diseases. By considering sequences of diseases across many individuals, we can use these proportional hazards models to infer the rate of getting various diseases given disease history.

There is an analogous situation for B cell sequences in that the mutation process depends significantly on the identity of the nearby bases. We can observe lots of mutated sequences, and do a similar sort of inference: when a position mutates, it changes the mutability of nearby bases. Unfortunately we don’t know the order in which the mutations occurred, and thus don’t know what sequences had increased mutability, so we have to do Gibbs sampling over orders. The paper describing these methods and some results is published in Annals of Applied Statistics.

We were inspired by the very nice work of the Kleinstein lab developing similar sorts of models using simpler methods. However, we wanted a more flexible modeling framework and for the complexity of the models to automatically scale to the signal in the data, which we did using penalization with the LASSO. What you see in the figure above is how we can set up a hierarchical model with a penalty that zeroes out 5-mer terms when they don’t contribute anything above the corresponding 3-mer term (the last base being unimportant gives the block-like structure, while when the first base is unimportant it gives the 4-fold repetitive pattern you can see when zooming out). We are also indebted to Steve and his team, especially Jason Vander Heiden, for supplying us with sequence data. They are a class act.

There’s a lot of interest in context-sensitive mutation processes these days, such as Kelly Harris’ work on how we can watch context-sensitive mutabilities change through evolutionary time, and Ludmil Alexandrov’s work on mutation processes in cancer. In both of these cases, they are in the process of transitioning from a statistical description of these processes to linking them with specific mutagens and repair processes.

Here too we would like to use statistics to learn more about the mechanisms behind these context-sensitive mutations. What’s neat about the framework that Jean and David developed is that now we can design features that correspond to specific mechanistic hypotheses and test how much they impact mutation rates. Stay tuned!

Using genotype abundance to improve phylogenetic inference

matsen@fredhutch.org (Frederick A. Matsen) — Tue, 05 Sep 2017 00:00:00 UTC

When doing computational biology, listen to biologists. I have found them to have remarkable intuition; this can be a gold mine of opportunity for us computational types.

In this particular case, the starting point was the stunningly beautiful work of Gabriel Victora’s lab visualizing germinal center dynamics in living mice. For those not yet initiated into the beauty of B cell repertoire, germinal centers are crucibles of evolution, in which B cells compete in an antigen-binding contest such that the best binder reproduces more. As part of the Victora lab work, they did single-cell extraction and sequencing, which enabled them to quantify the frequency of each B cell genotype without PCR bias or other artifacts. Such single-cell sequencing, and consequent abundance information, is now becoming commonplace. How should we use this abundance information in phylogenetics?

Well, the Victora lab knew, even if their algorithm implementation is not one we would have considered. Indeed, they were building trees by hand, using several criteria about what makes for a believable evolutionary scenario. One of their intuitions was that more abundant genotypes have more opportunity to leave mutant descendants. Therefore, when we are doing inference, we should prefer trees that attach branches to frequently observed genotypes compared to less frequently observed genotypes (see picture, in which the frequency of a given genotype is the number inside the circle; we call this structure a genotype collapsed tree or GCtree).

To have an objective computational method we need to formalize this intuition. Will DeWitt, Vladimir Minin, and I formulated it in terms of an “infinite type” branching process, in which every mutation creates a new type. We can augment existing sequence-based optimality criteria with the likelihood of the tree under our branching process model. In our case we decided to show that this works by ranking maximum-parsimony trees (there are often many equally parsimonious trees). Parsimony is in wide use in the B cell analysis community because it is a defensible choice when sampling is dense relative to mutations (as in the case of germinal centers), and it allows inference of zero branch lengths (leading to inference of sampled ancestral genotypes and multifurcations). We showed under simulation that more highly ranked trees were more correct than lower ranked trees. With the paired heavy and light chain data from the Victora lab, we were also able to do a biological validation by showing that trees that should be the same are more similar when using our algorithm than without. The result is now up on arXiv.

If you are muttering to yourself that we should be using this model as a prior for a Bayesian analysis, we hear you. Hopefully this motivates additional work in that sphere for abundance-based models. We do note that given the limited amount of mutation described before will lead to a fairly flat posterior. Furthermore, although one can infer sampled ancestors using an RJMCMC and multifurcations using phycas, these two features do not exist yet under one roof.

Will did a great job with this project, which is a nice complement to his existing publications as he heads into the UW Genome Sciences PhD program! We had a great time working with Luka and Gabriel, and look forward to more collaboration in the future.

Probabilistic Path Hamiltonian Monte Carlo

matsen@fredhutch.org (Frederick A. Matsen) — Mon, 26 Jun 2017 00:00:00 UTC

Hamiltonian Monte Carlo (HMC) is an exciting approach for sampling Bayesian posterior distributions. HMC is distinguished from typical MCMC because proposals are derived such that acceptance probability is very high, even though proposals can be distant from the starting state. These lovely proposals come from approximating the path of a particle moving without friction on the posterior surface.

Indeed, the situation is analogous to a bar of soap sliding around a bathtub, where in this case the bathtub is the posterior surface flipped upside down. When the soap hits an incline, i.e. a region of bad posterior, it slows down and heads back in another direction. We give the soap some random momentum to start with, let it slide around for some period, and where it is at the end of this period is our proposal. Calculating these proposals requires integrating out the soap dynamics, which is done by numerically integrating physics equations (hence the Hamiltonian in HMC). The acceptance ratio is determined only by how well our numerical integration performs: better numerical integration means a higher acceptance probability.

Vu had been noodling around with phylogenetic HMC when we heard that Arman Bilge (at that time in Auckland) had an implementation as well. These implementations not only moved through branch length space according to usual HMC, but also moved between topologies. They did this as follows: once a branch length hits zero, leading to four branches joined at a single node, one can regroup those branches together randomly in another configuration and continue. (If you are a phylogenetics person, this is a random nearest-neighbor interchange around the zero length branch.) This randomness, which is rather different than the deterministic paths of classical HMC once a momentum is drawn, is why we call the algorithm Probabilistic Path Hamiltonian Monte Carlo (PPHMC).

The primary challenge in theoretical development is that the PPHMC paths are no longer deterministic. Thus concepts such as reversibility and volume preservation, which are typical components of correctness proofs for HMC, need to be generalized to probabilistic equivalents. Vu had to work pretty hard to develop these elements and show that they led to ergodicity.

On the implementation front, Arman was also working hard to build an efficient sampler. However, the HMC integrator had difficulty going from one tree topology to another without incurring substantial error. We thrashed around for a while trying to improve things with a “careful” integrator that would find the crossing time and perhaps re-calculate gradients at that time, but proving that such a method would work seemed very hard.

Then, magically, our newest postdoc Cheng Zhang showed up and saved us with a smoothing surrogate function. This surrogate exchanges the discontinuity in the derivative for discontinuity in the potential energy, but we can deal with that using a “refraction” method introduced by Afshar and Domke in 2015. This approach allows us to maintain a low error, and thus make very long trajectories with a high acceptance rate.

I’m happy to announce that our manuscript has been accepted to the 2017 International Conference on Machine Learning. Practically speaking, this work is definitely a proof of concept. We have taken an algorithm that was previously only defined for smooth spaces and extended it to orthant complexes, which are basically Euclidean spaces with boundary glued along those boundaries in intricate ways. Our implementation is not fully optimized, but even if it was I’m not sure that it would out-compete good old MCMC for phylogenetics without some additional tricks.

To know if this flavor of sampler is going to be useful, we really need to better understand what I call the local to global question in phylogenetics. That is, to what extent does local information tell us about where to modify the tree? This is straightforward for posteriors on Euclidean spaces: the gradient points us towards a maximum. But for trees, does the gradient (local information) tell us anything about what parts of the tree should be modified (global information)? We’ll be thinking about this a lot in the coming months!

Smart proposals win for online phylogenetics using sequential Monte Carlo.

matsen@fredhutch.org (Frederick A. Matsen) — Wed, 07 Jun 2017 00:00:00 UTC

Sometimes projects take years to bear fruit.

As I described previously, Aaron Darling and I have been thinking for a good while about using Sequential Monte Carlo (SMC) for online Bayesian phylogenetic inference, in which an existing posterior on trees can be updated with additional sequences. In fact, we had a good enough proof-of-concept implementation in 2013 to give a talk at the Evolution meeting. The other talk from my group that year was about a surrogate function for likelihoods parameterized by a single branch length, which we called “lcfit”. These two projects have just recently resulted in intertwined submitted papers.

The SMC implementation, which we called sts for Sequential Tree Sampler, lay dormant for a while after Connor McCoy left the group despite a few efforts to rekindle it. One of the things that kept us from wrapping it up was problems with particle degeneracy, which is as follows. I think of SMC as a probabilistically correct version of evolutionary computation, in which trees get to reproduce if they have a high posterior. Every time we have a new sequence, we attach it to our existing population of trees via some proposal distribution suggesting where to add it. Particle degeneracy, then, means that you obtain a low diversity population due to a few trees “taking over” the population because they are significantly better than the rest.

In working with sts, we found that we could mitigate the effect of particle degeneracy by developing “smart” proposals that use the data and corresponding likelihood function to decide where to try next. This is in contrast to previous work by Bouchard-Côté et al 2012, which shows that for a different formulation of SMC based on subtree merging one does better with simple proposals (compare Teh et al. 2008). Our proposals have to decide to what edge to attach, where along the edge to attach, and how long of a branch length to use for the attachment (see picture above).

Although the group put in hard work, there was a lot more needed to put all this together and actually get a working sampler. Luckily, Aaron recruited a super sharp and motivated postdoc in Mathieu Fourment. Mathieu tried a lot of variants of proposal distributions, showed that “heated” proposal distributions were effective for picking attachment branches, and added a parsimony-based proposal. He also showed that a high effective sample size is necessary but not sufficient to have a good SMC posterior sample, and that we can develop a time-competitive sampler compared to running MrBayes again and again. That work is now up on bioRxiv.

Smart proposals need to be designed carefully. One of the bottlenecks was proposing the new pendant branch length, and for that the lcfit surrogate function (the subject of Connor’s talk at Evolution 2013) worked well. This spurred us to finish and write up the lcfit work, despite the fact that Aberer et al. 2015 wrote a paper in which they describe how to use standard probability distribution functions as surrogate functions for branch length proposals. This took a bit of wind out of our sails, but our purpose-built lcfit surrogate still has some interesting advantages over common functions, such as that has the right “shape” for likelihood curves, even when branch lengths become long. We were surprised to find that it does well for complex settings with heterogeneous models. On a practical level, it’s implemented as a stand-alone C library so it can be easily incorporated into other programs, in contrast to the work of Aberer and co, which is tightly integrated into ExaBayes. In the end we have turned it into a short paper with a long appendix which is now up on arXiv. Brian Claywell is the hero of this lcfit work– he was persistent in developing creative strategies to fit the surrogate function in a variety of settings.

There is a lot to be done for online Bayesian phylogenetics still. We don’t even try to sample model parameters, and haven’t tried making BEASTly rooted trees. There are many opportunities for optimization, some of which are described in the paper. For the parallelism nerds out there, I also note the existence of the particle cascade.

I certainly hope that this contribution guides future developers towards useful online Bayesian phylogenetics samplers: it would be great to be able to keep trees up to date as the genomes keep rolling in from projects such as ZIBRA. If you are interested in more details, get in touch!

Incorporating new sequences into posterior distributions using SMC

matsen@fredhutch.org (Frederick A. Matsen) — Wed, 26 Oct 2016 00:00:00 UTC

The Bayesian approach is a beautiful means of statistical inference in general, and phylogenetic inference in particular. In addition to getting posterior distributions on trees and mutation model parameters, the Bayesian approach has been used to get posteriors on complex model parameters such as geographic location and ancestral population sizes of viruses, among other applications.

The bummer about Bayesian computation? It takes so darn long for those chains to converge. And what’s worse in this age of in-situ sequencing of viruses for rapidly unfolding epidemics? If you get a new sequence you have to start over from scratch.

I’ve been thinking about this for several years with Aaron Darling, and in particular about Sequential Monte Carlo (SMC) for this application. SMC, also called particle filtering, is a way to get a posterior estimate with a collection of “particles” in a sequential process of reproduction and selection. You can think about it as a probabilistically correct genetic algorithm– one that is guaranteed to sample the correct posterior distribution given an infinite number of particles.

Although SMC has been applied before in phylogenetics, it has not been used in an “online” setting to update posterior distributions as new sequences appear. Aaron and I worked up a proof-of-concept implementation of SMC with Connor McCoy, but when Connor left for Google the implementation lost some steam.

However, when Vu arrived I was still curious about the theory behind our little SMC implementation. Such SMC algorithms raise some really interesting mathematical questions concerning how the phylogenetic likelihood surface changes as new sequences are added to the data set. If it changes radically, then the prospects for SMC are dim. We call this the subtree optimality question: do high-likelihood trees on n taxa arise from attaching branches to high-likelihood trees on n-1 taxa? Years ago I considered a similar question with Angie Cueto but that was for the distance-based objective function behind neighbor-joining, and others have thought about it from the empirical side under the guise of taxon sampling.

As described in an arXiv preprint, Vu developed a theory with some real surprises! First, an induction-y proof leads to consistency: as the number of particles goes to infinity, we maintain a correct posterior at every stage. Then he directly took on the subtree optimality problem by writing out the relevant likelihoods and using bounds. (We think that this is the first theoretical result for likelihood on this question.)

Then the big win: using these bounds on the ratio of likelihoods for the parent and child particles, he was able to show that the effective sample size (ESS) of the sampler is bounded below by a constant multiple of the number of particles. This is pretty neat for two reasons: first, it’s good to know that we are getting a better posterior estimate as we increase our computational effort, and it’s nice that the ESS goes up linearly with this effort. Second, this constant doesn’t depend on the size of the tree, so this bodes well for building big trees by incrementally adding taxa.

Of course, with this sort of theory we can’t get a reasonable estimate on the size of this key constant, and indeed it could be uselessly small. However, I’m still encouraged by these results, and the paper points to some interesting directions. For example, because SMC is continually maintaining an estimate of the posterior distribution, one can mix in MCMC moves in ways that otherwise would violate detailed balance, such as using an MCMC transition kernel that focuses effort around the newly added edge. In this way we might use an “SMC” algorithm with relatively few particles which in practice resembles a principled highly parallel MCMC. On the other hand we might use clever tricks to scale the SMC component up to zillions of particles.

All this strengthens my enthusiasm for continuing this work. Luckily, Aaron has recruited Mathieu Fourment to work on getting a useful implementation, and every day we are getting good news about his improvements. So stay tuned!

Summer high school and undergraduate students 2016

matsen@fredhutch.org (Frederick A. Matsen) — Fri, 22 Jul 2016 00:00:00 UTC

I definitely didn’t set out to have three high school and two undergrad students this summer.

But they’re fantastic, and making real contributions to our scientific work! Anna and Apurva (left and right) have written a new C++ front-end to the essential Smith-Waterman pre-alignment step for Duncan’s partis software. Andrew and Lola (center) are investigating the traces of maximum-likelihood phylogenetic inference software packages. Thayer (back left) is writing a multi-threaded C++ program to systematically search for all of the trees above a given likelihood cutoff. All of them are learning about science and coding.

These students rock, and I can’t wait to see what great things they bring into the world with their talent.