Matsen group: general
http://matsen.fredhutch.org/rss/general.rss
general newsfeeden-usTue, 25 Aug 2020 13:56:15 UTCTue, 25 Aug 2020 13:56:15 UTCA Bayesian phylogenetic hidden Markov model for B cell receptor sequences
http://matsen.fredhutch.org/general/2020/01/23/linearham.html
Thu, 23 Jan 2020 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2020/01/23/linearham.html<p><img src="https://matsen.fredhutch.org/images/phylo-hmm.png" width="275" class="pull-right" /></p>
<h4 id="summary">Summary</h4>
<ul>
<li>antibodies develop within you via an evolutionary process</li>
<li>understanding these evolutionary patterns is important for understanding how we respond to infection and vaccination</li>
<li>we have found using Bayesian methods that evolutionary inferences are uncertain in this regime</li>
<li>our most recent work develops a “Bayesian phylogenetic hidden Markov model,” which takes into account uncertainty in both the V(D)J recombination process and the evolutionary process</li>
<li>this work reveals substantial amino-acid uncertainty in the inference of the unmutated common ancestor of VRC01, an important and heavily-studied anti-HIV antibody</li>
<li>our results are described in <a href="https://arxiv.org/abs/1906.11982">a preprint</a> which is now being revised for <em>PLOS Computational Biology</em></li>
</ul>
<h4 id="a-brief-description-of-antibody-affinity-maturation">A brief description of antibody affinity maturation</h4>
<p>In order to defend against a very large and ever-mutating pool of pathogens, your body randomly generates, and then optimizes, a large collection of antibodies.
These antibodies are displayed as so-called <em>B cell receptors</em> on the surface of specialized B cells.
The random generation is a process called V(D)J recombination, in which a collection of candidate genes are randomly selected, trimmed a random amount, and then joined by random nucleotides.
The optimization is called “affinity maturation,” in which antibody-making B cells are rewarded for being able to bind antigen by being allowed to divide, during which they further mutate their B cell receptor to continue improving binding.
.
The left hand image below is a cartoon of affinity maturation, simplified from a diagram in <a href="http://dx.doi.org/10.1016/j.coi.2014.02.010">Victora and Mesin (2014)</a>.</p>
<p><img src="https://matsen.fredhutch.org/images/gc-victora-mesin-simplified.png" width="37%" />
<img src="https://matsen.fredhutch.org/images/gc-tree.png" width="50%" style="float:right" /></p>
<p>If we were omniscient, we would be able to see this process unfold and record the series of division events and the genetic sequences of the B cell receptors.
But, as mere mortals, we have to be satisfied with sequencing some subset of the cells and then reconstructing the process that led to these observed cells.
This includes both the tree structure, as well as the states at the internal nodes of the tree.
This is a little more complex than the usual phylogenetic tree and ancestral state reconstruction, because we have partial (but highly informative) knowledge about the ancestral state: we know it was sampled from some random process for which we know the ensemble of genes that could have rearranged in order to make the unmutated ancestral sequence.
More on this below.</p>
<h4 id="we-would-like-to-know-the-pathways-of-affinity-maturation">We would like to know the pathways of affinity maturation</h4>
<p>The goal of vaccination is to stimulate and affinity-mature antibodies that will be able to block infection.
Therefore, if we wish to design better vaccines and understand the impact of existing vaccines, we can use sequencing and sequence analysis methods to understand the development of antibodies.
This might be in response to a vaccine, or it might be in response to a viral infection.</p>
<p>Such analysis is especially important for difficult viruses such as HIV.
Despite decades of research and an enormous global budget, we still do not have an effective vaccine.
This failure stems largely from HIV’s astounding diversity and mutation rate, which make it incredibly difficult for your body to make antibodies that block a usefully large range of HIV strains.
Specifically, development of such antibodies typically requires a lot of mutation, including some relatively rare events.</p>
<p>In order to better understand what we can do to stimulate these difficult-to-elicit antibodies, the research community is studying the success stories: individuals that do manage to make antibodies that block a diversity of HIV strains.
These super-antibodies are called “broadly neutralizing antibodies” or bNAbs.
There is a tremendous amount of interest in a particular bNAb called VRC01.
For example, there is currently <a href="https://www.hvtn.org/en/media-room/news-releases/antibody-mediated-prevention-studies-fully-enrolled.html">a clinical trial to find if VRC01 infusion can block infection</a>.
Although this trial will tell us if VRC01 can block HIV infection in a realistic setting, what we really want is to be able to immunize such that the body makes something like VRC01 from scratch.</p>
<p>This is why it’s important to get a detailed understanding of the steps of affinity maturation taken by VRC01, which means understanding the unmutated common ancestor, as well as the series of mutations.
The better we can understand the history of such antibodies, the better we can understand the barriers to eliciting them with vaccination, and the better we can design vaccinations to overcome those barriers.</p>
<p>Correspondingly, researchers have made beautiful and detailed <a href="http://dx.doi.org/10.1016/j.cell.2015.03.004">computational dissections of this lineage</a> and <a href="http://dx.doi.org/10.1016/j.immuni.2013.04.012">of other antibodies in the same class</a>.
A 2018 <a href="http://dx.doi.org/10.1016/j.immuni.2018.10.015">paper</a> had as a primary result a new inference of the unmutated common ancestor.
These computational methods are then followed by biochemical analysis of these predicted sequences, which will guide vaccine design.</p>
<p>These biochemical characterizations are only meaningful to the extent that the computational inferences are correct.
Next I will describe the main point of this post, which is that if we use Bayesian methods we find that <em>antibody phylogenetics is uncertain</em>.</p>
<h4 id="bayesian-phylogenetic-methods-describe-tree-uncertainty">Bayesian phylogenetic methods describe tree uncertainty</h4>
<p>Given a model of sequence evolution and some sequence data, there is in general some ambiguity about the correct evolutionary history.
This is depicted in the following cartoon:</p>
<p><img src="https://matsen.fredhutch.org/images/phylo-suboptimal.png" width="100%" /></p>
<p>If we are only interested in the best-fitting tree, we can pick the left-hand one, but it fits the data only a little better than the other one.
Thus if we really care about the outcome of the analysis, we should consider both of these trees as potential explanations of the data.
The goal of Bayesian phylogenetics is to find all of the credible phylogenetic trees, and assign each of these a probability that it is the correct tree.</p>
<p>We can “boil” these trees down to quantities of actual interest to researchers.
For example, antibody researchers are commonly interested in the sequence of mutations leading to a specific antibody of interest, rather than the full tree containing those mutations.
We can represent the possible mutation paths and their probabilities in a diagram <a href="https://www.nature.com/articles/s41467-019-09481-7/figures/1">like this one</a>, which was made using a simpler version of the methods described in this blog post.
We were inspired in this representation by <a href="http://dx.doi.org/10.7554/eLife.00631">lovely work</a> from Jesse Bloom’s lab.</p>
<p>When we apply Bayesian methods to real data, we find substantial uncertainty in antibody trees.
This wouldn’t surprise an experienced phylogenetics researcher, because antibody sequences are relatively short, and mutation is typically focused in specific areas.
Thus, we think that Bayesian methods should be the method of choice for researchers who care a lot about the details of the sequence of mutations leading to an antibody of interest.
(If you cared a lot about the slope of a regression, wouldn’t you want to get a confidence interval? 😊)</p>
<h4 id="a-bayesian-phylogenetic-hidden-markov-model-for-b-cell-receptor-sequences">A Bayesian phylogenetic hidden Markov model for B cell receptor sequences</h4>
<p>There is something special and interesting about phylogenetics in this regime: we have information about the sequence at the root of the tree.
This is because we know that it came from V(D)J recombination, and there are databases of the various V, D, and J genes present in the population that go into that recombination.
We can formalize this knowledge as a probabilistic model of V(D)J recombination, which can be used as a prior on root sequences for our B cell phylogeny.</p>
<p>Putting this together with the usual Bayesian phylogenetic machinery, we have a posterior that looks like so:</p>
<p><img src="https://matsen.fredhutch.org/images/phylo-hmm.png" width="100%" /></p>
<p>where the box at the top of the tree is meant to represent a probabilistic model of the V(D)J recombination process.
Samples from this posterior integrate uncertainty in both the recombination process and the phylogenetic tree.</p>
<p>Amrit Dhar, a statistics PhD student working with Vladimir Minin and me, led development of a way of sampling from the posterior of these structures.
In doing so, he had to cope with the complexities of V(D)J recombination modeling (with assistance from Duncan Ralph), as well as the complexities of doing Bayesian phylogenetics.
The methods and validations are described in <a href="https://arxiv.org/abs/1906.11982">a preprint</a> which is now being revised for <em>PLOS Computational Biology</em>.</p>
<h4 id="substantial-uncertainty-in-the-unmutated-ancestor-of-vrc01">Substantial uncertainty in the unmutated ancestor of VRC01</h4>
<p>Using Amrit’s method, we find substantial uncertainty in the inference of the VRC01 unmutated common ancestor for the CDR3 region, which is a key region determining binding.
We can visualize it like so, with the heights of the letters being the posterior probability of that letter’s amino acid at each site:</p>
<p><img src="https://matsen.fredhutch.org/images/vrc01-logo.png" width="100%" /></p>
<p>Substantial uncertainty is evident at a number of sites, and with quite different amino acids.
For example, Tyrosine (Y) and Asparagine (N) have very different biochemical properties.
We might expect these variants to have different binding properties.</p>
<p>On the other hand, when the data supports a single clear answer, then the method reports it.
For example, we find very little uncertainty in the inferred ancestor of <a href="http://dx.doi.org/10.1016/j.immuni.2017.11.002">PC64</a>, another antibody lineage of great interest to HIV researchers.</p>
<h4 id="some-final-thoughts">Some final thoughts</h4>
<p>Bayesian methods are expensive to run, and it would require a staggering amount of compute to run this method on every inferred clonal family in a repertoire.
We don’t.
We only run it when we really care about understanding the collection of predicted ancestral sequences, such as when we want to express inferred antibodies and test their properties in the lab.</p>
<p>There’s a lot left to do here.
We would like to make the method faster (see our other research on accelerating Bayesian phylogenetic inference) so we can scale the method to more sequences.
Our method does not take into account the context-sensitive nature of the B cell receptor mutation process, like <a href="https://igphyml.readthedocs.io/en/latest/">IgPhyML</a> does.
Also, we’d ideally like to have a method that also incorporates uncertainty concerning which sequences are in the tree, which is <a href="/general/2016/04/16/partis-clustering.html">not known a priori</a>.</p>
<p>Thank you to Amrit, whose incredible determination pushed this challenging project through to completion, Duncan, for VDJ recombination consults and integration with partis, and to Vladimir, for being an awesome collaborator from the vision to the final details.
See the preprint for many other thank-yous, but I’d like to especially credit the Overbaugh lab for keeping us motivated to work on this challenging problem.</p>
<p>I would also like to credit Tom Kepler, whose <a href="http://dx.doi.org/10.12688/f1000research.2-103.v1">pioneering work</a> gave the first means of integrating phylogeny and rearrangement inference.
His software performed remarkably well in our benchmarks for the task of providing a point estimate of the tree.</p>
<hr />
<p>Please comment and ask questions <a href="https://b-t.cr/t/a-bayesian-phylogenetic-hidden-markov-model-for-b-cell-receptor-sequences/818">here</a>.</p>
<hr />
<p><strong>We’re always interested in hearing from people interested in our work who might want to come work with us as students or postdocs. Please drop me a line!</strong></p>
<p><br /></p>
Variational Bayesian phylogenetic inference
http://matsen.fredhutch.org/general/2019/08/24/vbpi.html
Sat, 24 Aug 2019 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2019/08/24/vbpi.html<p><img src="https://matsen.fredhutch.org/images/csd.png" width="275" class="pull-right" /></p>
<p>In late 2017 we were stuck without a clear way forward for our research on Bayesian phylogenetic inference methods.</p>
<p>We knew that we should be using gradient (i.e. multidimensional derivative) information to aid in finding the posterior, but couldn’t think of a way to find the <em>right</em> gradient.
Indeed, we had recently finished our work on a variant of Hamiltonian Monte Carlo (HMC) that used the branch length gradient to guide exploration, along with a probabilistic means of hopping from one tree structure to another when a branch became zero.
Although this project was a lot of fun and was <a href="http://proceedings.mlr.press/v70/dinh17a.html">an ICML paper</a>, it wasn’t the big advance that we needed: these continuous branch length gradients weren’t contributing enough to the fundamental challenge of keeping the sampler in the good region of phylogenetic tree structures.
But it was hard to even imagine a good solution to the central question: <em>how can we take gradients in the discrete space of phylogenetic trees?</em></p>
<p>Meanwhile, in another line of research we were trying to separate out the process of exploring discrete tree structures with that of handling the continuous branch length parameters.
As I described <a href="/general/2019/06/18/pt.html">in a previous post</a>, this combined a systematic search strategy modeled after what maximum-likelihood phylogenetic inference programs do, along with efficient marginal likelihood estimators to “integrate out” the branch lengths.
This worked well for some data sets, but was bound to fail for any data set in which the posterior was spread across too many trees.
Indeed, <em>any method that needs to do a calculation on each tree in the credible set is bound to fail for large and flat posterior distributions.</em></p>
<p>At this point I was feeling despondent.
I didn’t know how to take the gradient in discrete tree structure space, and the sampling-based methods we wanted to avoid seemed like the only approach that could work for flat posteriors.
The only opportunity I could see was the <a href="http://dx.doi.org/10.1093/sysbio/syr074">Höhna-Drummond</a> and <a href="http://dx.doi.org/10.1093/sysbio/syt014">Larget</a> work on parametrizing tree structure posteriors, however we had previously shown that they were <a href="http://dx.doi.org/10.1093/sysbio/syv006">insufficiently flexible to represent the shape of true phylogenetic posteriors</a>.
Perhaps we could generalize them?</p>
<p>Cheng Zhang, when he was a postdoc in my group, took that vague idea and built a completely new means of inferring phylogenetic posteriors: <em>variational Bayes phylogenetic inference</em>.
In this post I hope to explain this advance to the phylogenetics community.</p>
<h3 id="how-variational-inference-and-the-metropolis-hastings-ratio-each-get-around-the-normalizing-constant-problem">How variational inference and the Metropolis-Hastings ratio each get around the normalizing constant problem</h3>
<p>Bayesian phylogenetic inference targets the posterior distribution <script type="math/tex">p(\mathbf{z} \mid D)</script> on structures <script type="math/tex">\mathbf{z}</script> consisting of phylogenetic trees along with associated model parameters including branch lengths.
Bayes’ rule tells us that the posterior is proportional to the likelihood times the prior:</p>
<script type="math/tex; mode=display">p(\mathbf{z} \mid D) \propto p(D \mid \mathbf{z}) \, p(\mathbf{z})</script>
<p>We can efficiently evaluate the two terms on the right hand side: the likelihood <script type="math/tex">p(D \mid \mathbf{z})</script> via <a href="https://en.wikipedia.org/wiki/Felsenstein%27s_tree-pruning_algorithm">Felsenstein’s tree-pruning algorithm</a> and the prior <script type="math/tex">p(\mathbf{z})</script>.
However, it’s still quite hard to get correct values for the posterior <script type="math/tex">p(\mathbf{z} \mid D)</script> because of the unknown proportionality constant hidden in <script type="math/tex">\propto</script>.
We will call the likelihood times the prior on the right hand side of Bayes’ rule, <script type="math/tex">p(D \mid \mathbf{z}) \, p(\mathbf{z})</script>, the <em>unnormalized posterior</em>.</p>
<p>The difficulty posed by the unknown proportionality constant is analogous to surveyors trying to calculate the average absolute height of a mountain range using only relative height measurements: they have to cover the entire mountain range before feeling confident that they can translate their relative measurements into an absolute estimate of the average height.</p>
<p>The <a href="https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Step-by-step_instructions">Metropolis-Hastings algorithm</a> avoids this problem by only working in terms of ratios of posterior probabilities.
This cancels out the hidden proportionality constant, but with the cost of not directly giving an estimate of the posterior probability.
Such an estimate then comes from running a Metropolis-Hastings sampler, which in the phylogenetic case doesn’t scale to data sets with many sequences <a href="/general/2019/06/18/pt.html">as I described in a previous post</a>.</p>
<p>Variational inference takes a different approach, fitting a <em>variational approximation</em> <script type="math/tex">q_\phi(\mathbf{z})</script> to the posterior <script type="math/tex">p(\mathbf{z} \mid D)</script>.
This approximation is parameterized in terms of some parameters <script type="math/tex">\phi</script>.
Once we have fit this approximation, we use it in place of our actual posterior for whatever downstream analyses we have in mind.
It is an inferential method that can be used in place of Metropolis-Hastings.</p>
<p><img src="https://matsen.fredhutch.org/images/variational-gradient.png" width="100%" /></p>
<p>The fitting procedure for the variational approximation avoids the normalizing constant problem by taking a measure of “goodness of fit” that only requires evaluating the unnormalized posterior.
In the most common formulation, this is the Kullback-Liebler divergence <script type="math/tex">\text{KL}(q_\phi(\mathbf{z}) \parallel p(\mathbf{z} \mid D))</script>, in which the expectation of the normalizing constant <script type="math/tex">\log p(D)</script> can be pulled out.
We can then ignore that constant when optimizing.</p>
<p>This optimization process happens by stochastic gradient descent, in which one samples from the current approximation <script type="math/tex">q_\phi</script> and uses that sample to take an optimization step in terms of <script type="math/tex">\phi</script> to improve the fit.
That’s what I’m showing in the above figure, in which the pink points represent samples from the current variational approximation.
We take those points and calculate a gradient in terms of the variational parameters <script type="math/tex">\phi</script> using the un-normalized posterior, and then take a gradient ascent step.
I show a lot of points, corresponding to the fact that we use a <a href="http://arxiv.org/abs/1602.06725">multi-sample gradient estimator</a> to decrease variance in the gradient estimate.</p>
<p>Intuitively, one can simply imagine that after a sample from <script type="math/tex">q_\phi</script> one would like to fiddle with <script type="math/tex">\phi</script> so as to improve fit of the variational approximation, just as if the posterior was “data” and we were fitting a statistical model.
Early in the fitting procedure, this will involve increasing the probability of generating samples <script type="math/tex">\mathbf{z}</script> that had a high un-normalized posterior and decrease the probability of generating those that did not.
If you want to learn more, see the <a href="http://dx.doi.org/10.1080/01621459.2017.1285773">excellent review article</a> by Blei <em>et al</em> for background, and our ICLR paper for details about gradients.</p>
<p>However, I’d like to clarify a point that seems to cause confusion, including in the minds of reviewers who rejected our grant application: there is a clear distinction between the general technique of variational inference (VI) and a specific variational parameterization, such as mean-field VI.
Mean-field VI makes strong independence assumptions which limits the flexibility of variational approximations; indeed it is not appropriate <a href="http://arxiv.org/abs/1802.02538">even for some simple hierarchical models</a>.
In contrast, VI is a general technique that will work given an appropriate approximating density and fitting algorithm.
I describe evidence below that our parameterization for phylogenetic posteriors is sufficiently rich.
More generally, there are now many methods that use more richer families of variational approximations such as <a href="http://arxiv.org/abs/1505.05770">normalizing flows</a>.
<!--
While we are discussing generalizations, one can also substitute the KL divergence [with an alternative](http://arxiv.org/abs/1611.00328).
--></p>
<h3 id="how-do-we-obtain-a-variational-approximation-of-a-phylogenetic-posterior">How do we obtain a variational approximation of a <em>phylogenetic</em> posterior?</h3>
<p>You may be thinking “well this all sounds very nice, but how are we going to parameterize a discrete set of phylogenetic trees using real-valued parameters <script type="math/tex">\phi</script>?”
This is not at all obvious, and is the subject of <a href="/general/2018/12/05/sbn.html">a previous post</a> (where we also credit to the originators of this approach).
In short, one approximates the phylogenetic posterior using a series of conditional probabilities, like so:</p>
<p><img src="https://matsen.fredhutch.org/images/full-vbpi-parametrization.png" width="100%" /></p>
<p>We showed in our <a href="https://papers.nips.cc/paper/7418-generalizing-tree-probability-estimation-via-bayesian-networks.pdf">our 2018 NeurIPS paper</a> that this parametrization was sufficiently rich to approximate the shape of phylogenetic posteriors on real data to high accuracy.
In fact, in that paper (Table 1) we showed that the variational approximation fit to an MCMC sample was significantly <em>more accurate</em> than just using the MCMC samples in the usual way.</p>
<p>For full variational inference, we also layer on a variational distribution of branch lengths in terms of another set of variational parameters <script type="math/tex">\psi</script>.
I’m not going to describe how those work, but Cheng found a nice parameterization that used “just the right amount” of tree structure.
See <a href="https://openreview.net/pdf?id=SJVmjjR9FX">our 2018 ICLR paper</a> for a full description.</p>
<p>With the complete variational parameterization in hand, all that remains is to fit it to the posterior.
This required deft coding and a lot of tinkering on Cheng’s part, using control variate ideas for the tree structure and the reparametrization trick for branch lengths.
The result?
An algorithm that can outperform MCMC in terms of the number of likelihood computations.</p>
<p>The phylogenetic reader may also be interested in Table 1 of the ICLR paper, which shows that importance sampling using the full variational approximation gives marginal likelihood results quite concordant with, though with lower variance, than the <a href="http://dx.doi.org/10.1093/sysbio/syq085">stepping-stone method</a>.
Stepping-stone is a computationally expensive gold-standard method, whereas our method only required 1000 importance samples (and thus only 1000 likelihood evaluations once the variational approximation was fit).
That’s promising!</p>
<h3 id="whats-next">What’s next?</h3>
<p>We’re working hard to realize the promise of variational Bayes phylogenetic inference.
On the coding front, we’re developing the <a href="https://github.com/phylovi/libsbn/">libsbn</a> library along with a <a href="https://github.com/orgs/phylovi/people">team</a> including Mathieu Fourment.
The concept behind this Python-interface C++ library is that you can express interesting parts of your phylogenetic model in Python/TensorFlow/PyTorch/whatever and let an optimized library handle the tree structure and likelihood computations for you.
It’s not quite useful yet, but we already have the essential data structures, as well as likelihood computation and branch length gradients using <a href="http://dx.doi.org/10.1093/sysbio/syz020">BEAGLE</a>.
I’m having a blast hacking on it, and it shouldn’t be too long before it can perform inference.</p>
<p>But the really fun part about variational inference is the ability to develop tricks that accelerate convergence.
VI is fundamentally an optimization algorithm, and we can do whatever we want to do to accelerate that optimization.
For example <a href="http://www.jmlr.org/papers/volume14/hoffman13a/hoffman13a.pdf">stochastic variational inference</a> accelerates inference by taking random subsets of the data.
We need to be careful about how to do that in the phylogenetic case (we can’t naively subsample tips of the tree) but we are currently pursuing ideas along those lines.
In contrast, MCMC is a fairly constrained algorithm, and clever algorithms run the risk of either disturbing detailed balance or leading to an impossible-to-calculate proposal density.</p>
<p>I haven’t mentioned continuous model parameters other than branch lengths, and our initial work only used the simplest phylogenetic model: Jukes-Cantor without an explicit model of rates across sites.
Mathieu is working out the gradients of nucleotide model parameters, which will allow us to formulate variational approximations of those too.</p>
<p>There’s still a lot to be done, and I’m having the time of my research life working in this area.
I’d love to hear any comments, and don’t hesitate to <a href="https://matsen.fredhutch.org/members.html">reach out</a> with questions.</p>
<p><strong>We’re always interested in hearing from people interested in our work who might want to come work with us as students or postdocs. Please drop me a line!</strong></p>
<hr />
<p>I’m very grateful to Cheng Zhang for his creativity and skill in making this project happen.
He is now now tenure-track faculty at Peking University in Beijing.
I’d also like to thank our growing team of collaborators working on this subject.</p>
<p>Also, if you are interested in this area, check out <a href="http://dx.doi.org/10.1101/702944">the work of Mathieu Fourment and Aaron Darling</a>, which is an independent development from ours.</p>
<p><br /></p>
Bayesian phylogenetic inference without sampling trees
http://matsen.fredhutch.org/general/2019/06/18/pt.html
Tue, 18 Jun 2019 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2019/06/18/pt.html<p><img src="https://matsen.fredhutch.org/images/pt-tree-graph.png" width="250" class="pull-right" /></p>
<p>Most every description of Bayesian phylogenetics I’ve read proceeds as follows:</p>
<ul>
<li>“Bayesian phylogenetic analyses are conducted using a simulation technique known as Markov chain Monte Carlo (MCMC).” (<a href="http://dx.doi.org/10.1146/annurev.ecolsys.37.091305.110021">Alfaro & Holder, 2006</a>)</li>
<li>“Posterior probabilities are obtained by exploring tree space using a sampling technique, called Markov chain Monte Carlo (MCMC).” (Lemey et al, <em>The Phylogenetic Handbook</em>)</li>
<li>“Once the biologist has decided on the data, model and prior, the next step is to obtain a sample from the posterior. This is done by using MCMC…” (<a href="http://dx.doi.org/10.1038/s41559-017-0280-x">Nascimento et al, 2017</a>.)</li>
</ul>
<p>With statements like these in popular (and otherwise excellent!) reviews, it’s not surprising that people confuse Bayesian phylogenetics and Markov chain Monte Carlo (MCMC).
Well, let’s be clear.</p>
<p><em>MCMC is one way to approximate a Bayesian phylogenetic posterior distribution. It is not the only way.</em></p>
<p>In this post I’ll describe two of our recent papers that together give a systematic, rather than random, means of approximating a phylogenetic posterior distribution.</p>
<p>Without a doubt MCMC is the most popular means of approximating the posterior.
MCMC is wonderfully simple.
To implement a sampler (assuming you have a computable likelihood and prior) all you have to do is devise a proposal distribution that tries out a new tree and/or model parameter, and then accept/reject based on the <a href="https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm">Metropolis-Hastings ratio</a>.
If your proposal is reasonably good, then MCMC will converge (in the limit of a large number of samples) to the posterior distribution.</p>
<p>However, the simplicity of MCMC implies inherent limitations.
It is not a “smart” algorithm.
For example, one cannot easily adapt the proposal distribution to reflect what the MCMC is learning about the posterior.</p>
<p>This is problematic because the number of trees grows super-exponentially, and most of those trees are terrible explanations of a given data set.
Thus, naive random tree modification proposals are likely to take us out of the high-posterior region.
This is manifested in either timid tree proposal distributions, or as an unacceptably high rejection rate for proposals in an MCMC run.</p>
<p>Before we start talking about alternatives, I want to emphasize that people have done wonderful and important work within the MCMC framework.
It has brought us <em>all</em> of the biological insights we have learned using Bayesian phylogenetic methodology, from deep divergence inference with complex models to genetic epidemiology.
Methodologically, many authors have made this possible, from developing and testing tree proposals to leveraging Metropolis-Coupled Monte Carlo, which although <a href="http://dx.doi.org/10.1093/sysbio/syy008">not a panacea</a> certainly <a href="http://dx.doi.org/10.1093/sysbio/syv006">improves topological mixing</a>.</p>
<p>Let’s start discussing alternatives by backing up a bit.</p>
<h3 id="what-is-bayesian-phylogenetics-trying-to-do">What is Bayesian phylogenetics trying to do?</h3>
<p>In the Bayesian phylogenetic framework, we are interested in the posterior distribution <script type="math/tex">p(\tau, \theta \mid D)</script> on tree topologies <script type="math/tex">\tau</script> and model parameters <script type="math/tex">\theta</script> (including branch lengths) given sequence data <script type="math/tex">D</script>.
Bayes’ rule tells us that the posterior is proportional to the likelihood times the prior:</p>
<script type="math/tex; mode=display">p(\tau, \theta \mid D) \propto p(D \mid \tau, \theta) \, p(\tau, \theta)</script>
<p>Because this is a proportionality, we can easily evaluate ratios of posterior probabilities (such as to compute the Metropolis-Hastings ratio), but getting the true value of the posterior is intractable for phylogenetics.</p>
<p>Now, it’s common to be interested in the tree topologies <script type="math/tex">\tau</script> rather than the joint distribution on the topology and all of the associated continuous parameters.
For instance, one might want to test monophyly of a given clade.
That is, we would like to know <script type="math/tex">p(\tau \mid D)</script>.</p>
<p>The common way to do this with MCMC is to run your chain, count the number of times you saw each topology <script type="math/tex">\tau_i</script>, then divide by the number of samples from your chain.
By ignoring the continuous parameters, we effectively marginalize them out.</p>
<p><strong>In our work we wondered if we could develop an alternative means of getting <script type="math/tex">p(\tau \mid D)</script> without a sampling-based method such as MCMC.</strong>
Specifically, we wanted to avoid any randomized movement through tree space.</p>
<p>Consider that</p>
<script type="math/tex; mode=display">p(\tau_i \mid D) = \frac{p(D \mid \tau_i) \, p(\tau_i)}{\sum_j p(D \mid \tau_j) \, p(\tau_j)}</script>
<p>where the sum in the denominator of the ratio is over all trees <script type="math/tex">\tau_j</script>, <script type="math/tex">p(D \mid \tau_j)</script> is the marginal likelihood over continuous parameters</p>
<script type="math/tex; mode=display">p(D \mid \tau_j) = \int_\theta p(D \mid \tau_j, \theta) \, p(\theta),</script>
<p>and <script type="math/tex">p(\tau_j)</script> is the prior on tree topology <script type="math/tex">\tau_j</script>.</p>
<p>Thus we can approximate the per-topology posterior distribution <script type="math/tex">p(\tau_i \mid D)</script> given</p>
<ol>
<li>some systematic way of identifying a collection of “good” trees <script type="math/tex">\tau_j</script> that contain most of the posterior probability weight in the denominator of the ratio</li>
<li>some efficient way of estimating the marginal likelihood <script type="math/tex">p(D \mid \tau)</script>.</li>
</ol>
<p>The question is, then, can we obtain these two ingredients?</p>
<h3 id="ingredient-1-efficiently-finding-a-good-set-of-trees">Ingredient 1: efficiently finding a “good” set of trees</h3>
<p>Current maximum-likelihood and Bayesian phylogenetic algorithms are opposites in terms of objective and method.
Maximum-likelihood algorithms systematically zoom up to the top of the likelihood surface with no regard for trees that serve as nearly-as-good explanations of the data.
Bayesian algorithms explore tree space randomly, wasting effort by returning to the same trees many times, but given enough time do a good job of exploring the whole posterior region.</p>
<p>We decided to combine the systematic search of ML algorithms with the Bayesian objective, such that we would systematically find all of the “good” trees.
To do so, we did the tree rearrangements that one usually does with these algorithms, but keeping track of all of the trees that were above some likelihood threshold rather that just allowing rearrangements that result in an improvement of likelihood.</p>
<p>Note that I said “likelihood” and not “posterior” here.
In fact, by “likelihood” I mean the likelihood of the tree with the maximum-likelihood assignment of branch lengths (and other model parameters).
One of the surprising results of our work is that this likelihood acts as a surprisingly good proxy for the posterior when looking for this “good” set of trees.</p>
<p>We find that this strategy works reasonably well if one uses a collection of starting points obtained by running RAxML starting at several hundred random trees.
These starting points appear to cover all of the local peaks one finds in the posterior distribution.
We implemented our algorithm using the <a href="https://github.com/xflouris/libpll/">libpll library</a> from the Stamatakis group, which was a pleasant foundation.
We added a multithreading strategy that allowed different workers to spread out across tree space.
We report these results in <a href="https://arxiv.org/abs/1811.11007">Systematic Exploration of the High Likelihood Set of Phylogenetic Tree Topologies</a>, with Whidden, Claywell, Fisher, Magee, and Fourment.</p>
<h3 id="ingredient-2-efficiently-estimating-the-per-tree-marginal-likelihood">Ingredient 2: efficiently estimating the per-tree marginal likelihood</h3>
<p>The other component we needed was a way to evaluate the per-tree marginal likelihood <script type="math/tex">p(D \mid \tau)</script>.
Please note that we are doing this marginal likelihood estimation with a single fixed tree topology at a time, which is in contrast to many applications of phylogenetic marginal likelihood estimation in which one is comparing one evolutionary model to another while integrating out the tree topology as well.</p>
<p>Because marginal likelihood in this formulation is a problem with continuous parameters only, there are many existing methods for estimating it.
We also developed a few of our own specifically for this application, giving 19 methods in total.
This immediately brought to mind the classic 1978 paper by Moler & Van Loan: <a href="https://doi.org/10.1137/1020098">Nineteen Dubious Ways to Compute the Exponential of a Matrix</a>.
Accordingly, we named our paper <a href="https://arxiv.org/abs/1811.11804">19 Dubious Ways to Compute the Marginal Likelihood of a Phylogenetic Tree Topology</a> with Fourment, Magee, Whidden, Bilge, and Minin.</p>
<p>We were surprised to find that some fast methods could be quite accurate, and some slow methods showed rather poor accuracy.
One of the star algorithms here was a “Gamma Laplus” method devised by the group that used Laplace-like approximation to fit a Gamma distribution, which can then be used directly to compute a marginal likelihood.
One can then boost the accuracy with a relatively small computational cost by adding an importance sampling step.</p>
<h3 id="so-does-it-work">So, does it work?</h3>
<p>Yes, combining these two strategies does work as a proof of principle.
There are some caveats, though.
The first caveat is that we used the Jukes-Cantor model, so we didn’t have to marginalize out model parameters other than branch lengths.
This seems tractable: it would take another papers-worth of work with more interesting variational parameterizations, but I think we could deal with substitution-model and rate-variation parameters.</p>
<p>The second caveat is a more inherent issue for any method that attempts to individually explore every “good” tree.
Sometimes tree posteriors are just really diffuse!
For example, there are some data sets for which our extremely long “Golden” MrBayes runs never once sampled the same topology twice.</p>
<p>In fact, thinking about what to do with these very diffuse posterior distributions is what led us down the road of thinking more about <a href="/general/2018/12/05/sbn.html">density estimation on the set of phylogenetic trees</a>, which then in turn led us to investigate full variational Bayes phylogenetic inference, which I’ll write about in an upcoming post.</p>
<hr />
<p>This was a big and complex project that required a lot of hard work to pull off.
For the first paper, I’d like to highlight the stamina of Chris Whidden and the programming prowess of Brian Claywell, as well as thank Thayer Fisher for starting off the project as a summer undergraduate project.
For the second paper, I’d like to thank Mathieu Fourment, who implemented every one of the 19 methods with almost unbelievable gumption and skill, Andy Magee, who is the unsung hero of the project for contributing implementations and analysis, and Arman Bilge, who did important work early on for the Laplace-type methods.
Vladimir was as usual a wonderful collaborator.</p>
Generalizing tree probability estimation via Bayesian networks
http://matsen.fredhutch.org/general/2018/12/05/sbn.html
Wed, 05 Dec 2018 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2018/12/05/sbn.html<p><img src="https://matsen.fredhutch.org/images/tree-smoother.png" width="350" class="pull-right" /></p>
<p>Posterior probability estimation of phylogenetic tree topologies from an MCMC sample is currently a pretty simple affair.
You run your sampler, you get out some tree topologies, you count them up, normalize to get a probability, and done.
It doesn’t seem like there’s a lot of room for improvement, right?</p>
<p>Wrong.</p>
<p>Let’s step back a little and think like statisticians.
The posterior probability of a tree topology is an unknown quantity.
By running an MCMC sampler, we get a histogram, the normalized version of which will converge to the true posterior in the limit of a large number of samples.
We can use that simple histogram estimate, but nothing is stopping us from taking other estimators of the per-topology posterior distribution that may have nicer properties.</p>
<p>For real-valued samples we might use kernel density estimates to smooth noisy sampled distributions, which may reduce error when sampling is sparse.
Because the number of phylogenies is huge, MCMC is computationally expensive, and we are naturally impatient, one is often in the sparsely-sampled regime for topology posteriors.
Can we smooth out stochastic under- and over-estimates of topology posterior probabilities by using similarities between trees? (See upper-right cartoon.)
This smoothing should also extend the posterior to unsampled topologies.</p>
<p>The question is, then, how do we do something like a kernel density estimate in tree space?
In a beautiful line of work <a href="http://dx.doi.org/10.1093/sysbio/syr074">started by Höhna and Drummond</a> and <a href="http://dx.doi.org/10.1093/sysbio/syt014">extended by Larget</a> one can think of each tree as being determined by local choices about how groups of leaves (“clades”) get split apart recursively down the tree.
Their work assumed independence between these clade splitting probabilities.</p>
<p>This is a super-cool idea, but the formulation didn’t seem to work well for tree probability estimation from posterior samples on real data.
For example, <a href="http://dx.doi.org/10.1093/sysbio/syv006">Chris Whidden and I</a> noticed that this procedure underestimated the posterior for sub-peaks and overestimated the posterior between peaks.
This says that the conditional independence assumption on clades made by this method was too strong.
But this doesn’t doom the entire approach!
We just need to take a more flexible family of distributions over phylogenetic tree topologies.</p>
<p>I suggested this direction to Cheng Zhang, a postdoc in my group, and within a week he figured out the right construction that generalized this earlier work but allowed for much more complex distributions.
Cheng’s construction parameterizes a tree in terms of “subsplits,” which are the choices about how to split up a given clade.
To allow for more complex distributions than the previous conditional independence assumptions allow, he encodes tree probabilities in terms of a collection of subsplit-valued random variables that are placed in a Bayesian network.</p>
<p>The simplest such network enables dependence of a subsplit on the parent subsplit, which in tree terms means that when assigning a probability to a given subsplit we are influenced by what the sister clade is.
More complex networks can encode more complex dependence structures.
To our surprise, the simplest formulation worked well: allowing split frequencies for clades to depend on the sister clade gives a sufficiently flexible set of distributions to be able to fit complex tree-valued distributions.</p>
<p>In the simplest version one can write out the probability for a given tree like so:</p>
<p><img src="https://matsen.fredhutch.org/images/csd-example.png" style="width: 100%" /></p>
<p>where the <i>q</i>s are inferred probability distributions that we call conditional subsplit distributions.</p>
<p>In addition to more complex dependence structure, Cheng’s approach also more formally treats this whole procedure as an exercise in estimating an approximating distribution.
Where previous efforts estimated probabilities by counting, one can do better in the unrooted case for subsplit networks by optimizing the parameterized distribution on trees to match an empirical sampled distribution of unrooted trees via expectation maximization.
One can also take some weak priors to handle the sparsely-sampled case.</p>
<p>We’ve written up these results in a <a href="https://papers.nips.cc/paper/7418-generalizing-tree-probability-estimation-via-bayesian-networks">paper that has been accepted to the NeurIPS</a> (previously NIPS<sup><a href="#footnote1">1</a></sup>) conference as a Spotlight presentation.
I’m proud of Cheng for this accomplishment, but consequently the paper is written more for a machine-learning audience rather than a phylogenetics audience.
If you aren’t familiar with the Bayesian network formalism it may be a tough read.
The key thing to keep in mind that the network (paper Figure 2) encodes the tree as a collection of subsplits assigned to the nodes of the network, and the edges describe probabilistic dependence.
For example, the reason we can think of a conditional subsplit distribution as conditioning on the sister clade (see figure above) is because parent-child relationships in the subsplit Bayesian networks must take values such that the child subsplit is compatible with the parent subsplit.</p>
<p>If you don’t read anything else, flip to Table 1 and check out how much better these estimates are on big posteriors from real data than what everyone does right now, which is just to use the simple fraction.
Magic!
Hopefully it makes sense that we are smoothing out our MCMC posterior, and extending it to unsampled trees.
If you have questions, I hope you will head on over to <a href="https://www.phylobabble.org/t/generalizing-tree-probability-estimation-via-bayesian-networks/1067">Phylobabble</a> and ask them— let’s have a discussion!</p>
<p>Subsplit Bayesian networks open up a lot of opportunities for new ways of inferring posteriors on trees.
Stay tuned!</p>
<p>We are always looking for folks to contribute in this area.
If you’re interested, get in touch!</p>
<hr />
<p><a name="footnote1">1</a>: I’m happy to report that the <a href="https://nips.cc/Conferences/2018/News?article=2118">NIPS conference has changed its name to NeurIPS</a>.
This is an important move that at signals at least a desire by the board for diversity and inclusion in machine learning.
We can all hope that it is followed with concrete action.</p>
Human T cell receptor occurrence patterns encode immune history, genetic background, and receptor specificity
http://matsen.fredhutch.org/general/2018/05/15/pubtcr.html
Tue, 15 May 2018 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2018/05/15/pubtcr.html<p><img src="https://matsen.fredhutch.org/images/pubtcrs.png" width="470" class="pull-right" /></p>
<p>High-throughput sequencing of our adaptive immune repertoires holds great promise for understanding immune state.
These sequences implicitly contain a wealth of information on past and present exposures to infectious and autoimmune diseases, to environmental stimuli, and even to tumor-derived antigens.
In principle, we should be able to use these sequences of rearranged receptors to infer their eliciting antigens, either individually or collectively.</p>
<p>We’re starting to see neat progress in these areas for T cell receptors (TCRs).
Some recent studies compare TCR repertoire between individuals who do or do not have some immune state, such as <a href="https://academic.oup.com/bioinformatics/article/30/22/3181/2390867">an immunization</a>, <a href="https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1814-6">an autoimmune disease</a> or a <a href="http://dx.doi.org/10.1038/ng.3822">viral infection</a> and work to find sequence-level differences between the repertoires.
The Walczak-Mora team recently <a href="https://elifesciences.org/articles/33050">upped the bar</a> by not requiring a control cohort.
There has also been interesting progress on <a href="https://www.nature.com/articles/nature23091">predicting epitope specificity from TCR sequence</a> using structurally-informed sequence analysis.</p>
<p><a href="https://www.fredhutch.org/en/labs/profiles/bradley-phil.html">Phil Bradley</a>, just down the hall from us, wanted to take a different approach, asking <em>given appropriate statistical analysis of a sufficiently large data set, can we infer pathogen-responsive TCRs from co-occurrence and HLA information alone?</em>
(If you don’t remember about HLA, it determines the sequence of MHC, the <a href="https://en.wikipedia.org/wiki/Major_histocompatibility_complex#/media/File:MHC_Binding_Diagram.png">hot dog bun</a> presenting peptides for recognition to T cells.)
He showed that this indeed was the case, one example of which is shown in the figure above.
Each point is a cluster of TCR sequences, where clustering is performed based on both co-occurrence and on TCR sequence similarity.
Only TCR sequences that are significantly associated with an HLA type are allowed to participate in the clustering, and only clusters that were significant in terms of family-wise error rate are shown.
These clusters are plotted with respect to the cluster size and a co-occurrence score.</p>
<p>The surprising result is that this procedure, which knows nothing about what stimulated the TCRs to expand, identifies previously-labeled TCR sequences corresponding to certain immune states.
You probably recognize EBV, MS, and CMV, but we also see B19=parvovirus B19, INF=influenza, RA=rheumatoid arthritis, T1D=type 1 diabetes, and others.
That’s pretty neat!
This, along with other fun surprises, is published in <a href="http://dx.doi.org/10.7554/eLife.38358">eLife</a>.</p>
<p>I made very minor contributions to this manuscript, but wanted to write about it because I think it’s an exciting advance.
This proof of concept is definitely motivating us to think harder about what sorts of statistical frameworks would be useful for doing this sort of research more comprehensively.
Thanks to Will and Phil, to the Hansen lab for the neat data, and to the study participants.</p>
The Bayesian optimist's guide to adaptive immune receptor repertoire analysis
http://matsen.fredhutch.org/general/2018/05/12/bayesian-optimist.html
Sat, 12 May 2018 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2018/05/12/bayesian-optimist.html<p><img src="https://matsen.fredhutch.org/images/box-loop.png" width="470" class="pull-right" /></p>
<p>Immune receptor sequencing is stochastic through and through.
We have cells with random V(D)J rearrangements that are stimulated through some random process of exposures, which lead to some random amount of expansion, and in the B cell case there is some random process of mutation and selection.
So why don’t we use methods incorporating that uncertainty into our analysis?</p>
<p>We’ve tried to do this in our work, and have made some progress, but there is so much left to be done.
When Sarah Cobey and Patrick Wilson kindly invited me to contribute to their special issue of <em>Immunological Reviews</em>, I knew I wanted to step back and ask:</p>
<p><em>If computation was no barrier, how would we design an analysis framework that integrated out uncertainty in unknown quantities and took advantage of the hierarchical structure inherent in immune receptor data?</em></p>
<p>I teamed up with Branden Olson, a Statistics PhD student in the lab, and went to work.
It was a fun exercise to think through all of the steps of immune repertoire development and ask: what is the most realistic model under which inference should be possible, and what is the most realistic model for which we can perform simulation?
This was more effort than anticipated, but 230 references later the final version is now <a href="https://arxiv.org/abs/1804.10964">up on arXiv</a> and accessible for free (though I understand if you want to wait a few months to pay $38 and get it from the journal website).</p>
<p>In addition to dreaming research directions, I wanted to explain to my immunologist pals why I think probabilistic analysis methods are crucial, and describe the basics of Bayesian analysis via simple metaphors.
Ideally this will lead to a little more crosstalk between communities.
Traditionally, statisticians and lab biologists have been on independent tracks (see image above) even though they investigate the same underlying phenomena.
I hope that in the future we can unify these tracks by developing statistical models based on mechanism and design experiments based on statistical inferences.</p>
<p>I also hope that this serves as an invitation to the computational statistics community.
As we say at the end: “The computational statistician interested in immune receptor modeling is blessed with a complex biological system to analyze, intractable computational problems heaped on top of one another, and an ever-expanding collection of data sets generated from various in-vivo and in-vitro perturbations.”</p>
<p>Come play!</p>
Benchmarking tree and ancestral sequence inference for B cell receptor sequences
http://matsen.fredhutch.org/general/2018/05/02/bcr-phylo-benchmark.html
Wed, 02 May 2018 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2018/05/02/bcr-phylo-benchmark.html<p><img src="https://matsen.fredhutch.org/images/bcr-phylo-benchmark.png" width="470" class="pull-right" /></p>
<p>Phylogenetic tools, in particular for ancestral sequence reconstruction, get used a lot in the B cell receptor (BCR) sequence analysis world.
For example, they get used to reconstruct intermediate antibodies that then get synthesized in the lab and tested for binding (<a href="http://dx.doi.org/10.1126/science.1207532">Wu et. al, 2011</a>).
But how well do phylogenetic tools work in this parameter regime?
Although there have been countless benchmarking studies for phylogenetics, the case of B cell sequence evolution is different than the usual setting for phylogenetics:</p>
<ul>
<li>Sampling and sequencing, especially for direct sequencing of germinal centers, is dense compared to divergence between sequences. Because of the resulting distribution of short branch lengths, zero-length branches and multifurcations representing simultaneous divergence are common.</li>
<li>The somatic hypermutation (SHM) process in affinity maturation is highly <a href="/general/2017/11/14/motif.html">nucleotide-context-dependent process</a>.</li>
<li>Repertoire sequencing typically focuses on the coding sequence of antibodies, which are under very strong selective constraint. This contrasts with the neutral evolution assumptions of most phylogenetic algorithms, as well as the simulation software assumptions traditionally used for phylogenetics benchmarks.</li>
<li>In contrast to typical phylogenetic problems where the root sequence is unknown, one has significant information about the root sequence for BCR sequences: namely, that it’s a recombination of V, (D), and J genes, which are somewhat well characterized.</li>
</ul>
<p>BCR sequences also offer additional opportunities for validation.
Specifically, the irreversible <a href="https://en.wikipedia.org/wiki/Immunoglobulin_class_switching">class switching process</a> gives us a marker that should only go in one direction along a tree branch.
If it goes another direction, this indicates problems with the tree reconstruction.</p>
<p>Before I sketch the results of our analysis, I should mention differences between our work and another <a href="http://dx.doi.org/10.1093/bioinformatics/btx533">recent paper</a> also set up a benchmark of phylogenetic methods.
Much of that paper concerns the results of phylogenetic inference using a “toy” clonal family inference method with necessarily bad performance, whereas here we assume that clonal families have been properly inferred.
In addition, we simulate sequences under selection using an affinity-based model (which we show makes the inferential problem significantly more difficult), we compare accuracy of ancestral sequence inference, we include additional software tools (several of which are BCR-specific), and we use class-switching data as a further non-simulation means of benchmarking methods.</p>
<p>For this work, Kristian cooked up a simulator for B cell affinity maturation.
Although quite a lot of simulators have been written, going back to <a href="https://link.springer.com/chapter/10.1007%2F978-3-642-71984-4_13">Clone</a>, none of these did what we wanted, which was to use a context model to simulate mutations, and then use the corresponding amino acid sequences for a selection step.
Kristian’s model is simple, but nonetheless we feel that it does an appropriate job of simulating sequences for the purposes of benchmarking methods.
We show that the simulated data broadly speaking “looks like” germinal center data.</p>
<p>You can read the full results <a href="https://www.biorxiv.org/content/early/2018/04/25/307736">on bioRxiv</a>, but here are the things that surprised us:</p>
<ul>
<li>Picking between equally parsimonious trees using a context-sensitive model works surprisingly well. This makes us want to continue working on incorporating full context models into phylogenetic methods.</li>
<li>PHYLIP is quite a good choice! I thought that the BCR community was fairly behind the times by not using some of the more modern maximum-likelihood packages, but IQ-TREE is the only recently-developed package that does ML on trees and ancestral sequence inference, and it performs significantly worse (although it’s much faster and nicer to use!).</li>
<li><a href="http://dx.doi.org/10.1534/genetics.116.196303">IgPhyML</a> is a cool project that works to integrate hotspot motifs and Goldman-Yang codon modeling, which it does by marginalizing out hotspot motifs when they extend across a codon boundary. It does reasonably but not as well as we expected, which may be because we are benchmarking on the moderately-sized trees with which we have experience rather than the very deep broadly-neutralizing trees investigated in the IgPhyML paper.</li>
<li>The class-switching data gave noisier results than we had hoped for, giving error bars of the same magnitude as differences between methods. However, it confirmed that picking equally parsimonious trees using a context-sensitive model increases accuracy. Perhaps with better sampling or just more data we can learn more from class-switching data in the future.</li>
</ul>
<p>There’s quite a lot more to do here, both in terms of method development and benchmarking, and we look forward watching this area mature in the coming years.
Thanks to Kristian for his great work!</p>
Predicting B cell receptor substitution profiles using public repertoire data
http://matsen.fredhutch.org/general/2018/04/19/spurf.html
Thu, 19 Apr 2018 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2018/04/19/spurf.html<p><img src="https://matsen.fredhutch.org/images/spurf.png" width="470" class="pull-right" /></p>
<p>Can we predict how sites of an antibody will tolerate amino acid substitutions?
Kristian Davidsen posed this question shortly after he arrived in my group, pointing out that being able to do such prediction would be quite useful.
For example, engineered antibodies sometimes aggregate into clumps or have other properties that that make them useless for mass production.
If we could figure out ways to change the amino acid sequence of an antibody without changing binding properties, that could help us avoid aggregation and make a more useful antibody.</p>
<p>How to start to address this complex and high-dimensional question?
Although people have started to do <a href="http://dx.doi.org/10.7554/eLife.23156">deep mutational scanning on antibodies</a> this type of data is hard to come by.
On the other hand, B cell repertoire (i.e. antibody-coding) sequence data is becoming plentiful.
B cells undergo affinity maturation to improve binding in collections of sequences called “clonal families” grouped by naive ancestor sequence (more background <a href="/general/2016/04/16/partis-clustering.html">here</a>).
Although it’s not quite the same, we can use the frequency of an amino acid at a given site in that clonal family as a proxy for the suitability of that amino acid for an antibody binding the same target.
Or perhaps such a clonal-family amino-acid frequency is simply an interesting object in itself.</p>
<p>In any case, our goal became:
<em>given a single sequence from a clonal family, can we predict the amino acid frequency of the collection of sequences in the clonal family?</em>
We follow <a href="http://dx.doi.org/10.3389/fimmu.2017.00537">Sheng, Schramm et. al (2017)</a> in calling this sort of thing a <em>substitution profile</em>.
Inferring a substitution profile from a single sequence might sound hard or impossible, but several features of the affinity maturation process lean in our favor:</p>
<ol>
<li>There are a finite number of germline ancestor sequences from which diversification begins, and we can do a good job of inferring from which ancestor a given B cell sequence derives.</li>
<li>Simply because of the mutation process, some sites are more likely to mutate than others (recently covered <a href="/general/2017/11/14/motif.html">here</a>).</li>
<li>There’s lots of other repertoire data that we can use to watch the affinity maturation process.</li>
</ol>
<p>This last one is sort of special, and deserves a bit of explanation.
If we had a database containing every B cell sequence that had ever occurred, one could simply look for clonal families containing the sequence given to us, and take the average amino acid profile of those clonal family sequences.
Unfortunately we don’t have access to such a database, but we can at least look for somewhat similar sequences and learn from their substitution profiles.</p>
<p>The previous Sheng-Schramm work, as well as contemporaneous work by <a href="http://dx.doi.org/10.3389/fimmu.2017.01433">Kirik et. al (2017)</a>, also indicates that various germline genes diversify in various characteristic ways (this sentiment also appears in <a href="http://dx.doi.org/10.1371/journal.pcbi.1004409">Duncan’s first B cell paper</a> and I’m sure many other previous works).
This tells us that a profile based on germline gene identity should also inform a predicted substitution profile.
Also, the context-sensitive neutral process given a germline gene should be helpful.</p>
<p>How do we combine these various sorts of information, especially considering that what is helpful for prediction at one site might not be helpful for another?
Well, our group, consisting of Kristian, Amrit Dhar, and Vladimir Minin, decided to use a penalized tensor regression framework.
That sounds fancy, but it just means that a single profile is a weighted linear combination of the profiles from each of the sources of information (see picture above).
The weights may differ from site to site, but the kind of penalization we put on keeps them from changing too much between neighboring sites.
It also zeroes out coefficients that don’t seem to be helping out-of-sample prediction.
We find that different sources of information are useful for different parts of the B cell receptor sequence, in a way that corresponds to intuition about the “framework” and “complementarity determining” regions.</p>
<p>In any case, we show that integrating these diverse sources of information can help prediction, and provide a pre-trained prediction algorithm to do so.
The code and parameters are <a href="https://github.com/krdav/SPURF">on Github</a> and the paper is <a href="http://arxiv.org/abs/1802.06406">on arXiv</a>.
So have at it with your sequences, and let us know how it fares!</p>
<p>I think that predicting substitution profiles is an interesting and useful goal.
It did take a little getting used to, because we previously <a href="http://dx.doi.org/10.1098/rstb.2014.0244">worked super hard</a> to get per-residue natural selection estimates for B cell receptors by carefully separating the mutation and selection processes; here these substitution profiles just smash all that complexity down to a simpler object.
There’s more to be done here: as data sets get bigger and machine learning algorithms get smarter, I look forward to seeing prediction improve!
Thanks to Amrit, Kristian, and Vladimir for a fun project.</p>
Postdoc opening to learn about antibody development during HIV superinfection
http://matsen.fredhutch.org/general/2018/01/10/ab-postdoc.html
Wed, 10 Jan 2018 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2018/01/10/ab-postdoc.html<p>Please see <a href="https://b-t.cr/t/506">https://b-t.cr/t/506</a> for details.</p>
Per-sample immunoglobulin germline inference from B cell receptor deep sequencing data
http://matsen.fredhutch.org/general/2017/12/01/gl-inf.html
Fri, 01 Dec 2017 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2017/12/01/gl-inf.html<p><img src="https://matsen.fredhutch.org/images/gl-inf-fits.png" width="350" class="pull-right" /></p>
<p>Every B cell receptor sequence in a repertoire came from a V(D)J recombination of germline genes.
Each individual has only certain alleles of these genes in their germline, and knowing this set improves the accuracy of all aspects of BCR sequence analysis, from alignment to phylogenetic ancestral sequence reconstruction.
This germline allele set can be estimated directly from BCR sequence data, and it’s time to treat such estimation as part of standard BCR sequence analysis pipelines.</p>
<p>This central message is not new, but it’s worth emphasizing because doing germline set inference is not part of most current studies of B cell receptor (BCR) sequences.</p>
<p>Indeed, the most common way to annotate sequences is to align them one by one to the full set of alleles present in the IMGT database, which has hundreds of alleles.
Each individual has only a fraction of these alleles in their genome.</p>
<p>Unsurprisingly, aligning sequences one by one to the whole IMGT set can cause problems.
Imagine that A and B are two germline alleles in IMGT that are similar to one another.
Sequences deriving from germline allele A can somatically hypermutate to look more similar to the B allele than the A allele from which they came.
If we allow A and B in our germline repertoire, such sequences will be incorrectly annotated as being from B when they are from A.
This will certainly lead to an incorrect estimation of the naive sequence from which they came.</p>
<p>In addition, it’s known through the work of many groups that the total set of germline genes is much larger than that represented in IMGT.
This is not surprising given that this region is tricky to sequence directly, and that so far genetic studies have been primarily done on people of European ancestry.
Here again, if we are missing a sequence from our germline set, we will have problems with all of our downstream analyses.</p>
<p>Thus, we should be estimating per-sample germline sets for BCR sequence data.
This is not a trivial task.
In 2010, <a href="http://dx.doi.org/10.4049/jimmunol.1000445">Scott Boyd and others</a> were the first to use high-throughput sequencing data of rearranged BCRs to estimate per-sample germline sets with a combination of computation, expert judgement, and statistics.
In 2015, the Kleinstein group made a big step by developing TIgGER, an <a href="http://dx.doi.org/10.1073/pnas.1417683112">automated method for inferring germline sets</a> that weren’t too far from existing alleles, and more recently the Hedestam group developed IgDiscover, a <a href="http://dx.doi.org/10.1038/ncomms13642">method that could start more “from scratch”</a> for species where we have little or no germline information.</p>
<p>The motivation for Duncan’s work came from analyzing sequence data from diverse sources, and seeing clear evidence of alleles that were not represented in IMGT.
He tried the existing tools but became frustrated first with software usability.
He then started by re-implementing TIgGER, and then realized that he could use the same input information (their “mutation accumulation” plot depicted above) but in a way that more directly tests for the presence of new alleles, by considering the goodness of fit for one- vs two-component fits.
In classic Duncan fashion, he has done a ton of validation, varying many different parameters in his simulation and also comparing the results of the different methods on experimental data sets.
The work is now <a href="https://arxiv.org/abs/1711.05843">up on arXiv</a> and is part of his <a href="https://github.com/psathyrella/partis">partis</a> suite of repertoire analysis tools.</p>
<p>There’s still a lot to be done here, and our knowledge of this highly diverse and important locus will continue to improve as more sequencing data of all types comes in.
This is one example of many showing how analysis of a whole data set at once is more powerful for each individual sequence than one-at-a-time analysis of sequences.</p>
Survival analysis of DNA mutation motifs with penalized proportional hazards
http://matsen.fredhutch.org/general/2017/11/14/motif.html
Tue, 14 Nov 2017 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2017/11/14/motif.html<p><img src="https://matsen.fredhutch.org/images/motif-samm-example.png" width="350" class="pull-right" /></p>
<p>We are equipped with purpose-built molecular machinery to mutate our genome so that we can become immune to pathogens. This is truly a thing of wonder.</p>
<p>More specifically, I’m talking about mutations in B cells, the cells that make antibodies.
Once a randomly-generated antibody expressed on the outside of the B cell finds something it’s good at binding, the cell boosts the mutation rate of its antibody-coding region by about one million fold.
Those that have better binding are rewarded by stimulation to divide further.
The result of this Darwinian mutation and selection process is antibodies with improved binding properties.</p>
<p>The mutation process is wonderfully complex and interesting. Being statisticians, we payed our highest tribute that we can to a process we think is beautiful: we developed a statistical model of it. This work was led by the dynamic duo of <a href="http://students.washington.edu/jeanfeng/">Jean Feng</a> and <a href="https://github.com/dawahs">David Shaw</a>, while <a href="http://www.stat.washington.edu/vminin/">Vladimir Minin</a>, <a href="http://faculty.washington.edu/nrsimon/">Noah Simon</a> and I kibitzed.
Our model is known in statistics as a type of <a href="https://en.wikipedia.org/wiki/Proportional_hazards_model">proportional hazards model</a>. These models were introduced in Sir David Cox’s paper <a href="https://www.jstor.org/stable/2985181"><em>Regression Models and Life-Tables</em></a>, which with over 4600 citations makes it <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.231.5042&rep=rep1&type=pdf">the second most cited paper in statistics</a>.</p>
<p>These models are typically used to infer rates of failure, such as that of humans getting disease.
During our life span we get a sequence of diseases, some of which predispose us to other diseases.
By considering sequences of diseases across many individuals, we can use these proportional hazards models to infer the rate of getting various diseases given disease history.</p>
<p>There is an analogous situation for B cell sequences in that the mutation process depends significantly on the identity of the nearby bases.
We can observe lots of mutated sequences, and do a similar sort of inference: when a position mutates, it changes the mutability of nearby bases.
Unfortunately we don’t know the order in which the mutations occurred, and thus don’t know what sequences had increased mutability, so we have to do Gibbs sampling over orders.
The paper describing these methods and some results is <a href="http://dx.doi.org/10.1214/18-AOAS1233">published in Annals of Applied Statistics</a>.</p>
<p>We were inspired by the very nice <a href="http://dx.doi.org/10.3389/fimmu.2013.00358">work</a> of the <a href="http://medicine.yale.edu/lab/kleinstein/">Kleinstein lab</a> developing similar sorts of models using simpler methods.
However, we wanted a more flexible modeling framework and for the complexity of the models to automatically scale to the signal in the data, which we did using penalization with the LASSO.
What you see in the figure above is how we can set up a hierarchical model with a penalty that zeroes out 5-mer terms when they don’t contribute anything above the corresponding 3-mer term (the last base being unimportant gives the block-like structure, while when the first base is unimportant it gives the 4-fold repetitive pattern you can see when zooming out).
We are also indebted to Steve and his team, especially Jason Vander Heiden, for supplying us with sequence data.
They are a class act.</p>
<p>There’s a lot of interest in context-sensitive mutation processes these days, such as <a href="https://elifesciences.org/articles/24284">Kelly Harris’ work</a> on how we can watch context-sensitive mutabilities change through evolutionary time, and <a href="https://doi.org/10.1038/nature12477">Ludmil Alexandrov’s work</a> on mutation processes in cancer.
In both of these cases, they are in the process of transitioning from a statistical description of these processes to linking them with specific mutagens and repair processes.</p>
<p>Here too we would like to use statistics to learn more about the mechanisms behind these context-sensitive mutations.
What’s neat about the framework that Jean and David developed is that now we can design features that correspond to specific mechanistic hypotheses and test how much they impact mutation rates.
Stay tuned!</p>
Using genotype abundance to improve phylogenetic inference
http://matsen.fredhutch.org/general/2017/09/05/gctree-phylo.html
Tue, 05 Sep 2017 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2017/09/05/gctree-phylo.html<p><img src="https://matsen.fredhutch.org/images/gctree-phylo.png" width="350" class="pull-right" /></p>
<p>When doing computational biology, listen to biologists.
I have found them to have remarkable intuition; this can be a gold mine of opportunity for us computational types.</p>
<p>In this particular case, the starting point was the <a href="http://dx.doi.org/10.1126/science.aad3439">stunningly beautiful work of Gabriel Victora’s lab</a> visualizing germinal center dynamics in living mice.
For those not yet initiated into the beauty of B cell repertoire, germinal centers are crucibles of evolution, in which B cells compete in an antigen-binding contest such that the best binder reproduces more.
As part of the Victora lab work, they did single-cell extraction and sequencing, which enabled them to quantify the frequency of each B cell genotype without PCR bias or other artifacts.
Such single-cell sequencing, and consequent abundance information, is now becoming commonplace.
<em>How should we use this abundance information in phylogenetics?</em></p>
<p>Well, the Victora lab knew, even if their algorithm implementation is not one we would have considered.
Indeed, they were building trees by hand, using several criteria about what makes for a believable evolutionary scenario.
One of their intuitions was that <em>more abundant genotypes have more opportunity to leave mutant descendants</em>.
Therefore, when we are doing inference, we should prefer trees that attach branches to frequently observed genotypes compared to less frequently observed genotypes (see picture, in which the frequency of a given genotype is the number inside the circle; we call this structure a genotype collapsed tree or <em>GCtree</em>).</p>
<p>To have an objective computational method we need to formalize this intuition.
Will DeWitt, Vladimir Minin, and I formulated it in terms of an “infinite type” branching process, in which every mutation creates a new type.
We can augment existing sequence-based optimality criteria with the likelihood of the tree under our branching process model.
In our case we decided to show that this works by ranking maximum-parsimony trees (there are often many equally parsimonious trees).
Parsimony is in wide use in the B cell analysis community because it is a defensible choice when sampling is dense relative to mutations (as in the case of germinal centers), and it allows inference of zero branch lengths (leading to inference of sampled ancestral genotypes and multifurcations).
We showed under simulation that more highly ranked trees were more correct than lower ranked trees.
With the paired heavy and light chain data from the Victora lab, we were also able to do a biological validation by showing that trees that should be the same are more similar when using our algorithm than without.
The result is now <a href="http://arxiv.org/abs/1708.08944">up on arXiv</a>.</p>
<p>If you are muttering to yourself that we should be using this model as a prior for a Bayesian analysis, we hear you.
Hopefully this motivates additional work in that sphere for abundance-based models.
We do note that given the limited amount of mutation described before will lead to a fairly flat posterior.
Furthermore, although one can infer sampled ancestors using an <a href="http://dx.doi.org/10.1371/journal.pcbi.1003919">RJMCMC</a> and multifurcations using <a href="http://dx.doi.org/10.1093/sysbio/syu132">phycas</a>, these two features do not exist yet under one roof.</p>
<p>Will did a great job with this project, which is a nice complement to his existing publications as he heads into the UW Genome Sciences PhD program!
We had a great time working with Luka and Gabriel, and look forward to more collaboration in the future.</p>
Probabilistic Path Hamiltonian Monte Carlo
http://matsen.fredhutch.org/general/2017/06/26/pphmc.html
Mon, 26 Jun 2017 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2017/06/26/pphmc.html<p><img src="https://matsen.fredhutch.org/images/hmc-tub.png" width="350" class="pull-right" /></p>
<p>Hamiltonian Monte Carlo (HMC) is an exciting approach for sampling Bayesian posterior distributions.
HMC is distinguished from typical MCMC because proposals are derived such that acceptance probability is very high, even though proposals can be distant from the starting state.
These lovely proposals come from approximating the path of a particle moving without friction on the posterior surface.</p>
<p>Indeed, the situation is analogous to a bar of soap sliding around a bathtub, where in this case the bathtub is the posterior surface flipped upside down.
When the soap hits an incline, i.e. a region of bad posterior, it slows down and heads back in another direction.
We give the soap some random momentum to start with, let it slide around for some period, and where it is at the end of this period is our proposal.
Calculating these proposals requires integrating out the soap dynamics, which is done by numerically integrating physics equations (hence the Hamiltonian in HMC).
The acceptance ratio is determined only by how well our numerical integration performs: better numerical integration means a higher acceptance probability.</p>
<p>Vu had been noodling around with phylogenetic HMC when we heard that Arman Bilge (at that time in Auckland) had an implementation as well.
These implementations not only moved through branch length space according to usual HMC, but also moved between topologies.
They did this as follows: once a branch length hits zero, leading to four branches joined at a single node, one can regroup those branches together randomly in another configuration and continue.
(If you are a phylogenetics person, this is a random nearest-neighbor interchange around the zero length branch.)
This randomness, which is rather different than the deterministic paths of classical HMC once a momentum is drawn, is why we call the algorithm Probabilistic Path Hamiltonian Monte Carlo (PPHMC).</p>
<p>The primary challenge in theoretical development is that the PPHMC paths are no longer deterministic.
Thus concepts such as reversibility and volume preservation, which are typical components of correctness proofs for HMC, need to be generalized to probabilistic equivalents.
Vu had to work pretty hard to develop these elements and show that they led to ergodicity.</p>
<p>On the implementation front, Arman was also working hard to build an efficient sampler.
However, the HMC integrator had difficulty going from one tree topology to another without incurring substantial error.
We thrashed around for a while trying to improve things with a “careful” integrator that would find the crossing time and perhaps re-calculate gradients at that time, but proving that such a method would work seemed very hard.</p>
<p>Then, magically, our newest postdoc Cheng Zhang showed up and saved us with a smoothing surrogate function.
This surrogate exchanges the discontinuity in the derivative for discontinuity in the potential energy, but we can deal with that using a “refraction” method introduced by Afshar and Domke in 2015.
This approach allows us to maintain a low error, and thus make very long trajectories with a high acceptance rate.</p>
<p>I’m happy to announce that our <a href="https://arxiv.org/abs/1702.07814">manuscript</a> has been accepted to the 2017 International Conference on Machine Learning.
Practically speaking, this work is definitely a proof of concept.
We have taken an algorithm that was previously only defined for smooth spaces and extended it to orthant complexes, which are basically Euclidean spaces with boundary glued along those boundaries in intricate ways.
Our implementation is not fully optimized, but even if it was I’m not sure that it would out-compete good old MCMC for phylogenetics without some additional tricks.</p>
<p>To know if this flavor of sampler is going to be useful, we really need to better understand what I call the <em>local to global</em> question in phylogenetics.
That is, to what extent does local information tell us about where to modify the tree?
This is straightforward for posteriors on Euclidean spaces: the gradient points us towards a maximum.
But for trees, does the gradient (local information) tell us anything about what parts of the tree should be modified (global information)?
We’ll be thinking about this a lot in the coming months!</p>
Smart proposals win for online phylogenetics using sequential Monte Carlo.
http://matsen.fredhutch.org/general/2017/06/07/sts.html
Wed, 07 Jun 2017 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2017/06/07/sts.html<p><img src="https://matsen.fredhutch.org/images/sts-proposal.png" width="350" class="pull-right" /></p>
<p>Sometimes projects take years to bear fruit.</p>
<p>As I described <a href="/general/2016/10/26/smc-theory.html">previously</a>, <a href="http://darlinglab.org/">Aaron Darling</a> and I have been thinking for a good while about using Sequential Monte Carlo (SMC) for <em>online</em> Bayesian phylogenetic inference, in which an existing posterior on trees can be updated with additional sequences.
In fact, we had a good enough proof-of-concept implementation in 2013 to give a talk at the Evolution meeting.
The other talk from my group that year was about a surrogate function for likelihoods parameterized by a single branch length, which we called “lcfit”.
These two projects have just recently resulted in intertwined submitted papers.</p>
<p>The SMC implementation, which we called <code class="highlighter-rouge">sts</code> for Sequential Tree Sampler, lay dormant for a while after Connor McCoy left the group despite a few efforts to rekindle it.
One of the things that kept us from wrapping it up was problems with <em>particle degeneracy</em>, which is as follows.
I think of SMC as a probabilistically correct version of evolutionary computation, in which trees get to reproduce if they have a high posterior.
Every time we have a new sequence, we attach it to our existing population of trees via some proposal distribution suggesting where to add it.
Particle degeneracy, then, means that you obtain a low diversity population due to a few trees “taking over” the population because they are significantly better than the rest.</p>
<p>In working with <code class="highlighter-rouge">sts</code>, we found that we could mitigate the effect of particle degeneracy by developing “smart” proposals that use the data and corresponding likelihood function to decide where to try next.
This is in contrast to previous work by <a href="http://dx.doi.org/10.1093/sysbio/syr131">Bouchard-Côté et al 2012</a>, which shows that for a different formulation of SMC based on subtree merging one does better with simple proposals (compare <a href="http://papers.nips.cc/paper/3266-bayesian-agglomerative-clustering-with-coalescents">Teh et al. 2008</a>).
Our proposals have to decide to what edge to attach, where along the edge to attach, and how long of a branch length to use for the attachment (see picture above).</p>
<p>Although the group put in hard work, there was a lot more needed to put all this together and actually get a working sampler.
Luckily, Aaron recruited a super sharp and motivated postdoc in <a href="https://au.linkedin.com/in/mfourment">Mathieu Fourment</a>.
Mathieu tried a lot of variants of proposal distributions, showed that “heated” proposal distributions were effective for picking attachment branches, and added a parsimony-based proposal.
He also showed that a high effective sample size is necessary but not sufficient to have a good SMC posterior sample, and that we can develop a time-competitive sampler compared to running MrBayes again and again.
That work is now up <a href="http://biorxiv.org/content/early/2017/06/02/145219">on bioRxiv</a>.</p>
<p>Smart proposals need to be designed carefully.
One of the bottlenecks was proposing the new pendant branch length, and for that the lcfit surrogate function (the subject of Connor’s talk at Evolution 2013) worked well.
This spurred us to finish and write up the lcfit work, despite the fact that <a href="http://dx.doi.org/10.1093/sysbio/syv051">Aberer et al. 2015</a> wrote a paper in which they describe how to use standard probability distribution functions as surrogate functions for branch length proposals.
This took a bit of wind out of our sails, but our purpose-built lcfit surrogate still has some interesting advantages over common functions, such as that has the right “shape” for likelihood curves, even when branch lengths become long.
We were surprised to find that it does well for complex settings with heterogeneous models.
On a practical level, it’s implemented as a stand-alone C library so it can be easily incorporated into other programs, in contrast to the work of Aberer and co, which is tightly integrated into ExaBayes.
In the end we have turned it into a short paper with a long appendix which is now <a href="https://arxiv.org/abs/1706.00659">up on arXiv</a>.
Brian Claywell is the hero of this lcfit work– he was persistent in developing creative strategies to fit the surrogate function in a variety of settings.</p>
<p>There is a lot to be done for online Bayesian phylogenetics still.
We don’t even try to sample model parameters, and haven’t tried making BEASTly rooted trees.
There are many opportunities for optimization, some of which are described in the paper.
For the parallelism nerds out there, I also note the existence of the <a href="https://arxiv.org/abs/1407.2864">particle cascade</a>.</p>
<p>I certainly hope that this contribution guides future developers towards useful online Bayesian phylogenetics samplers: it would be great to be able to keep trees up to date as the genomes keep rolling in from projects such as <a href="http://www.zibraproject.org/">ZIBRA</a>.
If you are interested in more details, get in touch!</p>
Incorporating new sequences into posterior distributions using SMC
http://matsen.fredhutch.org/general/2016/10/26/smc-theory.html
Wed, 26 Oct 2016 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2016/10/26/smc-theory.html<p><img src="https://matsen.fredhutch.org/images/smc.png" width="350" class="pull-right" />
The Bayesian approach is a beautiful means of statistical inference in general, and phylogenetic inference in particular.
In addition to getting posterior distributions on trees and mutation model parameters, the Bayesian approach has been used to get posteriors on complex model parameters such as geographic location and ancestral population sizes of viruses, among other applications.</p>
<p>The bummer about Bayesian computation?
It takes so darn long for those chains to converge.
And what’s worse in this age of in-situ sequencing of viruses for rapidly unfolding epidemics?
<em>If you get a new sequence you have to start over from scratch.</em></p>
<p>I’ve been thinking about this for several years with <a href="http://darlinglab.org/">Aaron Darling</a>, and in particular about Sequential Monte Carlo (SMC) for this application.
SMC, also called particle filtering, is a way to get a posterior estimate with a collection of “particles” in a sequential process of reproduction and selection.
You can think about it as a probabilistically correct genetic algorithm– one that is guaranteed to sample the correct posterior distribution given an infinite number of particles.</p>
<p>Although <a href="http://sysbio.oxfordjournals.org/content/61/4/579">SMC has been applied before in phylogenetics</a>, it has not been used in an “online” setting to update posterior distributions as new sequences appear.
Aaron and I worked up a proof-of-concept implementation of SMC with Connor McCoy, but when Connor left for Google the implementation lost some steam.</p>
<p>However, when Vu arrived I was still curious about the theory behind our little SMC implementation.
Such SMC algorithms raise some really interesting mathematical questions concerning how the phylogenetic likelihood surface changes as new sequences are added to the data set.
If it changes radically, then the prospects for SMC are dim.
We call this the <em>subtree optimality question</em>: do high-likelihood trees on <em>n</em> taxa arise from attaching branches to high-likelihood trees on <em>n-1</em> taxa?
Years ago I considered a <a href="http://dx.doi.org/10.1007/s11538-010-9556-x">similar question with Angie Cueto</a> but that was for the distance-based objective function behind neighbor-joining, and others have thought about it from the empirical side under the guise of taxon sampling.</p>
<p>As described in an <a href="https://arxiv.org/abs/1610.08148">arXiv preprint</a>, Vu developed a theory with some real surprises!
First, an induction-y proof leads to consistency: as the number of particles goes to infinity, we maintain a correct posterior at every stage.
Then he directly took on the subtree optimality problem by writing out the relevant likelihoods and using bounds.
(We think that this is the first theoretical result for likelihood on this question.)</p>
<p>Then the big win: using these bounds on the ratio of likelihoods for the parent and child particles, he was able to show that the effective sample size (ESS) of the sampler is bounded below by a constant multiple of the number of particles.
This is pretty neat for two reasons: first, it’s good to know that we are getting a better posterior estimate as we increase our computational effort, and it’s nice that the ESS goes up linearly with this effort.
Second, this constant doesn’t depend on the size of the tree, so this bodes well for building big trees by incrementally adding taxa.</p>
<p>Of course, with this sort of theory we can’t get a reasonable estimate on the size of this key constant, and indeed it could be uselessly small.
However, I’m still encouraged by these results, and the paper points to some interesting directions.
For example, because SMC is continually maintaining an estimate of the posterior distribution, one can mix in MCMC moves in ways that otherwise would violate detailed balance, such as using an MCMC transition kernel that focuses effort around the newly added edge.
In this way we might use an “SMC” algorithm with relatively few particles which in practice resembles a principled highly parallel MCMC.
On the other hand we might use <a href="http://jmlr.org/proceedings/papers/v32/jun14.pdf">clever tricks</a> to scale the SMC component up to zillions of particles.</p>
<p>All this strengthens my enthusiasm for continuing this work.
Luckily, Aaron has recruited Mathieu Fourment to work on getting a useful implementation, and every day we are getting good news about his improvements.
So stay tuned!</p>
Summer high school and undergraduate students 2016
http://matsen.fredhutch.org/general/2016/07/22/summer-scholars.html
Fri, 22 Jul 2016 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2016/07/22/summer-scholars.html<p><img src="https://matsen.fredhutch.org/images/summer-students-2016.jpg" width="385" class="pull-right" />
I definitely didn’t set out to have three high school and two undergrad students this summer.</p>
<p>But they’re fantastic, and making real contributions to our scientific work!
Anna and Apurva (left and right) have written a new C++ front-end to the essential Smith-Waterman pre-alignment step for Duncan’s <a href="https://github.com/psathyrella/partis/">partis</a> software.
Andrew and Lola (center) are investigating the traces of maximum-likelihood phylogenetic inference software packages.
Thayer (back left) is writing a multi-threaded C++ program to systematically search for all of the trees above a given likelihood cutoff.
All of them are learning about science and coding.</p>
<p>These students rock, and I can’t wait to see what great things they bring into the world with their talent.</p>
Analysis of a slightly gentler discretization of time-trees
http://matsen.fredhutch.org/general/2016/07/11/discrete-time-tree.html
Mon, 11 Jul 2016 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2016/07/11/discrete-time-tree.html<p><img src="/images/NNI_VS_rNNI.png" width="395" class="pull-right" /></p>
<p>Inferring a good phylogenetic tree topology, i.e. a tree without branch lengths, is the primary challenge for efficient tree inference.
As such, we and others think a lot about how algorithms move between topologies, typically formalizing this information as a path through a graph representing tree topologies as vertices and edges as moves from one tree to another.
Removing all branch length information makes sense because algorithms are formulated in terms of these topologies: for classical unrooted tree inference, the set of trees that are tried from a specific tree is not determined by the branch lengths of the current tree.</p>
<p>But what about time-trees?
Time-trees are rooted phylogenetic trees such that every event is given a time: each internal node is given a divergence time, and each leaf node is given a sampling time.
These are absolute times of the sort one could put on a calendar.
To first approximation, working with time-trees is synonymous with running <a href="http://beast.bio.ed.ac.uk/">BEAST</a>, and we are indebted to the BEAST authors for making time-trees a central object of study in phylogenetics.
Such timing information is essential for studying demographic models, and in many respects time-trees are generally the “correct” tree to use because real evolution does happen in a rooted fashion and events do have times associated with them.</p>
<p>Because timing is so central to the time-tree definition, and because it’s been shown that defining
<a href="http://dx.doi.org/10.1109/BIBE.2008.4696663">the discrete component of time-tree moves with reference to timing information</a>
works well, perhaps we shouldn’t be so coarse when we discretize time-trees.
Rather than throw away all timing information, an alternative is to discretize calendar time and have at most one event per discretized time.
If we have the same number of times as we have events on the tree, this is equivalent to just giving a ranking, or total order, on the events in the tree which is compatible with the tree structure.
One can be a little less coarse still by allowing branch lengths to take on an integer number of lengths.</p>
<p>The goal of our <a href="http://biorxiv.org/content/early/2016/07/12/063362">most recent preprint</a> is to analyze the space of these discretized time-trees.
The study was led by Alex Gavryushkin while he was at the Centre for Computational Evolution in Auckland.
Chris Whidden and I contributed bits here and there.
As for classical discretized trees, we build a graph with the discretized time-trees as vertices and a basic set of moves on these trees as edges.
The goal is to understand basic properties of this graph, such as shortest paths and neighborhood sizes.</p>
<p>Adding timing information makes a substantial difference in the graph structure.
The simplest example of this is depicted above, which compares the usual nearest-neighbor interchange (NNI) with its ranked tree equivalent (RNNI).
RNNI only allows interchange of nodes when those nodes have adjacent ranks.
The picture shows an essential difference between shortest paths for NNI and RNNI: if one would like to move the attachment point of two leaves a good ways up the tree, it requires fewer moves to first bundle the leaves into a two-taxon subtree, move that subtree up the tree, then break apart the subtree.
On the other hand, for RNNI, a shortest RNNI graph path simply moves the attachment points individually up the tree.
This is an important difference: for example, the <a href="https://www.researchgate.net/publication/2643042_On_Computing_the_Nearest_Neighbor_Interchange_Distance">computational hardness proof of NNI</a> hinges on the bundling strategy resulting in shorter paths.</p>
<p>The most significant advance of the paper is the application of techniques from a paper by <a href="http://dx.doi.org/10.1137/0405034">Sleator, Trajan, and Thurston</a> to bound the number of trees in arbitrary diameter neighborhoods.
The idea is to develop a “grammar” of transformations such that every tree in a neighborhood with a given radius <em>k</em> can be written as a word of length <em>k</em> in the grammar.
Then, the number of trees in the neighborhood is bounded above by the number of letters to the power of the word length.
Further refinements lead to some interesting bounds.
In an interesting twist, these neighborhood size bounds provide a generalized version of a counter-example like that shown in the figure, which shows in more generality that the arguments for the computational hardness proof of NNI do not hold.</p>
<p>Some very nice work from Alex.
There’s going to be more to this story– stay tuned!</p>
A time-optimal algorithm to build the SPR subgraph on a set of trees
http://matsen.fredhutch.org/general/2016/06/30/sprgraphs.html
Thu, 30 Jun 2016 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2016/06/30/sprgraphs.html<p><img src="/images/sprgraphs.png" width="385" class="pull-right" /></p>
<p>We would like to better understand the subtree-prune-regraft (SPR) graph, which is a graph underlying most modern phylogenetic inference methods.
The nodes of this graph are the set of leaf-labeled phylogenetic trees, and the edges connect pairs of trees that can be transformed from one to another by moving a subtree from one place to another.
Phylogenetic methods implicitly move around this graph, whether to sample trees or find the most likely tree.
The work to understand this graph has been led by Chris Whidden, including <a href="/general/2014/05/13/sprmix.html">learning about how the graph structure influences Bayesian phylogenetic inference</a> and <a href="/general/2015/04/02/rspr-curvature-nodata.html">learning about the overall structure</a> of the graph.</p>
<p>These projects required us to reconstruct the subgraph of the full SPR graph induced by a subset of the nodes.
In the course of our work we have been getting progressively better at constructing this graph efficiently.
In our <a href="http://arxiv.org/abs/1606.08893">latest work</a> we develop a time-optimal algorithm.</p>
<p>Chris’ insight driving this new algorithm is that we shouldn’t be focusing on the trees and checking for pairs of adjacencies, but should rather shift focus to enumerating the potential adjacencies themselves.
These adjacencies can be formalized as structures called <em>agreement forests</em>, which in this case have two components.
If one is clever, and Chris is very clever, you can quickly store these forests and recognize if you’ve seen them before.
The strategy then is to move through the trees in an arbitrary sequential order, storing all of the potential adjacencies of each tree.
If a given tree returns a same adjacency as another previous tree, then connect the trees in the graph.</p>
<p>Although for this paper we obtained an asymptotically time-optimal algorithm, there is still interesting work to be done in order to get a fast implementation.
For example, we could be more thoughtful about exactly how the forests get serialized, which should lead to a faster look up in the central <a href="https://en.wikipedia.org/wiki/Trie">trie</a>.
But not having done any coding at all, much less profiling, we don’t know where the bottlenecks will lie.
This paper was in part motivated by a fun <a href="http://phylobabble.org/t/how-to-recognize-a-rearranged-tree/599">discussion on phylobabble</a>.</p>
Consistency and convergence rate of phylogenetic inference via regularization
http://matsen.fredhutch.org/general/2016/06/05/regularization.html
Sun, 05 Jun 2016 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2016/06/05/regularization.html<p><img src="https://matsen.fredhutch.org/images/treefix.png" width="470" class="pull-right" /></p>
<p>How frequently are genes <a href="https://en.wikipedia.org/wiki/Horizontal_gene_transfer">transferred horizontally</a>?
A popular means of addressing this question involves building phylogenetic trees on many genes, and looking for genes that end up in surprising places.
For example, if we have a lineage B that got a gene from lineage A, then a tree for that gene will have B’s version of that gene descending from an ancestor of A, which may be on the other side of the tree.</p>
<p>Using this approach requires that we have accurate trees for the genes.
That means doing a good job with our modeling and inference, but it also means having data with plenty of the mutations which give signal for tree building.
Unfortunately, sometimes we don’t have such rich data, but we’d still like to do such an analysis.</p>
<p>A naïve approach is just to run the sequences we have through maximum-likelihood tree estimation software and take the best tree for each gene individually, figuring that this is the best we can do with our incomplete data.
However, noisy estimates with our sparse data will definitely bias our estimates of the impact of horizontal gene transfer (HGT) upwards.
That is, if we get trees that are inaccurate in lots of random ways, it’s going to look like a lot of HGT.</p>
<p>We can do better by adding extra information into the inferential problem.
For example, in this case we know that gene evolution is linked with species evolution.
Thus in the absence of evidence to the contrary, it makes sense to assume that the gene tree follows a species tree.
From a statistical perspective, this motivates a <a href="https://en.wikipedia.org/wiki/Shrinkage_estimator">shrinkage estimator</a>, which combines other information with the raw data in order to obtain an estimator with better properties than estimation using the data alone.</p>
<p>One way of doing this is to take a Bayesian approach to the full estimation problem, which might involve priors on gene trees that pull them towards the species tree using a generative model; this approach has been elegantly implemented in phylogenetics by programs such as <a href="http://genome.cshlp.org/content/23/2/323.short">PHYLDOG</a>, <a href="http://mbe.oxfordjournals.org/content/28/1/273.full">SPIMAP</a> and <a href="http://dx.doi.org/10.1093/molbev/msp274">*BEAST</a>.
These programs are principled yet somewhat computationally expensive.</p>
<p>Another direction involves taking some sort of distance between the gene and species tree, and working to trade off a good value of the phylogenetic likelihood versus a good (small) value for this distance.
This distance-based approach works surprisingly well!
A <a href="http://dx.doi.org/10.1093/sysbio/sys076">2013 paper by Wu et al</a> proposed a method called TreeFix, which they showed performed almost as well as full SPIMAP inference, <em>even in simulations under the SPIMAP generative model!</em>
The cartoon above is from their paper, and illustrates that it makes sense to trade off some likelihood (height) for a lower reconciliation cost (lighter color).</p>
<p>This definitely got my attention, and made me wonder if one could develop relevant theory, as the theoretical justification for such an approach doesn’t follow “automatically” like it does for a procedure doing inference under a full probabilistic model.
Justification also doesn’t follow from the usual statistical theory for regularized estimators, because trees aren’t your typical statistical objects.
Then, one day <a href="http://vucdinh.github.io/">Vu</a>’s high-school and college friend <a href="https://sites.google.com/site/lamho86/">Lam Si Tung Ho</a> was visiting, and I suggested this problem to them.
<strong>They crushed it.</strong>
What resulted went far beyond what I originally imagined: a manuscript that not only provides a solid theoretical basis for penalized likelihood approaches in phylogenetics, but also develops many useful techniques for theoretical phylogenetics.
We have just put the manuscript <a href="http://arxiv.org/abs/1606.03059">up on arXiv</a>, which develops a likelihood estimator which is regularized in terms of the <a href="http://comet.lehman.cuny.edu/stjohn/research/treespaceReview.pdf">Billera-Holmes-Vogtman</a> (BHV) distance to a species tree.</p>
<p>First, the main results.
The regularized estimator is “adaptive fast converging,” meaning that it can correctly reconstruct all edges of length greater than any given threshold from gene sequences of polynomial length.
Perhaps more remarkable, though, is that Vu and Lam have explicit bounds of the convergence of the estimation simultaneously in terms of both branch length and topology (via the BHV distance).
This goes beyond the standard theoretical phylogenetics framework of “did we get the right topology or not”.
Surprisingly, the theory and bounds all work even if the species tree estimate is distant from the true gene tree, though of course one gets tighter bounds if it is close to the true gene tree.</p>
<p>Second, the new theoretical tools.</p>
<ul>
<li>a uniform (i.e. not depending on the tree) bound on the deviation of the likelihood of a collection of sequences generated from a model to their expected value</li>
<li>an upper bound on the BHV distance between two trees based on the Kullback-Leibler divergence between their expected per-site likelihood functions</li>
<li>analysis of the asymptotics of the regularization term close and far from the species tree.</li>
</ul>
<p>Of course, I don’t think that biologists will plug in their desired error into our bounds and just sequence an amount of DNA required to achieve that level of error.
That’s absurd.
What I hope is that this paper will add a theoretical aspect to the body of evidence that regularization is a principled method for phylogenetic estimation, and help convince phylogenetic practitioners that raw phylogenetic estimates are inherently limited.
We have all sorts of additional information these days that we can use for phylogenetic inference– let’s use it!
Just ask your local statistician: that community has been impressed by the <a href="https://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator">surprising effectiveness of regularization</a> since the 50’s, and regularization in various forms has become a mainstay of modern statistical inference.</p>
Likelihood-based clustering of B cell clonal families
http://matsen.fredhutch.org/general/2016/04/16/partis-clustering.html
Sat, 16 Apr 2016 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2016/04/16/partis-clustering.html<p><img src="/images/bcell-mess.png" width="350" class="pull-right" />
Antibodies are encoded by B cell receptor (BCR) sequences, which (simplifying somewhat) arise via a two-stage process.
The first stage is a random recombination process creating a so-called naive B cell, which are the common ancestors in the trees to the right.
The second is initiated when such a naive cell (perhaps weakly) binds an antigen, and consists of a mutation and selection process to improve the binding of the BCR to the antigen.
The history of this process can be thought of as a phylogenetic tree descending from these naive common ancestors.
One can sequence the B cell receptors resulting from these processes in high throughput, which form an implicit record of these complex processes in a single individual.</p>
<p>The situation from a phylogenetic inference perspective is a mess.
Sampled sequences have typically been mutated substantially from their ancestral naive sequence.
Thus given a pair of sequences sampled from the repertoire, it’s not clear if they share a common ancestral naive sequence— that is, if they even belong in the same tree.
Furthermore, in healthy individuals the size of these trees is very small compared to the number of sequences in the total repertoire.
This makes for a difficult, and very interesting, clustering problem.</p>
<p>Duncan Ralph and I have been working on this problem since he arrived several years ago, and I am very happy to announce that we’ve put up a manuscript describing that work on <a href="http://arxiv.org/abs/1603.08127">arXiv</a>.
This paper builds on our previous work on BCR sequence annotation and alignment <a href="/general/2015/03/23/partis-annotation.html">using a hidden Markov model</a>.
Using this HMM, we can define a likelihood for observing a given set of sequences distributed among a given collection of clusters.
This likelihood integrates over possible alternative annotations, which are formalized as paths through the HMM.</p>
<p>We had this general idea within the first several months of thinking about the problem together.
However, there’s a big difference between writing down an elegant formulation, even one with fast dynamic programming machinery, and actually building a system that scales to data sets of hundreds of thousands or millions of sequences.
This is where Duncan showed a tremendous amount of creativity and persistence by assembling layers of approximations and heuristics on top of this essential idea, and by developing a software package meant for others to use.</p>
<p>The code is available as part of the continuing development of <a href="https://github.com/psathyrella/partis/">partis</a>.
It also includes Duncan’s sophisticated BCR simulation package.</p>
<p>We’re under no illusions that we have “solved” this problem, and there’s still a lot to be done.
However, we believe that the likelihood-based approach in general and Duncan’s code in particular is a substantial advance over current methods, which use single-linkage clustering based on nucleotide edit distance.
If you work with BCR sequences, we hope you’ll give partis a spin and let us know what you think.</p>