Matsen Group: home

18 May 2025 » The next five years are going to be amazing for learning biological processes through probabilistic models

I am looking forward to a future in which AI will greatly reduce friction in designing and implementing models that directly reflect underlying biological knowledge and are directly interpretable. This can be contrasted with the “monolithic” paradigm of AI that works to develop a big black-box model. In that paradigm, one feeds all the data into a large model, and the model makes predictions. One can interpret the monolithic model post-hoc, meaning that one can query the model to try to figure out how it is reasoning. I do not doubt that this monolithic paradigm is going to be interesting and useful for some situations, but it’s not the one that I’m most excited about.

Rather, I love building models that directly reflect underlying biology; the parameters of these models can be directly interpreted in scientific terms. For example, in recent work... (full post)

07 Apr 2025 » Two new approaches to learn about antibody somatic hypermutation

Somatic hypermutation (SHM) is one of the craziest things I’ve ever heard of: we have enzymes that intentionally damage and mutate our DNA! This process is the foundation of the affinity maturation process that enables us to generate high-affinity antibodies against just about any target. Without it, we’d be limited to “just” random recombinants of germline genes.

Our SHM work was motivated by a desire to reconcile statistical analysis of somatic hypermutation with what we know mechanistically from lab experiments. Whenever one can bring reality into modeling there is the potential for a double win: one can learn something about underlying mechanism, and one can get better inferences by using a model that reflects underlying reality.

My interests were piqued by several fascinating papers. On the computational side there were Spisak et al 2020 and Zhou and Kleinstein 2020 who found evidence for an effect of absolute position along the sequence in SHM rates. The Spisak paper in particular showed a very irregular pattern of per-site effects that seemed surprising. One the lab... (full post)

03 Apr 2025 » Let's blog!

I love reading about others’ work in publications, but also in a more informal way that ties motivation, results, history, and context together. That’s what I’ve tried to do in this blog, and I’m going to start that again now.

This is an intentional return to Web 1.0.

Scientists have been unpaid content creators on social media platforms that may or may not have their best interests at heart. In contrast, email and blogs are powerful tools that use open protocols.

There is a modern argument against long-form blogs: modern attention spans are insufficient to read a long-format blog post. I hear that and loudly rebel against it. If reading 1300 words strains our attention span, let the effort strengthen our attention muscle. This is a hill I’m willing to die on for the thinking world and for my children.

I’ll announce blog posts on social media, but I would most like to connect via email and RSS. If you would like to join our Google Groups group, see the link below, and if... (full post)

16 Jul 2021 » Postdoc position available: Bayesian phylogenetics in the densely sampled regime

The project

Statistical phylogenetic (evolutionary tree) methods have been essential for understanding the SARS-CoV-2 epidemic, whether for understanding origins, global spread, or lineage dynamics of the virus. These methods are extremely mature, with optimized code and software packages implementing complex models. However, these methods were developed with the “classical” sampling regime in mind: a relatively small number of sequences with relatively large divergences between them.

Methods for the classical sampling regime work to integrate out the uncertainty we have in ancestral sequences. Although the Felsenstein algorithm does allow for efficient calculation and updating of phylogenetic likelihoods, even this is not enough to handle the massive trees we would like to use for SARS-CoV-2. Furthermore, the Felsenstein algorithm only works for IID models between sites.

With SARS-CoV-2 we are in a completely different sampling regime, with over 2 million genomes for a virus without very much evolutionary divergence. That means that we frequently sample identical viruses, and we often sequence the direct ancestor of a given virus. This greatly limits the uncertainty... (full post)

29 Mar 2021 » Postdoc position available: variational Bayes phylogenetic inference

The project

Bayesian phylogenetic (evolutionary tree) inference is important for genomic epidemiology and for our understanding of evolution. Trees, along with associated information, are complicated objects of inference, with intertwined discrete (tree structure) and continuous (dates, rates) structure. Random-walk Markov Chain Monte Carlo, implemented in packages such as BEAST (~20,000 citations) and MrBayes (>70,000 citations), is currently the only widely-applied inference technique.

We have recently developed a rich means of parameterizing tree distributions with a fixed parameter set. This renders them accessible to more modern inference techniques, such as variational Bayes. We have developed a proof-of-concept application of phylogenetic variational Bayes using modern general-purpose gradient estimators. Our collaborative group also has preliminary integrations with both PyTorch and TensorFlow.

To achieve the promise of variational Bayes phylogenetics, we will develop:

structure learning methods that will infer the discrete aspect of our variational approximation
fitting methods that leverage the special structure of our variational phylogenetic models
a modeling framework that integrates with PyTorch, enabling rich models that leverage covariates such as... (full post)

19 Oct 2020 » Life changes

Hi everyone. Through a combination of COVID and the arrival of a second child, I haven’t had time to write about our recent work. I’ll be back to posting at some point, but right now I’m focusing on being a dad and supporting my trainees. Thanks for understanding.

23 Jan 2020 » A Bayesian phylogenetic hidden Markov model for B cell receptor sequences

Summary

antibodies develop within you via an evolutionary process
understanding these evolutionary patterns is important for understanding how we respond to infection and vaccination
we have found using Bayesian methods that evolutionary inferences are uncertain in this regime
our most recent work develops a “Bayesian phylogenetic hidden Markov model,” which takes into account uncertainty in both the V(D)J recombination process and the evolutionary process
this work reveals substantial amino-acid uncertainty in the inference of the unmutated common ancestor of VRC01, an important and heavily-studied anti-HIV antibody
our results are described in a preprint which is now being revised for PLOS Computational Biology

A brief description of antibody affinity maturation

In order to defend against a very large and ever-mutating pool of pathogens, your body randomly generates, and then optimizes, a large collection of antibodies. These antibodies are displayed as so-called B cell receptors on the surface of specialized B cells. The random generation is a process called V(D)J recombination, in which a collection of candidate genes are randomly selected, trimmed... (full post)

Complete list of all posts