- 26 Oct 2016 » Incorporating new sequences into posterior distributions using SMC
- 22 Jul 2016 » Summer high school and undergraduate students 2016
- 11 Jul 2016 » Analysis of a slightly gentler discretization of time-trees
- 30 Jun 2016 » A time-optimal algorithm to build the SPR subgraph on a set of trees
- 05 Jun 2016 » An information-theoretic analysis of phylogenetic regularization
- 18 Apr 2016 » Postdoctoral position to develop next-generation Bayesian phylogenetic methods
- 16 Apr 2016 » Likelihood-based clustering of B cell clonal families
- 23 Feb 2016 » http://B-T.CR, a discussion site for immune repertoire analysis

The Bayesian approach is a beautiful means of statistical inference in general, and phylogenetic inference in particular. In addition to getting posterior distributions on trees and mutation model parameters, the Bayesian approach has been used to get posteriors on complex model parameters such as geographic location and ancestral population sizes of viruses, among other applications.

The bummer about Bayesian computation? It takes so darn long for those chains to converge. And what’s worse in this age of in-situ sequencing of viruses for rapidly unfolding epidemics? *If you get a new sequence you have to start over from scratch.*

I’ve been thinking about this for several years with Aaron Darling, and in particular about Sequential Monte Carlo (SMC) for this application. SMC, also called particle filtering, is a way to get a posterior estimate with a collection of “particles” in a sequential process of reproduction and selection. You can think about it as a probabilistically...
*(full post)*

I definitely didn’t set out to have three high school and two undergrad students this summer.

But they’re fantastic, and making real contributions to our scientific work! Anna and Apurva (left and right) have written a new C++ front-end to the essential Smith-Waterman pre-alignment step for Duncan’s partis software. Andrew and Lola (center) are investigating the traces of maximum-likelihood phylogenetic inference software packages. Thayer (back left) is writing a multi-threaded C++ program to systematically search for all of the trees above a given likelihood cutoff. All of them are learning about science and coding.

These students rock, and I can’t wait to see what great things they bring into the world with their talent.

Inferring a good phylogenetic tree topology, i.e. a tree without branch lengths, is the primary challenge for efficient tree inference. As such, we and others think a lot about how algorithms move between topologies, typically formalizing this information as a path through a graph representing tree topologies as vertices and edges as moves from one tree to another. Removing all branch length information makes sense because algorithms are formulated in terms of these topologies: for classical unrooted tree inference, the set of trees that are tried from a specific tree is not determined by the branch lengths of the current tree.

But what about time-trees? Time-trees are rooted phylogenetic trees such that every event is given a time: each internal node is given a divergence time, and each leaf node is given a sampling time. These are absolute times of the sort one could put on a calendar. To first approximation, working with time-trees is...
*(full post)*

We would like to better understand the subtree-prune-regraft (SPR) graph, which is a graph underlying most modern phylogenetic inference methods. The nodes of this graph are the set of leaf-labeled phylogenetic trees, and the edges connect pairs of trees that can be transformed from one to another by moving a subtree from one place to another. Phylogenetic methods implicitly move around this graph, whether to sample trees or find the most likely tree. The work to understand this graph has been led by Chris Whidden, including learning about how the graph structure influences Bayesian phylogenetic inference and learning about the overall structure of the graph.

These projects required us to reconstruct the subgraph of the full SPR graph induced by a subset of the nodes. In the course of our work we have been getting progressively better at constructing this graph efficiently. In our latest work we develop a time-optimal algorithm.

Chris’...
*(full post)*

How frequently are genes transferred horizontally? A popular means of addressing this question involves building phylogenetic trees on many genes, and looking for genes that end up in surprising places. For example, if we have a lineage B that got a gene from lineage A, then a tree for that gene will have B’s version of that gene descending from an ancestor of A, which may be on the other side of the tree.

Using this approach requires that we have accurate trees for the genes. That means doing a good job with our modeling and inference, but it also means having data with plenty of the mutations which give signal for tree building. Unfortunately, sometimes we don’t have such rich data, but we’d still like to do such an analysis.

A naïve approach is just to run the sequences we have through maximum-likelihood tree estimation software and take the best tree for each...
*(full post)*

*Although we have recently made a hire in this area, we continue to look for strong junior scientists to work on this and related projects.*

There is a lot more sequence data than in the early 2000’s, but inferential algorithms for Bayesian phylogenetic inference haven’t changed much since that time. There have definitely been advances, such as more clever proposal distributions, swapping out heated chains, and GPU-enabled likelihood calculations, but the core remains the same: propose a new state via a small branch length and/or tree structure perturbation, and accept or reject according to the Metropolis choice. Furthermore, the community has packed more and more complexity into priors and models, which would lead to a computational bottleneck even with a fixed number of sequences.

It’s time to improve inferential algorithms. Part of my inspiration lies in watching the revolution that has occurred in computational statistics in the last decade, in which the menu has...
*(full post)*

Antibodies are encoded by B cell receptor (BCR) sequences, which (simplifying somewhat) arise via a two-stage process. The first stage is a random recombination process creating a so-called naive B cell, which are the common ancestors in the trees to the right. The second is initiated when such a naive cell (perhaps weakly) binds an antigen, and consists of a mutation and selection process to improve the binding of the BCR to the antigen. The history of this process can be thought of as a phylogenetic tree descending from these naive common ancestors. One can sequence the B cell receptors resulting from these processes in high throughput, which form an implicit record of these complex processes in a single individual.

The situation from a phylogenetic inference perspective is a mess. Sampled sequences have typically been mutated substantially from their ancestral naive sequence. Thus given a pair of sequences sampled from the repertoire, it’s not clear...
*(full post)*

In my dream academic utopia, there would be free and open discussion and information sharing. I love open-access journals and preprint servers, but some communication is better suited for a discussion format. For example open problems, information about resources and software, and discussion of standardization all benefit from having a more widely open discussion and a faster update time than journals or even preprint servers can provide. Social media can host this discussion, but I think that it can be useful to put a dedicated website in place for discussion.

For these reasons, together with Simon Frost I’ve started a Discourse forum at http://B-T.CR/ for immune repertoire analysis. If you aren’t already aware, the specificities of our adaptive immune cells are encoded in the sequences of the B and T cell receptors (BCRs and TCRs, hence the name), and these can now be sequenced in high throughput. The result is a fascinating...
*(full post)*

all posts