- 05 Jun 2016 » An information-theoretic analysis of phylogenetic regularization
- 18 Apr 2016 » Postdoctoral position to develop next-generation Bayesian phylogenetic methods
- 16 Apr 2016 » Likelihood-based clustering of B cell clonal families
- 23 Feb 2016 » http://B-T.CR, a discussion site for immune repertoire analysis
- 25 Nov 2015 » New results on the subtree prune regraft distance for unrooted trees
- 01 Sep 2015 » Postdoctoral position to study molecular evolution and phylogenetics of immune cells
- 27 Aug 2015 » High school students 2015
- 22 Jul 2015 » Tanglegrams!

How frequently are genes transferred horizontally? A popular means of addressing this question involves building phylogenetic trees on many genes, and looking for genes that end up in surprising places. For example, if we have a lineage B that got a gene from lineage A, then a tree for that gene will have B’s version of that gene descending from an ancestor of A, which may be on the other side of the tree.

Using this approach requires that we have accurate trees for the genes. That means doing a good job with our modeling and inference, but it also means having data with plenty of the mutations which give signal for tree building. Unfortunately, sometimes we don’t have such rich data, but we’d still like to do such an analysis.

A naïve approach is just to run the sequences we have through maximum-likelihood tree estimation software and take the best tree for each...
*(full post)*

There is a lot more sequence data than in the early 2000’s, but inferential algorithms for Bayesian phylogenetic inference haven’t changed much since that time. There have definitely been advances, such as more clever proposal distributions, swapping out heated chains, and GPU-enabled likelihood calculations, but the core remains the same: propose a new state via a small branch length and/or tree structure perturbation, and accept or reject according to the Metropolis choice. Furthermore, the community has packed more and more complexity into priors and models, which would lead to a computational bottleneck even with a fixed number of sequences.

It’s time to improve inferential algorithms. Part of my inspiration lies in watching the revolution that has occurred in computational statistics in the last decade, in which the menu has greatly expanded beyond MCMC to include sequential Monte Carlo (SMC), Hamiltonian Monte Carlo (HMC), and variational inference, to name a few. These algorithms are making...
*(full post)*

Antibodies are encoded by B cell receptor (BCR) sequences, which (simplifying somewhat) arise via a two-stage process. The first stage is a random recombination process creating a so-called naive B cell, which are the common ancestors in the trees to the right. The second is initiated when such a naive cell (perhaps weakly) binds an antigen, and consists of a mutation and selection process to improve the binding of the BCR to the antigen. The history of this process can be thought of as a phylogenetic tree descending from these naive common ancestors. One can sequence the B cell receptors resulting from these processes in high throughput, which form an implicit record of these complex processes in a single individual.

The situation from a phylogenetic inference perspective is a mess. Sampled sequences have typically been mutated substantially from their ancestral naive sequence. Thus given a pair of sequences sampled from the repertoire, it’s not clear...
*(full post)*

In my dream academic utopia, there would be free and open discussion and information sharing. I love open-access journals and preprint servers, but some communication is better suited for a discussion format. For example open problems, information about resources and software, and discussion of standardization all benefit from having a more widely open discussion and a faster update time than journals or even preprint servers can provide. Social media can host this discussion, but I think that it can be useful to put a dedicated website in place for discussion.

For these reasons, together with Simon Frost I’ve started a Discourse forum at http://B-T.CR/ for immune repertoire analysis. If you aren’t already aware, the specificities of our adaptive immune cells are encoded in the sequences of the B and T cell receptors (BCRs and TCRs, hence the name), and these can now be sequenced in high throughput. The result is a fascinating...
*(full post)*

In our previous work, Chris Whidden and I have been working to understand properties of phylogenetic Markov chain Monte Carlo (MCMC) by learning about the graph formed by all phylogenetic trees (as vertices) and tree rearrangements (as edges). These tree rearrangements are ways of modifying one tree to get another. For example, here we have a picture of modifying a tree via so-called unrooted SPR.

It’s natural to ask for the shortest path in this graph between two vertices, i.e. the number of unrooted SPR moves required to modify one tree to make another. The number of such moves is called the unrooted SPR (uSPR) distance. It turns out that the problem is hard in general but fixed parameter tractable, meaning that the complexity isn’t too bad for pairs of trees that aren’t too different from one another. The current best algorithm for unrooted SPR distance cannot compute distances larger than 7, or reliably compare...
*(full post)*

*Although we have recently made a hire in this area, we continue look for good people to work on this and related projects.*

Our adaptive immune systems continually update themselves to neutralize and destroy pathogens. The receptor sequences of antibody-making B cells undergo a Darwinian process of mutation and selection which improves their binding to antigen.^{1} It is now possible to sequence these B cell receptors (BCRs) in high throughput, giving a profound new perspective on how the immune system responds to infection.^{2} Although the elements of B cell affinity maturation are the same as molecular evolution in other settings, being based on recombination, point mutation, and selection, there are a many important differences. These differences, along with the volume of sequence data available, bring new challenges for phylogenetics and molecular evolution.

The translational medical consequences of improved methods will be significant. Improved methods will especially help in understanding the...
*(full post)*

This summer we had two high school students, Andrew and Kate. They were quite sharp, so we threw them in the deep end with some real science projects. Andrew taught himself shell programming and Docker, and learned enough B cell analysis to work on making Bioboxes to be used for validation of B cell sequence analysis software. Kate taught herself shell programming, Python, and learned some undergraduate abstract algebra to learn about the SPR graph by characterizing its distribution of pairwise distances. They were both very independent, and a pleasure to have in the office!

Say we care about a function on pairs of trees (such as subtree-prune-regraft distance) that doesn’t make reference to the labels as such, but simply uses them as markers to ensure that the leaves end up in the right place. We’d like to calculate this function for all trees of a certain type. However, doing so for *every* pair of labeled trees is a waste, because if we just relabel the two trees in the same way, we will get the same result.

So, how many computations do we actually need to do?

It turns out that we only need to do one calculation per *tanglegram*. A tanglegram is a pair of trees along with a bijection between the leaves of those trees. They have been investigated in coevolutionary analyses before, and there’s a considerable literature concerning how to draw them in the plane with the minimal number of crossings. However, the symmetries and number of...
*(full post)*

all posts