Matsen group: general
http://matsen.fredhutch.org/rss/general.rss
general newsfeeden-usFri, 17 Nov 2017 04:20:18 PSTFri, 17 Nov 2017 04:20:18 PSTSurvival analysis of DNA mutation motifs with penalized proportional hazards
http://matsen.fredhutch.org/general/2017/11/14/motif.html
Tue, 14 Nov 2017 00:00:00 PSTmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2017/11/14/motif.html<p><img src="http://matsen.fredhutch.org/images/motif-samm-example.png" width="350" class="pull-right" /></p>
<p>We are equipped with purpose-built molecular machinery to mutate our genome so that we can become immune to pathogens. This is truly a thing of wonder.</p>
<p>More specifically, I’m talking about mutations in B cells, the cells that make antibodies.
Once a randomly-generated antibody expressed on the outside of the B cell finds something it’s good at binding, the cell boosts the mutation rate of its antibody-coding region by about one million fold.
Those that have better binding are rewarded by stimulation to divide further.
The result of this Darwinian mutation and selection process is antibodies with improved binding properties.</p>
<p>The mutation process is wonderfully complex and interesting. Being statisticians, we payed our highest tribute that we can to a process we think is beautiful: we developed a statistical model of it. This work was led by the dynamic duo of <a href="http://students.washington.edu/jeanfeng/">Jean Feng</a> and <a href="https://github.com/dawahs">David Shaw</a>, while <a href="http://www.stat.washington.edu/vminin/">Vladimir Minin</a>, <a href="http://faculty.washington.edu/nrsimon/">Noah Simon</a> and I kibitzed.
Our model is known in statistics as a type of <a href="https://en.wikipedia.org/wiki/Proportional_hazards_model">proportional hazards model</a>. These models were introduced in Sir David Cox’s paper <a href="https://www.jstor.org/stable/2985181"><em>Regression Models and Life-Tables</em></a>, which with over 4600 citations makes it <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.231.5042&rep=rep1&type=pdf">the second most cited paper in statistics</a>.</p>
<p>These models are typically used to infer rates of failure, such as that of humans getting disease.
During our life span we get a sequence of diseases, some of which predispose us to other diseases.
By considering sequences of diseases across many individuals, we can use these proportional hazards models to infer the rate of getting various diseases given disease history.</p>
<p>There is an analogous situation for B cell sequences in that the mutation process depends significantly on the identity of the nearby bases.
We can observe lots of mutated sequences, and do a similar sort of inference: when a position mutates, it changes the mutability of nearby bases.
Unfortunately we don’t know the order in which the mutations occurred, and thus don’t know what sequences had increased mutability, so we have to do Gibbs sampling over orders.
We have just posted our preprint describing the methods and some results to <a href="https://arxiv.org/abs/1711.04057">arXiv</a>.</p>
<p>We were inspired by the very nice <a href="http://dx.doi.org/10.3389/fimmu.2013.00358">work</a> of the <a href="http://medicine.yale.edu/lab/kleinstein/">Kleinstein lab</a> developing similar sorts of models using simpler methods.
However, we wanted a more flexible modeling framework and for the complexity of the models to automatically scale to the signal in the data, which we did using penalization with the LASSO.
What you see in the figure above is how we can set up a hierarchical model with a penalty that zeroes out 5-mer terms when they don’t contribute anything above the corresponding 3-mer term (the last base being unimportant gives the block-like structure, while when the first base is unimportant it gives the 4-fold repetitive pattern you can see when zooming out).
We are also indebted to Steve and his team, especially Jason Vander Heiden, for supplying us with sequence data.
They are a class act.</p>
<p>There’s a lot of interest in context-sensitive mutation processes these days, such as <a href="https://elifesciences.org/articles/24284">Kelly Harris’ work</a> on how we can watch context-sensitive mutabilities change through evolutionary time, and <a href="https://doi.org/10.1038/nature12477">Ludmil Alexandrov’s work</a> on mutation processes in cancer.
In both of these cases, they are in the process of transitioning from a statistical description of these processes to linking them with specific mutagens and repair processes.</p>
<p>Here too we would like to use statistics to learn more about the mechanisms behind these context-sensitive mutations.
What’s neat about the framework that Jean and David developed is that now we can design features that correspond to specific mechanistic hypotheses and test how much they impact mutation rates.
Stay tuned!</p>
Using genotype abundance to improve phylogenetic inference
http://matsen.fredhutch.org/general/2017/09/05/gctree-phylo.html
Tue, 05 Sep 2017 00:00:00 PDTmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2017/09/05/gctree-phylo.html<p><img src="http://matsen.fredhutch.org/images/gctree-phylo.png" width="350" class="pull-right" /></p>
<p>When doing computational biology, listen to biologists.
I have found them to have remarkable intuition; this can be a gold mine of opportunity for us computational types.</p>
<p>In this particular case, the starting point was the <a href="http://dx.doi.org/10.1126/science.aad3439">stunningly beautiful work of Gabriel Victora’s lab</a> visualizing germinal center dynamics in living mice.
For those not yet initiated into the beauty of B cell repertoire, germinal centers are crucibles of evolution, in which B cells compete in an antigen-binding contest such that the best binder reproduces more.
As part of the Victora lab work, they did single-cell extraction and sequencing, which enabled them to quantify the frequency of each B cell genotype without PCR bias or other artifacts.
Such single-cell sequencing, and consequent abundance information, is now becoming commonplace.
<em>How should we use this abundance information in phylogenetics?</em></p>
<p>Well, the Victora lab knew, even if their algorithm implementation is not one we would have considered.
Indeed, they were building trees by hand, using several criteria about what makes for a believable evolutionary scenario.
One of their intuitions was that <em>more abundant genotypes have more opportunity to leave mutant descendants</em>.
Therefore, when we are doing inference, we should prefer trees that attach branches to frequently observed genotypes compared to less frequently observed genotypes (see picture, in which the frequency of a given genotype is the number inside the circle; we call this structure a genotype collapsed tree or <em>GCtree</em>).</p>
<p>To have an objective computational method we need to formalize this intuition.
Will DeWitt, Vladimir Minin, and I formulated it in terms of an “infinite type” branching process, in which every mutation creates a new type.
We can augment existing sequence-based optimality criteria with the likelihood of the tree under our branching process model.
In our case we decided to show that this works by ranking maximum-parsimony trees (there are often many equally parsimonious trees).
Parsimony is in wide use in the B cell analysis community because it is a defensible choice when sampling is dense relative to mutations (as in the case of germinal centers), and it allows inference of zero branch lengths (leading to inference of sampled ancestral genotypes and multifurcations).
We showed under simulation that more highly ranked trees were more correct than lower ranked trees.
With the paired heavy and light chain data from the Victora lab, we were also able to do a biological validation by showing that trees that should be the same are more similar when using our algorithm than without.
The result is now <a href="http://arxiv.org/abs/1708.08944">up on arXiv</a>.</p>
<p>If you are muttering to yourself that we should be using this model as a prior for a Bayesian analysis, we hear you.
Hopefully this motivates additional work in that sphere for abundance-based models.
We do note that given the limited amount of mutation described before will lead to a fairly flat posterior.
Furthermore, although one can infer sampled ancestors using an <a href="http://dx.doi.org/10.1371/journal.pcbi.1003919">RJMCMC</a> and multifurcations using <a href="http://dx.doi.org/10.1093/sysbio/syu132">phycas</a>, these two features do not exist yet under one roof.</p>
<p>Will did a great job with this project, which is a nice complement to his existing publications as he heads into the UW Genome Sciences PhD program!
We had a great time working with Luka and Gabriel, and look forward to more collaboration in the future.</p>
Probabilistic Path Hamiltonian Monte Carlo
http://matsen.fredhutch.org/general/2017/06/26/pphmc.html
Mon, 26 Jun 2017 00:00:00 PDTmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2017/06/26/pphmc.html<p><img src="http://matsen.fredhutch.org/images/hmc-tub.png" width="350" class="pull-right" /></p>
<p>Hamiltonian Monte Carlo (HMC) is an exciting approach for sampling Bayesian posterior distributions.
HMC is distinguished from typical MCMC because proposals are derived such that acceptance probability is very high, even though proposals can be distant from the starting state.
These lovely proposals come from approximating the path of a particle moving without friction on the posterior surface.</p>
<p>Indeed, the situation is analogous to a bar of soap sliding around a bathtub, where in this case the bathtub is the posterior surface flipped upside down.
When the soap hits an incline, i.e. a region of bad posterior, it slows down and heads back in another direction.
We give the soap some random momentum to start with, let it slide around for some period, and where it is at the end of this period is our proposal.
Calculating these proposals requires integrating out the soap dynamics, which is done by numerically integrating physics equations (hence the Hamiltonian in HMC).
The acceptance ratio is determined only by how well our numerical integration performs: better numerical integration means a higher acceptance probability.</p>
<p>Vu had been noodling around with phylogenetic HMC when we heard that Arman Bilge (at that time in Auckland) had an implementation as well.
These implementations not only moved through branch length space according to usual HMC, but also moved between topologies.
They did this as follows: once a branch length hits zero, leading to four branches joined at a single node, one can regroup those branches together randomly in another configuration and continue.
(If you are a phylogenetics person, this is a random nearest-neighbor interchange around the zero length branch.)
This randomness, which is rather different than the deterministic paths of classical HMC once a momentum is drawn, is why we call the algorithm Probabilistic Path Hamiltonian Monte Carlo (PPHMC).</p>
<p>The primary challenge in theoretical development is that the PPHMC paths are no longer deterministic.
Thus concepts such as reversibility and volume preservation, which are typical components of correctness proofs for HMC, need to be generalized to probabilistic equivalents.
Vu had to work pretty hard to develop these elements and show that they led to ergodicity.</p>
<p>On the implementation front, Arman was also working hard to build an efficient sampler.
However, the HMC integrator had difficulty going from one tree topology to another without incurring substantial error.
We thrashed around for a while trying to improve things with a “careful” integrator that would find the crossing time and perhaps re-calculate gradients at that time, but proving that such a method would work seemed very hard.</p>
<p>Then, magically, our newest postdoc Cheng Zhang showed up and saved us with a smoothing surrogate function.
This surrogate exchanges the discontinuity in the derivative for discontinuity in the potential energy, but we can deal with that using a “refraction” method introduced by Afshar and Domke in 2015.
This approach allows us to maintain a low error, and thus make very long trajectories with a high acceptance rate.</p>
<p>I’m happy to announce that our <a href="https://arxiv.org/abs/1702.07814">manuscript</a> has been accepted to the 2017 International Conference on Machine Learning.
Practically speaking, this work is definitely a proof of concept.
We have taken an algorithm that was previously only defined for smooth spaces and extended it to orthant complexes, which are basically Euclidean spaces with boundary glued along those boundaries in intricate ways.
Our implementation is not fully optimized, but even if it was I’m not sure that it would out-compete good old MCMC for phylogenetics without some additional tricks.</p>
<p>To know if this flavor of sampler is going to be useful, we really need to better understand what I call the <em>local to global</em> question in phylogenetics.
That is, to what extent does local information tell us about where to modify the tree?
This is straightforward for posteriors on Euclidean spaces: the gradient points us towards a maximum.
But for trees, does the gradient (local information) tell us anything about what parts of the tree should be modified (global information)?
We’ll be thinking about this a lot in the coming months!</p>
Smart proposals win for online phylogenetics using sequential Monte Carlo.
http://matsen.fredhutch.org/general/2017/06/07/sts.html
Wed, 07 Jun 2017 00:00:00 PDTmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2017/06/07/sts.html<p><img src="http://matsen.fredhutch.org/images/sts-proposal.png" width="350" class="pull-right" /></p>
<p>Sometimes projects take years to bear fruit.</p>
<p>As I described <a href="/general/2016/10/26/smc-theory.html">previously</a>, <a href="http://darlinglab.org/">Aaron Darling</a> and I have been thinking for a good while about using Sequential Monte Carlo (SMC) for <em>online</em> Bayesian phylogenetic inference, in which an existing posterior on trees can be updated with additional sequences.
In fact, we had a good enough proof-of-concept implementation in 2013 to give a talk at the Evolution meeting.
The other talk from my group that year was about a surrogate function for likelihoods parameterized by a single branch length, which we called “lcfit”.
These two projects have just recently resulted in intertwined submitted papers.</p>
<p>The SMC implementation, which we called <code>sts</code> for Sequential Tree Sampler, lay dormant for a while after Connor McCoy left the group despite a few efforts to rekindle it.
One of the things that kept us from wrapping it up was problems with <em>particle degeneracy</em>, which is as follows.
I think of SMC as a probabilistically correct version of evolutionary computation, in which trees get to reproduce if they have a high posterior.
Every time we have a new sequence, we attach it to our existing population of trees via some proposal distribution suggesting where to add it.
Particle degeneracy, then, means that you obtain a low diversity population due to a few trees “taking over” the population because they are significantly better than the rest.</p>
<p>In working with <code>sts</code>, we found that we could mitigate the effect of particle degeneracy by developing “smart” proposals that use the data and corresponding likelihood function to decide where to try next.
This is in contrast to previous work by <a href="http://dx.doi.org/10.1093/sysbio/syr131">Bouchard-Côté et al 2012</a>, which shows that for a different formulation of SMC based on subtree merging one does better with simple proposals (compare <a href="http://papers.nips.cc/paper/3266-bayesian-agglomerative-clustering-with-coalescents">Teh et al. 2008</a>).
Our proposals have to decide to what edge to attach, where along the edge to attach, and how long of a branch length to use for the attachment (see picture above).</p>
<p>Although the group put in hard work, there was a lot more needed to put all this together and actually get a working sampler.
Luckily, Aaron recruited a super sharp and motivated postdoc in <a href="https://au.linkedin.com/in/mfourment">Mathieu Fourment</a>.
Mathieu tried a lot of variants of proposal distributions, showed that “heated” proposal distributions were effective for picking attachment branches, and added a parsimony-based proposal.
He also showed that a high effective sample size is necessary but not sufficient to have a good SMC posterior sample, and that we can develop a time-competitive sampler compared to running MrBayes again and again.
That work is now up <a href="http://biorxiv.org/content/early/2017/06/02/145219">on bioRxiv</a>.</p>
<p>Smart proposals need to be designed carefully.
One of the bottlenecks was proposing the new pendant branch length, and for that the lcfit surrogate function (the subject of Connor’s talk at Evolution 2013) worked well.
This spurred us to finish and write up the lcfit work, despite the fact that <a href="http://dx.doi.org/10.1093/sysbio/syv051">Aberer et al. 2015</a> wrote a paper in which they describe how to use standard probability distribution functions as surrogate functions for branch length proposals.
This took a bit of wind out of our sails, but our purpose-built lcfit surrogate still has some interesting advantages over common functions, such as that has the right “shape” for likelihood curves, even when branch lengths become long.
We were surprised to find that it does well for complex settings with heterogeneous models.
On a practical level, it’s implemented as a stand-alone C library so it can be easily incorporated into other programs, in contrast to the work of Aberer and co, which is tightly integrated into ExaBayes.
In the end we have turned it into a short paper with a long appendix which is now <a href="https://arxiv.org/abs/1706.00659">up on arXiv</a>.
Brian Claywell is the hero of this lcfit work– he was persistent in developing creative strategies to fit the surrogate function in a variety of settings.</p>
<p>There is a lot to be done for online Bayesian phylogenetics still.
We don’t even try to sample model parameters, and haven’t tried making BEASTly rooted trees.
There are many opportunities for optimization, some of which are described in the paper.
For the parallelism nerds out there, I also note the existence of the <a href="https://arxiv.org/abs/1407.2864">particle cascade</a>.</p>
<p>I certainly hope that this contribution guides future developers towards useful online Bayesian phylogenetics samplers: it would be great to be able to keep trees up to date as the genomes keep rolling in from projects such as <a href="http://www.zibraproject.org/">ZIBRA</a>.
If you are interested in more details, get in touch!</p>
Incorporating new sequences into posterior distributions using SMC
http://matsen.fredhutch.org/general/2016/10/26/smc-theory.html
Wed, 26 Oct 2016 00:00:00 PDTmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2016/10/26/smc-theory.html<p><img src="http://matsen.fredhutch.org/images/smc.png" width="350" class="pull-right" />
The Bayesian approach is a beautiful means of statistical inference in general, and phylogenetic inference in particular.
In addition to getting posterior distributions on trees and mutation model parameters, the Bayesian approach has been used to get posteriors on complex model parameters such as geographic location and ancestral population sizes of viruses, among other applications.</p>
<p>The bummer about Bayesian computation?
It takes so darn long for those chains to converge.
And what’s worse in this age of in-situ sequencing of viruses for rapidly unfolding epidemics?
<em>If you get a new sequence you have to start over from scratch.</em></p>
<p>I’ve been thinking about this for several years with <a href="http://darlinglab.org/">Aaron Darling</a>, and in particular about Sequential Monte Carlo (SMC) for this application.
SMC, also called particle filtering, is a way to get a posterior estimate with a collection of “particles” in a sequential process of reproduction and selection.
You can think about it as a probabilistically correct genetic algorithm– one that is guaranteed to sample the correct posterior distribution given an infinite number of particles.</p>
<p>Although <a href="http://sysbio.oxfordjournals.org/content/61/4/579">SMC has been applied before in phylogenetics</a>, it has not been used in an “online” setting to update posterior distributions as new sequences appear.
Aaron and I worked up a proof-of-concept implementation of SMC with Connor McCoy, but when Connor left for Google the implementation lost some steam.</p>
<p>However, when Vu arrived I was still curious about the theory behind our little SMC implementation.
Such SMC algorithms raise some really interesting mathematical questions concerning how the phylogenetic likelihood surface changes as new sequences are added to the data set.
If it changes radically, then the prospects for SMC are dim.
We call this the <em>subtree optimality question</em>: do high-likelihood trees on <em>n</em> taxa arise from attaching branches to high-likelihood trees on <em>n-1</em> taxa?
Years ago I considered a <a href="http://dx.doi.org/10.1007/s11538-010-9556-x">similar question with Angie Cueto</a> but that was for the distance-based objective function behind neighbor-joining, and others have thought about it from the empirical side under the guise of taxon sampling.</p>
<p>As described in an <a href="https://arxiv.org/abs/1610.08148">arXiv preprint</a>, Vu developed a theory with some real surprises!
First, an induction-y proof leads to consistency: as the number of particles goes to infinity, we maintain a correct posterior at every stage.
Then he directly took on the subtree optimality problem by writing out the relevant likelihoods and using bounds.
(We think that this is the first theoretical result for likelihood on this question.)</p>
<p>Then the big win: using these bounds on the ratio of likelihoods for the parent and child particles, he was able to show that the effective sample size (ESS) of the sampler is bounded below by a constant multiple of the number of particles.
This is pretty neat for two reasons: first, it’s good to know that we are getting a better posterior estimate as we increase our computational effort, and it’s nice that the ESS goes up linearly with this effort.
Second, this constant doesn’t depend on the size of the tree, so this bodes well for building big trees by incrementally adding taxa.</p>
<p>Of course, with this sort of theory we can’t get a reasonable estimate on the size of this key constant, and indeed it could be uselessly small.
However, I’m still encouraged by these results, and the paper points to some interesting directions.
For example, because SMC is continually maintaining an estimate of the posterior distribution, one can mix in MCMC moves in ways that otherwise would violate detailed balance, such as using an MCMC transition kernel that focuses effort around the newly added edge.
In this way we might use an “SMC” algorithm with relatively few particles which in practice resembles a principled highly parallel MCMC.
On the other hand we might use <a href="http://jmlr.org/proceedings/papers/v32/jun14.pdf">clever tricks</a> to scale the SMC component up to zillions of particles.</p>
<p>All this strengthens my enthusiasm for continuing this work.
Luckily, Aaron has recruited Mathieu Fourment to work on getting a useful implementation, and every day we are getting good news about his improvements.
So stay tuned!</p>
Summer high school and undergraduate students 2016
http://matsen.fredhutch.org/general/2016/07/22/summer-scholars.html
Fri, 22 Jul 2016 00:00:00 PDTmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2016/07/22/summer-scholars.html<p><img src="http://matsen.fredhutch.org/images/summer-students-2016.jpg" width="385" class="pull-right" />
I definitely didn’t set out to have three high school and two undergrad students this summer.</p>
<p>But they’re fantastic, and making real contributions to our scientific work!
Anna and Apurva (left and right) have written a new C++ front-end to the essential Smith-Waterman pre-alignment step for Duncan’s <a href="https://github.com/psathyrella/partis/">partis</a> software.
Andrew and Lola (center) are investigating the traces of maximum-likelihood phylogenetic inference software packages.
Thayer (back left) is writing a multi-threaded C++ program to systematically search for all of the trees above a given likelihood cutoff.
All of them are learning about science and coding.</p>
<p>These students rock, and I can’t wait to see what great things they bring into the world with their talent.</p>
Analysis of a slightly gentler discretization of time-trees
http://matsen.fredhutch.org/general/2016/07/11/discrete-time-tree.html
Mon, 11 Jul 2016 00:00:00 PDTmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2016/07/11/discrete-time-tree.html<p><img src="/images/NNI_VS_rNNI.png" width="395" class="pull-right" /></p>
<p>Inferring a good phylogenetic tree topology, i.e. a tree without branch lengths, is the primary challenge for efficient tree inference.
As such, we and others think a lot about how algorithms move between topologies, typically formalizing this information as a path through a graph representing tree topologies as vertices and edges as moves from one tree to another.
Removing all branch length information makes sense because algorithms are formulated in terms of these topologies: for classical unrooted tree inference, the set of trees that are tried from a specific tree is not determined by the branch lengths of the current tree.</p>
<p>But what about time-trees?
Time-trees are rooted phylogenetic trees such that every event is given a time: each internal node is given a divergence time, and each leaf node is given a sampling time.
These are absolute times of the sort one could put on a calendar.
To first approximation, working with time-trees is synonymous with running <a href="http://beast.bio.ed.ac.uk/">BEAST</a>, and we are indebted to the BEAST authors for making time-trees a central object of study in phylogenetics.
Such timing information is essential for studying demographic models, and in many respects time-trees are generally the “correct” tree to use because real evolution does happen in a rooted fashion and events do have times associated with them.</p>
<p>Because timing is so central to the time-tree definition, and because it’s been shown that defining
<a href="http://dx.doi.org/10.1109/BIBE.2008.4696663">the discrete component of time-tree moves with reference to timing information</a>
works well, perhaps we shouldn’t be so coarse when we discretize time-trees.
Rather than throw away all timing information, an alternative is to discretize calendar time and have at most one event per discretized time.
If we have the same number of times as we have events on the tree, this is equivalent to just giving a ranking, or total order, on the events in the tree which is compatible with the tree structure.
One can be a little less coarse still by allowing branch lengths to take on an integer number of lengths.</p>
<p>The goal of our <a href="http://biorxiv.org/content/early/2016/07/12/063362">most recent preprint</a> is to analyze the space of these discretized time-trees.
The study was led by Alex Gavryushkin while he was at the Centre for Computational Evolution in Auckland.
Chris Whidden and I contributed bits here and there.
As for classical discretized trees, we build a graph with the discretized time-trees as vertices and a basic set of moves on these trees as edges.
The goal is to understand basic properties of this graph, such as shortest paths and neighborhood sizes.</p>
<p>Adding timing information makes a substantial difference in the graph structure.
The simplest example of this is depicted above, which compares the usual nearest-neighbor interchange (NNI) with its ranked tree equivalent (RNNI).
RNNI only allows interchange of nodes when those nodes have adjacent ranks.
The picture shows an essential difference between shortest paths for NNI and RNNI: if one would like to move the attachment point of two leaves a good ways up the tree, it requires fewer moves to first bundle the leaves into a two-taxon subtree, move that subtree up the tree, then break apart the subtree.
On the other hand, for RNNI, a shortest RNNI graph path simply moves the attachment points individually up the tree.
This is an important difference: for example, the <a href="https://www.researchgate.net/publication/2643042_On_Computing_the_Nearest_Neighbor_Interchange_Distance">computational hardness proof of NNI</a> hinges on the bundling strategy resulting in shorter paths.</p>
<p>The most significant advance of the paper is the application of techniques from a paper by <a href="http://dx.doi.org/10.1137/0405034">Sleator, Trajan, and Thurston</a> to bound the number of trees in arbitrary diameter neighborhoods.
The idea is to develop a “grammar” of transformations such that every tree in a neighborhood with a given radius <em>k</em> can be written as a word of length <em>k</em> in the grammar.
Then, the number of trees in the neighborhood is bounded above by the number of letters to the power of the word length.
Further refinements lead to some interesting bounds.
In an interesting twist, these neighborhood size bounds provide a generalized version of a counter-example like that shown in the figure, which shows in more generality that the arguments for the computational hardness proof of NNI do not hold.</p>
<p>Some very nice work from Alex.
There’s going to be more to this story– stay tuned!</p>
A time-optimal algorithm to build the SPR subgraph on a set of trees
http://matsen.fredhutch.org/general/2016/06/30/sprgraphs.html
Thu, 30 Jun 2016 00:00:00 PDTmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2016/06/30/sprgraphs.html<p><img src="/images/sprgraphs.png" width="385" class="pull-right" /></p>
<p>We would like to better understand the subtree-prune-regraft (SPR) graph, which is a graph underlying most modern phylogenetic inference methods.
The nodes of this graph are the set of leaf-labeled phylogenetic trees, and the edges connect pairs of trees that can be transformed from one to another by moving a subtree from one place to another.
Phylogenetic methods implicitly move around this graph, whether to sample trees or find the most likely tree.
The work to understand this graph has been led by Chris Whidden, including <a href="/general/2014/05/13/sprmix.html">learning about how the graph structure influences Bayesian phylogenetic inference</a> and <a href="/general/2015/04/02/rspr-curvature-nodata.html">learning about the overall structure</a> of the graph.</p>
<p>These projects required us to reconstruct the subgraph of the full SPR graph induced by a subset of the nodes.
In the course of our work we have been getting progressively better at constructing this graph efficiently.
In our <a href="http://arxiv.org/abs/1606.08893">latest work</a> we develop a time-optimal algorithm.</p>
<p>Chris’ insight driving this new algorithm is that we shouldn’t be focusing on the trees and checking for pairs of adjacencies, but should rather shift focus to enumerating the potential adjacencies themselves.
These adjacencies can be formalized as structures called <em>agreement forests</em>, which in this case have two components.
If one is clever, and Chris is very clever, you can quickly store these forests and recognize if you’ve seen them before.
The strategy then is to move through the trees in an arbitrary sequential order, storing all of the potential adjacencies of each tree.
If a given tree returns a same adjacency as another previous tree, then connect the trees in the graph.</p>
<p>Although for this paper we obtained an asymptotically time-optimal algorithm, there is still interesting work to be done in order to get a fast implementation.
For example, we could be more thoughtful about exactly how the forests get serialized, which should lead to a faster look up in the central <a href="https://en.wikipedia.org/wiki/Trie">trie</a>.
But not having done any coding at all, much less profiling, we don’t know where the bottlenecks will lie.
This paper was in part motivated by a fun <a href="http://phylobabble.org/t/how-to-recognize-a-rearranged-tree/599">discussion on phylobabble</a>.</p>
An information-theoretic analysis of phylogenetic regularization
http://matsen.fredhutch.org/general/2016/06/05/regularization.html
Sun, 05 Jun 2016 00:00:00 PDTmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2016/06/05/regularization.html<p><img src="/images/treefix.png" width="385" class="pull-right" /></p>
<p>How frequently are genes <a href="https://en.wikipedia.org/wiki/Horizontal_gene_transfer">transferred horizontally</a>?
A popular means of addressing this question involves building phylogenetic trees on many genes, and looking for genes that end up in surprising places.
For example, if we have a lineage B that got a gene from lineage A, then a tree for that gene will have B’s version of that gene descending from an ancestor of A, which may be on the other side of the tree.</p>
<p>Using this approach requires that we have accurate trees for the genes.
That means doing a good job with our modeling and inference, but it also means having data with plenty of the mutations which give signal for tree building.
Unfortunately, sometimes we don’t have such rich data, but we’d still like to do such an analysis.</p>
<p>A naïve approach is just to run the sequences we have through maximum-likelihood tree estimation software and take the best tree for each gene individually, figuring that this is the best we can do with our incomplete data.
However, noisy estimates with our sparse data will definitely bias our estimates of the impact of horizontal gene transfer (HGT) upwards.
That is, if we get trees that are inaccurate in lots of random ways, it’s going to look like a lot of HGT.</p>
<p>We can do better by adding extra information into the inferential problem.
For example, in this case we know that gene evolution is linked with species evolution.
Thus in the absence of evidence to the contrary, it makes sense to assume that the gene tree follows a species tree.
From a statistical perspective, this motivates a <a href="https://en.wikipedia.org/wiki/Shrinkage_estimator">shrinkage estimator</a>, which combines other information with the raw data in order to obtain an estimator with better properties than estimation using the data alone.</p>
<p>One way of doing this is to take a Bayesian approach to the full estimation problem, which might involve priors on gene trees that pull them towards the species tree using a generative model; this approach has been elegantly implemented in phylogenetics by programs such as <a href="http://genome.cshlp.org/content/23/2/323.short">PHYLDOG</a>, <a href="http://mbe.oxfordjournals.org/content/28/1/273.full">SPIMAP</a> and <a href="http://dx.doi.org/10.1093/molbev/msp274">*BEAST</a>.
These programs are principled yet somewhat computationally expensive.</p>
<p>Another direction involves taking some sort of distance between the gene and species tree, and working to trade off a good value of the phylogenetic likelihood versus a good (small) value for this distance.
This distance-based approach works surprisingly well!
A <a href="http://dx.doi.org/10.1093/sysbio/sys076">2013 paper by Wu et al</a> proposed a method called TreeFix, which they showed performed almost as well as full SPIMAP inference, <em>even in simulations under the SPIMAP generative model!</em>
The cartoon above is from their paper, and illustrates that it makes sense to trade off some likelihood (height) for a lower reconciliation cost (lighter color).</p>
<p>This definitely got my attention, and made me wonder if one could develop relevant theory, as the theoretical justification for such an approach doesn’t follow “automatically” like it does for a procedure doing inference under a full probabilistic model.
Justification also doesn’t follow from the usual statistical theory for regularized estimators, because trees aren’t your typical statistical objects.
Then, one day <a href="http://vucdinh.github.io/">Vu</a>’s high-school and college friend <a href="https://sites.google.com/site/lamho86/">Lam Si Tung Ho</a> was visiting, and I suggested this problem to them.
<strong>They crushed it.</strong>
What resulted went far beyond what I originally imagined: a manuscript that not only provides a solid theoretical basis for penalized likelihood approaches in phylogenetics, but also develops many useful techniques for theoretical phylogenetics.
We have just put the manuscript <a href="http://arxiv.org/abs/1606.03059">up on arXiv</a>, which develops a likelihood estimator which is regularized in terms of the <a href="http://comet.lehman.cuny.edu/stjohn/research/treespaceReview.pdf">Billera-Holmes-Vogtman</a> (BHV) distance to a species tree.</p>
<p>First, the main results.
The regularized estimator is “adaptive fast converging,” meaning that it can correctly reconstruct all edges of length greater than any given threshold from gene sequences of polynomial length.
Perhaps more remarkable, though, is that Vu and Lam have explicit bounds of the convergence of the estimation simultaneously in terms of both branch length and topology (via the BHV distance).
This goes beyond the standard theoretical phylogenetics framework of “did we get the right topology or not”.
Surprisingly, the theory and bounds all work even if the species tree estimate is distant from the true gene tree, though of course one gets tighter bounds if it is close to the true gene tree.</p>
<p>Second, the new theoretical tools.</p>
<ul>
<li>a uniform (i.e. not depending on the tree) bound on the deviation of the likelihood of a collection of sequences generated from a model to their expected value</li>
<li>an upper bound on the BHV distance between two trees based on the Kullback-Leibler divergence between their expected per-site likelihood functions</li>
<li>analysis of the asymptotics of the regularization term close and far from the species tree.</li>
</ul>
<p>Of course, I don’t think that biologists will plug in their desired error into our bounds and just sequence an amount of DNA required to achieve that level of error.
That’s absurd.
What I hope is that this paper will add a theoretical aspect to the body of evidence that regularization is a principled method for phylogenetic estimation, and help convince phylogenetic practitioners that raw phylogenetic estimates are inherently limited.
We have all sorts of additional information these days that we can use for phylogenetic inference– let’s use it!
Just ask your local statistician: that community has been impressed by the <a href="https://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator">surprising effectiveness of regularization</a> since the 50’s, and regularization in various forms has become a mainstay of modern statistical inference.</p>
Postdoctoral position to develop next-generation Bayesian phylogenetic methods
http://matsen.fredhutch.org/general/2016/04/18/altphylo-postdoc.html
Mon, 18 Apr 2016 00:00:00 PDTmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2016/04/18/altphylo-postdoc.html<p><img src="/images/altphlo-random-starts.png" width="260" class="pull-right" />
<em>Although we have recently made a hire in this area, we continue to look for strong junior scientists to work on this and related projects.</em></p>
<p>There is a lot more sequence data than in the early 2000’s, but inferential algorithms for Bayesian phylogenetic inference haven’t changed much since that time.
There have definitely been advances, such as more clever proposal distributions, swapping out heated chains, and GPU-enabled likelihood calculations, but the core remains the same: propose a new state via a small branch length and/or tree structure perturbation, and accept or reject according to the <a href="https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Formal_derivation_of_the_Metropolis-Hastings_algorithm">Metropolis choice</a>.
Furthermore, the community has packed more and more complexity into priors and models, which would lead to a computational bottleneck even with a fixed number of sequences.</p>
<p>It’s time to improve inferential algorithms.
Part of my inspiration lies in watching the revolution that has occurred in computational statistics in the last decade, in which the menu has greatly expanded beyond MCMC to include sequential Monte Carlo (SMC), Hamiltonian Monte Carlo (HMC), and variational inference, to name a few.
These algorithms are making qualitative improvements in the scale of data that can be incorporated into analysis, and come from new mathematical foundations with provable statistical properties.
The other part of the inspiration, well, just comes from being annoyed that we have this fairly hard limit of 1,000 sequences for Bayesian phylogenetic inference.
So, <a href="http://www.stat.washington.edu/vminin/">Vladimir Minin</a> and I teamed up on a successful grant application to develop new inferential methods.</p>
<p>We thus have an open postdoc position to work on fundamentally new methods for phylogenetic inference.
If you join us, you’ll be able to work with a truly marvelous team.
Vladimir is an excellent collaborator with a deep knowledge of phylogenetics and statistics.
<a href="http://www.armanbilge.com/research/">Arman Bilge</a>, with whom Vu and I are already developing phylogenetic HMC, is joining us for his PhD.
<a href="https://scholar.google.com/citations?user=itc4x9kAAAAJ&hl=en">Chris Whidden</a> is a world expert in the discrete structure of tree space.
<a href="http://vucdinh.github.io/">Vu Dinh</a> is greatly expanding our knowledge of the phylogenetic likelihood function and is using that knowledge to develop new inferential methods.</p>
<p>To round out this team, we need someone with serious programming chops who is motivated to build new mathematically-founded methods.
We really want to make a difference for end users of phylogenetics, so we can’t only be proving theorems.
The focus will be on building solid prototype implementations and then working with the very clever RevBayes and BEAST2 developers for more general distribution.
We’ll also be collaborating with <a href="http://darlinglab.org/">Aaron Darling</a>’s group on online phylogenetic SMC.
<em>[Doesn’t that sound fun? It does to me.]</em></p>
<p>The position will come with a competitive postdoc-level salary with great benefits for two years, with possibility of extension.
Fred Hutchinson Cancer Research Center, home of about 190 faculty including three Nobel laureates, is an independent, nonprofit research institution dedicated to the development and advancement of biomedical research.
The environment is lively yet casual, with a strong emphasis on collaborative work.
The Center is housed in a lovely campus next to Lake Union a short walk from downtown, and a slightly longer walk from the University of Washington.
Powerful computing resources and a helpful IT staff await.</p>
<p>If you are interested in this position, please send Erick some representative publications, code samples, and a CV.</p>
Likelihood-based clustering of B cell clonal families
http://matsen.fredhutch.org/general/2016/04/16/partis-clustering.html
Sat, 16 Apr 2016 00:00:00 PDTmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2016/04/16/partis-clustering.html<p><img src="/images/bcell-mess.png" width="350" class="pull-right" />
Antibodies are encoded by B cell receptor (BCR) sequences, which (simplifying somewhat) arise via a two-stage process.
The first stage is a random recombination process creating a so-called naive B cell, which are the common ancestors in the trees to the right.
The second is initiated when such a naive cell (perhaps weakly) binds an antigen, and consists of a mutation and selection process to improve the binding of the BCR to the antigen.
The history of this process can be thought of as a phylogenetic tree descending from these naive common ancestors.
One can sequence the B cell receptors resulting from these processes in high throughput, which form an implicit record of these complex processes in a single individual.</p>
<p>The situation from a phylogenetic inference perspective is a mess.
Sampled sequences have typically been mutated substantially from their ancestral naive sequence.
Thus given a pair of sequences sampled from the repertoire, it’s not clear if they share a common ancestral naive sequence— that is, if they even belong in the same tree.
Furthermore, in healthy individuals the size of these trees is very small compared to the number of sequences in the total repertoire.
This makes for a difficult, and very interesting, clustering problem.</p>
<p>Duncan Ralph and I have been working on this problem since he arrived several years ago, and I am very happy to announce that we’ve put up a manuscript describing that work on <a href="http://arxiv.org/abs/1603.08127">arXiv</a>.
This paper builds on our previous work on BCR sequence annotation and alignment <a href="/general/2015/03/23/partis-annotation.html">using a hidden Markov model</a>.
Using this HMM, we can define a likelihood for observing a given set of sequences distributed among a given collection of clusters.
This likelihood integrates over possible alternative annotations, which are formalized as paths through the HMM.</p>
<p>We had this general idea within the first several months of thinking about the problem together.
However, there’s a big difference between writing down an elegant formulation, even one with fast dynamic programming machinery, and actually building a system that scales to data sets of hundreds of thousands or millions of sequences.
This is where Duncan showed a tremendous amount of creativity and persistence by assembling layers of approximations and heuristics on top of this essential idea, and by developing a software package meant for others to use.</p>
<p>The code is available as part of the continuing development of <a href="https://github.com/psathyrella/partis/">partis</a>.
It also includes Duncan’s sophisticated BCR simulation package.</p>
<p>We’re under no illusions that we have “solved” this problem, and there’s still a lot to be done.
However, we believe that the likelihood-based approach in general and Duncan’s code in particular is a substantial advance over current methods, which use single-linkage clustering based on nucleotide edit distance.
If you work with BCR sequences, we hope you’ll give partis a spin and let us know what you think.</p>
http://B-T.CR, a discussion site for immune repertoire analysis
http://matsen.fredhutch.org/general/2016/02/23/btcr.html
Tue, 23 Feb 2016 00:00:00 PSTmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2016/02/23/btcr.html<p><img src="http://b-t.cr/uploads/default/original/1X/5d5d816ed777fda7d45634dc788b2046508d1834.png" width="350" class="pull-right" />
In my dream academic utopia, there would be free and open discussion and information sharing.
I love open-access journals and preprint servers, but some communication is better suited for a discussion format.
For example open problems, information about resources and software, and discussion of standardization all benefit from having a more widely open discussion and a faster update time than journals or even preprint servers can provide.
Social media can host this discussion, but I think that it can be useful to put a dedicated website in place for discussion.</p>
<p>For these reasons, together with <a href="http://www.vet.cam.ac.uk/directory/sdf22@cam.ac.uk">Simon Frost</a> I’ve started a <a href="https://www.discourse.org/">Discourse</a> forum at <a href="http://B-T.CR/">http://B-T.CR/</a> for immune repertoire analysis.
If you aren’t already aware, the specificities of our adaptive immune cells are encoded in the sequences of the B and T cell receptors (BCRs and TCRs, hence the name), and these can now be sequenced in high throughput.
The result is a fascinating but challenging-to-interpret pile of sequences that, if we were smart enough, would tell us a whole lot about health history and prognosis.</p>
<p>This field, at least in its most recent high-throughput incarnation, is quite young and so we have a real opportunity to coordinate research efforts, reduce the pain of finding the right resources, and increase the fun of doing science.
If you are interested in this sort of work, I hope you will join and participate!
You can keep track of what’s happening on the forum by signing up for email notifications, through <a href="http://b-t.cr/latest.rss">RSS</a>, or by following <a href="https://twitter.com/bcr_tcr">@bcr_tcr</a> on Twitter.</p>
New results on the subtree prune regraft distance for unrooted trees
http://matsen.fredhutch.org/general/2015/11/25/uspr.html
Wed, 25 Nov 2015 00:00:00 PSTmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2015/11/25/uspr.html<p><img src="/images/uspr.png" width="350" class="pull-right" />
In our previous work, Chris Whidden and I have been working to understand properties of phylogenetic Markov chain Monte Carlo (MCMC) by learning about the graph formed by all phylogenetic trees (as vertices) and tree rearrangements (as edges).
These tree rearrangements are ways of modifying one tree to get another.
For example, here we have a picture of modifying a tree via so-called unrooted SPR.</p>
<p>It’s natural to ask for the shortest path in this graph between two vertices, i.e. the number of unrooted SPR moves required to modify one tree to make another.
The number of such moves is called the unrooted SPR (uSPR) distance.
It turns out that the problem is hard in general but fixed parameter tractable, meaning that the complexity isn’t too bad for pairs of trees that aren’t too different from one another.
The current best algorithm for unrooted SPR distance cannot compute distances larger than 7, or reliably compare trees with more than 30 leaves.
This is in rather stark contrast to the <em>rooted</em> case, for which Chris’ latest algorithms can compute rooted SPR (rSPR) distances of greater than 50 with hundreds of taxa.
The construct behind efficient algorithms for rSPR is called an <em>agreement forest</em>; a 2001 theorem due to Allen and Steel shows that computing rSPR distance is equivalent to finding a maximum agreement forest.</p>
<p>Thus, in order to try to improve the state of algorithms for the uSPR distance, our original goal was to define an agreement-forest like object for the unrooted case.
This didn’t work!
In fact Chris came up with some wild counter-examples showing that every characteristic one might want in an agreement forest does not hold in the unrooted case (see <a href="https://twitter.com/ematsen/status/635980017046937600">this tweet</a> for a picture).
These examples guided the rest of our work.</p>
<p>In the end, we developed a new distance, called the <em>replug distance</em> that bounds the uSPR distance below and is much more computable than full uSPR.
Chris then developed a clever algorithm to use this and other approximate uSPR distances together to calculate the full uSPR distance.
This algorithm can compute uSPR distances as large as 14 between trees with up to 50 leaves.
He also proved a conjecture from 2008 that certain type of tree simplification preserves uSPR distance.</p>
<p><a href="http://arxiv.org/abs/1511.07529">The paper</a> describing these results just went up on arXiv.
We tried to make it as simple and explicit as possible, but it ended up being a rather long and technical beast of a paper with a lot of custom notation and intermediate definitions.
Nevertheless, it’s a real Whidden opus.</p>
<p>For my part, I was hoping to connect the uSPR distance with a more established area of mathematics so that we could leverage some powerful existing machinery.
In particular, the theory of <a href="https://en.wikipedia.org/wiki/Matroid">matroids</a> seemed like a good fit, as the matroid independence condition could be used to ensure that all intermediate graph structures were actually trees.
This didn’t seem to be helpful in the end, although it did get us started thinking about <em>socket forests</em>.
I’d be glad to hear any ideas that anyone else has for connecting this sort of work with other parts of math.</p>
Postdoctoral position to study molecular evolution and phylogenetics of immune cells
http://matsen.fredhutch.org/general/2015/09/01/ab-postdoc.html
Tue, 01 Sep 2015 00:00:00 PDTmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2015/09/01/ab-postdoc.html<p><img src="/images/infant-ab.png" class="pull-right" />
<em>Although we have recently made a hire in this area, we continue to look for good people to work on this and related projects.</em></p>
<p>Our adaptive immune systems continually update themselves to neutralize and destroy pathogens.
The receptor sequences of antibody-making B cells undergo a Darwinian process of mutation and selection which improves their binding to antigen.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>
It is now possible to sequence these B cell receptors (BCRs) in high throughput, giving a profound new perspective on how the immune system responds to infection.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>
Although the elements of B cell affinity maturation are the same as molecular evolution in other settings, being based on recombination, point mutation, and selection, there are a many important differences.
These differences, along with the volume of sequence data available, bring new challenges for phylogenetics and molecular evolution.</p>
<p>The translational medical consequences of improved methods will be significant.
Improved methods will especially help in understanding the development of <a href="http://dx.doi.org/10.1016/j.cell.2015.03.004">broadly neutralizing antibodies against HIV</a>.
The current best hope for an effective HIV vaccine is to try to elicit such antibodies, but in order to do so we need to understand in detail how such antibodies.
However, the effective antibodies naturally produced in adults require many years of mutation from an unlikely starting point.
Thus they will be exceedingly difficult to elicit with a vaccine.</p>
<p>Fortunately, <a href="http://research.fredhutch.org/overbaugh/en.html">Julie Overbaugh</a>’s group here at the Fred Hutch recently published <a href="http://www.nature.com/nm/journal/v20/n6/full/nm.3565.html">a landmark paper</a>, in which they found broadly neutralizing antibodies in HIV-positive infants.
These antibodies mature in significantly less time than in adults (above graphic from a <a href="http://www.nature.com/nm/journal/v20/n6/full/nm.3598.html">News & Views about the paper</a>).
The maturation pathway of these antibodies is not yet understood.</p>
<p>This open postdoc position is to develop new methods for B cell receptor sequence analysis and apply them to sequences from the Overbaugh lab infant study.
Our group is dedicated to bringing modern model-based statistical<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup> molecular evolution and phylogenetics approaches to BCR sequence analysis that scale to large data sets.
Thus far we have developed means of doing <a href="http://dx.doi.org/10.1098/rstb.2014.0244">selection inference</a> on B cell receptors, <a href="http://arxiv.org/abs/1503.04224">VDJ annotation</a> (in press, PLOS Comp Bio), and clonal family inference (paper out soon)).
Having made progress in these arenas, we are now starting to improve phylogenetic methods for B cell receptors.</p>
<p>We would like someone who is strong in computational statistics and/or molecular evolution, but who is motivated by and engaged with the underlying biology, or someone who is strong in immunology but is interested in computational methods.
There aren’t any formal requirements really, except the ability to be able to make things happen with data.
The methods-development part of this postdoc will be joint with <a href="http://bedford.io">Trevor Bedford</a>, <a href="http://www.stat.washington.edu/vminin/">Vladimir Minin</a>, and Harlan Robins, co-founder of <a href="http://www.adaptivebiotech.com">Adaptive Biotechnologies</a>.</p>
<p>The position will come with a competitive postdoc-level salary with great benefits for two years.
There is some possibility of extension.
Fred Hutchinson Cancer Research Center, home of about 190 faculty including three Nobel laureates, is an independent, nonprofit research institution dedicated to the development and advancement of biomedical research.
The environment is lively yet casual, with a strong emphasis on collaborative work.
The Center is housed in a lovely campus next to Lake Union a short walk from downtown, and a slightly longer walk from the University of Washington.
Powerful computing resources and a helpful IT staff await.</p>
<!-- You can find out more about our group by visiting http://matsen.fredhutch.org/. -->
<p>If you are interested in this position, please send Erick some representative publications, code samples, and a CV.</p>
<p><em>If you are interested more in evolutionary dynamics of B cells please see <a href="http://bedford.io/blog/postdoc-repertoire-dynamics/">the related opening on Trevor’s website</a>.</em></p>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p><a href="http://dx.doi.org/10.1098/rstb.2014.0235">Cobey et al, 2015</a> <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p><a href="http://dx.doi.org/10.1038/nbt.2782">Georgiou et al, 2014</a> <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>Here <em>model-based</em> means that our ideas about how the system works are formalized into a probabilistic model, and <em>statistical inference</em> means that uncertainty at every stage is pushed through to obtain parameter estimates with appropriate statements of confidence in those estimates. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
High school students 2015
http://matsen.fredhutch.org/general/2015/08/27/high-school.html
Thu, 27 Aug 2015 00:00:00 PDTmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2015/08/27/high-school.html<p><img src="/images/high-school-2015.jpg" class="pull-right" />
This summer we had two high school students, Andrew and Kate.
They were quite sharp, so we threw them in the deep end with some real science projects.
Andrew taught himself shell programming and Docker, and learned enough B cell analysis to work on making <a href="http://bioboxes.org">Bioboxes</a> to be used for <a href="https://github.com/btcr/">validation of B cell sequence analysis software</a>.
Kate taught herself shell programming, Python, and learned some undergraduate abstract algebra to learn about the SPR graph by characterizing its distribution of pairwise distances.
They were both very independent, and a pleasure to have in the office!</p>
Tanglegrams!
http://matsen.fredhutch.org/general/2015/07/22/tanglegrams.html
Wed, 22 Jul 2015 00:00:00 PDTmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2015/07/22/tanglegrams.html<p><img src="/images/no-flip-tanglegram.png" class="pull-right" />
Say we care about a function on pairs of trees (such as subtree-prune-regraft distance) that doesn’t make reference to the labels as such, but simply uses them as markers to ensure that the leaves end up in the right place.
We’d like to calculate this function for all trees of a certain type.
However, doing so for <em>every</em> pair of labeled trees is a waste, because if we just relabel the two trees in the same way, we will get the same result.</p>
<p>So, how many computations do we actually need to do?</p>
<p>It turns out that we only need to do one calculation per <em>tanglegram</em>.
A tanglegram is a pair of trees along with a bijection between the leaves of those trees.
They have been investigated in coevolutionary analyses before, and there’s a considerable literature concerning how to draw them in the plane with the minimal number of crossings.
However, the symmetries and number of tanglegrams on a given number of leaves have not been explored until now.</p>
<p>We recently posted two papers to the arXiv on tanglegrams.
In the first, a <a href="http://arxiv.org/abs/1507.04784">mathematical phylogenetics paper</a>, we do a preliminary algebraic and combinatorial study of tanglegrams, including unrooted and/or unordered tanglegrams (the picture above shows a rooted and an unrooted tanglegram).
There we describe in detail how tanglegrams are equivalent to certain double cosets of the symmetric group.
In the second, a <a href="http://arxiv.org/abs/1507.04976">combinatorics paper</a>, we (mostly my coauthors Billey and Konvalinka) derive an explicit formula for the number of rooted ordered tanglegrams.
This is pretty cool– there isn’t an equivalent formula for tree shapes (i.e. unlabeled trees not embedded in the plane).
The derivation is lovely.</p>
<p>In any case, by using tanglegrams rather than pairs of trees, one gets an order-factorial (of the number of leaves) reduction in computation.
Now we are applying this knowledge of tanglegrams with our high school student Kate to learn about tree space.</p>
New paper on the shape of the phylogenetic likelihood function
http://matsen.fredhutch.org/general/2015/07/15/stationary.html
Wed, 15 Jul 2015 00:00:00 PDTmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2015/07/15/stationary.html<p><img src="/images/like-curve-shape.png" class="pull-right" />
Imagine we have a tree, sequence data for the leaves of that tree, and some fixed mutation rate matrix.
Then we fix all of the branch lengths of that tree except for one.
The likelihood function restricted to that branch gives a function from the positive real numbers to the unit interval.
Question: what is the shape of that function?</p>
<p>I asked Vu this question when he arrived.
As described in our <a href="http://arxiv.org/abs/1507.03647">new paper on arXiv</a>, the answer is rather interesting, and more complex than I would have thought.
Vu did a fantastic job with this project, taking (surprisingly to me) an algebraic approach, defining the <em>characteristic polynomial</em> of a likelihood function, defining an algebraic structure on <em>conditional frequency patterns</em>, then using a result about path-connected subgroups.</p>
<p>To summarize, if the model is quite simple (JC, F81), then the likelihood has a single maximum.
However, more complex models such as K2P can take on arbitrarily weird shapes, such as having many global maxima.
We’re not saying that this happens all the time or even that it happens much at all for real trees on real data, but we have developed foundations that allow us to analyze phylogenetic likelihoods in more detail.
Having more such understanding will allow more confident statements about phylogenies and more effective tree reconstruction.</p>
First paper on the curvature of tree space
http://matsen.fredhutch.org/general/2015/04/02/rspr-curvature-nodata.html
Thu, 02 Apr 2015 00:00:00 PDTmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2015/04/02/rspr-curvature-nodata.html<p><img src="/images/curvature-poincare.png" class="pull-right" />
Imagine a graph with vertices representing the trees of a given number of taxa, and edges connecting trees such that can be transformed to each other (see below for the “rSPR” example).
All popular likelihood-based tree inference algorithms perform some traversal of this graph: Bayesian algorithms, for example, perform Markov chain Monte Carlo (MCMC) on it.
In our <a href="/general/2014/05/13/sprmix.html}">recent Sys Bio paper</a>, Chris Whidden and I demonstrated graph effects on phylogenetic MCMC: that the graph structure combined with the likelihood function led to bottlenecks in tree space where it was difficult to move from one peak of good trees to another.</p>
<p>These results motivated us to learn more about these graph structures.
Consider the best-studied of these graphs, the rSPR graph, in which vertices represent <em>rooted</em> trees, and edges are rooted subtree-prune-regraft operations, in which a rooted tree is cut off of a tree and then reattached somewhere else in the tree with the same rooting.
Chris has done fundamental work on inferring shortest paths in this graph, <a href="http://dx.doi.org/10.1137/110845045">developing practical fixed parameter-tractable algorithms</a> and <a href="http://dx.doi.org/10.1093/sysbio/syu023">has continued to make them faster</a>.</p>
<p>Given that such graphs form the underlying space explored by phylogenetic tree inference algorithms, one would hope that we had a good understanding about this best-studied version of the graph.
However, all that is known about the rSPR graph is the <a href="http://dx.doi.org/10.1016/j.jcta.2011.04.013">diameter</a> (the maximum pairwise distance between vertices) and a <a href="http://dx.doi.org/10.1007/s00026-003-0192-0">recursive formula to calculate the degree of a vertex</a> (how many trees are one rSPR move away from a given tree vertex).
This is a great start, but is not enough detail to understand bottlenecks.</p>
<p>So Chris and I set out to learn more about the rSPR graph.
Our explicit intent was to learn characteristics that would be useful for understanding the difficulty of moving about tree space.
For this, we decided to apply <a href="http://dx.doi.org/10.1016/j.jfa.2008.11.001">recent advances in differential geometry</a> to understand the <em>curvature</em> of rSPR tree-space.
Pairs of vertices in a graph are positively curved with respect to a random walk if such walking brings walker positions closer together on average, and negatively curved if walking takes walker positions farther apart.
The image above in this post is that of a tiling of the Poincaré disk, which is a negatively curved graph, which you can see because the uniformly-selected-neighbor random walk on average increases distance between walkers.
We were able to compute curvatures on the rSPR graph up to 7 taxon trees for simple random walks, and prove some theorems describing curvature for larger numbers of taxa.
These results are in a <a href="http://arxiv.org/abs/1504.00304">preprint now up on arXiv</a>, which is a fun start, although lots remains to be done.</p>
New paper on annotation of BCR sequences
http://matsen.fredhutch.org/general/2015/03/23/partis-annotation.html
Mon, 23 Mar 2015 00:00:00 PDTmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2015/03/23/partis-annotation.html<p>The antigen binding properties of antibodies are determined by the sequences of their corresponding B cell receptors (BCRs).
These BCR sequences are created in “draft” form by VDJ recombination, which randomly selects and trims the ends of V, D, and J genes, then joins them together with additional random nucleotides.
If they pass initial screening and bind an antigen, these sequences then undergo an evolutionary process of mutation and selection, “revising” the BCR to improve binding to its cognate antigen.</p>
<p><img src="/images/annotation-problem.png" class="pull-left" />
Our <a href="/general/2014/03/23/mebcell.html}">first paper on BCRs</a> concerned natural selection as part of the “revision” process, and when Duncan joined the group we got to work on the “drafting” part.
Specifically, the first step was to work on the <em>annotation problem</em>: given a BCR sequence, which nucleotides came from which genes or non-templated insertions?
We recently <a href="http://arxiv.org/abs/1503.04224">posted a paper on arXiv</a> describing our approach.
Like previous work, we use a hidden Markov model (HMM) for this problem, but different from previous work, our emission and transition probabilities are parameter-rich categorical distributions, which are inferred “on the fly” for each data set.
We are motivated to do so by noting that these distributions deviate significantly, and reproducibly, from standard parametric distributions.
In our simulations we see significantly better performance using these parameter-rich distributions.
Next we will use the same framework to cluster BCR sequences by rearrangement event.</p>