Matsen group: general
http://matsen.fredhutch.org/rss/general.rss
general newsfeeden-usThu, 13 Dec 2018 17:14:21 UTCThu, 13 Dec 2018 17:14:21 UTCOpen postdoc position to work on variational Bayesian phylogenetic inference
http://matsen.fredhutch.org/general/2018/12/11/altphylo-postdoc.html
Tue, 11 Dec 2018 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2018/12/11/altphylo-postdoc.html<p><img src="https://matsen.fredhutch.org/images/bayesian_phylo_hearts.png" width="350" class="pull-right" /></p>
<p>We are obsessed with finding efficient alternatives to random-walk MCMC for Bayesian phylogenetic inference.
We have developed online sequential Monte Carlo <a href="http://dx.doi.org/10.1093/sysbio/syx087">theory</a> and <a href="http://dx.doi.org/10.1093/sysbio/syx090">algorithms</a>, <a href="http://proceedings.mlr.press/v70/dinh17a.html">phylogenetic Hamiltonian Monte Carlo</a>, and inference via <a href="http://arxiv.org/abs/1811.11007">direct topology search</a> and <a href="http://arxiv.org/abs/1811.11804">efficient marginal likelihood computation</a>.</p>
<p>Come work with us on a strategy that is producing very promising results: variational Bayesian phylogenetic inference based on <a href="https://papers.nips.cc/paper/7418-generalizing-tree-probability-estimation-via-bayesian-networks">subsplit Bayesian networks</a>.
There are lots of opportunities for projects to flesh out this direction.
We would like to find someone who can collaborate with us on methods development and implementation, thus knowledge of both Bayesian statistics and programming expertise are needed.
Experience with an existing code base for phylogenetics would be a big plus.</p>
<p>We’re stoked but are happy to wait for the right person to fill the position.
If you aren’t ready until this summer, no problem!</p>
<p><a href="https://careers-fhcrc.icims.com/jobs/12587/post-doctoral-research-fellow/job">Apply here</a> or just get in touch.</p>
Generalizing tree probability estimation via Bayesian networks
http://matsen.fredhutch.org/general/2018/12/05/sbn.html
Wed, 05 Dec 2018 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2018/12/05/sbn.html<p><img src="https://matsen.fredhutch.org/images/tree-smoother.png" width="350" class="pull-right" /></p>
<p>Posterior probability estimation of phylogenetic tree topologies from an MCMC sample is currently a pretty simple affair.
You run your sampler, you get out some tree topologies, you count them up, normalize to get a probability, and done.
It doesn’t seem like there’s a lot of room for improvement, right?</p>
<p>Wrong.</p>
<p>Let’s step back a little and think like statisticians.
The posterior probability of a tree topology is an unknown quantity.
By running an MCMC sampler, we get a histogram, the normalized version of which will converge to the true posterior in the limit of a large number of samples.
We can use that simple histogram estimate, but nothing is stopping us from taking other estimators of the per-topology posterior distribution that may have nicer properties.</p>
<p>For real-valued samples we might use kernel density estimates to smooth noisy sampled distributions, which may reduce error when sampling is sparse.
Because the number of phylogenies is huge, MCMC is computationally expensive, and we are naturally impatient, one is often in the sparsely-sampled regime for topology posteriors.
Can we smooth out stochastic under- and over-estimates of topology posterior probabilities by using similarities between trees? (See upper-right cartoon.)
This smoothing should also extend the posterior to unsampled topologies.</p>
<p>The question is, then, how do we do something like a kernel density estimate in tree space?
In a beautiful line of work <a href="http://dx.doi.org/10.1093/sysbio/syr074">started by Höhna and Drummond</a> and <a href="http://dx.doi.org/10.1093/sysbio/syt014">extended by Larget</a> one can think of each tree as being determined by local choices about how groups of leaves (“clades”) get split apart recursively down the tree.
Their work assumed independence between these clade splitting probabilities.</p>
<p>This is a super-cool idea, but the formulation didn’t seem to work well for tree probability estimation from posterior samples on real data.
For example, <a href="http://dx.doi.org/10.1093/sysbio/syv006">Chris Whidden and I</a> noticed that this procedure underestimated the posterior for sub-peaks and overestimated the posterior between peaks.
This says that the conditional independence assumption on clades made by this method was too strong.
But this doesn’t doom the entire approach!
We just need to take a more flexible family of distributions over phylogenetic tree topologies.</p>
<p>I suggested this direction to Cheng Zhang, a postdoc in my group, and within a week he figured out the right construction that generalized this earlier work but allowed for much more complex distributions.
Cheng’s construction parameterizes a tree in terms of “subsplits,” which are the choices about how to split up a given clade.
To allow for more complex distributions than the previous conditional independence assumptions allow, he encodes tree probabilities in terms of a collection of subsplit-valued random variables that are placed in a Bayesian network.</p>
<p>The simplest such network enables dependence of a subsplit on the parent subsplit, which in tree terms means that when assigning a probability to a given subsplit we are influenced by what the sister clade is.
More complex networks can encode more complex dependence structures.
To our surprise, the simplest formulation worked well: allowing split frequencies for clades to depend on the sister clade gives a sufficiently flexible set of distributions to be able to fit complex tree-valued distributions.</p>
<p>In the simplest version one can write out the probability for a given tree like so:</p>
<p><img src="https://matsen.fredhutch.org/images/csd-example.png" width="1100" /></p>
<p>where the <i>q</i>s are inferred probability distributions that we call conditional subsplit distributions.</p>
<p>In addition to more complex dependence structure, Cheng’s approach also more formally treats this whole procedure as an exercise in estimating an approximating distribution.
Where previous efforts estimated probabilities by counting, one can do better in the unrooted case for subsplit networks by optimizing the parameterized distribution on trees to match an empirical sampled distribution of unrooted trees via expectation maximization.
One can also take some weak priors to handle the sparsely-sampled case.</p>
<p>We’ve written up these results in a <a href="https://papers.nips.cc/paper/7418-generalizing-tree-probability-estimation-via-bayesian-networks">paper that has been accepted to the NeurIPS</a> (previously NIPS<sup><a href="#footnote1">1</a></sup>) conference as a Spotlight presentation.
I’m proud of Cheng for this accomplishment, but consequently the paper is written more for a machine-learning audience rather than a phylogenetics audience.
If you aren’t familiar with the Bayesian network formalism it may be a tough read.
The key thing to keep in mind that the network (paper Figure 2) encodes the tree as a collection of subsplits assigned to the nodes of the network, and the edges describe probabilistic dependence.
For example, the reason we can think of a conditional subsplit distribution as conditioning on the sister clade (see figure above) is because parent-child relationships in the subsplit Bayesian networks must take values such that the child subsplit is compatible with the parent subsplit.</p>
<p>If you don’t read anything else, flip to Table 1 and check out how much better these estimates are on big posteriors from real data than what everyone does right now, which is just to use the simple fraction.
Magic!
Hopefully it makes sense that we are smoothing out our MCMC posterior, and extending it to unsampled trees.
If you have questions, I hope you will head on over to <a href="https://www.phylobabble.org/t/generalizing-tree-probability-estimation-via-bayesian-networks/1067">Phylobabble</a> and ask them— let’s have a discussion!</p>
<p>Subsplit Bayesian networks open up a lot of opportunities for new ways of inferring posteriors on trees.
Stay tuned!</p>
<p>Also, we’d like to <a href="/general/2018/12/11/altphylo-postdoc.html">recruit a postdoc</a> to work in this area.
If you’re interested, get in touch!</p>
<hr />
<p><a name="footnote1">1</a>: I’m happy to report that the <a href="https://nips.cc/Conferences/2018/News?article=2118">NIPS conference has changed its name to NeurIPS</a>.
This is an important move that at signals at least a desire by the board for diversity and inclusion in machine learning.
We can all hope that it is followed with concrete action.</p>
Human T cell receptor occurrence patterns encode immune history, genetic background, and receptor specificity
http://matsen.fredhutch.org/general/2018/05/15/pubtcr.html
Tue, 15 May 2018 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2018/05/15/pubtcr.html<p><img src="https://matsen.fredhutch.org/images/pubtcrs.png" width="470" class="pull-right" /></p>
<p>High-throughput sequencing of our adaptive immune repertoires holds great promise for understanding immune state.
These sequences implicitly contain a wealth of information on past and present exposures to infectious and autoimmune diseases, to environmental stimuli, and even to tumor-derived antigens.
In principle, we should be able to use these sequences of rearranged receptors to infer their eliciting antigens, either individually or collectively.</p>
<p>We’re starting to see neat progress in these areas for T cell receptors (TCRs).
Some recent studies compare TCR repertoire between individuals who do or do not have some immune state, such as <a href="https://academic.oup.com/bioinformatics/article/30/22/3181/2390867">an immunization</a>, <a href="https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1814-6">an autoimmune disease</a> or a <a href="http://dx.doi.org/10.1038/ng.3822">viral infection</a> and work to find sequence-level differences between the repertoires.
The Walczak-Mora team recently <a href="https://elifesciences.org/articles/33050">upped the bar</a> by not requiring a control cohort.
There has also been interesting progress on <a href="https://www.nature.com/articles/nature23091">predicting epitope specificity from TCR sequence</a> using structurally-informed sequence analysis.</p>
<p><a href="https://www.fredhutch.org/en/labs/profiles/bradley-phil.html">Phil Bradley</a>, just down the hall from us, wanted to take a different approach, asking <em>given appropriate statistical analysis of a sufficiently large data set, can we infer pathogen-responsive TCRs from co-occurrence and HLA information alone?</em>
(If you don’t remember about HLA, it determines the sequence of MHC, the <a href="https://en.wikipedia.org/wiki/Major_histocompatibility_complex#/media/File:MHC_Binding_Diagram.png">hot dog bun</a> presenting peptides for recognition to T cells.)
He showed that this indeed was the case, one example of which is shown in the figure above.
Each point is a cluster of TCR sequences, where clustering is performed based on both co-occurrence and on TCR sequence similarity.
Only TCR sequences that are significantly associated with an HLA type are allowed to participate in the clustering, and only clusters that were significant in terms of family-wise error rate are shown.
These clusters are plotted with respect to the cluster size and a co-occurrence score.</p>
<p>The surprising result is that this procedure, which knows nothing about what stimulated the TCRs to expand, identifies previously-labeled TCR sequences corresponding to certain immune states.
You probably recognize EBV, MS, and CMV, but we also see B19=parvovirus B19, INF=influenza, RA=rheumatoid arthritis, T1D=type 1 diabetes, and others.
That’s pretty neat!
This, along with other fun surprises, is up in a manuscript <a href="https://www.biorxiv.org/content/early/2018/05/02/313106">on bioRxiv</a>.</p>
<p>I made very minor contributions to this manuscript, but wanted to write about it because I think it’s an exciting advance.
This proof of concept is definitely motivating us to think harder about what sorts of statistical frameworks would be useful for doing this sort of research more comprehensively.
Thanks to Will and Phil, to the Hansen lab for the neat data, and to the study participants.</p>
The Bayesian optimist's guide to adaptive immune receptor repertoire analysis
http://matsen.fredhutch.org/general/2018/05/12/bayesian-optimist.html
Sat, 12 May 2018 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2018/05/12/bayesian-optimist.html<p><img src="https://matsen.fredhutch.org/images/box-loop.png" width="470" class="pull-right" /></p>
<p>Immune receptor sequencing is stochastic through and through.
We have cells with random V(D)J rearrangements that are stimulated through some random process of exposures, which lead to some random amount of expansion, and in the B cell case there is some random process of mutation and selection.
So why don’t we use methods incorporating that uncertainty into our analysis?</p>
<p>We’ve tried to do this in our work, and have made some progress, but there is so much left to be done.
When Sarah Cobey and Patrick Wilson kindly invited me to contribute to their special issue of <em>Immunological Reviews</em>, I knew I wanted to step back and ask:</p>
<p><em>If computation was no barrier, how would we design an analysis framework that integrated out uncertainty in unknown quantities and took advantage of the hierarchical structure inherent in immune receptor data?</em></p>
<p>I teamed up with Branden Olson, a Statistics PhD student in the lab, and went to work.
It was a fun exercise to think through all of the steps of immune repertoire development and ask: what is the most realistic model under which inference should be possible, and what is the most realistic model for which we can perform simulation?
This was more effort than anticipated, but 230 references later the final version is now <a href="https://arxiv.org/abs/1804.10964">up on arXiv</a> and accessible for free (though I understand if you want to wait a few months to pay $38 and get it from the journal website).</p>
<p>In addition to dreaming research directions, I wanted to explain to my immunologist pals why I think probabilistic analysis methods are crucial, and describe the basics of Bayesian analysis via simple metaphors.
Ideally this will lead to a little more crosstalk between communities.
Traditionally, statisticians and lab biologists have been on independent tracks (see image above) even though they investigate the same underlying phenomena.
I hope that in the future we can unify these tracks by developing statistical models based on mechanism and design experiments based on statistical inferences.</p>
<p>I also hope that this serves as an invitation to the computational statistics community.
As we say at the end: “The computational statistician interested in immune receptor modeling is blessed with a complex biological system to analyze, intractable computational problems heaped on top of one another, and an ever-expanding collection of data sets generated from various in-vivo and in-vitro perturbations.”</p>
<p>Come play!</p>
Benchmarking tree and ancestral sequence inference for B cell receptor sequences
http://matsen.fredhutch.org/general/2018/05/02/bcr-phylo-benchmark.html
Wed, 02 May 2018 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2018/05/02/bcr-phylo-benchmark.html<p><img src="https://matsen.fredhutch.org/images/bcr-phylo-benchmark.png" width="470" class="pull-right" /></p>
<p>Phylogenetic tools, in particular for ancestral sequence reconstruction, get used a lot in the B cell receptor (BCR) sequence analysis world.
For example, they get used to reconstruct intermediate antibodies that then get synthesized in the lab and tested for binding (<a href="http://dx.doi.org/10.1126/science.1207532">Wu et. al, 2011</a>).
But how well do phylogenetic tools work in this parameter regime?
Although there have been countless benchmarking studies for phylogenetics, the case of B cell sequence evolution is different than the usual setting for phylogenetics:</p>
<ul>
<li>Sampling and sequencing, especially for direct sequencing of germinal centers, is dense compared to divergence between sequences. Because of the resulting distribution of short branch lengths, zero-length branches and multifurcations representing simultaneous divergence are common.</li>
<li>The somatic hypermutation (SHM) process in affinity maturation is highly <a href="/general/2017/11/14/motif.html">nucleotide-context-dependent process</a>.</li>
<li>Repertoire sequencing typically focuses on the coding sequence of antibodies, which are under very strong selective constraint. This contrasts with the neutral evolution assumptions of most phylogenetic algorithms, as well as the simulation software assumptions traditionally used for phylogenetics benchmarks.</li>
<li>In contrast to typical phylogenetic problems where the root sequence is unknown, one has significant information about the root sequence for BCR sequences: namely, that it’s a recombination of V, (D), and J genes, which are somewhat well characterized.</li>
</ul>
<p>BCR sequences also offer additional opportunities for validation.
Specifically, the irreversible <a href="https://en.wikipedia.org/wiki/Immunoglobulin_class_switching">class switching process</a> gives us a marker that should only go in one direction along a tree branch.
If it goes another direction, this indicates problems with the tree reconstruction.</p>
<p>Before I sketch the results of our analysis, I should mention differences between our work and another <a href="http://dx.doi.org/10.1093/bioinformatics/btx533">recent paper</a> also set up a benchmark of phylogenetic methods.
Much of that paper concerns the results of phylogenetic inference using a “toy” clonal family inference method with necessarily bad performance, whereas here we assume that clonal families have been properly inferred.
In addition, we simulate sequences under selection using an affinity-based model (which we show makes the inferential problem significantly more difficult), we compare accuracy of ancestral sequence inference, we include additional software tools (several of which are BCR-specific), and we use class-switching data as a further non-simulation means of benchmarking methods.</p>
<p>For this work, Kristian cooked up a simulator for B cell affinity maturation.
Although quite a lot of simulators have been written, going back to <a href="https://link.springer.com/chapter/10.1007%2F978-3-642-71984-4_13">Clone</a>, none of these did what we wanted, which was to use a context model to simulate mutations, and then use the corresponding amino acid sequences for a selection step.
Kristian’s model is simple, but nonetheless we feel that it does an appropriate job of simulating sequences for the purposes of benchmarking methods.
We show that the simulated data broadly speaking “looks like” germinal center data.</p>
<p>You can read the full results <a href="https://www.biorxiv.org/content/early/2018/04/25/307736">on bioRxiv</a>, but here are the things that surprised us:</p>
<ul>
<li>Picking between equally parsimonious trees using a context-sensitive model works surprisingly well. This makes us want to continue working on incorporating full context models into phylogenetic methods.</li>
<li>PHYLIP is quite a good choice! I thought that the BCR community was fairly behind the times by not using some of the more modern maximum-likelihood packages, but IQ-TREE is the only recently-developed package that does ML on trees and ancestral sequence inference, and it performs significantly worse (although it’s much faster and nicer to use!).</li>
<li><a href="http://dx.doi.org/10.1534/genetics.116.196303">IgPhyML</a> is a cool project that works to integrate hotspot motifs and Goldman-Yang codon modeling, which it does by marginalizing out hotspot motifs when they extend across a codon boundary. It does reasonably but not as well as we expected, which may be because we are benchmarking on the moderately-sized trees with which we have experience rather than the very deep broadly-neutralizing trees investigated in the IgPhyML paper.</li>
<li>The class-switching data gave noisier results than we had hoped for, giving error bars of the same magnitude as differences between methods. However, it confirmed that picking equally parsimonious trees using a context-sensitive model increases accuracy. Perhaps with better sampling or just more data we can learn more from class-switching data in the future.</li>
</ul>
<p>There’s quite a lot more to do here, both in terms of method development and benchmarking, and we look forward watching this area mature in the coming years.
Thanks to Kristian for his great work!</p>
Predicting B cell receptor substitution profiles using public repertoire data
http://matsen.fredhutch.org/general/2018/04/19/spurf.html
Thu, 19 Apr 2018 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2018/04/19/spurf.html<p><img src="https://matsen.fredhutch.org/images/spurf.png" width="470" class="pull-right" /></p>
<p>Can we predict how sites of an antibody will tolerate amino acid substitutions?
Kristian Davidsen posed this question shortly after he arrived in my group, pointing out that being able to do such prediction would be quite useful.
For example, engineered antibodies sometimes aggregate into clumps or have other properties that that make them useless for mass production.
If we could figure out ways to change the amino acid sequence of an antibody without changing binding properties, that could help us avoid aggregation and make a more useful antibody.</p>
<p>How to start to address this complex and high-dimensional question?
Although people have started to do <a href="http://dx.doi.org/10.7554/eLife.23156">deep mutational scanning on antibodies</a> this type of data is hard to come by.
On the other hand, B cell repertoire (i.e. antibody-coding) sequence data is becoming plentiful.
B cells undergo affinity maturation to improve binding in collections of sequences called “clonal families” grouped by naive ancestor sequence (more background <a href="/general/2016/04/16/partis-clustering.html">here</a>).
Although it’s not quite the same, we can use the frequency of an amino acid at a given site in that clonal family as a proxy for the suitability of that amino acid for an antibody binding the same target.
Or perhaps such a clonal-family amino-acid frequency is simply an interesting object in itself.</p>
<p>In any case, our goal became:
<em>given a single sequence from a clonal family, can we predict the amino acid frequency of the collection of sequences in the clonal family?</em>
We follow <a href="http://dx.doi.org/10.3389/fimmu.2017.00537">Sheng, Schramm et. al (2017)</a> in calling this sort of thing a <em>substitution profile</em>.
Inferring a substitution profile from a single sequence might sound hard or impossible, but several features of the affinity maturation process lean in our favor:</p>
<ol>
<li>There are a finite number of germline ancestor sequences from which diversification begins, and we can do a good job of inferring from which ancestor a given B cell sequence derives.</li>
<li>Simply because of the mutation process, some sites are more likely to mutate than others (recently covered <a href="/general/2017/11/14/motif.html">here</a>).</li>
<li>There’s lots of other repertoire data that we can use to watch the affinity maturation process.</li>
</ol>
<p>This last one is sort of special, and deserves a bit of explanation.
If we had a database containing every B cell sequence that had ever occurred, one could simply look for clonal families containing the sequence given to us, and take the average amino acid profile of those clonal family sequences.
Unfortunately we don’t have access to such a database, but we can at least look for somewhat similar sequences and learn from their substitution profiles.</p>
<p>The previous Sheng-Schramm work, as well as contemporaneous work by <a href="http://dx.doi.org/10.3389/fimmu.2017.01433">Kirik et. al (2017)</a>, also indicates that various germline genes diversify in various characteristic ways (this sentiment also appears in <a href="http://dx.doi.org/10.1371/journal.pcbi.1004409">Duncan’s first B cell paper</a> and I’m sure many other previous works).
This tells us that a profile based on germline gene identity should also inform a predicted substitution profile.
Also, the context-sensitive neutral process given a germline gene should be helpful.</p>
<p>How do we combine these various sorts of information, especially considering that what is helpful for prediction at one site might not be helpful for another?
Well, our group, consisting of Kristian, Amrit Dhar, and Vladimir Minin, decided to use a penalized tensor regression framework.
That sounds fancy, but it just means that a single profile is a weighted linear combination of the profiles from each of the sources of information (see picture above).
The weights may differ from site to site, but the kind of penalization we put on keeps them from changing too much between neighboring sites.
It also zeroes out coefficients that don’t seem to be helping out-of-sample prediction.
We find that different sources of information are useful for different parts of the B cell receptor sequence, in a way that corresponds to intuition about the “framework” and “complementarity determining” regions.</p>
<p>In any case, we show that integrating these diverse sources of information can help prediction, and provide a pre-trained prediction algorithm to do so.
The code and parameters are <a href="https://github.com/krdav/SPURF">on Github</a> and the paper is <a href="http://arxiv.org/abs/1802.06406">on arXiv</a>.
So have at it with your sequences, and let us know how it fares!</p>
<p>I think that predicting substitution profiles is an interesting and useful goal.
It did take a little getting used to, because we previously <a href="http://dx.doi.org/10.1098/rstb.2014.0244">worked super hard</a> to get per-residue natural selection estimates for B cell receptors by carefully separating the mutation and selection processes; here these substitution profiles just smash all that complexity down to a simpler object.
There’s more to be done here: as data sets get bigger and machine learning algorithms get smarter, I look forward to seeing prediction improve!
Thanks to Amrit, Kristian, and Vladimir for a fun project.</p>
Postdoc opening to learn about antibody development during HIV superinfection
http://matsen.fredhutch.org/general/2018/01/10/ab-postdoc.html
Wed, 10 Jan 2018 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2018/01/10/ab-postdoc.html<p>Please see <a href="https://b-t.cr/t/506">https://b-t.cr/t/506</a> for details.</p>
Per-sample immunoglobulin germline inference from B cell receptor deep sequencing data
http://matsen.fredhutch.org/general/2017/12/01/gl-inf.html
Fri, 01 Dec 2017 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2017/12/01/gl-inf.html<p><img src="https://matsen.fredhutch.org/images/gl-inf-fits.png" width="350" class="pull-right" /></p>
<p>Every B cell receptor sequence in a repertoire came from a V(D)J recombination of germline genes.
Each individual has only certain alleles of these genes in their germline, and knowing this set improves the accuracy of all aspects of BCR sequence analysis, from alignment to phylogenetic ancestral sequence reconstruction.
This germline allele set can be estimated directly from BCR sequence data, and it’s time to treat such estimation as part of standard BCR sequence analysis pipelines.</p>
<p>This central message is not new, but it’s worth emphasizing because doing germline set inference is not part of most current studies of B cell receptor (BCR) sequences.</p>
<p>Indeed, the most common way to annotate sequences is to align them one by one to the full set of alleles present in the IMGT database, which has hundreds of alleles.
Each individual has only a fraction of these alleles in their genome.</p>
<p>Unsurprisingly, aligning sequences one by one to the whole IMGT set can cause problems.
Imagine that A and B are two germline alleles in IMGT that are similar to one another.
Sequences deriving from germline allele A can somatically hypermutate to look more similar to the B allele than the A allele from which they came.
If we allow A and B in our germline repertoire, such sequences will be incorrectly annotated as being from B when they are from A.
This will certainly lead to an incorrect estimation of the naive sequence from which they came.</p>
<p>In addition, it’s known through the work of many groups that the total set of germline genes is much larger than that represented in IMGT.
This is not surprising given that this region is tricky to sequence directly, and that so far genetic studies have been primarily done on people of European ancestry.
Here again, if we are missing a sequence from our germline set, we will have problems with all of our downstream analyses.</p>
<p>Thus, we should be estimating per-sample germline sets for BCR sequence data.
This is not a trivial task.
In 2010, <a href="http://dx.doi.org/10.4049/jimmunol.1000445">Scott Boyd and others</a> were the first to use high-throughput sequencing data of rearranged BCRs to estimate per-sample germline sets with a combination of computation, expert judgement, and statistics.
In 2015, the Kleinstein group made a big step by developing TIgGER, an <a href="http://dx.doi.org/10.1073/pnas.1417683112">automated method for inferring germline sets</a> that weren’t too far from existing alleles, and more recently the Hedestam group developed IgDiscover, a <a href="http://dx.doi.org/10.1038/ncomms13642">method that could start more “from scratch”</a> for species where we have little or no germline information.</p>
<p>The motivation for Duncan’s work came from analyzing sequence data from diverse sources, and seeing clear evidence of alleles that were not represented in IMGT.
He tried the existing tools but became frustrated first with software usability.
He then started by re-implementing TIgGER, and then realized that he could use the same input information (their “mutation accumulation” plot depicted above) but in a way that more directly tests for the presence of new alleles, by considering the goodness of fit for one- vs two-component fits.
In classic Duncan fashion, he has done a ton of validation, varying many different parameters in his simulation and also comparing the results of the different methods on experimental data sets.
The work is now <a href="https://arxiv.org/abs/1711.05843">up on arXiv</a> and is part of his <a href="https://github.com/psathyrella/partis">partis</a> suite of repertoire analysis tools.</p>
<p>There’s still a lot to be done here, and our knowledge of this highly diverse and important locus will continue to improve as more sequencing data of all types comes in.
This is one example of many showing how analysis of a whole data set at once is more powerful for each individual sequence than one-at-a-time analysis of sequences.</p>
Survival analysis of DNA mutation motifs with penalized proportional hazards
http://matsen.fredhutch.org/general/2017/11/14/motif.html
Tue, 14 Nov 2017 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2017/11/14/motif.html<p><img src="https://matsen.fredhutch.org/images/motif-samm-example.png" width="350" class="pull-right" /></p>
<p>We are equipped with purpose-built molecular machinery to mutate our genome so that we can become immune to pathogens. This is truly a thing of wonder.</p>
<p>More specifically, I’m talking about mutations in B cells, the cells that make antibodies.
Once a randomly-generated antibody expressed on the outside of the B cell finds something it’s good at binding, the cell boosts the mutation rate of its antibody-coding region by about one million fold.
Those that have better binding are rewarded by stimulation to divide further.
The result of this Darwinian mutation and selection process is antibodies with improved binding properties.</p>
<p>The mutation process is wonderfully complex and interesting. Being statisticians, we payed our highest tribute that we can to a process we think is beautiful: we developed a statistical model of it. This work was led by the dynamic duo of <a href="http://students.washington.edu/jeanfeng/">Jean Feng</a> and <a href="https://github.com/dawahs">David Shaw</a>, while <a href="http://www.stat.washington.edu/vminin/">Vladimir Minin</a>, <a href="http://faculty.washington.edu/nrsimon/">Noah Simon</a> and I kibitzed.
Our model is known in statistics as a type of <a href="https://en.wikipedia.org/wiki/Proportional_hazards_model">proportional hazards model</a>. These models were introduced in Sir David Cox’s paper <a href="https://www.jstor.org/stable/2985181"><em>Regression Models and Life-Tables</em></a>, which with over 4600 citations makes it <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.231.5042&rep=rep1&type=pdf">the second most cited paper in statistics</a>.</p>
<p>These models are typically used to infer rates of failure, such as that of humans getting disease.
During our life span we get a sequence of diseases, some of which predispose us to other diseases.
By considering sequences of diseases across many individuals, we can use these proportional hazards models to infer the rate of getting various diseases given disease history.</p>
<p>There is an analogous situation for B cell sequences in that the mutation process depends significantly on the identity of the nearby bases.
We can observe lots of mutated sequences, and do a similar sort of inference: when a position mutates, it changes the mutability of nearby bases.
Unfortunately we don’t know the order in which the mutations occurred, and thus don’t know what sequences had increased mutability, so we have to do Gibbs sampling over orders.
We have just posted our preprint describing the methods and some results to <a href="https://arxiv.org/abs/1711.04057">arXiv</a>.</p>
<p>We were inspired by the very nice <a href="http://dx.doi.org/10.3389/fimmu.2013.00358">work</a> of the <a href="http://medicine.yale.edu/lab/kleinstein/">Kleinstein lab</a> developing similar sorts of models using simpler methods.
However, we wanted a more flexible modeling framework and for the complexity of the models to automatically scale to the signal in the data, which we did using penalization with the LASSO.
What you see in the figure above is how we can set up a hierarchical model with a penalty that zeroes out 5-mer terms when they don’t contribute anything above the corresponding 3-mer term (the last base being unimportant gives the block-like structure, while when the first base is unimportant it gives the 4-fold repetitive pattern you can see when zooming out).
We are also indebted to Steve and his team, especially Jason Vander Heiden, for supplying us with sequence data.
They are a class act.</p>
<p>There’s a lot of interest in context-sensitive mutation processes these days, such as <a href="https://elifesciences.org/articles/24284">Kelly Harris’ work</a> on how we can watch context-sensitive mutabilities change through evolutionary time, and <a href="https://doi.org/10.1038/nature12477">Ludmil Alexandrov’s work</a> on mutation processes in cancer.
In both of these cases, they are in the process of transitioning from a statistical description of these processes to linking them with specific mutagens and repair processes.</p>
<p>Here too we would like to use statistics to learn more about the mechanisms behind these context-sensitive mutations.
What’s neat about the framework that Jean and David developed is that now we can design features that correspond to specific mechanistic hypotheses and test how much they impact mutation rates.
Stay tuned!</p>
Using genotype abundance to improve phylogenetic inference
http://matsen.fredhutch.org/general/2017/09/05/gctree-phylo.html
Tue, 05 Sep 2017 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2017/09/05/gctree-phylo.html<p><img src="https://matsen.fredhutch.org/images/gctree-phylo.png" width="350" class="pull-right" /></p>
<p>When doing computational biology, listen to biologists.
I have found them to have remarkable intuition; this can be a gold mine of opportunity for us computational types.</p>
<p>In this particular case, the starting point was the <a href="http://dx.doi.org/10.1126/science.aad3439">stunningly beautiful work of Gabriel Victora’s lab</a> visualizing germinal center dynamics in living mice.
For those not yet initiated into the beauty of B cell repertoire, germinal centers are crucibles of evolution, in which B cells compete in an antigen-binding contest such that the best binder reproduces more.
As part of the Victora lab work, they did single-cell extraction and sequencing, which enabled them to quantify the frequency of each B cell genotype without PCR bias or other artifacts.
Such single-cell sequencing, and consequent abundance information, is now becoming commonplace.
<em>How should we use this abundance information in phylogenetics?</em></p>
<p>Well, the Victora lab knew, even if their algorithm implementation is not one we would have considered.
Indeed, they were building trees by hand, using several criteria about what makes for a believable evolutionary scenario.
One of their intuitions was that <em>more abundant genotypes have more opportunity to leave mutant descendants</em>.
Therefore, when we are doing inference, we should prefer trees that attach branches to frequently observed genotypes compared to less frequently observed genotypes (see picture, in which the frequency of a given genotype is the number inside the circle; we call this structure a genotype collapsed tree or <em>GCtree</em>).</p>
<p>To have an objective computational method we need to formalize this intuition.
Will DeWitt, Vladimir Minin, and I formulated it in terms of an “infinite type” branching process, in which every mutation creates a new type.
We can augment existing sequence-based optimality criteria with the likelihood of the tree under our branching process model.
In our case we decided to show that this works by ranking maximum-parsimony trees (there are often many equally parsimonious trees).
Parsimony is in wide use in the B cell analysis community because it is a defensible choice when sampling is dense relative to mutations (as in the case of germinal centers), and it allows inference of zero branch lengths (leading to inference of sampled ancestral genotypes and multifurcations).
We showed under simulation that more highly ranked trees were more correct than lower ranked trees.
With the paired heavy and light chain data from the Victora lab, we were also able to do a biological validation by showing that trees that should be the same are more similar when using our algorithm than without.
The result is now <a href="http://arxiv.org/abs/1708.08944">up on arXiv</a>.</p>
<p>If you are muttering to yourself that we should be using this model as a prior for a Bayesian analysis, we hear you.
Hopefully this motivates additional work in that sphere for abundance-based models.
We do note that given the limited amount of mutation described before will lead to a fairly flat posterior.
Furthermore, although one can infer sampled ancestors using an <a href="http://dx.doi.org/10.1371/journal.pcbi.1003919">RJMCMC</a> and multifurcations using <a href="http://dx.doi.org/10.1093/sysbio/syu132">phycas</a>, these two features do not exist yet under one roof.</p>
<p>Will did a great job with this project, which is a nice complement to his existing publications as he heads into the UW Genome Sciences PhD program!
We had a great time working with Luka and Gabriel, and look forward to more collaboration in the future.</p>
Probabilistic Path Hamiltonian Monte Carlo
http://matsen.fredhutch.org/general/2017/06/26/pphmc.html
Mon, 26 Jun 2017 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2017/06/26/pphmc.html<p><img src="https://matsen.fredhutch.org/images/hmc-tub.png" width="350" class="pull-right" /></p>
<p>Hamiltonian Monte Carlo (HMC) is an exciting approach for sampling Bayesian posterior distributions.
HMC is distinguished from typical MCMC because proposals are derived such that acceptance probability is very high, even though proposals can be distant from the starting state.
These lovely proposals come from approximating the path of a particle moving without friction on the posterior surface.</p>
<p>Indeed, the situation is analogous to a bar of soap sliding around a bathtub, where in this case the bathtub is the posterior surface flipped upside down.
When the soap hits an incline, i.e. a region of bad posterior, it slows down and heads back in another direction.
We give the soap some random momentum to start with, let it slide around for some period, and where it is at the end of this period is our proposal.
Calculating these proposals requires integrating out the soap dynamics, which is done by numerically integrating physics equations (hence the Hamiltonian in HMC).
The acceptance ratio is determined only by how well our numerical integration performs: better numerical integration means a higher acceptance probability.</p>
<p>Vu had been noodling around with phylogenetic HMC when we heard that Arman Bilge (at that time in Auckland) had an implementation as well.
These implementations not only moved through branch length space according to usual HMC, but also moved between topologies.
They did this as follows: once a branch length hits zero, leading to four branches joined at a single node, one can regroup those branches together randomly in another configuration and continue.
(If you are a phylogenetics person, this is a random nearest-neighbor interchange around the zero length branch.)
This randomness, which is rather different than the deterministic paths of classical HMC once a momentum is drawn, is why we call the algorithm Probabilistic Path Hamiltonian Monte Carlo (PPHMC).</p>
<p>The primary challenge in theoretical development is that the PPHMC paths are no longer deterministic.
Thus concepts such as reversibility and volume preservation, which are typical components of correctness proofs for HMC, need to be generalized to probabilistic equivalents.
Vu had to work pretty hard to develop these elements and show that they led to ergodicity.</p>
<p>On the implementation front, Arman was also working hard to build an efficient sampler.
However, the HMC integrator had difficulty going from one tree topology to another without incurring substantial error.
We thrashed around for a while trying to improve things with a “careful” integrator that would find the crossing time and perhaps re-calculate gradients at that time, but proving that such a method would work seemed very hard.</p>
<p>Then, magically, our newest postdoc Cheng Zhang showed up and saved us with a smoothing surrogate function.
This surrogate exchanges the discontinuity in the derivative for discontinuity in the potential energy, but we can deal with that using a “refraction” method introduced by Afshar and Domke in 2015.
This approach allows us to maintain a low error, and thus make very long trajectories with a high acceptance rate.</p>
<p>I’m happy to announce that our <a href="https://arxiv.org/abs/1702.07814">manuscript</a> has been accepted to the 2017 International Conference on Machine Learning.
Practically speaking, this work is definitely a proof of concept.
We have taken an algorithm that was previously only defined for smooth spaces and extended it to orthant complexes, which are basically Euclidean spaces with boundary glued along those boundaries in intricate ways.
Our implementation is not fully optimized, but even if it was I’m not sure that it would out-compete good old MCMC for phylogenetics without some additional tricks.</p>
<p>To know if this flavor of sampler is going to be useful, we really need to better understand what I call the <em>local to global</em> question in phylogenetics.
That is, to what extent does local information tell us about where to modify the tree?
This is straightforward for posteriors on Euclidean spaces: the gradient points us towards a maximum.
But for trees, does the gradient (local information) tell us anything about what parts of the tree should be modified (global information)?
We’ll be thinking about this a lot in the coming months!</p>
Smart proposals win for online phylogenetics using sequential Monte Carlo.
http://matsen.fredhutch.org/general/2017/06/07/sts.html
Wed, 07 Jun 2017 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2017/06/07/sts.html<p><img src="https://matsen.fredhutch.org/images/sts-proposal.png" width="350" class="pull-right" /></p>
<p>Sometimes projects take years to bear fruit.</p>
<p>As I described <a href="/general/2016/10/26/smc-theory.html">previously</a>, <a href="http://darlinglab.org/">Aaron Darling</a> and I have been thinking for a good while about using Sequential Monte Carlo (SMC) for <em>online</em> Bayesian phylogenetic inference, in which an existing posterior on trees can be updated with additional sequences.
In fact, we had a good enough proof-of-concept implementation in 2013 to give a talk at the Evolution meeting.
The other talk from my group that year was about a surrogate function for likelihoods parameterized by a single branch length, which we called “lcfit”.
These two projects have just recently resulted in intertwined submitted papers.</p>
<p>The SMC implementation, which we called <code>sts</code> for Sequential Tree Sampler, lay dormant for a while after Connor McCoy left the group despite a few efforts to rekindle it.
One of the things that kept us from wrapping it up was problems with <em>particle degeneracy</em>, which is as follows.
I think of SMC as a probabilistically correct version of evolutionary computation, in which trees get to reproduce if they have a high posterior.
Every time we have a new sequence, we attach it to our existing population of trees via some proposal distribution suggesting where to add it.
Particle degeneracy, then, means that you obtain a low diversity population due to a few trees “taking over” the population because they are significantly better than the rest.</p>
<p>In working with <code>sts</code>, we found that we could mitigate the effect of particle degeneracy by developing “smart” proposals that use the data and corresponding likelihood function to decide where to try next.
This is in contrast to previous work by <a href="http://dx.doi.org/10.1093/sysbio/syr131">Bouchard-Côté et al 2012</a>, which shows that for a different formulation of SMC based on subtree merging one does better with simple proposals (compare <a href="http://papers.nips.cc/paper/3266-bayesian-agglomerative-clustering-with-coalescents">Teh et al. 2008</a>).
Our proposals have to decide to what edge to attach, where along the edge to attach, and how long of a branch length to use for the attachment (see picture above).</p>
<p>Although the group put in hard work, there was a lot more needed to put all this together and actually get a working sampler.
Luckily, Aaron recruited a super sharp and motivated postdoc in <a href="https://au.linkedin.com/in/mfourment">Mathieu Fourment</a>.
Mathieu tried a lot of variants of proposal distributions, showed that “heated” proposal distributions were effective for picking attachment branches, and added a parsimony-based proposal.
He also showed that a high effective sample size is necessary but not sufficient to have a good SMC posterior sample, and that we can develop a time-competitive sampler compared to running MrBayes again and again.
That work is now up <a href="http://biorxiv.org/content/early/2017/06/02/145219">on bioRxiv</a>.</p>
<p>Smart proposals need to be designed carefully.
One of the bottlenecks was proposing the new pendant branch length, and for that the lcfit surrogate function (the subject of Connor’s talk at Evolution 2013) worked well.
This spurred us to finish and write up the lcfit work, despite the fact that <a href="http://dx.doi.org/10.1093/sysbio/syv051">Aberer et al. 2015</a> wrote a paper in which they describe how to use standard probability distribution functions as surrogate functions for branch length proposals.
This took a bit of wind out of our sails, but our purpose-built lcfit surrogate still has some interesting advantages over common functions, such as that has the right “shape” for likelihood curves, even when branch lengths become long.
We were surprised to find that it does well for complex settings with heterogeneous models.
On a practical level, it’s implemented as a stand-alone C library so it can be easily incorporated into other programs, in contrast to the work of Aberer and co, which is tightly integrated into ExaBayes.
In the end we have turned it into a short paper with a long appendix which is now <a href="https://arxiv.org/abs/1706.00659">up on arXiv</a>.
Brian Claywell is the hero of this lcfit work– he was persistent in developing creative strategies to fit the surrogate function in a variety of settings.</p>
<p>There is a lot to be done for online Bayesian phylogenetics still.
We don’t even try to sample model parameters, and haven’t tried making BEASTly rooted trees.
There are many opportunities for optimization, some of which are described in the paper.
For the parallelism nerds out there, I also note the existence of the <a href="https://arxiv.org/abs/1407.2864">particle cascade</a>.</p>
<p>I certainly hope that this contribution guides future developers towards useful online Bayesian phylogenetics samplers: it would be great to be able to keep trees up to date as the genomes keep rolling in from projects such as <a href="http://www.zibraproject.org/">ZIBRA</a>.
If you are interested in more details, get in touch!</p>
Incorporating new sequences into posterior distributions using SMC
http://matsen.fredhutch.org/general/2016/10/26/smc-theory.html
Wed, 26 Oct 2016 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2016/10/26/smc-theory.html<p><img src="https://matsen.fredhutch.org/images/smc.png" width="350" class="pull-right" />
The Bayesian approach is a beautiful means of statistical inference in general, and phylogenetic inference in particular.
In addition to getting posterior distributions on trees and mutation model parameters, the Bayesian approach has been used to get posteriors on complex model parameters such as geographic location and ancestral population sizes of viruses, among other applications.</p>
<p>The bummer about Bayesian computation?
It takes so darn long for those chains to converge.
And what’s worse in this age of in-situ sequencing of viruses for rapidly unfolding epidemics?
<em>If you get a new sequence you have to start over from scratch.</em></p>
<p>I’ve been thinking about this for several years with <a href="http://darlinglab.org/">Aaron Darling</a>, and in particular about Sequential Monte Carlo (SMC) for this application.
SMC, also called particle filtering, is a way to get a posterior estimate with a collection of “particles” in a sequential process of reproduction and selection.
You can think about it as a probabilistically correct genetic algorithm– one that is guaranteed to sample the correct posterior distribution given an infinite number of particles.</p>
<p>Although <a href="http://sysbio.oxfordjournals.org/content/61/4/579">SMC has been applied before in phylogenetics</a>, it has not been used in an “online” setting to update posterior distributions as new sequences appear.
Aaron and I worked up a proof-of-concept implementation of SMC with Connor McCoy, but when Connor left for Google the implementation lost some steam.</p>
<p>However, when Vu arrived I was still curious about the theory behind our little SMC implementation.
Such SMC algorithms raise some really interesting mathematical questions concerning how the phylogenetic likelihood surface changes as new sequences are added to the data set.
If it changes radically, then the prospects for SMC are dim.
We call this the <em>subtree optimality question</em>: do high-likelihood trees on <em>n</em> taxa arise from attaching branches to high-likelihood trees on <em>n-1</em> taxa?
Years ago I considered a <a href="http://dx.doi.org/10.1007/s11538-010-9556-x">similar question with Angie Cueto</a> but that was for the distance-based objective function behind neighbor-joining, and others have thought about it from the empirical side under the guise of taxon sampling.</p>
<p>As described in an <a href="https://arxiv.org/abs/1610.08148">arXiv preprint</a>, Vu developed a theory with some real surprises!
First, an induction-y proof leads to consistency: as the number of particles goes to infinity, we maintain a correct posterior at every stage.
Then he directly took on the subtree optimality problem by writing out the relevant likelihoods and using bounds.
(We think that this is the first theoretical result for likelihood on this question.)</p>
<p>Then the big win: using these bounds on the ratio of likelihoods for the parent and child particles, he was able to show that the effective sample size (ESS) of the sampler is bounded below by a constant multiple of the number of particles.
This is pretty neat for two reasons: first, it’s good to know that we are getting a better posterior estimate as we increase our computational effort, and it’s nice that the ESS goes up linearly with this effort.
Second, this constant doesn’t depend on the size of the tree, so this bodes well for building big trees by incrementally adding taxa.</p>
<p>Of course, with this sort of theory we can’t get a reasonable estimate on the size of this key constant, and indeed it could be uselessly small.
However, I’m still encouraged by these results, and the paper points to some interesting directions.
For example, because SMC is continually maintaining an estimate of the posterior distribution, one can mix in MCMC moves in ways that otherwise would violate detailed balance, such as using an MCMC transition kernel that focuses effort around the newly added edge.
In this way we might use an “SMC” algorithm with relatively few particles which in practice resembles a principled highly parallel MCMC.
On the other hand we might use <a href="http://jmlr.org/proceedings/papers/v32/jun14.pdf">clever tricks</a> to scale the SMC component up to zillions of particles.</p>
<p>All this strengthens my enthusiasm for continuing this work.
Luckily, Aaron has recruited Mathieu Fourment to work on getting a useful implementation, and every day we are getting good news about his improvements.
So stay tuned!</p>
Summer high school and undergraduate students 2016
http://matsen.fredhutch.org/general/2016/07/22/summer-scholars.html
Fri, 22 Jul 2016 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2016/07/22/summer-scholars.html<p><img src="https://matsen.fredhutch.org/images/summer-students-2016.jpg" width="385" class="pull-right" />
I definitely didn’t set out to have three high school and two undergrad students this summer.</p>
<p>But they’re fantastic, and making real contributions to our scientific work!
Anna and Apurva (left and right) have written a new C++ front-end to the essential Smith-Waterman pre-alignment step for Duncan’s <a href="https://github.com/psathyrella/partis/">partis</a> software.
Andrew and Lola (center) are investigating the traces of maximum-likelihood phylogenetic inference software packages.
Thayer (back left) is writing a multi-threaded C++ program to systematically search for all of the trees above a given likelihood cutoff.
All of them are learning about science and coding.</p>
<p>These students rock, and I can’t wait to see what great things they bring into the world with their talent.</p>
Analysis of a slightly gentler discretization of time-trees
http://matsen.fredhutch.org/general/2016/07/11/discrete-time-tree.html
Mon, 11 Jul 2016 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2016/07/11/discrete-time-tree.html<p><img src="/images/NNI_VS_rNNI.png" width="395" class="pull-right" /></p>
<p>Inferring a good phylogenetic tree topology, i.e. a tree without branch lengths, is the primary challenge for efficient tree inference.
As such, we and others think a lot about how algorithms move between topologies, typically formalizing this information as a path through a graph representing tree topologies as vertices and edges as moves from one tree to another.
Removing all branch length information makes sense because algorithms are formulated in terms of these topologies: for classical unrooted tree inference, the set of trees that are tried from a specific tree is not determined by the branch lengths of the current tree.</p>
<p>But what about time-trees?
Time-trees are rooted phylogenetic trees such that every event is given a time: each internal node is given a divergence time, and each leaf node is given a sampling time.
These are absolute times of the sort one could put on a calendar.
To first approximation, working with time-trees is synonymous with running <a href="http://beast.bio.ed.ac.uk/">BEAST</a>, and we are indebted to the BEAST authors for making time-trees a central object of study in phylogenetics.
Such timing information is essential for studying demographic models, and in many respects time-trees are generally the “correct” tree to use because real evolution does happen in a rooted fashion and events do have times associated with them.</p>
<p>Because timing is so central to the time-tree definition, and because it’s been shown that defining
<a href="http://dx.doi.org/10.1109/BIBE.2008.4696663">the discrete component of time-tree moves with reference to timing information</a>
works well, perhaps we shouldn’t be so coarse when we discretize time-trees.
Rather than throw away all timing information, an alternative is to discretize calendar time and have at most one event per discretized time.
If we have the same number of times as we have events on the tree, this is equivalent to just giving a ranking, or total order, on the events in the tree which is compatible with the tree structure.
One can be a little less coarse still by allowing branch lengths to take on an integer number of lengths.</p>
<p>The goal of our <a href="http://biorxiv.org/content/early/2016/07/12/063362">most recent preprint</a> is to analyze the space of these discretized time-trees.
The study was led by Alex Gavryushkin while he was at the Centre for Computational Evolution in Auckland.
Chris Whidden and I contributed bits here and there.
As for classical discretized trees, we build a graph with the discretized time-trees as vertices and a basic set of moves on these trees as edges.
The goal is to understand basic properties of this graph, such as shortest paths and neighborhood sizes.</p>
<p>Adding timing information makes a substantial difference in the graph structure.
The simplest example of this is depicted above, which compares the usual nearest-neighbor interchange (NNI) with its ranked tree equivalent (RNNI).
RNNI only allows interchange of nodes when those nodes have adjacent ranks.
The picture shows an essential difference between shortest paths for NNI and RNNI: if one would like to move the attachment point of two leaves a good ways up the tree, it requires fewer moves to first bundle the leaves into a two-taxon subtree, move that subtree up the tree, then break apart the subtree.
On the other hand, for RNNI, a shortest RNNI graph path simply moves the attachment points individually up the tree.
This is an important difference: for example, the <a href="https://www.researchgate.net/publication/2643042_On_Computing_the_Nearest_Neighbor_Interchange_Distance">computational hardness proof of NNI</a> hinges on the bundling strategy resulting in shorter paths.</p>
<p>The most significant advance of the paper is the application of techniques from a paper by <a href="http://dx.doi.org/10.1137/0405034">Sleator, Trajan, and Thurston</a> to bound the number of trees in arbitrary diameter neighborhoods.
The idea is to develop a “grammar” of transformations such that every tree in a neighborhood with a given radius <em>k</em> can be written as a word of length <em>k</em> in the grammar.
Then, the number of trees in the neighborhood is bounded above by the number of letters to the power of the word length.
Further refinements lead to some interesting bounds.
In an interesting twist, these neighborhood size bounds provide a generalized version of a counter-example like that shown in the figure, which shows in more generality that the arguments for the computational hardness proof of NNI do not hold.</p>
<p>Some very nice work from Alex.
There’s going to be more to this story– stay tuned!</p>
A time-optimal algorithm to build the SPR subgraph on a set of trees
http://matsen.fredhutch.org/general/2016/06/30/sprgraphs.html
Thu, 30 Jun 2016 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2016/06/30/sprgraphs.html<p><img src="/images/sprgraphs.png" width="385" class="pull-right" /></p>
<p>We would like to better understand the subtree-prune-regraft (SPR) graph, which is a graph underlying most modern phylogenetic inference methods.
The nodes of this graph are the set of leaf-labeled phylogenetic trees, and the edges connect pairs of trees that can be transformed from one to another by moving a subtree from one place to another.
Phylogenetic methods implicitly move around this graph, whether to sample trees or find the most likely tree.
The work to understand this graph has been led by Chris Whidden, including <a href="/general/2014/05/13/sprmix.html">learning about how the graph structure influences Bayesian phylogenetic inference</a> and <a href="/general/2015/04/02/rspr-curvature-nodata.html">learning about the overall structure</a> of the graph.</p>
<p>These projects required us to reconstruct the subgraph of the full SPR graph induced by a subset of the nodes.
In the course of our work we have been getting progressively better at constructing this graph efficiently.
In our <a href="http://arxiv.org/abs/1606.08893">latest work</a> we develop a time-optimal algorithm.</p>
<p>Chris’ insight driving this new algorithm is that we shouldn’t be focusing on the trees and checking for pairs of adjacencies, but should rather shift focus to enumerating the potential adjacencies themselves.
These adjacencies can be formalized as structures called <em>agreement forests</em>, which in this case have two components.
If one is clever, and Chris is very clever, you can quickly store these forests and recognize if you’ve seen them before.
The strategy then is to move through the trees in an arbitrary sequential order, storing all of the potential adjacencies of each tree.
If a given tree returns a same adjacency as another previous tree, then connect the trees in the graph.</p>
<p>Although for this paper we obtained an asymptotically time-optimal algorithm, there is still interesting work to be done in order to get a fast implementation.
For example, we could be more thoughtful about exactly how the forests get serialized, which should lead to a faster look up in the central <a href="https://en.wikipedia.org/wiki/Trie">trie</a>.
But not having done any coding at all, much less profiling, we don’t know where the bottlenecks will lie.
This paper was in part motivated by a fun <a href="http://phylobabble.org/t/how-to-recognize-a-rearranged-tree/599">discussion on phylobabble</a>.</p>
Consistency and convergence rate of phylogenetic inference via regularization
http://matsen.fredhutch.org/general/2016/06/05/regularization.html
Sun, 05 Jun 2016 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2016/06/05/regularization.html<p><img src="https://matsen.fredhutch.org/images/treefix.png" width="470" class="pull-right" /></p>
<p>How frequently are genes <a href="https://en.wikipedia.org/wiki/Horizontal_gene_transfer">transferred horizontally</a>?
A popular means of addressing this question involves building phylogenetic trees on many genes, and looking for genes that end up in surprising places.
For example, if we have a lineage B that got a gene from lineage A, then a tree for that gene will have B’s version of that gene descending from an ancestor of A, which may be on the other side of the tree.</p>
<p>Using this approach requires that we have accurate trees for the genes.
That means doing a good job with our modeling and inference, but it also means having data with plenty of the mutations which give signal for tree building.
Unfortunately, sometimes we don’t have such rich data, but we’d still like to do such an analysis.</p>
<p>A naïve approach is just to run the sequences we have through maximum-likelihood tree estimation software and take the best tree for each gene individually, figuring that this is the best we can do with our incomplete data.
However, noisy estimates with our sparse data will definitely bias our estimates of the impact of horizontal gene transfer (HGT) upwards.
That is, if we get trees that are inaccurate in lots of random ways, it’s going to look like a lot of HGT.</p>
<p>We can do better by adding extra information into the inferential problem.
For example, in this case we know that gene evolution is linked with species evolution.
Thus in the absence of evidence to the contrary, it makes sense to assume that the gene tree follows a species tree.
From a statistical perspective, this motivates a <a href="https://en.wikipedia.org/wiki/Shrinkage_estimator">shrinkage estimator</a>, which combines other information with the raw data in order to obtain an estimator with better properties than estimation using the data alone.</p>
<p>One way of doing this is to take a Bayesian approach to the full estimation problem, which might involve priors on gene trees that pull them towards the species tree using a generative model; this approach has been elegantly implemented in phylogenetics by programs such as <a href="http://genome.cshlp.org/content/23/2/323.short">PHYLDOG</a>, <a href="http://mbe.oxfordjournals.org/content/28/1/273.full">SPIMAP</a> and <a href="http://dx.doi.org/10.1093/molbev/msp274">*BEAST</a>.
These programs are principled yet somewhat computationally expensive.</p>
<p>Another direction involves taking some sort of distance between the gene and species tree, and working to trade off a good value of the phylogenetic likelihood versus a good (small) value for this distance.
This distance-based approach works surprisingly well!
A <a href="http://dx.doi.org/10.1093/sysbio/sys076">2013 paper by Wu et al</a> proposed a method called TreeFix, which they showed performed almost as well as full SPIMAP inference, <em>even in simulations under the SPIMAP generative model!</em>
The cartoon above is from their paper, and illustrates that it makes sense to trade off some likelihood (height) for a lower reconciliation cost (lighter color).</p>
<p>This definitely got my attention, and made me wonder if one could develop relevant theory, as the theoretical justification for such an approach doesn’t follow “automatically” like it does for a procedure doing inference under a full probabilistic model.
Justification also doesn’t follow from the usual statistical theory for regularized estimators, because trees aren’t your typical statistical objects.
Then, one day <a href="http://vucdinh.github.io/">Vu</a>’s high-school and college friend <a href="https://sites.google.com/site/lamho86/">Lam Si Tung Ho</a> was visiting, and I suggested this problem to them.
<strong>They crushed it.</strong>
What resulted went far beyond what I originally imagined: a manuscript that not only provides a solid theoretical basis for penalized likelihood approaches in phylogenetics, but also develops many useful techniques for theoretical phylogenetics.
We have just put the manuscript <a href="http://arxiv.org/abs/1606.03059">up on arXiv</a>, which develops a likelihood estimator which is regularized in terms of the <a href="http://comet.lehman.cuny.edu/stjohn/research/treespaceReview.pdf">Billera-Holmes-Vogtman</a> (BHV) distance to a species tree.</p>
<p>First, the main results.
The regularized estimator is “adaptive fast converging,” meaning that it can correctly reconstruct all edges of length greater than any given threshold from gene sequences of polynomial length.
Perhaps more remarkable, though, is that Vu and Lam have explicit bounds of the convergence of the estimation simultaneously in terms of both branch length and topology (via the BHV distance).
This goes beyond the standard theoretical phylogenetics framework of “did we get the right topology or not”.
Surprisingly, the theory and bounds all work even if the species tree estimate is distant from the true gene tree, though of course one gets tighter bounds if it is close to the true gene tree.</p>
<p>Second, the new theoretical tools.</p>
<ul>
<li>a uniform (i.e. not depending on the tree) bound on the deviation of the likelihood of a collection of sequences generated from a model to their expected value</li>
<li>an upper bound on the BHV distance between two trees based on the Kullback-Leibler divergence between their expected per-site likelihood functions</li>
<li>analysis of the asymptotics of the regularization term close and far from the species tree.</li>
</ul>
<p>Of course, I don’t think that biologists will plug in their desired error into our bounds and just sequence an amount of DNA required to achieve that level of error.
That’s absurd.
What I hope is that this paper will add a theoretical aspect to the body of evidence that regularization is a principled method for phylogenetic estimation, and help convince phylogenetic practitioners that raw phylogenetic estimates are inherently limited.
We have all sorts of additional information these days that we can use for phylogenetic inference– let’s use it!
Just ask your local statistician: that community has been impressed by the <a href="https://en.wikipedia.org/wiki/James%E2%80%93Stein_estimator">surprising effectiveness of regularization</a> since the 50’s, and regularization in various forms has become a mainstay of modern statistical inference.</p>
Postdoctoral position to develop next-generation Bayesian phylogenetic methods
http://matsen.fredhutch.org/general/2016/04/18/altphylo-postdoc.html
Mon, 18 Apr 2016 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2016/04/18/altphylo-postdoc.html<p><img src="/images/altphlo-random-starts.png" width="260" class="pull-right" />
<em>Although we have recently made a hire in this area, we continue to look for strong junior scientists to work on this and related projects.</em></p>
<p>There is a lot more sequence data than in the early 2000’s, but inferential algorithms for Bayesian phylogenetic inference haven’t changed much since that time.
There have definitely been advances, such as more clever proposal distributions, swapping out heated chains, and GPU-enabled likelihood calculations, but the core remains the same: propose a new state via a small branch length and/or tree structure perturbation, and accept or reject according to the <a href="https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm#Formal_derivation_of_the_Metropolis-Hastings_algorithm">Metropolis choice</a>.
Furthermore, the community has packed more and more complexity into priors and models, which would lead to a computational bottleneck even with a fixed number of sequences.</p>
<p>It’s time to improve inferential algorithms.
Part of my inspiration lies in watching the revolution that has occurred in computational statistics in the last decade, in which the menu has greatly expanded beyond MCMC to include sequential Monte Carlo (SMC), Hamiltonian Monte Carlo (HMC), and variational inference, to name a few.
These algorithms are making qualitative improvements in the scale of data that can be incorporated into analysis, and come from new mathematical foundations with provable statistical properties.
The other part of the inspiration, well, just comes from being annoyed that we have this fairly hard limit of 1,000 sequences for Bayesian phylogenetic inference.
So, <a href="http://www.stat.washington.edu/vminin/">Vladimir Minin</a> and I teamed up on a successful grant application to develop new inferential methods.</p>
<p>We thus have an open postdoc position to work on fundamentally new methods for phylogenetic inference.
If you join us, you’ll be able to work with a truly marvelous team.
Vladimir is an excellent collaborator with a deep knowledge of phylogenetics and statistics.
<a href="http://www.armanbilge.com/research/">Arman Bilge</a>, with whom Vu and I are already developing phylogenetic HMC, is joining us for his PhD.
<a href="https://scholar.google.com/citations?user=itc4x9kAAAAJ&hl=en">Chris Whidden</a> is a world expert in the discrete structure of tree space.
<a href="http://vucdinh.github.io/">Vu Dinh</a> is greatly expanding our knowledge of the phylogenetic likelihood function and is using that knowledge to develop new inferential methods.</p>
<p>To round out this team, we need someone with serious programming chops who is motivated to build new mathematically-founded methods.
We really want to make a difference for end users of phylogenetics, so we can’t only be proving theorems.
The focus will be on building solid prototype implementations and then working with the very clever RevBayes and BEAST2 developers for more general distribution.
We’ll also be collaborating with <a href="http://darlinglab.org/">Aaron Darling</a>’s group on online phylogenetic SMC.
<em>[Doesn’t that sound fun? It does to me.]</em></p>
<p>The position will come with a competitive postdoc-level salary with great benefits for two years, with possibility of extension.
Fred Hutchinson Cancer Research Center, home of about 190 faculty including three Nobel laureates, is an independent, nonprofit research institution dedicated to the development and advancement of biomedical research.
The environment is lively yet casual, with a strong emphasis on collaborative work.
The Center is housed in a lovely campus next to Lake Union a short walk from downtown, and a slightly longer walk from the University of Washington.
Powerful computing resources and a helpful IT staff await.</p>
<p>If you are interested in this position, please send Erick some representative publications, code samples, and a CV.</p>
Likelihood-based clustering of B cell clonal families
http://matsen.fredhutch.org/general/2016/04/16/partis-clustering.html
Sat, 16 Apr 2016 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2016/04/16/partis-clustering.html<p><img src="/images/bcell-mess.png" width="350" class="pull-right" />
Antibodies are encoded by B cell receptor (BCR) sequences, which (simplifying somewhat) arise via a two-stage process.
The first stage is a random recombination process creating a so-called naive B cell, which are the common ancestors in the trees to the right.
The second is initiated when such a naive cell (perhaps weakly) binds an antigen, and consists of a mutation and selection process to improve the binding of the BCR to the antigen.
The history of this process can be thought of as a phylogenetic tree descending from these naive common ancestors.
One can sequence the B cell receptors resulting from these processes in high throughput, which form an implicit record of these complex processes in a single individual.</p>
<p>The situation from a phylogenetic inference perspective is a mess.
Sampled sequences have typically been mutated substantially from their ancestral naive sequence.
Thus given a pair of sequences sampled from the repertoire, it’s not clear if they share a common ancestral naive sequence— that is, if they even belong in the same tree.
Furthermore, in healthy individuals the size of these trees is very small compared to the number of sequences in the total repertoire.
This makes for a difficult, and very interesting, clustering problem.</p>
<p>Duncan Ralph and I have been working on this problem since he arrived several years ago, and I am very happy to announce that we’ve put up a manuscript describing that work on <a href="http://arxiv.org/abs/1603.08127">arXiv</a>.
This paper builds on our previous work on BCR sequence annotation and alignment <a href="/general/2015/03/23/partis-annotation.html">using a hidden Markov model</a>.
Using this HMM, we can define a likelihood for observing a given set of sequences distributed among a given collection of clusters.
This likelihood integrates over possible alternative annotations, which are formalized as paths through the HMM.</p>
<p>We had this general idea within the first several months of thinking about the problem together.
However, there’s a big difference between writing down an elegant formulation, even one with fast dynamic programming machinery, and actually building a system that scales to data sets of hundreds of thousands or millions of sequences.
This is where Duncan showed a tremendous amount of creativity and persistence by assembling layers of approximations and heuristics on top of this essential idea, and by developing a software package meant for others to use.</p>
<p>The code is available as part of the continuing development of <a href="https://github.com/psathyrella/partis/">partis</a>.
It also includes Duncan’s sophisticated BCR simulation package.</p>
<p>We’re under no illusions that we have “solved” this problem, and there’s still a lot to be done.
However, we believe that the likelihood-based approach in general and Duncan’s code in particular is a substantial advance over current methods, which use single-linkage clustering based on nucleotide edit distance.
If you work with BCR sequences, we hope you’ll give partis a spin and let us know what you think.</p>
http://B-T.CR, a discussion site for immune repertoire analysis
http://matsen.fredhutch.org/general/2016/02/23/btcr.html
Tue, 23 Feb 2016 00:00:00 UTCmatsen@fredhutch.org (Frederick A. Matsen)http://matsen.fredhutch.org/general/2016/02/23/btcr.html<p><img src="http://b-t.cr/uploads/default/original/1X/5d5d816ed777fda7d45634dc788b2046508d1834.png" width="350" class="pull-right" />
In my dream academic utopia, there would be free and open discussion and information sharing.
I love open-access journals and preprint servers, but some communication is better suited for a discussion format.
For example open problems, information about resources and software, and discussion of standardization all benefit from having a more widely open discussion and a faster update time than journals or even preprint servers can provide.
Social media can host this discussion, but I think that it can be useful to put a dedicated website in place for discussion.</p>
<p>For these reasons, together with <a href="http://www.vet.cam.ac.uk/directory/sdf22@cam.ac.uk">Simon Frost</a> I’ve started a <a href="https://www.discourse.org/">Discourse</a> forum at <a href="http://B-T.CR/">http://B-T.CR/</a> for immune repertoire analysis.
If you aren’t already aware, the specificities of our adaptive immune cells are encoded in the sequences of the B and T cell receptors (BCRs and TCRs, hence the name), and these can now be sequenced in high throughput.
The result is a fascinating but challenging-to-interpret pile of sequences that, if we were smart enough, would tell us a whole lot about health history and prognosis.</p>
<p>This field, at least in its most recent high-throughput incarnation, is quite young and so we have a real opportunity to coordinate research efforts, reduce the pain of finding the right resources, and increase the fun of doing science.
If you are interested in this sort of work, I hope you will join and participate!
You can keep track of what’s happening on the forum by signing up for email notifications, through <a href="http://b-t.cr/latest.rss">RSS</a>, or by following <a href="https://twitter.com/bcr_tcr">@bcr_tcr</a> on Twitter.</p>