In search of a C++ mentor for next-generation viral tracking algorithms

The challenge

In the current nCov-19 outbreak, we have all learned the importance of epidemiology. Epidemiology is currently undergoing a revolution due to easy and cheap access to viral genetic sequences, available in real time as the epidemic unfolds. This helps us understand viral spread because viruses mutate as they are passed from individual, and by using shared mutations we can infer transmission history.

Such detective work using viral genetic sequences is beautifully expressed on the nextstrain.org platform. This platform allows people to understand viral transmission between geographic areas using the relationship of the viral genomes to one another. Specifically, it represents these relationships in terms of a phylogenetic tree, analogous to a “tree of life” but for a single viral outbreak.

Although nextstrain is wonderful, and perfect for some audiences, it has some limitations. These all derive from the fact that it infers a single tree structure, when in fact each tree structure is a hypothesis with a certain level of support that we can obtain with our finite data. The right way to proceed is called Bayesian phylogenetics, in which we infer a so-called posterior distribution on trees: the ensemble of trees that can credibly explain the data, along with a probability that each one is correct.

This lack of a probabilistic foundation for phylogenetic inferences on nextstrain has real consequences. For example, in this Twitter thread Trevor Bedford (one of the co-founders of nextstrain and a dear colleague) does a back-of-the-envelope calculation to evaluate if two strains of nCov-19 represent one or two introductions into Washington. If, instead, we were using Bayesian phylogenetics we could evaluate that probability rigorously, using all sources of information.

The price for the rigor and flexibility of Bayesian phylogenetics is speed. The current (non-Bayesian) pipeline for processing sequence data and building nCov-19 trees (in late March 2020) on nextstrain is one hour. Current methods for Bayesian phylogenetics would take several weeks. So we need a new method if we are going to make Bayesian inferences actionable.

Our project

My research group has been working diligently for 5 years on developing categorically faster methods for Bayesian phylogenetics. Our best new inferential framework uses a technique called variational Bayes. If you want to learn more about it, read on my blog here and here. We are implementing these algorithms in a Python-interface C++17 library located at https://github.com/phylovi/libsbn. This library aims to enable integration with TensorFlow for flexible modeling.

I am by background a mathematician, and not a C++ programmer, but I do like programming and have done my best to learn best practice and apply it in this project. Although it has some warts, I am proud of the code and am always wanting to find ways to make it better.

The ask

I am wondering if any C++ experts would be interested in being volunteer mentors for us. We have designed efficient algorithms, but would appreciate advice on implementation. For example, we are currently struggling with page faults on a big mmap-ped block of memory. We’d also love to have input on best-practice code structure and tooling. If you are familiar with TensorFlow, even better.

We are hoping to have occasional consults, although more contributions and PRs would of course be welcome too.

The disclaimer

I want to make it clear that our code will not be ready to guide epidemiological choices in the current outbreak. My goal is for it to be ready to deploy in such a setting in 5 years. There is no guarantee that it will be the best means of doing rapid Bayesian phylogenetic analysis, however, two of the three primary architects of the BEAST program (the definitive package for Bayesian inference in viruses: see, e.g., this dissection of Ebolavirus spread) are working with us on variational inference.

I do not want to “profiteer” on people’s current interest in viruses because of COVID. If you can figure out a way to directly contribute to controlling the current outbreak, please do so. On the other hand, we need rapid Bayesian phylogenetics for viral tracking, and I believe that variational inference is the way to do it. If the current epidemic makes you want to contribute to with what could become the foundation for next-generation viral tracking, we would welcome some help.

Drop me a line at if you’re interested.