Edge principal components / squash clustering paper up on arXiv

27 Jul 2011, by Erick

Dear Microbial Ecology community:

Have you ever been annoyed with how the axes of principal components plots don’t come with a clear interpretation? Ever wondered why a hierarchical clustering algorithm chose to do a specific merge on your data?

Steve Evans and I decided to take on these issues and develop some more transparent ordination and clustering methods for phylogenetic placement data. Our new methods are called edge principal components analysis (edge PCA) and squash clustering. In edge PCA, the principal components are weights on the edges of the “reference” phylogenetic tree; these weights can be visualized as a thickening and coloring of the edges. Thus, using our visualization tool, thick red edges are those that move points in the positive direction along the corresponding PCA axis, and thick blue edges are those that move points in the negative direction. Edge PCA also uses the structure of the tree to find consistent differences between nearby species, which can result in greater resolution than distance-based methods. In squash clustering, the internal nodes of the clustering tree correspond to distributions of mass on the reference tree, and distances between internal nodes are Kantorovich-Rubinstein (earth-mover’s) distances between those distributions. These mass distributions can be visualized so users can view what is driving a particular merge.

It took us a little while to get it the manuscript in its final version, and I’ve just put it up on the arXiv preprint server. You can also see an example tree showing the edge PCA axes.

all posts