v1.1.alpha15: gaps, rates, and better support for more sequences

07 Feb 2014, by Erick

We are releasing a new version today, and there is one change that should be especially noted because it can change placement results. I have changed how non-informative columns (such as columns that are gap in a query sequence) are masked out of the final alignments that are used for placement. The difference is fully explained on the pplacer documentation page, but in short there are some subtle effects that non-informative sites can have on placement, and so it matters if they are masked or not. I also fixed a bug that was misaligning rate assignments when placing on a tree that was built with FastTree.

I do not expect the changes to substantially impact your results. For instance, this changed classification results for 4-6% of sequences in a trial run, and those changes were almost uniformly between classifications at the species level and the corresponding genus and vice versa, although I can’t promise what will happen on your data.

However, on the bright side, pplacer is now guaranteed to give identical results irrespective of any other sequences that are in your query fasta file.

I should also note that taxtastic was making reference packages for FastTree amino acid trees that were incorrectly being called as using an empirical rate matrix. This bug has been fixed and I encourage you to update to the master branch there (you can update via pip).

There are a number of new features in this release that will be of interest to people pushing lots of data through pplacer, including that pplacer, guppy, and rppr support gzip-compressed input sequence and .jplace files, and deduplicate_sequences.py supports gzip-compressed FASTA files.

As usual, there were also lots of bugfixes and improvements: see the CHANGELOG for details.

all posts