Algebraic and combinatorial phylogenetics


Mike Steel (University of Canterbury)

Arndt von Haeseler (CIBIV)

Piotr Zwiernik (Uiversitat Pompeu Fabra)



June 19-23: Part 1 (20 hours)
July 3-7: Part 2 (10 hours)
Details about the schedule will appear soon


Facultat de Matemàtiques i Estadística, UPC

Campus Ciutadella, UPF


Over the last ten years a tremendous progress have been observed in the study and design of mathematical models and phylogenetic reconstruction methods. Phylogenetics is the study of ancestral relationships among species. Its interest relies not only on the study of the evolutionary processes that led to the extant species, but also on the direct relation to genomics (comparative genomics) and in emerging fields such as traceability of cancer cells. Although it is more or less known that statistics, probability, computation, differential equations and algorithmics play an important role in biology, biostatistics and bioinformatics, it is probably less evident that areas like combinatorics, geometry and algebra can also interact with biology. The main aim of this course is twofold: to awake the interest of mathematicians in the applied research in biology, and to bring together mathematicians from different disciplines.
The course will go from the basics of stochastic processes and combinatorial objects related to phylogenetics to the deep understanding of the reconstruction methods used nowadays and the new algebraic and combinatorial tools that are emerging in the field.



Part 1: Phylogenetics and combinatorics (Mike Steel and Arndt von Haeseler)

Introduction to phylogenetics: Brief historical overview, and some of the ways in which mathematics can help biologists address fundamental questions. Background on graphs and trees, and some key definitions and properties of of rooted vs unrooted phylogenetic trees

Basic combinatorics of discrete trees: Enumeration of phylogenies. Trees as hierarchies and split systems. Strict, majority and Adams consensus. The Robinson-Foulds metric. Tree rearrangement operations (NNI, SPR,TBR). Consensus methods and Arrow-type results.

Tree shape and random discrete phylogenies: Properties of tree shape under discrete random models of speciation and extinction. Balance indices and application of Pólya-Urn models.

Pulling trees apart, and putting trees together: Subtree-based tree reconstruction, quartet (and rooted triplet) encoding. The Aho algorithm, supertree methods (Majority (+), MRP, R*). Phylogenetic decisiveness (of topology and branch lengths) under patchy taxon coverage.

Phylogenies based on discrete characters: Character data. Convexity, homoplasy-free evolution and ‘perfect phylogeny’ for binary and non-binary data. The 4-character result. Maximum parsimony. Links between parsimony score and TBR distance (Bryant), and the 2-character solution to the maximum parsimony problem.

Continuous phylogenies and distance-based tree reconstruction: Distance-based tree reconstruction, the 4-point condition, neighbor joining. Paulin’s formula and cyclic permutations, balanced minimum evolution (BME). Phylogenetic diversity and the greedy algorithm.

Evolution on a tree (Part one): Markov processes on a line and on a tree. Site substitution models, their classification, and properties. ‘Felsenstein zone’ and phylogenetic mixture models.

Evolution on a tree (Part two): Maximum likelihood and Bayesian methods. Identifiability issues and phylogenetic mixture models. Information-theoretic and algebraic techniques for tree reconstruction models.

Evolution of trees: Modeling speciation and extinction with birth-death models. Edge lengths and predicting phylogenetic diversity loss. Kingman’s coalescent, the multispecies coalescent and lineage sorting. Consistent reconstruction of species trees from conflicting gene trees (Rosenberg et al). Models of genome evolution and lateral gene transfer.

Phylogenetic networks: Networks that represent uncertainty (implicit) vs explicit reticulate evolution (hybrid species, lateral gene transfer, recombination/re-assortment). Embedded trees, softwired and hardwired clusters. Types of networks (pedigree, haplotype, median, level-1). Properties of normal and regular networks. NeighborNet.

Part 2: Semialgebraic statistics, latent tree models, and phylogenetics (Piotr Zwiernik)

This lecture course is divided into three parts. In part 1, I will present a basic geometric perspective on statistical models and show how algebraic geometry can be useful in statistics and phylogenetics. In part 2, I will define graphical models and graphical models with hidden variables focusing on latent tree models. Graphical models with hidden variables form a rich class of statistical models that leads to many interesting open problems in (real) algebraic geometry. In part 3, I will use this setting to analyze the structure of phylogenetic tree models.

There will be the workshop Algebraic and Combinatorial Phylogenetics between parts 1 and 2. See:


A. Dress, K. Huber, J. Koolen, V. Moulton and A. Spillner, “Basic phylogenetic combinatorics”, Cambridge University Press 2012.
L. Pachter and B. Sturmfels, Algebraic statistics for computational biology, Cambridge University Press 2005.
C. Semple, M. Steel, “Phylogenetics”, Oxford University Press, 2003
M. Steel, “Random paths in evolution”, in preparation
P. Zwiernik, “Semialgebraic statistics and latent tree models”, Chapman&Hall, 2015.
Additional references will be given later.