Bayesian inference using Markov chain Monte Carlo

Part 2: Case studies from Phylogenetics

Tim Vaughan

Computational Evolution Group (cEvo)
Department of Biosystems Science and Engineering, ETH Zurich

Bayesian Statistics Mini-course
Univerisity of Hohenheim, 2018

Case 1: Bayesian Phylogenetic Inference

Modelling genetic change

Given two or more aligned nucleoide or amino acid sequences, usually the first goal is to calculate some measure of sequence similarity (or conversely distance).
The easiest distance to compute is the p-distance: the number of differences between two aligned sequences relative to their length.
- The p-distance is the Hamming distance normalized by the length of the sequence. Therefore it is the proportion of positions at which the sites differ.
- The p-distance can also be considered the probability that the two sequences differ at a random site.

Modelling genetic change

p-distance

$p$-distance = 3/7 $\simeq 0.43$.

Proportion of differences between two sequences.
Usually understimates the true distance: genetic (or evolutionary) distance $d$.

Multiple, parallel and back-substitutions

Relationship between $p$-distance and genetic distance

Continuous-time Markov chains (CTMCs)

CTMCs are the continuous-time generalization of Markov chains: state $X(t)$ is a function of a continuous time parameter.
Obey the Chapman-Kolmogorov equation: $$p(x_1,t_1|x_0,t_0)=\sum_{x_{1/2}}p(x_1,t_1|x_{1/2},t_{1/2})p(x_{1/2},t_{1/2}|x_0,t_0)$$ where $p(x_1,t_1)\equiv P(X(t_1)=x_1)$ and $t_0<t_{1/2}<t_1$.
Can be written in the following differential form: \begin{align*} \frac{d}{dt}p(x,t|x_0,t_0)&=\sum_{x'\neq x}\left[p(x',t|x_0,t_0)Q_{xx'}-p(x,t|x_0,_0)Q_{x'x}\right]\\ &=\sum_{x'}Q_{x'x}p(x',t|x_0,t_0) \end{align*} defining $Q_{xx}=-\sum_{x'\neq x}Q_{x'x}$.
This form is known as the Master Equation or Kolmogorov forward equation. Linear, so solution is "simply" $\vec{p}(t)=\exp[-Qt]\vec{p}(0)$.

CTMC example

Consider a system with two states (0 and 1) and a transition rate matrix $$Q = \left[\begin{array}{cc} - & 2 \\ 1 & - \end{array}\right]$$

Gives rise to the following trajectories:

Run Demo

Time $\Delta t$ spent in state before transition: $$ P(\Delta t|x) = \lambda e^{-\lambda \Delta t}$$ where $\lambda = \sum_{x'\neq x}Q_{x'x}$.

Jukes-Cantor model of nucelotide substitution

$$ Q = \mu\left[\begin{array}{cccc} -1 & \frac{1}{3} & \frac{1}{3} & \frac{1}{3}\\ \frac{1}{3} & -1 & \frac{1}{3} & \frac{1}{3}\\ \frac{1}{3} & \frac{1}{3} & -1 & \frac{1}{3}\\ \frac{1}{3} & \frac{1}{3} & \frac{1}{3} &-1 \end{array}\right] $$

\begin{align*} P(X(t)=x|X(0)=x_0) &= \left[\exp(-Qt)\right]_{xx_0}\\ &=\left\{\begin{array}{ll} \frac{1}{4}+\frac{3}{4}e^{-\frac{4}{3}\mu t} & \text{for } x=x_0\\ \frac{1}{4}-\frac{1}{4}e^{-\frac{4}{3}\mu t} & \text{for } x\neq x_0 \end{array}\right. \end{align*}

Differences per site

Genetic distance under Jukes-Cantor

The probabilty that a site is distinct after time $t$ is \begin{align*} p_{\text{diff}}&=\sum_{x'\neq x} P(X(t)=x'|X(0)=x)\\ &=\frac{3}{4} - \frac{3}{4}e^{-\frac{4}{3}\mu t} \end{align*}

Equating $p_{\text{diff}}$ with the $p$-distance and solving for $\mu t$ yields $$\hat{d} = \widehat{\mu t} = -\frac{3}{4}\log(1-\frac{4}{3}p).$$

Accounts for multiple, parallel, back-substitutions.
Only correct if $p$ is exact: infinite sequences.
Diverges if $p≥3/4$.

Other CTMC substitution models: Kimura 80

$$ Q = \left[\begin{array}{cccc} \cdot & \alpha & \beta & \beta \\ \alpha & \cdot & \beta & \beta \\ \beta & \beta & \cdot & \alpha \\ \beta & \beta & \alpha &\cdot \end{array}\right] $$

Also "Kimura two parameter" model.
Allows different rates for transitions and transversions.
Equilibrium base frequencies equal.

Other CTMC substitution models: HKY

$$ Q = \left[\begin{array}{cccc} \cdot & \alpha\pi_G & \beta\pi_G & \beta\pi_G \\ \alpha\pi_A & \cdot & \beta\pi_A & \beta\pi_A \\ \beta\pi_C & \beta\pi_C & \cdot & \alpha\pi_C \\ \beta\pi_T & \beta\pi_T & \alpha\pi_T &\cdot \end{array}\right] $$

Due to Hasegawa, Kishino, Yano (1985).
Allows different transition transversion rates.
Allows equilibrium base frequencies to differ.
Most complex model for which analytical solutions are available.

Other CTMC substitution models: GTR

$$ Q = \left[\begin{array}{cccc} \cdot & \alpha\pi_G & \beta\pi_G & \gamma\pi_G \\ \alpha\pi_A & \cdot & \delta\pi_A & \epsilon\pi_A \\ \beta\pi_C & \delta\pi_C & \cdot & \eta\pi_C \\ \gamma\pi_T & \epsilon\pi_T & \eta\pi_T &\cdot \end{array}\right] $$

All models so-far have been reversible.
GTR is the most general time-reversible model.
Reversibility is a mathematical convenience: no biological reason!
No useful analytical solution available (matrix exponential doesn't count).

Variable rates among sites

Gamma distribution rate heterogeneity allows for variation in evolutionary rate among sites according to $$f(r) = \frac{\alpha^{\alpha}}{\Gamma(\alpha)}r^{\alpha-1}e^{-\alpha r}$$

Analytical solution exists for JC distance with site rate heterogeneity: $$\hat{d}=-\frac{3}{4}\alpha\left[1-\left(1-\frac{4}{3}p\right)^{-1/\alpha}\right]$$

Genetic distance variation between models

HIV-1B vs HIV-O/SIVcpz/HIV-1C: env gene:

	$p$-distance	JC69	K80	Tajima-Nei
HIV-O	0.391	0.552	0.560	0.572
SIVcpz	0.266	0.337	0.340	0.427
HIV-1C	0.163	0.184	0.187	0.189

HIV-1B vs HIV-O/HIV-1C: pol gene:

$p$-distance JC69 K80 Tajima-Nei

HIV-O 0.257 0.315 0.318 0.324

HIV-1C 0.103 0.111 0.113 0.114

Least-squares Phylogenetic Inference

Can use genetic distance as the basis for least-squares-based phylogenetic inference.
Define cost of tree as the (weighted) sum of the squares of the difference between the genetic distance on the tree and the distance estimated from the sequence alignment: $$C(T) = \sum_i\sum_j w_{ij}(d_{ij}(T)-\hat{d}_{ij})^2$$
Search for tree $T$ with lowest $C(T)$.

Comments:

Ignores correlations due to shared ancestry.
Consistent estimator: converges to truth in the limit of infinitely long alignment.

The Phylogenetic Likelihood

$$P(A|T, \mu)$$

$A$ is a multiple sequence alignment of $n$ sequences,
$T$ is a phylogenetic tree with $n$ leaves, and
$\mu$ are the parameters of the chosen CTMC-based model of sequence evolution.
Just a product of site pattern propabilities: note the obvious generalizations to multiple alignments that share a tree, or to phylogenetic networks that allow different sites to evolve under distinct "local" trees.

What is the likelihood of a tree?

$$L(T|A)\equiv P(A|T) = \sum_x\sum_w P(x)P(A|x)P(w|x)P(G|w)P(G|w)$$
For $n$ taxa there are $4^{n-1}$ possible internal node states!
Solve using Dynamic Programming!

Felsenstein's Pruning Algorithm

Introduced by Joseph Felsenstein, 1973.
Recursion based on the conditional likelihood of a subtree below node $k$ having state $s$: $L_k(s)$.
For leaves, $L_k(s)=\delta_{s,l}$ where $l$ is the leaf state.

For internal nodes, $$L_k(s) = \left(\sum_xP(x|s)L_{c_l}(x)\right)\left(\sum_yP(y|s)L_{c_r}(y)\right)$$

Time complexity for $m$ sites: $m(n-1)4^2$. This is the workhorse of computational phylogenetics.

Maximum likelihood for Phylogenetics

PhyML (Guindon and Gascuel, 2003)
- Hill-climbing algorithm which attempts to find ML trees.
RAxML (e.g. Stamatakis, 2005)
- Another hill-climbing algorithm, optimized for large trees.
FastTree (Price, Dehal, Arkin, 2009)
- Uses efficient Neighbour-Joining algorithm to find starting tree.
- Updates tree using heuristics based on tree length.
- Inference from alignments of $>10^5$ sequences feasible.

Problems with ML

Only ever gives point estimates: no indication of uncertainty.
- Addressed using ad hocery: bootstrap algorithms.
$L(\theta|D)$ is not a probability!
No clear way of incorporating additional (prior?) information into the analysis.

Need a better approach!

The Phylogenetic Posterior

$$P(T,\mu,\theta|A) = \frac{1}{P(A)}P(A|T,\mu)P(T|\theta)P(\mu,\theta)$$

$P(T|\theta)$ is the "tree prior" parameterized by $\theta$.
$P(\mu,\theta)$ are the parameter priors.

Questions

What is $P(A)$?
Is the tree prior really a prior? (I.e. does it depend on the data?)

The Neutrality Assumption

Because of the way we've factorized the joint probability for the data and model parameters, we are implicitly assuming that our alignment could have been produced in the following fashion:

Separating the process of tree generation from that of sequence evolution implies neutrality.

Tree Priors

Tree priors allow us to specify a generative model for the genealogy.
While this model may not involve genetic evolution, it may depend on speciation rates, population sizes, migration rates, etc. etc.
A huge fraction of phylogenetics inference research is focused on developing and efficiently implementing tree priors!

The Yule model for speciation

Simple model for speciation due to Yule (1924).
Constant rate branching process where branching occurs uniformly across extant lineages at a rate of $\lambda$ per lineage per unit time.
Number of lineages $k$ evolves under the following master equation: $$\dot{p}(k,t) = \lambda (k-1)p(k-1,t) - \lambda k p(k,t)$$
Write this in chemical kinetics formalism: $X \overset{\lambda}{\longrightarrow} 2X$

Birth-death(-sampling) priors

Most of the recent groundwork on this is due to Tanja Stadler (ETH, Zürich).
Generalization of Yule process to account for extinction and sampling.
Linear speciation, extinction and sampling rates.
Prior is over sampled trees: assumes unobserved full tree.

Tanja Stadler, J. Theor. Biol., 2010

Coalescent priors

Original theory due to Kingman (1980).
Model the shape of gene trees within a species.
Can be obtained as the limit of a number of population genetic models, including the Wright-Fisher model and the Moran model.
Provide an important link between tree shape and population size!

Drummond et al., TIEE, 2003

Multi-species coalescent priors

Hierarchical models that embed gene trees within species trees.
Species tree described using a birth-death prior.
Gene trees described by coalescent priors for populations which subdivide at speciation events.

Leliaert et al., Euro. J. Phycology, 2014

MCMC in Tree Space

Need to identify a set of proposals $q_j(x'|x)$ when $x$ is a point in the space of rooted time trees.

Stopping criteria

How can we tell when a phylogenetic MCMC calculation has reached equilibrium?

How do we know when we've collected enough samples?

One approach is to compute ESS (using empirical estimates of autocorrelation time) for each parameter and a number of tree summary statistics (e.g. tree height and tree "length").
- Assume that once all of these scores are sufficiently high (e.g. > 200) we have adequately sampled the posterior.
Examine output of several independent (and independently initialized chains. A necessary (not sufficient) condition for convergence is that sample distributions converge as the chains converge to equilibrium.
- Always do this!

Are we there yet?

The AWTY application applies a number of different statistics to assess the convergence of the tree state. It relies heavily on comparing the result of multiple runs.

Nylander, et al., Bioinformatics, 2008

Post-processing: Parameter samples

Logs of individual parameters can be considered samples from the marginal posteriors of those parameters.

Post-processing: Tree samples

Summary statistics (e.g. age of the MRCA of a pair of taxa) can be computed and their marginal distributions assessed.
A number of different approaches can be used to produce a "summary" tree meant to represent the best guess at the true tree. (None of these algorithms are perfect!!)

Strict consensus: Include only clades that appear in ALL sampled trees.
Majority rule consensus: Include clades that appear in >50% of the sampled trees.
Maximum clade credibility tree: The sampled topology for which the product of the posterior clade probabilities is maximized.

In BEAST, summary trees are produced using the MCC tree method via the program TreeAnnotator.

Bayesian Phylogenetic Inference Software

Popular Bayesian phyogenetic inference software:

MrBayes (Huelsenbeck and Ronquist, Bioinformatics, 2001): Early CLI program for phylogenetic inference.
RevBayes (Höhna et al., Syst. Biol., 2016): R-like syntax for specifying phylogenetic models.
BEAST/BEAST2 (Drummond and Rambaut, 2007; Bouckaert et al., 2014): XML specification of phylogenetic models. Extensible.

Some software implementing special models:

MIGRATE: Performs inference under the Structured Coalescent. (I.e. subpopulation sizes, migration rates, ancestral locations.)
ClonalFrame/ClonalOrigin: Infer bacterial Ancestral Recombination Graphs (generalizations of trees when recombination is present).

... there are many others!

Case 2: Bayesian inference of bacterial
ancestral recombination graphs

Why study bacterial phylogenetics?

Bacteria play important roles (both positive and negative) in the health of animals and plants.
Many bacteria possess interesting and experimentally accessible evolutionary dynamics.
Bacterial genomes are measurably evolving over relatively short study periods.

Drummond & Rambaut, TIEE (2003)

Rainey & Travisano, Nature (1998)

Cartoon bacterial population genetics

Effect of events can be plasmid transfer, insertion or homologous recombination.
Rate of recombination varies and is subject to selection.
Focus on homologous recombination: only event which doesn't alter the length of the sequence

The Problem for Phylogenetic Inference

For many bacteria, the ratio between the recombination rate and the mutation rate is extremely high.

Vos and Didelot, The ISME Journal (2009)

Existing solutions

Pre-processing of data to identify and remove non-vertically inherited material. (eg. START: Jolley et al., 2001)

Pros	Cons
Can use standard tools for phylogenetic inference.	Data is being thrown away. Ad hoc, may bias results.

Explicit modelling of bacterial recombination.
(eg. ClonalFrame and ClonalOrigin: Didelot et al., 2007, 2010)

Pros	Cons
Can make use of all data. Can infer additional parameters such as recombination rates. May yield increased confidence in estimates.	Models can be complex, with many parameters. Both computationally and statistically challenging. Existing implementations lack flexibility.

The Coalescent with Gene Conversion

Wiuf, 1999; Wiuf and Hein, 2000

Parameters
$N(t)g$	Coalescence rate
$\rho_s$	Recombination rate
$\delta$	Expected tract length

Problem

The space of possible ancestral recombination graphs is extremely large. Even two samples have infinitely many distinct ancestries!

Full ARG

Approximation 1: ClonalFrame

Didelot et al., 2007 (ClonalFrame); Didelot and Wilson, 2015 (ClonalFrameML)

Approximation 2: ClonalOrigin

Didelot et al., 2010 (ClonalOrigin); Ansari and Didelot, 2014 (ClonalOrigin')

Our Goal

To conduct Bayesian inference of the full ARG under the ClonalOrigin model, optionally under non-trivial population size dynamics.

Inference under the ClonalOrigin Model

Inference follows the standard Bayesian phylogenetic tradition:

\begin{equation*} f(G,N,\mu,\rho,\delta|A) \propto P_{F}(A|G,\mu)f_{CGC}(G|N,\rho,\delta)f_{\text{prior}}(N ,\mu,\rho,\delta) \end{equation*}

where

$A$ is the sequence alignment,
$\mu$ are the substitution model parameters, and
$G$ is the full sample genealogy including clonal frame
$T$ and $M$ conversions $\{C_i\}_{i\in[1\ldots M]}$.

The genealogy density under ClonalOrigin model can be expanded

\begin{equation*} f_{CGC}(G|\rho',\delta,N)=\left(\prod_{i=1}^M f(C_i|T,N,\delta)\right)P(M|T,\rho)f_C(T|N) \end{equation*}

Identifiability

Despite using a simplified model, an infinite number of ARGs still possess the same likelihood given a sequence alignment.

Very important for an MCMC algorithm to propose state changes which minimize effect on likelihood.

MCMC State Proposal Strategy

Attachment points and site tracts of new conversions drawn from prior.

Validation 1: Sampling from the prior

It is trivial to sample directly from the approximate ARG prior via coalescent simulation.
We expect precise agreement between our MCMC algorithm and the coalescent simulation when no data is involved.

One should alwayas test that a new sampler exactly converges to a known distribution: this provides a strict necessary criterion for correctness.

Validation 2: Inference from simulated data

True ARG:

Randomly-selected ARG from MCMC:

Summary networks

True ARG:

Summary ARG from MCMC:

Application to Escherichia coli

Applied method to E. coli O157/O26 sequences collected by Tessy George (^mEpiLab, Massey University, New Zealand).
Sequences for 53 rMLST genes (21.1 kb total) from each of 23 human and cattle-derived isolates.
Analyzed using fixed expected tract length parameter and informative prior on conversion rate parameter.

Application to Escherichia coli

Software implementation

tgvaughan.github.io/bacter

BEAST 2 package that performs inference under the ClonalOrigin model.
Joint inference of the clonal frame, recombinant edges and parameters.
Can be combined with usual variety of substitution models, parametric population models and parameter priors.
GUI analysis configuration via BEAUti.

Recombination and Demographic Inference

Relationship used by Li and Durbin, 2011 to infer human demographic history from pairs of autosomes.

Bayesian Skyline Plots in Bacter

Bacter implements a non-parametric demographic history model inspired by the standard Bayesian Skyline Plot (Drummond et al., 2005).
In principle can reconstruct non-trivial dynamics from very few genomes.
In practice, this is limited by the computational expense of the algorithm.

Piecewise constant/linear population model

Inference example

Bacter limitations

Computational complexity scales with the number of proposed conversions.
- This can be arbitrarily large, even for small sample sets.
- Cannot use the BEAGLE phylogenetic likelihood library to speed things up due to peculiarities of ARG likelihood computation.
MCMC algorithm used does not intelligently locate conversions.
- Looking at addressing this in the near future.
Network summary algorithm can produce peculiar results.
- More research needs to be done to find a better algorithm.

Summary

Bayesian phylogenetic inference is generally done by adding a tree prior to the tree likelihood. This implicitly assumes neutral evolution.
Tree priors link the shape of the tree to parameters such as speciation rates, extinction rates, sampling rates, population sizes and migration rates.
Development of these priors is a particularly rich area of current research.
Due to the immense size of the state spaces involved and the use of distributions which are analytically intractable, almost all Bayesian phylogenetic inference uses MCMC.
Deciding on stopping criteria is difficult. Many necessary requirements for convergence, but no sufficient requirements!
Methods for summarizing sampled tree distributions exist, but may be misleading. At the moment, sticking to summary statistics for particular characteristics of interest is the only safe option.

	$p$-distance	JC69	K80	Tajima-Nei
HIV-O	0.257	0.315	0.318	0.324
HIV-1C	0.103	0.111	0.113	0.114

Case 1: Bayesian Phylogenetic Inference

Modelling genetic change

Modelling genetic change

p-distance

Multiple, parallel and back-substitutions

Relationship between $p$-distance and genetic distance

Continuous-time Markov chains (CTMCs)

CTMC example

Jukes-Cantor model of nucelotide substitution

Differences per site

Genetic distance under Jukes-Cantor

Other CTMC substitution models: Kimura 80

Other CTMC substitution models: HKY

Other CTMC substitution models: GTR

Variable rates among sites

Genetic distance variation between models

Least-squares Phylogenetic Inference

The Phylogenetic Likelihood

What is the likelihood of a tree?

Felsenstein's Pruning Algorithm

Maximum likelihood for Phylogenetics

Problems with ML

The Phylogenetic Posterior

The Neutrality Assumption

Tree Priors

The Yule model for speciation

Birth-death(-sampling) priors

Coalescent priors

Multi-species coalescent priors

MCMC in Tree Space

Stopping criteria

Are we there yet?

Post-processing: Parameter samples

Post-processing: Tree samples

Bayesian Phylogenetic Inference Software

Case 2: Bayesian inference of bacterial ancestral recombination graphs

Why study bacterial phylogenetics?

Cartoon bacterial population genetics

The Problem for Phylogenetic Inference

Existing solutions

The Coalescent with Gene Conversion

Full ARG

Approximation 1: ClonalFrame

Approximation 2: ClonalOrigin

Inference under the ClonalOrigin Model

Identifiability

MCMC State Proposal Strategy

Validation 1: Sampling from the prior

Validation 2: Inference from simulated data

Summary networks

Application to Escherichia coli

Application to Escherichia coli

Software implementation

Recombination and Demographic Inference

Bayesian Skyline Plots in Bacter

Piecewise constant/linear population model

Inference example

Bacter limitations

Summary

Recommended Reading

Case 2: Bayesian inference of bacterial
ancestral recombination graphs