Bayesian inference using Markov chain Monte Carlo
Part 2: Case studies from Phylogenetics
Tim Vaughan
Computational Evolution Group (cEvo)
Department of Biosystems Science and Engineering, ETH Zurich
Bayesian Statistics Mini-course
Univerisity of Hohenheim, 2018
Case 1: Bayesian Phylogenetic Inference
Modelling genetic change
- Given two or more aligned nucleoide or amino acid sequences,
usually the first goal is to calculate some measure of sequence
similarity (or conversely distance).
- The easiest distance to compute is the p-distance: the number of
differences between two aligned sequences relative to their length.
- The p-distance is the Hamming distance normalized by the
length of the sequence. Therefore it is the proportion of positions
at which the sites differ.
- The p-distance can also be considered the probability that the two
sequences differ at a random site.
Modelling genetic change
p-distance
$p$-distance = 3/7 $\simeq 0.43$.
- Proportion of differences between two sequences.
- Usually understimates the true distance: genetic (or evolutionary) distance $d$.
Multiple, parallel and back-substitutions
Relationship between $p$-distance and genetic distance
Continuous-time Markov chains (CTMCs)
- CTMCs are the continuous-time generalization of Markov chains: state $X(t)$ is
a function of a continuous time parameter.
- Obey the Chapman-Kolmogorov equation:
$$p(x_1,t_1|x_0,t_0)=\sum_{x_{1/2}}p(x_1,t_1|x_{1/2},t_{1/2})p(x_{1/2},t_{1/2}|x_0,t_0)$$
where $p(x_1,t_1)\equiv P(X(t_1)=x_1)$ and $t_0<t_{1/2}<t_1$.
- Can be written in the following differential form:
\begin{align*}
\frac{d}{dt}p(x,t|x_0,t_0)&=\sum_{x'\neq x}\left[p(x',t|x_0,t_0)Q_{xx'}-p(x,t|x_0,_0)Q_{x'x}\right]\\
&=\sum_{x'}Q_{x'x}p(x',t|x_0,t_0)
\end{align*}
defining $Q_{xx}=-\sum_{x'\neq x}Q_{x'x}$.
- This form is known as the Master Equation or Kolmogorov forward equation. Linear, so solution is "simply" $\vec{p}(t)=\exp[-Qt]\vec{p}(0)$.
CTMC example
Consider a system with two states (0 and 1) and a transition rate matrix
$$Q = \left[\begin{array}{cc}
- & 2 \\
1 & -
\end{array}\right]$$
Gives rise to the following trajectories:
Time $\Delta t$ spent in state before transition:
$$ P(\Delta t|x) = \lambda e^{-\lambda \Delta t}$$
where $\lambda = \sum_{x'\neq x}Q_{x'x}$.
Jukes-Cantor model of nucelotide substitution
$$
Q = \mu\left[\begin{array}{cccc}
-1 & \frac{1}{3} & \frac{1}{3} & \frac{1}{3}\\
\frac{1}{3} & -1 & \frac{1}{3} & \frac{1}{3}\\
\frac{1}{3} & \frac{1}{3} & -1 & \frac{1}{3}\\
\frac{1}{3} & \frac{1}{3} & \frac{1}{3} &-1
\end{array}\right]
$$
\begin{align*}
P(X(t)=x|X(0)=x_0) &= \left[\exp(-Qt)\right]_{xx_0}\\
&=\left\{\begin{array}{ll}
\frac{1}{4}+\frac{3}{4}e^{-\frac{4}{3}\mu t} & \text{for } x=x_0\\
\frac{1}{4}-\frac{1}{4}e^{-\frac{4}{3}\mu t} & \text{for } x\neq x_0
\end{array}\right.
\end{align*}
Genetic distance under Jukes-Cantor
The probabilty that a site is distinct after time $t$ is
\begin{align*}
p_{\text{diff}}&=\sum_{x'\neq x} P(X(t)=x'|X(0)=x)\\
&=\frac{3}{4} - \frac{3}{4}e^{-\frac{4}{3}\mu t}
\end{align*}
Equating $p_{\text{diff}}$ with the
$p$-distance and solving for $\mu t$ yields
$$\hat{d} = \widehat{\mu t} = -\frac{3}{4}\log(1-\frac{4}{3}p).$$
- Accounts for multiple, parallel, back-substitutions.
- Only correct if $p$ is exact: infinite sequences.
- Diverges if $p≥3/4$.
Other CTMC substitution models: Kimura 80
$$
Q = \left[\begin{array}{cccc}
\cdot & \alpha & \beta & \beta \\
\alpha & \cdot & \beta & \beta \\
\beta & \beta & \cdot & \alpha \\
\beta & \beta & \alpha &\cdot
\end{array}\right]
$$
- Also "Kimura two parameter" model.
- Allows different rates for transitions and transversions.
- Equilibrium base frequencies equal.
Other CTMC substitution models: HKY
$$
Q = \left[\begin{array}{cccc}
\cdot & \alpha\pi_G & \beta\pi_G & \beta\pi_G \\
\alpha\pi_A & \cdot & \beta\pi_A & \beta\pi_A \\
\beta\pi_C & \beta\pi_C & \cdot & \alpha\pi_C \\
\beta\pi_T & \beta\pi_T & \alpha\pi_T &\cdot
\end{array}\right]
$$
- Due to Hasegawa, Kishino, Yano (1985).
- Allows different transition transversion rates.
- Allows equilibrium base frequencies to differ.
- Most complex model for which analytical solutions are available.
Other CTMC substitution models: GTR
$$
Q = \left[\begin{array}{cccc}
\cdot & \alpha\pi_G & \beta\pi_G & \gamma\pi_G \\
\alpha\pi_A & \cdot & \delta\pi_A & \epsilon\pi_A \\
\beta\pi_C & \delta\pi_C & \cdot & \eta\pi_C \\
\gamma\pi_T & \epsilon\pi_T & \eta\pi_T &\cdot
\end{array}\right]
$$
- All models so-far have been reversible.
- GTR is the most general time-reversible model.
- Reversibility is a mathematical convenience: no biological reason!
- No useful analytical solution available (matrix exponential doesn't count).
Variable rates among sites
Gamma distribution rate heterogeneity allows for variation
in evolutionary rate among sites according to
$$f(r) = \frac{\alpha^{\alpha}}{\Gamma(\alpha)}r^{\alpha-1}e^{-\alpha r}$$
Analytical solution exists for JC distance with site rate heterogeneity:
$$\hat{d}=-\frac{3}{4}\alpha\left[1-\left(1-\frac{4}{3}p\right)^{-1/\alpha}\right]$$
Genetic distance variation between models
- HIV-1B vs HIV-O/SIVcpz/HIV-1C: env gene:
| $p$-distance | JC69 | K80 | Tajima-Nei |
HIV-O | 0.391 | 0.552 | 0.560 | 0.572 |
SIVcpz | 0.266 | 0.337 | 0.340 | 0.427 |
HIV-1C | 0.163 | 0.184 | 0.187 | 0.189 |
- HIV-1B vs HIV-O/HIV-1C: pol gene:
| $p$-distance | JC69 | K80 | Tajima-Nei |
HIV-O | 0.257 | 0.315 | 0.318 | 0.324 |
HIV-1C | 0.103 | 0.111 | 0.113 | 0.114 |
Least-squares Phylogenetic Inference
- Can use genetic distance as the basis for least-squares-based phylogenetic inference.
- Define cost of tree as the (weighted) sum of the squares of the
difference between the genetic distance on the tree and the
distance estimated from the sequence alignment:
$$C(T) = \sum_i\sum_j w_{ij}(d_{ij}(T)-\hat{d}_{ij})^2$$
- Search for tree $T$ with lowest $C(T)$.
Comments:
- Ignores correlations due to shared ancestry.
- Consistent estimator: converges to truth in the limit of infinitely long alignment.
The Phylogenetic Likelihood
$$P(A|T, \mu)$$
- $A$ is a multiple sequence alignment of $n$ sequences,
- $T$ is a phylogenetic tree with $n$ leaves, and
- $\mu$ are the parameters of the chosen CTMC-based model of sequence evolution.
- Just a product of site pattern propabilities: note the obvious
generalizations to multiple alignments that share a tree, or to phylogenetic networks that allow
different sites to evolve under distinct "local" trees.
What is the likelihood of a tree?
- $$L(T|A)\equiv P(A|T) = \sum_x\sum_w P(x)P(A|x)P(w|x)P(G|w)P(G|w)$$
- For $n$ taxa there are $4^{n-1}$ possible internal node states!
- Solve using Dynamic Programming!
Felsenstein's Pruning Algorithm
- Introduced by Joseph Felsenstein, 1973.
- Recursion based on the conditional likelihood of a subtree below node $k$ having state $s$: $L_k(s)$.
- For leaves, $L_k(s)=\delta_{s,l}$ where $l$ is the leaf state.
For internal nodes,
$$L_k(s) = \left(\sum_xP(x|s)L_{c_l}(x)\right)\left(\sum_yP(y|s)L_{c_r}(y)\right)$$
Time complexity for $m$ sites: $m(n-1)4^2$. This is the workhorse of computational phylogenetics.
Maximum likelihood for Phylogenetics
- PhyML (Guindon and Gascuel, 2003)
- Hill-climbing algorithm which attempts to find ML trees.
- RAxML (e.g. Stamatakis, 2005)
- Another hill-climbing algorithm, optimized for large trees.
- FastTree (Price, Dehal, Arkin, 2009)
- Uses efficient Neighbour-Joining algorithm to find starting tree.
- Updates tree using heuristics based on tree length.
- Inference from alignments of $>10^5$ sequences feasible.
Problems with ML
- Only ever gives point estimates: no indication of uncertainty.
- Addressed using ad hocery: bootstrap algorithms.
- $L(\theta|D)$ is not a probability!
- No clear way of incorporating additional (prior?) information into the analysis.
Need a better approach!
The Phylogenetic Posterior
$$P(T,\mu,\theta|A) = \frac{1}{P(A)}P(A|T,\mu)P(T|\theta)P(\mu,\theta)$$
- $P(T|\theta)$ is the "tree prior" parameterized by $\theta$.
- $P(\mu,\theta)$ are the parameter priors.
Questions
- What is $P(A)$?
- Is the tree prior really a prior? (I.e. does it depend on the data?)
The Neutrality Assumption
Because of the way we've factorized the joint probability for the data and model parameters, we
are implicitly assuming that our alignment could have been produced in the following fashion:
Separating the process of tree generation from that of
sequence evolution implies neutrality.
Tree Priors
- Tree priors allow us to specify a generative model for the genealogy.
- While this model may not involve genetic evolution, it may depend on
speciation rates, population sizes, migration rates, etc. etc.
- A huge fraction of phylogenetics inference research is focused on
developing and efficiently implementing tree priors!
The Yule model for speciation
- Simple model for speciation due to Yule (1924).
- Constant rate branching process where
branching occurs uniformly across extant lineages
at a rate of $\lambda$ per lineage per unit time.
- Number of lineages $k$ evolves under the following master equation:
$$\dot{p}(k,t) = \lambda (k-1)p(k-1,t) - \lambda k p(k,t)$$
- Write this in chemical kinetics formalism:
$X \overset{\lambda}{\longrightarrow} 2X$
Birth-death(-sampling) priors
- Most of the recent groundwork on this is due to Tanja Stadler (ETH, Zürich).
- Generalization of Yule process to account for extinction and sampling.
- Linear speciation, extinction and sampling rates.
- Prior is over sampled trees: assumes unobserved full tree.
Coalescent priors
- Original theory due to Kingman (1980).
- Model the shape of gene trees within a species.
- Can be obtained as the limit of a number of population genetic models,
including the Wright-Fisher model and the Moran model.
- Provide an important link between tree shape and population size!
Drummond et al., TIEE, 2003
Multi-species coalescent priors
- Hierarchical models that embed gene trees within species trees.
- Species tree described using a birth-death prior.
- Gene trees described by coalescent priors for populations which
subdivide at speciation events.
Leliaert et al., Euro. J. Phycology, 2014
MCMC in Tree Space
Need to identify a set of proposals $q_j(x'|x)$ when $x$ is a point in
the space of rooted time trees.
Stopping criteria
- How can we tell when a phylogenetic MCMC calculation has reached equilibrium?
- How do we know when we've collected enough samples?
- One approach is to compute ESS (using empirical estimates of autocorrelation time) for each
parameter and a number of tree summary statistics (e.g. tree height and tree "length").
- Assume that once all of these scores are sufficiently high
(e.g. > 200) we have adequately sampled the posterior.
- Examine output of several independent (and independently
initialized chains. A necessary (not sufficient) condition for
convergence is that sample distributions converge as the chains
converge to equilibrium.
Are we there yet?
The AWTY application applies a number of different statistics to assess
the convergence of the tree state. It relies heavily on comparing
the result of multiple runs.
Post-processing: Parameter samples
Logs of individual parameters can be considered samples from the marginal posteriors of those parameters.
Post-processing: Tree samples
- Summary statistics (e.g. age of the MRCA of a pair of taxa) can be
computed and their marginal distributions assessed.
- A number of different approaches can be used to produce a "summary" tree meant
to represent the best guess at the true tree. (None of these algorithms are perfect!!)
- Strict consensus
- Include only clades that appear in ALL sampled trees.
- Majority rule consensus
- Include clades that appear in >50% of the sampled trees.
- Maximum clade credibility tree
- The sampled topology for which the product of the posterior clade probabilities is maximized.
In BEAST, summary trees are produced using the MCC tree method via the program TreeAnnotator.
Bayesian Phylogenetic Inference Software
Popular Bayesian phyogenetic inference software:
- MrBayes (Huelsenbeck and Ronquist, Bioinformatics, 2001)
- Early CLI program for phylogenetic inference.
- RevBayes (Höhna et al., Syst. Biol., 2016)
- R-like syntax for specifying phylogenetic models.
- BEAST/BEAST2 (Drummond and Rambaut, 2007; Bouckaert et al., 2014)
- XML specification of phylogenetic models. Extensible.
Some software implementing special models:
- MIGRATE
- Performs inference under the Structured Coalescent. (I.e. subpopulation sizes, migration rates, ancestral locations.)
- ClonalFrame/ClonalOrigin
- Infer bacterial Ancestral Recombination Graphs (generalizations of trees when recombination is present).
... there are many others!
Case 2: Bayesian inference of bacterial
ancestral recombination graphs
Why study bacterial phylogenetics?
- Bacteria play important roles (both positive and
negative) in the health of animals and
plants.
- Many bacteria possess interesting and
experimentally accessible evolutionary
dynamics.
- Bacterial genomes are measurably evolving over
relatively short study periods.
Cartoon bacterial population genetics
- Effect of events can be plasmid transfer, insertion or homologous recombination.
- Rate of recombination varies and is subject to selection.
- Focus on homologous recombination: only event which doesn't alter the length of the sequence
The Problem for Phylogenetic Inference
For many bacteria, the ratio between the recombination rate and the mutation rate is extremely high.
Existing solutions
- Pre-processing of data to identify and remove
non-vertically inherited material. (eg. START: Jolley et al., 2001)
Pros | Cons |
- Can use standard tools for
phylogenetic inference.
|
- Data is being thrown away.
- Ad hoc, may bias results.
|
- Explicit modelling of bacterial recombination.
(eg. ClonalFrame and ClonalOrigin: Didelot et al., 2007, 2010)
Pros | Cons |
- Can make use of all data.
- Can infer additional parameters such as recombination rates.
- May yield increased confidence in estimates.
|
- Models can be complex, with many parameters.
- Both computationally and statistically challenging.
- Existing implementations lack flexibility.
|
The Coalescent with Gene Conversion
Wiuf, 1999; Wiuf and Hein, 2000
Parameters |
$N(t)g$ | Coalescence rate |
$\rho_s$ | Recombination rate |
$\delta$ | Expected tract length |
Problem
The space of possible ancestral recombination graphs is extremely
large. Even two samples have infinitely many distinct ancestries!
Full ARG
Our Goal
To conduct Bayesian inference of the full ARG under the ClonalOrigin
model, optionally under non-trivial population size dynamics.
Inference under the ClonalOrigin Model
Inference follows the standard Bayesian phylogenetic tradition:
\begin{equation*}
f(G,N,\mu,\rho,\delta|A) \propto P_{F}(A|G,\mu)f_{CGC}(G|N,\rho,\delta)f_{\text{prior}}(N
,\mu,\rho,\delta)
\end{equation*}
where
- $A$ is the sequence alignment,
- $\mu$ are the substitution model parameters, and
- $G$ is the full sample genealogy including clonal frame
$T$ and $M$ conversions $\{C_i\}_{i\in[1\ldots M]}$.
The genealogy density under ClonalOrigin model can be expanded
\begin{equation*}
f_{CGC}(G|\rho',\delta,N)=\left(\prod_{i=1}^M f(C_i|T,N,\delta)\right)P(M|T,\rho)f_C(T|N)
\end{equation*}
Identifiability
Despite using a simplified model, an infinite number of ARGs still
possess the same likelihood given a sequence alignment.
Very important for an MCMC algorithm to propose state changes which minimize effect on likelihood.
MCMC State Proposal Strategy
- Attachment points and site tracts of new conversions drawn from prior.
Validation 1: Sampling from the prior
- It is trivial to sample directly from the approximate ARG prior
via coalescent simulation.
- We expect precise agreement between our MCMC algorithm and the
coalescent simulation when no data is involved.
One should alwayas test that a new sampler exactly converges to a known
distribution: this provides a strict necessary criterion for
correctness.
Validation 2: Inference from simulated data
True ARG:
Randomly-selected ARG from MCMC:
True ARG:
Summary ARG from MCMC:
Application to Escherichia coli
- Applied method to E. coli O157/O26 sequences collected by Tessy George (mEpiLab, Massey University, New Zealand).
- Sequences for 53 rMLST genes (21.1 kb total) from each of 23 human and cattle-derived isolates.
- Analyzed using fixed expected tract length parameter and informative prior on conversion rate parameter.
Application to Escherichia coli
Software implementation
tgvaughan.github.io/bacter
- BEAST 2 package that performs inference under the ClonalOrigin model.
- Joint inference of the clonal frame, recombinant edges and parameters.
- Can be combined with usual variety of substitution models, parametric population models and parameter priors.
- GUI analysis configuration via BEAUti.
Recombination and Demographic Inference
- Relationship used by
Li and Durbin, 2011
to infer human demographic history from pairs of autosomes.
Bayesian Skyline Plots in Bacter
- Bacter implements a non-parametric demographic history model inspired by the standard Bayesian Skyline Plot (Drummond et al., 2005).
- In principle can reconstruct non-trivial dynamics from very few genomes.
- In practice, this is limited by the computational expense of the algorithm.
Piecewise constant/linear population model
Inference example
Bacter limitations
- Computational complexity scales with the number of proposed conversions.
- This can be arbitrarily large, even for small sample sets.
- Cannot use the BEAGLE phylogenetic likelihood library to speed things up due
to peculiarities of ARG likelihood computation.
- MCMC algorithm used does not intelligently locate conversions.
- Looking at addressing this in the near future.
- Network summary algorithm can produce peculiar results.
- More research needs to be done to find a better algorithm.
Summary
- Bayesian phylogenetic inference is generally done by adding a tree
prior to the tree likelihood. This implicitly assumes neutral
evolution.
- Tree priors link the shape of the tree to parameters such as
speciation rates, extinction rates, sampling rates, population
sizes and migration rates.
- Development of these priors is a particularly rich area of current
research.
- Due to the immense size of the state spaces involved and the use of
distributions which are analytically intractable, almost all
Bayesian phylogenetic inference uses MCMC.
- Deciding on stopping criteria is difficult. Many necessary
requirements for convergence, but no sufficient requirements!
- Methods for summarizing sampled tree distributions exist, but may
be misleading. At the moment, sticking to summary statistics for
particular characteristics of interest is the only safe
option.