An MCMC algorithm for Bayesian inference of hard polytomies

Tim Vaughan, Oliver Pybus, Tanja Stadler

Stadler Group, D-BSSE, ETH Zürich

New Zealand Phylogenomics Meeting, 12th February, 2019

Recap: What are polytomies?

Polytomies are features of phylogenetic trees: internal nodes with more than two children, also known as multifurcations.

These can represent two different things:

Soft polytomies merely represent an inability to resolve the order of speciation/birth events due to insufficient data.
Hard polytomies represent a true or nearly true multifurcation of the parent species/transmission lineage.

We are interested in hard polytomies here.

When may polytomies occur?

Gene tree context:: Some marine species are known to have "sweepstakes reproductive success", meaning that offspring distributions have extremely high variance.
Transmission tree context:: Super-spreading events where single individuals "instantaneously" produce a large number of secondary infections.
Species trees:: Simultaneous speciation events. Has been hypothesized several times in the literature.

Polytomies in sampled trees?

Consider a population in which the distribution for the number of offspring $m$ is such that $P(m>2)>0$.

In complete trees, polytomies will appear whenever the number of offspring exceeds 2.
In sampled trees, polytomies will occur with vanishingly small probability.

$\implies$ Require offspring distribution such that number of sampled polytomies is significant.

Modelling polytomy generation

Birth-death models:
Augment exsting birth-death models to include higher order birth events: $$X \overset{\lambda_m}{\longrightarrow}(m+1) X$$
$$X \overset{\mu}{\longrightarrow} 0$$
Coalescent models:
The traditional coalescent excludes polytomies (almost surely). Under the WF model, this is because higher order coalescences occur with probability $O(1/N^2)$ or lower.

The $\Lambda$-Coalescent

Introduced by Sagitov (1999) and Pitman (1999).
Given $n$ extant lineages, each $k$-tuple of lineages undergoes coalescence at a rate $$\lambda_{n,k} = \int_0^1 x^{k-2}(1-x)^{n-k}\Lambda(dx)$$ where $\Lambda$ is a distribution on $[0,1]$.
Total coalescence rate is therefore: $$\lambda_{\mathrm{tot}} = \sum_{k=2}^{n}\binom{n}{k}\lambda_{n,k}$$
Regular coalescent corresponds to the case when $\Lambda(x)=\delta(x)$.

The $\Lambda$-Coalescent (ctd.)

Pitman (1999) derived the coalescent rates using a very general argument.
Requires only that the inclusion of lineages in a merger event is determined by exchangeable Bernoulli trials.
Together with a consistency requirement: $$\lambda_{n,k} = \lambda_{n+1,k} + \lambda_{n+1,k+1}$$ and de Finetti's theorem, the rates must be those quoted on the previous slide.
Equivalent to selecting $k-2$ of each of the $n-2$ remaining lineages with probability $x\sim\Lambda(x)$.
Note that these "rates" are probabilities conditional on a pairwise coalescence occuring. The event rate per generation is $c_N$: the per-generation pairwise coalescence probability.

The Wright-Fisher Process

Every generation constrained to have the same size.

The Galton-Watson Process

Every individual produces an i.i.d. number of offspring, with $\mathbb{E}[m]>1$.

Constraint: Each new generation is sampled (without replacement) from the intermediate generation.

The $\beta$-Coalescent

When the probability of $\geq n$ offspring decays as $1/n^{\alpha}$, the coalescent limit yields (Schweinsberg 2003) a $\Lambda$-coalescent with: $$\Lambda(x) = \frac{x^{1-\alpha}(1-x)^{\alpha-1}}{B(2-\alpha,\alpha)}$$ where $B(q,r) = \Gamma(q)\Gamma(r)/\Gamma(q+r)$

This implies that the coalescent rates become $$\lambda_{n,k} = \frac{B(k-\alpha,n-k+\alpha)}{B(2-\alpha,\alpha)}$$

Taking the limit $\alpha\rightarrow 2$ yields the Kingman coalescent.

Time is measured in units of $N/\sigma^2\equiv N_e$ generations, where $\sigma^2$ is the variance of the offspring distribution.

Example $\beta$-coalescent trees

$\alpha=2.0$ (regular coalescent):

Example $\beta$-coalescent trees

$\alpha=1.5$

Example $\beta$-coalescent trees

$\alpha=1.0$

The $\beta$-coalescent density

Transforming time $dt = dt'/N_e(t')$ allows us to account for population dynamics.
For a given tree, the likelihood for $\alpha$ seems close to true values.

Inference Strategy

Usual phylogenetic posterior:

$$P(T,\mu,\theta,\alpha|A) = \frac{1}{P(A)} P(A|T,\mu)P(T|N_e,\alpha)P(\mu,N_e,\alpha)$$

$P(T|N_e,\alpha)$ is the $\Lambda$-coalescent tree prior: $$\prod_i\frac{\lambda_{n_i,k_i}}{N_e}\exp\left[-\frac{\tau_i}{N_e}\sum_{k'=2}^{n_i}\binom{n_i}{k'}\lambda_{n_i,k'}\right]$$
$P(\mu,N_e,\alpha)=P(\mu)P(N_e)P(\alpha)$ are the parameter priors.

Space of trees with (potential) polytomes $\neq$ space of binary trees.

Need a reversible-jump MCMC algorithm to sample time trees under this model.

Previous work

Lewis, Holder and Holsinger, Sys Bio, 2005 (unrooted trees)
Guindon, Guo and Welch, Theor Pop Biol, 2016 (spatial continuum model)

Proposal distributions:
Subtree Prune+Regraft

Tuning parameters:

Exponential rate for root attachments.
Coalescent attachment probability.

Proposal distributions:
Narrow Exchange

Proposal distributions:
Subtree Slide

Tuning parameters:

Rate for non-coalescent attachment..
Coalescent attachment probability.

Proposal distributions:
Expand/Collapse

Tuning parameters:

Exponential rate for expantion above root.

Validation: Sampling from prior

Validation: Inference from simulated data

Investigation: does structure confound inference of $\alpha$?

Simulation study:

Simulated trees with 60 taxa under the 3-deme structured coalescent (equal population sizes and migration rates, even leaf distribution among types).
Varied migration rate from very low (highly structured trees) to very high (no discernable structure).
Simulated sequences down each tree to produce alignments.
Analyzed each alignment under the beta-coalescent.

Example structured trees

Inferred beta-coalescent parameters

$\implies$ little/no confusion between structure and multifurcation?

BEAST package

Pitchfork: github.com/tgvaughan/pitchfork
Easy BEAUti interface, integrates with other models easily.
Infer $\alpha$ and (parametric) population dynamics jointly together with multifurcating trees.
Help identify superspreading in transmission trees?

Application to Ebola

Gytis Dudas et al. (Nature, 2017) have published a curated set of 1610 Ebola virus genomes sampled during the 2014-2015 West African outbreak.

Restrict analysis to oubtreak within the Kailahun region.
How do results from inference under the Beta-coalescent differ from results inferred under classic birth-death models?

Application to Ebola

Tree inferred under stochastic SIR birth-death model.

Application to Ebola

Tree inferred under constant-$N_e$ Beta-coalescent model.

Application to Ebola

Steps necessary for "serious" phylodynamic analyses

Account for structure and population dynamics.
Allow for mixture of offspring distributions: $$\Lambda(x) = p_0\delta(x) + (1-p_0)Beta(2-\alpha,\alpha)(x)$$
Apply a spike and slab prior to $p_0$: allows for posterior probability on the presence of non-Kingman dynamics.
These steps are complicated by the dependence of the $\beta$-coalescent timescale on the offspring distribution through $\alpha$.

Summary

Populations where individuals may occasionally produce a significant portion of the following generation lead to the possibility of observable polytomies in sampled trees.
Bayesian phylogenetic inference under such models can be performed using a reversible-jump MCMC algorithm.
Numerical results suggest that it is possible to recover parameters of the offsping distribution from genetic data.
Analysis of Ebola virus data shows that this model predicts transmission trees with multiple polytomies, although interpretation requires care.

Work in progress: Development of birth-death model

Consider the following birth/death model: $$X \overset{\lambda_m(x)}{\longrightarrow} (m+1)X$$ $$X \overset{\mu}{\longrightarrow} 0$$ where $$\lambda_m(x)=\lambda e^{-x}\frac{x^m}{m!}$$

The probability of an individual at time $t$ in the past giving rise to $i$ lineages in the present $p_i(t)$ is: \begin{align} \frac{d}{dt}p_0(t) &= -(\lambda+\mu)p_0(t) + \mu + \lambda e^{x(p_0(t)-1)}p_0(t)\\ \frac{d}{dt}p_1(t) &= -(\lambda+\mu)p_1(t) + \lambda(x p_0(t)+1)e^{x(p_0(t)-1)}p_1(t) \end{align}

Sadly I can't find an analytical solution to these. Thus we need numerical integration.

Acknowledgements

Tanja Stadler and the Computational Evolution (cEvo) Group

Patrick Hoscheit

Oliver Pybus

Recap: What are polytomies?

When may polytomies occur?

Polytomies in sampled trees?

Modelling polytomy generation

The $\Lambda$-Coalescent

The $\Lambda$-Coalescent (ctd.)

The Wright-Fisher Process

The Galton-Watson Process

The $\beta$-Coalescent

Example $\beta$-coalescent trees

Example $\beta$-coalescent trees

Example $\beta$-coalescent trees

The $\beta$-coalescent density

Inference Strategy

Previous work

Proposal distributions: Subtree Prune+Regraft

Proposal distributions: Narrow Exchange

Proposal distributions: Subtree Slide

Proposal distributions: Expand/Collapse

Validation: Sampling from prior

Validation: Inference from simulated data

Validation: Inference from simulated data

Investigation: does structure confound inference of $\alpha$?

Example structured trees

Inferred beta-coalescent parameters

BEAST package

Application to Ebola

Application to Ebola

Application to Ebola

Application to Ebola

Steps necessary for "serious" phylodynamic analyses

Summary

Work in progress: Development of birth-death model

Acknowledgements

Proposal distributions:
Subtree Prune+Regraft

Proposal distributions:
Narrow Exchange

Proposal distributions:
Subtree Slide

Proposal distributions:
Expand/Collapse