An MCMC algorithm for Bayesian inference of hard polytomies
Tim Vaughan, Oliver Pybus, Tanja Stadler
Stadler Group, D-BSSE, ETH Zürich
New Zealand Phylogenomics Meeting, 12th February, 2019

Recap: What are polytomies?

Polytomies are features of phylogenetic trees: internal nodes with more than two children, also known as multifurcations.

These can represent two different things:

  • Soft polytomies merely represent an inability to resolve the order of speciation/birth events due to insufficient data.
  • Hard polytomies represent a true or nearly true multifurcation of the parent species/transmission lineage.

We are interested in hard polytomies here.

When may polytomies occur?

Gene tree context:
Some marine species are known to have "sweepstakes reproductive success", meaning that offspring distributions have extremely high variance.
Transmission tree context:
Super-spreading events where single individuals "instantaneously" produce a large number of secondary infections.
Species trees:
Simultaneous speciation events. Has been hypothesized several times in the literature.

Polytomies in sampled trees?

Consider a population in which the distribution for the number of offspring $m$ is such that $P(m>2)>0$.

  • In complete trees, polytomies will appear whenever the number of offspring exceeds 2.
  • In sampled trees, polytomies will occur with vanishingly small probability.

$\implies$ Require offspring distribution such that number of sampled polytomies is significant.

Modelling polytomy generation


  • Birth-death models:
    Augment exsting birth-death models to include higher order birth events: $$X \overset{\lambda_m}{\longrightarrow}(m+1) X$$
    $$X \overset{\mu}{\longrightarrow} 0$$
  • Coalescent models:
    The traditional coalescent excludes polytomies (almost surely). Under the WF model, this is because higher order coalescences occur with probability $O(1/N^2)$ or lower.

The $\Lambda$-Coalescent

  • Introduced by Sagitov (1999) and Pitman (1999).
  • Given $n$ extant lineages, each $k$-tuple of lineages undergoes coalescence at a rate $$\lambda_{n,k} = \int_0^1 x^{k-2}(1-x)^{n-k}\Lambda(dx)$$ where $\Lambda$ is a distribution on $[0,1]$.
  • Total coalescence rate is therefore: $$\lambda_{\mathrm{tot}} = \sum_{k=2}^{n}\binom{n}{k}\lambda_{n,k}$$
  • Regular coalescent corresponds to the case when $\Lambda(x)=\delta(x)$.

The $\Lambda$-Coalescent (ctd.)

  • Pitman (1999) derived the coalescent rates using a very general argument.
  • Requires only that the inclusion of lineages in a merger event is determined by exchangeable Bernoulli trials.
  • Together with a consistency requirement: $$\lambda_{n,k} = \lambda_{n+1,k} + \lambda_{n+1,k+1}$$ and de Finetti's theorem, the rates must be those quoted on the previous slide.
  • Equivalent to selecting $k-2$ of each of the $n-2$ remaining lineages with probability $x\sim\Lambda(x)$.
  • Note that these "rates" are probabilities conditional on a pairwise coalescence occuring. The event rate per generation is $c_N$: the per-generation pairwise coalescence probability.

The Wright-Fisher Process

Every generation constrained to have the same size.

The Galton-Watson Process

Every individual produces an i.i.d. number of offspring, with $\mathbb{E}[m]>1$.

Constraint: Each new generation is sampled (without replacement) from the intermediate generation.

The $\beta$-Coalescent

When the probability of $\geq n$ offspring decays as $1/n^{\alpha}$, the coalescent limit yields (Schweinsberg 2003) a $\Lambda$-coalescent with: $$\Lambda(x) = \frac{x^{1-\alpha}(1-x)^{\alpha-1}}{B(2-\alpha,\alpha)}$$ where $B(q,r) = \Gamma(q)\Gamma(r)/\Gamma(q+r)$

This implies that the coalescent rates become $$\lambda_{n,k} = \frac{B(k-\alpha,n-k+\alpha)}{B(2-\alpha,\alpha)}$$

Taking the limit $\alpha\rightarrow 2$ yields the Kingman coalescent.

Time is measured in units of $N/\sigma^2\equiv N_e$ generations, where $\sigma^2$ is the variance of the offspring distribution.

Example $\beta$-coalescent trees

$\alpha=2.0$ (regular coalescent):

Example $\beta$-coalescent trees

$\alpha=1.5$

Example $\beta$-coalescent trees

$\alpha=1.0$

The $\beta$-coalescent density

  • Transforming time $dt = dt'/N_e(t')$ allows us to account for population dynamics.
  • For a given tree, the likelihood for $\alpha$ seems close to true values.

Inference Strategy

Usual phylogenetic posterior:

$$P(T,\mu,\theta,\alpha|A) = \frac{1}{P(A)} P(A|T,\mu)P(T|N_e,\alpha)P(\mu,N_e,\alpha)$$
  • $P(T|N_e,\alpha)$ is the $\Lambda$-coalescent tree prior: $$\prod_i\frac{\lambda_{n_i,k_i}}{N_e}\exp\left[-\frac{\tau_i}{N_e}\sum_{k'=2}^{n_i}\binom{n_i}{k'}\lambda_{n_i,k'}\right]$$
  • $P(\mu,N_e,\alpha)=P(\mu)P(N_e)P(\alpha)$ are the parameter priors.

Space of trees with (potential) polytomes $\neq$ space of binary trees.

Need a reversible-jump MCMC algorithm to sample time trees under this model.

Previous work

Proposal distributions:
Subtree Prune+Regraft

Tuning parameters:

  • Exponential rate for root attachments.
  • Coalescent attachment probability.

Proposal distributions:
Narrow Exchange

Proposal distributions:
Subtree Slide

Tuning parameters:

  • Rate for non-coalescent attachment..
  • Coalescent attachment probability.

Proposal distributions:
Expand/Collapse

Tuning parameters:

  • Exponential rate for expantion above root.

Validation: Sampling from prior

Validation: Inference from simulated data

Validation: Inference from simulated data

Investigation: does structure confound inference of $\alpha$?

Simulation study:

  • Simulated trees with 60 taxa under the 3-deme structured coalescent (equal population sizes and migration rates, even leaf distribution among types).
  • Varied migration rate from very low (highly structured trees) to very high (no discernable structure).
  • Simulated sequences down each tree to produce alignments.
  • Analyzed each alignment under the beta-coalescent.

Example structured trees

Inferred beta-coalescent parameters

$\implies$ little/no confusion between structure and multifurcation?

BEAST package

  • Pitchfork: github.com/tgvaughan/pitchfork
  • Easy BEAUti interface, integrates with other models easily.
  • Infer $\alpha$ and (parametric) population dynamics jointly together with multifurcating trees.
  • Help identify superspreading in transmission trees?

Application to Ebola

  • Gytis Dudas et al. (Nature, 2017) have published a curated set of 1610 Ebola virus genomes sampled during the 2014-2015 West African outbreak.
  • Restrict analysis to oubtreak within the Kailahun region.
  • How do results from inference under the Beta-coalescent differ from results inferred under classic birth-death models?

Application to Ebola

Tree inferred under stochastic SIR birth-death model.

Application to Ebola

Tree inferred under constant-$N_e$ Beta-coalescent model.

Application to Ebola

Steps necessary for "serious" phylodynamic analyses

  • Account for structure and population dynamics.
  • Allow for mixture of offspring distributions: $$\Lambda(x) = p_0\delta(x) + (1-p_0)Beta(2-\alpha,\alpha)(x)$$
  • Apply a spike and slab prior to $p_0$: allows for posterior probability on the presence of non-Kingman dynamics.
  • These steps are complicated by the dependence of the $\beta$-coalescent timescale on the offspring distribution through $\alpha$.

Summary

  • Populations where individuals may occasionally produce a significant portion of the following generation lead to the possibility of observable polytomies in sampled trees.
  • Bayesian phylogenetic inference under such models can be performed using a reversible-jump MCMC algorithm.
  • Numerical results suggest that it is possible to recover parameters of the offsping distribution from genetic data.
  • Analysis of Ebola virus data shows that this model predicts transmission trees with multiple polytomies, although interpretation requires care.

Work in progress: Development of birth-death model

Consider the following birth/death model: $$X \overset{\lambda_m(x)}{\longrightarrow} (m+1)X$$ $$X \overset{\mu}{\longrightarrow} 0$$ where $$\lambda_m(x)=\lambda e^{-x}\frac{x^m}{m!}$$

The probability of an individual at time $t$ in the past giving rise to $i$ lineages in the present $p_i(t)$ is: \begin{align} \frac{d}{dt}p_0(t) &= -(\lambda+\mu)p_0(t) + \mu + \lambda e^{x(p_0(t)-1)}p_0(t)\\ \frac{d}{dt}p_1(t) &= -(\lambda+\mu)p_1(t) + \lambda(x p_0(t)+1)e^{x(p_0(t)-1)}p_1(t) \end{align}

Sadly I can't find an analytical solution to these. Thus we need numerical integration.

Acknowledgements

Tanja Stadler and the Computational Evolution (cEvo) Group


Patrick Hoscheit

Oliver Pybus