An MCMC algorithm for Bayesian inference of hard polytomies
Tim Vaughan, Oliver Pybus, Tanja Stadler
Stadler Group, D-BSSE, ETH Zürich
MCEB, 28th June, 2018

Recap: What are polytomies?

Polytomies are features of phylogenetic trees: internal nodes with more than two children, also known as multifurcations.

These can represent two different things:

  • Soft polytomies merely represent an inability to resolve the order of speciation/birth events due to insufficient data.
  • Hard polytomies represent a true or nearly true multifurcation of the parent species/transmission lineage.

We are interested in hard polytomies here.

When may polytomies occur?

  • Gene tree context:
    Some marine species are known to have "sweepstakes reproductive success", meaning that offspring distributions have extremely high variance.
  • Transmission tree context:
    Super-spreading events where single individuals "instantaneously" produce a large number of secondary infections.
  • Species trees:
    Simultaneous speciation events. Has been hypothesized several times in the literature.
  • Polytomies in sampled trees?

    Consider a population in which the distribution for the number of offspring $m$ is such that $P(m>2)>0$.

    • In complete trees, polytomies will appear whenever the number of offspring exceeds 2.
    • In sampled trees, polytomies will occur with vanishingly small probability.

    $\implies$ Require offspring distribution such that $P(m>fN)$ is significant.

    Modelling polytomy generation


    • Birth-death models:
      Augment exsting birth-death models to include higher order birth events: $$X \overset{\lambda_k}{\longrightarrow}(k+1) X$$
      $$X \overset{\mu}{\longrightarrow} 0$$
    • Coalescent models:
      The traditional coalescent excludes polytomies (almost surely). Under the WF model, this is because higher order coalescences occur with probability $O(1/N^2)$ or lower.

    The $\Lambda$-Coalescent

    • Introduced by Sagitov (1999) and Pitman (1999).
    • Given $n$ extant lineages, each $k$-tuple of lineages undergoes coalescence at a rate $$\lambda_{n,k} = \int_0^1 x^{k-2}(1-x)^{n-k}\Lambda(dx)$$ where $\Lambda$ is a distribution on $[0,1]$.
    • Total coalescence rate is therefore: $$\lambda_{\mathrm{tot}} = \sum_{k=2}^{n}\binom{n}{k}\lambda_{n,k}$$
    • Regular coalescent corresponds to the case when $\Lambda(x)=\delta(x)$.

    The $\Lambda$-Coalescent (ctd.)

    • Pitman (1999) derived the coalescent rates using a very general argument.
    • Requires only that the inclusion of lineages in a merger event is determined by exchangeable Bernoulli trials.
    • Together with a consistency requirement: $$\lambda_{n,k} = \lambda_{n+1,k} + \lambda_{n+1,k+1}$$ and de Finetti's theorem, the rates must be those quoted on the previous slide.
    • Equivalent to selecting $k-2$ of each of the $n-2$ remaining lineages with probability $x\sim\Lambda(x)$.
    • Note that these "rates" are probabilities conditional on a pairwise coalescence occuring. The event rate per generation is $c_N$: the per-generation pairwise coalescence probability.

    The Wright-Fisher Process

    Every generation constrained to have the same size.

    The Galton-Watson Process

    Every individual produces an i.i.d. number of offspring, with $\mathbb{E}[m]>1$.

    Constraint: Each new generation is sampled (without replacement) from the intermediate generation.

    The $\beta$-Coalescent

    When the probability of $\geq n$ offspring decays as $1/n^{\alpha}$, the coalescent limit yields (Schweinsberg 2003) a $\Lambda$-coalescent with: $$\Lambda(x) = \frac{x^{1-\alpha}(1-x)^{\alpha-1}}{B(2-\alpha,\alpha)}$$ where $B(q,r) = \Gamma(q)\Gamma(r)/\Gamma(q+r)$

    This implies that the coalescent rates become $$\lambda_{n,k} = \frac{B(k-\alpha,n-k+\alpha)}{B(2-\alpha,\alpha)}$$

    Taking the limit $\alpha\rightarrow 2$ yields the Kingman coalescent.

    Time is measured in units of $N/\sigma^2\equiv N_e$ generations, where $\sigma^2$ is the variance of the offspring distribution.

    Example $\beta$-coalescent trees

    $\alpha=2.0$ (regular coalescent):

    Example $\beta$-coalescent trees

    $\alpha=1.5$

    Example $\beta$-coalescent trees

    $\alpha=1.0$

    The $\beta$-coalescent density

    • Use the same cumlulative rate distributions for this computation.
    • Transforming time $dt = dt'/N_e(t')$ allows us to account for population dynamics.
    • For a given tree, the likelihood for $\alpha$ seems close to true values.

    Inference Strategy

    Usual phylogenetic posterior:

    $$P(T,\mu,\theta,\alpha|A) = \frac{1}{P(A)} P(A|T,\mu)P(T|N_e,\alpha)P(\mu,N_e,\alpha)$$
    • $P(T|N_e,\alpha)$ is the $\Lambda$-coalescent tree prior: $$\prod_i\frac{\lambda_{n_i,k_i}}{N_e}\exp\left[-\frac{\tau_i}{N_e}\sum_{k'=2}^{n_i}\binom{n_i}{k'}\lambda_{n_i,k'}\right]$$
    • $P(\mu,N_e,\alpha)=P(\mu)P(N_e)P(\alpha)$ are the parameter priors.

    Space of trees with (potential) polytomes $\neq$ space of binary trees.

    Need a reversible-jump MCMC algorithm to sample time trees under this model.

    Previous work

    Proposal distributions:
    Subtree Prune+Regraft

    Tuning parameters:

    • Exponential rate for root attachments.
    • Coalescent attachment probability.

    Proposal distributions:
    Narrow Exchange

    Proposal distributions:
    Subtree Slide

    Tuning parameters:

    • Rate for non-coalescent attachment..
    • Coalescent attachment probability.

    Proposal distributions:
    Expand/Collapse

    Tuning parameters:

    • Exponential rate for expantion above root.

    Validation: Sampling from prior

    Validation: Inference from simulated data

    How best to summarize tree distributions with polytomies?

    Validation: Inference from simulated data

    Investigation: does structure confound inference of $\alpha$?

    Simulation study:

    • Simulated trees with 60 taxa under the 3-deme structured coalescent (equal population sizes and migration rates, even leaf distribution among types).
    • Varied migration rate from very low (highly structured trees) to very high (no discernable structure).
    • Simulated sequences down each tree to produce alignments.
    • Analyzed each alignment under the beta-coalescent.

    Example structured trees

    Inferred beta-coalescent parameters

    $\implies$ little/no confusion between structure and multifurcation?

    BEAST package

    • Pitchfork: github.com/tgvaughan/pitchfork
    • Easy BEAUti interface, integrates with other models easily.
    • Infer $\alpha$ and (parametric) population dynamics jointly together with multifurcating trees.
    • Help identify superspreading in transmission trees?

    Application to Ebola

    • Gytis Dudas et al. (Nature, 2017) have published a curated set of 1610 Ebola virus genomes sampled during the 2014-2015 West African outbreak.
    • Restrict analysis to oubtreak within the Kailahun region.
    • How do results from inference under the Beta-coalescent differ from results inferred under classic birth-death models?

    Application to Ebola

    Tree inferred under stochastic SIR birth-death model.

    Application to Ebola

    Tree inferred under constant-$N_e$ Beta-coalescent model.

    Application to Ebola

    Steps necessary for "serious" phylodynamic analyses

    • Account for structure and population dynamics.
    • Allow for mixture of offspring distributions: $$\Lambda(x) = p_0\delta(x) + (1-p_0)Beta(2-\alpha,\alpha)(x)$$
    • Apply a spike and slab prior to $p_0$: allows for posterior probability on the presence of non-Kingman dynamics.

    Summary

    • Populations where individuals may occasionally produce a significant portion of the following generation lead to the possibility of observable polytomies in sampled trees.
    • Bayesian phylogenetic inference under such models can be performed using a reversible-jump MCMC algorithm.
    • Numerical results suggest that it is possible to recover parameters of the offsping distribution from genetic data.
    • Analysis of Ebola virus data shows that this model predicts transmission trees with multiple polytomies, although interpretation requires care.

    Future plan: Development of birth-death model

    Consider the following birth/death model: $$X \overset{\lambda_m(x)}{\longrightarrow} (m+1)X$$ $$X \overset{\mu}{\longrightarrow} 0$$ where $$\lambda_m(x)=\lambda e^{-x}\frac{x^m}{m!}$$

    The probability of an individual at time $t$ in the past giving rise to $i$ lineages in the present $p_i(t)$ is: \begin{align} \frac{d}{dt}p_0(t) &= -(\lambda+\mu)p_0(t) + \mu + \lambda e^{x(p_0(t)-1)}p_0(t)\\ \frac{d}{dt}p_1(t) &= -(\lambda+\mu)p_1(t) + \lambda(x p_0(t)+1)e^{x(p_0(t)-1)}p_1(t) \end{align}

    Sadly I can't find an analytical solution to these. Thus we need numerical integration.

    Acknowledgements

    Tanja Stadler and the Computational Evolution (cEvo) Group


    Patrick Hoscheit

    Oliver Pybus