Phylogenetic Treatment of Recombination
With a particular focus on Bacterial Recombination
Tim Vaughan
cEvo group, D-BSSE, The ETH Zurich
Taming the BEAST, 22nd June, 2018

Recombination primer

  • Occurs in eukaryotes, bacteria and viruses.
  • Eukaryotic recombination occurs during meiosis via chromosomal crossover.
  • Bacterial recombination occurs via
    • phage-mediated transduction,
    • natural transformation, and
    • conjugation.
  • Viral recombination occurs when multiple strains infect a single cell. (Example: reassortment in segmented viral genomes.)
commons.wikimedia.org

Recombination and Phylogenetics

Ignoring recombination when present:

  • Biases inference of internal node heights times.
  • Biases inference of topology.
  • Poor mixing due to badly fitting model.

Relationship to Species Phylogenies

Accounting for Recombination

  1. Pre-process of data to identify and remove non-vertically inherited material.
    Pros Cons
    • Can use standard tools for phylogenetic inference.
    • Data is being thrown away.
    • Ad hoc, may bias results.
  2. Explicitly model recombination processes.
    Pros Cons
    • Can make use of all data.
    • Can infer additional parameters such as recombination rates.
    • May yield increased confidence in estimates.
    • Models can be complex, with many parameters.
    • Both computationally and statistically challenging.

Wright-Fisher with Recombination

  • Consider an WF model with male and female diploid individuals.
  • Focus on a small segment of the genome.
  • Each child selects 1 male and 1 female parent uniformly at random.
  • With probability $r$ (which depends on the segment size) the two chromosome pairs of a parent are recombined before being passed on.

Wright-Fisher model with recombination

  • Since the specific pairing of chromosomes only matters over a single generation, in the long term the haploid approximation is very good:
  • Each child in $i+1$ selects a parent uniformly at random from generation $i$.
  • With probability $r$ an additional parent is selected.
  • In this case, a breakpoint chosen uniformly on the chromosome, and everything to either the left or right replaced by the homologous section of the second parent's chromosome.
  • (Equivalent to selecting 2 parents for each child and letting the resulting sequences recombine with some probability.)

Model 1: Coalescent with Recombination

For fixed recombination rate $\rho=r/g$ in the limit $r\ll 1$, $g\ll 1$ and $N\gg 1$, the genealogical process is the coalescent with recombination (Hudson, 1983):

  • Coalescence rate: $T_c(k)=\binom{k}{2}\frac{1}{N}$
  • Recombination rate: $T_r(k)=\rho k$.
  • Recombination break points chosen uniformly along sequence: everything to the left descends from one parent, everything else to the other.
  • Each site possess a local tree.
  • Local trees may find MRCAs (grey sites) before grand (G)MRCA of the process.

CwR example ARG

Coalescent with gene conversion

A gene conversion event refers to the replacement of a single contiguous stretch of sequence with a homologous stretch from a different parent. Incorporated into a modified CwR process by Wiuf and Hein (2000).

  • Allows similar patterns of site ancestry but with fewer events.
  • Model applicable to prokaryotic recombination.
  • Coalescence and recombination rates as for CwR.
  • Conversions initiate at randomly chosen starting sites and extend for $d$ sites where $P(d)=(1-\delta^{-1})^{d-1}\delta^{-1}$ (geometric distribution).

Bayesian inference of ARGs

Can easily write down an expression for the posterior distribution for the ARG given an alignment:

$$P(G,\rho,N,\mu|A) = \frac{1}{P(A)}P(A|G,\mu)P(G|\rho,N)P(\rho,N,\mu)$$
  • $G$ is the recombination graph,
  • $\rho$ is the recombination rate.

In practice, this is non-trivial since:

  1. the likelihood for $G$, $P(A|G,\mu)$, is invariant under many features of $G$,
  2. the likelihood surface often contains many distinct peaks, and
  3. the state space of ARGs is huge: much larger than for trees.

Despite this, many approximate algorithms exist.

Algorithms for Bayesian ARG inference

Sequentially Markov Coalescent (SMC)

  • Sites between breakpoints exponentially distributed with rate $\rho T$ where $T$ is the total length of the current local tree.
  • Neglects some possible recombinations, e.g. those that do not affect the data.

Population inference using SMC

  • Li & Durbin (2011) developed an SMC-based HMM on pairs of alignments.
  • Baum-Welch used to estimate parameters and hidden states (local TMRCAs).
  • TMRCA distribution used to produce estimates of population size dynamics.

Bacteria

  • Bacteria play critical roles in the health of animals and plants.
  • Many bacteria possess interesting and experimentally accessible evolutionary dynamics.
  • Bacterial genomes are measurably evolving over relatively short study periods.
    Drummond & Rambaut, TIEE (2003)
Rainey & Travisano, Nature (1998)

Cartoon bacterial model

  • Effect of events can be plasmid transfer, insertion or homologous recombination.
  • Rate of recombination varies and is subject to selection.
  • Focus on homologous recombination: only event which doesn't alter the length of the sequence.

Bacterial Recombination Rates

For many bacteria, the ratio between the recombination rate and the mutation rate is extremely high.

Vos and Didelot, The ISME Journal (2009)

The Coalescent with Gene Conversion

Wiuf, 1999; Wiuf and Hein, 2000
Parameters
$1/N(t)$ Coalescence rate
$\rho_s$ Recombination rate
$\delta$ Expected tract length
Problem
The space of possible ancestral recombination graphs is extremely large. Even two samples have infinitely many distinct ancestries!

Full ARG


 

Approximation 1: ClonalFrame


Didelot et al., 2007 (ClonalFrame); Didelot and Wilson, 2015 (ClonalFrameML)

Approximation 2: ClonalOrigin


Didelot et al., 2010 (ClonalOrigin); Ansari and Didelot, 2014 (ClonalOrigin')
Our Goal
To conduct Bayesian inference of the full ARG under the ClonalOrigin model, optionally under non-trivial population size dynamics.

Inference under the ClonalOrigin Model

Inference follows the standard Bayesian phylogenetic tradition:

\begin{equation*} f(G,N,\mu,\rho,\delta|A) \propto P_{F}(A|G,\mu)f_{CGC}(G|N,\rho,\delta)f_{\text{prior}}(N ,\mu,\rho,\delta) \end{equation*}

where

  • $A$ is the sequence alignment,
  • $\mu$ are the substitution model parameters, and
  • $G$ is the full sample genealogy including clonal frame
    $T$ and $M$ conversions $\{C_i\}_{i\in[1\ldots M]}$.

The genealogy density under ClonalOrigin model can be expanded

\begin{equation*} f_{CGC}(G|\rho',\delta,N)=\left(\prod_{i=1}^M f(C_i|T,N,\delta)\right)P(M|T,\rho)f_C(T|N) \end{equation*}

Identifiability

Despite using a simplified model, an infinite number of ARGs still possess the same likelihood given a sequence alignment.

Very important for an MCMC algorithm to propose state changes which minimize effect on likelihood.

MCMC State Proposal Strategy

  • Attachment points and site tracts of new conversions drawn from prior.

Validation 1: Sampling from the prior

  • It is trivial to sample directly from the approximate ARG prior via coalescent simulation.
  • We expect precise agreement between our MCMC algorithm and the coalescent simulation when no data is involved.
One should always test that a new sampler exactly converges to a known distribution: this provides a strict necessary criterion for correctness.

Validation 2: Inference from simulated data

True ARG:

Randomly-selected ARG from MCMC:

Summary networks

True ARG:

Summary ARG from MCMC:

Application to Escherichia coli

  • Applied method to E. coli O157/O26 sequences collected by Tessy George (mEpiLab, Massey University, New Zealand).
  • Sequences for 53 rMLST genes (21.1 kb total) from each of 23 human and cattle-derived isolates.
  • Analyzed using fixed expected tract length parameter and informative prior on conversion rate parameter.

Application to Escherichia coli

Software implementation


tgvaughan.github.io/bacter

  • BEAST 2 package that performs inference under the ClonalOrigin model.
  • Joint inference of the clonal frame, recombinant edges and parameters.
  • Can be combined with usual variety of substitution models, parametric population models and parameter priors.
  • GUI analysis configuration via BEAUti.

Bayesian Skyline Plots in Bacter

  • Bacter implements a non-parametric demographic history model inspired by the standard Bayesian Skyline Plot (Drummond et al., 2005).
  • In principle can reconstruct non-trivial dynamics from very few genomes.
  • In practice, this is limited by the computational expense of the algorithm.

Piecewise constant/linear population model

Inference example

Bacter limitations

  • Computational complexity scales with the number of proposed conversions.
    • This can be arbitrarily large, even for small sample sets.
    • Cannot use the BEAGLE phylogenetic likelihood library to speed things up due to peculiarities of ARG likelihood computation.

  • MCMC algorithm used does not intelligently locate conversions.
    • Looking at addressing this in the near future.

  • Network summary algorithm can produce peculiar results.
    • More research needs to be done to find a better algorithm.


Summary

  1. It is important to be aware of the role recombination plays in shaping the phylogeny of a set of samples.
  2. Recombination networks are distinct from (nested within!) species networks.
  3. Network inference is both statistically and computationally demanding.
  4. Model-based methods for inferring recombination networks exist.
  5. Within BEAST 2, the available method is Bacter, which focuses on bacterial recombination.
  6. This method will be taught in the following tutorial!