Bayesian inference using Markov chain Monte Carlo
Part 1: Foundations and Implementation
Tim Vaughan
Computational Evolution Group (cEvo)
Department of Biosystems Science and Engineering, ETH Zurich
Bayesian Statistics Mini-course
Univerisity of Hohenheim, 2018

Introduction

Bayesian inference

  • Bayesian inference is enabled by the Bayesian interpretation of probabilities:
    • Probabilities are just as applicable to hypotheses as they are to data.
    • Inference of models and/or model parameters can be conducted using the standard calculus of probabilities (sum rule, product rule, etc).
  • In contrast, sampling theory approaches assign probabilities only to data:
    • Inference of models and/or parameters is a distinct procedure.

The Problem

Suppose we have a vector $\vec{d}$ representing some collected data. Assuming a model $M$, what do the data tell us about the parameters $\vec{\theta}_M$ of this model?

The Bayesian solution is simple to derive and tremendously intuitive:

$$P(\vec{\theta}_M|\vec{d},M) = \frac{P(\vec{d}|\vec{\theta}_M,M)P(\vec{\theta}_M)}{P(\vec{d}|M)}$$

For models with large parameter space volumes, actually evaluating this probability for a particular $\vec{\theta}_M$ is HARD.

$$P(\vec{d}|M)=\sum_{\vec{\theta}_M} P(\vec{d}|\vec{\theta}_M,M)P(\vec{\theta}_M|M)$$

The Problem (Continued)

Idea due to MacKay, 2003

A practical example from phylogenetics

The posterior probability for a particular time tree $T$ given a multiple sequence alignment $A$ and a neutral substitution model and a coalescent prior on tree space is

$$P(T,\mu,\theta|A) = \frac{P(A|T,\mu)P(T|\theta)P(\mu,\theta)}{P(A)}$$

where $\mu$ and $\theta$ are substitution model and tree prior parameters.

How difficult is it to compute the denominator?

$$P(A)=\sum_{T,\mu,\theta}P(A|T,\mu)P(T|\theta)P(\mu,\theta)$$

How big is tree space?

The number $N$ of distinct labelled rooted tree topologies grows rapidly with the number $m$ of leaves (i.e. sequences):

$$N_m = \frac{(2m-3)!}{2^{m-2}(m-2)!}$$

Computational solutions to the normalization problem

  1. Use brute-force enumeration all possible states: $$\sum_{\vec{\theta}_M}P(\vec{d}|\vec{\theta}_M,M)P(\vec{\theta}_M|M)$$
    • Only computationally feasible for relatively small problems.
  2. Use a variational approach (AKA "variational Bayes"):
    find $Q(\vec{\theta}_M)$ minimizing $\sum_{\vec{\theta}_M}Q(\vec{\theta}_M)\log\frac{Q(\vec{\theta}_M)}{P(\vec{\theta}_M|\vec{d},M)}$.
    • Requires a parametric ansatz for the posterior. Not always easy to find, particularly in "interesting" state spaces such as tree space.

The MCMC Algorithm

Monte Carlo Methods

  • Broad class of methods which use stochastic algorithms to mathematical problems.
  • A large sub-class of these methods are targeted at the numerical solution of analytically intractable integrals: \begin{align} I &= \int_{\vec{x}\in\mathcal{V}} f(\vec{x})p(\vec{x})\\ &= E_p[f(\vec{x})] \end{align}
  • By the law of large numbers: $$I = \lim_{N\rightarrow\infty} \frac{1}{N}\sum_{i=1}^N f(\vec{x}^{(i)})$$ where $\vec{x}^{(i)}$ are draws from the distribution $p(\vec{x})$.

Monte Carlo Methods (Continued)

Wikipedia

Here $\mathcal{V}=[0,1]^2$, $p(\vec{x})=1$ and $f(\vec{x})=I(|\vec{x}|^2<1)$.

Importance Sampling

  • What happens if we can't sample directly from $p(\vec{x})$?
  • Solution: suppose we have a function $Q(\vec{x})$ also defined on $\mathcal{V}$ from which we can easily generate samples:
  • We can then use $$I=\int_{\vec{x}\in\mathcal{V}}f(\vec{x})\frac{p(\vec{x})}{q(\vec{x})}q(\vec{x}) = E_q\left[f(\vec{x})\frac{p(\vec{x})}{q(\vec{x})}\right]$$
  • The factor $p(\vec{x}^{(i)})/q(\vec{x}^{(i)})$ is the weight of sample $\vec{x}^{(i)}$.

Importance Sampling in a Bayesian context

  • Importance sampling is a very general and useful idea with many applications in Bayesian statistics.
  • A very naïve application would be the computation of the normalization factor in Bayes theorem: $$P(\vec{d}|M)=\lim_{N\rightarrow\infty}\frac{1}{N}\sum_{i=1}^{N}P(\vec{d}|\vec{\theta}_M^{(i)})$$ where $\vec{\theta}_M^{(i)}\sim P(\vec{\theta}_M|M)$.
  • Similarly, one could compute expectations of parameters under the posterior.
  • Note that the equation above still holds if unbiased noise is added to the likelihood.

Approximate Bayesian Computation

  • Suppose we can't even evaluate the likelihood $P(\vec{d}|\vec{\theta}_M,M)$. What then?
  • An example from population genetics might be the probability of a sequence alignment $A$ given some population parameters $\phi=(N(t), \rho, \ldots)$. This probability $P(A|\phi)$ is difficult to compute as it requires summing over a large number of latent variables (including the sampled phylogenetic tree/network).
  • ABC: Replace the likelihood with a cost function depending on the difference between summary statistics evaluated on data $\vec{d}_s$ simulated under the $M$ from parameters $\vec{\theta}_M$ and the same summaries evaluate using the observed data $\vec{d}$.
  • While this method seems less than rigorous, there are systematic ways of applying it.

Markov chain Monte Carlo methods

  • Until now we have assumed each sample $\vec{x}^{(i)}$ is statistically independent.
  • Markov chain Monte Carlo methods instead produce a sequence of correlated samples $\vec{x}^{(1)}, \vec{x}^{(2)},\ldots,\vec{x}^{(N)}$, where each element is drawn from a distribution depending only on the previous element: $$\vec{x}^{(i+1)} \sim Q(\vec{x}'|\vec{x}^{(i)})$$
  • Allows samples to adapt to shape of distribution:

The Gibbs sampler

  • To be useful, the Markov chain must be ergodic in the following sense: $$\lim_{N\rightarrow\infty}\frac{1}{N}\sum_{i=1}^{N}f(\vec{x}^{(i)})=\int_{\vec{x}\in\mathcal{V}} f(\vec{x})p(\vec{x})$$
  • That is, the stationary distribution of the chain must match $p(\vec{x})$. Thus, the transition probabilities must be chosen so that they do not perturb the stationary distribution.
  • One way to achieve this is to use a sequence of transition functions which sample individual elements of the state vector from the conditional distributions $p(x_k|\vec{x}_{-k})$.
  • This is possible only when these conditional probabilities are known.

Metropolis-Hastings sampling

  • Suppose instead that we have the following transition probability: $$Q(\vec{x}'|\vec{x})=\alpha(\vec{x}',\vec{x})q(\vec{x}'|\vec{x}) + \delta(\vec{x}'-\vec{x})(1-\int_{\vec{y}\in\mathcal{V}}\alpha(\vec{y},\vec{x})q(\vec{y}|\vec{x}))$$
  • Here $q(\vec{x}'|\vec{x})$ is an arbitrary transition probability distribution (as long as ergodicity is satisfiable) and $\alpha(\vec{x}',\vec{x})$ is the acceptance probability.
  • Assuming that the stationary distribution of the chain satisfies detailed balance allows us to write $$p(\vec{x})Q(\vec{x}'|\vec{x}) = p(\vec{x}')Q(\vec{x}|\vec{x}')$$

Metropolis-Hastings sampling (continued)

  • From this it is straight-forward to identify the following solution for $\alpha$: $$\alpha(\vec{x}',\vec{x})=1\wedge\frac{p(\vec{x}')}{p(\vec{x})}\cdot\frac{q(\vec{x}|\vec{x}')}{q(\vec{x}'|\vec{x})}$$
  • The ratio $q(\vec{x}|\vec{x}')/q(\vec{x}'|\vec{x})$ is known as the Hastings ratio, and accounts for asymmetry in the proposal distribution.
  • This isn't the only possibility for $\alpha$:
    • It's possible to find Markov chains with the transition probability given that have $p(\vec{x})$ as a stationary distribution but which violate detailed balance.
    • It's also possible to find $\alpha$ such that both $\alpha(\vec{x}',\vec{x})$ and $\alpha(\vec{x},\vec{x}')$ are $<1$: although this would be silly. (Why?)

The MH algorithm

The complete (yet simple!) MH algorithm is as follows:
  1. Set $\vec{x}$ to an arbitrary initial state.
  2. Sample $\vec{x}'$ from a proposal distribution $q(\vec{x'}|\vec{x})$
  3. Evaluate the acceptance probability $\alpha(\vec{x}',\vec{x})$
  4. Sample $u\sim\text{Unif}(0,1)$.
  5. If $u<\alpha(\vec{x}',\vec{x})$ then set $\vec{x}'\leftarrow\vec{x}$.
  6. Log $\vec{x}$ to file.
  7. Go to 2 (until the heat death of the universe).

MCMC in practice

Implementation example

Suppose we want to sample from the following posterior for a single variable $x$:

Use a simple proposal distribution:

$$q_w(x'|x) = \begin{cases} 1/w & x-\frac{w}{2} < x' < x+\frac{w}{2}\\ 0 & \text{otherwise} \end{cases}$$

Implementation example

Regenerate

Implementation example

Regenerate

Implementation example

$w=1$:
$w=3$:
$w=5$:
Regenerate

How do we assess convergence?

The easiest approach is simply to run multiple independent chains, each initialized from a unique starting point.

  • Chains can be run in parallel.
  • Resulting sample distributions can be compared using (for example) Q-Q plots.
  • Once convergence is attained, combining the results of multiple chains is perfectly acceptable.

How do we assess mixing?

The key to assessing mixing is the autocorrelation function of the chain states:

$$\widehat{ACF}(l) = \frac{\sum_i x_i x_{i-l}}{\left(\sum_i x_i\right)^2}$$

Assessing Mixing (continued)

The lag required for this function to decay to within the vicinity of 0 is the gap between effectively independent samples, $\tau$.

If $N$ is the total number of MCMC samples, we then define $$N_{\text{eff}}=\frac{N}{\tau}$$ to be the effective sample size (ESS).

The ESS is a rough estimate of the number of actual samples a chain has generated.

You should really only consider the order of magnitude of the ESS.

Multivariate target distributions

  • MCMC is usually applied to posteriors with large numbers of dimensions.
  • This is precisely the regime in which MCMC shines.
  • A typical phylogenetic analysis (as we'll see later) may contain many hundreds of distinct parameters.
  • How do we develop algorithms to sample from posteriors over these parameter spaces?

Complex proposals

The usual approach is to decompose the proposal function $q(\vec{x}'|\vec{x})$ in terms of a large number of simpler proposals: $$q(\vec{x}'|\vec{x})=\sum_j w_j q_j(\vec{x}'|\vec{x})$$

  • Individual proposal functions $q_j(\vec{x}'|\vec{x})$ are sometimes called "operators", and the $w_j$ are called "weights".
  • Individual operators are often constructed to be reversible: i.e. maintain detailed balance.
  • However, individual operators are usually incapable of producing an ergodic chain on their own. (I.e. they may not be capable of exploring the entire support of the posterior.)

Example

These proposal operators are sufficient for ergodicity:

$q_1(\vec{x'}|\vec{x})=\delta(x_2'-x_2)g(x_1'|x_1)$, $q_2(\vec{x'}|\vec{x})=\delta(x_1'-x_1)g(x_2'|x_2)$

where $g(x'|x)$ is the PDF for $\text{Unif}(x-w/2,x+w/2)$

Example (continued)

Regenerate

Correlated variables

  • Chain is mixing slowly because of strong correlations between the two parameters.
  • Solution is to update both parameters simultaneously.
  • Anticipating correlations between parameters is a key skill in designing efficient MCMC algorithms.
  • In the Bayesian context, this often requires careful examination of the likelihood: non-identifiability is a source of such correlations.

Include a third proposal operator: $$q_3(\vec{x}'|\vec{x}) = \frac{1}{v}\int_{-v/2}^{v/2}\delta(x_1'-(x_1+a))\delta(x_2'-(x_2+a))\mathrm{d}a$$

Mixing improvement with new operator

Regenerate

Alternative approach: Hamiltonian Monte Carlo

  • The Hamiltonian formalization of classical mechanics relates the dynamics of phase space coordinates $(\vec{q},\vec{p})$ to the Hamiltonian function $H(\vec{q},\vec{p})=K(\vec{p}) + V(\vec{q})$ via Hamilton's equations: \begin{align} \dot{q}_i &= \frac{\partial H}{\partial p_i}\\ \dot{p}_i &= -\frac{\partial H}{\partial q_i} \end{align}
  • Hamiltonian (/hybrid) Monte Carlo is an MCMC algorithm that treats the parameter state vector as the position component of canonical coordinates in phase space.

Alternative approach: Hamiltonian Monte Carlo

  • HMC assumes that the system is in thermal contact with a heat bath of fixed temperature ("canonical ensemble"):
  • The finite temperature means that the probability of observing the system in an accessible state $(\vec{q},\vec{p})$ is $$P(\vec{q},\vec{p}) = \frac{1}{Z}e^{-H(\vec{q},\vec{p})/T}$$ where $Z=\sum_{\vec{q},\vec{p}}e^{-H(\vec{q},\vec{p})/T}$ is the partition function.

Alternative approach: Hamiltonian Monte Carlo

  • Note that the probability distribution factorizes $$P(\vec{q},\vec{p})=Z^{-1}\exp[-(K(\vec{p})+V(\vec{q}))/T]=P(\vec{q})P(\vec{p})$$
  • The MCMC method involves identifying the position coordinates $\vec{q}$ with the variables ($\vec{x}$) of interest , and defining $V(\vec{q})$ such that $P(\vec{q})$ is the target density, i.e. $$V(\vec{q}) = -T\log[ZP(\vec{q})]$$
  • The actual update step for the algorithm involves:
    1. Sampling a new $\vec{p}^*$ from $P(\vec{p})$ (easy, since $K(\vec{p})$ is quadratic and hence $P(\vec{p})$ is Gaussian,
    2. Evolving the state using Hamiltons equations from $(\vec{q},\vec{p}^*)$ to $(\vec{q}',\vec{p}')$.
    3. Performing an MCMC accept/reject for this new state compared to the original state $(\vec{q},\vec{p})$.

Alternative approach: Hamiltonian Monte Carlo

Radford Neal, MCMC using Hamltonian dynamics (in Handbook of Markov chain Monte Carlo, 2011)

Reversible Jump MCMC

The Problem

  • Suppose we have a posterior density of the form: $$P(\vec{x}_k,k) = \sum_{k}Pf(\vec{x_k}|k)P(k)$$ where the dimension of $\vec{x}_k$ depends on $k$.
  • "The number of things you don't know is one of the things you don't know."
  • This is a straight-forward way of dealing with model selection. (In fact, model selection and parameter inference are essentially the same thing...)
  • Can we still use MCMC to sample from this distribution?

The Solution

The Solution

Use the following form of the detailed balance criterion:

$$\int_{\vec{x}_A\in A}\int_{\vec{x}'_B\in B} P(\vec{x}_A)Q(\vec{x}'_B|\vec{x}_A) = \int_{\vec{x}_B\in B} \int_{\vec{x}'_A\in A} P(\vec{x}_B)Q(\vec{x}'_A|\vec{x}_B)$$

Here $A$ and $B$ are any two subsets of the state space, while $Q(\vec{x}'|\vec{x})$ is the MCMC transition kernel: $$Q(\vec{x}'|\vec{x})=\alpha(\vec{x}',\vec{x})q(\vec{x}'|\vec{x}) + \delta(\vec{x}'-\vec{x})(1-\int_{\vec{y}\in\mathcal{V}}\alpha(\vec{y},\vec{x})q(\vec{y}|\vec{x}))$$

The DB criterion then reduces to:

$$\int_{\vec{x}_A\in A}\int_{\vec{x}'_B\in B} P(\vec{x}_A)\alpha(\vec{x}'_B,\vec{x}_A)q(\vec{x}'_B|\vec{x}_A) = \int_{\vec{x}_B\in B} \int_{\vec{x}'_A\in A} P(\vec{x}_B)\alpha(\vec{x}'_A,\vec{x}_B)q(\vec{x}'_A|\vec{x}_B)$$

The Solution

$$\int_{\vec{x}_A\in A}\int_{\vec{x}'_B\in B} P(\vec{x}_A)\alpha(\vec{x}'_B,\vec{x}_A)q(\vec{x}'_B|\vec{x}_A) = \int_{\vec{x}_B\in B} \int_{\vec{x}'_A\in A} P(\vec{x}_B)\alpha(\vec{x}'_A,\vec{x}_B)q(\vec{x}'_A|\vec{x}_B)$$
  • Deriving an acceptance probability from the above criterion is guaranteed to produce a chain that converges to the correct distribution (assuming your proposal is capable of exploring the state space, of course).
  • For the most part, $$\alpha(\vec{x}',\vec{x}) = 1\wedge \frac{P(\vec{x}')}{P(\vec{x})}\times\frac{q(\vec{x}|\vec{x}')}{q(\vec{x}'|\vec{x})}$$ just works.
  • In the case that proposals contain singularities (occasionally the case for model switching), the safest thing is to back up and derive $\alpha(\vec{x}',\vec{x})$ from the criterion directly.

Reversible jump is confusingly named?

  • Consider a proposal involving choosing $f\sim\text{Unif}(1/\beta,\beta$ then setting $\vec{x}'=f\vec{x}$.
  • The proposal function for this move is: $$q(\vec{x}'|\vec{x})=\frac{1}{\beta-\beta^{-1}}\int_{\beta^{-1}}^{\beta}\mathrm{d}f\delta(\vec{x}'-f\vec{x})$$
  • This does not involve a dimension change. However, due to the singularity it is easiest to derive the acceptance probability for the move from the DB criterion directly.
  • The $\alpha$ for other proposals which do involve a dimension change can often be derived simply by substituting the proposal densities directly into the usual expression.

Final words

  • One often hears "reversible jump" mentioned in hushed tones.
  • However, as long as you use the generalized detailed balance criterion, you won't go wrong.

DON'T PANIC!

Model Selection

The Marginal Likelihood

  • The posterior for parameters $\vec{\theta}_M$ of model $M$ given data $\vec{d}$ is: $$P(\vec{\theta}_M = \frac{P(\vec{d}|\vec{\theta}_M,M)P(\vec{\theta}_M|M)}{P(\vec{d}|M)}$$
  • The troublesome denominator $$P(\vec{d}|M)=\sum_{\vec{\theta}_M}P(\vec{d}|\vec{\theta}_M,M)P(\vec{\theta}_M|M)$$ is the likelihood (or marginal likelihood) for $M$ given the data.
  • This function is central to Bayesian model selection:
    \begin{align} \frac{P(M_1|\vec{d})}{P(M_2|\vec{d})}&=\frac{P(\vec{d}|M_1)P(M_1)}{P(\vec{d}|M_2)P(M_2)}\\ &=\frac{P(\vec{d}|M_1)}{P(\vec{d}|M_2)}\text{ if }P(M_1)=P(M_2) \end{align}

Bayesian Occam's Razor

Interestingly, the marginal likelihood (aka the "evidence") for a model automatically penalizes against overfitting:

Idea due to MacKay, 2003

But how can we compute it?

The Harmonic Mean Estimator

Idea: use the fact that we can draw samples $\vec{\theta}_M^{(i)}$ from the posterior under the model: \begin{align} \lim_{N\rightarrow\infty}\frac{1}{N}\sum_{i=1}^{N}f(\vec{\theta}_M^{(i)}) &= \sum_{\vec{\theta}_M}f(\vec{\theta}_M)P(\vec{\theta}_M|\vec{d},M)\\ &= \sum_{\vec{\theta}_M}f(\vec{\theta}_M)\frac{P(\vec{d}|\vec{\theta}_M,M)P(\vec{\theta}_M|M)}{P(\vec{d}|M)} \end{align}

  • $\implies\lim_{N\rightarrow\infty}\frac{1}{N}\sum_{i=1}^{N}\frac{1}{P(\vec{d}|\vec{\theta}_M^{(i)},M)}=\frac{1}{P(\vec{d}|M)}$.
  • Sadly, this has a reputation for being the worst Monte Carlo method ever invented!
    • Any idea why?

Thermodynamic Integration

  • This is the "correct" way to compute the marginal likelihood using MCMC.
  • From Lartillot and Phillipe (2006):
    • Suppose $p_i(\vec{\theta})=\frac{1}{Z_i}q_i(\vec{\theta})$, $i=0,1$, where $Z_i=\sum_{\vec{\theta}}q_i(\vec{\theta})$.
    • Define a continuous series of distributions $p_{\beta}$, $q_{\beta}$, $Z_{\beta}$ for $\beta\in[0,1]$.
    • It is easy to show that $$E\left[\frac{\partial\log q_{\beta}(\vec{\theta})}{\partial\beta}\right]=\frac{\partial Z_{\beta}}{\partial\beta}$$

Thermodynamic Integration (continued)

  • Thus, by using MCMC to produce estimates for $E[\partial\log q_{\beta}(\vec{\theta})/\partial\beta]$ under a series of distributions between $\beta=0$ and $\beta=1$ one can numerically integrate this sampled curve to give an estimate for $$\int\mathrm{d}\beta \frac{\partial \log Z_{\beta}}{\partial\beta}=\log Z_1-\log Z_0$$
  • In practice one can also slowly vary $\beta$ during the MCMC run.

Annealing-Melting Integration

  • The most straight-forward approach is to use something like the following $$q_{\beta}(\vec{\theta}_M) = P(\vec{d}|\vec{\theta}_M,M)^{\beta}P(\vec{\theta}_M|M)$$
  • Then $q_1(\vec{\theta}_M)$ is the unnormalized posterior and $q_0(\vec{\theta}_M)$ is the prior.
  • Integration of the expectation over $\beta$ then yields $$\log Z_1 - \log Z_0 = \log Z_1 = \log P(\vec{d}|M)$$

Model-switch Integration

  • One difficulty with the previous scheme is that it requires one to use MCMC to sample from the prior. This can be challenging as priors can be very broad.
  • An alternative is to use instead \begin{align} q_{\beta}(\vec{\theta})=&\left[P(\vec{d}|\vec{\theta}_{M},M_1)P(\vec{\theta}_{M}|M_1)\right]^{\beta}\\ &\times\left[P(\vec{d}|\vec{\theta}_{M},M_2)P(\vec{\theta}_{M}|M_2)\right]^{1-\beta} \end{align}
  • Integration of the expectation over $\beta$ then directly yields the log Bayes Factor: $$\log Z_1 - \log Z_0 = \log Z_1 = \log \frac{P(\vec{d}|M_1)}{P(\vec{d}|M_2)}$$

Model-switch Integration

In this case, the integral yields a BF of $10^{17}$ in favour of model 1 over model 0.

Summary of Part 1

  1. Metropolis-Hastings MCMC is one of a large family of Monte Carlo methods useful for Bayesian analysis.
  2. Its usefulness depends critically on the proposal mechanism. Getting this right involves either
    • knowing enough about the posterior to create well-behaved operators, or
    • using an adaptive method such as Hamiltonian Monte Carlo.
  3. Using MCMC to sample from mixtures of parameter densities of different dimensions is not only possible, it's easy!
  4. MCMC can also be used to directly compute model likelihoods and Bayes factors via thermodynamic integration.