Bayesian inference using Markov chain Monte Carlo

Part 1: Foundations and Implementation

Tim Vaughan

Computational Evolution Group (cEvo)
Department of Biosystems Science and Engineering, ETH Zurich

Bayesian Statistics Mini-course
Univerisity of Hohenheim, 2018

Introduction

Bayesian inference

Bayesian inference is enabled by the Bayesian interpretation of probabilities:
- Probabilities are just as applicable to hypotheses as they are to data.
- Inference of models and/or model parameters can be conducted using the standard calculus of probabilities (sum rule, product rule, etc).
In contrast, sampling theory approaches assign probabilities only to data:
- Inference of models and/or parameters is a distinct procedure.

The Problem

Suppose we have a vector $\vec{d}$ representing some collected data. Assuming a model $M$, what do the data tell us about the parameters $\vec{\theta}_M$ of this model?

The Bayesian solution is simple to derive and tremendously intuitive:

$$P(\vec{\theta}_M|\vec{d},M) = \frac{P(\vec{d}|\vec{\theta}_M,M)P(\vec{\theta}_M)}{P(\vec{d}|M)}$$

For models with large parameter space volumes, actually evaluating this probability for a particular $\vec{\theta}_M$ is HARD.

$$P(\vec{d}|M)=\sum_{\vec{\theta}_M} P(\vec{d}|\vec{\theta}_M,M)P(\vec{\theta}_M|M)$$

The Problem (Continued)

Idea due to MacKay, 2003

A practical example from phylogenetics

The posterior probability for a particular time tree $T$ given a multiple sequence alignment $A$ and a neutral substitution model and a coalescent prior on tree space is

$$P(T,\mu,\theta|A) = \frac{P(A|T,\mu)P(T|\theta)P(\mu,\theta)}{P(A)}$$

where $\mu$ and $\theta$ are substitution model and tree prior parameters.

How difficult is it to compute the denominator?

$$P(A)=\sum_{T,\mu,\theta}P(A|T,\mu)P(T|\theta)P(\mu,\theta)$$

How big is tree space?

The number $N$ of distinct labelled rooted tree topologies grows rapidly with the number $m$ of leaves (i.e. sequences):

$$N_m = \frac{(2m-3)!}{2^{m-2}(m-2)!}$$

Computational solutions to the normalization problem

Use brute-force enumeration all possible states: $$\sum_{\vec{\theta}_M}P(\vec{d}|\vec{\theta}_M,M)P(\vec{\theta}_M|M)$$
- Only computationally feasible for relatively small problems.
Use a variational approach (AKA "variational Bayes"):
find $Q(\vec{\theta}_M)$ minimizing $\sum_{\vec{\theta}_M}Q(\vec{\theta}_M)\log\frac{Q(\vec{\theta}_M)}{P(\vec{\theta}_M|\vec{d},M)}$.
- Requires a parametric ansatz for the posterior. Not always easy to find, particularly in "interesting" state spaces such as tree space.

The MCMC Algorithm

Monte Carlo Methods

Broad class of methods which use stochastic algorithms to mathematical problems.
A large sub-class of these methods are targeted at the numerical solution of analytically intractable integrals: \begin{align} I &= \int_{\vec{x}\in\mathcal{V}} f(\vec{x})p(\vec{x})\\ &= E_p[f(\vec{x})] \end{align}
By the law of large numbers: $$I = \lim_{N\rightarrow\infty} \frac{1}{N}\sum_{i=1}^N f(\vec{x}^{(i)})$$ where $\vec{x}^{(i)}$ are draws from the distribution $p(\vec{x})$.

Monte Carlo Methods (Continued)

Wikipedia

Here $\mathcal{V}=[0,1]^2$, $p(\vec{x})=1$ and $f(\vec{x})=I(|\vec{x}|^2<1)$.

Importance Sampling

What happens if we can't sample directly from $p(\vec{x})$?
Solution: suppose we have a function $Q(\vec{x})$ also defined on $\mathcal{V}$ from which we can easily generate samples:
We can then use $$I=\int_{\vec{x}\in\mathcal{V}}f(\vec{x})\frac{p(\vec{x})}{q(\vec{x})}q(\vec{x}) = E_q\left[f(\vec{x})\frac{p(\vec{x})}{q(\vec{x})}\right]$$
The factor $p(\vec{x}^{(i)})/q(\vec{x}^{(i)})$ is the weight of sample $\vec{x}^{(i)}$.

Importance Sampling in a Bayesian context

Importance sampling is a very general and useful idea with many applications in Bayesian statistics.
A very naïve application would be the computation of the normalization factor in Bayes theorem: $$P(\vec{d}|M)=\lim_{N\rightarrow\infty}\frac{1}{N}\sum_{i=1}^{N}P(\vec{d}|\vec{\theta}_M^{(i)})$$ where $\vec{\theta}_M^{(i)}\sim P(\vec{\theta}_M|M)$.
Similarly, one could compute expectations of parameters under the posterior.
Note that the equation above still holds if unbiased noise is added to the likelihood.

Approximate Bayesian Computation

Suppose we can't even evaluate the likelihood $P(\vec{d}|\vec{\theta}_M,M)$. What then?
An example from population genetics might be the probability of a sequence alignment $A$ given some population parameters $\phi=(N(t), \rho, \ldots)$. This probability $P(A|\phi)$ is difficult to compute as it requires summing over a large number of latent variables (including the sampled phylogenetic tree/network).
ABC: Replace the likelihood with a cost function depending on the difference between summary statistics evaluated on data $\vec{d}_s$ simulated under the $M$ from parameters $\vec{\theta}_M$ and the same summaries evaluate using the observed data $\vec{d}$.
While this method seems less than rigorous, there are systematic ways of applying it.

Markov chain Monte Carlo methods

Until now we have assumed each sample $\vec{x}^{(i)}$ is statistically independent.
Markov chain Monte Carlo methods instead produce a sequence of correlated samples $\vec{x}^{(1)}, \vec{x}^{(2)},\ldots,\vec{x}^{(N)}$, where each element is drawn from a distribution depending only on the previous element: $$\vec{x}^{(i+1)} \sim Q(\vec{x}'|\vec{x}^{(i)})$$
Allows samples to adapt to shape of distribution:

The Gibbs sampler

To be useful, the Markov chain must be ergodic in the following sense: $$\lim_{N\rightarrow\infty}\frac{1}{N}\sum_{i=1}^{N}f(\vec{x}^{(i)})=\int_{\vec{x}\in\mathcal{V}} f(\vec{x})p(\vec{x})$$
That is, the stationary distribution of the chain must match $p(\vec{x})$. Thus, the transition probabilities must be chosen so that they do not perturb the stationary distribution.
One way to achieve this is to use a sequence of transition functions which sample individual elements of the state vector from the conditional distributions $p(x_k|\vec{x}_{-k})$.
This is possible only when these conditional probabilities are known.

Metropolis-Hastings sampling

Suppose instead that we have the following transition probability: $$Q(\vec{x}'|\vec{x})=\alpha(\vec{x}',\vec{x})q(\vec{x}'|\vec{x}) + \delta(\vec{x}'-\vec{x})(1-\int_{\vec{y}\in\mathcal{V}}\alpha(\vec{y},\vec{x})q(\vec{y}|\vec{x}))$$
Here $q(\vec{x}'|\vec{x})$ is an arbitrary transition probability distribution (as long as ergodicity is satisfiable) and $\alpha(\vec{x}',\vec{x})$ is the acceptance probability.
Assuming that the stationary distribution of the chain satisfies detailed balance allows us to write $$p(\vec{x})Q(\vec{x}'|\vec{x}) = p(\vec{x}')Q(\vec{x}|\vec{x}')$$

Metropolis-Hastings sampling (continued)

From this it is straight-forward to identify the following solution for $\alpha$: $$\alpha(\vec{x}',\vec{x})=1\wedge\frac{p(\vec{x}')}{p(\vec{x})}\cdot\frac{q(\vec{x}|\vec{x}')}{q(\vec{x}'|\vec{x})}$$
The ratio $q(\vec{x}|\vec{x}')/q(\vec{x}'|\vec{x})$ is known as the Hastings ratio, and accounts for asymmetry in the proposal distribution.
This isn't the only possibility for $\alpha$:
- It's possible to find Markov chains with the transition probability given that have $p(\vec{x})$ as a stationary distribution but which violate detailed balance.
- It's also possible to find $\alpha$ such that both $\alpha(\vec{x}',\vec{x})$ and $\alpha(\vec{x},\vec{x}')$ are $<1$: although this would be silly. (Why?)

The MH algorithm

The complete (yet simple!) MH algorithm is as follows:

Set $\vec{x}$ to an arbitrary initial state.

Sample $\vec{x}'$ from a proposal distribution $q(\vec{x'}|\vec{x})$

Evaluate the acceptance probability $\alpha(\vec{x}',\vec{x})$

Sample $u\sim\text{Unif}(0,1)$.

If $u<\alpha(\vec{x}',\vec{x})$ then set $\vec{x}'\leftarrow\vec{x}$.

Log $\vec{x}$ to file.

Go to 2 (until the heat death of the universe).

MCMC in practice

Implementation example

Suppose we want to sample from the following posterior for a single variable $x$:

Use a simple proposal distribution:

$$q_w(x'|x) = \begin{cases} 1/w & x-\frac{w}{2} < x' < x+\frac{w}{2}\\ 0 & \text{otherwise} \end{cases}$$

Implementation example

Regenerate

Implementation example

Regenerate

Implementation example

$w=1$:

$w=3$:

$w=5$:

Regenerate

How do we assess convergence?

The easiest approach is simply to run multiple independent chains, each initialized from a unique starting point.

Chains can be run in parallel.
Resulting sample distributions can be compared using (for example) Q-Q plots.
Once convergence is attained, combining the results of multiple chains is perfectly acceptable.

How do we assess mixing?

The key to assessing mixing is the autocorrelation function of the chain states:

$$\widehat{ACF}(l) = \frac{\sum_i x_i x_{i-l}}{\left(\sum_i x_i\right)^2}$$

Assessing Mixing (continued)

The lag required for this function to decay to within the vicinity of 0 is the gap between effectively independent samples, $\tau$.

If $N$ is the total number of MCMC samples, we then define $$N_{\text{eff}}=\frac{N}{\tau}$$ to be the effective sample size (ESS).

The ESS is a rough estimate of the number of actual samples a chain has generated.

You should really only consider the order of magnitude of the ESS.

Multivariate target distributions

MCMC is usually applied to posteriors with large numbers of dimensions.
This is precisely the regime in which MCMC shines.
A typical phylogenetic analysis (as we'll see later) may contain many hundreds of distinct parameters.
How do we develop algorithms to sample from posteriors over these parameter spaces?

Complex proposals

The usual approach is to decompose the proposal function $q(\vec{x}'|\vec{x})$ in terms of a large number of simpler proposals: $$q(\vec{x}'|\vec{x})=\sum_j w_j q_j(\vec{x}'|\vec{x})$$

Individual proposal functions $q_j(\vec{x}'|\vec{x})$ are sometimes called "operators", and the $w_j$ are called "weights".
Individual operators are often constructed to be reversible: i.e. maintain detailed balance.
However, individual operators are usually incapable of producing an ergodic chain on their own. (I.e. they may not be capable of exploring the entire support of the posterior.)

Example

These proposal operators are sufficient for ergodicity:

$q_1(\vec{x'}|\vec{x})=\delta(x_2'-x_2)g(x_1'|x_1)$, $q_2(\vec{x'}|\vec{x})=\delta(x_1'-x_1)g(x_2'|x_2)$

where $g(x'|x)$ is the PDF for $\text{Unif}(x-w/2,x+w/2)$

Example (continued)

Regenerate

Correlated variables

Chain is mixing slowly because of strong correlations between the two parameters.
Solution is to update both parameters simultaneously.
Anticipating correlations between parameters is a key skill in designing efficient MCMC algorithms.
In the Bayesian context, this often requires careful examination of the likelihood: non-identifiability is a source of such correlations.

Include a third proposal operator: $$q_3(\vec{x}'|\vec{x}) = \frac{1}{v}\int_{-v/2}^{v/2}\delta(x_1'-(x_1+a))\delta(x_2'-(x_2+a))\mathrm{d}a$$

Mixing improvement with new operator

Regenerate

Alternative approach: Hamiltonian Monte Carlo

The Hamiltonian formalization of classical mechanics relates the dynamics of phase space coordinates $(\vec{q},\vec{p})$ to the Hamiltonian function $H(\vec{q},\vec{p})=K(\vec{p}) + V(\vec{q})$ via Hamilton's equations: \begin{align} \dot{q}_i &= \frac{\partial H}{\partial p_i}\\ \dot{p}_i &= -\frac{\partial H}{\partial q_i} \end{align}
Hamiltonian (/hybrid) Monte Carlo is an MCMC algorithm that treats the parameter state vector as the position component of canonical coordinates in phase space.

Alternative approach: Hamiltonian Monte Carlo

HMC assumes that the system is in thermal contact with a heat bath of fixed temperature ("canonical ensemble"):

The finite temperature means that the probability of observing the system in an accessible state $(\vec{q},\vec{p})$ is $$P(\vec{q},\vec{p}) = \frac{1}{Z}e^{-H(\vec{q},\vec{p})/T}$$ where $Z=\sum_{\vec{q},\vec{p}}e^{-H(\vec{q},\vec{p})/T}$ is the partition function.

Alternative approach: Hamiltonian Monte Carlo

Note that the probability distribution factorizes $$P(\vec{q},\vec{p})=Z^{-1}\exp[-(K(\vec{p})+V(\vec{q}))/T]=P(\vec{q})P(\vec{p})$$
The MCMC method involves identifying the position coordinates $\vec{q}$ with the variables ($\vec{x}$) of interest , and defining $V(\vec{q})$ such that $P(\vec{q})$ is the target density, i.e. $$V(\vec{q}) = -T\log[ZP(\vec{q})]$$
The actual update step for the algorithm involves:
1. Sampling a new $\vec{p}^*$ from $P(\vec{p})$ (easy, since $K(\vec{p})$ is quadratic and hence $P(\vec{p})$ is Gaussian,
2. Evolving the state using Hamiltons equations from $(\vec{q},\vec{p}^*)$ to $(\vec{q}',\vec{p}')$.
3. Performing an MCMC accept/reject for this new state compared to the original state $(\vec{q},\vec{p})$.

Alternative approach: Hamiltonian Monte Carlo

Radford Neal, MCMC using Hamltonian dynamics (in Handbook of Markov chain Monte Carlo, 2011)

Reversible Jump MCMC

The Problem

Suppose we have a posterior density of the form: $$P(\vec{x}_k,k) = \sum_{k}Pf(\vec{x_k}|k)P(k)$$ where the dimension of $\vec{x}_k$ depends on $k$.
"The number of things you don't know is one of the things you don't know."
This is a straight-forward way of dealing with model selection. (In fact, model selection and parameter inference are essentially the same thing...)
Can we still use MCMC to sample from this distribution?

The Solution

Use the following form of the detailed balance criterion:

$$\int_{\vec{x}_A\in A}\int_{\vec{x}'_B\in B} P(\vec{x}_A)Q(\vec{x}'_B|\vec{x}_A) = \int_{\vec{x}_B\in B} \int_{\vec{x}'_A\in A} P(\vec{x}_B)Q(\vec{x}'_A|\vec{x}_B)$$

Here $A$ and $B$ are any two subsets of the state space, while $Q(\vec{x}'|\vec{x})$ is the MCMC transition kernel: $$Q(\vec{x}'|\vec{x})=\alpha(\vec{x}',\vec{x})q(\vec{x}'|\vec{x}) + \delta(\vec{x}'-\vec{x})(1-\int_{\vec{y}\in\mathcal{V}}\alpha(\vec{y},\vec{x})q(\vec{y}|\vec{x}))$$

The DB criterion then reduces to:

$$\int_{\vec{x}_A\in A}\int_{\vec{x}'_B\in B} P(\vec{x}_A)\alpha(\vec{x}'_B,\vec{x}_A)q(\vec{x}'_B|\vec{x}_A) = \int_{\vec{x}_B\in B} \int_{\vec{x}'_A\in A} P(\vec{x}_B)\alpha(\vec{x}'_A,\vec{x}_B)q(\vec{x}'_A|\vec{x}_B)$$

The Solution

Deriving an acceptance probability from the above criterion is guaranteed to produce a chain that converges to the correct distribution (assuming your proposal is capable of exploring the state space, of course).
For the most part, $$\alpha(\vec{x}',\vec{x}) = 1\wedge \frac{P(\vec{x}')}{P(\vec{x})}\times\frac{q(\vec{x}|\vec{x}')}{q(\vec{x}'|\vec{x})}$$ just works.
In the case that proposals contain singularities (occasionally the case for model switching), the safest thing is to back up and derive $\alpha(\vec{x}',\vec{x})$ from the criterion directly.

Reversible jump is confusingly named?

Consider a proposal involving choosing $f\sim\text{Unif}(1/\beta,\beta$ then setting $\vec{x}'=f\vec{x}$.
The proposal function for this move is: $$q(\vec{x}'|\vec{x})=\frac{1}{\beta-\beta^{-1}}\int_{\beta^{-1}}^{\beta}\mathrm{d}f\delta(\vec{x}'-f\vec{x})$$
This does not involve a dimension change. However, due to the singularity it is easiest to derive the acceptance probability for the move from the DB criterion directly.
The $\alpha$ for other proposals which do involve a dimension change can often be derived simply by substituting the proposal densities directly into the usual expression.

Final words

One often hears "reversible jump" mentioned in hushed tones.
However, as long as you use the generalized detailed balance criterion, you won't go wrong.

DON'T PANIC!

Model Selection

The Marginal Likelihood

The posterior for parameters $\vec{\theta}_M$ of model $M$ given data $\vec{d}$ is: $$P(\vec{\theta}_M = \frac{P(\vec{d}|\vec{\theta}_M,M)P(\vec{\theta}_M|M)}{P(\vec{d}|M)}$$
The troublesome denominator $$P(\vec{d}|M)=\sum_{\vec{\theta}_M}P(\vec{d}|\vec{\theta}_M,M)P(\vec{\theta}_M|M)$$ is the likelihood (or marginal likelihood) for $M$ given the data.
This function is central to Bayesian model selection:
\begin{align} \frac{P(M_1|\vec{d})}{P(M_2|\vec{d})}&=\frac{P(\vec{d}|M_1)P(M_1)}{P(\vec{d}|M_2)P(M_2)}\\ &=\frac{P(\vec{d}|M_1)}{P(\vec{d}|M_2)}\text{ if }P(M_1)=P(M_2) \end{align}

Bayesian Occam's Razor

Interestingly, the marginal likelihood (aka the "evidence") for a model automatically penalizes against overfitting:

Idea due to MacKay, 2003

But how can we compute it?

The Harmonic Mean Estimator

Idea: use the fact that we can draw samples $\vec{\theta}_M^{(i)}$ from the posterior under the model: \begin{align} \lim_{N\rightarrow\infty}\frac{1}{N}\sum_{i=1}^{N}f(\vec{\theta}_M^{(i)}) &= \sum_{\vec{\theta}_M}f(\vec{\theta}_M)P(\vec{\theta}_M|\vec{d},M)\\ &= \sum_{\vec{\theta}_M}f(\vec{\theta}_M)\frac{P(\vec{d}|\vec{\theta}_M,M)P(\vec{\theta}_M|M)}{P(\vec{d}|M)} \end{align}

$\implies\lim_{N\rightarrow\infty}\frac{1}{N}\sum_{i=1}^{N}\frac{1}{P(\vec{d}|\vec{\theta}_M^{(i)},M)}=\frac{1}{P(\vec{d}|M)}$.
Sadly, this has a reputation for being the worst Monte Carlo method ever invented!
- Any idea why?

Thermodynamic Integration

This is the "correct" way to compute the marginal likelihood using MCMC.
From Lartillot and Phillipe (2006):
- Suppose $p_i(\vec{\theta})=\frac{1}{Z_i}q_i(\vec{\theta})$, $i=0,1$, where $Z_i=\sum_{\vec{\theta}}q_i(\vec{\theta})$.
- Define a continuous series of distributions $p_{\beta}$, $q_{\beta}$, $Z_{\beta}$ for $\beta\in[0,1]$.
- It is easy to show that $$E\left[\frac{\partial\log q_{\beta}(\vec{\theta})}{\partial\beta}\right]=\frac{\partial Z_{\beta}}{\partial\beta}$$

Thermodynamic Integration (continued)

Thus, by using MCMC to produce estimates for $E[\partial\log q_{\beta}(\vec{\theta})/\partial\beta]$ under a series of distributions between $\beta=0$ and $\beta=1$ one can numerically integrate this sampled curve to give an estimate for $$\int\mathrm{d}\beta \frac{\partial \log Z_{\beta}}{\partial\beta}=\log Z_1-\log Z_0$$
In practice one can also slowly vary $\beta$ during the MCMC run.

Annealing-Melting Integration

The most straight-forward approach is to use something like the following $$q_{\beta}(\vec{\theta}_M) = P(\vec{d}|\vec{\theta}_M,M)^{\beta}P(\vec{\theta}_M|M)$$
Then $q_1(\vec{\theta}_M)$ is the unnormalized posterior and $q_0(\vec{\theta}_M)$ is the prior.
Integration of the expectation over $\beta$ then yields $$\log Z_1 - \log Z_0 = \log Z_1 = \log P(\vec{d}|M)$$

Model-switch Integration

One difficulty with the previous scheme is that it requires one to use MCMC to sample from the prior. This can be challenging as priors can be very broad.
An alternative is to use instead \begin{align} q_{\beta}(\vec{\theta})=&\left[P(\vec{d}|\vec{\theta}_{M},M_1)P(\vec{\theta}_{M}|M_1)\right]^{\beta}\\ &\times\left[P(\vec{d}|\vec{\theta}_{M},M_2)P(\vec{\theta}_{M}|M_2)\right]^{1-\beta} \end{align}
Integration of the expectation over $\beta$ then directly yields the log Bayes Factor: $$\log Z_1 - \log Z_0 = \log Z_1 = \log \frac{P(\vec{d}|M_1)}{P(\vec{d}|M_2)}$$

Model-switch Integration

In this case, the integral yields a BF of $10^{17}$ in favour of model 1 over model 0.

Summary of Part 1

Metropolis-Hastings MCMC is one of a large family of Monte Carlo methods useful for Bayesian analysis.
Its usefulness depends critically on the proposal mechanism. Getting this right involves either
- knowing enough about the posterior to create well-behaved operators, or
- using an adaptive method such as Hamiltonian Monte Carlo.
Using MCMC to sample from mixtures of parameter densities of different dimensions is not only possible, it's easy!
MCMC can also be used to directly compute model likelihoods and Bayes factors via thermodynamic integration.