Bayesian inference using Markov chain Monte Carlo
Part 1: Foundations and Implementation
Tim Vaughan
Computational Evolution Group (cEvo)
Department of Biosystems Science and Engineering, ETH Zurich
Bayesian Statistics Mini-course
Univerisity of Hohenheim, 2018
Bayesian inference
- Bayesian inference is enabled by the Bayesian interpretation of probabilities:
- Probabilities are just as applicable to hypotheses as they are to data.
- Inference of models and/or model parameters can be conducted using the standard calculus of probabilities (sum rule, product rule, etc).
- In contrast, sampling theory approaches assign probabilities only to data:
- Inference of models and/or parameters is a distinct procedure.
The Problem
Suppose we have a vector $\vec{d}$ representing some collected data. Assuming a model $M$, what do the data tell us about the parameters $\vec{\theta}_M$ of this model?
The Bayesian solution is simple to derive and tremendously intuitive:
$$P(\vec{\theta}_M|\vec{d},M) = \frac{P(\vec{d}|\vec{\theta}_M,M)P(\vec{\theta}_M)}{P(\vec{d}|M)}$$
For models with large parameter space volumes,
actually evaluating this probability for a particular
$\vec{\theta}_M$ is HARD.
$$P(\vec{d}|M)=\sum_{\vec{\theta}_M} P(\vec{d}|\vec{\theta}_M,M)P(\vec{\theta}_M|M)$$
A practical example from phylogenetics
The posterior probability for a particular time tree $T$ given a multiple sequence alignment $A$ and a neutral substitution model and a coalescent prior on tree space is
$$P(T,\mu,\theta|A) = \frac{P(A|T,\mu)P(T|\theta)P(\mu,\theta)}{P(A)}$$
where $\mu$ and $\theta$ are substitution model and tree prior parameters.
How difficult is it to compute the denominator?
$$P(A)=\sum_{T,\mu,\theta}P(A|T,\mu)P(T|\theta)P(\mu,\theta)$$
How big is tree space?
The number $N$ of distinct labelled rooted tree topologies
grows rapidly with the number $m$ of leaves
(i.e. sequences):
$$N_m = \frac{(2m-3)!}{2^{m-2}(m-2)!}$$
Computational solutions to the normalization problem
- Use brute-force enumeration all possible states:
$$\sum_{\vec{\theta}_M}P(\vec{d}|\vec{\theta}_M,M)P(\vec{\theta}_M|M)$$
- Only computationally feasible for relatively small problems.
- Use a variational approach (AKA "variational Bayes"):
find $Q(\vec{\theta}_M)$ minimizing $\sum_{\vec{\theta}_M}Q(\vec{\theta}_M)\log\frac{Q(\vec{\theta}_M)}{P(\vec{\theta}_M|\vec{d},M)}$.
- Requires a parametric ansatz for the posterior. Not always easy to find, particularly in "interesting" state spaces such as tree space.
Implementation example
Suppose we want to sample from the following posterior for a single variable $x$:
Use a simple proposal distribution:
$$q_w(x'|x) = \begin{cases}
1/w & x-\frac{w}{2} < x' < x+\frac{w}{2}\\
0 & \text{otherwise}
\end{cases}$$
Implementation example
$w=1$:
How do we assess convergence?
The easiest approach is simply to run multiple independent
chains, each initialized from a unique starting point.
- Chains can be run in parallel.
- Resulting sample distributions can be compared using (for example) Q-Q plots.
- Once convergence is attained, combining the results of
multiple chains is perfectly acceptable.
How do we assess mixing?
The key to assessing mixing is the autocorrelation function of the chain states:
$$\widehat{ACF}(l) = \frac{\sum_i x_i x_{i-l}}{\left(\sum_i x_i\right)^2}$$
Assessing Mixing (continued)
The lag required for this function to decay to within the
vicinity of 0 is the gap between effectively independent
samples, $\tau$.
If $N$ is the total number of MCMC samples, we then define
$$N_{\text{eff}}=\frac{N}{\tau}$$
to be the effective sample size (ESS).
The ESS is a rough estimate of the number of actual
samples a chain has generated.
You should really only
consider the order of magnitude of the ESS.
Multivariate target distributions
- MCMC is usually applied to posteriors with large numbers of dimensions.
- This is precisely the regime in which MCMC shines.
- A typical phylogenetic analysis (as we'll see later) may contain many hundreds
of distinct parameters.
- How do we develop algorithms to sample from posteriors over
these parameter spaces?
Complex proposals
The usual approach is to decompose the proposal function $q(\vec{x}'|\vec{x})$ in terms of a large number of simpler proposals:
$$q(\vec{x}'|\vec{x})=\sum_j w_j q_j(\vec{x}'|\vec{x})$$
- Individual proposal functions $q_j(\vec{x}'|\vec{x})$ are
sometimes called "operators", and the $w_j$ are called
"weights".
- Individual operators are often constructed to be reversible:
i.e. maintain detailed balance.
- However, individual operators are usually incapable of
producing an ergodic chain on their own. (I.e. they may not be
capable of exploring the entire support of the posterior.)
Example
These proposal operators are sufficient for ergodicity:
$q_1(\vec{x'}|\vec{x})=\delta(x_2'-x_2)g(x_1'|x_1)$,
$q_2(\vec{x'}|\vec{x})=\delta(x_1'-x_1)g(x_2'|x_2)$
where $g(x'|x)$ is the PDF for $\text{Unif}(x-w/2,x+w/2)$
Correlated variables
- Chain is mixing slowly because of strong correlations between the two
parameters.
- Solution is to update both parameters simultaneously.
- Anticipating correlations between parameters is a key skill in designing efficient MCMC algorithms.
- In the Bayesian context, this often requires careful examination of the likelihood: non-identifiability is a source of such correlations.
Include a third proposal operator:
$$q_3(\vec{x}'|\vec{x}) = \frac{1}{v}\int_{-v/2}^{v/2}\delta(x_1'-(x_1+a))\delta(x_2'-(x_2+a))\mathrm{d}a$$
Mixing improvement with new operator
Alternative approach: Hamiltonian Monte Carlo
- The Hamiltonian formalization of classical mechanics
relates the dynamics of phase space coordinates $(\vec{q},\vec{p})$ to the
Hamiltonian function $H(\vec{q},\vec{p})=K(\vec{p}) + V(\vec{q})$ via Hamilton's
equations:
\begin{align}
\dot{q}_i &= \frac{\partial H}{\partial p_i}\\
\dot{p}_i &= -\frac{\partial H}{\partial q_i}
\end{align}
- Hamiltonian (/hybrid) Monte Carlo is an MCMC algorithm that
treats the parameter state vector as the position component of
canonical coordinates in phase space.
Alternative approach: Hamiltonian Monte Carlo
- HMC assumes that the system is in thermal contact
with a heat bath of fixed temperature ("canonical ensemble"):
- The finite temperature means that the probability of observing the system in an accessible state $(\vec{q},\vec{p})$ is
$$P(\vec{q},\vec{p}) = \frac{1}{Z}e^{-H(\vec{q},\vec{p})/T}$$
where $Z=\sum_{\vec{q},\vec{p}}e^{-H(\vec{q},\vec{p})/T}$ is the partition function.
Alternative approach: Hamiltonian Monte Carlo
- Note that the probability distribution factorizes
$$P(\vec{q},\vec{p})=Z^{-1}\exp[-(K(\vec{p})+V(\vec{q}))/T]=P(\vec{q})P(\vec{p})$$
- The MCMC method involves identifying the position coordinates $\vec{q}$ with the variables ($\vec{x}$) of interest , and defining $V(\vec{q})$ such that $P(\vec{q})$ is the target density, i.e.
$$V(\vec{q}) = -T\log[ZP(\vec{q})]$$
- The actual update step for the algorithm involves:
- Sampling a new $\vec{p}^*$ from $P(\vec{p})$ (easy,
since $K(\vec{p})$ is quadratic and hence $P(\vec{p})$ is
Gaussian,
- Evolving the state using Hamiltons equations from
$(\vec{q},\vec{p}^*)$ to $(\vec{q}',\vec{p}')$.
- Performing an MCMC accept/reject for this new state
compared to the original state $(\vec{q},\vec{p})$.
Alternative approach: Hamiltonian Monte Carlo
The Problem
- Suppose we have a posterior density of the form:
$$P(\vec{x}_k,k) = \sum_{k}Pf(\vec{x_k}|k)P(k)$$
where the dimension of $\vec{x}_k$ depends on $k$.
- "The number of things you don't know is one of the things you don't know."
- This is a straight-forward way of dealing with model selection. (In fact, model selection and parameter inference are essentially the same thing...)
- Can we still use MCMC to sample from this distribution?
The Solution
The Solution
Use the following form of the detailed balance criterion:
$$\int_{\vec{x}_A\in A}\int_{\vec{x}'_B\in B} P(\vec{x}_A)Q(\vec{x}'_B|\vec{x}_A) =
\int_{\vec{x}_B\in B} \int_{\vec{x}'_A\in A} P(\vec{x}_B)Q(\vec{x}'_A|\vec{x}_B)$$
Here $A$ and $B$ are any two subsets of the state space, while $Q(\vec{x}'|\vec{x})$ is the MCMC transition kernel:
$$Q(\vec{x}'|\vec{x})=\alpha(\vec{x}',\vec{x})q(\vec{x}'|\vec{x}) + \delta(\vec{x}'-\vec{x})(1-\int_{\vec{y}\in\mathcal{V}}\alpha(\vec{y},\vec{x})q(\vec{y}|\vec{x}))$$
The DB criterion then reduces to:
$$\int_{\vec{x}_A\in A}\int_{\vec{x}'_B\in B} P(\vec{x}_A)\alpha(\vec{x}'_B,\vec{x}_A)q(\vec{x}'_B|\vec{x}_A) =
\int_{\vec{x}_B\in B} \int_{\vec{x}'_A\in A} P(\vec{x}_B)\alpha(\vec{x}'_A,\vec{x}_B)q(\vec{x}'_A|\vec{x}_B)$$
The Solution
$$\int_{\vec{x}_A\in A}\int_{\vec{x}'_B\in B} P(\vec{x}_A)\alpha(\vec{x}'_B,\vec{x}_A)q(\vec{x}'_B|\vec{x}_A) =
\int_{\vec{x}_B\in B} \int_{\vec{x}'_A\in A} P(\vec{x}_B)\alpha(\vec{x}'_A,\vec{x}_B)q(\vec{x}'_A|\vec{x}_B)$$
- Deriving an acceptance probability from the above criterion is guaranteed to produce a chain that converges to the correct distribution (assuming your proposal is capable of exploring the state space, of course).
- For the most part,
$$\alpha(\vec{x}',\vec{x}) = 1\wedge \frac{P(\vec{x}')}{P(\vec{x})}\times\frac{q(\vec{x}|\vec{x}')}{q(\vec{x}'|\vec{x})}$$
just works.
- In the case that proposals contain singularities (occasionally the case for model switching), the safest thing is to back up and derive $\alpha(\vec{x}',\vec{x})$ from the criterion directly.
Reversible jump is confusingly named?
- Consider a proposal involving choosing $f\sim\text{Unif}(1/\beta,\beta$ then setting $\vec{x}'=f\vec{x}$.
- The proposal function for this move is:
$$q(\vec{x}'|\vec{x})=\frac{1}{\beta-\beta^{-1}}\int_{\beta^{-1}}^{\beta}\mathrm{d}f\delta(\vec{x}'-f\vec{x})$$
- This does not involve a dimension change. However, due to the singularity it is easiest to derive the acceptance probability for the move from the DB criterion directly.
- The $\alpha$ for other proposals which do involve a dimension change can often be derived simply by substituting the proposal densities directly into the usual expression.
Final words
- One often hears "reversible jump" mentioned in hushed tones.
- However, as long as you use the generalized detailed balance criterion, you won't go wrong.
DON'T PANIC!
Bayesian Occam's Razor
Interestingly, the marginal likelihood (aka the "evidence") for a model automatically penalizes against overfitting:
But how can we compute it?
The Harmonic Mean Estimator
Idea: use the fact that we can draw samples $\vec{\theta}_M^{(i)}$ from the posterior under the model:
\begin{align}
\lim_{N\rightarrow\infty}\frac{1}{N}\sum_{i=1}^{N}f(\vec{\theta}_M^{(i)})
&= \sum_{\vec{\theta}_M}f(\vec{\theta}_M)P(\vec{\theta}_M|\vec{d},M)\\
&= \sum_{\vec{\theta}_M}f(\vec{\theta}_M)\frac{P(\vec{d}|\vec{\theta}_M,M)P(\vec{\theta}_M|M)}{P(\vec{d}|M)}
\end{align}
- $\implies\lim_{N\rightarrow\infty}\frac{1}{N}\sum_{i=1}^{N}\frac{1}{P(\vec{d}|\vec{\theta}_M^{(i)},M)}=\frac{1}{P(\vec{d}|M)}$.
- Sadly, this has a reputation for being the worst Monte Carlo method ever invented!
Thermodynamic Integration
- This is the "correct" way to compute the marginal likelihood using MCMC.
- From Lartillot and Phillipe (2006):
- Suppose $p_i(\vec{\theta})=\frac{1}{Z_i}q_i(\vec{\theta})$, $i=0,1$, where $Z_i=\sum_{\vec{\theta}}q_i(\vec{\theta})$.
- Define a continuous series of distributions $p_{\beta}$, $q_{\beta}$, $Z_{\beta}$ for $\beta\in[0,1]$.
- It is easy to show that
$$E\left[\frac{\partial\log q_{\beta}(\vec{\theta})}{\partial\beta}\right]=\frac{\partial Z_{\beta}}{\partial\beta}$$
Thermodynamic Integration (continued)
- Thus, by using MCMC to produce estimates for $E[\partial\log q_{\beta}(\vec{\theta})/\partial\beta]$ under a series of distributions between $\beta=0$ and $\beta=1$ one can numerically integrate this sampled curve to give an estimate for
$$\int\mathrm{d}\beta \frac{\partial \log Z_{\beta}}{\partial\beta}=\log Z_1-\log Z_0$$
- In practice one can also slowly vary $\beta$ during the MCMC run.
Annealing-Melting Integration
- The most straight-forward approach is to use something like the following
$$q_{\beta}(\vec{\theta}_M) = P(\vec{d}|\vec{\theta}_M,M)^{\beta}P(\vec{\theta}_M|M)$$
- Then $q_1(\vec{\theta}_M)$ is the unnormalized posterior and $q_0(\vec{\theta}_M)$ is the prior.
- Integration of the expectation over $\beta$ then yields
$$\log Z_1 - \log Z_0 = \log Z_1 = \log P(\vec{d}|M)$$
Model-switch Integration
- One difficulty with the previous scheme is that it requires one to use MCMC to sample from the prior. This can be challenging as priors can be very broad.
- An alternative is to use instead
\begin{align}
q_{\beta}(\vec{\theta})=&\left[P(\vec{d}|\vec{\theta}_{M},M_1)P(\vec{\theta}_{M}|M_1)\right]^{\beta}\\
&\times\left[P(\vec{d}|\vec{\theta}_{M},M_2)P(\vec{\theta}_{M}|M_2)\right]^{1-\beta}
\end{align}
- Integration of the expectation over $\beta$ then directly yields the log Bayes Factor:
$$\log Z_1 - \log Z_0 = \log Z_1 = \log \frac{P(\vec{d}|M_1)}{P(\vec{d}|M_2)}$$
Model-switch Integration
In this case, the integral yields a BF of $10^{17}$ in favour of model 1 over model 0.
Summary of Part 1
- Metropolis-Hastings MCMC is one of a large family of Monte Carlo methods useful for Bayesian analysis.
- Its usefulness depends critically on the proposal mechanism. Getting this right involves either
- knowing enough about the posterior to create well-behaved operators, or
- using an adaptive method such as Hamiltonian Monte Carlo.
- Using MCMC to sample from mixtures of parameter densities of different dimensions is not only possible, it's easy!
- MCMC can also be used to directly compute model likelihoods and Bayes factors via thermodynamic integration.