Gibbs Sampling from Hidden Markov Models

A number of students are either currently working with or have been asking about the algorithm to Gibbs sample from a hidden Markov model described in

Scott, S.L., 2002. Bayesian Methods for Hidden Markov Models. Journal of the American Statistical Association, 97(457), pp.337-351.

Unfortunately, the derivation and notation used by Scott is not as clear as it could be, and so we re-derive the major results here.

A hidden Markov model is essentially a Markov chain in which the states are imperfectly observed.

Consider a Markov chain, and let \(s_{k}\) denote the state of the chain at time \(k\).  A hidden Markov model assumes these states are not directly observed, but at each transition a measurement \(y_{k}\) is taken, and the distribution of \(y_{k}\) depends only upon the state \(s_{k}\) of the chain.

The algorithm described by Scott Gibbs samples for the sequence of states \(s_{1}, s_{2}, \ldots, s_{n}\), given the conditional density \(p(y_{k}|s_{k})\) of \(y_{k}\) given \(s_{k}\), the transition probabilities \(p(s_{k+1} | s_{k})\), and the prior distribution \(p(s_{1})\) for the initial state.

Forward Pass

The forward pass of the algorithm computes the sequence of joint distributions \(p(s_{k+1},s_{k}|Y_{k+1})\) where \(Y_{k}\) denotes the sequence of the first \(k\) observations \((y_{1}, y_{2}, \ldots, y_{k})\).

Given the prior distribution \(p(s_{1})\) of the initial state \(s_{1}\) and the conditional density \(p(y_{1}|s_{1})\) of \(y_{1}\) given \(s_{1}\), the joint distribution is \[ p(y_{1}, s_{1}) = p(y_{1}|s_{1})p(s_{1}).\] By Bayes’ rule the posterior distribution \(p(s_{1}|y_{1})\)  of \(s_{1}\) given \(y_{1}\) is \[ p(s_{1}|y_{1}) = \frac{p(y_{1},s_{1})}{\sum_{s_{1}} p(y_{1}, s_{1})}.\]

Similarly, the joint distribution of \(y_{2}\), \(s_{2}\) and \(s_{1}\) conditional on \(y_{1}\) is \[p(y_{2}, s_{2},s_{1}|y_{1}) = p(y_{2}|s_{2})p(s_{2}|s_{1})p(s_{1}|y_{1})\] so\[p(s_{2},s_{1}|y_{1}, y_{2}) = \frac{p(y_{2}, s_{2},s_{1}|y_{1})}{\sum_{s_{2}} \sum_{s_{1}} p(y_{2}, s_{2},s_{1}|y_{1})}\] and the marginal distribution of \(s_{2}\) conditional on \(y_{1}\) and \(y_{2}\) is \[ p(s_{2}|y_{1},y_{2}) = \sum_{s_{1}} p(s_{2},s_{1}|y_{1}, y_{2}). \]

In general,\[p(y_{k+1}, s_{k+1},s_{k}|Y_{k}) = p(y_{k+1}|s_{k+1})p(s_{k+1}|s_{k})p(s_{k}|Y_{k}). \] and by Bayes’ rule\[p(s_{k+1},s_{k}|Y_{k+1}) = \frac{p(y_{k+1}, s_{k+1},s_{k}|Y_{k}) }{\sum_{s_{k+1}} \sum_{s_{k}} p(y_{k+1}, s_{k+1},s_{k}|Y_{k}) },\] and the marginal distribution required for the next step is calculated as \[p(s_{k+1}|Y_{k+1}) = \sum_{s_{k}} \; p(s_{k+1},s_{k}|Y_{k+1}).\]

 Reverse Pass

The reverse pass of the algorithm computes the sequence of joint distributions \(p(s_{k},s_{k-1}|Y_{n})\).

To compute \(p(s_{k},s_{k-1}|Y_{n})\), note that \[
p(s_{k},s_{k-1}|Y_{n})= p(s_{k-1}|s_{k},Y_{n}) p(s_{k}|Y_{n})= p(s_{k-1}|s_{k},Y_{k}) p(s_{k}|Y_{n}).
\] The key point here is that conditional on \(s_{k}\), \(s_{k-1}\) is independent of \(y_{k+1},\ldots,y_{n}\).  As \(p(s_{k-1},s_{k} | Y_{k})=p(s_{k-1}|s_{k},Y_{k})p(s_{k}|Y_{k})\),\[p(s_{k},s_{k-1}|Y_{n}) = p(s_{k-1},s_{k}|Y_{k}) \frac{p(s_{k}|Y_{n})}{p(s_{k}|Y_{k})}.\]

Once all the joint distributions \(p(s_{k},s_{k-1}|Y_{n})\) are computed, the states are sampled by sampling \(s_{n}\) from \(p(s_{n}|Y_{n})\), then sample \(s_{n-1}\) from \[p(s_{n-1}|s_{n},Y_{n}) = \frac{p(s_{n},s_{n-1}|Y_{n})}{\sum_{s_{n-1}} p(s_{n},s_{n-1}|Y_{n})},\] and so on.  In general, \(s_{k-1}\) is sampled from \[p(s_{k-1}|s_{k},Y_{n}) = \frac{p(s_{k},s_{k-1}|Y_{n})}{\sum_{s_{k-1}} p(s_{k},s_{k-1}|Y_{n})}.\]

Likelihood

The likelihood \(p(Y_{n})\) is not required for Gibbs sampling.  However, it is easily calculated by in the forward pass of the algorithm.

At each iteration of the forward pass, compute in addition \[\begin{align}
p(Y_{k}, s_{k})
&= \sum_{s_{k-1}} p(y_{k} | s_{k}) p(s_{k}|s_{k-1})p(Y_{k-1},s_{k-1})\\
&= p(y_{k} | s_{k}) \sum_{s_{k-1}}  p(s_{k}|s_{k-1})p(Y_{k-1},s_{k-1}).\end{align}\] Then \[p(Y_{n}) =  \sum_{s_{n}} p(Y_{n} , s_{n}).\] However, as Scott discusses, it is more numerically stable to compute \(\log p(Y_{n})\).

This entry was posted in General Posts, Guides. Bookmark the permalink.