The best introduction to HMMs and the Viterbi algorithm is the classical tutorial by Rabiner ( see e.g. <a href="https://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf" rel="nofollow noreferrer">https://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf</a> ).<p>What is missing from Steve's clear explanation is (1) a graphic depiction of the central data structure called the "trellis" (Fig. 6 in the above paper) and (2) a citation to the original Viterbi paper [1].<p>For many years, HMMs - together with a Viterbi triphone decoder - defined the state of the art in ASR (automatic speech recognition), until eventually, neural networks took over.<p>[1] A. J. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” IEEE Trans. Inform. Theory, vol. IT-13, pp. 260-269, Apr. 1967.
My favorite story about Viterbi encoding is that apparently they launched the Voyager probe with a Viterbi encoder even though at the time there wasn't a computer on Earth capable of decoding it in real time (encoding is less computationally intensive than decoding). Years later they remotely switched over to Viterbi to enable higher bitrates.
This is very much like the least action principle [0]. The optimal path between two points is the combination of optimal paths between the starting and an intermediate, and the intermediate and end points.<p>[0]: <a href="https://en.wikipedia.org/wiki/Stationary-action_principle" rel="nofollow noreferrer">https://en.wikipedia.org/wiki/Stationary-action_principle</a>
<i>The</i> text for understanding and implementing HMMs is "A Revealing Introduction to Hidden Markov Models" by Mark Stamp, the content is a google away!
Viterbi algorithm, Baum Welch algorithm, forward-backward algorithm, beam search, etc, are all very closely related, all using basically the same kind of dynamic programming. Viterbi calculates the max and argmax, Baum Welch (forward-backward) calculates the sum (forward) and then also the posteriors (backward). The difference is just the max vs sum. And to get the argmax, you would just backtrack. Beam search is an approximation for the max and argmax.<p>Btw, using automatic differentiation, just calculating the forward scores (sum) is enough, automatic differentiation will then automatically lead to the backward pass.<p>This is not just for HMMs but it applies to many kinds of search. Naturally for CTC (can be seen as a simple special case of HMM), transducer, but also encoder-decoder or decoder-only (language) models, and more.
The analogy doesn't make sense for me. What if the shortest path from Waterloo to New York doesn't go through Toronto?<p>And if the point is that you already know that it does, then....the rest is beyond obvious?<p>What am I missing?
This youtube channel intuitively explains algorithms on a whiteboard:
<a href="https://youtu.be/IqXdjdOgXPM" rel="nofollow noreferrer">https://youtu.be/IqXdjdOgXPM</a>
The classic failure of this approach is a form of the hill climbing problem (fastest way from Cambridge to Palo Alto is to start driving <i>east</i> to the airport).