What's <i>actually happening</i> in a LLM is many orders of magnitude more complex than a Markov chain. However, I agree that they're an amazing pedagogical tool for the basic principles of how a LLM works, even to a non-technical audience.<p>Many people try to "explain" LLMs starting with the principles of neural networks. This rarely works well: there are some significant conceptual leaps required.<p>However, explaining that a LLMs are really just iterated next-word prediction based on a statistical model of the preceding words <i>is</i> something that most people can grok, and in a useful way: in my experience, it actually helps give people a useful intuition for why and how models hallucinate and what kind of things they're good/bad at.<p>Markov chains are super simple iterated next-word prediction model, and can be explained in 15 minutes. It's a great way to explain LLMs to laypeople.
Though it seems like there is some sensitivity around the comparison of LLMs to Markov chains, and certainly an LLM is not generated in the same way, it is pretty accurate that an LLM could be represented by a sufficiently (ie very) complex Markov chain. The states of the chain would not be the tokens themselves, as in this example, but the total context window vector of N input tokens, which would fan out to states consisting of the new context N[1:] + A, where A is any of the tokens sampled from the resulting probability distribution, and the transition probabilities are just drawn from that same distribution according to the temperature settings.<p>You could even do some very hand-wavy math on how staggeringly complex the resulting Markov chain would get: BERT for example has a token vocabulary of 30,000 and a context window of 512 tokens. So the number of possible states would be 30,000^512, or ~1.9 x 10^2292, with each of those having a max fan out to 30,000 other states. So clearly the LLM is a more compact representation of the concept.
It's true that Markov chains are very limited in their capabilities. But one thing I love about them is that they are one of the simplest and most intuitive ways to write code that *learns* from input data.<p>If you're never written something that *learns*, try it out! Here's a very primitive one I wrote recently to explain the basic idea and explains it along the way.<p><a href="https://github.com/unoti/markov-basics/blob/main/markov-basics.ipynb">https://github.com/unoti/markov-basics/blob/main/markov-basi...</a><p>This starts with generating US city names, and ends with generating text based on the Dungeons and Dragons DM guide.
I have no idea if the author is aware, but it's worth noting why a Markov chain has the name and what the difference is from other probabilistic models. The Markov property states that the probability distribution of the next state in some system depends only upon the current state.<p>Obviously, language does not have this property and this has been known from the start, but Markov models are extremely computationally tractable, easy to implement, and easy to understand, while doing a good enough job. Introducing recursion and recurrent layers into multilayer neural architectures allowed explicit modeling of variable-length past context for the current state, which is a more accurate representation of how language works, but these were quite expensive to train. Transformer models introduced the attention mechanism to simulate recurrence without explicitly encoding it, reducing the training cost to make it tractable to train equivalently capable models with larger parameter sets on larger training sets, giving us the large language model.<p>This ability to explicitly capture variable-length context lookback is what makes things like few-shot and one-shot learning possible. The probability distribution is effectively self-modifying at inference time. It's not quite like an animal brain. There is no true plasticity with the strict separation between training and inference, but it gets a lot closer than a Markov model.<p>You can see in a lot of the comments here about use cases people had for Markov models where they shine. If you're making a bot intended to model one specific topical forum on a single web site, context variance is reduced compared to trying to converse with and understand arbitrary people in arbitrary situations. You're able to capture the relevant context based on how you select your training data. In contrast, current LLMs allow you to train on all text you can find anywhere and the model will perform well in any context.
> This a scratch text area to test the autocomplete in another moment down would be four thousand miles i wish you know please ma am is this new zealand or conversations in another moment down stairs how funny it<p>^ to me it’s really a stretch that someone could read that and think “markov chains are practically an llm!”<p>They’re pretty limited, and everyone working on building LLMs knows about them and has probably used them at some point.<p>They both generate tokens/words on probabilities, but it’s hard to understate how much simpler of a probability model a Markov chain is compared to an llm.<p>You’ve stated that markov chains have a limited use as a simple chat bot. How do you improve an mc so that it can perform other tasks, like classification, summarization, question answering, complex text parsing, text matching, etc, etc?<p>The language modeling done in llms has an open pathway to scaling and improving on this statistical model, and it’s incredible to me that this is still opaque to people. I don’t believe that llms are a path to intelligence, but I can certainly recognize that they’re going to emerge as a sophisticated and very useful productivity tool.
Sure, LLMs are like 1024-gram Markov chains (or whatever the context limit is). But there are problems: 1) the transition matrix is far too huge to represent, and 2) it treats all pairs of 1024-grams as completely different, even if the first 1023 words are the same.<p>Function approximation solves both issues, and the Transformer is the best function clas we've found so far.
Markov chains are fun. I often use them when teaching a Python fundamentals course. You can create an implementation in around 100 lines of code that explores many features of the language: classes, functions, loops, dictionaries, and lists. Then, you can augment with tests, a command line app, typing, etc.
Markov Chains are cool stuff. Back in 2015 in one of my senior class projects, I built a haiku bot that would take a seed word and make a haiku from a corpus of Tweets I collected from the firehose API.
This is what I've been saying to people, LLMs aren't much smarter than a Markov Chain - a lot of twitter bots and earlier chat agents were driven by them.<p>What I do see though is the infrastructure around LLMs making a difference.
Markov chains were one of the coolest discoveries of my programming career and I spent years using them to make forum and social media bots trained on people’s post histories for them.<p>I think that experience is part of why I’ve been generally unimpressed with a lot of LLM hype. Like yeah, it’s cool and definitely more useful than a markov chain - but for the amount of resources that went into it I’d expect the gap to be quite a bit larger than it really is.
LLM with temperature 0 will always return the same output for the same input.<p>Considering this, LLMs are Markov Chains, its just the output sequence as a whole can be considered as an element and whole context can be considered as "one previous element".<p>So, the whole block of the text on input is previous element, and whole block of the text on the output is the next element.<p>Isn't it?
If we let Markov model mean a probability distribution over a sequence defined as P[ X(t) | X(t-1) ] , then a transformer is specifically not a Markov model. The Markov property means each element is conditional on only the previous element ("context length" = 1).<p>A discrete table based probability distribution over possible words, as in the link, is a Markov language model, but not a meaningful contestant among Markov models for language modelling. The latest contestants here are things like RWKV, ResNet, SSMs, and so on which might be better thought of as HMMs.<p>An HMM is a Markov chain of hidden ("latent") states, rather than over tokens or words. Variations have been used for speech recognition and language modelling for 20-30 years.
Here's a good post on how effective these can be:<p><a href="https://colab.research.google.com/github/norvig/pytudes/blob/main/ipynb/Goldberg.ipynb" rel="nofollow">https://colab.research.google.com/github/norvig/pytudes/blob...</a>
Claude Shannon showed you could use n-grams to represent language. Then Chomsky invented automata / formal languages theory to show why it couldn’t work.<p>IIRC transformers are at the very bottom of the Chomsky hierarchy, and yet they are clearly able to master English grammar.<p>What gives?
If you want a quick insight into Markov Chains I recommend this resource by Orange Duck.<p>Orange Duck has an implementation of a Markov chain in 17 lines of Python. [0]<p>[0]: <a href="https://theorangeduck.com/page/17-line-markov-chain" rel="nofollow">https://theorangeduck.com/page/17-line-markov-chain</a>
Someone wrote a Markov chaining bot back on LambdaMOO in the early 90s, that would pretend to participate in conversation in the room, and I remember being blown away by it. Especially in the context of the limited computation power on that platform and the machine it was hosted on.
I spent an interesting evening recently with a grain of salt. <a href="https://en.wikipedia.org/wiki/Mark_V._Shaney" rel="nofollow">https://en.wikipedia.org/wiki/Mark_V._Shaney</a>
I had an exam at Uni where one of the problem was related to Markov Chains, brings back memories.<p>Also, for a lab we've used Markov Chains to create 'music'. The results were terrible, but the exercise was good.
I resource I found invaluable in demystifying LLM's was Andrej Karpathy's Youtube series. In particular, his "Makemore" video built a basic LLM in the same vein as OP's.<p>It's long (2h), but well worth the time if anyone is curious about the technical workings of LLMs.<p><a href="https://www.youtube.com/watch?v=PaCmpygFfXo&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&index=3" rel="nofollow">https://www.youtube.com/watch?v=PaCmpygFfXo&list=PLAqhIrjkxb...</a>
While you <i>can</i> represent the probabilities as a table, IMHO the graph representation is MUCH better :<p><a href="https://en.wikipedia.org/wiki/Markov_chain#/media/File:Markovkate_01.svg" rel="nofollow">https://en.wikipedia.org/wiki/Markov_chain#/media/File:Marko...</a>
I agree! I'm a undergrad and I used Markov chains to explain how I'm able to predict what my colleagues are going to say! (BTW I'm a chess player so that's how I know what it is :)
It could be me not understanding something but:<p>> Since there is a 25% probability Alice is at the grocery store, we multiply that with the probility of her transitioning to the Planetarium: 25% ∗ 75%<p>Shouldn't this be 25% * 70%?
In 2016 I created a Twitter bot that replicated a political activist via a Markov chain model. It was impressive how many people liked those post, also a middle level politician.
Back around 2002 we built an IRC bot that logged everything in a channel, built up a frequency table of which word follows what word and then randomly would spew out some text using a random walk of the frequency tables and had some logic for when to stop (I think probability increased with sentence length).<p>Sometimes it came up with very funny stuff, but mostly non-sense.