The paper: <a href="https://scontent-sjc3-1.xx.fbcdn.net/v/t39.2365-6/470135129_1314438233309836_4712217603129928862_n.pdf?_nc_cat=111&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=WqSN1qsot3oQ7kNvgFWGG4j&_nc_zt=14&_nc_ht=scontent-sjc3-1.xx&_nc_gid=A2yO-vwOF4w2PIUX2gHIbXD&oh=00_AYBAR_B1_9ewVRJM5VYbJbdfm4Uk5INZY0t67hlpNccpAA&oe=676400C8" rel="nofollow">https://scontent-sjc3-1.xx.fbcdn.net/v/t39.2365-6/470135129_...</a>
The summer that BERT came out I was working at a startup that was using character-based CNN models for classification. We were thinking a lot about alternate representations, other members of the team were keen on word vectors but I wasn't, particularly because it seemed the documents were were working on frequently had out-of-dictionary words, because those words were important, and because discarding them would lead to failure.<p>(We were working on "foundation models" too, so it's not just being out-of-dictionary in the final model that's a problem but being out-of-dictionary in the foundation model which is more expensive to train.)<p>We were doing OK with character based models for classification but people believed that storing the "dictionary" inside the neural net was not a good use of the neural net so there was a lot of enthusiasm for tokens.<p>Meanwhile I felt so sure that schemes like Word2Vec were doomed that I had left an earlier project using RNNs where the goal was text understanding with a foundation model made by training an RNN to write fake abstracts for case reports from PubMed.<p>When byte-pair encoding was introduced I remember telling people in a meeting that it was the first tokenization scheme we'd looked at that I could endorse.<p>I have to admit though that I wish we could work at the character label.
I really hope this works out. Death to tokenizers!<p>Interesting that it's a hierarchical structure but only two levels of hierarchy. Stacking more levels seems like an obvious direction for further research.<p>Note: I posted this comment on another related story[1] and the author replied:<p>"Author here :), I do think it’s a good direction to look into! That said, aside from it being a bit too much to do at once, you’d also have to be careful about how you distributed your FLOP budget across the hierarchy. With two levels, you can make one level (bytes/local encoder) FLOP efficient and the other (patches/global encoder) FLOP intensive. You’d also need to find a way to group patches into larger units. But ya, there are many directions to go from here!"<p>[1] <a href="https://news.ycombinator.com/item?id=42413430">https://news.ycombinator.com/item?id=42413430</a>
To create a patch, a small model is used to predict the likelihood for the next character in the input string. Input string: 'Lazy dog jumped over a fence.' Use the model to predict the likelihood of each character.<p>For example:<p><pre><code> 100% sure the next character is 'a'.
Or maybe it's 10% sure it's 'a', 10% sure it's 'b', and so on.
</code></pre>
Then we chunk character estimates together.
How many characters?
Enough characters so that the total uncertainty (entropy) in each chunk is about the same.
And there you have your 'patch' (or 'token').
Recent and related:<p><i>Sharing new research, models, and datasets from Meta FAIR</i> - <a href="https://news.ycombinator.com/item?id=42412360">https://news.ycombinator.com/item?id=42412360</a> - Dec 2024 (61 comments)
So only thing teaching model (loss) is probability prediction in single byte space. And that is enough? Looks very promising, if I am not misunderstanding.
From my understanding this not only removes tokenization but also sampling correct?<p>Sampling can be a pain point of LLMs, but they also can enable interesting usages, like forcing grammar so the model always outputs valid JSON or tuning temperature to get more varied distribution, XTC sampling, etc.<p>What would be the equivalent of these in a BLT?<p>I can only think of providing the decoder an extra input of allowed/prohibited bytes and run the decode over and over until it outputs something valid, maybe there's a simpler and more obvious approach.
I find it interesting how far linguistic, and experienced based approaches have fallen out of fashion. Humans don't read character by character, even if we <i>can</i> it's not a standard operating mode. We have word stems and understand modifications by endings. Tokenization doesn't replicate this experience (seriously, look at the tokens that appear in LLM vocabularies), nor does character or byte encoding. Humans have multiple ways to parse words. You can grok a full sentence, read a phrase, read word by word, or sound out a new word character by character. Very few papers explicitly claim that a method is good because it replicates the way a human would perform a task, or perceive the world.<p>I suspect as LLM reliance increases we'll want to align the models to our experience more closely. I further suspect this will make the errors that models make more comprehensible.
> Unlike tokenization, BLT has no fixed
vocabulary for patches.<p>iiuc this means: the vocabulary of patches is not known prior to training.<p>I guess once training has established a vocabulary of patches, that same fixed vocabulary is used for inference (if this is not true I don't see how it could work).<p>Right?
An interesting read on alternative tokenization methods.<p>Questions:<p>1. What's the goal of entropy based byte token grouping as tokenization? Is this tokenization method best suited for the goal?<p>2. What about simply using byte level sequence to sequence autoencoder with down sampling for tokenization?
Related quote from Karpathy:<p>Tokenization is at the heart of much weirdness of LLMs. Do not brush it off.<p>• Why can't LLM spell words? Tokenization.<p>• Why can't LLM do super simple string processing tasks like reversing a string? Tokenization.<p>• Why is LLM worse at non-English languages (e.g. Japanese)? Tokenization.<p>• Why is LLM bad at simple arithmetic? Tokenization.<p>• Why did GPT-2 have more than necessary trouble coding in Python? Tokenization.<p>• Why did my LLM abruptly halt when it sees the string "<|endoftext|>"? Tokenization.<p>• What is this weird warning I get about a "trailing whitespace"? Tokenization.<p>• Why the LLM break if I ask it about "SolidGoldMagikarp"? Tokenization.<p>• Why should I prefer to use YAML over JSON with LLMs? Tokenization.<p>• Why is LLM not actually end-to-end language modeling? Tokenization.<p>• What is the real root of suffering? Tokenization.
My notes:<p>It's a 3 component model.<p>- Encoder: Takes byte groupings and outputs a hidden state/encoding called patches<p>- Transformer: Takes these encodings of patches in autoregressive fashion<p>- Decoder: Takes processed encodings by transformers and outputs bytes<p>Loss is on byte to byte crossentropy (Next byte prediction)<p>How they group bytes.<p>- Use entropy thresholds: If a sequence of bytes have entropy lower than a threshold, group them<p>- This is a learned model (from data)<p>Why this helps over current byte-pair tokenization in LLMs.<p>- Encoder/decoder essentially act as “learnable” tokenization scheme<p>- Better efficiency tradeoffs (as for highly predictable sequence of bytes, encoder can “offload” computation effort from the main transformer)<p>- History teaches us that end to end learned system beats human designed mechanisms
I am gonna read this paper and the other latent sentence later today. I always advocated for this kind of solutions together with latent sentence search should get to the next level of AI. Amazing work from Meta