Byte Latent Transformer: Patches Scale Better Than Tokens

378 pointsby zxexz5 months ago

19 comments

dang5 months ago

The paper: <a href="https://scontent-sjc3-1.xx.fbcdn.net/v/t39.2365-6/470135129_1314438233309836_4712217603129928862_n.pdf?_nc_cat=111&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=WqSN1qsot3oQ7kNvgFWGG4j&_nc_zt=14&_nc_ht=scontent-sjc3-1.xx&_nc_gid=A2yO-vwOF4w2PIUX2gHIbXD&oh=00_AYBAR_B1_9ewVRJM5VYbJbdfm4Uk5INZY0t67hlpNccpAA&oe=676400C8" rel="nofollow">https://scontent-sjc3-1.xx.fbcdn.net/v/t39.2365-6/470135129_...</a>

PaulHoule5 months ago

The summer that BERT came out I was working at a startup that was using character-based CNN models for classification. We were thinking a lot about alternate representations, other members of the team were keen on word vectors but I wasn't, particularly because it seemed the documents were were working on frequently had out-of-dictionary words, because those words were important, and because discarding them would lead to failure.(We were working on "foundation models" too, so it's not just being out-of-dictionary in the final model that's a problem but being out-of-dictionary in the foundation model which is more expensive to train.)We were doing OK with character based models for classification but people believed that storing the "dictionary" inside the neural net was not a good use of the neural net so there was a lot of enthusiasm for tokens.Meanwhile I felt so sure that schemes like Word2Vec were doomed that I had left an earlier project using RNNs where the goal was text understanding with a foundation model made by training an RNN to write fake abstracts for case reports from PubMed.When byte-pair encoding was introduced I remember telling people in a meeting that it was the first tokenization scheme we'd looked at that I could endorse.I have to admit though that I wish we could work at the character label.

评论 #42419473 未加载

评论 #42417964 未加载

modeless5 months ago

I really hope this works out. Death to tokenizers!Interesting that it's a hierarchical structure but only two levels of hierarchy. Stacking more levels seems like an obvious direction for further research.Note: I posted this comment on another related story[1] and the author replied:"Author here :), I do think it’s a good direction to look into! That said, aside from it being a bit too much to do at once, you’d also have to be careful about how you distributed your FLOP budget across the hierarchy. With two levels, you can make one level (bytes/local encoder) FLOP efficient and the other (patches/global encoder) FLOP intensive. You’d also need to find a way to group patches into larger units. But ya, there are many directions to go from here!"[1] <a href="https://news.ycombinator.com/item?id=42413430">https://news.ycombinator.com/item?id=42413430</a>

评论 #42422235 未加载

flimflamm5 months ago

To create a patch, a small model is used to predict the likelihood for the next character in the input string. Input string: 'Lazy dog jumped over a fence.' Use the model to predict the likelihood of each character.For example:<pre><code> 100% sure the next character is 'a'. Or maybe it's 10% sure it's 'a', 10% sure it's 'b', and so on. </code></pre> Then we chunk character estimates together. How many characters? Enough characters so that the total uncertainty (entropy) in each chunk is about the same. And there you have your 'patch' (or 'token').

评论 #42416018 未加载

评论 #42418140 未加载

dang5 months ago

Recent and related:Sharing new research, models, and datasets from Meta FAIR - <a href="https://news.ycombinator.com/item?id=42412360">https://news.ycombinator.com/item?id=42412360</a> - Dec 2024 (61 comments)

vishpr5 months ago

So only thing teaching model (loss) is probability prediction in single byte space. And that is enough? Looks very promising, if I am not misunderstanding.

nodja5 months ago

From my understanding this not only removes tokenization but also sampling correct?Sampling can be a pain point of LLMs, but they also can enable interesting usages, like forcing grammar so the model always outputs valid JSON or tuning temperature to get more varied distribution, XTC sampling, etc.What would be the equivalent of these in a BLT?I can only think of providing the decoder an extra input of allowed/prohibited bytes and run the decode over and over until it outputs something valid, maybe there's a simpler and more obvious approach.

评论 #42419427 未加载

dr_dshiv5 months ago

Does this mean AI can pre-train on binaries?

评论 #42417906 未加载

iandanforth5 months ago

I find it interesting how far linguistic, and experienced based approaches have fallen out of fashion. Humans don't read character by character, even if we can it's not a standard operating mode. We have word stems and understand modifications by endings. Tokenization doesn't replicate this experience (seriously, look at the tokens that appear in LLM vocabularies), nor does character or byte encoding. Humans have multiple ways to parse words. You can grok a full sentence, read a phrase, read word by word, or sound out a new word character by character. Very few papers explicitly claim that a method is good because it replicates the way a human would perform a task, or perceive the world.I suspect as LLM reliance increases we'll want to align the models to our experience more closely. I further suspect this will make the errors that models make more comprehensible.

DerSaidin5 months ago

> Unlike tokenization, BLT has no fixed vocabulary for patches.iiuc this means: the vocabulary of patches is not known prior to training.I guess once training has established a vocabulary of patches, that same fixed vocabulary is used for inference (if this is not true I don't see how it could work).Right?

RandyOrion5 months ago

An interesting read on alternative tokenization methods.Questions:1. What's the goal of entropy based byte token grouping as tokenization? Is this tokenization method best suited for the goal?2. What about simply using byte level sequence to sequence autoencoder with down sampling for tokenization?

boulos5 months ago

This is neat work, but I also love the (presumably intentional?) backronym of BLT.

dewijones925 months ago

notebooklm <a href="https://notebooklm.google.com/notebook/77fe83ee-35b3-4a9a-a3d9-e2f1395a3a0f/audio" rel="nofollow">https://notebooklm.google.com/notebook/77fe83ee-35b3-4a9a-a3...</a>

评论 #42416653 未加载

评论 #42418277 未加载

amelius5 months ago

Why can't the tokenization be implicit, so we only feed bytes (or characters) to the model?

评论 #42417428 未加载

评论 #42417423 未加载

macrolime5 months ago

I wonder whether llama 4 will use this

qouteall5 months ago

Related quote from Karpathy:Tokenization is at the heart of much weirdness of LLMs. Do not brush it off.• Why can't LLM spell words? Tokenization.• Why can't LLM do super simple string processing tasks like reversing a string? Tokenization.• Why is LLM worse at non-English languages (e.g. Japanese)? Tokenization.• Why is LLM bad at simple arithmetic? Tokenization.• Why did GPT-2 have more than necessary trouble coding in Python? Tokenization.• Why did my LLM abruptly halt when it sees the string "<|endoftext|>"? Tokenization.• What is this weird warning I get about a "trailing whitespace"? Tokenization.• Why the LLM break if I ask it about "SolidGoldMagikarp"? Tokenization.• Why should I prefer to use YAML over JSON with LLMs? Tokenization.• Why is LLM not actually end-to-end language modeling? Tokenization.• What is the real root of suffering? Tokenization.

评论 #42415874 未加载

评论 #42416206 未加载

评论 #42416580 未加载

评论 #42417567 未加载

评论 #42416629 未加载

评论 #42416497 未加载

paraschopra5 months ago

My notes:It's a 3 component model.- Encoder: Takes byte groupings and outputs a hidden state/encoding called patches- Transformer: Takes these encodings of patches in autoregressive fashion- Decoder: Takes processed encodings by transformers and outputs bytesLoss is on byte to byte crossentropy (Next byte prediction)How they group bytes.- Use entropy thresholds: If a sequence of bytes have entropy lower than a threshold, group them- This is a learned model (from data)Why this helps over current byte-pair tokenization in LLMs.- Encoder/decoder essentially act as “learnable” tokenization scheme- Better efficiency tradeoffs (as for highly predictable sequence of bytes, encoder can “offload” computation effort from the main transformer)- History teaches us that end to end learned system beats human designed mechanisms

评论 #42416125 未加载

评论 #42416410 未加载

评论 #42416631 未加载

bloomingkales5 months ago

I thought we’re supposed to be plateauing!?

评论 #42416010 未加载

评论 #42419745 未加载

fabmilo5 months ago

I am gonna read this paper and the other latent sentence later today. I always advocated for this kind of solutions together with latent sentence search should get to the next level of AI. Amazing work from Meta

评论 #42416057 未加载

19 comments

dang5 months ago

PaulHoule5 months ago

评论 #42419473 未加载

评论 #42417964 未加载

modeless5 months ago

评论 #42422235 未加载

flimflamm5 months ago

评论 #42416018 未加载

评论 #42418140 未加载

dang5 months ago

vishpr5 months ago

So only thing teaching model (loss) is probability prediction in single byte space. And that is enough? Looks very promising, if I am not misunderstanding.

nodja5 months ago

评论 #42419427 未加载

dr_dshiv5 months ago

Does this mean AI can pre-train on binaries?

评论 #42417906 未加载

iandanforth5 months ago

DerSaidin5 months ago

RandyOrion5 months ago

boulos5 months ago

This is neat work, but I also love the (presumably intentional?) backronym of BLT.

dewijones925 months ago

notebooklm <a href="https://notebooklm.google.com/notebook/77fe83ee-35b3-4a9a-a3d9-e2f1395a3a0f/audio" rel="nofollow">https://notebooklm.google.com/notebook/77fe83ee-35b3-4a9a-a3...</a>

评论 #42416653 未加载

评论 #42418277 未加载

amelius5 months ago

Why can't the tokenization be implicit, so we only feed bytes (or characters) to the model?