The way this works is awesome. If I understand correctly, it's like that, given (part of) a sentence, the next token really in the sequence will be one predicted by the model among the top scoring ones, so most next tokens can be mapped to very low numbers (0 if the actual next token it's the best token in the LLM prediction, 1 if it is the second best, ...). This small numbers can be encoded very efficiently using trivial old techniques. And boom: done.<p>So for instance:<p>> In my pasta I put a lot of [cheese]<p>LLM top N tokens for "In my pasta I put a lot of" will be [0:tomato, 1:cheese, 2:oil]<p>The real next token is "cheese" so I'll store "1".<p>Well, this is neat, but also very computationally expensive :D So for my small ESP32 LoRa devices I used this: <a href="https://github.com/antirez/smaz2">https://github.com/antirez/smaz2</a>
And so forth.