Short Message Compression Using LLMs

261 点作者 chunkles5 个月前

23 条评论

antirez5 个月前

The way this works is awesome. If I understand correctly, it's like that, given (part of) a sentence, the next token really in the sequence will be one predicted by the model among the top scoring ones, so most next tokens can be mapped to very low numbers (0 if the actual next token it's the best token in the LLM prediction, 1 if it is the second best, ...). This small numbers can be encoded very efficiently using trivial old techniques. And boom: done.So for instance:> In my pasta I put a lot of [cheese]LLM top N tokens for "In my pasta I put a lot of" will be [0:tomato, 1:cheese, 2:oil]The real next token is "cheese" so I'll store "1".Well, this is neat, but also very computationally expensive :D So for my small ESP32 LoRa devices I used this: <a href="https://github.com/antirez/smaz2">https://github.com/antirez/smaz2</a> And so forth.

评论 #42552286 未加载

评论 #42552081 未加载

评论 #42551933 未加载

评论 #42551388 未加载

评论 #42559405 未加载

评论 #42554201 未加载

评论 #42556517 未加载

评论 #42553248 未加载

userbinator5 个月前

The download is 153MB, compressed... didn't even bother to wait for it to finish once I saw the size.The brotli comparison is IMHO slightly misleading. Yes, it "embeds a dictionary to optimize the compression of small messages", but that dictionary is a few orders of magnitude smaller than the embedded "dictionary" which is the LLM in ts_sms.There's a reason the Hutter Prize (and the demoscene) counts the whole data necessary to reproduce its output. In other words, ts_sms took around 18 bytes + ~152MB while brotli took around 70 bytes + ~128KB (approximately size of its dictionary and decompressor.)

评论 #42556695 未加载

评论 #42556647 未加载

tshaddox5 个月前

This is obviously relevant to the Hutter Prize, which is intended to incentivize AI research by awarding cash to people who can losslessly compress a large English text corpus:<a href="https://en.wikipedia.org/wiki/Hutter_Prize" rel="nofollow">https://en.wikipedia.org/wiki/Hutter_Prize</a>From a cursory web search it doesn't appear that LLMs have been useful for this particular challenge, presumably because the challenge imposes rather strict size, CPU, and memory constraints.

评论 #42552698 未加载

评论 #42551552 未加载

评论 #42551458 未加载

kianN5 个月前

For those wondering how it works:> The language model predicts the probabilities of the next token. An arithmetic coder then encodes the next token according to the probabilities. [1]It’s also mentioned that the model is configured to be deterministic, which is how I would guess the decompression is able to map a set of token likelihoods to the original token?[1] <a href="https://bellard.org/ts_zip/" rel="nofollow">https://bellard.org/ts_zip/</a>

评论 #42552999 未加载

评论 #42552701 未加载

giovannibonetti5 个月前

Regarding lossless text compression, does anyone know how a simple way to compress repetitive JSON(B) data in a regular Postgres table? Ideally I would use columnar compression [1], but I'm limited to the extensions supported by Google Cloud SQL [2].Since my JSON(B) data is fairly repetitive, my bet would be to store some sort of JSON schema in a parent table. I'm storing the response body from a API call to a third-party API, so normalizing it by hand is probably out of the question.I wonder if Avro can be helpful for storing the JSON schema. Even if I had to create custom PL/SQL functions for my top 10 JSON schemas it would be ok, since the data is growing very quickly and I imagine it could be compressed at least 10x compared to regular JSON or JSONB columns.[1] <a href="https://github.com/citusdata/citus?tab=readme-ov-file#creating-tables-with-columnar-storage">https://github.com/citusdata/citus?tab=readme-ov-file#creati...</a> [2] <a href="https://cloud.google.com/sql/docs/postgres/extensions" rel="nofollow">https://cloud.google.com/sql/docs/postgres/extensions</a>

评论 #42551197 未加载

评论 #42552270 未加载

评论 #42551896 未加载

评论 #42557187 未加载

评论 #42551084 未加载

评论 #42551056 未加载

tdiff5 个月前

How is llm here better than Markov chains created from a corpus of English text? I guess similar idea must have been explored million times in traditional compression studies.

max_5 个月前

Does this guy (Fabrice Bellard) have a podcast interview anyone would recommend?

评论 #42552155 未加载

mNovak5 个月前

I recall someone using one of the image generation models for pretty impressive (lossy) compression as well -- I wonder if AI data compression/inflation will be a viable concept in the future; the cost of inference right now is high, but it feels similar to the way cryptographic functions were more expensive before they got universal hardware acceleration.

评论 #42551655 未加载

stabbles5 个月前

It's a bit confusing to show the output as multibyte utf-8 characters and compare that to a base64 string

评论 #42551121 未加载

slater5 个月前

i always wondered if e.g. telcos had special short codes for stuff people often send, like at xmas many people write "merry christmas" in an SMS, and the telco just sends out "[code:mx]" to all recipient phones, to save on bandwidth and disk space?

评论 #42556604 未加载

j_juggernaut5 个月前

Made a quick and dirt streamlit app to play around encrypt decrypt <a href="https://llmencryptdecrypt-euyfofcjh8bf2utuha2zox.streamlit.app/" rel="nofollow">https://llmencryptdecrypt-euyfofcjh8bf2utuha2zox.streamlit.a...</a>

lxgr5 个月前

Impressive!I wonder if this is at all similar to what Apple uses for their satellite iMessage/SMS service, as that's a domain where it's probably worth spending significant compute on both sides to shave off even a single byte to transmit.

Retr0id5 个月前

What's the throughput like, for both compression and decompression?

crazygringo5 个月前

What is this encoding scheme that produces Chinese characters from binary data? E.g. from the first example:> 뮭䅰㼦覞㻪紹陠聚牊I've never seen that before. The base64 below it, in contrast, is quite familiar.

评论 #42552253 未加载

SeptiumMMX5 个月前

The practical use for this could be satellite messaging (e.g. InReach) where a message is limited to ~160 characters, and costs about a dollar per message.

deadbabe5 个月前

Could this become an attack vector somehow? The greatest minds could probably find a way to get a malicious payload decompressed into the output.

评论 #42551105 未加载

评论 #42553669 未加载

the5avage5 个月前

Is there a paper explaining it in more detail? I also saw on his website he has a similar algorithm for audio compression...

gcr4 个月前

Decoding random gibberish into semantically meaningful sentences is fascinating.It's really fun to see what happens when you feed the model keysmash! Each part of the input space seems highly semantically meaningful.Here's a few decompressions of short strings (in base64):<pre><code> $ ./ts_sms.exe d -F base64 sAbC Functional improvements of the wva $ ./ts_sms.exe d -F base64 aBcDefGh In the Case of Detained Van Vliet {# $ ./ts_sms.exe d -F base64 yolo9000 Give the best tendering $ ./ts_sms.exe d -F base64 elonMuskSuckss= As a result, there are safety mandates on radium-based medical devices $ ./ts_sms.exe d -F base64 trump4Prezident= Order Fostering Actions Supported in May In our yellow $ ./ts_sms.exe d -F base64 harris4Prezident= Colleges Beto O'Rourke voted with Cher ┬íLa $ ./ts_sms.exe d -F base64 obama4Prezident= 2018 AFC Champions League activity televised live on Telegram: $ ./ts_sms.exe d -F base64 hunter2= All contact and birthday parties $ ./ts_sms.exe d -F base64 'correctHorseBatteryStaples=' --- author: - Stefano Vezzalini - Paolo Di┬áRio - Petros Maev - Chris Copi - Andreas Smit bibliography: $ ./ts_sms.exe d -F base64 'https//news/ycombinator/com/item/id/42517035' Allergen-specific Tregs or Treg used in cancer immunotherapy. Tregs are a critical feature of immunotherapies for cancer. Our previous studies indicated a role of Tregs in multiple cancers such as breast, liver, prostate, lung, renal and pancreatitis. Ten years ago, most clinical studies were positi ve, and zero percent response rates $ ./ts_sms.exe d -F base64 'helloWorld=' US Internal Revenue Service (IRS) seized $1.6 billion worth of bitcoin and </code></pre> In terms of compressions, set phrases are pretty short:<pre><code> $ ./ts_sms.exe c -F base64 'I love you' G5eY $ ./ts_sms.exe c -F base64 'Happy Birthday' 6C+g </code></pre> Common mutations lead to much shorter output than uncommon mutations / typos, as expected:<pre><code> $ ./ts_sms.exe c -F base64 'one in the hand is worth two in the bush' Y+ox+lmtc++G $ ./ts_sms.exe c -F base64 'One in the hand is worth two in the bush' kC4Y5cUJgL3s $ ./ts_sms.exe c -F base64 'One in the hand is worth two in the bush.' kC4Y5cUJgL3b $ ./ts_sms.exe c -F base64 'One in the hand .is worth two in the bush.' kC4Y5c+urSDmrod4 </code></pre> Note that the correct version of this idiom is a couple bits shorter:<pre><code> $ ./ts_sms.exe c -F base64 'A bird in the hand is worth two in the bush.' ERdNZC0WYw== </code></pre> Slight corruptions at different points lead to wildly different (but meaningful) output:<pre><code> $ ./ts_sms.exe d -F base64 FRdNZC0WYw== Dionis Ellison Dionis Ellison is an American film director, $ ./ts_sms.exe d -F base64 ERcNZC0WYw== A preliminary assessment of an endodontic periapical fluor $ ./ts_sms.exe d -F base64 ERdNYC0WYw== A bird in the hand and love of the divine $ ./ts_sms.exe d -F base64 ERdNZC1WYw== A bird in the hand is worth thinking about $ ./ts_sms.exe d -F base64 ERdNZD0WYw== A bird in the hand is nearly as big as the human body $ ./ts_sms.exe d -F base64 ERdNZC0wYw== A bird in the hand is worth something! Friday $ ./ts_sms.exe d -F base64 ERdNZC0XYw== A bird in the hand is worth two studies</code></pre>

评论 #42564149 未加载

yalok5 个月前

What’s the size of the model used here?

评论 #42551070 未加载

jonplackett5 个月前

Would this also work for video encoding using something like Sora?Get Sora to guess the next frame and then correct any parts that are wrong?I mean, it would be an absolutely insane waste of power, but maybe one day it’ll make sense!

MPSimmons4 个月前

Cool, now all I need is the tiny encoded message and a 7+B weight model, plus some electricity.This is more like a book cipher than a compression algorithm.

mlok5 个月前

LLMs, and now this, make me think of the (non-existant) "Sloot Digital Coding System" that could be viewed as a form of "compression".<a href="https://en.m.wikipedia.org/wiki/Sloot_Digital_Coding_System" rel="nofollow">https://en.m.wikipedia.org/wiki/Sloot_Digital_Coding_System</a>

评论 #42552381 未加载

RandomThoughts35 个月前

It’s a very clever idea.I could see it becoming very useful if on device LLM becomes a thing. That might allow storing a lot of original sources for not much additional data. We might be able to get an on device chat bot sending you to a copy of Wikipedia/reference material all stored on device and working fully offline.

评论 #42553606 未加载

评论 #42552343 未加载