What they mean with "“intelligence is compression”" is that there's actually an equivalence between them.<p>See, an "intelligence" is a predictor. Like a LLM, it predicts the next char/token depending on what it's seen before.<p>In order to turn this into a compressor, you store the "difference" between the predicted token and the actual token. If the predictor is <i>good</i>, this stream of data will be very low entropy which can be compressed with arithmetic encoding or something similar.<p>In the case of arithmetic encoding you get lossless compression. (additionally, because arithmetic encoding is based on probabilities (frequencies), if the predictor outputs probabilities instead of a single prediction, this can also be used to crunch the entropy)<p>Now look at for instance the speech codecs in GSM. They use Linear Prediction Coding, which has the same concept of using a predictor and storing the error stream. Except the latter's coefficients are rounded down, making it a form of lossy compression.<p>And yes, you can probably make a pretty good (lossless or perhaps even lossy, but I don't think you want that) text compressor by using an LLM to (deterministically) predict (the likelihood of) tokens and storing only the errors/differences. It should be able to outperform zip or gzip, because it can make use of language knowledge to make predictions.<p>There's a catch, however, which is that in the case of LLM compression you also need to store all those weights somewhere, cause you need the predictor to decode. This is always the catch with compression, there is always some kind of "model" or predictor algorithm that is implicit to the compressor. In the case of gzip it's a model that says "strings tend to repeat in patterns", which is of course "stored" (in some sense) in the gzip executable code. But with gzip we don't count the size of gzip to our compressed files either, because we get to compress a lot of files with one model. Similarly for this hypothetical LLM-text compression scheme, but you just need to compress a whole lot more text before it's worth it.<p>All that said, however, like many others pointed out, 78% isn't a great score for MNIST.<p>Then again, I also don't think that gzip compression is the best predictor for gray scale image similarity. For one if you have two pixels of grayscale values 65 and 66, gzip will see them as "different bytes", regardless of them being very similar in grayscale level. You might even be able to increase the score by thresholding the training set to BW 0/255.