78% MNIST accuracy using GZIP in under 10 lines of code

363 点作者 js98超过 1 年前

26 条评论

I tried replacing the distance function in the code with some simpler distance measures:<pre><code> Gzip distance: ~3 minutes, 78% accuracy Euclidean distance: ~0.5 seconds, 93% accuracy Jaccard distance * : ~0.7 seconds, 94% accuracy Dice dissimilarity * : ~0.8 seconds, 94% accuracy * after binarising the images </code></pre> So, as a distance measure for classifying MNIST digits, GZIP has lower accuracy, and is much more computationally demanding than other measures.I'm not that familiar with how the GZIP algorithm works, it's kind of interesting that it's so much lower. I wonder whether image focused compression algorithms might do better?(Edit: Btw, I enjoyed the post; it's a creative idea, the writing and code was great, and it's sparked some good discussion. But after having a closer look, I think the baselines above provide some context to the gzip scores.)

评论 #37591101 未加载

评论 #37590220 未加载

评论 #37591971 未加载

评论 #37595205 未加载

评论 #37590757 未加载

GaggiX超过 1 年前

For a comparison with others techniques:Linear SVC (best performance): 92 %SVC rbf (best performance): 96.4 %SVC poly (best performance): 94.5 %Logistic regression (prev assignment): 89 %Naive Bayes (prev assignment): 81 %From this blog page: <a href="https://dmkothari.github.io/Machine-Learning-Projects/SVM_with_MNIST.html" rel="nofollow noreferrer">https://dmkothari.github.io/Machine-Learning-Projects/SVM_wi...</a>Also it seems from reading online articles that people are able to obtain much better results just by using K-NN, so I imagine that the author just made his job harder by using gzip but I could be wrong about this.

评论 #37586338 未加载

评论 #37585509 未加载

评论 #37587068 未加载

评论 #37585455 未加载

评论 #37587952 未加载

tdr2d超过 1 年前

Obviously, the code may be elegant and compact, 78% accuracy is considered very very bad for MNIST.A dummy model written with Tensorflow easilly reaches 90% accuracy. The best models ranked at 99,87%, see the benchmark : <a href="https://paperswithcode.com/sota/image-classification-on-mnist" rel="nofollow noreferrer">https://paperswithcode.com/sota/image-classification-on-mnis...</a>

评论 #37586067 未加载

评论 #37585607 未加载

评论 #37586197 未加载

tobeyey超过 1 年前

Replacing the NCD<pre><code> distances = [(compute_ncd(x1, x), label) for x, _, label in compressed_lengths] </code></pre> with the Euclidean distance<pre><code> distances = [(np.sqrt(np.sum(np.square(x1-x))), label) for x, _, label in compressed_lengths] </code></pre> gives you +15% test accuracy and saves you a lot of compute.

benreesman超过 1 年前

My favorite book about the deep connections between information theory, compression, and learning algorithms is MacKay (most probably know about it but I didn’t for a long time so maybe some will benefit from the mention).I gather this is common knowledge (if not sufficiently emphasized at times) among those with serious educations, but as a self-taught, practical application-type ML person this profound thread running through all these topics (and seemingly into heavy-ass particle physics and cosmology and stuff like that) was a blinding “Aha!” moment that I’ll venture this comment in the hopes that even one other person has that unforgettable moment.

评论 #37591932 未加载

jszymborski超过 1 年前

In fairness you can run MNIST through UMAP and get near perfect seperation. I'm of the belief that you have to try pretty hard not to do well on MNIST these days.<a href="https://github.com/lmcinnes/umap_paper_notebooks/blob/master/UMAP%20MNIST.ipynb">https://github.com/lmcinnes/umap_paper_notebooks/blob/master...</a>EDIT: I should add, unless it isn't clear, that we really should retire the dataset. Something like the QuickDraw dataset makes a lot more sense to me.

评论 #37585760 未加载

评论 #37586029 未加载

评论 #37587509 未加载

评论 #37585664 未加载

wayeq超过 1 年前

I'm merely a hobbyist in this domain but isn't highly compressed data (like encrypted data) also high entropy? If this is finding patterns in the compressed data to figure out which digit the uncompressed data represents, shouldn't those patterns be exploitable for better compression?

评论 #37589357 未加载

评论 #37589445 未加载

tjungblut超过 1 年前

I don't immediately find it, but couple of years back there was a "meta-feature" which was the size of the MNIST image. I think that scored about 90'ish % accurate results on its own - without even looking at the image.

评论 #37586220 未加载

评论 #37585650 未加载

3abiton超过 1 年前

Didn't it turn out that authors of that paper have made mistakes that catapulted their results to the top of the benchmark charts? I thought the theory was inconsistent after that incident. 78% accuracy from just GZIP is impressive.

评论 #37588531 未加载

评论 #37594336 未加载

jp57超过 1 年前

Leaving aside whether this problem is a good application of this compression trick, I want to say that everyone experimenting with this should stop using `gzip` and start using `zlib`.If you change the first line from `gzip.compress` to `zlib.compress` you should get the same classifier performance with a 3x speedup.

bob1029超过 1 年前

General purpose compressors and information distance measures have become super interesting to me while I've been investigating alternative language models.I've been playing around with an attention mechanism that combines the idea of using normalized compression distance (gzip) with discrete convolution between candidate sequences (sliding window of N bytes over each). Another round of normalization over the convolution outputs - accommodating varying lengths - allows for us to compare candidate sequences for relevant information on equal grounds.The NCD formula I am using right now:<pre><code> NCD(x,y) = (C(xy) - MIN(C(x),C(y))) / MAX(C(x),C(y)) </code></pre> No weird parameters or any other things to tune. The only parameters are the source documents and the input context/query.

评论 #37609747 未加载

yannccc2超过 1 年前

All these ideas date back to at least 2010, probably earlier. It was called "information distance". <a href="https://arxiv.org/abs/1006.3520" rel="nofollow noreferrer">https://arxiv.org/abs/1006.3520</a>

tysam_and超过 1 年前

If you'd like to play around with MNIST yourself, I wrote a PyTorch training implementation that gets ~99.45%+ val accuracy in <13.6 seconds on a V100, est. < 6.5 seconds on an A100. Made to be edited/run in Colab: <a href="https://github.com/tysam-code/hlb-CIFAR10">https://github.com/tysam-code/hlb-CIFAR10</a>It's originally kitted for CIFAR10, but I've found the parameters to be quite general. The code is very easy to read and well-commented, and is a great starting place for exploration.Min-cut deltas to run MNIST:<pre><code> .datasets.CIFAR10(' -> .datasets.MNIST(' (both occurences) 'whiten': Conv(3, -> 'whiten': Conv(1, crop_size = 28 - > `crop_size = 28 </code></pre> Compute for the project funded by Carter Brown and Daniel Gross, my appreciation to them both for helping make this possible. Their support via encouragement has been very helpful as well. <3 :)

m00x超过 1 年前

It goes along the same lines as the recent paper from Deepmind stating that DL (language modeling in their case) is compression.<a href="https://arxiv.org/pdf/2309.10668.pdf" rel="nofollow noreferrer">https://arxiv.org/pdf/2309.10668.pdf</a>

thomasahle超过 1 年前

Similar result (2019) using ZIP, getting 74%: <a href="https://www.blackhc.net/blog/2019/mnist-by-zip/" rel="nofollow noreferrer">https://www.blackhc.net/blog/2019/mnist-by-zip/</a>

0xDEF超过 1 年前

Has anyone tried how different compression algorithms compare when doing NCD classification?gzip first does LZ77 and then Huffman coding. ANS is a more modern alternative to Huffman coding that can achieve higher compression ratios than Huffman coding.

Lerc超过 1 年前

I'm sure there was something like this mentioned in "Managing Gigabytes" or thereabouts. It might have been using bitwise arithmetic compression which makes sense for the problem at hand.

benob超过 1 年前

What explains the huge difference in perf with the approach from 2019 mentioned at the end?For image classification, couldn't one use the residual from applying a jpeg codebook learned on a training example as metric?

SomewhatLikely超过 1 年前

Wouldn't it make more sense to create ten streams of images of the same class and then see which one results in the smallest compressed size for a test image? That is, if I gzip a hundred '6's plus my test image and get a compressed size for my test image of 10 bytes, but doing the same for other digits gives me say 15 bytes then I conclude the test image is a '6'.

yieldcrv超过 1 年前

I want someone to show me how to use compression to compile exploits on emulated logic gates, like NSOGroup keeps doingSeems like a sufficiently novel and important advancement that should be taught in universities at this point, since we need to harden software around this kind of possibility.

petters超过 1 年前

I wonder why you started with a code-golfed version? That seems original to the main point of the post

评论 #37584767 未加载

great_psy超过 1 年前

Having some artificial intelligence algorithm that is completely understood and tu able like gzip is would be great for many uses.I think it’s pretty hard to just improve a NN without more data or some non trivial amount of effort.

tysam_and超过 1 年前

Additionally, try flipping your images and averaging the size before comparing the distance between them. I'd expect a boost of about 78% -> 84% or so, based on how this typically works as TTA.

评论 #37587477 未加载

tripzilch超过 1 年前

What they mean with "“intelligence is compression”" is that there's actually an equivalence between them.See, an "intelligence" is a predictor. Like a LLM, it predicts the next char/token depending on what it's seen before.In order to turn this into a compressor, you store the "difference" between the predicted token and the actual token. If the predictor is good, this stream of data will be very low entropy which can be compressed with arithmetic encoding or something similar.In the case of arithmetic encoding you get lossless compression. (additionally, because arithmetic encoding is based on probabilities (frequencies), if the predictor outputs probabilities instead of a single prediction, this can also be used to crunch the entropy)Now look at for instance the speech codecs in GSM. They use Linear Prediction Coding, which has the same concept of using a predictor and storing the error stream. Except the latter's coefficients are rounded down, making it a form of lossy compression.And yes, you can probably make a pretty good (lossless or perhaps even lossy, but I don't think you want that) text compressor by using an LLM to (deterministically) predict (the likelihood of) tokens and storing only the errors/differences. It should be able to outperform zip or gzip, because it can make use of language knowledge to make predictions.There's a catch, however, which is that in the case of LLM compression you also need to store all those weights somewhere, cause you need the predictor to decode. This is always the catch with compression, there is always some kind of "model" or predictor algorithm that is implicit to the compressor. In the case of gzip it's a model that says "strings tend to repeat in patterns", which is of course "stored" (in some sense) in the gzip executable code. But with gzip we don't count the size of gzip to our compressed files either, because we get to compress a lot of files with one model. Similarly for this hypothetical LLM-text compression scheme, but you just need to compress a whole lot more text before it's worth it.All that said, however, like many others pointed out, 78% isn't a great score for MNIST.Then again, I also don't think that gzip compression is the best predictor for gray scale image similarity. For one if you have two pixels of grayscale values 65 and 66, gzip will see them as "different bytes", regardless of them being very similar in grayscale level. You might even be able to increase the score by thresholding the training set to BW 0/255.

_a_a_a_超过 1 年前

"MNIST"?accuracy - but of what?what's this about?

评论 #37586785 未加载

Karellen超过 1 年前

You solved some problem, including implementing the GZIP algorithm, in less than 10 lines of code?...Oh, ok, not that.

26 条评论

montebicyclelo超过 1 年前

评论 #37591101 未加载

评论 #37590220 未加载

评论 #37591971 未加载

评论 #37595205 未加载

评论 #37590757 未加载

GaggiX超过 1 年前

评论 #37586338 未加载

评论 #37585509 未加载

评论 #37587068 未加载

评论 #37585455 未加载

评论 #37587952 未加载

tdr2d超过 1 年前

评论 #37586067 未加载

评论 #37585607 未加载

评论 #37586197 未加载

tobeyey超过 1 年前

benreesman超过 1 年前

评论 #37591932 未加载

jszymborski超过 1 年前

评论 #37585760 未加载

评论 #37586029 未加载

评论 #37587509 未加载

评论 #37585664 未加载

wayeq超过 1 年前

评论 #37589357 未加载

评论 #37589445 未加载

tjungblut超过 1 年前

评论 #37586220 未加载

评论 #37585650 未加载

3abiton超过 1 年前

评论 #37588531 未加载

评论 #37594336 未加载

jp57超过 1 年前

bob1029超过 1 年前

评论 #37609747 未加载

yannccc2超过 1 年前

tysam_and超过 1 年前

m00x超过 1 年前

thomasahle超过 1 年前

Similar result (2019) using ZIP, getting 74%: <a href="https://www.blackhc.net/blog/2019/mnist-by-zip/" rel="nofollow noreferrer">https://www.blackhc.net/blog/2019/mnist-by-zip/</a>

0xDEF超过 1 年前

Lerc超过 1 年前

I'm sure there was something like this mentioned in "Managing Gigabytes" or thereabouts. It might have been using bitwise arithmetic compression which makes sense for the problem at hand.

benob超过 1 年前

SomewhatLikely超过 1 年前

yieldcrv超过 1 年前

petters超过 1 年前

I wonder why you started with a code-golfed version? That seems original to the main point of the post

评论 #37584767 未加载

great_psy超过 1 年前

tysam_and超过 1 年前

Additionally, try flipping your images and averaging the size before comparing the distance between them. I'd expect a boost of about 78% -> 84% or so, based on how this typically works as TTA.

评论 #37587477 未加载

tripzilch超过 1 年前

_a_a_a_超过 1 年前

"MNIST"?accuracy - but of what?what's this about?

评论 #37586785 未加载

Karellen超过 1 年前

You solved some problem, including implementing the GZIP algorithm, in less than 10 lines of code?...Oh, ok, not that.