Two points worth noting:<p>1. Gzip is not a suitable compressor for this use case, because it's limited to a 32KB window. So the input can only be correlated with the last 32KB of the reference texts.<p>2. You can save a great deal in computation by avoiding recompressing the reference texts over and over and over. Some compression algorithms support checkpointing the compression state so that it can be resumed from that point repeatedly ("dictionary-based compression", which is a distinct capability from just streaming compression, which generally can only be continued once).<p>I would personally shill for using Zstandard [0] instead for this purpose. Although I should disclose my bias: I'm a developer of Zstd. A few salient facts:<p>1. Zstd supports very large windows (up to 128MB, or up to 2GB in long mode).<p>2. Zstd is much faster than zlib.<p>3. Zstd has well-developed support for dictionary-based compression.<p>4. Additionally, it has a dictionary trainer that can reduce a corpus of reference documents to a compact summary document that aims to capture as much of the content as possible of the reference corpus. [1]<p>5. It has (more than one) python binding available. [2][3]<p>[0] <a href="https://github.com/facebook/zstd" rel="nofollow">https://github.com/facebook/zstd</a><p>[1] <a href="https://github.com/facebook/zstd/blob/dev/lib/zdict.h#L40" rel="nofollow">https://github.com/facebook/zstd/blob/dev/lib/zdict.h#L40</a><p>[2] <a href="https://pypi.org/project/zstandard" rel="nofollow">https://pypi.org/project/zstandard</a><p>[3] <a href="https://pypi.org/project/zstd" rel="nofollow">https://pypi.org/project/zstd</a>
The University of Waikato, New Zealand has had a lot of research going on to use compression for named entity tagging (name, location, date, person, ...) etc.<p>While it's not the best-performing paradigm for text sequence tagging, it is intellectually intriguing as you say because of the parallel between the concepts "compression" and "understanding", even in the human brain. If we can't understand s.th., we need to memorize it; if we understand it, it doesn't need much space or cognitive load at all, basically a name that is well-linked to other concepts.
You don't have to recompress the whole corpus to add a single document to it. All the compression algorithms mentioned here work in a streaming fashion. You could "just" save the internal state of the algorithm after compressing the training data, and then reuse that state for each classification task.
I'm missing something or the author could have used: Kolmogovov complexity (0), normalized compression distance (1), some research of Rudy Cilibrasi and Vitaniy, and some research that come after it(2).<p>(0) <a href="https://en.m.wikipedia.org/wiki/Kolmogorov_complexity" rel="nofollow">https://en.m.wikipedia.org/wiki/Kolmogorov_complexity</a><p>(1) <a href="https://en.m.wikipedia.org/wiki/Normalized_compression_distance" rel="nofollow">https://en.m.wikipedia.org/wiki/Normalized_compression_dista...</a><p>(2) for example: <a href="https://www.sciencedirect.com/science/article/pii/S0379073813000923" rel="nofollow">https://www.sciencedirect.com/science/article/pii/S037907381...</a>
I've made use of the property of compression algorithms in detecting redundancy and similarity in a slightly different way: Compress a codebase, and the most compressible source code files are also ones that are likely candidates for refactoring.
> "Take a labeled training set of documents... Return the label for which the size increased the least."<p>that's not unlike using an autoencoder to score a document as an anomaly amongst the other labels?
The Hutter prize [0] is based around the idea that compressing Wikipedia will lead to more advanced AI research. It doesn't seem like it has so far, but the solutions aren't all open source so it's hard to say, but it's definitely along the same lines.<p>[0] <a href="https://en.wikipedia.org/wiki/Hutter_Prize" rel="nofollow">https://en.wikipedia.org/wiki/Hutter_Prize</a>
The theory behind this kind of approach is actually very deep. If we had a “perfect compressor” that returned the shortest non-halting prefix-free Turing machine that, when run, output the original dataset, this would essentially be the best classifier we could create. But as the author notes, there are some heavy computational penalties to overcome for this, even to an approximation.
I work for an educational company and a proprietary metric we license includes how well the text gzips as one part of it's scoring system for reading difficulty.<p>The better it gzips, the easier it is to read.<p>It's just one part of the score, I'm not certain what weight it has on it.
It is worth pointing out that Yann LeCun, a prominent ML researcher, also worked with DjVu, an image compression algorithm.<p>cf. <a href="https://en.wikipedia.org/wiki/DjVu" rel="nofollow">https://en.wikipedia.org/wiki/DjVu</a>
I took a course on this and similar techniques called "Information Theoretic Data Mining". You can see a bunch of relevant references on the course site: <a href="https://eda.liacs.nl/teaching/itdm/" rel="nofollow">https://eda.liacs.nl/teaching/itdm/</a>.
I really enjoyed the article, it is (in my opinion) a nice case of curiosity-driven research.<p>I was wondering, what if someone would like to try a similar approach on images rather than text? What kind of image compression algorithms (if any) would be fit for use in this scenario?
> However, it would most probably not be worth using such an approach because of the prohibitive computational cost.<p>Looking at the current state of the art (Transformer etc), computational cost certainly is not the main issue.
The same technique was extrapolated to images in this paper where a CDN's corpus of cached images were classified and apply the optimizations for each image as the exemplar for it bucket.
Cool idea! Shouldn't this work also by concatenating the single document (you want to classify) with the compressed version of the conc. class corpus (saving compute time)?
redundant information in natural language is often the more important. Now think of the compression algorithms input tokens as parsed stemmed etc words, instead of a bytestream; you get some interesting concept maps pretty easily.
I worked for an internet scraping/statistics gathering company some years ago, and we used this approach alongside a few others to find mailing addresses embedded in websites. Basically use LZW-type compression with entropy information only trained on known addresses, and then compress a document, looking for the section of the document with the highest compression ratio.<p>It worked decently well, and surprisingly better than a lot of other, more standard approaches just because of the wild non-uniformity of human-generated content on the web.
this is truly interesting.. does this work for very long documents? i.e. entire book? And if this is the case, why are nlp reseachers still struggling with increasing transformer width? The answer is here