Text Classification by Data Compression

105 pointsby Lemaxoxoalmost 4 years ago

19 comments

felixhandtealmost 4 years ago

Two points worth noting:1. Gzip is not a suitable compressor for this use case, because it's limited to a 32KB window. So the input can only be correlated with the last 32KB of the reference texts.2. You can save a great deal in computation by avoiding recompressing the reference texts over and over and over. Some compression algorithms support checkpointing the compression state so that it can be resumed from that point repeatedly ("dictionary-based compression", which is a distinct capability from just streaming compression, which generally can only be continued once).I would personally shill for using Zstandard [0] instead for this purpose. Although I should disclose my bias: I'm a developer of Zstd. A few salient facts:1. Zstd supports very large windows (up to 128MB, or up to 2GB in long mode).2. Zstd is much faster than zlib.3. Zstd has well-developed support for dictionary-based compression.4. Additionally, it has a dictionary trainer that can reduce a corpus of reference documents to a compact summary document that aims to capture as much of the content as possible of the reference corpus. [1]5. It has (more than one) python binding available. [2][3][0] <a href="https://github.com/facebook/zstd" rel="nofollow">https://github.com/facebook/zstd</a>[1] <a href="https://github.com/facebook/zstd/blob/dev/lib/zdict.h#L40" rel="nofollow">https://github.com/facebook/zstd/blob/dev/lib/zdict.h#L40</a>[2] <a href="https://pypi.org/project/zstandard" rel="nofollow">https://pypi.org/project/zstandard</a>[3] <a href="https://pypi.org/project/zstd" rel="nofollow">https://pypi.org/project/zstd</a>

评论 #27445993 未加载

jll29almost 4 years ago

The University of Waikato, New Zealand has had a lot of research going on to use compression for named entity tagging (name, location, date, person, ...) etc.While it's not the best-performing paradigm for text sequence tagging, it is intellectually intriguing as you say because of the parallel between the concepts "compression" and "understanding", even in the human brain. If we can't understand s.th., we need to memorize it; if we understand it, it doesn't need much space or cognitive load at all, basically a name that is well-linked to other concepts.

评论 #27443092 未加载

lovasoaalmost 4 years ago

You don't have to recompress the whole corpus to add a single document to it. All the compression algorithms mentioned here work in a streaming fashion. You could "just" save the internal state of the algorithm after compressing the training data, and then reuse that state for each classification task.

评论 #27441646 未加载

评论 #27441670 未加载

woliveirajralmost 4 years ago

I'm missing something or the author could have used: Kolmogovov complexity (0), normalized compression distance (1), some research of Rudy Cilibrasi and Vitaniy, and some research that come after it(2).(0) <a href="https://en.m.wikipedia.org/wiki/Kolmogorov_complexity" rel="nofollow">https://en.m.wikipedia.org/wiki/Kolmogorov_complexity</a>(1) <a href="https://en.m.wikipedia.org/wiki/Normalized_compression_distance" rel="nofollow">https://en.m.wikipedia.org/wiki/Normalized_compression_dista...</a>(2) for example: <a href="https://www.sciencedirect.com/science/article/pii/S0379073813000923" rel="nofollow">https://www.sciencedirect.com/science/article/pii/S037907381...</a>

评论 #27444997 未加载

评论 #27443963 未加载

userbinatoralmost 4 years ago

I've made use of the property of compression algorithms in detecting redundancy and similarity in a slightly different way: Compress a codebase, and the most compressible source code files are also ones that are likely candidates for refactoring.

autokadalmost 4 years ago

> "Take a labeled training set of documents... Return the label for which the size increased the least."that's not unlike using an autoencoder to score a document as an anomaly amongst the other labels?

timeinputalmost 4 years ago

The Hutter prize [0] is based around the idea that compressing Wikipedia will lead to more advanced AI research. It doesn't seem like it has so far, but the solutions aren't all open source so it's hard to say, but it's definitely along the same lines.[0] <a href="https://en.wikipedia.org/wiki/Hutter_Prize" rel="nofollow">https://en.wikipedia.org/wiki/Hutter_Prize</a>

Xceleratealmost 4 years ago

The theory behind this kind of approach is actually very deep. If we had a “perfect compressor” that returned the shortest non-halting prefix-free Turing machine that, when run, output the original dataset, this would essentially be the best classifier we could create. But as the author notes, there are some heavy computational penalties to overcome for this, even to an approximation.

评论 #27443661 未加载

评论 #27443913 未加载

donatjalmost 4 years ago

I work for an educational company and a proprietary metric we license includes how well the text gzips as one part of it's scoring system for reading difficulty.The better it gzips, the easier it is to read.It's just one part of the score, I'm not certain what weight it has on it.

评论 #27443446 未加载

euskealmost 4 years ago

It is worth pointing out that Yann LeCun, a prominent ML researcher, also worked with DjVu, an image compression algorithm.cf. <a href="https://en.wikipedia.org/wiki/DjVu" rel="nofollow">https://en.wikipedia.org/wiki/DjVu</a>

nightcrackeralmost 4 years ago

I took a course on this and similar techniques called "Information Theoretic Data Mining". You can see a bunch of relevant references on the course site: <a href="https://eda.liacs.nl/teaching/itdm/" rel="nofollow">https://eda.liacs.nl/teaching/itdm/</a>.

Fragoel2almost 4 years ago

I really enjoyed the article, it is (in my opinion) a nice case of curiosity-driven research.I was wondering, what if someone would like to try a similar approach on images rather than text? What kind of image compression algorithms (if any) would be fit for use in this scenario?

carschnoalmost 4 years ago

> However, it would most probably not be worth using such an approach because of the prohibitive computational cost.Looking at the current state of the art (Transformer etc), computational cost certainly is not the main issue.

dedalusalmost 4 years ago

The same technique was extrapolated to images in this paper where a CDN's corpus of cached images were classified and apply the optimizations for each image as the exemplar for it bucket.

sean_pedersenalmost 4 years ago

Cool idea! Shouldn't this work also by concatenating the single document (you want to classify) with the compressed version of the conc. class corpus (saving compute time)?

评论 #27443949 未加载

评论 #27441754 未加载

pxxalmost 4 years ago

Aren't the block sizes too small? gzip uses 64k block sizes and it seems like the compressed sizes are several times larger.

评论 #27446218 未加载

h2odragonalmost 4 years ago

redundant information in natural language is often the more important. Now think of the compression algorithms input tokens as parsed stemmed etc words, instead of a bytestream; you get some interesting concept maps pretty easily.

thomaslucealmost 4 years ago

I worked for an internet scraping/statistics gathering company some years ago, and we used this approach alongside a few others to find mailing addresses embedded in websites. Basically use LZW-type compression with entropy information only trained on known addresses, and then compress a document, looking for the section of the document with the highest compression ratio.It worked decently well, and surprisingly better than a lot of other, more standard approaches just because of the wild non-uniformity of human-generated content on the web.

评论 #27443337 未加载

totorovirusalmost 4 years ago

this is truly interesting.. does this work for very long documents? i.e. entire book? And if this is the case, why are nlp reseachers still struggling with increasing transformer width? The answer is here