This is clever -- and useful in many settings that preclude the use of a deep neural network for classification.<p>Intuitively, the key idea is that if you have two documents, say, <i>x1</i> and <i>x2</i>, and a target document <i>y</i>, if <i>x1</i>'s statistical regularities are more similar to <i>y</i>'s than to <i>x2</i>'s, then <i>len(compress(x1+y)) - len(compress(y)) < len(compress(x2+y)) - len(compress(y))</i>, where "<i>+</i>" means concatenation and "<i>compress</i>" is a compression program like gzip.<p><i>len(compress(x1+y)) - len(compress(y))</i> is, quite literally, the number of additional bytes we need to compress the statistical regularities in <i>x1</i> given the statistical regularities in <i>y</i>. The more similar the statistical regularities between <i>x1</i> and <i>y</i>, the fewer bytes we need to compress them together.<p>The authors use kNN using a distance function called normalized compression distance (NCD), based on the above idea. Remarkably, this simple, intuitive method outperforms BERT on a variety of zero-shot classification tasks!