Classification via compression

source: Text classification with Python 3.14's zstd module • Max Halford

In general, text is compressed with zstd with custom dictionary. The dictionary is the target text that you want to search similarity against. Then go through all the text corpus, compress them, and see which one has the smallest size.

It is exploiting the fact that the more similar it is for the content of the compression, the smaller the size will be, due to the dictionary given.

“Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors - ACL Anthology

Deep neural networks (DNNs) are often used for text classification due to their high accuracy. However, DNNs can be computationally intensive, requiring millions of parameters and large amounts of labeled data, which can make them expensive to use, to optimize, and to transfer to out-of-distribution (OOD) cases in practice. In this paper, we propose a non-parametric alternative to DNNs that’s easy, lightweight, and universal in text classification: a combination of a simple compressor like gzip with a k-nearest-neighbor classifier. Without any training parameters, our method achieves results that are competitive with non-pretrained deep learning methods on six in-distribution datasets.It even outperforms BERT on all five OOD datasets, including four low-resource languages. Our method also excels in the few-shot setting, where labeled data are too scarce to train DNNs effectively.

This is also applicable to image classification.

78% MNIST accuracy using GZIP in under 10 lines of code. | Jakob Serlier's Personal Site

Key principles

Normalised Compression Distance (NCD): measures similarity between two objects based on how much more effort it takes to compress them together compared to compressing them separately

The Bitter Lesson is coming for Tokenization | ⛰️ lucalp