TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

An Example of Gzip Based Text Classification in 58 Lines of Code

1 pointsby sonicrocketmanalmost 2 years ago

2 comments

bediger4000almost 2 years ago
I don&#x27;t know if this is similar to: <a href="https:&#x2F;&#x2F;www.researchgate.net&#x2F;publication&#x2F;10744818_Chain_Letters_and_Evolutionary_Histories" rel="nofollow noreferrer">https:&#x2F;&#x2F;www.researchgate.net&#x2F;publication&#x2F;10744818_Chain_Lett...</a><p>which if I recall correctly, drew an evolutionary tree of old school postal chain letters using normalized compression distance.<p>I&#x27;ve also wondered if gzip is the best way of doing the compression. It compresses by block, and the decompression table(s) are in the file. It&#x27;s not terribly hard to write a Huffman encoder where the character frequency table resides in a separate file. Huffman encoding isn&#x27;t done by block, so it&#x27;s inefficient compared to gzip as a compression method, but normalized compression difference isn&#x27;t really reliant on best possible compression.
sonicrocketmanalmost 2 years ago
Hey all,<p>Don&#x27;t be too harsh on the code. It was thrown together in an hour.<p>I used the method from the paper (linked on HN earlier this week) but had to tweak the code as it was both unclear what the input types should be and had a few bugs. Hopefully I didn&#x27;t break it during the process.<p>The classifier seems to work fine and gave me seemingly reasonable results from my admittedly limited testing.<p>Mostly this was an exercise to teach myself about this method, though I do have a possible use case for it in the future.<p>Very cool to see such a classic CS problem applied to new domains. Feels like a programmer&#x27;s version of pharmaceutical drug repurposing.