科技回声

2 条评论

I don't know if this is similar to: <a href="https://www.researchgate.net/publication/10744818_Chain_Letters_and_Evolutionary_Histories" rel="nofollow noreferrer">https://www.researchgate.net/publication/10744818_Chain_Lett...</a>which if I recall correctly, drew an evolutionary tree of old school postal chain letters using normalized compression distance.I've also wondered if gzip is the best way of doing the compression. It compresses by block, and the decompression table(s) are in the file. It's not terribly hard to write a Huffman encoder where the character frequency table resides in a separate file. Huffman encoding isn't done by block, so it's inefficient compared to gzip as a compression method, but normalized compression difference isn't really reliant on best possible compression.

sonicrocketman将近 2 年前

Hey all,Don't be too harsh on the code. It was thrown together in an hour.I used the method from the paper (linked on HN earlier this week) but had to tweak the code as it was both unclear what the input types should be and had a few bugs. Hopefully I didn't break it during the process.The classifier seems to work fine and gave me seemingly reasonable results from my admittedly limited testing.Mostly this was an exercise to teach myself about this method, though I do have a possible use case for it in the future.Very cool to see such a classic CS problem applied to new domains. Feels like a programmer's version of pharmaceutical drug repurposing.

An Example of Gzip Based Text Classification in 58 Lines of Code

2 条评论

An Example of Gzip Based Text Classification in 58 Lines of Code

2 条评论