TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Shoco: a fast compressor for short strings

31 pointsby multipassalmost 10 years ago

6 comments

dalkealmost 10 years ago
I have a background project of exploring how to compress SMILES strings, which is a notation for storing chemical information. For example, &quot;C&quot; is methane, &quot;CC&quot; is ethane, &quot;C=C&quot; is ethene, &quot;CCO&quot; is ethyl alcohol, &quot;C1CCCCC1&quot; is cyclohexane, and &quot;c1ccccc1&quot;, which contains aromatic carbons, is benzene. The average length of a SMILES string for real-world molecules is about 50 characters.<p>I previously evaluated a special purpose tool which identifies the best n-grams and uses dynamic programming during encoding. That gets about 70% compression on SMILES string. I also tried the off-the-shelf femtozip which got about 60% compression but had more decompression overhead than I like.<p>Shoco, trained on 1,455,763 SMILES strings (average of 56 letters each), and tested with 100,000 strings from the training set, reports &quot;average compression ratio: 47%&quot;.
评论 #10066681 未加载
knodi123almost 10 years ago
Look how well it can compress &quot;fofofofofofofofofofofo&quot;.<p>50%<p>Look how well it can compress &quot;ababababababababababab&quot;.<p>0%
rurbanalmost 10 years ago
Will test against smaz for our internal JSON compressed protocol. smaz compressed fine but was too slow. The ability to train the model sounds convincing.
Khaoalmost 10 years ago
I get negative compression percentage when I put words with &quot;é&quot; in the test box.
评论 #10061856 未加载
techwizrdalmost 10 years ago
I wonder what&#x27;d happen if you used this on base64 strings.
评论 #10066733 未加载
thrownaway2424almost 10 years ago
I can&#x27;t tell you how many times I&#x27;ve said to myself &quot;if only these very short ASCII strings were even shorter!&quot;
评论 #10065782 未加载