TechEcho

6 comments

dalkealmost 10 years ago

I have a background project of exploring how to compress SMILES strings, which is a notation for storing chemical information. For example, "C" is methane, "CC" is ethane, "C=C" is ethene, "CCO" is ethyl alcohol, "C1CCCCC1" is cyclohexane, and "c1ccccc1", which contains aromatic carbons, is benzene. The average length of a SMILES string for real-world molecules is about 50 characters.I previously evaluated a special purpose tool which identifies the best n-grams and uses dynamic programming during encoding. That gets about 70% compression on SMILES string. I also tried the off-the-shelf femtozip which got about 60% compression but had more decompression overhead than I like.Shoco, trained on 1,455,763 SMILES strings (average of 56 letters each), and tested with 100,000 strings from the training set, reports "average compression ratio: 47%".

评论 #10066681 未加载

knodi123almost 10 years ago

Look how well it can compress "fofofofofofofofofofofo".50%Look how well it can compress "ababababababababababab".0%

rurbanalmost 10 years ago

Will test against smaz for our internal JSON compressed protocol. smaz compressed fine but was too slow. The ability to train the model sounds convincing.

Khaoalmost 10 years ago

I get negative compression percentage when I put words with "é" in the test box.

评论 #10061856 未加载

techwizrdalmost 10 years ago

I wonder what'd happen if you used this on base64 strings.

评论 #10066733 未加载

thrownaway2424almost 10 years ago

I can't tell you how many times I've said to myself "if only these very short ASCII strings were even shorter!"

评论 #10065782 未加载

Shoco: a fast compressor for short strings

6 comments

Shoco: a fast compressor for short strings

6 comments