I have a background project of exploring how to compress SMILES strings, which is a notation for storing chemical information. For example, "C" is methane, "CC" is ethane, "C=C" is ethene, "CCO" is ethyl alcohol, "C1CCCCC1" is cyclohexane, and "c1ccccc1", which contains aromatic carbons, is benzene. The average length of a SMILES string for real-world molecules is about 50 characters.<p>I previously evaluated a special purpose tool which identifies the best n-grams and uses dynamic programming during encoding. That gets about 70% compression on SMILES string. I also tried the off-the-shelf femtozip which got about 60% compression but had more decompression overhead than I like.<p>Shoco, trained on 1,455,763 SMILES strings (average of 56 letters each), and tested with 100,000 strings from the training set, reports "average compression ratio: 47%".
Will test against smaz for our internal JSON compressed protocol. smaz compressed fine but was too slow. The ability to train the model sounds convincing.