TechEcho

1 comment

cosmieover 6 years ago

If you're doing anything with text, ftfy[1][2] is an oasis of sanity in a world filled with torment. No data source is safe from the fate of having its encoding butchered as it passes through the cruel fate of dozens of getting tossed from databases to applications to data pipelines and eventually to you. All of which make whatever default encoding assumptions that are most convenient or backwards compatible with themselves, unintended data mutations be damned.Once you've zapped as much mojibake[3] from your data as possible, follow it with a pass through csvclean[4] so you have confidence your data is delimited and escaped exactly how you want/expect it to be and can be processed and ingested with confidence. Then, when you need to cram it back into a legacy system that only supports ASCII, unidecode[5] for the win. And every now and then transliterate[6] comes to the rescue for the odd need.[1] <a href="https://ftfy.readthedocs.io/en/latest/" rel="nofollow">https://ftfy.readthedocs.io/en/latest/</a>[2] The main fix_text() function is by far the crown jewel. But there are quite a few handy helper functions in the library that don't get wrapped into fix_text() and have to be called independently when desired.[3] <a href="https://en.wikipedia.org/wiki/Mojibake" rel="nofollow">https://en.wikipedia.org/wiki/Mojibake</a>[4] <a href="https://csvkit.readthedocs.io/en/1.0.3/scripts/csvclean.html" rel="nofollow">https://csvkit.readthedocs.io/en/1.0.3/scripts/csvclean.html</a>[5] <a href="https://github.com/iki/unidecode/" rel="nofollow">https://github.com/iki/unidecode/</a>[6] <a href="https://pypi.org/project/transliterate/" rel="nofollow">https://pypi.org/project/transliterate/</a>

1 comment

cosmieover 6 years ago

Obscure Python libraries for data science

1 comment

Obscure Python libraries for data science

1 comment