TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Obscure Python libraries for data science

14 pointsby jhibbetsover 6 years ago

1 comment

cosmieover 6 years ago
If you&#x27;re doing anything with text, ftfy[1][2] is an oasis of sanity in a world filled with torment. No data source is safe from the fate of having its encoding butchered as it passes through the cruel fate of dozens of getting tossed from databases to applications to data pipelines and eventually to you. All of which make whatever default encoding assumptions that are most convenient or backwards compatible with themselves, unintended data mutations be damned.<p>Once you&#x27;ve zapped as much mojibake[3] from your data as possible, follow it with a pass through csvclean[4] so you have confidence your data is delimited and escaped <i>exactly</i> how you want&#x2F;expect it to be and can be processed and ingested with confidence. Then, when you need to cram it back into a legacy system that only supports ASCII, unidecode[5] for the win. And every now and then transliterate[6] comes to the rescue for the odd need.<p>[1] <a href="https:&#x2F;&#x2F;ftfy.readthedocs.io&#x2F;en&#x2F;latest&#x2F;" rel="nofollow">https:&#x2F;&#x2F;ftfy.readthedocs.io&#x2F;en&#x2F;latest&#x2F;</a><p>[2] The main fix_text() function is by far the crown jewel. But there are quite a few handy helper functions in the library that don&#x27;t get wrapped into fix_text() and have to be called independently when desired.<p>[3] <a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Mojibake" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Mojibake</a><p>[4] <a href="https:&#x2F;&#x2F;csvkit.readthedocs.io&#x2F;en&#x2F;1.0.3&#x2F;scripts&#x2F;csvclean.html" rel="nofollow">https:&#x2F;&#x2F;csvkit.readthedocs.io&#x2F;en&#x2F;1.0.3&#x2F;scripts&#x2F;csvclean.html</a><p>[5] <a href="https:&#x2F;&#x2F;github.com&#x2F;iki&#x2F;unidecode&#x2F;" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;iki&#x2F;unidecode&#x2F;</a><p>[6] <a href="https:&#x2F;&#x2F;pypi.org&#x2F;project&#x2F;transliterate&#x2F;" rel="nofollow">https:&#x2F;&#x2F;pypi.org&#x2F;project&#x2F;transliterate&#x2F;</a>