<p><pre><code> After that it removes (or replaces) Unicode punctuation, performs a SHA1 hashing, and uses the first 8 bytes for deduplication comparisons (paragraph level)
</code></pre>
Is taking the first few bytes that much faster than comparing the entire hash? Or something else? That is one of those performance optimizations I would go back and forth on endlessly wondering if I lost something by trying to shave off a few cycles.
Does anyone know of a maintained alternative to fasttext? It is mentioned here for language identification, but clicking through to the GitHub project, it looks to be recently archived.<p>I usually use a BERT model for text classification these days, but would like to have an alternative that it less CPU-heavy like fasttext at hand for high-volume use cases.