TechEcho

Nicely written, thanks for posting!I was reminded about recent LLM wins coming from training data improvements (eg. fineweb)

<pre><code> After that it removes (or replaces) Unicode punctuation, performs a SHA1 hashing, and uses the first 8 bytes for deduplication comparisons (paragraph level) </code></pre> Is taking the first few bytes that much faster than comparing the entire hash? Or something else? That is one of those performance optimizations I would go back and forth on endlessly wondering if I lost something by trying to shave off a few cycles.

Does anyone know of a maintained alternative to fasttext? It is mentioned here for language identification, but clicking through to the GitHub project, it looks to be recently archived.I usually use a BERT model for text classification these days, but would like to have an alternative that it less CPU-heavy like fasttext at hand for high-volume use cases.

The section on deduplication was very useful thanks for posting

This is a great blog btw.

(2023)Still very useful, but it should probably have a date in the title.

Nicely written, thanks for posting!I was reminded about recent LLM wins coming from training data improvements (eg. fineweb)

The section on deduplication was very useful thanks for posting

This is a great blog btw.

(2023)Still very useful, but it should probably have a date in the title.

Large language model data pipelines and Common Crawl

6 comments

Large language model data pipelines and Common Crawl

6 comments