TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Large language model data pipelines and Common Crawl

139 pointsby sonabinu11 months ago

6 comments

alhaad11 months ago
Nicely written, thanks for posting!<p>I was reminded about recent LLM wins coming from training data improvements (eg. fineweb)
fbdab10311 months ago
<p><pre><code> After that it removes (or replaces) Unicode punctuation, performs a SHA1 hashing, and uses the first 8 bytes for deduplication comparisons (paragraph level) </code></pre> Is taking the first few bytes that much faster than comparing the entire hash? Or something else? That is one of those performance optimizations I would go back and forth on endlessly wondering if I lost something by trying to shave off a few cycles.
评论 #40744682 未加载
评论 #40726925 未加载
hobofan11 months ago
Does anyone know of a maintained alternative to fasttext? It is mentioned here for language identification, but clicking through to the GitHub project, it looks to be recently archived.<p>I usually use a BERT model for text classification these days, but would like to have an alternative that it less CPU-heavy like fasttext at hand for high-volume use cases.
评论 #40727341 未加载
msp2611 months ago
The section on deduplication was very useful thanks for posting
barrenko11 months ago
This is a great blog btw.
spott11 months ago
(2023)<p>Still very useful, but it should probably have a date in the title.