TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Large language model data pipelines and Common Crawl

139 点作者 sonabinu11 个月前

6 条评论

alhaad11 个月前
Nicely written, thanks for posting!<p>I was reminded about recent LLM wins coming from training data improvements (eg. fineweb)
fbdab10311 个月前
<p><pre><code> After that it removes (or replaces) Unicode punctuation, performs a SHA1 hashing, and uses the first 8 bytes for deduplication comparisons (paragraph level) </code></pre> Is taking the first few bytes that much faster than comparing the entire hash? Or something else? That is one of those performance optimizations I would go back and forth on endlessly wondering if I lost something by trying to shave off a few cycles.
评论 #40744682 未加载
评论 #40726925 未加载
hobofan11 个月前
Does anyone know of a maintained alternative to fasttext? It is mentioned here for language identification, but clicking through to the GitHub project, it looks to be recently archived.<p>I usually use a BERT model for text classification these days, but would like to have an alternative that it less CPU-heavy like fasttext at hand for high-volume use cases.
评论 #40727341 未加载
msp2611 个月前
The section on deduplication was very useful thanks for posting
barrenko11 个月前
This is a great blog btw.
spott11 个月前
(2023)<p>Still very useful, but it should probably have a date in the title.