TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Models all the way down

114 点作者 jdkee大约 1 年前

8 条评论

astariul大约 1 年前
Very interesting read. The format of the article is weird at first, but we get used to it.<p>These days we rely more and more on datasets bigger and bigger, and the sheer size &#x2F; apparent quality of datasets curated by machine makes them an attractive option to smaller, human-curated datasets.<p>This article is eye-opening about the shortcomings of such approaches.
seanmcdirmid大约 1 年前
Would it be ethical to train a model to identify child porn so it could be excluded from training sets, or even the internet, automatically? It seems like an ideal very useful application for CV AI, but you might have to train it on CP to make it effective so…
评论 #39883547 未加载
评论 #39881967 未加载
评论 #39882101 未加载
评论 #39882385 未加载
Rant423大约 1 年前
My conclusion is that LAION is built using model upon model upon model; so each error propagates and the resulting dataset is kinda shitty.<p>But &quot;we&quot; made amazing stuff out of LAION.<p>So: the next person who can curate a high-quality big dataset is gonna make a fortune
mdrzn大约 1 年前
&quot;If your full-time, eight-hours-a-day, five-days-a-week job were to look at each image in the dataset for just one second, it would take you 781 years.&quot;<p>So if we take 781 people to do this job, it&#x27;d only take a year? With about 1500 workers it would only take 6 months? Seems important enough to do.
croemer大约 1 年前
I think there&#x27;s a typo in the section on arbitrary thresholds. Specifically in the sentence about 16% being within 0.1 of the cutoff. I have a feeling this should be 0.01 as mentioned later.
GaggiX大约 1 年前
How does the author know that Midjourney was partially trained on LAION-5B?
fenomas大约 1 年前
Reading this felt like a huge waste of my time.<p>TFA picks out 4-5 realistically unavoidable features of AI training - like that a certain quality threshold value was chosen arbitrarily, or that there&#x27;s more training data for English than for other languages - and then they hand-wavingly suggest that each feature could have huge implications about.. something. And then they move on to the next topic, without making any argument why the thing they just discussed is important, or what implications it might actually have.
评论 #39882365 未加载
评论 #39882689 未加载
评论 #39881732 未加载
awesomeideas大约 1 年前
Interesting, but I&#x27;m not a fan of ergodic literature when the form of interaction is &quot;scroll forever&quot;
评论 #39880343 未加载
评论 #39880313 未加载
评论 #39880544 未加载
评论 #39881534 未加载
评论 #39881735 未加载