科技回声

8 条评论

astariul大约 1 年前

Very interesting read. The format of the article is weird at first, but we get used to it.These days we rely more and more on datasets bigger and bigger, and the sheer size / apparent quality of datasets curated by machine makes them an attractive option to smaller, human-curated datasets.This article is eye-opening about the shortcomings of such approaches.

seanmcdirmid大约 1 年前

Would it be ethical to train a model to identify child porn so it could be excluded from training sets, or even the internet, automatically? It seems like an ideal very useful application for CV AI, but you might have to train it on CP to make it effective so…

评论 #39883547 未加载

评论 #39881967 未加载

评论 #39882101 未加载

评论 #39882385 未加载

Rant423大约 1 年前

My conclusion is that LAION is built using model upon model upon model; so each error propagates and the resulting dataset is kinda shitty.But "we" made amazing stuff out of LAION.So: the next person who can curate a high-quality big dataset is gonna make a fortune

mdrzn大约 1 年前

"If your full-time, eight-hours-a-day, five-days-a-week job were to look at each image in the dataset for just one second, it would take you 781 years."So if we take 781 people to do this job, it'd only take a year? With about 1500 workers it would only take 6 months? Seems important enough to do.

croemer大约 1 年前

I think there's a typo in the section on arbitrary thresholds. Specifically in the sentence about 16% being within 0.1 of the cutoff. I have a feeling this should be 0.01 as mentioned later.

GaggiX大约 1 年前

How does the author know that Midjourney was partially trained on LAION-5B?

fenomas大约 1 年前

Reading this felt like a huge waste of my time.TFA picks out 4-5 realistically unavoidable features of AI training - like that a certain quality threshold value was chosen arbitrarily, or that there's more training data for English than for other languages - and then they hand-wavingly suggest that each feature could have huge implications about.. something. And then they move on to the next topic, without making any argument why the thing they just discussed is important, or what implications it might actually have.

评论 #39882365 未加载

评论 #39882689 未加载

评论 #39881732 未加载

awesomeideas大约 1 年前

Interesting, but I'm not a fan of ergodic literature when the form of interaction is "scroll forever"

评论 #39880343 未加载

评论 #39880313 未加载

评论 #39880544 未加载

评论 #39881534 未加载

评论 #39881735 未加载

8 条评论

astariul大约 1 年前

seanmcdirmid大约 1 年前

评论 #39883547 未加载

评论 #39881967 未加载

评论 #39882101 未加载

评论 #39882385 未加载

Rant423大约 1 年前

mdrzn大约 1 年前

croemer大约 1 年前

I think there's a typo in the section on arbitrary thresholds. Specifically in the sentence about 16% being within 0.1 of the cutoff. I have a feeling this should be 0.01 as mentioned later.

GaggiX大约 1 年前

How does the author know that Midjourney was partially trained on LAION-5B?

fenomas大约 1 年前

评论 #39882365 未加载

评论 #39882689 未加载

评论 #39881732 未加载

awesomeideas大约 1 年前

Interesting, but I'm not a fan of ergodic literature when the form of interaction is "scroll forever"

Models all the way down

8 条评论

Models all the way down

8 条评论