The report (Identifying and Eliminating CSAM in Generative ML Training Data and Models)[0] that this guy is very slowly sumarizing (and seems to largely agree with despite the title) was discussed 3 days ago (38 points, 30 comments)[1]<p>[0]: <a href="https://purl.stanford.edu/kh752sm9123" rel="nofollow noreferrer">https://purl.stanford.edu/kh752sm9123</a>
[1]: <a href="https://news.ycombinator.com/item?id=38711135">https://news.ycombinator.com/item?id=38711135</a>
Tangential, but why didn't the OpenAssistant team (lead by the author of the video) release the OpenAssistant dataset? As far as I know, the project was shut down, and only some initial highly filtered version of the data got released. This dataset could be very valuable for the community that created it.
It's honestly pretty sad that at no time the authors of this paper bothered contacting laion to remove the links and work together to develop better filters. Also pretty interesting, that one of the authors calls, David Thiel himself the "Ai censorship death star". Yannic is probably right that they aren't particularly interested in bettering open source diffusion models and are more in the walled garden camp.
then why does IBM spend money producing this one?<p><a href="https://www.youtube.com/watch?v=y9k-U9AuDeM" rel="nofollow noreferrer">https://www.youtube.com/watch?v=y9k-U9AuDeM</a>
Open source advocates: "With enough eyes, all bugs are shallow."<p>These researchers: "I see your project includes a non-zero amount of CSAM."<p>Open source advocates: "How dare you point out an issue? This is a hit piece!"