TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

Stack Overflow has stopped publishing data dumps to the Internet Archive

101 点作者 JasonPunyon10 个月前

7 条评论

Noble610 个月前
The rationale obviously points to stack exchange blocking AI from training off their content on archive.org. They go on to demand adherence to “socially responsible” AI training which requires cash-flow between AI companies and the data sources they train from.<p>First, and most obviously, stack exchange does NOT own the forum content. It has been provided for FREE by the larger developer community, and that same community regularly makes use of the AI tools which will be inhibited by this policy change. Second, stack exchange is questioning the integrity of archive.org by hiding the data.<p>Developers are the real victims here, and the audacity of Stack Exchange to demand money for work they DIDN’T do, but continuing to NOT pay their forum contributors is peak irony.
评论 #40947416 未加载
评论 #40947609 未加载
binarymax10 个月前
I see where they’re coming from but they need to sort out the license confusion.<p>Stack Exhange data really is the worlds best open Q&amp;A dataset. Far cleaner and more reliable than anything else.<p>But LLM trainers are going to use it no matter what. It’s not like they care about copyright or licenses.
JasonPunyon10 个月前
You may remember a carbon copy of this event from a year ago. <a href="https:&#x2F;&#x2F;meta.stackexchange.com&#x2F;questions&#x2F;389922&#x2F;june-2023-data-dump-is-missing" rel="nofollow">https:&#x2F;&#x2F;meta.stackexchange.com&#x2F;questions&#x2F;389922&#x2F;june-2023-da...</a><p>Discussion from then <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=36257523">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=36257523</a>
swatcoder10 个月前
Paraphrased: &quot;Now that OpenAI is paying us for your freely contributed Creative Commons content, we share an interest in constructing their moat by making it harder for others to access both mechanically and legally&quot;
PreInternet0110 个月前
Well, SO is now (possibly was?) owned[1] by the same group of companies[2] that failed to secure their own TLDs[3] for <i>purely technical</i> reasons, so, before nefarious intent, please also consider plain incompetence....<p>[1] <a href="https:&#x2F;&#x2F;techcrunch.com&#x2F;2021&#x2F;06&#x2F;02&#x2F;stack-overflow-acquired-by-prosus-for-a-reported-1-8-billion" rel="nofollow">https:&#x2F;&#x2F;techcrunch.com&#x2F;2021&#x2F;06&#x2F;02&#x2F;stack-overflow-acquired-by...</a> [2] <a href="https:&#x2F;&#x2F;www.google.com&#x2F;search?q=prosus+multichoice" rel="nofollow">https:&#x2F;&#x2F;www.google.com&#x2F;search?q=prosus+multichoice</a> [3] e.g. <a href="https:&#x2F;&#x2F;www.icann.org&#x2F;en&#x2F;registry-agreements&#x2F;terminated&#x2F;multichoice" rel="nofollow">https:&#x2F;&#x2F;www.icann.org&#x2F;en&#x2F;registry-agreements&#x2F;terminated&#x2F;mult...</a>
precommunicator10 个月前
I wonder if archives downloaded by two different people have different checksums? That would mean they have hidden a paper town (fake entry&#x2F;signature) somewhere. I would be surprised if that&#x27;s not the case, or will be the case.
评论 #40949538 未加载
luke-stanley10 个月前
&quot;Stack Overflow is no longer uploading the data dump to archive.org.&quot; &quot;We would really rather users do not upload the file to archive.org or similar data pile sites.&quot; They have no way to stop people from doing that under the license. Only kind words. Since they&#x27;ve made it deliberately hard for people to train on, I&#x27;d be really surprised if people didn&#x27;t put it on Archive.org and HuggingFace Datasets. So long as it&#x27;s under the license, it should be fine, right? I am not a lawyer. What they said about access speed issues makes little sense to me, I torrented their dumps before just fine and was very happy to seed it.