TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

With the rise of AI, web crawlers are suddenly controversial

90 点作者 leephillips大约 1 年前

11 条评论

throwup238大约 1 年前
<i>&gt; For decades, robots.txt governed the behavior of web crawlers. But as unscrupulous AI companies seek out more and more data, the basic social contract of the web is falling apart.</i><p>The basic social contract of the web fell apart long ago when almost everyone decided that Google was the only search engine worth serving and started aggressively blocking other crawlers.
评论 #39421253 未加载
评论 #39421607 未加载
评论 #39426369 未加载
评论 #39422406 未加载
linkjuice4all大约 1 年前
&quot;With the rise of AI, photos of the exterior of your business are suddenly controversial&quot;<p>Many revenue-based websites tried to have it both ways with web crawlers wherein they wanted to block automated access or repeat viewers while letting first time viewers get a free taste. Others have noted that basically Google gets a free pass for all the traffic it brings in but everyone else has to respect robots declarations.<p>It seems like a no brainer - if your web server is configured to reply to GET requests with a 200 status and some content then they get to do pretty much whatever they want with it.<p>Don&#x27;t want to give access to everyone? Stop sending your content for free and get them to agree to some contract and authorize&#x2F;license their access to your stuff.
评论 #39421865 未加载
评论 #39422481 未加载
评论 #39421941 未加载
calibas大约 1 年前
&gt; <i>For decades, robots.txt governed the behavior of web crawlers.</i><p>It never governed anything, web crawlers were never under any obligation to follow robots.txt.<p>This article seems like they took an existing controversy, rebranded it as something new, then blamed in on AI.
评论 #39426024 未加载
评论 #39421752 未加载
评论 #39422086 未加载
aaronrobinson大约 1 年前
Drama. Crawlers have always been controversial.
andybak大约 1 年前
&gt; But as unscrupulous AI companies seek out more and more data<p>I&#x27;m not sure I&#x27;m ready to concede the fundemental value judgement being made here. At least I refuse to accept it as a given rather then the core issue to be decided.
micromacrofoot大约 1 年前
many crawlers have always ignored robots.txt, if you’re monitoring any moderately visited site you’re bound to see random spikes of bots hammering your server no matter what text file or headers you set
评论 #39422311 未加载
elpocko大约 1 年前
robots.txt is relevant and effective, as is my DNT header.
amelius大约 1 年前
When did robots.txt get a legal status?<p>Or did it ever?
评论 #39422070 未加载
naiv大约 1 年前
Proxy companies are a big winner now
lewhoo大约 1 年前
I don&#x27;t get it. The crux of it all seems to be that Google isn&#x27;t competing with owners of data it crawls using the very same data. The crawl part isn&#x27;t as much of a controversy as usage, isn&#x27;t it ? The mentioned eBay v. Bidder&#x27;s Edge (2000) seems to be a dispute over usage.
评论 #39425073 未加载
mediumsmart大约 1 年前
The web comes in 2 versions. One of them has a basic social contract. <i>maybe</i>