TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: Why is ChatGPT allowed to scrape other sites via prompts?

28 pointsby jbryu12 months ago
The fact that I can give ChatGPT any URL and extract html content from it feels like a big TOS breach for most sites. Am I misunderstanding something about the legality of scraping? Aren't developers discouraged from scraping like this in the first place for for-profit projects?

11 comments

bicx12 months ago
Google scrapes like a maniac. And for profit. Many others do the same.<p>A website can put up a TOS prohibiting such use, but my understanding is that is essentially unenforceable if the site is publicly accessible.<p>The recent Meta v Bright Data case highlights how extreme it can get without being technically illegal. <a href="https:&#x2F;&#x2F;techcrunch.com&#x2F;2024&#x2F;02&#x2F;26&#x2F;meta-drops-lawsuit-against-web-scraping-firm-bright-data-that-sold-millions-of-instagram-records&#x2F;" rel="nofollow">https:&#x2F;&#x2F;techcrunch.com&#x2F;2024&#x2F;02&#x2F;26&#x2F;meta-drops-lawsuit-against...</a><p>If you’re trying to prevent scraping of your data, your best option is to not make it public.
评论 #40439057 未加载
评论 #40439593 未加载
Nextgrid12 months ago
If you can paste the URL in a browser and copy paste the next, why is it bad that a third-party agent can do the same? It&#x27;s no different than a remotely-hosted browser you control via natural language, or asking a <i>human</i> assistant to do it and email you the result.
评论 #40435717 未加载
评论 #40437027 未加载
persedes12 months ago
I&#x27;ve encountered a couple of robots.txt that specifically block popular llms for certain areas. Example:<p><a href="https:&#x2F;&#x2F;www.sigmaaldrich.com&#x2F;robots.txt" rel="nofollow">https:&#x2F;&#x2F;www.sigmaaldrich.com&#x2F;robots.txt</a>
icedchai12 months ago
My understanding is scraping public sites is legal. It&#x27;s no different from a search engine crawling your site.
评论 #40436460 未加载
brianjking12 months ago
You can opt out.<p><a href="https:&#x2F;&#x2F;platform.openai.com&#x2F;docs&#x2F;gptbot" rel="nofollow">https:&#x2F;&#x2F;platform.openai.com&#x2F;docs&#x2F;gptbot</a>
评论 #40435213 未加载
tripplyons12 months ago
Scraping and violating TOS are not illegal to do, but they can get you blocked.
评论 #40435311 未加载
xcasperx12 months ago
I believe this is current precedent around scraping:<p><a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;HiQ_Labs_v._LinkedIn" rel="nofollow">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;HiQ_Labs_v._LinkedIn</a>
brudgers12 months ago
Terms of service enforcement is a matter of civil law.<p>Your legal wherewithal relative to those who abuse them is what gives your terms of service teeth. Or leaves you toothless.
mensetmanusman12 months ago
Preventing scraping also entrenches google for eternity.
rl312 months ago
The web agent&#x27;s system prompt is simply informed that Scarlett Johansson&#x27;s voice is at the URL it&#x27;s about to visit.
8note12 months ago
Why? It&#x27;s another user agent. Curl does the same thing, as does chrome and firefox