The fact that I can give ChatGPT any URL and extract html content from it feels like a big TOS breach for most sites. Am I misunderstanding something about the legality of scraping? Aren't developers discouraged from scraping like this in the first place for for-profit projects?
Google scrapes like a maniac. And for profit. Many others do the same.<p>A website can put up a TOS prohibiting such use, but my understanding is that is essentially unenforceable if the site is publicly accessible.<p>The recent Meta v Bright Data case highlights how extreme it can get without being technically illegal. <a href="https://techcrunch.com/2024/02/26/meta-drops-lawsuit-against-web-scraping-firm-bright-data-that-sold-millions-of-instagram-records/" rel="nofollow">https://techcrunch.com/2024/02/26/meta-drops-lawsuit-against...</a><p>If you’re trying to prevent scraping of your data, your best option is to not make it public.
If you can paste the URL in a browser and copy paste the next, why is it bad that a third-party agent can do the same? It's no different than a remotely-hosted browser you control via natural language, or asking a <i>human</i> assistant to do it and email you the result.
I've encountered a couple of robots.txt that specifically block popular llms for certain areas. Example:<p><a href="https://www.sigmaaldrich.com/robots.txt" rel="nofollow">https://www.sigmaaldrich.com/robots.txt</a>
You can opt out.<p><a href="https://platform.openai.com/docs/gptbot" rel="nofollow">https://platform.openai.com/docs/gptbot</a>
I believe this is current precedent around scraping:<p><a href="https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn" rel="nofollow">https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn</a>
Terms of service enforcement is a matter of civil law.<p>Your legal wherewithal relative to those who abuse them is what gives your terms of service teeth. Or leaves you toothless.