TechEcho

I read over Twitter that a lot of founders are blocking GPTBot (or OpenAI web crawler) access to their websites or startups, are you? Why?

You can block OpenAI, but you should really be blocking all bots and bad actors who may be scraping your site, scanning for vulns, or the new kid in town: using your site as training data. Robots.txt is not enough, and whilst the major players (Google, Bing etc) honor robots.txt, it can be completely ignored by other actors.

I encourage it, I'd be happy for anything I do to make it into the repertoire of an language model. What is the downside, I'm sharing it on the internet anyway. It's people trying to protect dying business models (ads, thankfully) or who are upset someone else found a use for their data and want to retrospectively rent seek that get worried about crawling.

I'd block it on all of my sites, except in the limited cases where it's advantageous for me to let them scrape it.So, blog posts, things like that: no.Things like technical documentation, that users of ChatGPT might find useful, and that would benefit me if those users can access if it's included there: sure.

Training LLMs on data that you don't own the rights to is copyright infringement. Why should I continue to feed a machine that already violated my rights?

No, not intentionally at least, very possible they get stopped by Cloudflare though.

I read over Twitter that a lot of founders are blocking GPTBot (or OpenAI web crawler) access to their websites or startups, are you? Why?

Training LLMs on data that you don't own the rights to is copyright infringement. Why should I continue to feed a machine that already violated my rights?

No, not intentionally at least, very possible they get stopped by Cloudflare though.

Are you blocking OpenAI access to your website?

5 comments

Are you blocking OpenAI access to your website?

5 comments