You can block OpenAI, but you should really be blocking all bots and bad actors who may be scraping your site, scanning for vulns, or the new kid in town: using your site as training data. Robots.txt is not enough, and whilst the major players (Google, Bing etc) honor robots.txt, it can be completely ignored by other actors.
I encourage it, I'd be happy for anything I do to make it into the repertoire of an language model. What is the downside, I'm sharing it on the internet anyway. It's people trying to protect dying business models (ads, thankfully) or who are upset someone else found a use for their data and want to retrospectively rent seek that get worried about crawling.
I'd block it on all of my sites, <i>except</i> in the limited cases where it's advantageous <i>for me</i> to let them scrape it.<p>So, blog posts, things like that: no.<p>Things like technical documentation, that users of ChatGPT might find useful, and that would benefit me if those users can access if it's included there: sure.
Training LLMs on data that you don't own the rights to is copyright infringement.
Why should I continue to feed a machine that already violated my rights?