I Have Blocked OpenAI

18 点作者 https443将近 2 年前

17 条评论

This is such a technopurist take. People who use LLM’s already know they can give wrong information. Your documentation won’t be able to cover every single possible contextual scenario that an LLM can help with. I think there are valid reasons to not allow OpenAI to spider you, but this is just a really silly one that feels pretty egotistical. People aren’t going to this guy saying “well OpenAI said your software works this way and it doesn’t”. It’s an entirely contrived scenario that doesn’t exist in reality.

评论 #37192754 未加载

评论 #37192735 未加载

superkuh将近 2 年前

For the last two weeks my little webserver has been getting 200+ hits a day from bots with the useragent of anthropic-ai. At first it was what you'd expect, mirroring all the pdfs and such. But the last week it's just /robots.txt. 200+ times per day from amazon-ec2 so I have no way of knowing if it's actually anthropic-ai.I was happy that they'd be including documents on topics I found interesting and things I wrote in the word adjacency training of their foundational model. That'd mean the model would be more useful to me. But the robots.txt stuff is weird. Maybe it's because I've had,<pre><code> User-agent: Killer AI Disallow: /~superkuh/ </code></pre> in there for the last 10 years? /s

sigilis将近 2 年前

You should take down the documentation entirely, if you want to prevent incorrect interpretations of things. The LLMs won’t be the ones emailing you, the people who would get things wrong if the LLM provided some kind of confident wrong answer would probably simply not read your documentation as the vast majority of users do not. You’re just shifting some, but not all, misunderstandings into totally uninformed questions that will mean an additional email pointing them to RTFM.All of these “we’re not letting bots crawl our site!” posts make me feel like I’ve travelled back in time to when having web spiders crawl your site was a big deal. You can’t really prevent people from using tools wrong, and it is odd that so many people care about this futile attempt to insulate yourself from stupid users that I managed to see it on the front page of HN.The worst part is, if an LLM has already read in your docs and the interaction you fear your users having with LLMs comes to pass: they will have misapprehensions about the old version of your docs which will be even more wrong.Allow me to prepare you for the future now before you have to hear it from someone else, you will be getting email spam about LLM Algorithm Optimization soon. LLMAO firms are probably already organizing around the bend of time, we’re just a little before they become visible.

评论 #37193071 未加载

ljoshua将近 2 年前

I agree that LLMs are almost more likely than not to answer documentation questions wrong, to hallucinate methods that don’t exist, or just be silly. But the value I see in allowing LLMs to train on documentation is in the glue code that an LLM could (potentially!) generate.Documentation, even good docs, usually only answer the question “What does this method/class/general idea do?” Really good docs will come with some examples of connecting A and B. But they will often not include examples of connecting A to E when you have to transform via P because of business requirements, and almost never tell you how to incorporate third-party libraries X, Y, and Z.As an engineer, I can read the docs and figure out the bits, but having an LLM suggest some of the intermediary or glue steps, even if wrong sometimes, is a benefit I don’t get only from good documentation.

评论 #37192764 未加载

Racing0461将近 2 年前

unpopular opinion: llm responses being wrong is still valuable to me since it gives me a better jumping off point to exploring than nothing at all. especially with something like coding that can easily be back-propagated due to something not compiling/not working as intended. could be harmful in other areas tho.

评论 #37192829 未加载

评论 #37192700 未加载

input_sh将近 2 年前

> Despite the volume of documentation, my documentation would still be just a tiny blip in the amount of information in the LLM, and it will still pull in information from elsewhere to answer questions.I sympathise. I've recently discovered that apparently I have enough Internet clout that ChatGPT knows about me. As in I can carefully construct a prompt and it will unmistakably reference me in particular. Don't even need to provide my name in the prompt.Except, every fucking detail of what it "knows" about me is 100% false, and there's nothing I can do to correct it. I'm from a wrong country, I did things in my career that I absolutely didn't, etc.Needless to say, I also blocked its crawler.

speedgoose将近 2 年前

I understand that some people don’t want their work to train AI. Personally I like that the work I publish is not completely useless as it is at least used to train LLMs.

RecycledEle将近 2 年前

We are all myopic in our own ways.The guy who posted about blocking OpenAI so they will not answer questions about his software wrong (meaning not completely) ignores that his documentation is inaccessible to many less technically literate people. LLM AIs help bridge the gap to get newbies using software before they can understand the manuals.

评论 #37195488 未加载

评论 #37192874 未加载

kristianp将近 2 年前

IIRC LLMs also use common crawl data for training. Are they also blocking common crawl?Another thing is that chatgpt 4 can do live retrieval of websites in response to users questions. That is a different crawler doing that I imagine. Are they going to block that too?

评论 #37199032 未加载

b800h将近 2 年前

I bet information about his software is around elsewhere, and now ChatGPT will make up even more. I don't know how this is fixed. Structured queryable data, I guess.

评论 #37192715 未加载

Cantinflas将近 2 年前

> But here’s the problem: it will answer them wrong.There is no way to know that, and even if it ends up being true, blocking openai will likely make the problem worse, e.g. the ai answers will be worse without access to the documentation.

评论 #37192817 未加载

pleoxy将近 2 年前

Adding friction to the use of information about your product seems like a disservice to the users/customers.By not having that information in the system at all will only degrade the answers. Not change who is asking.

dutchbrit将近 2 年前

Just a thought that I have, wouldn’t it be better to block all robots and only to whitelist a select few? More AI bots are scraping now and in the future…

评论 #37192720 未加载

zmnd将近 2 年前

Can you give an example of what question someone asked GPT-4 and was misled? And how that question was better answered by one of your tutorials?

worrycue将近 2 年前

I wonder if that would even help. If an LLM knows nothing at all about the software, it might just make up complete bullshit anyway.

orbit7将近 2 年前

Does OpenAI also scan wayback machine? If it does and you are on that you may also wish to remove yourself.

gavinhoward将近 2 年前

Dupe: <a href="https://news.ycombinator.com/item?id=37182366">https://news.ycombinator.com/item?id=37182366</a> .