People are just as bad as my LLMs

203 点作者 Wilsoniumite2 个月前

25 条评论

> ...a lot of the safeguards and policy we have to manage humans own unreliability may serve us well in managing the unreliability of AI systems too.It seems like an incredibly bad outcome if we accept "AI" that's fundamentally flawed in a way similar to if not worse than humans and try to work around it rather than relegating it to unimportant tasks while we work towards a standard of intelligence we'd otherwise expect from a computer.LLMs certainly appear to be the closest to real AI that we've gotten so far. But I think a lot of that is due to the human bias that language is a sign of intelligence and our measuring stick is unsuited to evaluate software specifically designed to mimic the human ability to string words together. We now have the unreliability of human language processes without most of the benefits that comes from actual human level intelligence. Managing that unreliability with systems designed for humans bakes in all the downsides without further pursuing the potential upsides from legitimate computer intelligence.

评论 #43327701 未加载

评论 #43338981 未加载

评论 #43333025 未加载

评论 #43328179 未加载

评论 #43328465 未加载

评论 #43327270 未加载

评论 #43327250 未加载

tehsauce2 个月前

There has been some good research published on this topic of how RLHF, ie aligning to human preferences easily introduces mode collapse and bias into models. For example, with a prompt like: "Choose a random number", the base pretrained model can give relatively random answers, but after fine tuning to produce responses humans like, they become very biased towards responding with numbers like "7" or "42".

评论 #43325870 未加载

评论 #43325613 未加载

评论 #43325331 未加载

评论 #43326272 未加载

评论 #43327147 未加载

评论 #43326029 未加载

lxe2 个月前

It's almost as if we trained LLMs on text produced by people.

评论 #43327094 未加载

smallnix2 个月前

Is my understanding wrong that LLMs are trained to emulate observed human behavior in their training data?From that follows that LLMs fit to produce all kinds of human biases. Like preferring the first choice out of many, and the last our of many (primacy biases). Funnily the LLM might replicate the biases slightly wrong and by doing so produce new derived biases.

评论 #43327232 未加载

评论 #43326545 未加载

评论 #43326607 未加载

评论 #43326439 未加载

henlobenlo2 个月前

This is the "anyone can be a mathematician meme". People who hang around elite circles have no idea how dumb the average human is. The average human hallucinates constantly.

bawolff2 个月前

So if you give a bunch of people a boring task, pay them the same regardless of if they treat it seriously or not - the end result is they do a bad job!Hardly a shocker. I think this say more about the experimental design then it does about AI & humans.

markbergz2 个月前

For anyone interested in these LLM pairwise sorting problems, check out this paper: <a href="https://arxiv.org/abs/2306.17563" rel="nofollow">https://arxiv.org/abs/2306.17563</a>The authors discuss the person 1 / doc 1 bias and the need to always evaluate each pair of items twice.If you want to play around with this method there is a nice python tool here: <a href="https://github.com/vagos/llm-sort" rel="nofollow">https://github.com/vagos/llm-sort</a>

评论 #43326164 未加载

jayd162 个月前

If the question inherently allows for "no-preference" to be valid but that is not a possible answer then you've left it to the person or llm to deal with that. If a human is not allowed to specify no preference why would you expect uniform results when you don't even ask for it? You only asked to pick the best. Even if they picked perfectly, its not defined in the task to make sure you select draws in a random way.

velcrovan2 个月前

interleaving a bunch of people's comments and then asking the LLM to sort them out and rank them…seems like a poor method. The whole premise seems silly, actually. I don't think there's any lesson to draw here other than that you need to understand the problem domain in order to get good results from an LLM.

isaacremuant2 个月前

So many articles like this HN have a catchy title and then a short article that doesn't really conclude the title.The experiment itself is so fundamentally flawed it's hard to begin criticizing it. HN comments as a predictor of good hiring material is just as valid as social media profile artifacts or sleep patterns.Just because you produce something with statistics (with or without LLMs) and have nice visuals and narratives doesn't mean is valid or rigorous or "better than nothing" for decision making.Articles like this keep making it to the top of HN because HN is behaving like reddit where the article is read by few and the gist of the title debated by many.

le-mark2 个月前

Human level artificial intelligence has never had much appeal to me, there are enough idiots in the world, why do we need artificial ones? Ie if average machine intelligence mirrored human IQ distribution?

评论 #43327152 未加载

devit2 个月前

The "person one" vs "person two" bias seems trivially solvable by running each pair evaluation twice with each possible labelling and the averaging the scores.Although of course that behavior may be a signal that the model is sort of guessing randomly rather than actually producing a signal.

评论 #43325808 未加载

jopsen2 个月前

But an LLM can't be held accountable.. neither can most employees, but we often forget that :)

评论 #43331750 未加载

评论 #43327173 未加载

评论 #43326283 未加载

satisfice2 个月前

People are alive. They have rights and responsibilities. They can be held accountable. They are not "just as bad" as your LLMs.

评论 #43327041 未加载

andrewmcwatters2 个月前

I don’t understand the significance of performing tests like these.To me it’s literally the same as testing one Markov chain against another.

megadata2 个月前

At least LLMs are very often ready so acknowledge they might be wrong.It can be incredibly hard to get a person to acknowledge that they might be remotely wrong on a topic they really care about.Or, for some people, the thought that they might be wrong about anything attall is just like blasphemy to them.

评论 #43327124 未加载

评论 #43327212 未加载

oldherl2 个月前

It's just because people tend to put the "original" result in the first place and the "improved" result in the second place in many scientific studies. LLM and humans are learning that and assume that the second one is the better one.

K0balt2 个月前

I know this is only adjacent to OP’s point, but I do find it somewhat ironic that it is easy to find people who are just as unreliable and incompetent at answering questions correctly as a 7b model, but also a lot less knowledgeable.Also, often less capable of carrying on a decent conversation.I’ve noticed an periconcious urge when talking to people to judge them against various models and quants, or to decide they are truly SOTA.I need to touch grass a bit more, I think.

评论 #43327026 未加载

soared2 个月前

Wouldn’t the same outcome be achieved much more simply by giving LLMs a two choices (colors, numbers, whatever), asking “pick one” and assessing the results in the same way?

评论 #43327228 未加载

vivzkestrel2 个月前

should have started naming them from person 4579 and see if it still exhibits the bias

djaouen2 个月前

Yes, but a consensus of people beats LLMs every time. For now, at least.

bxguff2 个月前

Kind of an odd metric to try to base this process off of. are more comments inherently better? is it responding to buzz words? Makes sense talking about hiring algos / resume scanners in part one and if anything this elucidates some of the trouble with them.

评论 #43326421 未加载

th0ma52 个月前

No they are not randomly wrong or right without perspective unless they have some kind of brain injury. So that's against the title but the rest of their point is interesting!

raincole2 个月前

What a clickbait title.TL;DR: the author found a very, very specific bias that is prevalent in both humans and LLMs. That is it.

mdp20212 个月前

Very nice article. But the title, and the idea, is the very frequent "racist" form of the proper "People [can be] just as bad as my LLMs".Now: some people can't count. Some people hum between words. Some people set fire to national monuments. Reply: "Yes we knew", and "No, it's not necessary".And: if people could lift the tons, we would not have invented cranes.Very, very often in these pages I meet people repeating "how bad people are". That is "how bad people can be", and "and we would have guessed these pages are especially visited by engineers, who must be already aware of the importance of technical boosts" - so, besides the point relevant to the fact that the median does not represent the whole set, the other point relevant to the fact that tools are not measured on reaching mediocre results.

评论 #43325586 未加载