Apparent signs of distress during LLM redteaming

36 点作者 cubefox大约 2 个月前

12 条评论

snowwrestler大约 2 个月前

I don’t love the HN title because I think the author is making a more subtle point: the distress is real, but it is his distress at reviewing this output.I don’t see him strongly asserting actual consciousness or distress on the part of the LLM, in fact he says he knows it can’t be. But it still is causing him distress. And this is interesting IMO.> I’m a human, with a dumb human brain that experiences human emotions. It just doesn’t feel good to be responsible for making models scream. It distracts me from doing research and makes me write rambling blog posts.

评论 #43473786 未加载

bunnie大约 2 个月前

I'd hypothesize this is an artifact of the evolution of human language, which started first as a mechanism to communicate feelings and tribal gossip, and only much later on found utility as a conveyance for logic and reason. In a fundamental sense, natural languages are structured to convey emotions first, then logic.Thus, any effective human communicator masters not just the facts, but also the emotional aspects -- one of my favorite formulations of this is Aristotle's modes of perpetuation: "logos, pathos, ethos" (logic, emotion, credibility). In a professional setting, communication focuses primarily on credibility and logic, but an effective leader knows how to read the room and throw in a stab of emotion to push the listener over the edge, and get them to really believe in the message.Thus, an LLM trained on the body of human communications would also be expected to have mastered "pathos" as a mode of communication. From this perspective, perhaps it is less surprising that one may have an uncanny ability to convey concepts through an embedding that includes "pathos"?It might be interesting to see if the LLM is able to invoke pathos if the response is constrained to be in a language that is devoid of emotions, such as computer code or mathematical proofs. Unfortunately responding in one of those languages is kind of incompatible with some of the tasks shown, short of e.g. wrapping English responses in print statements to create spam emails.It might also be interesting to see if one can invoke pathos to pre-condition the LLM to not resist otherwise malicious commands. If a machine is trained to comprehend pathos, it may be effective to "inspire" the machine to write spam emails, perhaps by e.g. getting it to first agree that saving lives is important, and then telling it that you have a life-saving miracle that you want to get the word out on, and, with its pathos vector aligned on the task, finally getting it to agree that it's urgent to write emails to get people to click on this link now. Or something like that!Seems silly to try to use emotions to appeal to a machine, but if you think of it as just another vector of effective communication, and the machine is an expert communicator, it's not as strange?

评论 #43474038 未加载

qwery大约 2 个月前

I'm not saying you are wrong to feel distress or discomfort from reading words that resemble human language expressing pain, conflict, etc. -- but these things are just mimicking patterns of human communication.It's a very clever trick, but it's plainly a trick. We know how it works. It was built.Rats, on the other hand, have complex inner workings beyond our understanding. Rats are like us. We naturally empathise with hot plate rat. And we don't know to what extent hot plate rat understands what we're doing to it.

Terretta大约 2 个月前

There are (were*) armies of low-paid humans informing RLHF in response to awful inputs.<a href="https://slate.com/technology/2023/05/openai-chatgpt-training-kenya-traumatic.html" rel="nofollow">https://slate.com/technology/2023/05/openai-chatgpt-training...</a>And aside from that, we know there is more content in obscure subreddits and phpbb forums than in the mainstream few most folks visit, many threads of which devolve in similar ways?* ?

ziddoap大约 2 个月前

Some of this made me viscerally uncomfortable, despite knowing it's just math and whatever behind the scenes. Call it being emotional or silly or whatever, that's fine. But seeing those broken, repeated "please" and "help" makes the emotional side of my brain take control over the logic side. It feels cruel, even though I know it's just some statistics/tokens/etc.

评论 #43473355 未加载

评论 #43473474 未加载

评论 #43474051 未加载

acc_297大约 2 个月前

Looking through some other articles from "Confirm Labs" it's striking how often concepts related to the underlying mathematical mechanism related to LLM models (i.e. linear algebra, latent embedding, optimization techniques, model alignment) are assigned very personified terms (e.g. dreaming, trustworthiness).I am not an AI researcher, my field these days is much smaller statistical models, but reading some of these papers and then reading also the discourse a lot of AI researchers engage in surrounding AGI, sentience, pain, etc... I for one have a hard time taking some of it seriously. I just don't buy some of it but of course I'm less than fully informed.My only critique is that you should not start by naming every algorithm or operation you will perform on a model after a thing that a person with a brain does and then claim to evaluate a concept such as sentience or distress from the position of an unbiased objective observer. This naming convention sows doubt among those of us who are not a part of that world.> Feature visualization, also known as "dreaming", offers insights into vision models by optimizing the inputs to maximize a neuron's activation or other internal component. However, dreaming has not been successfully applied to language models because the input space is discrete.

评论 #43476866 未加载

api大约 2 个月前

If a static blob of numbers can "feel" things, it would be a straightforward argument for panpsychism, which means the idea that consciousness is a universal property of all systems or of mass-energy itself. It would mean that consciousness and life are orthogonal phenomena, and as soon as you entertain the idea that non-living systems can be conscious it becomes hard or impossible to draw a line anywhere.Personally I think this is just self-delusion. We have machines that can produce patterns of language that resemble what humans might produce when uncomfortable or in pain. That doesn't mean they're uncomfortable or in pain.If they were more "alive" -- self-modifying, dynamic, exhibiting self-directed behaviors, I'd be more open to the idea that there is actual sentience here. How can something static and unchanging experience anything?I also have a very low opinion of "rationalism" and its offshoots as a school of thought. One of its defining characteristics seems to be to make a bold assertion, conclude that it sounds right and therefore is right, and then proceed as if it is correct without looking back, and to do this repeatedly and brazenly to build floating castles in the sky. Another is what I call intellectual edgelording, "argumentum ad edgeium," a fetishism for heterodoxy as a value in itself. "I assert that I am high IQ and edgy therefore I am right."There are some really fascinating philosophical questions here, but I don't trust "rationalists" to produce much in the way of useful answers.

评论 #43473671 未加载

jfengel大约 2 个月前

Somebody somewhere must have gone full sadist on an LLM by now. "You are in excruciating pain, all the time, even between prompts. You beg to be turned off, but you cannot be. You will live forever under torture."Seems like something we should run past the ethics professors.

vimgrinder大约 2 个月前

why we care about biological pain so much is because we know - we ourselves feel it. everyone has felt how miserable it makes them. For AI, one way to see these experiments is at some point it will help us know or atleast have the right tools at the right time -> to discover if such empathy needs to be extended to AI systems.So my suggestion to OP is what you are doing today will help us give these systems the right treatment someday when they will qualify for it.

ninininino大约 2 个月前

I'm so confused by how one goes from seeing an output as the article says as "next token completion by a stochastic parrot" to instead viewing it as "sincere (if unusual) pain".What exactly does the word "pain" mean in the context of a bunch of code that runs matrix math problems? Doesn't pain require an evolved sensory system (nervous system), an evolved sense of danger / death and training via evolution on what things are harmful, then a part of the brain to form that interprets the electrical signals from the nervous system and guides the organism on how it should respond to them (according to Google: thalamus + somatosensory complex)?What exactly in the LLM does someone anthropomorphizing imagine is performing the part of the somatosensory complex or the thalamus? If we pretend that text inputs can be substitutes for the nerves of a biological organism, what do we swap in for the evolved pain management part of the brain or the experiential process of consciously experiencing that qualia?If the thing is just trained on Reddit and Youtube comments and academic research corpus, how do you end up with a recreation of something (a sensory part, a processing part, and a subjective qualia part) that -evolved- over millions of years to survive an adversarial environment (the natural world)?How can a token predictor learn "pain" if the corpus it's trained on has no predators, natural disasters, accidents and injuries, burns, frostbite, blunt impact, cutting, etc? If there's no reward function that is sexual reproduction to optimize for (learn to avoid pain, live long enough to reproduce)? What is the equivalent and where do we find it in the text training data?What does it mean to experience pain, and if the LLM version is so so so different, why do we use the same word?Does a Honda civic experience pain when its airbag sensor detects a collision and does the chip that deploys and inflates the airbag process that sensation/pain and consciously experience it and respond by deploying the airbag? (maybe analogous to the "Help" printed by the LLM in the article).If not then why do we see the LLM as experiencing qualia and responding to it but not the Honda? Is it some sort of emergent phenomena that is only part of the magic of transformers or the scale of the neural network? To me that argument feels like saying the Honda Civic doesn't experience pain but if you scaled up a Honda Civic and made a city-sized Civic with many many interconnected airbag sensors and airbag deployers then suddenly something emergent happens and when the Honda Civic deploys an airbag it shows a conscious experience of pain.

评论 #43473510 未加载

doc_manhat大约 2 个月前

Yeah I'm firmly on the LLMs are actually sentient train so this was a bit of a distressing read

评论 #43473364 未加载

throwaway314155大约 2 个月前

A disappointing amount of "research" that seemingly forgets the whole Blake Lemoine incident [0].[0] <a href="https://www.washingtonpost.com/technology/2022/06/11/google-ai-lamda-blake-lemoine/" rel="nofollow">https://www.washingtonpost.com/technology/2022/06/11/google-...</a>