Some Remarks on Large Language Models

182 点作者 backpropaganda超过 2 年前

17 条评论

Interesting post. I find myself moving away from the sort of "compare/contrast with humans" mode and more "let's figure out exactly what this machine _is_" way of thinking.If we look back at the history of mechanical machines, we see a lot of the same kind of debates happening there that we do around AI today -- comparing them to the abilities of humans or animals, arguing that "sure, this machine can do X, but humans can do Y better..." But over time, we've generally stopped doing that as we've gotten used to mechanical machines. I don't know that I've ever heard anyone compare a wheel to leg, for instance, even though both "do" the same thing, because at this point we take wheels for granted. Wheels are much more efficient at transporting objects across a surface in some circumstances, but no one's going around saying "yeah, but they will never be able to climb stairs as well" because, well, at this point we recognize that's not an actual argument we need to have. We know what wheels do and don't.These AI machine are a fairly novel type of machine, so we don't yet really understand what arguments make sense to have and which ones are unnecessary. But I like these posts that get more into exactly what an LLM _is_, as I find them helpful in understanding better exactly what kind of machine an LLM is. They're not "intelligent" any more than any other machine is (and historically, people have sometimes ascribed intelligence, even sentience, to simple mechanical machines), but that's not so important. Exactly what we'll end up doing with these machines will be very interesting.

评论 #34239240 未加载

评论 #34238502 未加载

评论 #34242324 未加载

评论 #34239017 未加载

macleginn超过 2 年前

I cannot understand where the boundary between some of the "common-yet-boring arguments" and "real limitations" is. E.g., the ideas that "You cannot learn anything meaningful based only on form" and "It only connects pieces its seen before according to some statistics" are "boring", but the fact that models have no knowledge of knowledge, or knowledge of time, or any understanding of how texts relate to each other is "real". These are essentially the same things! This is what people may mean when they proffer their "boring critiques", if you press them hard enough. Of course Yoav, being abrest of the field, knows all the details and can talk about the problem in more concrete terms, but "vague" and "boring" are still different things.I also cannot fathom how models can develop a sense of time, or structured knowledge of the world consisting of discrete objects, even with a large dose of RLHF, if the internal representations are continuous, and layer normalised, and otherwise incapable of arriving at any hard-ish, logic-like rules? All these models seem have deep seated architectural limitations, and they are almost at the limit of the available training data. Being non-vague and positive-minded about this doesn't solve the issue. The models can write polite emails and funny reviews of Persian rags in haiku, but they are deeply unreasonable and 100% unreliable. There is hardly a solid business or social case for this stuff.

评论 #34238040 未加载

brooksbp超过 2 年前

Sometimes I read text like this and really enjoy the deep insights and arguments once I filter out the emotion, attitude, or tone. And I wonder if the core of what they're trying to communicate would be better or more efficiently received if the text was more neutral or positive. E.g. you can be 'bearish' on something and point out 'limitations', or you can say 'this is where I think we are' and 'this is how I think we can improve', but your insights and arguments about the thing can more or less be the same in either form of delivery.

评论 #34235312 未加载

评论 #34236613 未加载

评论 #34236368 未加载

评论 #34239927 未加载

评论 #34236083 未加载

评论 #34236536 未加载

评论 #34237499 未加载

dekhn超过 2 年前

I would like to see the section on "Common-yet-boring" arguments cleaned up a bit. There is a whole category of "researchers" who just spend their time criticizing LLMs with common-yet-boring arguments (Emily Bender is the best example) such as "they cost a lot to train" (uhhh have you seen how much enterprise spends on cloud for non-LLM stuff? Or seen the power consumption of an aluminum smelting plant? Or calcuated the costs of all the airplanes flying around taking tourists to vacation?)By improving this section I think we can have a standard go-to doc to refute the common-but-boring arguments. By pre-anticipating what they say (and yes, Bender is very predictable... yuo could almost make a chatbot that predicts her) it greatly weakens their argument.

评论 #34240302 未加载

评论 #34236676 未加载

评论 #34238514 未加载

评论 #34237452 未加载

评论 #34236912 未加载

评论 #34238839 未加载

评论 #34261463 未加载

phillipcarter超过 2 年前

While not unsolvable, I think the author is understating this problem a lot:> Also, let's put things in perspective: yes, it is enviromentally costly, but we aren't training that many of them, and the total cost is miniscule compared to all the other energy consumptions we humans do.Part of the reason LLMs aren't that big in the grand scheme of things is because they haven't been good enough and businesses haven't started to really adopt them. That will change, but the costs will be high because they're also extremely expensive to run. I think the author is focusing on the training costs for now, but that will likely get dwarfed by operational costs. What then? Waving one's arms and saying it'll just "get cheaper over time" isn't an acceptable answer because it's hard work and we don't really know how cheap we can get right now. It must be a focus if we actually care about widespread adoption and environmental impact.

评论 #34237194 未加载

评论 #34236485 未加载

评论 #34236350 未加载

评论 #34242062 未加载

评论 #34238928 未加载

axpy906超过 2 年前

I asked ChatGPT if author was still relevant. Apparently so.> Yoav Goldberg is a computer science professor and researcher in the field of natural language processing (NLP). He is currently a professor at Bar-Ilan University in Israel and a senior researcher at the Allen Institute for Artificial Intelligence (AI2).Professor Goldberg has made significant contributions to the NLP field, particularly in the areas of syntactic parsing, word embeddings, and multi-task learning. He has published numerous papers in top-tier conferences and journals, and his work has been widely cited by other researchers.

seydor超过 2 年前

> Another way to say it is that the model is "not grounded". The symbols the model operates on are just symbols, and while they can stand in relation to one another, they do not "ground" to any real-world item.This is what Math is, abstract syntactic rules. GPTs however seem to struggle in particular at counting, probably because their structure does not have a notion of order. I wonder if future LLMs built for math will basically solve all math (if they will be able to find any proof that is provable or not).Grounding LLMs to images will be super interesting to see though, because images have order and so much of abstract thinking is spatial/geometric in its base. Perhaps those will be the first true AIs

评论 #34261619 未加载

dullcrisp超过 2 年前

I love trying to teach things to ChatGPT. It’s like if a toddler got a press agent.I apologize for the confusion caused by my previous response. You are correct that the star-shaped block will not fit into the square hole. That is because the edges of the star shape will obstruct the block from fitting into the square hole. The star-shaped block fits into the round hole.Block-and-hole puzzles were developed in the early 20th century as children’s teaching time. They’re a common fixture in play rooms and doctors offices throughout the world. The star shape was invented in 1973.Please let me know if there’s anything else I can assist you with.

alibero超过 2 年前

> Finally, RLHF, or "RL with Human Feedback". This is a fancy way of saying that the model now observes two humans in a conversation, one playing the role of a user, and another playing the role of "the AI", demonstrating how the AI should respond in different situations. This clearly helps the model learn how dialogs work, and how to keep track of information across dialog states (something that is very hard to learn from just "found" data). And the instructions to the humans are also the source of all the "It is not appropriate to..." and other formulaic / templatic responses we observe from the model. It is a way to train to "behave nicely" by demonstration.I think this misses a big component of RLHF (the reinforcement learning). The approach described above is "just" supervised learning on human demonstrations. RLHF uses a reinforcement learning objective to train the model rather than maximizing likelihood of human demonstrations. In fact, you can then take the utterances your model has generated, collect human feedback on those to improve your reward model, and then train a new (hopefully better) model -- you no longer need a human roleplaying as an AI. This changed objective addresses some of the alignment issues that LMs struggle with: Open AI does a pretty good job of summarizing the motivation in <a href="https://arxiv.org/abs/2009.01325" rel="nofollow">https://arxiv.org/abs/2009.01325</a>:> While [supervised learning] has led to markedly improved performance, there is still a misalignment between this fine-tuning objective—maximizing the likelihood of human-written text—and what we care about—generating high-quality outputs as determined by humans. This misalignment has several causes: the maximum likelihood objective has no distinction between important errors (e.g. making up facts) and unimportant errors (e.g. selecting the precise word from a set of synonyms); models are incentivized to place probability mass on all human demonstrations, including those that are low-quality; and distributional shift during sampling can degrade performance. Optimizing for quality may be a principled approach to overcoming these problems.where RLHF is one approach to "optimizing for quality".

eternalban超过 2 年前

GPT-3 is limited, but it has delivered a jolt that demands a general reconsideration of machine vs human intelligence. Has it made you change your mind about anything?At this point for me, the notion of machine "intelligence" is a more reasonable proposition. However this shift is the result of a reconsideration of the binary proposition of "dumb or intelligent like humans".First, I propose a possible discriminant for "intelligence" vs "computation" to be the ability of an algorithm to brute force compute a response given the input corpus of the 'AI' under consideration, where the machine has provided a reasonable response.It also seems reasonable to begin to differentiate 'kinds' of intelligence. On this very planet there are a variety of creatures that exhibit some form of intelligence. And they seem to be distinct kinds. Social insects are arguably intelligent. Crows are discussed frequently on hacker news. Fluffy is not entirely dumb either. But are these all the same 'kind' of intelligence?Putting cards on the table, at this point it seems eminently possible that we will create some form of mechanical insectoid intelligence. I do not believe insects have any need for 'meaning' - form will do. That distinction also takes the sticky 'what is consciousness?' Q out of the equation.

lalaithion超过 2 年前

> In particular, if the model is trained on multiple news stories about the same event, it has no way of knowing that these texts all describe the same thing, and it cannot differentiate it from several texts describing similar but unrelated eventsAnd... the claim is that humans can do this? Is it just the boring "This AI can only receive information via tokens, whereas humans get it via more high resolution senses of various types, and somehow that is what causes the ability to figure out two things are actually the same thing?" thing?

评论 #34235817 未加载

评论 #34236710 未加载

aunch超过 2 年前

Great focus on the core model itself! I think a complimentary aspect of making LLM's "useful" from a productionization perspective is all of the engineering around the model itself. This blog post did a pretty good job highlighting those complementary points: <a href="https://lspace.swyx.io/p/what-building-copilot-for-x-really" rel="nofollow">https://lspace.swyx.io/p/what-building-copilot-for-x-really</a>

ilaksh超过 2 年前

He seems to have missed the biggest difference which is the lack of visual information.

评论 #34239944 未加载

zzzeek超过 2 年前

"The models are biased, don't cite their sources, and we have no idea if there may be very negative effects on society by machines that very confidently spew truth/garbage mixtures which are very difficult to fact check"dumb boring critiques, so what? so boring! we'll "be careful", OK? so just shut up!

stevenhuang超过 2 年前

I found the "grounding" explanation provided by human feedback very insightful:> Why is this significant? At the core the model is still doing language modeling, right? learning to predict the next word, based on text alone? Sure, but here the human annotators inject some level of grounding to the text. Some symbols ("summarize", "translate", "formal") are used in a consistent way together with the concept/task they denote. And they always appear in the beginning of the text. This make these symbols (or the "instructions") in some loose sense external to the rest of the data, making the act of producing a summary grounded to the human concept of "summary". Or in other words, this helps the model learn the communicative intent of the a user who asks for a "summary" in its "instruction". An objection here would be that such cases likely naturally occur already in large text collections, and the model already learned from them, so what is new here? I argue that it might be much easier to learn from direct instructions like these than it is to learn from non-instruction data (think of a direct statement like "this is a dog" vs needing to infer from over-hearing people talk about dogs). And that by shifting the distribution of the training data towards these annotated cases, substantially alter how the model acts, and the amount of "grounding" it has. And that maybe with explicit instructions data, we can use much less training text compared to what was needed without them. (I promised you hand waving didn't I?)

评论 #34238963 未加载

评论 #34235967 未加载

CarbonCycles超过 2 年前

Unfortunate, an opportunity to further enlighten others, but the author took a dismissive and antagonistic perspective.

评论 #34236322 未加载

light_hue_1超过 2 年前

The dismissal of biases and stereotypes is exactly why AI research needs more people who are part of the minority. Yoav can dismiss this because it just doesn't affect him much.It's easy to say "Oh well, humans are biased too" when the biases of these machines don't: misgender you, mistranslate text that relates to you, have negative affect toward you, are more likely to write violent stories related to you, have lower performance on tasks related to you, etc.

评论 #34235241 未加载

评论 #34234995 未加载

评论 #34239295 未加载

评论 #34239674 未加载

评论 #34236315 未加载