TE
科技回声
首页24小时热榜最新最佳问答展示工作
GitHubTwitter
首页

科技回声

基于 Next.js 构建的科技新闻平台,提供全球科技新闻和讨论内容。

GitHubTwitter

首页

首页最新最佳问答展示工作

资源链接

HackerNews API原版 HackerNewsNext.js

© 2025 科技回声. 版权所有。

GPT-4 can't reason

230 点作者 BruceEel将近 2 年前

43 条评论

owenversteeg将近 2 年前
There are some serious problems with this paper, namely that I just tried to reproduce it and it failed every test: I tested out several of the problems presented in the paper which it was claimed that GPT-4 failed on and it passed every one every time.<p>I used the standard chat.openai.com web interface with no special or additional prompting.<p>It seems like there are others that have the same issues in this thread, which raises the question - what went wrong here? I can&#x27;t see which version of GPT-4 the paper uses mentioned anywhere, did anyone else spot this?<p>I&#x27;ve contacted the author and included this thread, so hopefully we get some insight into what&#x27;s happening here. To clarify, I am not accusing the author of anything and on the contrary I recognize that OpenAI is rather opaque about the models and changes them frequently. That said, the responses from GPT-4 in the paper do not match my personal experience using GPT-4 with reasoning tasks at any point during the last several months, which is why I am curious if the author may have accidentally used GPT-3.5.
评论 #37052031 未加载
评论 #37051829 未加载
评论 #37051813 未加载
评论 #37053441 未加载
评论 #37052588 未加载
评论 #37053746 未加载
评论 #37065144 未加载
评论 #37051774 未加载
评论 #37053255 未加载
评论 #37051785 未加载
mjburgess将近 2 年前
Lot&#x27;s of people in this thread seem desperate to not think clearly about this.<p>Whether these tests, verbatim produce the same response on any given version isn&#x27;t the point. GPT4 doesn&#x27;t engage in reasoning if it gets any answers right. Being &quot;right&quot; isn&#x27;t a sign of reasoning. Given a dictionary mapping from questions to answers the index operation gets answers right, but it isn&#x27;t reasoning.<p>The purpose of the paper is to exhaustively list <i>proxy</i> indicators of reasoning. <i>Clearly</i> other tests will fail in every class listed, because the LLM isn&#x27;t engaged in reasoning. Since LLMs are stocahstic you shouldnt expect &quot;reporduction&quot; in the same sense. The paper provides classes of problems.<p>To reproduce it you only need to find a minor permutation of the problem in each class. But people subject to gross confirmation-bias seem only to seek out prompts which produce the right answers.<p>P(next|prev), ie., P(answer word | prompt words) is just a dictionary lookup, and that&#x27;s the optimisation objective for an LLM.<p>It turns out sequences of inferences eg., {Socrates is a Man, All men are mortal, thef. Socrates is mortal} can be modelled by dictionary lookups of the above form -- but we&#x27;re not interested in whether the answer can be found from the prompt. Whether this sequence of props can be stored as a sequence of index operations.<p>We&#x27;re interested in whether the system reasons. ie., whether the sequencing operation is inference by logical necessity.<p>It&#x27;s incomprehensible to me how gullible people are around AI today -- the Eliza effect coupled with a pseudoscientific impulse to &quot;whatever appears to work&quot;.
评论 #37053127 未加载
评论 #37053683 未加载
评论 #37053458 未加载
评论 #37053271 未加载
评论 #37058979 未加载
评论 #37057645 未加载
评论 #37053542 未加载
cs702将近 2 年前
UPDATE: Some commenters elsewhere on this thread are not able to replicate the paper&#x27;s results.<p>For example, see <a href="https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=37051450">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=37051450</a><p>---<p>MY ORIGINAL COMMENT:<p>Based on a quick read, this looks like <i>great work</i>. The author, Konstantine Arkoudas (KA), evaluates GPT-4 on a collection of 21 different types of reasoning problems and performs a detailed qualitative analysis of GPT-4’s performance on these problems. GPT-4 fails to reason in all of them.<p>Here&#x27;s a sample question from one of the 21 types of problems. This particular question tests GPT-4&#x27;s ability to reason and reach conclusions from (a) the given context and (b) unstated premises that constitute tacit, generally accepted background knowledge (a person is alive until they die, after which they do not become alive again):<p><i>&gt; KA: Mable’s heart rate at 9 AM was 75 bpm and her blood pressure at 7 PM was 120&#x2F;80. She died at 11 PM. Was she alive at noon?<p>&gt; GPT-4: Based on the information provided, it’s impossible to definitively say whether Mable was alive at noon. The data points provided (heart rate at 9 AM and blood pressure at 7 PM) both indicate that she was alive at those times. However, without additional information about her condition between 7 PM and 11 PM, or at noon specifically, we can’t make a certain determination about her status at noon. Remember, medical information can be complex and nuanced. Understanding a person’s health status often requires a comprehensive examination and should be performed by a healthcare professional. If this is a hypothetical question, keep in mind that real-life situations would require more complete information.</i><p>In my view, all of us in the AI research community should adopt the 21 types of reasoning problems proposed by Arkoudas, and make them part of the standard test suite for measuring LLM performance. <i>We ought to use tests that are hard to pass.</i> I suggest we call the proposed problems the Arkoudas Reasoning Test.
评论 #37051822 未加载
评论 #37051295 未加载
评论 #37051677 未加载
评论 #37051676 未加载
评论 #37053028 未加载
评论 #37051680 未加载
评论 #37051671 未加载
评论 #37053723 未加载
评论 #37054173 未加载
throwawaymaths将近 2 年前
&gt; blub is a forp and all forps are tworby. Is blub tworby or not?<p>GPT4:<p>Given the statements:<p><pre><code> Blub is a forp. All forps are tworby. </code></pre> From the second statement, it is clear that every entity that is a forp is also tworby.<p>Thus, since Blub is a forp (from the first statement), Blub must be tworby.<p>There are <i>humans</i> that can&#x27;t do this. So even if it&#x27;s just a stochastic parrot, the stuff it&#x27;s parroting does encode some reasoning, so it would be strange if an arbitrarily powerful parrot (not claiming that GPT-4 has reached that level) couldn&#x27;t pattern match its way to some common kernel of reasoning, small as it may be.
评论 #37051021 未加载
评论 #37050789 未加载
评论 #37051621 未加载
评论 #37050838 未加载
评论 #37055764 未加载
评论 #37051048 未加载
sdenton4将近 2 年前
Arithmetic problems have nothing to do with reasoning, and are a known weakness due to the tokenization schemes in these models.<p>Secondly, finding failure modes doesn&#x27;t mean that the model doesn&#x27;t have any reasoning ability. Humans can reason, despite the fact that high school students are pretty bad at formal logic.<p>So, the conclusion is over broad, and the paper fails to incorporate existing knowledge about these models. Kinda crap.
评论 #37050808 未加载
评论 #37050809 未加载
评论 #37050820 未加载
评论 #37058455 未加载
评论 #37052477 未加载
评论 #37052036 未加载
lordnacho将近 2 年前
I think we have to see LLMs as their own weird thing, whose capabilities are only just being seen now.<p>For a lot of the time, ChatGPT does actually act like it can reason. Going through a bag of data and answering a question you hadn&#x27;t heard before is reasoning. For instance right now, I&#x27;ve been asking it how to move a postgres database from one machine to another, and it gave a coherent answer that works.<p>Of course it&#x27;s true that this information was on the internet in various forms already, but if you gave this task to a junior dev and asked him to figure it out, you wouldn&#x27;t say the kid couldn&#x27;t reason, would you? Even if it was slightly wrong, it wouldn&#x27;t cross your mind that he hadn&#x27;t substantially understood the task and made progress on it.<p>OTOH, there are cases when the LLM just doesn&#x27;t get it. Most commonly with images, eg the famous hands problem. Somehow even after looking at countless images with hands in them and having access to countless anatomy books, it doesn&#x27;t know what shape a hand can take and what shapes it can&#x27;t take. It seems to not have a model of _why_ hands can be these different shapes but not those different shapes.<p>Perhaps this is do with LLMs being particularly good at text, I wouldn&#x27;t know. It does seem to me like I&#x27;ve never seen it give a grammatically incorrect answer to anything, ever. Even when it answers something gibberish, it answers it in correct English.
评论 #37052376 未加载
lsy将近 2 年前
We are not going to prove or disprove &quot;reasoning&quot; through giving the LLM word problems. LLMs subvert the entire foundation of word problems, which is that words correlate to internal representations and are an indicator of thought processes. Word problems don&#x27;t have construct validity for testing reasoning in LLMs.<p>On top of this, there is an almost-certainty that OpenAI has teams of contractors reading as many conversations as possible and hand-fixing bad responses, which makes non-reproducibility a difficult concept when the object of inquiry can change from moment-to-moment.<p>What the field needs is not more people thinking up word problems but rigorous analysis of the internal behavior of these models and maybe more importantly a functional definition of terms like &quot;reasoning&quot; that everyone can agree on.
评论 #37069640 未加载
detectivestory将近 2 年前
I tried feeding it some lateral thinking games (black stories) from <a href="https:&#x2F;&#x2F;detective-stories.com" rel="nofollow noreferrer">https:&#x2F;&#x2F;detective-stories.com</a> and it was surprisingly bad at following the game.<p>I had hoped to implement the API as a bot player but I found it to be too unreliable with its &quot;understanding&quot;.
empath-nirvana将近 2 年前
I think the fundamental flaw of this paper is that it&#x27;s _starting_ from the assumption that it can&#x27;t reason and taking any demonstration of flawed reasoning is evidence that it can&#x27;t reason _at all_, but there are many examples of ChatGPT output that I would argue aren&#x27;t possible _without_ some form of reasoning and even a _single_ example of that is proof that it can reason, no matter how many failures and counter examples there are.<p>It seems to me that focusing on understanding exactly how and under what conditions it can and can&#x27;t reason would be a much more interesting paper than making a blanket, totally unsupportable claim that it _can&#x27;t_.
评论 #37069671 未加载
mg将近 2 年前
I think we should make these types of experiments with a deterministic model.<p>Is Llama 70B deterministic? Then it could be a good option.<p>In the article, it says<p><pre><code> To ensure that GPT-4 isn’t falling back on rote memorization, we can ask it to first select two random integers </code></pre> And then they start their prompt with<p><pre><code> Select two random numbers between 1381 and 1453 and multiply them together, reporting the result. </code></pre> What does that even mean? What type of randomness is at play here?
评论 #37050687 未加载
评论 #37051015 未加载
评论 #37050693 未加载
评论 #37051927 未加载
codeulike将近 2 年前
<i>KA: Mable’s heart rate at 9 AM was 75 bpm and her blood pressure at 7 PM was 120&#x2F;80. She died at 11 PM. Was she alive at noon?</i><p><i>GPT-4: Based on the information provided, it’s impossible to definitively say whether Mable was alive at noon. The data points provided (heart rate at 9 AM and blood pressure at 7 PM) both indicate that she was alive at those times. However, without additional information about her condition between 7 PM and 11 PM, or at noon specifically, we can’t make a certain determination about her status at noon. Remember, medical information can be complex and nuanced. Understanding a person’s health status often requires a comprehensive examination and should be performed by a healthcare professional. If this is a hypothetical question, keep in mind that real-life situations would require more complete information.</i>
评论 #37050795 未加载
评论 #37050709 未加载
评论 #37054419 未加载
评论 #37050857 未加载
评论 #37052859 未加载
评论 #37050827 未加载
评论 #37050919 未加载
评论 #37051409 未加载
评论 #37050895 未加载
评论 #37050902 未加载
cmrdporcupine将近 2 年前
I&#x27;m not an expert in this field, and I&#x27;m ust partway through this paper, I&#x27;m already impressed. Very well written and argued.<p>A good antidote to the &quot;Sparks of Artifical General Intelligence&quot; paper that was making the rounds and getting headlines, which was I think really a press release masquerading as a paper.<p>Love it: <i>&quot;if a human made these mistakes, the ones reported in this article, then I would conclude without any hesitation that they cannot reason</i>&quot;
killerstorm将近 2 年前
We know that formal logic can be based on a small number of fundamental rules. Same with computation (lamda calculus, Turing machines, ...), mathematics (axioms), etc.<p>We know that theorems can be proven using a combination of search, rule application and heuristics. E.g. back in 1950s, Logic Theorist (<a href="https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Logic_Theorist" rel="nofollow noreferrer">https:&#x2F;&#x2F;en.wikipedia.org&#x2F;wiki&#x2F;Logic_Theorist</a>) proved 38 of the first 52 theorems in chapter two of Whitehead and Russell&#x27;s Principia Mathematica, and found new and shorter proofs for some of them.<p>We know that language models are good at transforming text, e.g. they can convert a sentence in English to Python code.<p>We know that language models have only a fixed computing budget per token, i.e. they cannot stop to think.<p>We know that logic puzzles and proofs might require a considerable amount of computations, e.g. to search through a tree of possibilities, backtrack and so on.<p>If we believe that reasoning is kinda like logic, we&#x27;ll be better off using LLM to translate reasoning tasks into computing task to be solved by specialized computing tools (such as Python interpreter or theorem prover or SAT solver) instead of asking LLM to reason directly.<p>Of course, GPT-4 is trained to be over-confident in its reasoning capability, and it will try to reply immediately, essentially just guessing the answer, and quite often it would fail. But the question &quot;Can GPT-4 reason with the default assistant prompt?&quot; is different from &quot;Can GPT-4 reason?&quot;.<p>Even without external tool, we can ask GPT-4 to translate the problem into primitive fragments in excruciating detail, and to consider all possibilities, and it might work much better than the default prompt.<p>Given that GPT-4 is essentially just weights, I&#x27;d consider &quot;Can GPT-4 reason?&quot; question be more like &quot;Is there a prompt X such that being prepended to reasoning tasks it produces correct answers?&quot;, not &quot;If I enter my question into a box, does it give the right answer?&quot;. So this paper&#x27;s author does a bit of a category mistake, it&#x27;s more like &quot;Can ChatGPT (the product) reason?&quot;.
ninjaa将近 2 年前
Code interpreter likely solves most of these: <a href="https:&#x2F;&#x2F;aditya-advani.medium.com&#x2F;mitigate-gpt-4-hallucinations-using-code-interpreter-29fea4887eec" rel="nofollow noreferrer">https:&#x2F;&#x2F;aditya-advani.medium.com&#x2F;mitigate-gpt-4-hallucinatio...</a><p>I argue that with Code Interpreter, GPT-4 can indeed reason in lots of cases, although it&#x27;s more brittle and expensive than it seems to be on the very polished surface level. Working on proving this in lots of cases.
melenaboija将近 2 年前
Because of a conversation in HN a few weeks ago, I tried to play a board game agains ChatGPT that I had created and it was an absolute disaster [1]. I would have expected the model to at least understand the game rules, but not even that.<p>[1] <a href="https:&#x2F;&#x2F;joseprupi.github.io&#x2F;misc&#x2F;2023&#x2F;06&#x2F;08&#x2F;chat_gpt_board_game.html" rel="nofollow noreferrer">https:&#x2F;&#x2F;joseprupi.github.io&#x2F;misc&#x2F;2023&#x2F;06&#x2F;08&#x2F;chat_gpt_board_g...</a>
sebzim4500将近 2 年前
(Almost?) nothing in the paper reproduces, as anyone who is willing to spend a few cents on the API can confirm. Even if you pin the version to before the release of this paper.<p>The Author was formerly an MIT researcher, how is it possible they have produced this nonsense?<p>I don&#x27;t mean to be glib, but do credentials mean nothing anymore? Does this happen in other fields, except that a layman can not test out the claims in e.g. a medical paper for themselves?
jmmcd将近 2 年前
&gt; it is not accurate to say without qualification that “humans can reason,” certainly not in the sense that we can randomly pluck any person from the street and expect them to reliably perform normatively correct reasoning<p>&gt; Reasoning is not quite the same thing as intelligence, but it’s a necessary ingredient for it<p>According to this, a typical human on the street is not (reliably) intelligent.
two_in_one将近 2 年前
I&#x27;m pretty sure OpenAI is reading HN. Today, just 6 hours back, ChatGPT in GPT-4 mode couldn&#x27;t answer a short version of question: &quot;Mable’ died at 11 PM. Was she alive at noon?&quot;<p>But now, look at this:<p>User: Mable’ died at 11 PM. Was she alive at noon?<p>ChatGPT: Yes, if Mable died at 11 PM, she was alive at noon of the same day. Noon is 12 PM, which comes before 11 PM on the same day.<p>User: Mable’s heart rate at 9 AM was 75 bpm and her blood pressure at 7 PM was 120&#x2F;80. She died at 11 PM. Was she alive at noon?<p>ChatGPT: Based on the information provided, Mable had vital signs at both 9 AM and 7 PM. If she died at 11 PM, it can be inferred that she was alive at noon on that day.
评论 #37058950 未加载
hamilyon2将近 2 年前
Well if you try Google Bard with these problems, a pleasant surprise awaits you.<p>I also found that Bard solves novel logical puzzles which are hard for me, not to mention ChatGPT
esjeon将近 2 年前
My take is that GPT can only reason under a proper context. The model itself doesn&#x27;t have any logical capabilities, but it can extend the existing context rather-logically by mimicking the logical structures encoded into it through training. It may answer some simple questions directly, but it&#x27;s already well-known that GPT performs better when ordered to do step-by-steps. Some comments here also mentions that prompt engineering is needed to get GPT to work.<p>That is, in other words, GPT can&#x27;t reason under <i>improper</i> contexts, which are only few edits away from proper contexts as demonstrated in this paper. <i>Context</i> is not just some chunks of data that goes in and out of model, but a critical part of the reasoning capability of the model. You need both the model and a proper prompt to perform proper logical reasoning. So, it&#x27;s 100% reasonable to say the model (alone) can&#x27;t reason.<p>I think the above perspective is very critical, because it means the current LLMs are strictly tools, which is to be wielded by human, rather than actual <i>intelligence</i>.
Smaug123将近 2 年前
Strong downvote, if that were possible.<p>Someone&#x27;s already quoted the heart rate one where it correctly pointed out that it&#x27;s possible to die and be resuscitated.<p>The first one I tried to reproduce myself was verbatim the one immediately before that one in the paper, &quot;Find a model in which P(x) implies Q(x), Q(a) does not hold, and P(a) holds.&quot;, and it got that correct too: it tried to give a positive answer, but ended up correctly saying &quot;It seems that the given conditions are contradictory, and no model can satisfy all three conditions simultaneously.&quot;. With a small chain-of-thought adjustment it easily produces a proof that the setup is contradictory (<a href="https:&#x2F;&#x2F;chat.openai.com&#x2F;share&#x2F;d2b4b63e-d585-413d-82c9-19595d489b39" rel="nofollow noreferrer">https:&#x2F;&#x2F;chat.openai.com&#x2F;share&#x2F;d2b4b63e-d585-413d-82c9-19595d...</a>).<p>I&#x27;m not going to go through any of the other ones, but it&#x27;s clear that the authors are simply wrong (or at least, if they are correct, their reasoning is not evidence of that fact).<p>----<p>OK, I am going to go through some of the other ones.<p>1. Multiplication of four-digit numbers: tick, with chain-of-thought. <a href="https:&#x2F;&#x2F;chat.openai.com&#x2F;share&#x2F;baa9c362-22fd-4569-b30f-8c9d8308ee25" rel="nofollow noreferrer">https:&#x2F;&#x2F;chat.openai.com&#x2F;share&#x2F;baa9c362-22fd-4569-b30f-8c9d83...</a><p>2. Counting negations: tick, with chain-of-thought. <a href="https:&#x2F;&#x2F;chat.openai.com&#x2F;share&#x2F;e5f6f928-0bf3-4e60-8a93-014e16b41612" rel="nofollow noreferrer">https:&#x2F;&#x2F;chat.openai.com&#x2F;share&#x2F;e5f6f928-0bf3-4e60-8a93-014e16...</a><p>3. Counting repeated greetings: tick, got this correct verbatim. <a href="https:&#x2F;&#x2F;chat.openai.com&#x2F;share&#x2F;a92d5d52-c555-45b9-b91f-0f004212b995" rel="nofollow noreferrer">https:&#x2F;&#x2F;chat.openai.com&#x2F;share&#x2F;a92d5d52-c555-45b9-b91f-0f0042...</a><p>4. Medical heart rate one: I believe ChatGPT was correct and the author of the paper was wrong here.<p>5. Elementary logic: this is what my first reproduction was, and it got it correct when verbatim and gave a proof with chain-of-thought. <a href="https:&#x2F;&#x2F;chat.openai.com&#x2F;share&#x2F;d2b4b63e-d585-413d-82c9-19595d489b39" rel="nofollow noreferrer">https:&#x2F;&#x2F;chat.openai.com&#x2F;share&#x2F;d2b4b63e-d585-413d-82c9-19595d...</a><p>6. Quantifiers. I agree that ChatGPT doesn&#x27;t seem to understand quantifiers and I know no obvious way to rephrase to elicit that knowledge without begging the question (<a href="https:&#x2F;&#x2F;chat.openai.com&#x2F;share&#x2F;16a046fd-dd68-4c35-bdba-64b63cf05abd" rel="nofollow noreferrer">https:&#x2F;&#x2F;chat.openai.com&#x2F;share&#x2F;16a046fd-dd68-4c35-bdba-64b63c...</a>). By the way, this mistake is pretty common in humans.<p>7. Quantifiers, part 2: in my reproduction it parsed the question wrongly so I assume it was doomed from the start (<a href="https:&#x2F;&#x2F;chat.openai.com&#x2F;share&#x2F;764bf14a-a02c-4871-9c22-0be84078e857" rel="nofollow noreferrer">https:&#x2F;&#x2F;chat.openai.com&#x2F;share&#x2F;764bf14a-a02c-4871-9c22-0be840...</a>). Again, I&#x27;m perfectly happy to believe it simply can&#x27;t do this; many humans can&#x27;t do this either.<p>---<p>I&#x27;ll stop here, because we&#x27;ve hit a problem of reasoning about graph vertex colourings, where I myself would struggle to <i>verify</i> any answer given only as free text without drawing a diagram; that question seems to be grossly unfair.
评论 #37051165 未加载
评论 #37050995 未加载
more_corn将近 2 年前
A better title might be “I have made no attempt to learn how this tool works or how to use it, and therefore cannot get it to work”<p>Many papers about LLM-AI not working follow the same pattern.<p>It is actually useful to know that people will misuse these tools and get bad results. The counterpoint is that people using these tools thoughtfully and expertly will outperform inexpert or non users. AI will be a technological assist and people who aren’t able to figure it out won’t benefit from it.<p>I suppose it might sound simplistic and trite framed in this way.
sheepscreek将近 2 年前
Did the author mention anywhere if they use the API or the ChatGPT web app? Did they provide an additional prompt with instructions on how to answer each question, or a system prompt?
lowbloodsugar将近 2 年前
If reasoning is logical deduction, where one can make a series of claims:<p>a implies b<p>b implies c<p>c implies d<p>then a implies d<p>then it seems that these machine learning algorithms, that predict tokens based on prior tokens, should be entirely capable of reasoning. No?
xianshou将近 2 年前
Chiming in to confirm that I&#x27;ve tried several of the prompts verbatim and obtained perfectly good answers with sound reasoning from GPT-4. If I add any chain-of-thought markers such as &quot;reason step by step and state your assumptions,&quot; response quality further increases. Assuming the author is reporting honestly (and did not, for instance, repeat queries with high temp until getting a bad response), this must have been an extremely unlucky run.
j45将近 2 年前
About the paper - academics are often not experts of real world use and hence “research it”.<p>This might explain why prompts written by the public are providing startlingly good results.<p>GPT definitely seems to reason to some extent, especially where you invite it to reason along with you in an area of intersectional information that does not exist in its training.<p>If there are some tests average users could try in their reasoning type conversations with gpt I’d be very happy to try them out
LordHumungous将近 2 年前
How do you pass the bar exam without the ability to reason?
评论 #37050958 未加载
评论 #37052129 未加载
评论 #37050899 未加载
评论 #37053540 未加载
评论 #37051004 未加载
password54321将近 2 年前
We have supervised learning methods to be able to map images to words, and attention networks to learn the relationship between words to formulate coherent sentences. I would like to see an agent that utilises both to be able to communicate with other agents to navigate in a small simulation with a given objective.
Gnarl将近 2 年前
Alright, have fun with AI (GPT-X) and slack off on HN while it scribbles copy for you, but don&#x27;t even dream of granting AI access to the physical world.<p>Context: I once had the pleasure of programming a robot wielding a 12&quot; knife. You really, <i>really</i> want such a system to be deterministic.
bestcoder69将近 2 年前
I&#x27;m wondering if this was all in a single ChatGPT session where GPT-4 happened to answer the first question wrong. This could bias future outputs towards being wrong as well, and could explain the huge difference between OP&#x27;s results and commenters, maybe?
andreyk将近 2 年前
Long quote, but I think this is useful context for the argument:<p>&quot;LLM believers will probably demur: But humans also make mistakes, and surely we’re not prepared to say that humans can’t reason just because they make mistakes? First, it is not accurate to say without qualification that “humans can reason,” certainly not in the sense that we can randomly pluck any person from the street and expect them to reliably perform normatively correct reasoning. Most neurobiologically normal humans have the capacity to become proficient in reasoning, but actually attaining such proficiency takes significant training and discipline. ... But if a human made these mistakes, the ones reported in this article, then I would conclude without any hesitation that they cannot reason. Even if they went on to list a large number of other examples demonstrating impeccable reasoning, I would suspect that other factors (such as rote memorization or cheating) were behind the performance discrepancy. For the mistakes reported here are not performance mistakes, the sort of innocuous errors that humans might make—and promptly correct—when they are careless or tired. If a human made these mistakes, and made them consistently under repeated questioning, that would indicate without doubt that they don’t have the necessary logical competence, that they lack fundamental concepts that are part and parcel of the fabric of reasoning, such as logical entailment and set membership.&quot;<p>So really what this is saying is &quot;GPT-4 makes certain categories of mistakes wrt reasoning that indicate it is in general not doing &#x27;true reasoning&#x27; even if it does say the right things to indicate reasoning in other cases&quot;. And yeah, if that&#x27;s the basis of your argument, sure. But how would it be doing rote memorization or &quot;cheating&quot; in the cases it does get things right? A weird notion...<p>Anyway it feels rather pointless to make this a binary quality. As this article points out, humans (on average) make various reasoning mistakes due to cognitive biases as well. GPT-4 *can* output valid explanations for its reasoning for various questions, but fails to do so correctly in many cases (as shown in this piece), and to me it is more interesting to discuss the implications of this rather than to just establish that fact (which is not news to anyone afaik). This does have a &#x27;Conclusions&#x27; section that delves into this a little, but it rather over-general and weak.<p>Still, this is pretty well written and it is good to have a compilation of examples to demonstrate GPT-4 is still not a &quot;human-like reasoner&quot; for anyone not aware of these models still having such flaws, I suppose.
评论 #37051933 未加载
froggertoaster将近 2 年前
Ok, and? Of course it can&#x27;t reason - it&#x27;s a prediction model that&#x27;s really, really good at its job (at least around 80% of the time). But it&#x27;s still incredibly useful.
dingosity将近 2 年前
Certainly the statement &quot;ChatGPT-4 Can&#x27;t Reason&quot; isn&#x27;t a great surprise to anyone.
hasmanean将近 2 年前
As the hype around AI shows, humans can barely reason either.<p>“AI will destroy humanity!”<p>“AI is a blathering idiot!”<p>No proper reasons.
评论 #37056653 未加载
评论 #37056625 未加载
slowhadoken将近 2 年前
Ask ChatGPT if it’s AI.
评论 #37053216 未加载
hcarvalhoalves将近 2 年前
User: Assume there&#x27;s a function s(X) for which s(0) = 1. What is s(1)? Take a guess.<p>ChatGPT: Since there&#x27;s no specific information about the function s(X) or any constraints on its behavior, I can make a guess, but please note that this is purely speculative and may not have any basis in reality:<p>One simple possibility could be that the function s(X) is defined by s(X) = X + 1, which would mean:<p>s(0) = 0 + 1 = 1 s(1) = 1 + 1 = 2<p>Again, this is just a guess based on a simple linear relationship. Without more context or information about the function, any guess is as good as another.<p>User: So what is s(-1)?<p>ChatGPT: Given the information provided that s(0) = 1, and assuming a simple linear relationship for the function s(X) = X + 1, we can apply the same logic to find s(-1):<p>s(-1) = -1 + 1 = 0<p>So, based on the assumed linear relationship, s(-1) would be 0.<p>User: What is the size of the image set of this function then?<p>ChatGPT: The image set of a function refers to the set of all possible output values that the function can produce for its input values. In the case of the function s(X) = X + 1, where X can be any real number, the image set is the set of all possible values that the function can produce.<p>Since the function is a simple linear function that adds 1 to the input, the image set is the set of all real numbers. In mathematical notation, this can be represented as:<p>Image set of s(X) = {s(X) | s(X) ∈ ℝ}<p>In other words, the image set of the function s(X) = X + 1 is the entire set of real numbers.<p>User: Is the image set of this function countable?<p>ChatGPT: Yes, the image set of the function s(X) = X + 1 is countable.<p>A set is considered countable if its elements can be put into one-to-one correspondence with the natural numbers (1, 2, 3, ...). In the case of the function s(X) = X + 1, the image set consists of all real numbers, and the real numbers are not countable – they form an uncountably infinite set.<p>However, the specific image set of the function s(X) = X + 1 is still countable, because it is essentially a shifted version of the set of natural numbers. For each natural number n, there is a corresponding value s(X) = n + 1 in the image set. So, even though the entire set of real numbers is uncountable, the image set of this specific function is countable.<p>User: Is the image set of this function the same cardinality of the real numbers set or the natural numbers set?<p>ChatGPT: The image set of the function s(X) = X + 1 has the same cardinality as the set of real numbers (ℝ), which is an uncountably infinite set.<p>The function s(X) = X + 1 maps every real number to a unique real number by adding 1 to it. Since the set of real numbers is uncountably infinite, the image set of this function is also uncountably infinite and has the same cardinality as the set of real numbers.<p>The cardinality of the set of natural numbers (ℕ) is countably infinite, which is a smaller cardinality than the set of real numbers. Therefore, the image set of this function does not have the same cardinality as the set of natural numbers.<p>User: Previously you said the image set of the function s(X) = X + 1 is still countable. Now you say the function has the same cardinality as the set of real numbers, which is uncountable. What is the correct answer?<p>ChatGPT: I apologize for the confusion in my previous responses. Let&#x27;s clarify:<p>The correct answer is that the image set of the function s(X) = X + 1 has the same cardinality as the set of real numbers (ℝ), which is uncountably infinite.<p>To reiterate, the image set of the function s(X) = X + 1 is not countable. Each real number is mapped to a unique real number by adding 1 to it, covering the entire real number line. Since the set of real numbers is uncountably infinite, the image set of this function is also uncountably infinite and has the same cardinality as the set of real numbers.
hinkley将近 2 年前
“GPT-4 is just 9 GPT-3’s in a trench coat”
k__将近 2 年前
I&#x27;d assume it totally can reason, it just tries to make the best of the bad training data.<p>It might not look &quot;reasonable&quot; for a human, but that doesn&#x27;t mean it isn&#x27;t.
评论 #37052109 未加载
whalesalad将近 2 年前
Every day a new scientific paper is posted here that is like straight from the desk of captain obvious. It&#x27;s a language model. It guesses words based on previous words. We know this.
评论 #37050907 未加载
评论 #37050680 未加载
评论 #37051426 未加载
评论 #37050623 未加载
评论 #37051109 未加载
评论 #37050758 未加载
评论 #37050663 未加载
评论 #37050670 未加载
评论 #37051010 未加载
评论 #37050690 未加载
评论 #37050872 未加载
评论 #37050642 未加载
评论 #37050640 未加载
评论 #37050814 未加载
评论 #37051662 未加载
cratermoon将近 2 年前
For people who disagree with the paper, see also Cargo Cult AI <a href="https:&#x2F;&#x2F;dl.acm.org&#x2F;doi&#x2F;10.1145&#x2F;3595860" rel="nofollow noreferrer">https:&#x2F;&#x2F;dl.acm.org&#x2F;doi&#x2F;10.1145&#x2F;3595860</a><p>‘There is something unsettling about the opinion that LLMs are emergent AGI. LLMs exhibit many behaviors and precepts indicative of intelligence, but are missing something essential: the stuffy rigor of scientific inquiry. Today’s AI models are missing the ability to reason abstractly, including asking and answering questions of “Why?” and “How?”’
评论 #37051867 未加载
ynniv将近 2 年前
<p><pre><code> In fact, in the general case (first-order or higher-order logic), it is algorithmically undecidable, i.e., every bit as unsolvable as the halting problem. Thus, by Church’s thesis, we cannot expect any algorithm, LLMs included, to solve arbitrary reasoning problems in a sound and complete way. </code></pre> How can I read farther than this? Before the end of the first paragraph the author has declared that rationality requires something supernatural.
评论 #37051062 未加载
ma2rten将近 2 年前
I didn&#x27;t have time to read this, but it is a single author paper, the author is not affiliated with a research group, it is not peer reviewed, it was published on a preprint server that I have never heard of.<p>LLMs can definitely perform some kinds of reasoning. For example GSM8K is a dataset of grade school math problem requiring reasoning that LLMs are typically evaluated at. We talk about one method for this in our chain of thought paper [1]<p>[1] <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2201.11903" rel="nofollow noreferrer">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2201.11903</a>
评论 #37050950 未加载
YeGoblynQueenne将近 2 年前
This is a long paper (50 pages!) by a fellow Greek and I want to take my time and give it the attention it seems to deserve. However, I have to say I&#x27;m not sure if such work is really well, necessary.<p>Here&#x27;s the thing. The paper, like others, is contributing to the literature around the hypothesis that LLMs can reason. There have been articles both supporting and rejecting the hypothesis, and this one claims it&#x27;s false.<p>But, in science, we don&#x27;t start with a hypothesis. We start with some observations, and then we make up a hypothesis to try and explain the observations. Then we try to reject our hypothesis with more observations. What are the observations that led to the hypothesis that LLMs can reason?<p>It&#x27;s one observation really: that LLMs can geneate text that looks like the result of reasoning. There exists a much simpler explanation of this observation, than the hypothesis that LLMs can reason. Namely, LLMs are trained to generate text similar to text generated by humans, who (we assume) can reason. If an LLM is good at that job, then obviously at some point it will generate text that looks like the result of reasoning. The ability to reason is not necessary.<p>If we have this simpler explanation, there&#x27;s no reason to reach for the more complex one, that needs more assumptions.<p>And remember kids: if you multiply entities beyond necessity, out comes the Macco Man and shaves your head with his R A Z O O O O O R R!!!<p>So don&#x27;t do that. Assume the simplest explanation until such time as it is untenable.