科技回声

12 条评论

RL doesn't completely "work" yet, it still has a scalability problem. Claude can write a small project, but as it becomes larger, Claude gets confused and starts making mistakes.I used to think the problem was that models can't learn over time like humans, but maybe that can be worked around. Today's models have large enough context windows to fit a medium sized project's complete code and documentation, and tomorrow's may be larger; good-enough world knowledge can be maintained by re-training every few months. The real problem is that even models with large context windows struggle with complexity moreso than humans; they miss crucial details, then become very confused when trying to correct their mistakes and/or miss other crucial details (whereas humans sometimes miss crucial details, but are usually able to spot them and fix them without breaking something else).Reliability is another issue, but I think it's related to scalability: an LLM that cannot make reliable inferences from a small input data, cannot grow that into a larger output data without introducing cascading hallucinations.EDIT: creative control is also superseded by reliability and scalability. You can generate any image imaginable with a reliable diffusion model, by first generating something vague, then repeatedly refining it (specifying which details to change and which to keep), each refinement closer to what you're imagining. Except even GPT-4o isn't nearly reliable enough for this technique, because while it can handle a couple refinements, it too starts losing details (changing unrelated things).

评论 #43719717 未加载

评论 #43728173 未加载

thetrustworthy大约 1 个月前

For those who are knowledgeable about the field but not yet the author of this post, it is worth mentioning that Shunyu Yao has played a huge role in the development of LLM-based AI agents, including being an author / contributor to:- ReAct- Reflexion- SWE-bench- OpenAI Deep Research- OpenAI Operator

评论 #43722697 未加载

wavemode大约 1 个月前

> AI has beat world champions at chess and Go, surpassed most humans on SAT and bar exams, and reached gold medal level on IOI and IMO. But the world hasn’t changed much, at least judged by economics and GDP.> I call this the utility problem, and deem it the most important problem for AI.> Perhaps we will solve the utility problem pretty soon, perhaps not. Either way, the root cause of this problem might be deceptively simple: our evaluation setups are different from real-world setups in many basic ways.LLMs are reaching the same stage that most exciting technologies reach. They have quickly attracted lots of investor money, but that is going to have to start turning into actual money. Many research papers are being written, but people are going to start wanting to see actual improvements, not just theoretical improvements on benchmarks.

评论 #43717260 未加载

评论 #43717597 未加载

评论 #43717642 未加载

mplanchard大约 1 个月前

Meta request to authors: please define your acronyms at least once!Even in scientific domains where a high level of background knowledge is expected, it is standard practice to define each acronym prior to its use in the rest of the paper, for example “using three-letter acronyms (TLAs) without first defining them is a hindrance to readability.”

评论 #43717596 未加载

评论 #43718035 未加载

评论 #43719021 未加载

评论 #43720906 未加载

GiorgioG大约 1 个月前

More AI hype from an AI "expert". AI in software development is still a junior developer that memorized "everything" and can learn nothing beyond that, he/she will happily lie to you and they'll never tell you the most important thing a developer can be comfortable saying: "I don't know".

conartist6大约 1 个月前

"solving" Dota is a huge huge HUGE overstatement of the kind you are pointing out.The players it played against had never played against something that behaved so weirdly. It had lightning reflexes and it clearly wasn't human. It was playing a toy game mode requiring about 5% of skills needed for a full match. I'm other words, they engineered it to look good at they toy task, and it did. But they didn't give the pros any time at all to learn their opponent -- after all they might have figured out how to play against it!

评论 #43727992 未加载

jarbus大约 1 个月前

I largely agree, and this is actually something I've been thinking for a while. The problem was never the algorithm; it's the game the algorithm is trying to solve. It's not clear to me what extent we can push this to aside from math, coding. Robotics should be ripe for this, though.

评论 #43721328 未加载

cadamsdotcom大约 1 个月前

Benchmark saturation will keep happening.Which is great! There's room in the world for new benchmarks that test for more diverse things!It's highly likely at least one of the new benchmarks will eventually test for all the criteria being mentioned.

yapyap大约 1 个月前

> Instead of just asking, “Can we train a model to solve X?”, we’re asking, “What should we be training AI to do, and how do we measure real progress?”To say we are at a point where AI can do anything reliably is laughable, it can do much and it will tell you any answer whether right or wrong with full confidence. To trust such a technology in the big no-human decisions like we want it to is foolswork.

nottorp大约 1 个月前

Is it me or are they proposing making LLMs play text adventures?

ma_s79miskl6大约 1 个月前

he was going to poke some kids

m0llusk大约 1 个月前

Um, what is RL?

评论 #43717145 未加载

评论 #43717556 未加载

评论 #43717149 未加载

12 条评论

armchairhacker大约 1 个月前

评论 #43719717 未加载

评论 #43728173 未加载

thetrustworthy大约 1 个月前

评论 #43722697 未加载

wavemode大约 1 个月前

评论 #43717260 未加载

评论 #43717597 未加载

评论 #43717642 未加载

mplanchard大约 1 个月前

评论 #43717596 未加载

评论 #43718035 未加载

评论 #43719021 未加载

评论 #43720906 未加载

GiorgioG大约 1 个月前

conartist6大约 1 个月前

评论 #43727992 未加载

jarbus大约 1 个月前

评论 #43721328 未加载

cadamsdotcom大约 1 个月前

yapyap大约 1 个月前

nottorp大约 1 个月前

Is it me or are they proposing making LLMs play text adventures?

ma_s79miskl6大约 1 个月前

he was going to poke some kids

m0llusk大约 1 个月前

Um, what is RL?

评论 #43717145 未加载

评论 #43717556 未加载

评论 #43717149 未加载

The Second Half

12 条评论

The Second Half

12 条评论