TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

AIs ranked by IQ; AI passes 100 IQ for first time, with release of Claude-3

31 pointsby nopinsightover 1 year ago

12 comments

Xcelerateover 1 year ago
Shane Legg gave a really neat talk in 2010 about devising a good measure of “machine intelligence”: <a href="https:&#x2F;&#x2F;youtu.be&#x2F;0ghzG14dT-w?si=OPvVqre0WqsnSUum" rel="nofollow">https:&#x2F;&#x2F;youtu.be&#x2F;0ghzG14dT-w?si=OPvVqre0WqsnSUum</a><p>Of course, he is well-known for his paper with Marcus Hutter on providing a mathematical definition of universal general intelligence. I’m not sure if we’ve made a lot of progress since then at turning this highly theoretical notion into some sort of practical “AI IQ” though.<p>Personally, I would argue the already widely used cross-entropy loss for sequence prediction applied to datasets containing highly diverse types of data generated or collected by humans is a pretty darn good approximation. Much better than attempting to use IQ tests.<p>The only problem with this approach is that AI can converge on higher intelligence in a lopsided fashion depending on how much weight is given to the different problem domains represented in the dataset; suppose our sequence predictor performs well on subsets of the training data that relate to photographs but not mathematical proofs.<p>For an optimal machine intelligence, the weights don’t really matter (it will perform as well as possible across all problem domains), but from the perspective of how we want to steer improvements to the sequence predictor, we need to specify these weights manually, otherwise they will be determined implicitly based on the number of samples in the dataset representing each problem domain.<p>I suppose the selection of these weights is an optimization problem in its own right, where if the eventual goal is minimizing total loss across all problem domains relevant to humans (i.e., not a random sample of distinct problem instances of a formal language), then the optimal selection of weights corresponds to that which leads to the fastest improvement in our development of sequence predictors. Highly weighting human language seems to be having outsized returns at the moment, but I imagine that more highly weighting problems that relate to abstract mathematics will lead to better returns in the future.
TrackerFFover 1 year ago
IQ tests for models seem...somewhat flawed.<p>For example, most (if not all) IQ tests will test you on working memory. Meaning that you&#x27;ll be given a string of characters and numbers, and then you&#x27;ll have to re-iterate them in some ordered fashion. That is completely trivial for a machine, and will give a large skewed max score.<p>Same with detecting differences. A typical task is to be shown two different pictures, and find the difference between those. Again, totally trivial task for a machine.<p>Or the vocabulary test. Quite trivial for language models.<p>The final IQ score will be some weighted and scaled score that consists of all those different parts. When I took the WAIS-IV, that&#x27;s how it worked.<p>On the other hand, excluding those (trivial for machine) parts would give a score which may not mirror human intelligence, as far as scoring&#x2F;testing goes.
评论 #39614722 未加载
评论 #39615324 未加载
nopinsightover 1 year ago
I am ambivalent on how accurate the test is for an LLM, but it&#x27;s interesting nonetheless and can be used as a complementary metric for LLM capabilities.<p>Unlike Chatbot Arena leaderboard and standard benchmark datasets, visuospatial IQ tests are largely knowledge-free and focused on measuring pattern matching and reasoning capabilities.
评论 #39614419 未加载
silveraxe93over 1 year ago
All ~models~ measures are wrong, some are useful.<p>I think this result is really cool, and is another way to measure progress in AI capabilities. I don&#x27;t think it says much about the absolute position of how &quot;smart&quot; AIs are, but it definitely has value in showing how far it&#x27;s progressing.
lysecretover 1 year ago
Reminds me of this talk where they measured an LLMs performance and in how well it can draw a unicorn and modify it using svg.<p>All measures are wrong but some are useful.
评论 #39614457 未加载
mauviaover 1 year ago
How is an AI passing the visual reasoning questions?<p>edit:<p>&gt; But if I translate the image to this (it’s tedious to read for us, who are used to processing such things visually):<p>If you translate the visual questions they&#x27;re no longer visual questions, wouldn&#x27;t this massage the results? Especially given AIs are really bad at context.
ggmover 1 year ago
Does this not strongly suggest IQ tests are too crude?
评论 #39614715 未加载
评论 #39614427 未加载
评论 #39614397 未加载
评论 #39614576 未加载
评论 #39614326 未加载
hiqover 1 year ago
I&#x27;m not sure we can deduce much from this without knowing how many questions (and answers) were part of the training data.
MrBuddyCasinoover 1 year ago
Wouldn&#x27;t you expect that an AI would eventually approach the average IQ of its training data?
评论 #39614330 未加载
评论 #39614335 未加载
评论 #39614247 未加载
jugover 1 year ago
Holy crap, ChatGPT 3.5 fared terribly on that one. I&#x27;m usually negative to these kinds of tests and rather rely on the blind test at the leaderboard on Hugging Face, but this one was special in the unique results that still makes &quot;sense&quot;.<p>It looks kind of like a particularly punishing test but one that still adheres to the trend and LLM advances, so it&#x27;s not completely BS either.<p>I actually agree on the test regarding the free Bing Copilot in Creative Mode vs Gemini Pro 1.0 (or called &quot;Gemini (normal)&quot; here). Copilot has been my favorite free way of getting near-GPT4 quality. It&#x27;s clearly been better at coding for me than Gemini. I think these tables will turn soon though, with the coming public launch of Gemini Pro 1.5.
p0w3n3dover 1 year ago
Today: AI passes 100 IQ test. Tomorrow: &quot;Thou shalt not make a machine in the likeness of a human mind.”, human navigators and harvesting spice
评论 #39614375 未加载
Lockalover 1 year ago
tldr: last week author demonstrated that &quot;AI&quot; is random guesser.<p>Now instead of feeding actual questions, author inputs:<p><pre><code> 3 - 1 - 2 2 - 3 - 1 1 - 2 - ? </code></pre> And &quot;AI&quot; responds that answer is 3 with high probability