TechEcho

12 comments

Xcelerateover 1 year ago

Shane Legg gave a really neat talk in 2010 about devising a good measure of “machine intelligence”: <a href="https://youtu.be/0ghzG14dT-w?si=OPvVqre0WqsnSUum" rel="nofollow">https://youtu.be/0ghzG14dT-w?si=OPvVqre0WqsnSUum</a>Of course, he is well-known for his paper with Marcus Hutter on providing a mathematical definition of universal general intelligence. I’m not sure if we’ve made a lot of progress since then at turning this highly theoretical notion into some sort of practical “AI IQ” though.Personally, I would argue the already widely used cross-entropy loss for sequence prediction applied to datasets containing highly diverse types of data generated or collected by humans is a pretty darn good approximation. Much better than attempting to use IQ tests.The only problem with this approach is that AI can converge on higher intelligence in a lopsided fashion depending on how much weight is given to the different problem domains represented in the dataset; suppose our sequence predictor performs well on subsets of the training data that relate to photographs but not mathematical proofs.For an optimal machine intelligence, the weights don’t really matter (it will perform as well as possible across all problem domains), but from the perspective of how we want to steer improvements to the sequence predictor, we need to specify these weights manually, otherwise they will be determined implicitly based on the number of samples in the dataset representing each problem domain.I suppose the selection of these weights is an optimization problem in its own right, where if the eventual goal is minimizing total loss across all problem domains relevant to humans (i.e., not a random sample of distinct problem instances of a formal language), then the optimal selection of weights corresponds to that which leads to the fastest improvement in our development of sequence predictors. Highly weighting human language seems to be having outsized returns at the moment, but I imagine that more highly weighting problems that relate to abstract mathematics will lead to better returns in the future.

TrackerFFover 1 year ago

IQ tests for models seem...somewhat flawed.For example, most (if not all) IQ tests will test you on working memory. Meaning that you'll be given a string of characters and numbers, and then you'll have to re-iterate them in some ordered fashion. That is completely trivial for a machine, and will give a large skewed max score.Same with detecting differences. A typical task is to be shown two different pictures, and find the difference between those. Again, totally trivial task for a machine.Or the vocabulary test. Quite trivial for language models.The final IQ score will be some weighted and scaled score that consists of all those different parts. When I took the WAIS-IV, that's how it worked.On the other hand, excluding those (trivial for machine) parts would give a score which may not mirror human intelligence, as far as scoring/testing goes.

评论 #39614722 未加载

评论 #39615324 未加载

nopinsightover 1 year ago

I am ambivalent on how accurate the test is for an LLM, but it's interesting nonetheless and can be used as a complementary metric for LLM capabilities.Unlike Chatbot Arena leaderboard and standard benchmark datasets, visuospatial IQ tests are largely knowledge-free and focused on measuring pattern matching and reasoning capabilities.

评论 #39614419 未加载

silveraxe93over 1 year ago

All ~models~ measures are wrong, some are useful.I think this result is really cool, and is another way to measure progress in AI capabilities. I don't think it says much about the absolute position of how "smart" AIs are, but it definitely has value in showing how far it's progressing.

lysecretover 1 year ago

Reminds me of this talk where they measured an LLMs performance and in how well it can draw a unicorn and modify it using svg.All measures are wrong but some are useful.

评论 #39614457 未加载

mauviaover 1 year ago

How is an AI passing the visual reasoning questions?edit:> But if I translate the image to this (it’s tedious to read for us, who are used to processing such things visually):If you translate the visual questions they're no longer visual questions, wouldn't this massage the results? Especially given AIs are really bad at context.

ggmover 1 year ago

Does this not strongly suggest IQ tests are too crude?

评论 #39614715 未加载

评论 #39614427 未加载

评论 #39614397 未加载

评论 #39614576 未加载

评论 #39614326 未加载

hiqover 1 year ago

I'm not sure we can deduce much from this without knowing how many questions (and answers) were part of the training data.

MrBuddyCasinoover 1 year ago

Wouldn't you expect that an AI would eventually approach the average IQ of its training data?

评论 #39614330 未加载

评论 #39614335 未加载

评论 #39614247 未加载

jugover 1 year ago

Holy crap, ChatGPT 3.5 fared terribly on that one. I'm usually negative to these kinds of tests and rather rely on the blind test at the leaderboard on Hugging Face, but this one was special in the unique results that still makes "sense".It looks kind of like a particularly punishing test but one that still adheres to the trend and LLM advances, so it's not completely BS either.I actually agree on the test regarding the free Bing Copilot in Creative Mode vs Gemini Pro 1.0 (or called "Gemini (normal)" here). Copilot has been my favorite free way of getting near-GPT4 quality. It's clearly been better at coding for me than Gemini. I think these tables will turn soon though, with the coming public launch of Gemini Pro 1.5.

p0w3n3dover 1 year ago

Today: AI passes 100 IQ test. Tomorrow: "Thou shalt not make a machine in the likeness of a human mind.”, human navigators and harvesting spice

评论 #39614375 未加载

Lockalover 1 year ago

tldr: last week author demonstrated that "AI" is random guesser.Now instead of feeding actual questions, author inputs:<pre><code> 3 - 1 - 2 2 - 3 - 1 1 - 2 - ? </code></pre> And "AI" responds that answer is 3 with high probability

12 comments

Xcelerateover 1 year ago

TrackerFFover 1 year ago

评论 #39614722 未加载

评论 #39615324 未加载

nopinsightover 1 year ago

评论 #39614419 未加载

silveraxe93over 1 year ago

lysecretover 1 year ago

Reminds me of this talk where they measured an LLMs performance and in how well it can draw a unicorn and modify it using svg.All measures are wrong but some are useful.

评论 #39614457 未加载

mauviaover 1 year ago

ggmover 1 year ago

Does this not strongly suggest IQ tests are too crude?

评论 #39614715 未加载

评论 #39614427 未加载

评论 #39614397 未加载

评论 #39614576 未加载

评论 #39614326 未加载

hiqover 1 year ago

I'm not sure we can deduce much from this without knowing how many questions (and answers) were part of the training data.

MrBuddyCasinoover 1 year ago

Wouldn't you expect that an AI would eventually approach the average IQ of its training data?

评论 #39614330 未加载

评论 #39614335 未加载

评论 #39614247 未加载

jugover 1 year ago

p0w3n3dover 1 year ago

Today: AI passes 100 IQ test. Tomorrow: "Thou shalt not make a machine in the likeness of a human mind.”, human navigators and harvesting spice

评论 #39614375 未加载

Lockalover 1 year ago

AIs ranked by IQ; AI passes 100 IQ for first time, with release of Claude-3

12 comments

AIs ranked by IQ; AI passes 100 IQ for first time, with release of Claude-3

12 comments