Related ongoing thread:<p><i>Show HN: LLM plays Pokémon (open sourced)</i> - <a href="https://news.ycombinator.com/item?id=43187231">https://news.ycombinator.com/item?id=43187231</a>
This is truly tremendous to watch. Eleven years from TPP, and we're watching the current best-in-class AI try its best at the same. Who'll get there first, the historical gestalt of Twitch users or the just-shy-of-10^26 FLOPS [0] AI model?<p>Now here's a concept for anyone with more money than sense: ClaudePlaysTwitchPlaysPokemon, where it's TPP but every participant is Claude. Would hivemind AI consensus perform better than a single AI? Anthropic's certainly looking into it! [1]<p>[0]: <a href="https://www.oneusefulthing.org/p/a-new-generation-of-ais-claude-37" rel="nofollow">https://www.oneusefulthing.org/p/a-new-generation-of-ais-cla...</a><p>[1]: <a href="https://www.anthropic.com/news/visible-extended-thinking" rel="nofollow">https://www.anthropic.com/news/visible-extended-thinking</a>
This is neat but watching a reasoning model that stops to consider "I have read half of a dialogue block, time to press A to get the rest of the text" gets old really quick. I think I'd rather watch a model try to play pokemon against human opponents on a simulator like pokemon showdown (which I understand is a bit further in an IP rights grey area than emulating a 30 year old game). In that case you would get to see how it handles unknown information and updates its reasoning based on the success/failure of its predictions.
It's run by Anthropic! <a href="https://x.com/AnthropicAI/status/1894419011569344978" rel="nofollow">https://x.com/AnthropicAI/status/1894419011569344978</a>
Anyone interested in watching lots of reinforcement agents playing pokemon red at once, we have a website which streams hundreds of concurrent games from multiple people’s training runs to a shared map in real time!<p><a href="https://pwhiddy.github.io/pokerl-map-viz/" rel="nofollow">https://pwhiddy.github.io/pokerl-map-viz/</a><p>(works best on desktop)
Watching the moment to moment is pretty boring, but it might be interesting if someone puts together highlights of interesting events and moments. The screenshot where Claude asks for the game to restart is absolutely charming.
I can't look at the current state of this and without wondering if it's tokenizer-dyslexia. I wonder if AI performance growth has been borrowed from overfitting and pruning the tokenizer of invalid sequences and leakage the entire corpus, a cardinal sin of making valid predictions.
This would be a really cool category of speed-running. "How fast can a model beat a game that it's never played before?"<p>First get the model to beat a game, then work on better decision-making, then try to speed up the decision-making. Then repeat when better models come out.