Debates over AI benchmarking have reached Pokémon

Not even Pokémon is safe from AI benchmarking controversy.

Last week, a post on X went viral, claiming that Google’s latest Gemini model surpassed Anthropic’s flagship Claude model in the original Pokémon video game trilogy. Reportedly, Gemini had reached Lavender Town in a developer’s Twitch stream; Claude was stuck at Mount Moon as of late February.

Gemini is literally ahead of Claude atm in pokemon after reaching Lavender Town

119 live views only btw, incredibly underrated stream pic.twitter.com/8AvSovAI4x

— Jush (@Jush21e8) April 10, 2025

But what the post failed to mention is that Gemini had an advantage.

As users on Reddit pointed out, the developer who maintains the Gemini stream built a custom minimap that helps the model identify “tiles” in the game like cuttable trees. This reduces the need for Gemini to analyze screenshots before it makes gameplay decisions.

Now, Pokémon is a semi-serious AI benchmark at best — few would argue it’s a very informative test of a model’s capabilities. But it is an instructive example of how different implementations of a benchmark can influence the results.

For example, Anthropic reported two scores for its recent Anthropic 3.7 Sonnet model on the benchmark SWE-bench Verified, which is designed to evaluate a model’s coding abilities. Claude 3.7 Sonnet achieved 62.3% accuracy on SWE-bench Verified, but 70.3% with a “custom scaffold” that Anthropic developed.

More recently, Meta fine-tuned a version of one of its newer models, Llama 4 Maverick, to perform well on a particular benchmark, LM Arena. The vanilla version of the model scores significantly worse on the same evaluation.

Given that AI benchmarks — Pokémon included — are imperfect measures to begin with, custom and non-standard implementations threaten to muddy the waters even further. That is to say, it doesn’t seem likely that it’ll get any easier to compare models as they’re released.

Source link

Debates over AI benchmarking have reached Pokémon

LEAVE A REPLY Cancel reply

Latest news

Nasdaq-listed Janover purchases $10.5 million worth of Solana after stock soars to record high

Kotlin, Ruby & Swift Drop in Popularity

Ratings Agency Issues US Downgrade Warning as Trade War Could Boost Alternatives to the Dollar

Fed Governor Waller Suggests Inflation is Transitory, Could Pump Bitcoin Back Above $100K

Advertisement

Stuff Your Kindle Day: Get free fantasy books on April 15-20

Mantra CEO Denies Exit Dump But Admits Price Support Loop in Coffeezilla Interview

Must read

Nasdaq-listed Janover purchases $10.5 million worth of Solana after stock soars to record high

Kotlin, Ruby & Swift Drop in Popularity

You might also likeRELATED
Recommended to you

Editor Picks

AGII Officially Launches AI-Powered Web3 App, Ushering in a New Era of Decentralized Intelligence

Colle AI Bolsters Multichain Framework with Continued XRP Cryptocurrency Infrastructure Enhancements

Atua AI Utilizes Grok AI Technology to Improve Cryptocurrency Workflow Automation and On-Chain Intelligence

Must Read

Nasdaq-listed Janover purchases $10.5 million worth of Solana after stock soars to record high

Kotlin, Ruby & Swift Drop in Popularity

Ratings Agency Issues US Downgrade Warning as Trade War Could Boost Alternatives to the Dollar

Hot Topics

Debates over AI benchmarking have reached Pokémon

LEAVE A REPLY Cancel reply

Latest news

Advertisement

Must read

You might also likeRELATEDRecommended to you

Editor Picks

Must Read

Hot Topics

You might also likeRELATED
Recommended to you