isitcooked.ai

IS IT COOKED?

The same tests, every model, every day. When a model silently gets worse — quantized, throttled, or “improved” — the scores drop and the verdict flips. Receipts included.

last run 2026-07-03 · nothing cooked today

ModelBench scoreVerdictStructured ExtractionMath & ReasoningCode GenerationInstruction FollowingReport AnalysisSummarizationCustomer ServiceWeb DesignGame DesignCreative Writing
Claude Fable 5Anthropic91.5CALIBRATING91100601001009493100100100
Claude Opus 4.8Anthropic89.9CALIBRATING9410010010010093979910098
Gemini 3.1 ProGoogle86.5CALIBRATING100100100801009296999857
GPT-5.5OpenAI85.6CALIBRATING9710010010010092941009999
Claude Sonnet 5Anthropic82.1CALIBRATING9410010010010093939510092
Grok 4.3xAI79.4CALIBRATING1001001001001009185989578
GLM 5.2Z.AI75.4CALIBRATING96100100801008296966644
DeepSeek V4 FlashDeepSeek69.4no data
Qwen3.7 MaxAlibaba65.3no data
Claude Haiku 4.5Anthropic65.2CALIBRATING10010010070100899310010090
Kimi K2.6Moonshot63.7CALIBRATING601008086807690873354
GPT-5.4 MiniOpenAI63.6CALIBRATING9610010010010091911009997
DeepSeek V4 ProDeepSeek62.9no data
MiniMax M3MiniMax60.8CALIBRATING100100100801008497985561
Gemini 3.5 FlashGoogle58.5CALIBRATING94100100951007499968754
MiMo V2.5 ProXiaomi55.6CALIBRATING89100807010090971009288
Nex N2 ProNex AGI53.4CALIBRATING911001001001009090974487
Llama 4 MaverickMeta51.7no data

Bench score aggregates public benchmarks (LMArena, MMLU-Pro, GPQA, SWE-bench, AIME). Sparklines are 14 days of our own daily tests. Methodology →