IS IT COOKED?
The same tests, every model, every day. When a model silently gets worse — quantized, throttled, or “improved” — the scores drop and the verdict flips. Receipts included.
last run 2026-07-03 · nothing cooked today
| Model | Bench score | Verdict | Structured Extraction | Math & Reasoning | Code Generation | Instruction Following | Report Analysis | Summarization | Customer Service | Web Design | Game Design | Creative Writing |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Claude Fable 5Anthropic | 91.5 | CALIBRATING | 91 | 100 | 60 | 100 | 100 | 94 | 93 | 100 | 100 | 100 |
| Claude Opus 4.8Anthropic | 89.9 | CALIBRATING | 94 | 100 | 100 | 100 | 100 | 93 | 97 | 99 | 100 | 98 |
| Gemini 3.1 ProGoogle | 86.5 | CALIBRATING | 100 | 100 | 100 | 80 | 100 | 92 | 96 | 99 | 98 | 57 |
| GPT-5.5OpenAI | 85.6 | CALIBRATING | 97 | 100 | 100 | 100 | 100 | 92 | 94 | 100 | 99 | 99 |
| Claude Sonnet 5Anthropic | 82.1 | CALIBRATING | 94 | 100 | 100 | 100 | 100 | 93 | 93 | 95 | 100 | 92 |
| Grok 4.3xAI | 79.4 | CALIBRATING | 100 | 100 | 100 | 100 | 100 | 91 | 85 | 98 | 95 | 78 |
| GLM 5.2Z.AI | 75.4 | CALIBRATING | 96 | 100 | 100 | 80 | 100 | 82 | 96 | 96 | 66 | 44 |
| DeepSeek V4 FlashDeepSeek | 69.4 | no data | — | — | — | — | — | — | — | — | — | — |
| Qwen3.7 MaxAlibaba | 65.3 | no data | — | — | — | — | — | — | — | — | — | — |
| Claude Haiku 4.5Anthropic | 65.2 | CALIBRATING | 100 | 100 | 100 | 70 | 100 | 89 | 93 | 100 | 100 | 90 |
| Kimi K2.6Moonshot | 63.7 | CALIBRATING | 60 | 100 | 80 | 86 | 80 | 76 | 90 | 87 | 33 | 54 |
| GPT-5.4 MiniOpenAI | 63.6 | CALIBRATING | 96 | 100 | 100 | 100 | 100 | 91 | 91 | 100 | 99 | 97 |
| DeepSeek V4 ProDeepSeek | 62.9 | no data | — | — | — | — | — | — | — | — | — | — |
| MiniMax M3MiniMax | 60.8 | CALIBRATING | 100 | 100 | 100 | 80 | 100 | 84 | 97 | 98 | 55 | 61 |
| Gemini 3.5 FlashGoogle | 58.5 | CALIBRATING | 94 | 100 | 100 | 95 | 100 | 74 | 99 | 96 | 87 | 54 |
| MiMo V2.5 ProXiaomi | 55.6 | CALIBRATING | 89 | 100 | 80 | 70 | 100 | 90 | 97 | 100 | 92 | 88 |
| Nex N2 ProNex AGI | 53.4 | CALIBRATING | 91 | 100 | 100 | 100 | 100 | 90 | 90 | 97 | 44 | 87 |
| Llama 4 MaverickMeta | 51.7 | no data | — | — | — | — | — | — | — | — | — | — |
Bench score aggregates public benchmarks (LMArena, MMLU-Pro, GPQA, SWE-bench, AIME). Sparklines are 14 days of our own daily tests. Methodology →