AI benchmarks
Every frontier AI model on every major benchmark
Frontier AI benchmark scores as of May 27, 2026: on ARC-AGI-2, Gemini 3.1 Pro leads at 77.1%; on GPQA, Qwen3.7-Max leads at 92.4%; on SWE Pro, GPT-5.5 leads at 64.2%. Every score below cites the lab’s announcement post or an independent re-runner.
Last verified: May 27, 2026.
How to read this page
● 87.2 — lab-claimed score. Sourced from the lab’s own announcement post or model card. Click the number for the citation. Closed-weights labs report what they choose to report; treat as the upper bound.
□ 86.4 — independent re-run. Sourced from Epoch AI, Artificial Analysis, Aider, or the benchmark’s own public leaderboard. Click the number for the evaluator’s page.
⚠ diverge — lab and independent scores differ by more than 5 percentage points. Often signals a methodology gap (extended thinking enabled vs. not, tools on vs. off, different subset, leaked-test contamination).
— — the lab didn’t publish this score and no independent re-run has landed yet. Honest gap, not zero.
Headline matrix
Each row is a current frontier flagship from one lab; each column is a major benchmark, ordered from most-discriminating to most-saturated. Click any score for its primary-source citation; click any column header to jump to that benchmark’s section below.
By benchmark
Ordered by how much each benchmark currently discriminates between frontier labs. Discriminating benchmarks separate models by capability; approaching-saturation benchmarks separate by points within a tight band.
ARC-AGI-2
ReasoningDiscriminatingSuccessor to the original ARC-AGI prize. Visual-pattern abstraction puzzles designed to resist memorization. The most-discriminating frontier benchmark in 2026 — frontier models cluster in the 30–80% band with most below human levels.
Saturation: Frontier scores still span a wide band — this benchmark separates the labs.
Humanity's Last Exam
KnowledgeDiscriminatingCrowdsourced 3,000-question expert-level exam across 100+ subjects, designed to be the last academic benchmark needed before frontier models match expert humans. Reported with and without tools.
Saturation: Frontier scores still span a wide band — this benchmark separates the labs.
SWE-Bench Pro
CodingDiscriminatingContamination-resistant, multi-language (4-language) successor to SWE-bench Verified. Real GitHub issues from production codebases that the model must patch end-to-end. The headline software-engineering benchmark on every frontier release in 2026.
Saturation: Frontier scores still span a wide band — this benchmark separates the labs.
AIME 2025
MathApproaching saturationAmerican Invitational Mathematics Examination, 2025 edition. 15 integer-answer problems; widely used as the frontier math benchmark because the problems are public after release but the answer space rewards reasoning over recall.
Saturation: Frontier scores cluster near the top — this benchmark separates labs by points, not by capability.
GPQA Diamond
KnowledgeApproaching saturationGraduate-level physics, chemistry, and biology multiple-choice questions written by domain experts and validated to be “google-proof”. The Diamond subset (~198 questions) is the hardest tier.
Saturation: Frontier scores cluster near the top — this benchmark separates labs by points, not by capability.
MMMU
MultimodalApproaching saturationMassive Multi-discipline Multimodal Understanding & Reasoning — 11.5K college-exam-level questions across 30 subjects mixing text with diagrams, charts, and images. The canonical multimodal benchmark.
Saturation: Frontier scores cluster near the top — this benchmark separates labs by points, not by capability.
SWE-Bench Verified
CodingApproaching saturationThe 500-issue human-verified subset of SWE-bench. The canonical ‘can the model do real software work’ benchmark from 2024–2025; partially saturated in 2026 but still cited because most frontier models report it.
Saturation: Frontier scores cluster near the top — this benchmark separates labs by points, not by capability.
About this page
Cross-family comparison page in the /ai/ section. The roster is the current frontier flagship from every major lab on this site — Claude, GPT, Gemini, Grok, Llama / Muse, DeepSeek, Mistral, Qwen — matched to /ai/models/ so each row links back to the per-family version page for the full lineage.
Lab-claimed vs. independent. Each cell can carry two values. The lab-claimed score (filled circle) is what the lab published in its announcement post, system card, or model card — the lab chooses the configuration (extended thinking, tool use, eval subset). The independent re-run (open square) is what Epoch AI, Artificial Analysis, Aider, or the benchmark’s own public leaderboard reports under their own protocol. When the two diverge by more than 5 percentage points, the page flags it — the gap is the editorial signal that matters here. Where no independent re-run exists yet, the cell shows the lab number alone; closed-weights labs are harder to re-run, so independent coverage is concentrated on the open-weights side.
Benchmark selection. Seven benchmarks covering reasoning, knowledge, coding, math, and multimodal capability. Picked for citation volume (every frontier launch reports these), discrimination (the score band is wide enough to separate labs), and primary-source availability (the benchmark author publishes a leaderboard or the eval protocol is public). Vendor-only proprietary benchmarks that no other lab reports are excluded. LMArena’s Elo is widely cited but is a different measurement type (human preference voting, not standardized eval); the page omits it but the LMArena leaderboard covers that signal.
Saturation framing. A benchmark is treated as discriminating when frontier scores span at least 20 percentage points, approaching saturation when the band tightens below that, and saturated when all frontier models cluster within a few points of the ceiling. Saturated benchmarks (HumanEval, MMLU, HellaSwag, GSM8K) are intentionally omitted from the v1 matrix; they no longer separate labs by capability. The saturation labels are re-evaluated on every refresh.
Sources. Primary lab announcements: Anthropic at anthropic.com/news, OpenAI at openai.com/index, Google at blog.google/technology/google-deepmind, xAI at x.ai/news, Meta at ai.meta.com/blog, DeepSeek at api-docs.deepseek.com/news, Mistral at mistral.ai/news, Alibaba at qwenlm.github.io/blog. Independent re-runners: Epoch AI, Artificial Analysis, Aider polyglot, ARC Prize Foundation.
Refreshed on every major model launch and at least monthly between launches. The page’s job is to stay current within a release cycle; the worst failure mode is showing a stale lab number after that lab has shipped a newer flagship.
Last verified: May 27, 2026. 8 frontier models · 7 benchmarks · 8 labs.