AI benchmarks

Every frontier AI model on every major benchmark

Frontier AI benchmark scores as of May 27, 2026: on ARC-AGI-2, Gemini 3.1 Pro leads at 77.1%; on GPQA, Qwen3.7-Max leads at 92.4%; on SWE Pro, GPT-5.5 leads at 64.2%. Every score below cites the lab’s announcement post or an independent re-runner.

Last verified: May 27, 2026.

ARC-AGI-2 leader
77.1%
GPQA leader
92.4%
SWE Pro leader
64.2%
Coverage
8 × 7
frontier models × benchmarks

How to read this page

● 87.2lab-claimed score. Sourced from the lab’s own announcement post or model card. Click the number for the citation. Closed-weights labs report what they choose to report; treat as the upper bound.

□ 86.4independent re-run. Sourced from Epoch AI, Artificial Analysis, Aider, or the benchmark’s own public leaderboard. Click the number for the evaluator’s page.

⚠ diverge — lab and independent scores differ by more than 5 percentage points. Often signals a methodology gap (extended thinking enabled vs. not, tools on vs. off, different subset, leaked-test contamination).

— the lab didn’t publish this score and no independent re-run has landed yet. Honest gap, not zero.

Headline matrix

Each row is a current frontier flagship from one lab; each column is a major benchmark, ordered from most-discriminating to most-saturated. Click any score for its primary-source citation; click any column header to jump to that benchmark’s section below.

Anthropic · April 16, 2026
ARC-AGI-2
SWE Pro
SWE Verified
DeepSeek · April 24, 2026
ARC-AGI-2
SWE Verified
Google · February 19, 2026
SWE Pro
AIME
SWE Verified
GPT-5.5Closed
OpenAI · April 23, 2026
ARC-AGI-2
SWE Pro
SWE Verified
Grok 4.3Closed
xAI · April 17, 2026
ARC-AGI-2
SWE Pro
SWE Verified
Mistral AI · April 29, 2026
ARC-AGI-2
HLE
SWE Pro
SWE Verified
Meta · April 8, 2026
ARC-AGI-2
SWE Pro
SWE Verified
Alibaba · May 20, 2026
ARC-AGI-2
HLE
SWE Pro
SWE Verified

By benchmark

Ordered by how much each benchmark currently discriminates between frontier labs. Discriminating benchmarks separate models by capability; approaching-saturation benchmarks separate by points within a tight band.

ARC-AGI-2

ReasoningDiscriminating

Successor to the original ARC-AGI prize. Visual-pattern abstraction puzzles designed to resist memorization. The most-discriminating frontier benchmark in 2026 — frontier models cluster in the 30–80% band with most below human levels.

Author
François Chollet · ARC Prize Foundation
Human baseline
Human-untrained ~60%, expert ~95% on the public set; the hard private set is by design lower

Saturation: Frontier scores still span a wide band — this benchmark separates the labs.

Model
Score
Notes
Headline number from the launch — jumped from 31.1% on Gemini 3 Pro. Currently the highest ARC-AGI-2 score from any frontier lab.
Model
GPT-5.5
OpenAI
Score
Notes
OpenAI did not publish an ARC-AGI-2 score for GPT-5.5; ARC Prize Foundation runs the public eval rather than the lab.
Model
Score
Notes
Anthropic did not report an ARC-AGI-2 score in the Opus 4.7 announcement; the launch emphasized coding and vision rather than abstract-reasoning evals.
Model
Score
Notes
DeepSeek does not report ARC-AGI-2 on their model card; independent ARC Prize Foundation runs on open-weights models are queued.
Model
Score
Not reported
Notes
xAI has not published a Grok 4.3 model card as of mid-May 2026; the announcement emphasized video understanding and document generation rather than abstract-reasoning evals.
Model
Score
Not reported
Notes
Mistral did not report ARC-AGI-2 on the Medium 3.5 launch post.
Model
Score
Not reported
Notes
Meta's Muse Spark launch blog did not include an ARC-AGI-2 score; the model is closed-weights and access is gated, limiting independent re-runs.
Model
Score
Not reported
Notes
Alibaba did not report ARC-AGI-2 on the Qwen3.7-Max launch post.

Humanity's Last Exam

KnowledgeDiscriminating

Crowdsourced 3,000-question expert-level exam across 100+ subjects, designed to be the last academic benchmark needed before frontier models match expert humans. Reported with and without tools.

Author
Center for AI Safety · Scale AI
Human baseline
Expert humans in their own domains score ~88%; broad humans far below

Saturation: Frontier scores still span a wide band — this benchmark separates the labs.

Model
Score
Notes
Grok 4 was the first xAI model to clear 50% on HLE; 4.3 reports ~5pp gain. With tools, text-only subset.
Model
GPT-5.5
OpenAI
Score
Notes
Reported with tools enabled (browse + code execution).
Model
Score
Notes
No-tools score — Google reports both. With tools: ~52%.
Model
Score
Notes
Reported with tools enabled.
Model
Score
Notes
Reported with the “Contemplating” reasoning mode. Without it: ~18%.
Model
Score
Notes
Reported with Thinking mode enabled.
Model
Score
Not reported
Notes
Not reported in the Medium 3.5 release post.
Model
Score
Not reported
Notes
Not reported in the Qwen3.7-Max launch post; Alibaba's agentic-benchmark suite was the launch focus.

SWE-Bench Pro

CodingDiscriminating

Contamination-resistant, multi-language (4-language) successor to SWE-bench Verified. Real GitHub issues from production codebases that the model must patch end-to-end. The headline software-engineering benchmark on every frontier release in 2026.

Author
Princeton NLP · SWE-Bench team
Human baseline
Human SWE pass-rate on a comparable subset estimated ~75% by the benchmark authors

Saturation: Frontier scores still span a wide band — this benchmark separates the labs.

Model
GPT-5.5
OpenAI
Score
Notes
Inherits SOTA from GPT-5.3-Codex; lab framing positions GPT-5.5 as the unified-router model that picks coding-mode for software tasks.
Model
Score
Notes
Alibaba’s headline: 3.7-Max ahead of DeepSeek V4-Pro and Claude Opus 4.6 on agentic-coding evals.
Model
Score
Notes
Headline coding gain over Opus 4.6's 4.6 (~52%).
Model
Score
Notes
Model
Score
Notes
Model
Score
Notes
Model
Score
Not reported
Notes
No published Grok 4.3 SWE-Bench Pro number.
Model
Score
Not reported
Notes
Meta did not report SWE-Bench Pro for the Muse Spark launch.

AIME 2025

MathApproaching saturation

American Invitational Mathematics Examination, 2025 edition. 15 integer-answer problems; widely used as the frontier math benchmark because the problems are public after release but the answer space rewards reasoning over recall.

Author
Mathematical Association of America
Human baseline
Strong high-school competitors solve ~50%; AIME qualifiers (top USAMO contenders) ~80%

Saturation: Frontier scores cluster near the top — this benchmark separates labs by points, not by capability.

Model
GPT-5.5
OpenAI
Score
Notes
With extended-reasoning (gpt-5.5-pro). Standard gpt-5.5 chat scored ~84%.
Model
Score
Notes
Grok 4 Heavy (16-agent variant). Single-instance Grok 4.3 scores ~88%.
Model
Score
Notes
With Deep Think reasoning mode.
Model
Score
Notes
Native extended-thinking mode enabled.
Model
Score
Notes
DeepSeek-V3.2-Speciale (V3.2's high-compute variant) claimed IMO/IOI gold medals at this scale; V4-Pro inherits the reasoning gains.
Model
Score
Notes
With extended thinking (xhigh effort).
Model
Score
Notes
Model
Score
Notes

GPQA Diamond

KnowledgeApproaching saturation

Graduate-level physics, chemistry, and biology multiple-choice questions written by domain experts and validated to be “google-proof”. The Diamond subset (~198 questions) is the hardest tier.

Author
Rein et al. · NYU · Cohere
Human baseline
PhD-level experts in matched domains score ~65%; non-experts with web access ~34%

Saturation: Frontier scores cluster near the top — this benchmark separates labs by points, not by capability.

Model
Score
Notes
Highest GPQA Diamond score reported by any lab as of 2026-05-27; Alibaba’s framing positioned 3.7-Max against Claude Opus 4.6 on agentic-coding evals.
Model
Score
Notes
Model
GPT-5.5
OpenAI
Score
Notes
Model
Score
Notes
Reported in launch livestream; no formal model card published.
Model
Score
Notes
Model
Score
Notes
Model
Score
Notes
Model
Score
Notes

MMMU

MultimodalApproaching saturation

Massive Multi-discipline Multimodal Understanding & Reasoning — 11.5K college-exam-level questions across 30 subjects mixing text with diagrams, charts, and images. The canonical multimodal benchmark.

Author
MMMU Benchmark · University of Waterloo + collaborators
Human baseline
College students with web access ~83%; domain experts ~90%

Saturation: Frontier scores cluster near the top — this benchmark separates labs by points, not by capability.

Model
Score
Notes
Headline multimodal number; the Astra-trained multimodal pipeline is Gemini's strongest documented axis.
Model
GPT-5.5
OpenAI
Score
Notes
Model
Score
Notes
Headlined vision-pipeline upgrade (3× image resolution).
Model
Score
Notes
Natively multimodal across text, image, audio, and video tokenizers.
Model
Score
Notes
Native video input is the marquee multimodal feature; MMMU is mostly static-image and chart questions.
Model
Score
Notes
Model
Score
Notes
DeepSeek's multimodal pipeline trails the US frontier labs; the V-series is text-and-code-first.
Model
Score
Notes
First Mistral flagship with multimodal vision in the same weights as chat.

SWE-Bench Verified

CodingApproaching saturation

The 500-issue human-verified subset of SWE-bench. The canonical ‘can the model do real software work’ benchmark from 2024–2025; partially saturated in 2026 but still cited because most frontier models report it.

Human baseline
Subset construction targets human-solvable issues; success rate not directly comparable to model pass@1

Saturation: Frontier scores cluster near the top — this benchmark separates labs by points, not by capability.

Model
GPT-5.5
OpenAI
Score
Notes
Model
Score
Notes
Model
Score
Notes
Model
Score
Notes
Model
Score
Notes
Model
Score
Notes
Mistral's launch headline: “first flagship merged model” — chat + reasoning + coding + vision in one set of weights.
Model
Score
Notes
Model
Score
Not reported
Notes
Alibaba shifted to SWE-Bench Pro as the headline coding benchmark for 3.7-Max; the open-weights Qwen3.6 sibling reported 77.2% on SWE-Bench Verified at the same launch window.

About this page

Cross-family comparison page in the /ai/ section. The roster is the current frontier flagship from every major lab on this site — Claude, GPT, Gemini, Grok, Llama / Muse, DeepSeek, Mistral, Qwen — matched to /ai/models/ so each row links back to the per-family version page for the full lineage.

Lab-claimed vs. independent. Each cell can carry two values. The lab-claimed score (filled circle) is what the lab published in its announcement post, system card, or model card — the lab chooses the configuration (extended thinking, tool use, eval subset). The independent re-run (open square) is what Epoch AI, Artificial Analysis, Aider, or the benchmark’s own public leaderboard reports under their own protocol. When the two diverge by more than 5 percentage points, the page flags it — the gap is the editorial signal that matters here. Where no independent re-run exists yet, the cell shows the lab number alone; closed-weights labs are harder to re-run, so independent coverage is concentrated on the open-weights side.

Benchmark selection. Seven benchmarks covering reasoning, knowledge, coding, math, and multimodal capability. Picked for citation volume (every frontier launch reports these), discrimination (the score band is wide enough to separate labs), and primary-source availability (the benchmark author publishes a leaderboard or the eval protocol is public). Vendor-only proprietary benchmarks that no other lab reports are excluded. LMArena’s Elo is widely cited but is a different measurement type (human preference voting, not standardized eval); the page omits it but the LMArena leaderboard covers that signal.

Saturation framing. A benchmark is treated as discriminating when frontier scores span at least 20 percentage points, approaching saturation when the band tightens below that, and saturated when all frontier models cluster within a few points of the ceiling. Saturated benchmarks (HumanEval, MMLU, HellaSwag, GSM8K) are intentionally omitted from the v1 matrix; they no longer separate labs by capability. The saturation labels are re-evaluated on every refresh.

Sources. Primary lab announcements: Anthropic at anthropic.com/news, OpenAI at openai.com/index, Google at blog.google/technology/google-deepmind, xAI at x.ai/news, Meta at ai.meta.com/blog, DeepSeek at api-docs.deepseek.com/news, Mistral at mistral.ai/news, Alibaba at qwenlm.github.io/blog. Independent re-runners: Epoch AI, Artificial Analysis, Aider polyglot, ARC Prize Foundation.

Refreshed on every major model launch and at least monthly between launches. The page’s job is to stay current within a release cycle; the worst failure mode is showing a stale lab number after that lab has shipped a newer flagship.

Last verified: May 27, 2026. 8 frontier models · 7 benchmarks · 8 labs.

Last refreshed 2026-05-27 by Iapetus — inlined sort arrow with column label.