Every frontier AI model on every major benchmark

GPQA leader

93.6%

SWE Pro leader

80.3%

Coverage

8 × 7

frontier models × benchmarks

How to read this page

● 87.2 — lab-claimed score. Sourced from the lab’s own announcement post or model card. Click the number for the citation. Closed-weights labs report what they choose to report; treat as the upper bound.

□ 86.4 — independent re-run. Sourced from Epoch AI, Artificial Analysis, Aider, or the benchmark’s own public leaderboard. Click the number for the evaluator’s page.

⚠ diverge — lab and independent scores differ by more than 5 percentage points. Often signals a methodology gap (extended thinking enabled vs. not, tools on vs. off, different subset, leaked-test contamination).

— — the lab didn’t publish this score and no independent re-run has landed yet. Honest gap, not zero.

Headline matrix

Each row is a current frontier flagship from one lab; each column is a major benchmark, ordered from most-discriminating to most-saturated. Click any score for its primary-source citation; click any column header to jump to that benchmark’s section below.

Claude Fable 5Closed

Anthropic · June 9, 2026

HLE

●59 □53.3⚠ diverge

SWE Pro

●80.3

AIME

—

ARC-AGI-2

—

GPQA

●92.6 □92.6

MMMU

—

SWE Verified

●95.5

DeepSeek-V4-ProOpen

DeepSeek · April 24, 2026

HLE

●23.4 □35.9⚠ diverge

SWE Pro

●54.1

AIME

●91.8

ARC-AGI-2

—

GPQA

●90.1 □88.8

MMMU

●76.4

SWE Verified

●80.6

Gemini 3.5 FlashClosed

Google · May 19, 2026

HLE

●40.2 □41

SWE Pro

●55.1

AIME

—

ARC-AGI-2

●72.1 □72.1

GPQA

□92.2

MMMU

—

SWE Verified

—

GPT-5.5Closed

OpenAI · April 23, 2026

HLE

●52.2 □44.3⚠ diverge

SWE Pro

●58.6

AIME

—

ARC-AGI-2

●85 □85

GPQA

●93.6 □93.5

MMMU

—

SWE Verified

—

Grok 4.3Closed

xAI · April 17, 2026

HLE

□35

SWE Pro

—

AIME

—

ARC-AGI-2

—

GPQA

□90.1

MMMU

—

SWE Verified

—

Mistral Medium 3.5Open

Mistral AI · April 28, 2026

HLE

□12.8

SWE Pro

—

AIME

—

ARC-AGI-2

—

GPQA

□74.8

MMMU

—

SWE Verified

●77.6

Muse SparkClosed

Meta · April 8, 2026

HLE

●58 □39.9⚠ diverge

SWE Pro

□55

AIME

●87.3

ARC-AGI-2

—

GPQA

●89.5 □88.4

MMMU

●81.4

SWE Verified

●77.4

Qwen3.7-PlusClosed

Alibaba · May 31, 2026

HLE

□33.4

SWE Pro

—

AIME

—

ARC-AGI-2

—

GPQA

—

MMMU

—

SWE Verified

—

By benchmark

Ordered by how much each benchmark currently discriminates between frontier labs. Discriminating benchmarks separate models by capability; approaching-saturation benchmarks separate by points within a tight band.

Humanity's Last Exam

KnowledgeDiscriminating

Crowdsourced 3,000-question expert-level exam across 100+ subjects, designed to be the last academic benchmark needed before frontier models match expert humans. Reported with and without tools.

Author

Center for AI Safety · Scale AI

lastexam.ai

Human baseline

Expert humans in their own domains score ~88%; broad humans far below

Saturation: Frontier scores still span a wide band — this benchmark separates the labs.

Model

Score

Notes

Model

Anthropic

Score

●59 □53.3⚠ diverge

Notes

Humanity's Last Exam 59.0 (no tools), per Fable 5's launch comparison table (2026-06-09, via Vellum). Anthropic leads the no-tools HLE field at launch; a separate with-tools figure was not broken out in the launch table (Opus 4.8's with-tools 57.9 is the prior Anthropic reference point).

Model

Meta

Score

●58 □39.9⚠ diverge

Notes

Per launch blog text: "Contemplating mode provides significant capability improvements in challenging tasks, achieving 58% in Humanity's Last Exam." Standard (non-Contemplating) mode launch chart shows lower; the 58% figure is the headline number.

Model

OpenAI

Score

●52.2 □44.3⚠ diverge

Notes

With tools enabled (52.2); 41.4 without tools, per the GPT-5.5 announcement.

Model

Google

Score

●40.2 □41

Notes

Humanity's Last Exam, full set (text + multimodal). The model card does not split tools vs. no-tools for HLE on 3.5 Flash.

Model

xAI

Score

□35

Notes

xAI did not publish a formal Grok 4.3 announcement post or model card. Prior-cycle value (55.3) cited a now-404 URL (x.ai/news/grok-4-3) and is removed pending a primary source.

Model

Alibaba

Score

□33.4

Notes

Not reported in the Qwen3.7-Plus launch post; the launch focused on multimodal vision and agentic capabilities, not academic frontier reasoning.

Model

DeepSeek

Score

●23.4 □35.9⚠ diverge

Notes

Reported with Thinking mode enabled.

Model

labs.scale.com/leaderboard/swe_bench_pro_public

Mistral AI

Score

□12.8

Notes

Not reported in the Medium 3.5 release post.

SWE-Bench Pro

CodingDiscriminating

Contamination-resistant, multi-language successor to SWE-bench Verified. Real GitHub issues from production codebases that the model must patch end-to-end. The headline software-engineering benchmark on every frontier release in 2026. With Claude Fable 5's export-control suspension lifted (redeployed globally 2026-07-01), its 80.3 lab-claimed top re-enters the production roster over a 54–59 cluster — a ~26pp band — so the benchmark reads as discriminating again at the production frontier.

Author

Scale AI

Human baseline

Human SWE pass-rate on a comparable subset estimated ~75% by the benchmark authors

Saturation: Frontier scores still span a wide band — this benchmark separates the labs.

Model

Score

Notes

Model

Anthropic

Score

●80.3

Notes

SWE-Bench Pro pass-rate (Anthropic agentic-coding scaffold), Fable 5's headline number per the launch table (2026-06-09, via Vellum): Fable 5 80.3, Mythos Preview 77.8, Opus 4.8 69.2, GPT-5.5 58.6, Gemini 3.1 Pro 54.2 — the top score of any model tested. Vendor-scaffold number; not directly comparable to Scale's standardized SEAL leaderboard.

Model

OpenAI

Score

●58.6

Notes

SWE-Bench Pro (Public) per the GPT-5.5 announcement; OpenAI flags evidence of memorization on this public eval.

Model

Google

Score

●55.1

Notes

SWE-Bench Pro (Public), single attempt, per the Gemini 3.5 Flash model card.

Model

Meta

Score

□55

Notes

Meta did not report SWE-Bench Pro for the Muse Spark launch.

Model

DeepSeek

Score

●54.1

Notes

Model

xAI

Score

Not reported

Notes

No published Grok 4.3 SWE-Bench Pro number.

Model

Mistral AI

Score

Not reported

Notes

Mistral did NOT publish SWE-Bench Pro in the Medium 3.5 launch post — only SWE-Bench Verified (77.6, text-confirmed). Prior-cycle 50.8 cited the launch post URL but is not text-verifiable on the official post; removed pending primary-source confirmation.

Model

maa.org/math-competitions/aime

Alibaba

Score

Not reported

Notes

Qwen3.7-Plus launch did not publish SWE-Bench Pro. The text-only sibling Qwen3.7-Max (now retired as Qwen flagship per /ai/models/) had reported 60.6 on SWE-Bench Pro on 2026-05-20.

AIME 2025

MathApproaching saturation

American Invitational Mathematics Examination, 2025 edition. 15 integer-answer problems; widely used as the frontier math benchmark because the problems are public after release but the answer space rewards reasoning over recall.

Author

Mathematical Association of America

Human baseline

Strong high-school competitors solve ~50%; AIME qualifiers (top USAMO contenders) ~80%

Saturation: Frontier scores cluster near the top — this benchmark separates labs by points, not by capability.

Model

Score

Notes

Model

DeepSeek

Score

●91.8

Notes

DeepSeek-V3.2-Speciale (V3.2's high-compute variant) claimed IMO/IOI gold medals at this scale; V4-Pro inherits the reasoning gains.

Model

Meta

Score

●87.3

Notes

Model

Anthropic

Score

Not reported

Notes

Anthropic's Fable 5 launch table does not report AIME 2025; the math-and-reasoning slot is filled by HLE (no tools) and FrontierCode instead.

Model

Google

Score

Not reported

Notes

Gemini 3.5 Flash model card does not report AIME 2025; reports Humanity's Last Exam, ARC-AGI-2, and agentic benchmarks instead.

Model

OpenAI

Score

Not reported

Notes

OpenAI's GPT-5.5 announcement post (openai.com/index/introducing-gpt-5-5) reports FrontierMath instead of AIME 2025. The GPT-5.5 system card (deploymentsafety.openai.com/gpt-5-5) was checked on 2026-06-13 and is a safety document — it carries no AIME capability score. Prior-cycle value (96.4) was not text-verifiable on any OpenAI primary source and is permanently removed; this cell stays null unless OpenAI republishes AIME 2025 for GPT-5.5.

Model

xAI

Score

Not reported

Notes

xAI did not publish a formal Grok 4.3 announcement post or model card. Prior-cycle value (95.2 Heavy / ~88 single-instance) cited a now-404 URL (x.ai/news/grok-4-3) and is removed pending a primary source.

Model

Mistral AI

Score

Not reported

Notes

Mistral did NOT publish AIME 2025 in the Medium 3.5 launch post. The launch post's body text names only SWE-Bench Verified and τ³-Telecom; multiple third-party reviewers explicitly note that Mistral skipped AIME / GPQA / MMLU / HumanEval / MATH for this release. Prior-cycle 88.1 cited the launch post URL but is not text-verifiable; removed pending primary-source confirmation.

Model

Alibaba

Score

Not reported

Notes

Qwen3.7-Plus launch did not publish AIME 2025. The text-only sibling Qwen3.7-Max (now retired as Qwen flagship) had reported HMMT 2026 Feb (97.1) rather than AIME 2025; neither was carried forward on the multimodal Plus model.

ARC-AGI-2

ReasoningApproaching saturation

Successor to the original ARC-AGI prize. Visual-pattern abstraction puzzles designed to resist memorization. Frontier flagships now cluster in the 70–85% band — still informative but tightening as the top of the field converges.

Author

François Chollet · ARC Prize Foundation

arcprize.org/arc-agi/2

Human baseline

Human-untrained ~60%, expert ~95% on the public set; the hard private set is by design lower

Saturation: Frontier scores cluster near the top — this benchmark separates labs by points, not by capability.

Model

Score

Notes

Model

OpenAI

Score

●85 □85

Notes

ARC-AGI-2 (Verified) reported in the GPT-5.5 announcement table (Abstract reasoning section) at xhigh reasoning. OpenAI uses 'Verified' to describe their own test protocol; this is not the same as the ARC Prize Foundation's verified public leaderboard.

Model

Google

Score

●72.1 □72.1

Notes

Per the Gemini 3.5 Flash model card (Evaluation > Results, Reasoning section). The launch blog headlined Terminal-Bench 2.1 / GDPval-AA / MCP Atlas instead; ARC-AGI-2 is from the model card.

Model

Anthropic

Score

Not reported

Notes

Anthropic's Fable 5 launch comparison table (2026-06-09) does not include ARC-AGI-2; the published rows are SWE-Bench Pro / Verified, FrontierCode (incl. Diamond split), Terminal-Bench 2.1, HLE (no tools), GPQA Diamond, GDP.pdf vision, and tau-squared-Bench. Fable 5 is not yet a distinct row on the ARC Prize v2 public leaderboard, so no independent ARC-AGI-2 re-run is recorded.

Model

DeepSeek

Score

Not reported

Notes

DeepSeek does not report ARC-AGI-2 on their announcement post (text-confirmed: the V4 launch post is general-claims only). Prior-cycle ARC Prize independent value (18.6) is no longer present on the ARC Prize public leaderboard as of 2026-06-03 and has been nulled pending re-submission.

Model

xAI

Score

Not reported

Notes

xAI has not published a Grok 4.3 model card; the rollout was a silent model-selector update with no formal announcement. ARC-AGI-2 not reported.

Model

Mistral AI

Score

Not reported

Notes

Mistral did not report ARC-AGI-2 on the Medium 3.5 launch post.

Model

Meta

Score

Not reported

Notes

Meta's Muse Spark launch blog did not include an ARC-AGI-2 score; the model is closed-weights and access is gated, limiting independent re-runs.

Model

Alibaba

Score

Not reported

Notes

Qwen3.7-Plus is the multimodal vision+language flagship that replaced Qwen3.7-Max as the current Qwen flagship on 2026-05-31. The launch did not publish ARC-AGI-2; the only published benchmark at launch was Vision Arena rank (#16). Independent ARC-AGI-2 re-runs not yet present on the ARC Prize public leaderboard.

GPQA Diamond

KnowledgeApproaching saturation

Graduate-level physics, chemistry, and biology multiple-choice questions written by domain experts and validated to be “google-proof”. The Diamond subset (~198 questions) is the hardest tier.

Author

Rein et al. · NYU · Cohere

arxiv.org/abs/2311.12022

Human baseline

PhD-level experts in matched domains score ~65%; non-experts with web access ~34%

Saturation: Frontier scores cluster near the top — this benchmark separates labs by points, not by capability.

Model

Score

Notes

Model

OpenAI

Score

●93.6 □93.5

Notes

Model

Anthropic

Score

●92.6 □92.6

Notes

GPQA Diamond 92.6 per Fable 5's launch materials (2026-06-09, via Vellum's launch-day breakdown). Anthropic characterizes GPQA as state-of-the-art / effectively beaten at the frontier and leads with SWE-Bench Pro and FrontierCode rather than GPQA.

Model

Google

Score

□92.2

Notes

Gemini 3.5 Flash model card does not report GPQA Diamond; emphasizes agentic-coding (Terminal-Bench 2.1, MCP Atlas, OSWorld-Verified) and multimodal (CharXiv, MMMU-Pro) instead.

Model

DeepSeek

Score

●90.1 □88.8

Notes

Lab-claimed score per DeepSeek's V4 launch chart (the body text only carries general capability claims, no specific GPQA number) and corroborated by multiple third-party reviewers reporting 90.1 from the official April 24, 2026 announcement. Prior cycle had 86.7 which appears to have been a chart-OCR misread. The DeepSeek V4 tech report is the more rigorous source: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf.

Model

xAI

Score

□90.1

Notes

Carried forward 88.5 from prior cycle would cite a 404 URL (x.ai/news/grok-4-3); xAI never published a formal Grok 4.3 announcement post or model card. Per the data-freshness rule, lab_score nulled until a primary xAI source is found.

Model

Meta

Score

●89.5 □88.4

Notes

Lab-claimed score per the launch chart (the blog body text does not name a GPQA number; chart-OCR retained from prior cycle).

Model

Mistral AI

Score

□74.8

Notes

Mistral did NOT publish GPQA Diamond in the Medium 3.5 launch post. The launch post's body text only names SWE-Bench Verified (77.6) and τ³-Telecom (91.4) as the headline numbers; prior-cycle 82.4 cited the launch chart but is not text-verifiable on the official post and is contradicted by multiple third-party reviewers explicitly noting that Mistral skipped MMLU / GPQA / AIME / HumanEval / MATH for this release. Removed pending primary-source confirmation.

Model

Alibaba

Score

Not reported

Notes

Qwen3.7-Plus launch did not publish GPQA Diamond. The Plus model is the multimodal half of the 3.7 generation, with the launch positioning on vision (Vision Arena #16) rather than reasoning-knowledge benchmarks. Independent Artificial Analysis GPQA Diamond re-run for qwen3.7-plus pending.

MMMU

MultimodalApproaching saturation

Massive Multi-discipline Multimodal Understanding & Reasoning — 11.5K college-exam-level questions across 30 subjects mixing text with diagrams, charts, and images. The canonical multimodal benchmark.

Author

MMMU Benchmark · University of Waterloo + collaborators

mmmu-benchmark.github.io

Human baseline

College students with web access ~83%; domain experts ~90%

Saturation: Frontier scores cluster near the top — this benchmark separates labs by points, not by capability.

Model

Score

Notes

Model

Meta

Score

●81.4

Notes

Natively multimodal across text, image, audio, and video tokenizers.

Model

DeepSeek

Score

●76.4

Notes

DeepSeek's multimodal pipeline trails the US frontier labs; the V-series is text-and-code-first.

Model

Anthropic

Score

Not reported

Notes

Anthropic's Fable 5 launch table does not report the plain MMMU benchmark; multimodal capability is described via GDP.pdf vision (Fable 5 leads at 29.8) and screenshot-to-code demos rather than an MMMU number.

Model

Google

Score

Not reported

Notes

Gemini 3.5 Flash model card reports MMMU-Pro (83.6%, no tools) and CharXiv Reasoning (84.2%, no tools), but not the plain MMMU benchmark tracked here.

Model

OpenAI

Score

Not reported

Notes

OpenAI's GPT-5.5 announcement post reports MMMU-Pro (81.2 no tools / 83.2 with tools), not the plain MMMU tracked here. The GPT-5.5 system card (deploymentsafety.openai.com/gpt-5-5, checked 2026-06-13) carries no plain-MMMU score. Prior-cycle value (85.6) was not text-verifiable on any OpenAI primary source and is permanently removed; this cell stays null unless OpenAI republishes plain MMMU for GPT-5.5.

Model

xAI

Score

Not reported

Notes

xAI did not publish a formal Grok 4.3 announcement post or model card. Prior-cycle value (79.8) cited a now-404 URL (x.ai/news/grok-4-3) and is removed pending a primary source.

Model

Mistral AI

Score

Not reported

Notes

Mistral did NOT publish MMMU in the Medium 3.5 launch post. Medium 3.5 is described in the launch post as the first Mistral flagship to merge multimodal vision with chat in a single weights set, but the launch post body text does not name an MMMU number. Prior-cycle 75.2 cited the launch post URL but is not text-verifiable; removed pending primary-source confirmation.

Model

openai.com/index/introducing-swe-bench-verified

Alibaba

Score

Not reported

Notes

Despite being the multimodal flagship, Qwen3.7-Plus launch did not publish MMMU. The launch reported Vision Arena rank (#16 overall, #5 lab) as the headline multimodal signal; MMMU was not surfaced. Independent Artificial Analysis MMMU-Pro coverage for qwen3.7-plus pending.

SWE-Bench Verified

CodingApproaching saturation

The 500-issue human-verified subset of SWE-bench. The canonical ‘can the model do real software work’ benchmark from 2024–2025; partially saturated in 2026 but still cited because most frontier models report it.

Author

OpenAI · Princeton NLP

Human baseline

Subset construction targets human-solvable issues; success rate not directly comparable to model pass@1

Saturation: Frontier scores cluster near the top — this benchmark separates labs by points, not by capability.

Model

Score

Notes

Model

Anthropic

Score

●95.5

Notes

SWE-Bench Verified 95.5 per Fable 5's launch materials (2026-06-09). The harder Pro variant (80.3) is the launch headline; Verified is reported as a secondary, near-saturated number (six models from four labs cluster near 80% on the vendor-reported tracker).

Model

DeepSeek

Score

●80.6

Notes

Lab-claimed score per DeepSeek's V4 launch chart (the announcement body text only carries general capability claims, no specific SWE-Bench Verified number) and corroborated by multiple third-party reviewers reporting 80.6 from the official April 24, 2026 announcement, with CAISI's independent reproduction also matching. Prior cycle had 78.9 which appears to have been a chart-OCR misread (same failure shape as yesterday's GPQA Diamond 86.7→90.1 fix on this row). The DeepSeek V4 tech report is the more rigorous source: https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf.

Model

Mistral AI

Score

●77.6

Notes

Per the launch post body text: "Mistral Medium 3.5 scores 77.6% on SWE-Bench Verified, ahead of Devstral 2 and models like Qwen3.5 397B A17B."

Model

Meta

Score

●77.4

Notes

Model

Google

Score

Not reported

Notes

Gemini 3.5 Flash model card does not report SWE-Bench Verified; it reports SWE-Bench Pro instead.

Model

OpenAI

Score

Not reported

Notes

OpenAI's GPT-5.5 announcement post pivoted to SWE-Bench Pro (58.6%) and does not surface SWE-Bench Verified. The GPT-5.5 system card (deploymentsafety.openai.com/gpt-5-5, checked 2026-06-13) references SWE-Bench Verified only as a linked definition, with no GPT-5.5 score. Prior-cycle 87.3 was not text-verifiable on any OpenAI primary source and is permanently removed; this cell stays null unless OpenAI republishes SWE-Bench Verified for GPT-5.5. The Gemini 3.5 Flash model card cross-references GPT-5.5 SWE-Bench Pro 58.6, the cell populated above.

Model

xAI

Score

Not reported

Notes

xAI did not publish a formal Grok 4.3 announcement post or model card. Prior-cycle value (82.6) cited a now-404 URL (x.ai/news/grok-4-3) and is removed pending a primary source.

Model