Head-to-Head · Frontier Models

GPT-5.5 vs Claude Opus 4.7: Our Verdict

OpenAI took back the headline benchmarks in April. Anthropic answered the same week with a focused upgrade for agents. We tested both to decide which frontier model most teams should actually build on.

By Theodore Pruitt, Senior Reviewer, Assistants & Code June 12, 2026 6 rounds judged

GPT-5.5

OpenAI

3 rounds won

Claude Opus 4.7

Anthropic

3 rounds won

The Verdict ✓ Winner: GPT-5.5 GPT-5.5

GPT-5.5 wins on raw capability and on long-context work, and takes our recommendation as the default frontier model for teams whose work spans research, coding, and computer use in equal measure. Claude Opus 4.7 remains the right pick for production agents and multi-tool orchestration where reliability across a long, asynchronous loop matters more than the top-of-leaderboard score.

These two models shipped eight days apart in April 2026, and they now occupy the top of the frontier tier. GPT-5.5 is OpenAI's smartest publicly available model, pitched at agentic coding, computer use, knowledge work, and early scientific research. Claude Opus 4.7 is Anthropic's focused successor to Opus 4.6, built around long-running asynchronous agents, multi-step tool calling, and high-resolution vision.

Both are expensive. Both ship with a one-million-token context window. Both are sold as the model you reach for when the task actually matters. We tested them on the same work and judged them round by round. Each round names a winner and states the procedure we used to decide it.

The Rounds

Coding

Round toGPT-5.5

On the headline coding evaluations, GPT-5.5 takes the leaderboard. Claude Opus 4.7 lifts SWE-bench Verified from 80.8% to 87.6% and SWE-bench Pro from 53.4% to 64.3%, real and well-targeted gains. But GPT-5.5 reaches 82.7% on Terminal-Bench 2.0, ahead of Claude Opus 4.7's 69.4% by more than thirteen points, and that gap held up on our own command-line tasks. For the day-to-day coding agent loop (plan, run a command, read the output, decide what to do next), GPT-5.5 is the more reliable executor today.

How we tested itWe ran each model against the same fixed set of coding tasks, then compared vendor-reported scores on the public benchmarks both companies cite (SWE-bench Verified, SWE-bench Pro, Terminal-Bench 2.0) against our own smaller harness of real PRs.

Agentic Tool Use

Round toClaude Opus 4.7

Opus 4.7 posts the strongest result we measured for chained tool calls. It reached 77.3% on MCP-Atlas, the closest public benchmark to a real production agent, and on our orchestration tasks it routed across tools with fewer dropped steps than GPT-5.5. Anthropic's design choices (adaptive thinking, the new memory tool, automatic cleanup of older tool results) show up as steadier behavior over long loops, even when the per-task quality is a half-step behind.

How we tested itWe gave each model the same five multi-step workflows that require chained tool calls (a financial model, a research-and-summarize task, a data analysis with a code interpreter, a customer-service routing simulation, and a mixed-tool orchestration), measured how many steps each completed unaided, and cross-checked against MCP-Atlas and Tau2-bench results.

Long-Context Reading

Round toGPT-5.5

GPT-5.5 is the first model in either family where the full one-million-token context window is genuinely usable end to end. On MRCR v2 at 512K–1M tokens it scores 74.0%, a 37-point jump over GPT-5.4, and at 128K–256K it reaches 87.5% against Claude's 59.2%. On our own large-corpus tests Claude Opus 4.7 was competent up to about 256K tokens; past that it began missing facts that GPT-5.5 retrieved cleanly.

How we tested itWe loaded each model with the same 300K- and 900K-token corpora (a full codebase, a deposition transcript, and a year of board minutes), asked twenty fact-retrieval and synthesis questions per corpus, and compared against vendor MRCR v2 results at 128K–1M.

Computer Use & Vision

Round toGPT-5.5

This round is closer than the table suggests. GPT-5.5 reaches 78.7% on OSWorld-Verified against Claude Opus 4.7's 78.0%, and on our flows the two were within noise on routine clicking and typing. Claude wins the vision sub-test on dense imagery: Opus 4.7 raised image resolution to roughly 3.75 megapixels, and one early-access partner measured visual acuity jumping from 54.5% on Opus 4.6 to 98.5% on 4.7. But OpenAI's model edges the round on overall task completion.

How we tested itWe ran each model on the same set of browser and desktop tasks (filling forms, navigating multi-step web flows, reading dense screenshots and technical diagrams) and compared OSWorld-Verified scores alongside our own pass-rate measurements.

Reliability on Long Agentic Runs

Round toClaude Opus 4.7

Opus 4.7 is built for this category, and it shows. It held coherence over the multi-hour refactor and the patch loop more reliably than GPT-5.5, drifting less and producing fewer wrapper-function fillers; one partner's TBench harness reported it landing fixes, including a race condition, that previous Claude models had missed. GPT-5.5 is competent over long horizons but is more likely to over-write or restate intermediate plans when a run stretches past a few hours.

How we tested itWe assigned each model the same three asynchronous, multi-hour jobs (a multi-file refactor on a 50K-line repo, an end-to-end research write-up with citations, and a vulnerability patch loop) and recorded how often each went off course before completing or asking for help.

Pricing & Practical Cost

Round toClaude Opus 4.7

On the rate card, Claude Opus 4.7 is the cheaper model: $5 per million input tokens and $25 per million output against GPT-5.5's $5 input and $30 output, with both offering a one-million-token context window. The caveat is real. Opus 4.7's new tokenizer can produce up to 35% more tokens for the same input text compared to Opus 4.6, so effective bills can rise even though the rate card did not. Even after that adjustment, Opus 4.7 lands at or below GPT-5.5 on most workloads, and prompt caching at up to 90% off cached input is the most reliable way to claw the difference back.

How we tested itWe priced a month of representative production traffic on each model's standard API rates, then re-priced the same traffic accounting for the Opus 4.7 tokenizer change Anthropic itself documents.

Where the verdict turned

Two rounds decided this, and they pulled in opposite directions. GPT-5.5’s 82.7% on Terminal-Bench 2.0 against Claude Opus 4.7’s 69.4% is a 13-point lead, and Terminal-Bench measures real command-line workflows: planning, iteration, and tool coordination in a sandboxed terminal. That is the closest public proxy for the work a coding agent actually does in a developer’s terminal, and it is where GPT-5.5 separates itself.

The countervailing result is long-running agentic reliability. Opus 4.7 is built for long-running, asynchronous agents, extending Opus 4.6’s coding and agentic strengths, and it is especially effective for pipelines where tasks unfold over time: large codebases, multi-stage debugging, and end-to-end project orchestration. When the loop is hours long, the model that drifts least wins. That is Opus 4.7.

A team building a sit-in-front-of-the-developer coding assistant should weigh the first finding more heavily. A team building an unattended agent that runs overnight against a production codebase should weigh the second.

What changed in April 2026

Both models were released within eight days of each other. Claude Opus 4.7 shipped on April 16, 2026 from Anthropic. GPT-5.5 followed on April 23, 2026 from OpenAI. The releases were designed against each other in everything but name.

OpenAI’s emphasis is breadth. GPT-5.5 is positioned as OpenAI’s smartest model yet, built for real work across agentic coding, computer use, knowledge work, and early scientific research. On GDPval, which tests an agent’s ability to produce well-specified knowledge work across 44 occupations, it scores 84.9%. On OSWorld-Verified, which measures whether a model can operate real computer environments on its own, it reaches 78.7%. On Tau2-bench Telecom, which tests complex customer-service workflows, it reaches 98.0% without prompt tuning.

Anthropic’s emphasis is depth in a narrower band. Hex reported that Claude Opus 4.7 is the strongest model it has evaluated, correctly reporting when data is missing instead of producing plausible-but-wrong fallbacks, and resisting dissonant-data traps that even Opus 4.6 falls for; low-effort Opus 4.7 is roughly equivalent to medium-effort Opus 4.6. Anthropic shipped a focused upgrade, not a broad sweep. The strongest results land in the places that break production agents, and SWE-bench Pro jumping from 53.4% to 64.3% means Opus 4.7 can handle the harder multi-language engineering tasks that Opus 4.6 regularly stumbled on.

The tokenizer footnote you cannot ignore

Anyone budgeting against Opus 4.7 needs to know one thing that is not on the rate card. The Claude Opus 4.7 tokenizer produces up to 35% more tokens from the same input text compared to Opus 4.6. Anthropic confirms this in its migration documentation, where the multiplier ranges from 1.0x to 1.35x depending on content type, so a request that cost $0.10 on Opus 4.6 could cost $0.10 to $0.135 on Opus 4.7 for the exact same prompt doing the exact same work. The list price did not move. The invoice can. The public estimate is a 1.0x to 1.35x multiplier, with the upper end showing up most often on code, structured data, and non-English text.

There is also a quieter migration cost. For Claude Opus 4.7, setting temperature, top_p, or top_k to any non-default value in the Messages API returns a 400 error. If you have production code that depends on those controls, this is not a minor footnote, it is a migration task. Anthropic also removed extended thinking budgets for Opus 4.7; adaptive thinking is now the supported path, disabled by default unless you opt in explicitly. Teams on a tight rollout schedule should test before they migrate.

GPT-5.5’s economics are simpler today: list price at $5.00 per million input tokens and $30.00 per million output tokens via OpenAI, with the same one-million-token context window. Even so, on most production workloads we priced, Opus 4.7’s lower output rate still leaves it at or below GPT-5.5 once realistic tokenizer inflation is factored in.

Who should buy which

Choose GPT-5.5 if your team’s work mixes coding, research, and computer use in roughly equal parts, if long-context reading matters, or if you want one model to default to across the company. It’s the stronger generalist by the numbers we ran, the long-context leader by a wide margin, and the better executor inside a command-line agent.

Choose Claude Opus 4.7 if you’re building unattended, long-running agents that orchestrate across multiple tools, if you do heavy work on dense imagery or technical diagrams, or if you need a model that resists confabulation when data is missing. For the computer-use work at the heart of XBOW’s autonomous penetration testing, the new Claude Opus 4.7 is a step change: 98.5% on their visual-acuity benchmark versus 54.5% for Opus 4.6. That’s the kind of jump that decides a build-versus-buy meeting.

A pragmatic combination is also defensible: route synchronous developer-facing coding work to GPT-5.5 and unattended overnight refactor and patch loops to Opus 4.7. But if forced to one frontier model, our recommendation for working teams today is GPT-5.5. For everyone whose product is the agent itself, Opus 4.7.

Sources