OpenAI shipped GPT-5.5 as a model built for work, not just chat.
That is the useful way to read this release. GPT-5.5 is not interesting because it gives slightly cleaner chatbot answers. It is interesting because OpenAI is pushing it as a workstation model: plan the task, use tools, move across files and software, check the result, and keep going until the job is done.
That changes the question. The question is not simply whether GPT-5.5 is smarter than GPT-5.4. The question is whether it can carry more of the actual work inside Codex, OpenClaw, and other agent harnesses.
My read after testing it: yes, mostly. GPT-5.5 looks like a real jump for agentic work. It is stronger when the task has files, tools, tests, logs, or a clear target state. It is less clearly dominant when the work depends on product taste, frontend judgment, or vague creative direction.
That distinction matters.
What OpenAI actually released
OpenAI released GPT-5.5 on April 23, 2026. The base model rolled out to Plus, Pro, Business, and Enterprise users in ChatGPT and Codex. GPT-5.5 Pro rolled out to Pro, Business, and Enterprise users in ChatGPT.
The launch-day API story changed fast. OpenAI updated the launch post on April 24 to say GPT-5.5 and GPT-5.5 Pro were available in the API. The API changelog now lists GPT-5.5 across Chat Completions, Responses, and Batch. GPT-5.5 Pro is available through Responses for harder problems that benefit from more compute.
That makes the release more than a ChatGPT feature drop. It is now a migration story. If you are wiring GPT-5.5 into a real workflow, verify the exact endpoint, auth mode, context behavior, caching support, and tool path before swapping defaults.
OpenAI's positioning is direct. GPT-5.5 is strongest in agentic coding, computer use, knowledge work, and early scientific research. The examples are coding, debugging, online research, data analysis, documents, spreadsheets, software operation, and longer tool-using tasks.
That is not a better answer box. That is a better worker inside a harness.
The benchmark story is strong, but not clean
OpenAI's headline numbers are good.
The company reports 82.7 percent on Terminal-Bench 2.0, up from 75.1 percent for GPT-5.4. That benchmark matters here because it tests command-line workflows that require planning, iteration, and tool coordination. In plain terms, it maps pretty closely to what people actually want from Codex-style agents.
OpenAI also reports strong results on OSWorld-Verified, Toolathlon, BrowseComp, FrontierMath, GDPval, and CyberGym. That is a serious launch table.
It is not a clean sweep, though. The Decoder pointed out that Claude Opus 4.7 still leads GPT-5.5 on SWE-Bench Pro. Gemini 3.1 Pro leads the base GPT-5.5 model on BrowseComp. GDPval only moves modestly from GPT-5.4.
That does not make GPT-5.5 weak. It makes the launch specific.
GPT-5.5 looks strongest where the work is operational, tool-heavy, and multi-step. It does not automatically beat every competing model in every category. Honestly, that is more useful than another vague "new best model" headline.
Codex is the real story
The most interesting GPT-5.5 claims are not in the generic ChatGPT framing. They are in Codex.
OpenAI Developers described GPT-5.5 as OpenAI's strongest agentic coding model to date. They said it can carry coding tasks further end to end: understand a codebase, make changes, debug, test, and validate. They also said GPT-5.5 is more token efficient than GPT-5.4 in Codex for most users.
That is the claim I care about. Not whether it can win a one-off prompt. The real test is whether it can stay useful through the full engineering loop.
Simon Willison said he had previewed GPT-5.5 in Codex for weeks and had especially good results using it for security reviews against code written by other models. Dan Shipper and the Every team were more bullish overall, especially on coding, knowledge work, and long sessions.
Their caveats are also worth keeping. Every still found Opus 4.7 stronger for some planning, detail work, frontend, product design, and underspecified vibe-coding tasks.
That matches my own read. GPT-5.5 may be the better default workhorse. That does not make it the best taste model.
My local OpenClaw tests matched the workstation thesis
I also ran GPT-5.5 through a small local gauntlet set inside OpenClaw. This was not a public benchmark. It was a practical test set for Codex-style work: broken ops scenarios, frontend implementation, security audit, and system design.
The results were stronger than I expected.
In the NovaPay reconciliation outage, GPT-5.5 found five config and permission faults, fixed them, produced a clean postmortem, and ignored stale noise from old OOM, TLS, and MongoDB errors.
In the DataForge silent pipeline failure, it treated the problem as stale output instead of a crash. That was the right instinct. It found the FIFO log trap, empty worker count, config path mismatch, missing table, and stale cache.
In the frontend build test, it produced a single-file React TypeScript data table with sorting, filtering, pagination, selection, theme toggle, keyboard behavior, ARIA support, responsive layout, and a clean TypeScript compile.
That is good engineering execution. It is not the same as product taste. The full version of this post includes screenshots, but the short version is simple: GPT-5.5 can build a competent interface from a spec. I would still want a human pass or a taste-focused model before calling the design finished.
In the security audit test, it found all 17 planted issues in a vulnerable Express app, with line numbers, severity estimates, exploitability notes, impact, and fixes.
In the system design test, it covered a 50,000 events per second log aggregation system with sizing math, shard counts, retention, alert routing, failure modes, rollout plan, and a cost model that stayed under the given cap.
Across that local set, the model went 120 out of 120 by my scoring rubric. Again, that is my own working test, not a universal benchmark. But it lines up with the workstation framing better than the chatbot framing.
The operational runs are the part I trust most. GPT-5.5 traced the incident shape, separated current faults from stale noise, and validated the full path afterward. That is what I want from a work model.
Why this matters for OpenClaw and third-party harnesses
This release matters more if you run agents outside a model lab's first-party app.
OpenClaw can route GPT-5.5 several ways: direct OpenAI API billing, Codex OAuth through an OpenAI Codex route, or native Codex app-server behavior depending on the runtime setup. Those route labels matter. They affect auth, cost, context, and what the harness can actually do.
The bigger point is ecosystem independence.
A lot of third-party agent workflows have been boxed in by first-party gravity. Claude Code is excellent, but the Claude ecosystem can be awkward when you need specific auth paths, Max or Team entitlements, API-key routes, tool support, or non-Anthropic harness behavior.
GPT-5.5 gives OpenClaw and similar systems a serious non-Claude work model with a supported subscription path and API path. That matters if you do not want your agent stack tied to one vendor's first-party harness.
This is where GPT-5.5 feels strategically important. If it is good at long-running coding, computer use, and tool work, third-party harnesses can build credible workflows through OpenAI, Codex OAuth, OpenCode, Kilo Gateway, Vercel AI Gateway, and other provider catalogs. They do not have to wait for one company to bless every workflow.
That portability is not glamorous. It is useful.
The price story is still annoying
The official API pricing confirms the launch-week reports.
GPT-5.5 short-context pricing is $5 per million input tokens and $30 per million output tokens. Long-context GPT-5.5 is $10 per million input tokens and $45 per million output tokens. GPT-5.5 Pro is much more expensive, but matches the GPT-5.4 Pro pricing tier.
That means GPT-5.5 is meaningfully more expensive per token than GPT-5.4.
OpenAI's argument is that GPT-5.5 can use fewer tokens in Codex by finishing work with fewer retries and less wasted looping. That is plausible for hard tasks. If a model saves an hour of back-and-forth or avoids three failed implementation attempts, the completed-task cost can still make sense.
But the sticker price matters. Teams that already understand their token usage are going to model this carefully. A smarter model can still lose some workflows if the cost curve feels wrong.
Theo Browne had a fair skeptical read: GPT-5.5 is smart, but weird, hard to wrangle, and expensive.
That tension is real. The model can be better and still not be the right default for every task.
Safety is part of the product now
The system card matters because GPT-5.5 improves cyber and bio-relevant work, not only safe office tasks.
OpenAI evaluated GPT-5.5 under its Preparedness Framework, with targeted cybersecurity and biology red-teaming and feedback from nearly 200 early-access partners. The system card rates biological and chemical capability as High. Cybersecurity is also High, but below Critical. AI self-improvement remains below High.
For defenders, that is a serious line. Stronger cyber capability is useful if you are auditing code, triaging logs, testing controls, or writing detections. It is also easier to misuse.
OpenAI says it is using stricter classifiers for higher-risk cyber activity, monitoring for impermissible use, and Trusted Access for Cyber so verified defenders can use sharper capabilities with fewer pointless refusals.
There is also a caveat worth saying plainly. The system card says UK AISI found a universal jailbreak during testing. OpenAI updated its safeguards afterward, but UK AISI could not fully verify the final fix because of a configuration issue in the retest version.
That does not make the release reckless. It does mean the safety story is still an engineering problem, not a solved checkbox.
The early reaction is split
The positive camp is not just saying GPT-5.5 has higher benchmark numbers. They are describing a model that feels better inside a work harness. The pattern is implementation, refactors, debugging, testing, validation, and longer repository work.
That matches my experience. GPT-5.5 has been strong at orchestration, tool calls, failure recovery, and fixing itself after verification catches something. The improvement is not that it sounds smarter. It keeps the work loop intact longer.
The skeptical camp is also not wrong. Developers are already debating cost, usage limits, rollout friction, model personality, and whether it actually follows through when the task gets messy.
That split is the story. People using GPT-5.5 for real multi-step work seem more impressed than people sampling it like a chatbot. The model looks best when it has files, tools, tests, and a clear target state. It looks less special when the task is vague, taste-heavy, or blocked by quota.
My take
GPT-5.5 looks like OpenAI's clearest answer yet to Claude's work-model advantage.
GPT-5.4 made OpenAI competitive again for a lot of agentic coding work. GPT-5.5 sharpens the pitch: stronger inside Codex, better at carrying context across tools, and more practical for real workflows than a pure reasoning monster that burns time and budget.
But I would not flatten this into "OpenAI wins."
The better read is narrower. GPT-5.5 may become the default workhorse for people who live inside Codex-style systems. Opus may still be better when the work needs product taste, careful planning, frontend judgment, or a more opinionated collaborator. Gemini still has lanes where long-context research and web work remain competitive.
The winner depends on the harness, the task, the budget, and how much human steering you want in the loop.
For builders, the practical advice is simple.
Use GPT-5.5 where persistence matters: refactors, testing loops, security review, operational docs, research synthesis, spreadsheet work, document work, and agentic tasks with a clear target state.
Be more cautious where taste matters: frontend design, product direction, ambiguous prototypes, and writing that needs a sharper voice instead of smooth structure.
And do not treat launch-week model docs as frozen. GPT-5.5 went from "coming very soon" to live API in one day. Verify the route, pricing, and auth mode before wiring production spend.
That last part is boring. It is also how you avoid building your plan on vibes.
Full version with screenshots, source links, and the complete notes:
https://solomonneas.dev/blog/gpt55-openai-workstation-model/