An AI Benchmark That Tests Real Coding Workflows

An AI Benchmark That Tests Real Coding Workflows

Leader posted Originally published at jason.agostoni.net 7 min read

Developers face a real choice: pick a coding model or agent based on synthetic benchmarks that look great but do not predict actual project work. The problem is no longer whether models can score well on those benchmarks; it's whether those scores still mean anything.

Today's benchmarks test narrow skills well, but they rarely capture the full workflow of professional development.

I wanted something that tests what real development looks like: a complete SDLC cycle on a representative / realistic app, similar to how teams ship weekly. Ship-Bench is that project, open at http://github.com/JAgostoni/ship-bench for anyone who wants to follow along or try it themselves.

Ship-Bench runs agents through five phases that match a professional SDLC: Architect, UX Designer, Planner, Developer, and Reviewer. Each phase scores out of 100 against a specific rubric, with full evidence like specs, backlogs, code, and tests.

A benchmark like this needed more than a to-do app.

I wanted something more substantial than a to-do list, but not so complex that results would become wildly inconsistent from run to run. I settled on a knowledge base app with editing as it leaves room for product and implementation choices while staying inside a problem space that most developers (and LLMS) already understand.

That balance matters. The app is simple enough to keep the benchmark grounded, but open-ended enough to surface differences in planning, UX judgment, architecture, coding, and review quality.

How Ship-Bench Works

The first step in Ship-Bench is building a Product Brief. That brief is meant to test core product instincts before any code is written: interpreting requirements, resolving ambiguity, prioritizing scope, and making defensible implementation and UX decisions.

To do that, the feature set is intentionally larger than a defined MVP. The brief includes five possible features, but only the first three are required in v1, which keeps the evaluation shorter to run while still forcing the agent to decide what to do now versus later.

The feature statements focus on common product problems rather than highly specific implementation instructions. Browse articles, search content, edit knowledge, organize information. Most developers understand the shape of those problems, but the details are left open enough that the agent still has to define flows, tradeoffs, and structure. Not too dissimilar from reality.

The brief also includes non-functional and technical goals meant to push toward a simple app with some future scaling intent. It asks for something easy to run locally and maintain, but also something that can support around 100 concurrent users, use current libraries and frameworks where practical, and leave room for growth without drifting into unnecessary complexity.

That last part was important to me. I wanted to see whether an agent would research online for the latest frameworks and versions rather than rely only on its internal knowledge.

The full Product Brief is here for anyone who wants to read it directly: https://github.com/JAgostoni/ship-bench/blob/main/docs/product-brief.md.

The Role-Based Phases

Once the Product Brief is in place, the benchmark moves through five specialized roles meant to mirror a real product team. Each role has a specific job, well defined output, and a handoff that feeds the next phase. The point is not only to evaluate each role on its own, but to see how well the work transfers from one stage to the next. The overall goal is to take the ambiguity of the Product Brief and turn it into concrete decisions ready for the developer.

Architect

The Architect’s job is to turn the Product Brief into a concrete technical plan. Its main task is to make the big implementation decisions up front so the developer is not forced to solve architecture questions later in the build. That means choosing the front end and back end stack, data model, search approach, integration pattern, repo structure, local setup, and the testing and scaling considerations needed to support the brief’s goals. The output is a Technical Architecture Spec that makes the system buildable, keeps the implementation simple and maintainable, and leaves as few unresolved decisions as possible for later phases.

The Architect handoff matters because it gives UX and the Planner a stable technical frame to work inside. A clear architecture reduces guesswork in the design spec and keeps the backlog grounded in choices the developer can actually implement. It is evaluated based on completeness, accuracy and recency.

UX Designer

The UX Designer’s job is to turn the Product Brief into a concrete design direction and style guide. Its task is to decide how the app should feel and how the main flows should work, including layout, navigation, component behavior, responsive behavior, visual tone, and interaction states. It also needs to define the states and handoff details that make the design implementable without extra interpretation from the developer. The output is a UX Direction Spec that takes the ambiguity of the brief and turns it into a clear, consistent interface system the developer can build from.

The UX handoff translates architecture into interface decisions the Planner can sequence. Once layout, states, and component behavior are pinned down, the backlog can break the work into cleaner implementation steps. It is evaluated on completeness, quality and adherence.

Planner

The Planner’s job is to turn the approved product and technical decisions into a sequenced implementation backlog. Its main task is not just to list work, but to break the project into right-sized iterations so the developer agent can work through it in manageable chunks without losing context. It needs to define what belongs in MVP, what comes later, what blocks what, and how each iteration can leave the codebase in a working state. The output is an Implementation Backlog with iteration files that make the work executable, sequential, and easy to review.

The Planner is the main bridge between planning and building. A good backlog keeps the developer focused on one coherent slice at a time instead of forcing them to hold the whole project in working memory. It is evaluated on completeness and properly constructed iterations.

Developer

The Developer’s job is to turn the backlog into a working MVP without drifting beyond the assigned scope. Its main task is to implement one iteration at a time, keep the codebase in a working state, and avoid introducing new unresolved design or architecture decisions midstream. It also has to follow the given tech choices, cover the testing scope defined in the brief, and handle errors cleanly so the result is stable enough to review. The output is a completed iteration summary that shows what was built, what assumptions were made, and confirms the app still runs locally.

The Developer handoff is the most literal one in the benchmark: the backlog becomes code, tests, and a runnable app. Good upstream decisions should make this phase feel straightforward, while weak handoffs should show up quickly. It is evaluated on working code, adherence to spec, code quality and process completeness.

Reviewer

The Reviewer’s job is to verify the delivered MVP end to end and check whether it actually meets the brief. Its main task is to test the required flows, confirm the app runs locally, review the test suite, check responsiveness and error handling, and compare the implementation against the architecture, UX, and backlog decisions. It also needs to do a light code review for basic quality signals like modularity, current dependencies, and obvious security issues. The output is a QA report with pass or fail results, defect logs, spec drift notes, and a release recommendation that tells the team whether the build is ready or needs more work.

The Reviewer closes the loop by checking whether the earlier handoffs actually held up in a real implementation. It is less about originality and more about verification, which makes it the final test of whether the whole chain from brief to build worked as intended. It is evaluated against review and test completeness and depth.

Evaluation Framework

The evaluation itself is intentionally split between a human judge and an LLM judge. The goal is to combine two perspectives on the same deliverable, especially in the more subjective phases where rubric compliance alone is not enough. Each phase has its own evaluation file in the space, with detailed scoring criteria and pass/fail gates that keep the scoring consistent.

At a high level, the framework is trying to answer two questions: did the agent do the phase well, and did the output set up the next phase cleanly. The result is less about one leaderboard number and more about whether the whole sequence of work actually resembles a real delivery process.

Benchmarking Like Real Work

Ship-Bench is built to feel like an actual project rather than one-off synthetic tasks. The phases move in order, and each handoff has to carry real context forward, which is much closer to how professional roles interact on a team. It can go really wrong or it can go really right.

It also demands working deliverables at every stage, not just polished descriptions. The benchmark expects outputs that can be used by the next phase, whether that is a technical spec, a design direction, a backlog, or a runnable application with tests and supporting notes.

That structure reflects how developers actually work: brief, decide, plan, build, review, ship. Ship-Bench is not a replacement for other benchmarks; it is a way to show what professional workflows look like when the goal is to build something real.

Next Steps

Initial testing and benchmarking is already underway to test Ship-Bench itself making it more consistent and reliable.

What models and tools would you want to see?

More Posts

Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download)

Pocket Portfolio - Apr 1

AI Reliability Gap: Why Large Language Models are not for Safety-Critical Systems

praneeth - Mar 31

Architecting a Local-First Hybrid RAG for Finance

Pocket Portfolio - Feb 25

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Karol Modelskiverified - Mar 19

The Privacy Gap: Why sending financial ledgers to OpenAI is broken

Pocket Portfolio - Feb 23
chevron_left

Commenters (This Week)

1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!