Same model. Different results. — AgentKit Benchmark + OpenCode Integration

Question

Same model. Different results. — AgentKit Benchmark + OpenCode Integration

Ajay_dev posted Apr 12 Originally published at dev.to 2 min read

We open-sourced AgentKit two weeks ago with zero guarantees anyone would care.

400+ clones later — we're shipping the biggest update yet. And we have benchmark data to back it up.

Quick note: AgentKit Preview is our closed, in-development intelligence layer. The fully open-source AgentKit is live and ready to use today at github.com/Ajaysable123/AgentKit — npx agentkit-ai@latest init gets you running in seconds.

Live Benchmark — Gemma 4 31b · Same Model · Same Task

Both runs used Gemma 4 31b via OpenCode. The only variable was AgentKit Preview's workflow enforcement, skill injection, and plan gates.

Benchmark	Vanilla OpenCode	+ AgentKit Preview
Structured planning before coding	0%	100%
Plan approved before first edit	—	✅ Yes (40.6s review)
Task interruptions	1x	0x
Task completion	20% (scaffolding only)	80% (DER parser implemented)
Hard problem solved	❌ No	✅ Yes

Without AgentKit — Gemma 4 31b gave up on the hard part and shipped placeholder strings ([ASN.1 Decoding Required]). No plan, no verification, interrupted once.

With AgentKit — Same Gemma 4 31b implemented a real custom ASN.1 DER parser, handled both UTCTime and GeneralizedTime, built expiration logic. Completed the task properly.

The model didn't get smarter. AgentKit's workflow gates changed its behavior:

Plan gate forced it to think through the DER parsing approach before writing code
Approval step made it commit to solving the hard problem instead of sidestepping it
State machine kept it accountable through RESEARCH → PLAN → EXECUTE → REVIEW

What else just landed

Native OpenCode Integration

OpenCode Integration

AgentKit now ships a native TUI plugin for OpenCode that lives inside the terminal UI — not just in the system prompt.

Select the agentkit agent from the agent switcher and you get:

Pre-loaded skills injected automatically
Workflow gates (RESEARCH → PLAN → EXECUTE → REVIEW → SHIP)
Mandatory approval dialogs before any code edit
Memory context from previous sessions

Works With Any Model

The skill router, workflow engine, and marketplace run entirely via CLI — no Claude API required. Tested on Gemma 4 31b, MiniMax M2.5, and Claude.

# Works with any model in OpenCode
agentkit workflow transition RESEARCH
agentkit workflow approve
agentkit workflow transition EXECUTE

Get started

Open-source AgentKit (free — stable & ready to use):

npx agentkit-ai@latest init

github.com/Ajaysable123/AgentKit

AgentKit Preview (closed beta — in active development)

To everyone who cloned, starred, or tried AgentKit — thank you. This is just getting started.


The callout block at the top does the heavy lifting — anyone who lands on the article immediately knows the open-source version is stable and available, and Preview is the next thing being built. Want any other changes?

1 Comment

chevron_left

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

DuchessCodes · Answer 1 · 2026-04-12T15:30:04+0000

Same model, same task big difference just from enforcing structure. This really highlights that most agent failures are process failures, not model failures.

	I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt Karol Modelskiverified - Mar 19
	I built AgentKit because my AI coding agent kept failing — here's what I learned Ajay_dev - Apr 12
	I Wrote a Script to Fix Audible's Unreadable PDF Filenames snapsynapseverified - Apr 20
	Your AI Agent Skills Have a Version Control Problem snapsynapseverified - Apr 22
	I spent years trying to get AI agents to collaborate. Then Opus 4.6 and Codex 5.3 wrote the rules snapsynapseverified - Apr 20

Same model. Different results. — AgentKit Benchmark + OpenCode Integration

Live Benchmark — Gemma 4 31b · Same Model · Same Task

What else just landed

Native OpenCode Integration

Works With Any Model

Get started

1 Comment

Please log in to add a comment.

Please log in to comment on this post.

More Posts

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

I built AgentKit because my AI coding agent kept failing — here's what I learned

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

Your AI Agent Skills Have a Version Control Problem

I spent years trying to get AI agents to collaborate. Then Opus 4.6 and Codex 5.3 wrote the rules

More From Ajay_dev

I built AgentKit because my AI coding agent kept failing — here's what I learned

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,133 amazing developers

Don't have an account? Sign up

OR

Same model. Different results. — AgentKit Benchmark + OpenCode Integration

Live Benchmark — Gemma 4 31b · Same Model · Same Task

What else just landed

Native OpenCode Integration

Works With Any Model

Get started

1 Comment

Please log in to add a comment.

Please log in to comment on this post.

More Posts

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

I built AgentKit because my AI coding agent kept failing — here's what I learned

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

Your AI Agent Skills Have a Version Control Problem

I spent years trying to get AI agents to collaborate. Then Opus 4.6 and Codex 5.3 wrote the rules

More From Ajay_dev

I built AgentKit because my AI coding agent kept failing — here's what I learned

Related Jobs

Commenters (This Week)