Same model. Different results. — AgentKit Benchmark + OpenCode Integration

Same model. Different results. — AgentKit Benchmark + OpenCode Integration

posted Originally published at dev.to 2 min read

We open-sourced AgentKit two weeks ago with zero guarantees anyone would care.

400+ clones later — we're shipping the biggest update yet. And we have benchmark data to back it up.

Quick note: AgentKit Preview is our closed, in-development intelligence layer. The fully open-source AgentKit is live and ready to use today at github.com/Ajaysable123/AgentKitnpx agentkit-ai@latest init gets you running in seconds.


Live Benchmark — Gemma 4 31b · Same Model · Same Task

Both runs used Gemma 4 31b via OpenCode. The only variable was AgentKit Preview's workflow enforcement, skill injection, and plan gates.

Benchmark Vanilla OpenCode + AgentKit Preview
Structured planning before coding 0% 100%
Plan approved before first edit ✅ Yes (40.6s review)
Task interruptions 1x 0x
Task completion 20% (scaffolding only) 80% (DER parser implemented)
Hard problem solved ❌ No ✅ Yes

Without AgentKit — Gemma 4 31b gave up on the hard part and shipped placeholder strings ([ASN.1 Decoding Required]). No plan, no verification, interrupted once.

With AgentKit — Same Gemma 4 31b implemented a real custom ASN.1 DER parser, handled both UTCTime and GeneralizedTime, built expiration logic. Completed the task properly.

The model didn't get smarter. AgentKit's workflow gates changed its behavior:

  • Plan gate forced it to think through the DER parsing approach before writing code
  • Approval step made it commit to solving the hard problem instead of sidestepping it
  • State machine kept it accountable through RESEARCH → PLAN → EXECUTE → REVIEW

What else just landed

Native OpenCode Integration

OpenCode Integration

AgentKit now ships a native TUI plugin for OpenCode that lives inside the terminal UI — not just in the system prompt.

Select the agentkit agent from the agent switcher and you get:

  • Pre-loaded skills injected automatically
  • Workflow gates (RESEARCH → PLAN → EXECUTE → REVIEW → SHIP)
  • Mandatory approval dialogs before any code edit
  • Memory context from previous sessions

Works With Any Model

The skill router, workflow engine, and marketplace run entirely via CLI — no Claude API required. Tested on Gemma 4 31b, MiniMax M2.5, and Claude.

# Works with any model in OpenCode
agentkit workflow transition RESEARCH
agentkit workflow approve
agentkit workflow transition EXECUTE

Get started

Open-source AgentKit (free — stable & ready to use):

npx agentkit-ai@latest init

github.com/Ajaysable123/AgentKit

AgentKit Preview (closed beta — in active development)


To everyone who cloned, starred, or tried AgentKit — thank you. This is just getting started.


The callout block at the top does the heavy lifting — anyone who lands on the article immediately knows the open-source version is stable and available, and Preview is the next thing being built. Want any other changes?

1 Comment

2 votes
2

More Posts

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Karol Modelskiverified - Mar 19

I built AgentKit because my AI coding agent kept failing — here's what I learned

Ajay_dev - Apr 12

I Wrote a Script to Fix Audible's Unreadable PDF Filenames

snapsynapseverified - Apr 20

Your AI Agent Skills Have a Version Control Problem

snapsynapseverified - Apr 22

I spent years trying to get AI agents to collaborate. Then Opus 4.6 and Codex 5.3 wrote the rules

snapsynapseverified - Apr 20
chevron_left

Commenters (This Week)

3 comments
1 comment
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!