8 years writing tests by hand. Then I built an MCP server so AI could do it with me.

Question

8 years writing tests by hand. Then I built an MCP server so AI could do it with me.

calendar_todayMay 20 • schedule9 min read

— Originally published at dev.to

TL;DR — I'm a QA engineer. I've been writing tests for 8 years, building mobile apps on the side, and watching every wave of "AI for testing" tools collapse the same way. Eventually I stopped waiting and built mk-qa-master — an open-source MCP server that lets Claude / Cursor / Codex / Gemini actually drive a real test suite. This is the story of how I got here.

A moment that stuck

A few years into my QA career, I was on a release call for a feature I'd written ~40 tests for. The CI report came back green. The product owner asked: "Are we good to ship?"

I looked at the suite. 40 tests, all green. But I knew — because I'd watched the tests run, because I'd written them — that three of them were testing the same trivial happy path with slight variations. And nothing tested the discount-coupon edge case I'd flagged in standup the day before, because nobody had written it yet, and the AI tools we'd just adopted hadn't picked up on it from the issue description.

We shipped. Two days later, the coupon bug hit production.

That bug wasn't a model problem. The AI was capable. It was a context problem — the AI couldn't see my last 30 production incidents, my historical regression points, or which assertions actually mattered to the business. It saw the code. So it wrote tests that matched the code. Of course they all passed.

That memory followed me through every subsequent AI-for-testing tool I touched. None of them got better at the context problem. They got better at the code.

Eight years later I stopped waiting for someone else to fix it.

How I got into QA

I came in through the side door, like a lot of QA engineers. I wrote code first — backend services, then mobile, eventually full-stack — and got pulled into testing because the team needed someone who could actually maintain a Playwright suite and stop pretending Cypress would automate itself.

The tools weren't the problem. pytest is fine. Jest is fine. Cypress, Go's httptest, Maestro for mobile — all fine. The frameworks have been good for years.

The problem was always the judgment layer. Which test should I write next? Which existing test is most likely to break first when this PR ships? Of these 12 failures, which are real bugs and which are flakes? Those questions don't show up in the framework documentation. You learn them by being the QA engineer who's been on the team for two years and remembers every release.

I got good at that layer. So good that other engineers would ping me with "is this test useful or am I just performing testing?" type questions. I learned to recognize the patterns — over-mocked assertions, brittle selectors, tests that assert "page loaded" and call it covered.

That pattern recognition is what AI tools were supposed to bring to teams that didn't have a senior QA engineer in the room.

They couldn't. Not because the models weren't smart enough, but because they were locked in a chat box. They could read code. They couldn't see anything else.

My side projects taught me the same lesson, harder

While I was doing QA in my day job, I was also building two iOS/Android apps in my evenings:

chichitie — a location-based food discovery app
nokou — a driver dashboard / speed-camera companion

Solo dev, both platforms, no QA team. Just me, my test suite, and a release cadence I was trying to maintain after a full-time job.

Every shortcut you can imagine in a solo project, I took. AI-generated tests? Tried them. Got back code that compiled and used selectors that didn't exist in my app. Spent more time fixing the # TODO placeholders and the hallucinated component names than I would have writing the tests by hand.

Tried again with better prompts. Same problem. Tried a different model. Same problem. Tried prompt-chaining to surface my app's structure first. The AI still made things up, because eventually it had to guess at something — usually a selector, sometimes an assertion, sometimes a whole user flow.

It clicked one night around 2 AM, after rage-deleting the third generated test in a row: the AI cannot see my actual app. It can read my code. It cannot see the live screen, the actual element tree, the last 50 crashes in Firebase Crashlytics, the production session recordings.

The fix isn't smarter prompts. The fix is giving the AI the access it's currently guessing about.

But how? I didn't want to write a new IDE plugin. I didn't want to be locked to one AI vendor. I needed a primitive that worked across Claude, Cursor, Codex, whatever I happened to be using that week.

When MCP showed up

Anthropic released the Model Context Protocol in late 2024. I almost didn't read the launch post — the name sounded like another vendor-specific protocol that would be deprecated in eighteen months.

Then I noticed something: it was an open standard. Any client could implement it. Any server could expose tools to it. The AI client made the calls; the server did the work. Clean separation. Same protocol whether you were on Claude Desktop, Cursor, Codex CLI, or anything that came later.

For my testing problem, that was the missing piece. I could write one MCP server that:

Knew how to drive my test runner
Knew how to probe the live DOM of my app
Knew how to read my last N runs and classify flaky vs broken
Knew how to expose all of that as tools the AI client could call

The AI client wouldn't have to be retrained for my use case. It wouldn't have to be vendor-locked. It would just discover the tools and use them.

I spent a weekend reading the spec. Started prototyping the following Monday.

v0.1 — a thin layer over pytest

The first version was barebones. One runner (pytest). Six tools. No optimizer. No history archive. No mobile support.

I dogfooded it on one of my own projects within a week. Three things happened immediately:

Claude stopped guessing at selectors. I added analyze_url, which probed the live DOM and returned actual selectors. The AI had real data to work with.
Tests that used to be # TODO-laden stubs started compiling. Not because Claude got smarter — because it had access to information it'd been guessing about.
I noticed Claude kept asking me the same question. "Which of these failed tests is flaky vs broken?" It couldn't answer that itself, because it didn't have run history.

So v0.2 added more runners (Jest, Cypress, Go test). v0.3 added Maestro for mobile. v0.4 added the optimizer — a coach layer that reads the last N runs, classifies broken vs flaky vs slow-regression, and writes a markdown action plan.

That fourth release was when the project clicked for me. The earlier versions were "AI-driven test runner." v0.4 was "AI-driven QA team member." The optimizer plan is what a senior engineer would write after looking at the dashboard. The AI couldn't do that on its own — but with the right tools, it could.

v0.5 — admitting this was a family, not a tool

I'd been collecting use cases that didn't quite fit in mk-qa-master. "What if the AI could also triage my backlog and rank ideas?" "What if it could turn a Linear ticket into testable scenarios automatically?"

Different problems. Same architectural pattern. Each one wanted its own MCP server, its own runner abstraction, its own optimizer.

I rebranded mcp-test-runner → mk-qa-master, then immediately started two more:

mk-spec-master — reads specs from Linear / JIRA / GitHub Issues / Notion / Figma / Markdown, extracts acceptance criteria, maintains a spec ↔ test coverage matrix
mk-plan-master — RICE-scores product initiatives, ranks backlog, emits spec drafts that hand straight to mk-spec-master.parse_spec

Together they form the AI dev pipeline:

Idea → Plan → Spec → Code (your IDE) → Test → Coverage → Coach
       mk-plan mk-spec  your IDE        mk-qa  mk-spec    both

The family wraps the rails; code-writing stays in your IDE. The MCP doesn't try to replace Claude Code or Cursor — it lives alongside them.

v0.6 and v0.7 — the day the API testing arc collapsed

A user asked: "Does mk-qa-master test APIs?"

The honest answer in v0.5 was "sort of, if you have API tests inside your pytest suite." The runner didn't differentiate UI tests from API tests; the optimizer treated them the same. Not exactly wrong, but not exactly yes.

I spent one day shipping two API runners:

v0.6.0 — Schemathesis (OpenAPI / Swagger fuzz testing, property-based)
v0.6.1 — Newman (Postman collection runner)

Same MCP tool surface. Same optimizer pipeline. ~150 lines of Python per runner. They inherited the existing history / flake / coach loop because the runner abstraction was already correct.

Then a CAPTCHA question came in. "I'm testing my client's staging site, hit a reCAPTCHA, the test stalls. What do I do?"

The methodology layer had nothing on CAPTCHA. The optimizer had no classification for it. I shipped v0.6.3 — a knowledge-layer release that documented a Tier 1 / 2 / 3 decision flow:

Tier 1: bypass via Google's official test keys
Tier 2: degrade gracefully (mark external_dependency, skip)
Tier 3: AI visual judgment (forward-pointer to a tool that didn't exist yet)

Two weeks later that forward-pointer became inspect_visual_challenge + solve_visual_challenge — v0.7.0. The MCP screenshots the CAPTCHA, the AI client (which is already multimodal — Claude, Cursor, Gemini all have vision) looks at the image, the MCP executes the clicks the AI indicates.

mk-qa-master does not contain a vision model. It doesn't need one. The intelligence is already in the room.

That's the architectural insight the whole project is built around: the MCP wraps tools; the AI client brings the reasoning; neither of them tries to be the other.

What I'd tell my past self

If I were starting again, I'd tell year-1-of-QA me three things:

The framework isn't the problem. Stop looking for the "perfect" test framework. They're all fine. The thing that's broken is the layer between the framework and the human judgment about which tests to write and what they mean. That's where the real work is.
Build tools, not heroics. The years I spent being the team's QA expert were valuable, but they didn't scale. The MCP server scales. The runner abstraction scales. The methodology layer scales. The pattern recognition I built up over eight years should be encodable, not bottlenecked through me.
Open source what you wish existed. I spent five years looking for a tool that did what mk-qa-master does. It didn't exist. The market for it is real — I see QA engineers in my network hit the same wall I hit. If you're sitting on a problem you've been solving by hand for years, the right move might be to ship the abstraction.

What I'm doing now

mk-qa-master is at v0.7.0. 18 tools. 7 runners (pytest / Jest / Cypress / Go / Maestro / Schemathesis / Newman). Bilingual built-in knowledge layer (EN + zh-TW). AAA on Glama. MIT.

I'm not done. Pact contract testing is on the roadmap. hCaptcha follows reCAPTCHA in v0.7.1. The family's next member — probably either an audit / perf MCP or an a11y MCP — depends on whether real users tell me which they need first.

What I want — and the reason I'm writing this — is for other QA engineers to take a look at the project, find the gaps I haven't seen, and either contribute or fork or rebuild it differently. The market for AI-driven QA is going to be enormous. It deserves more than one opinionated open-source tool.

If you've been doing QA for years and you've felt the same frustrations — let me know. The GitHub repo is open, the issue tracker is open, and I read every reply.

Links

mk-qa-master: https://github.com/kao273183/mk-qa-master
PyPI: https://pypi.org/project/mk-qa-master/
Family site: https://mcp.chenjundigital.com
Family: mk-qa-master · mk-spec-master · mk-plan-master

If this resonates and you know a QA engineer who's been muttering at AI tools for the last two years — share this with them. The first few hundred users are the hardest. After that the project starts surfacing on its own.

— Jack Kao, QA engineer, building solo.

2 Comments

🔥 Join developers growing publicly

Share your knowledge, build in public, and grow your developer presence with a global community.

Join CoderLegion

chevron_left

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

VGR · Answer 1 · 2026-05-22T00:12:55+0000

VGR • May 21

Really cool idea. AI-assisted testing feels way more practical than a lot of the hype projects out there.

MiniKao • May 21

@[VGR] Thank you for your support. I genuinely hope to contribute something meaningful to the QA field and help more people understand the true value of software testing.

	Your AI Doesn't Just Write Tests. It Runs Them Too. Kevin Martinez - May 12
	The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI Ken W. Algerverified - Jun 4
	MCP Is the USB-C of AI. So Why Are You Plugging Everything In? Ken W. Algerverified - Jun 10
	I spent years trying to get AI agents to collaborate. Then Opus 4.6 and Codex 5.3 wrote the rules snapsynapseverified - Apr 20
	Everyone says DeepSeek is cheaper, but I got tired of guessing the exact math. So I built a calculat abarth23 - Apr 27

8 years writing tests by hand. Then I built an MCP server so AI could do it with me.

A moment that stuck

How I got into QA

My side projects taught me the same lesson, harder

When MCP showed up

v0.1 — a thin layer over pytest

v0.5 — admitting this was a family, not a tool

v0.6 and v0.7 — the day the API testing arc collapsed

What I'd tell my past self

What I'm doing now

Links

2 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

Your AI Doesn't Just Write Tests. It Runs Them Too.

The Sovereign Vault — A Comprehensive Guide to Protocol-Driven AI

MCP Is the USB-C of AI. So Why Are You Plugging Everything In?

I spent years trying to get AI agents to collaborate. Then Opus 4.6 and Codex 5.3 wrote the rules

Everyone says DeepSeek is cheaper, but I got tired of guessing the exact math. So I built a calculat

More From MiniKao

Testing Edge AI from an MCP tool: I pointed mk-qa-master at my webcam and YOLO answered

From mock-only-works to real-world-works: 48 hours of reCAPTCHA debugging

I open-sourced 24 QA skills for Claude Code — from spec to release

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,726 amazing developers

Don't have an account? Sign up

OR

8 years writing tests by hand. Then I built an MCP server so AI could do it with me.

A moment that stuck

How I got into QA

My side projects taught me the same lesson, harder

When MCP showed up

v0.1 — a thin layer over pytest

v0.5 — admitting this was a family, not a tool

v0.6 and v0.7 — the day the API testing arc collapsed

What I'd tell my past self

What I'm doing now

Links

2 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

More From MiniKao

Related Jobs

Commenters (This Week)