TL;DR — I'm a QA engineer. I've been writing tests for 8 years, building mobile apps on the side, and watching every wave of "AI for testing" tools collapse the same way. Eventually I stopped waiting and built mk-qa-master — an open-source MCP server that lets Claude / Cursor / Codex / Gemini actually drive a real test suite. This is the story of how I got here.
A moment that stuck
A few years into my QA career, I was on a release call for a feature I'd written ~40 tests for. The CI report came back green. The product owner asked: "Are we good to ship?"
I looked at the suite. 40 tests, all green. But I knew — because I'd watched the tests run, because I'd written them — that three of them were testing the same trivial happy path with slight variations. And nothing tested the discount-coupon edge case I'd flagged in standup the day before, because nobody had written it yet, and the AI tools we'd just adopted hadn't picked up on it from the issue description.
We shipped. Two days later, the coupon bug hit production.
That bug wasn't a model problem. The AI was capable. It was a context problem — the AI couldn't see my last 30 production incidents, my historical regression points, or which assertions actually mattered to the business. It saw the code. So it wrote tests that matched the code. Of course they all passed.
That memory followed me through every subsequent AI-for-testing tool I touched. None of them got better at the context problem. They got better at the code.
Eight years later I stopped waiting for someone else to fix it.
How I got into QA
I came in through the side door, like a lot of QA engineers. I wrote code first — backend services, then mobile, eventually full-stack — and got pulled into testing because the team needed someone who could actually maintain a Playwright suite and stop pretending Cypress would automate itself.
The tools weren't the problem. pytest is fine. Jest is fine. Cypress, Go's httptest, Maestro for mobile — all fine. The frameworks have been good for years.
The problem was always the judgment layer. Which test should I write next? Which existing test is most likely to break first when this PR ships? Of these 12 failures, which are real bugs and which are flakes? Those questions don't show up in the framework documentation. You learn them by being the QA engineer who's been on the team for two years and remembers every release.
I got good at that layer. So good that other engineers would ping me with "is this test useful or am I just performing testing?" type questions. I learned to recognize the patterns — over-mocked assertions, brittle selectors, tests that assert "page loaded" and call it covered.
That pattern recognition is what AI tools were supposed to bring to teams that didn't have a senior QA engineer in the room.
They couldn't. Not because the models weren't smart enough, but because they were locked in a chat box. They could read code. They couldn't see anything else.
My side projects taught me the same lesson, harder
While I was doing QA in my day job, I was also building two iOS/Android apps in my evenings:
- chichitie — a location-based food discovery app
- nokou — a driver dashboard / speed-camera companion
Solo dev, both platforms, no QA team. Just me, my test suite, and a release cadence I was trying to maintain after a full-time job.
Every shortcut you can imagine in a solo project, I took. AI-generated tests? Tried them. Got back code that compiled and used selectors that didn't exist in my app. Spent more time fixing the # TODO placeholders and the hallucinated component names than I would have writing the tests by hand.
Tried again with better prompts. Same problem. Tried a different model. Same problem. Tried prompt-chaining to surface my app's structure first. The AI still made things up, because eventually it had to guess at something — usually a selector, sometimes an assertion, sometimes a whole user flow.
It clicked one night around 2 AM, after rage-deleting the third generated test in a row: the AI cannot see my actual app. It can read my code. It cannot see the live screen, the actual element tree, the last 50 crashes in Firebase Crashlytics, the production session recordings.
The fix isn't smarter prompts. The fix is giving the AI the access it's currently guessing about.
But how? I didn't want to write a new IDE plugin. I didn't want to be locked to one AI vendor. I needed a primitive that worked across Claude, Cursor, Codex, whatever I happened to be using that week.
When MCP showed up
Anthropic released the Model Context Protocol in late 2024. I almost didn't read the launch post — the name sounded like another vendor-specific protocol that would be deprecated in eighteen months.
Then I noticed something: it was an open standard. Any client could implement it. Any server could expose tools to it. The AI client made the calls; the server did the work. Clean separation. Same protocol whether you were on Claude Desktop, Cursor, Codex CLI, or anything that came later.
For my testing problem, that was the missing piece. I could write one MCP server that:
- Knew how to drive my test runner
- Knew how to probe the live DOM of my app
- Knew how to read my last N runs and classify flaky vs broken
- Knew how to expose all of that as tools the AI client could call
The AI client wouldn't have to be retrained for my use case. It wouldn't have to be vendor-locked. It would just discover the tools and use them.
I spent a weekend reading the spec. Started prototyping the following Monday.
v0.1 — a thin layer over pytest
The first version was barebones. One runner (pytest). Six tools. No optimizer. No history archive. No mobile support.
I dogfooded it on one of my own projects within a week. Three things happened immediately:
- Claude stopped guessing at selectors. I added
analyze_url, which probed the live DOM and returned actual selectors. The AI had real data to work with.
- Tests that used to be
# TODO-laden stubs started compiling. Not because Claude got smarter — because it had access to information it'd been guessing about.
- I noticed Claude kept asking me the same question. "Which of these failed tests is flaky vs broken?" It couldn't answer that itself, because it didn't have run history.
So v0.2 added more runners (Jest, Cypress, Go test). v0.3 added Maestro for mobile. v0.4 added the optimizer — a coach layer that reads the last N runs, classifies broken vs flaky vs slow-regression, and writes a markdown action plan.
That fourth release was when the project clicked for me. The earlier versions were "AI-driven test runner." v0.4 was "AI-driven QA team member." The optimizer plan is what a senior engineer would write after looking at the dashboard. The AI couldn't do that on its own — but with the right tools, it could.
I'd been collecting use cases that didn't quite fit in mk-qa-master. "What if the AI could also triage my backlog and rank ideas?" "What if it could turn a Linear ticket into testable scenarios automatically?"
Different problems. Same architectural pattern. Each one wanted its own MCP server, its own runner abstraction, its own optimizer.
I rebranded mcp-test-runner → mk-qa-master, then immediately started two more:
- mk-spec-master — reads specs from Linear / JIRA / GitHub Issues / Notion / Figma / Markdown, extracts acceptance criteria, maintains a spec ↔ test coverage matrix
- mk-plan-master — RICE-scores product initiatives, ranks backlog, emits spec drafts that hand straight to
mk-spec-master.parse_spec
Together they form the AI dev pipeline:
Idea → Plan → Spec → Code (your IDE) → Test → Coverage → Coach
mk-plan mk-spec your IDE mk-qa mk-spec both
The family wraps the rails; code-writing stays in your IDE. The MCP doesn't try to replace Claude Code or Cursor — it lives alongside them.
v0.6 and v0.7 — the day the API testing arc collapsed
A user asked: "Does mk-qa-master test APIs?"
The honest answer in v0.5 was "sort of, if you have API tests inside your pytest suite." The runner didn't differentiate UI tests from API tests; the optimizer treated them the same. Not exactly wrong, but not exactly yes.
I spent one day shipping two API runners:
- v0.6.0 — Schemathesis (OpenAPI / Swagger fuzz testing, property-based)
- v0.6.1 — Newman (Postman collection runner)
Same MCP tool surface. Same optimizer pipeline. ~150 lines of Python per runner. They inherited the existing history / flake / coach loop because the runner abstraction was already correct.
Then a CAPTCHA question came in. "I'm testing my client's staging site, hit a reCAPTCHA, the test stalls. What do I do?"
The methodology layer had nothing on CAPTCHA. The optimizer had no classification for it. I shipped v0.6.3 — a knowledge-layer release that documented a Tier 1 / 2 / 3 decision flow:
- Tier 1: bypass via Google's official test keys
- Tier 2: degrade gracefully (mark
external_dependency, skip)
- Tier 3: AI visual judgment (forward-pointer to a tool that didn't exist yet)
Two weeks later that forward-pointer became inspect_visual_challenge + solve_visual_challenge — v0.7.0. The MCP screenshots the CAPTCHA, the AI client (which is already multimodal — Claude, Cursor, Gemini all have vision) looks at the image, the MCP executes the clicks the AI indicates.
mk-qa-master does not contain a vision model. It doesn't need one. The intelligence is already in the room.
That's the architectural insight the whole project is built around: the MCP wraps tools; the AI client brings the reasoning; neither of them tries to be the other.
What I'd tell my past self
If I were starting again, I'd tell year-1-of-QA me three things:
- The framework isn't the problem. Stop looking for the "perfect" test framework. They're all fine. The thing that's broken is the layer between the framework and the human judgment about which tests to write and what they mean. That's where the real work is.
- Build tools, not heroics. The years I spent being the team's QA expert were valuable, but they didn't scale. The MCP server scales. The runner abstraction scales. The methodology layer scales. The pattern recognition I built up over eight years should be encodable, not bottlenecked through me.
- Open source what you wish existed. I spent five years looking for a tool that did what mk-qa-master does. It didn't exist. The market for it is real — I see QA engineers in my network hit the same wall I hit. If you're sitting on a problem you've been solving by hand for years, the right move might be to ship the abstraction.
What I'm doing now
mk-qa-master is at v0.7.0. 18 tools. 7 runners (pytest / Jest / Cypress / Go / Maestro / Schemathesis / Newman). Bilingual built-in knowledge layer (EN + zh-TW). AAA on Glama. MIT.
I'm not done. Pact contract testing is on the roadmap. hCaptcha follows reCAPTCHA in v0.7.1. The family's next member — probably either an audit / perf MCP or an a11y MCP — depends on whether real users tell me which they need first.
What I want — and the reason I'm writing this — is for other QA engineers to take a look at the project, find the gaps I haven't seen, and either contribute or fork or rebuild it differently. The market for AI-driven QA is going to be enormous. It deserves more than one opinionated open-source tool.
If you've been doing QA for years and you've felt the same frustrations — let me know. The GitHub repo is open, the issue tracker is open, and I read every reply.
Links
If this resonates and you know a QA engineer who's been muttering at AI tools for the last two years — share this with them. The first few hundred users are the hardest. After that the project starts surfacing on its own.
— Jack Kao, QA engineer, building solo.