What Nobody Tells You About Running an MCP Server in Production (72 Tools, 1,161 ops/s)
There are a thousand tutorials that show you how to stand up a Model Context Protocol server. Almost all of them stop at the same place: a "hello world" with two or three tools, a screenshot of a model calling one of them, and a closing line about how easy it all was.
It is easy — for three tools.
I run an MCP server in production with 72 tools, sustaining 1,161 operations per second. Everything interesting I've learned about MCP happened in the gap between three tools and seventy-two. That gap is where the tutorials end and the engineering begins, and it's the part nobody writes about. So here it is — the lessons, the failure modes, and the principles that actually matter once an MCP server stops being a demo and starts being infrastructure.
I'm going to keep this at the level of principle rather than dumping our internal architecture. Partly because the specifics are our product, and partly because the principles are the genuinely transferable part. If you're building an MCP server you intend to run, not just show, this is what's waiting for you.
The first wall: the model can't tell your tools apart
The single biggest thing that changes between 3 tools and 70 tools is not performance. It's comprehension. The model gets confused.
With three tools, the names can be sloppy and it doesn't matter. get_data, send_thing, do_action — the model picks correctly because there's nothing to confuse them with. At seventy tools, you will inevitably have several that are near neighbors in meaning: tools that read overlapping data, tools that act on the same resource in slightly different ways, tools whose descriptions, if written carelessly, are nearly interchangeable.
When that happens, the model starts picking the wrong tool. Not randomly — systematically. It will reach for the tool whose description best matches the surface form of the request, even when a different tool is the correct one. And because the call looks plausible, you often don't catch it until you're staring at a wrong result wondering why.
The fix is not more tools or smarter models. It's treating tool descriptions as a disambiguation problem, not a documentation problem. Every tool description has to answer, implicitly, "why would I pick this one instead of the similar-looking one next to it?" That means descriptions that state not just what a tool does, but its boundaries — when not to use it, what it is not for. The discipline that keeps a 72-tool server usable is closer to writing a good taxonomy than writing good docs. Your tools have to occupy clearly separated semantic territory, and the descriptions are how you draw the borders.
This is the lesson that surprised me most. I expected the hard problems to be infrastructural. The hardest one was linguistic.
The second wall: surface area is attack surface
Three tools is a toy. Seventy-two tools is a system that can read, write, and act across a lot of resources — and every one of those tools is a thing the model can be talked into calling.
Here's the uncomfortable reality of a capable MCP server: the model driving it is, by design, taking instructions from somewhere. If any of those instructions can come from untrusted content — a document the model reads, a webpage it fetches, a record returned by one of your own tools — then you have a prompt-injection surface, and every tool you expose is a lever an attacker might try to pull.
This is where most production MCP discussions get hand-wavy, so let me be concrete about the principle. You cannot trust the model to police itself, and you cannot trust the input to be clean. The only place you can enforce anything reliably is between the model deciding to call a tool and the tool actually executing. That gap — after intent, before effect — is the one place a guarantee can live.
In practice that means every tool call passes through a checkpoint before it runs. Is this caller allowed to invoke this tool? Is this specific invocation within policy — the right scope, the right limits, not a destructive action triggered by text that came from an untrusted source? Is it being logged so there's an audit trail afterward? The model proposes; the checkpoint disposes. The tool itself never runs on the model's say-so alone.
I won't detail how our checkpoint is built — that's the part that's ours. But the principle is non-negotiable for anyone running MCP seriously: an MCP server without an inline authorization and policy layer between decision and execution is not a production system. It's a production incident that hasn't happened yet. The convenience of MCP — that a model can fluidly call your tools — is exactly the thing that makes an ungoverned MCP server dangerous. The fluidity cuts both ways.
The third wall: throughput is a state problem, not a speed problem
Let's talk about the 1,161 ops/s, because the number itself is less interesting than what it took to make it stable.
Getting high throughput out of a single endpoint is, by now, a fairly well-understood engineering exercise — connection handling, concurrency, not doing dumb things on the hot path. That part is just craft. The part that bites you in an MCP server specifically is state and ordering across tool calls.
MCP interactions are rarely one-and-done. A model strings tool calls together: it reads something, decides based on the result, writes something, reads again to confirm. Under load, with many of these chains interleaving, the failure mode isn't slowness — it's incoherence. Two chains touch the same resource in an order neither expects. A read returns a value that a concurrent write has already made stale. The model, acting on stale perception, makes a decision that was correct for a world that no longer exists.
The lesson here mirrors the first one in a strange way. With three tools and one user, sequencing takes care of itself. At production throughput with many concurrent chains, you have to design for the fact that the model's view of the world and the actual state of the world drift apart under load, and the tighter you can close that gap, the less your agents do confidently wrong things. Last-write-wins is a decision, not a default — make it on purpose. Idempotency on the tools that can tolerate it is worth more than raw speed. And the operations that genuinely must be ordered have to be made un-interleavable, even at a throughput cost, because an incoherent result at 1,161 ops/s is worse than a correct result at 900.
Throughput, in other words, is not how fast you can answer. It's how many concurrent agents can act through your server before they start tripping over each other's reality.
The fourth wall: observability is the whole game
When something goes wrong with three tools, you read the logs and you see it. When something goes wrong across 72 tools and thousands of operations a second, "read the logs" is not a plan.
The thing that makes a large MCP server operable is that you can answer, after the fact, what happened and why. Which tool was called, by whom, with what arguments, in what order, with what result, and — critically — what the policy layer decided about it. If you can't reconstruct the chain of a misbehaving agent's actions, you can't fix anything; you can only guess.
This is why the audit trail isn't a compliance afterthought bolted on at the end. It's the primary debugging instrument of a production MCP server. The same checkpoint that enforces policy before execution is the natural place to record what was attempted and what was allowed. Governance and observability turn out to be the same capability viewed from two angles: one is "stop the bad call before it runs," the other is "explain every call after it ran." Build one and you mostly get the other.
What this all adds up to
If I compress everything I've learned running MCP at this scale into a single idea, it's this: MCP is not hard to start and is hard to run, and the difficulty is concentrated in places the tutorials don't look. The hard parts aren't the protocol. They're:
Keeping a large toolset semantically distinct so the model picks correctly.
Putting an enforceable checkpoint between every decision and every effect, because the model's fluency is also its vulnerability.
Designing for state coherence under concurrency, not just speed.
Making every action reconstructable after the fact.
None of those show up when you have three tools and one user. All of them show up by the time you have seventy-two tools and a thousand operations a second. The protocol gets you to the demo. These four walls are what stand between the demo and production.
MCP is genuinely a good thing — it's the cleanest standard we've had for letting models act through real tools, and I'm glad it exists. But "a model can call your tools" is a capability, and like every capability it's exactly as safe as the layer you put around it. The servers that survive contact with production are the ones whose builders understood that early, ideally before tool number forty started getting confused with tool number forty-one.
If you're past three tools and heading for thirty, you'll hit these walls in roughly this order. Now at least you know they're coming.
Fabio Bastos builds and runs production AI infrastructure, including an MCP server with 72 tools sustaining 1,161 operations per second. He writes about what breaks when AI systems leave the demo and meet the real world.