Most AI agents never reach production. The missing piece isn't code—it's infrastructure design.

Question

Most AI agents never reach production. The missing piece isn't code—it's infrastructure design.

Tom SmithBackerLeader posted Sep 15 4 min read

The Infrastructure Challenge Holding Back AI Agents in Production

AI agents need more than Kubernetes can offer. Here's what's missing and how teams are solving it.

Most AI pilots never make it to production. Recent studies show failure rates between 88% and 95%. The reasons vary, but infrastructure gaps play a major role.

Kubernetes works well for traditional applications. But AI agents have different needs. They require context about users, tools, and other agents. Traditional container orchestration treats workloads as black boxes. That approach breaks down with agentic AI.

The Kubernetes Gap

Standard Kubernetes networking assumes simple request-response patterns. AI agents operate differently. They need to communicate with multiple tools, other agents, and various language models. These interactions follow protocols like Model Context Protocol (MCP) and Agent-to-Agent (A2A) that standard service meshes don't understand.

Security presents another challenge. Traditional applications run with fixed permissions. AI agents act on behalf of users with dynamic authorization needs. A customer service agent might need different permissions depending on which user it's helping.

Observability becomes complex, too. When an agent fails, teams need to trace the entire interaction chain. Which tools did it call? How did other agents respond? What decisions led to the failure? Standard monitoring tools can't provide this level of visibility.

"Navigating the path to production with AI agents is hard and requires critical gaps in the Kubernetes foundation to be filled," says Idit Levine, CEO and Founder of Solo.io. Her company recently announced Kagent Enterprise, a platform designed to address these specific infrastructure challenges.

What AI Agents Actually Need

AI applications require infrastructure that understands context. This means knowing who initiated a request, which agent is handling it, and what tools are involved.

Identity and Authorization: Agents need dynamic identity management. A single agent might act with different permissions based on the user it represents. Traditional RBAC systems can't handle this complexity.

Protocol Support: Agent communication uses specialized protocols. MCP enables tool interactions. A2A handles agent-to-agent communication. Standard load balancers and proxies don't support these protocols natively.

Advanced Observability: Teams need to trace complete agent workflows. This includes LLM calls, tool interactions, and decision points. Standard APM tools miss these agent-specific interactions.

Lifecycle Management: Agents have complex lifecycles. They maintain memory between interactions. They can spawn sub-agents for specific tasks. Traditional deployment models don't account for these patterns.

Emerging Solutions

Several approaches are emerging to address these gaps. Some teams build custom solutions on top of existing service meshes. Others adopt specialized platforms designed for agentic workloads.

Solo.io's Kagent Enterprise represents the specialized platform approach. It extends Kubernetes with three layers of context awareness:

Context-aware networking: The platform includes agentgateway, an agent-native data plane that Solo.io contributed to the Linux Foundation. It supports MCP, A2A, and leading LLM provider protocols natively.

Context-aware runtime: Kagent Enterprise introduces a new runtime layer that extends Kubernetes to become context-aware. It handles agent identity, advanced failover, memory management, and deeper observability instrumentation.

Context-aware platform: The AgentOps dashboard provides centralized visibility with agent graphs and end-to-end tracing. Policy and lifecycle management come built-in, with declarative APIs and UI controls.

The custom approach gives teams full control but requires significant engineering effort. Teams need to implement agent-aware networking, build observability tools, and create governance frameworks. This works for organizations with large platform engineering teams.

Real-World Implementation

Early adopters are seeing success with both approaches. A financial services company built a custom agent platform on Istio. They added agent identity management and custom observability. The project took eight months but now supports dozens of production agents.

Organizations using specialized platforms like Kagent Enterprise report faster time-to-production. The platform handles protocol translation and observability automatically. Teams can focus on agent logic instead of infrastructure concerns.

The choice depends on team capabilities and requirements. Custom solutions offer flexibility but demand significant investment. Specialized platforms accelerate deployment but may limit customization options.

The open source kagent project, which launched in March 2025 and became a CNCF project, has grown to 800+ community members and 100+ contributors. This community backing provides a foundation for teams building agentic infrastructure.

Looking Forward

The infrastructure landscape for AI agents is still evolving. New protocols are emerging. Observability requirements are becoming clearer. Security models are maturing.

Several trends are worth watching. Protocol standardization is advancing through initiatives like the Linux Foundation's agent gateway project. Observability vendors are adding agent-specific features. Cloud providers are exploring managed agent platforms.

The goal remains the same: make Kubernetes truly AI-ready. This means extending container orchestration with agent-aware capabilities. Teams need infrastructure that understands the unique requirements of agentic applications.

Getting Started

Organizations can start preparing now. Begin by mapping current AI pilot requirements against existing infrastructure capabilities. Identify specific gaps in networking, security, and observability.

Experiment with agent protocols in development environments. Test MCP implementations with tool servers. Try A2A communication between simple agents. This hands-on experience reveals infrastructure needs before production deployment.

For teams looking to accelerate their journey, platforms like Kagent Enterprise provide a ready-made foundation. The open source kagent project offers a starting point for teams wanting to build custom solutions.

The infrastructure foundation matters. AI agents need more than traditional cloud native stacks provide. Teams that address these gaps early will have an advantage in scaling agentic applications.

The future of AI in production depends on infrastructure that truly understands agents.

If you read this far, tweet to the author to show them you care. Tweet a Thanks

chevron_left

Andrew Mewborn · Answer 1 · 2025-09-29T13:43:29+0000

A very insightful breakdown of the challenges AI agents face in production. Do you think Kubernetes will eventually evolve to natively support agent-aware capabilities, or will specialised platforms like Kagent Enterprise become the standard solution for AI infrastructure?

	Splunk unveiled AI agents that can debug your code, triage incidents, and monitor infrastructure. Tom Smith - Sep 9
	HPE unveils agentic AI, smart infrastructure, and developer tools at Discover 2025 conference. Tom Smith - Jun 24
	AI agents now autonomously protect, recover, and manage enterprise data without human intervention. Tom Smith - Aug 19
	Fabrix.ai automates IT operations through AI agents that reason, decide and act—solving complex operational challenges. Tom Smith - May 7
	Only 2% of companies are ready to scale AI securely—here's what the other 98% are missing. Tom Smith - Jul 14

Most AI agents never reach production. The missing piece isn't code—it's infrastructure design.

The Infrastructure Challenge Holding Back AI Agents in Production

The Kubernetes Gap

What AI Agents Actually Need

Emerging Solutions

Real-World Implementation

Looking Forward

Getting Started

0 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

Splunk unveiled AI agents that can debug your code, triage incidents, and monitor infrastructure.

HPE unveils agentic AI, smart infrastructure, and developer tools at Discover 2025 conference.

AI agents now autonomously protect, recover, and manage enterprise data without human intervention.

Fabrix.ai automates IT operations through AI agents that reason, decide and act—solving complex operational challenges.

Only 2% of companies are ready to scale AI securely—here's what the other 98% are missing.

More From Tom Smith

Modern systems generate petabytes of telemetry data, but most teams are still guessing what matters

Larry Ellison reveals why Oracle built a power plant to train AI - and what it means for developers.

Oracle reveals why Python notebooks won't run your enterprise AI - and what will.

Welcome to Coder Legion Community

with 2,570 amazing developers

Connect with

Already have an account? Log in

Most AI agents never reach production. The missing piece isn't code—it's infrastructure design.

The Infrastructure Challenge Holding Back AI Agents in Production

The Kubernetes Gap

What AI Agents Actually Need

Emerging Solutions

Real-World Implementation

Looking Forward

Getting Started

0 Comments

Please log in to add a comment.

Please log in to comment on this post.

More Posts

More From Tom Smith