Skip to main content

Stop Building Agent Frameworks, Start Building Agent Infrastructure

The AI agent ecosystem has dozens of orchestration frameworks and almost no infrastructure. That distinction explains why 88% of agent systems never reach production.

Zaher Khateeb
7 min read

There are now dozens of agent frameworks. Almost none of them help you run agents in production.

LangChain, CrewAI, Microsoft Agent Framework, Haystack, Pydantic AI, smolagents — the list grows every month. Each one helps you build agents. Define tools. Compose prompts. Wire multi-agent conversations. The building part is increasingly well-served.

What's not well-served is everything that happens after the demo.

In our last post, we argued that the gap between AI pilots and production is an infrastructure gap, not an intelligence gap. This post makes that argument concrete: what "infrastructure" actually means for agents, why frameworks can't provide it, and what needs to exist instead.

The Framework Explosion

The agent framework market is booming. LangChain gets nearly 200 million downloads per month. CrewAI's CEO claims 450 million agentic workflows a month. Microsoft consolidated AutoGen and Semantic Kernel into a unified Agent Framework, positioning both predecessors in maintenance mode. Gartner reported a 1,445% surge in multi-agent systems inquiries from Q1 2024 to Q2 2025.

Frameworks solve a genuine problem. Before LangChain, connecting an LLM to a tool required custom code for every integration. Before CrewAI, building a multi-agent team meant implementing your own delegation logic from scratch. These are real contributions that made agent development accessible to thousands of teams.

But every framework shares the same gap. They solve the building problem. They don't solve the running problem.

Take a typical agent setup: define your agents, give them roles and tools, configure a team, and run it. It works. In a notebook. On your laptop. With one user. With a single API key. What happens when you need to:

  • Enforce a per-agent token budget so one runaway doesn't consume your monthly allocation?
  • Canary deploy a new agent version to 5% of traffic before going wide?
  • Detect reasoning quality degradation before users start complaining?
  • Isolate a misbehaving agent without killing the entire workflow?
  • Attribute cost to individual agents across a multi-agent chain?
  • Failover to a backup provider automatically when your primary goes down at 2 AM?

These aren't exotic requirements. They're table stakes for any production system. And no major framework ships all of them.

The closest the ecosystem has come is LangSmith, which provides solid tracing, evaluation, and monitoring for LangChain workflows. It's a genuine step forward — and it covers observability well, with some cost tracking capabilities. Reliability, security, and deployment still require assembling your own stack from separate tools, each with its own integration surface and failure modes.

This isn't a criticism — frameworks solve a real problem. But "building agents" and "running agents" are different problems, and the industry has overwhelmingly invested in the former.

What Infrastructure Actually Means

The distinction becomes clear if you look at web development.

Frameworks help you build web applications: Rails, Django, Express, Next.js. They handle routing, templates, database queries, authentication patterns. They're indispensable for development.

Infrastructure helps you run those applications at scale: Kubernetes orchestrates deployment. Nginx handles load balancing. Datadog provides observability. Vault manages secrets. Terraform provisions resources.

Nobody ships a Rails app straight to production without infrastructure. You'd have no monitoring, no load balancing, no secret management, no deployment pipeline. The app would work on your laptop and break under real traffic.

Yet this is exactly how most agent systems ship today. Framework to build. Nothing to run.

For AI agents, infrastructure means five things:

Observability. Not just logging — distributed tracing across multi-agent chains. When Agent C produces a bad output, you need to trace back through Agent B's reasoning and Agent A's initial planning to find the root cause. Generic observability tools don't understand LLM-specific concerns: token consumption per step, prompt/completion pairs, reasoning chain visualization, or quality degradation over time.

Reliability. Not "add try/except" — circuit breakers that detect failing providers and route around them. Retries with exponential backoff that respect rate limits. Timeouts that prevent one slow agent from blocking an entire workflow. Automatic failover to alternative providers when your primary goes down.

Cost control. LLM calls aren't free, and agents make a lot of them. Per-agent token budgets prevent a single runaway agent from consuming your entire monthly allocation overnight. Cost-aware routing sends simple tasks to cheaper models and reserves expensive ones for complex reasoning. Surge management throttles gracefully instead of failing hard.

Security. Key rotation that works across multiple providers without downtime. Vault integration for secrets that never touch environment variables. Prompt injection defense at every entry point — not just the user-facing one, but between agents too. Sandboxed execution that limits what agents can access and do.

Deployment. Canary deploys that roll out new agent versions to 5% of traffic before going wide. Automated rollback when error rates spike. Health checks that verify agents are producing quality outputs, not just responding. SLA enforcement that guarantees response times and availability.

Each of these is a solved problem in traditional infrastructure. None are comprehensively solved for agents — point solutions exist for individual pieces, but nobody has unified them into a single infrastructure layer.

Why Most Agent Systems Fail in Production

We cited the numbers in our first post: IDC found that 88% of AI proof-of-concepts fail to reach production, and MIT NANDA found only 5% achieved meaningful revenue impact. These aren't just model problems. Organizational issues play a role — misaligned spending, poor integration, the learning curve. But infrastructure is a consistent thread: the gap between what works in a pilot and what survives in production.

The pattern repeats. Team builds an impressive demo. Stakeholders approve production rollout. Then the team discovers what it actually takes to go live:

  • An observability pipeline
  • A reliability layer with retries and failover
  • A cost management system
  • A security layer with key rotation and injection defense
  • A deployment pipeline with rollback

For teams without existing infrastructure, this adds up to months of work before a single agent serves a real user. Some teams push through it. Most don't — the project loses momentum, the budget runs out, or the stakeholders move on to the next initiative.

The infrastructure work is invisible in demos. Nobody asks "how do you rotate API keys?" during a proof-of-concept. But it dominates real deployments. The failure modes are specific and predictable:

No observability — Agent chain fails intermittently. Logs show the final error but not the upstream cause. Team spends days reproducing issues they can't trace.

No reliability — One provider timeout cascades through the entire workflow. Users see failures that look random but are actually correlated to a single flaky dependency.

No cost control — A reasoning loop runs unchecked overnight on a single workflow. The team discovers a five-figure token bill the next morning.

No key rotation — Rate limits hit, the system falls over, and there's no automated recovery path. Someone manually swaps keys at 2 AM.

These aren't hypothetical. They're the specific failure modes that kill agent projects between demo and production.

The Kubernetes Parallel

Before Kubernetes launched in 2014, every team deploying microservices built their own container orchestration. Their own service discovery. Their own health checks. Their own scaling logic. It was a massive duplication of effort — infrastructure that had nothing to do with the team's actual product.

Kubernetes didn't make microservices possible. Microservices existed before it. What Kubernetes did was make microservices practical at scale by providing a standard infrastructure layer. Teams stopped building orchestration plumbing and started building services.

We're in the "before Kubernetes" era for AI agents.

Every team that gets serious about production agents builds their own observability pipeline. Their own retry logic. Their own cost tracking. Their own key rotation. Their own deployment scripts. It's the same duplication of effort, the same infrastructure that has nothing to do with agent intelligence.

The infrastructure layer that lets teams focus on agent logic instead of operational plumbing doesn't widely exist yet. That's the gap. That's what AgentiCraft is building — the observability, reliability, cost control, security, and deployment layer that every production agent system needs and almost none currently have.

What Good Agent Infrastructure Looks Like

Not a feature checklist — a design philosophy.

Production-first. Not "add monitoring later." Observability, security, and reliability from the first line of agent code. If you have to bolt infrastructure on after the fact, you'll skip it under deadline pressure — and you'll pay for it in production.

Agent-aware. Generic infrastructure doesn't understand LLM-specific concerns. Datadog can track HTTP latency but not token consumption per reasoning step. PagerDuty can alert on error rates but not on reasoning quality degradation. Agent infrastructure must understand prompts, completions, context windows, token budgets, and the non-deterministic nature of LLM outputs.

Unified, not assembled. The current path to production agent infrastructure is stitching together five or six separate tools — a tracing service, a secrets manager, a cost dashboard, a deployment platform — each with its own integration surface, authentication model, and failure modes. The individual tools exist. The integration doesn't. Infrastructure should be a single layer, not a supply chain.

Opinionated defaults, flexible overrides. Ship with sane defaults — circuit breaker thresholds, retry policies, cost limits, security rules — that work for 80% of use cases. Let teams customize everything for the other 20%. Zero-configuration to start, full control when you need it.

When Frameworks Are Enough

If you're prototyping, exploring, or building a single-agent tool, a framework is all you need.

We said in our first post that LangChain has us beat on tool breadth and CrewAI is genuinely easier to start with. That hasn't changed. Microsoft's Agent Framework gives Azure and .NET teams a native path into agent development. These are the right tools for their use cases.

The infrastructure gap only matters when you need agents to run reliably, at scale, with real SLAs and real cost constraints. Most teams won't need infrastructure until they're past the prototype stage. That's fine.

The mistake isn't using a framework. The mistake is reaching production without infrastructure and then trying to bolt it on after the fact. That's where the 88% failure rate lives — in the gap between "it works in a notebook" and "it works at 2 AM on a Saturday when the primary provider goes down."

What Needs to Exist

The agent ecosystem's bottleneck isn't intelligence. The models are good enough. The frameworks are mature enough. What's missing is the operational layer — the infrastructure that makes agent systems production-grade.

The teams that succeed won't be the ones with better prompts or more creative agent architectures. They'll be the ones with better observability, reliability, cost control, and deployment. The ones who solved the boring infrastructure problems before they became 2 AM emergencies.

Next post: the coordination patterns that make multi-agent systems work at scale — and why picking the wrong one is the most expensive architectural decision you'll make.

If this resonates, join the waitlist or follow along.

The 51st framework won't close the production gap. Infrastructure will.

ZK
Zaher KhateebFounder & CTO at AgentiCraft

Building the infrastructure layer between AI agent logic and production. Distributed systems, multi-agent coordination, and making unreliable components work together reliably at scale.

Be the First to Know

Join the waitlist and get early access when we launch.

Launching soon.