Name: AgentiCraft
Author: AgentiCraft

Your agent works perfectly. Then OpenAI returns a 429 for sixty seconds and your entire pipeline goes silent.

The fix isn't switching providers. It's building the layer between your agent code and the providers — the layer that handles key rotation, cost optimization, failure recovery, and rate limiting so your agents never have to think about it.

Today we're releasing agenticraft-llm, a standalone Python package for production-grade LLM provider management. It's the provider layer we built for AgentiCraft, extracted into a package anyone can use.

uv add agenticraft-llm

Or with pip:

pip install agenticraft-llm

Python 3.10+. 14 providers. Zero platform dependencies. GitHub · Docs

The Problem Nobody Talks About

Every LLM tutorial shows the same pattern: import the SDK, pass your API key, call complete(). It works in notebooks. It works in demos. Then you deploy it.

In production, three things happen that tutorials never cover:

Your key hits a rate limit. OpenAI enforces per-minute and per-day limits — separate counters, separate ceilings. Anthropic has request-per-minute and token-per-minute limits that interact in non-obvious ways. One rate-limited key stalls an entire multi-agent pipeline within seconds, because retry logic hammers the same key harder.

Your provider goes down. Not often, but when it does, your agents don't degrade gracefully — they crash. There's no automatic failover, no circuit breaker, no health monitoring. You find out from your users.

Your costs spiral. Your frontier provider costs 100x more per token than your cheapest option. Most requests don't need the top tier. But without intelligent routing, every request goes to the same provider — either the most expensive one (burning money) or the cheapest one (shipping garbage).

These aren't agent problems. They're infrastructure problems. And they need an infrastructure solution.

What's Inside

agenticraft-llm is a reliability layer between your application and your LLM providers. One interface, four components, each solving a specific production failure mode.

14 Providers, One Interface

Every provider implements the same base class. Same method signatures. Same response types. Same exception hierarchy. Switch providers by changing a string, not rewriting your code.

from agenticraft_llm import OpenAIProvider, AnthropicProvider
 
# Same interface, any provider
openai = OpenAIProvider(api_key="sk-...")
anthropic = AnthropicProvider(api_key="sk-ant-...")
 
# Identical call pattern
response = await openai.complete(
    messages=[{"role": "user", "content": "Hello"}],
    model="gpt-5.4",
)
 
# Same response type — content, usage, cost, latency
print(response.content)     # "Hello! How can I help?"
print(response.usage)       # TokenUsage(prompt_tokens=8, completion_tokens=12)
print(response.latency_ms)  # 423.5

Supported providers: OpenAI, Anthropic, Google, Mistral, Azure OpenAI, Ollama, Fireworks, Cerebras, DeepSeek, xAI, Cohere, OpenRouter, Perplexity, and Nebius. Each with built-in pricing tables for automatic cost calculation, capability declarations, and provider-specific SDK exception mapping.

Optional Dependencies

Provider SDKs are optional extras. Install only what you use: uv add agenticraft-llm[openai,anthropic] for two providers, or agenticraft-llm[all] for everything.

Cost-Aware Routing

The cost gap across providers is orders of magnitude — not a rounding error. A request on your frontier provider might cost $0.20. The same request on a self-hosted model costs nothing beyond infrastructure. For many tasks, the quality difference is negligible.

The router scores every eligible provider on three dimensions and picks the highest-scoring one that passes all hard constraints:

from agenticraft_llm import CostAwareRouter, CostRouterConfig
 
router = CostAwareRouter(CostRouterConfig(
    cost_weight=0.5,       # Cheaper providers score higher
    quality_weight=0.35,   # Observed accuracy per task type
    latency_weight=0.15,   # Response time matters
    explore_mode=True,     # Enable Thompson sampling
))
await router.initialize()
 
decision = await router.select_with_constraints(
    providers=["openai", "anthropic", "google"],
    input_tokens=500,
    output_tokens=500,
    max_cost=0.10,
    quality_threshold=0.8,
)
 
print(decision.selected_provider)    # "anthropic"
print(decision.estimated_cost)       # 0.0054
print(decision.alternatives)         # [{provider: "openai", ...}, ...]

Quality scores aren't static. The router maintains a Bayesian posterior for each provider — a Beta distribution that updates after every response. Good results push the posterior up. Failures push it down. Over time, the estimate converges to reality. When a provider's quality drifts, the router adapts automatically.

This means new providers get tested without manual intervention. A new provider starts with a uniform prior. The system allocates a small fraction of traffic until the quality estimate stabilizes. If it performs well, traffic shifts toward it. If not, traffic stays with proven providers.

Key Pool Management

A single API key is a single point of failure. agenticraft-llm manages pools of keys per provider with five rotation strategies — from simple round-robin to adaptive health-based selection.

from agenticraft_llm import OpenAIProvider
 
# Three keys, adaptive rotation
provider = OpenAIProvider(
    api_keys=["sk-key1", "sk-key2", "sk-key3"],
    key_rotation_strategy="adaptive",
)
 
# Each call automatically rotates to the healthiest key
for task in tasks:
    response = await provider.complete(messages=task.messages)

The adaptive strategy tracks four signals per key: success rate (40% weight), consecutive errors (30%), rate limit recency (20%), and latency (10%). When a key degrades, it receives less traffic automatically. When it recovers, traffic gradually returns.

Keys cycle through five states — ACTIVE, RATE_LIMITED, COOLING_DOWN, DISABLED, EXHAUSTED — with automatic transitions based on observed behavior. A key that hits three consecutive errors enters cooldown. A rate-limited key waits for the retry-after window. An exhausted key triggers provider-level fallback.

Circuit Breakers

Not all failures are equal. A rate limit (429) is not a crash (500), which is not an auth error (401). A generic circuit breaker counts all three the same. agenticraft-llm's circuit breakers understand the difference.

from agenticraft_llm import CircuitBreaker, CircuitBreakerConfig
 
breaker = CircuitBreaker(
    name="openai",
    config=CircuitBreakerConfig(
        failure_threshold=5,      # Open after 5 server errors
        recovery_timeout=60.0,    # Probe after 60 seconds
        success_threshold=2,      # Close after 2 successful probes
    ),
)
 
# Wraps any async call with automatic state tracking
response = await breaker.call(provider.complete, messages, model="gpt-5.4")

Rate limits don't increment the failure counter — the provider is healthy, you're just sending too much traffic. Auth errors trigger immediate key rotation instead of retries. Only server errors and connection failures count toward the threshold.

Different providers get different defaults. Ollama runs locally, so failures likely mean the host is down — low threshold, short recovery. Anthropic gets a longer request timeout because complex reasoning tasks routinely take longer. These are starting defaults; tune them based on your observed traffic.

Rate Limiting

Token bucket rate limiting stays below provider quotas proactively, instead of reacting to 429s after the fact.

from agenticraft_llm import OpenAIProvider
from agenticraft_llm.resilience import RateLimitConfig
 
provider = OpenAIProvider(
    api_key="sk-...",
    enable_rate_limiter=True,
    rate_limiter_config=RateLimitConfig(
        max_calls=200,      # 200 calls per window
        time_window=60.0,   # per 60 seconds
        burst_size=20,      # Allow bursts of 20
    ),
)

How It Composes

These four components aren't independent features. They compose into a single resilience stack with a feedback loop:

Router selects the optimal provider based on cost, quality, and health
Circuit breaker checks if that provider is healthy — if OPEN, the router picks the next best
Key pool selects the healthiest key for the chosen provider
Rate limiter ensures the request stays within quota before it's sent
After the response, the outcome feeds back into the router's quality posterior, the circuit breaker's state machine, and the key pool's health scores

The feedback loop is the critical piece. Every LLM call generates a signal. That signal improves the next routing decision. Without it, you're flying blind. With it, the system self-heals — unhealthy providers get less traffic, which helps them recover, which the system detects, which restores traffic. No manual intervention. No 2 AM pager alerts.

A Practical Example

Let's wire it up end-to-end. Three providers, automatic failover, cost optimization:

from agenticraft_llm import (
    CostAwareRouter, CostRouterConfig,
    OpenAIProvider, AnthropicProvider, GoogleProvider,
    CircuitBreaker, CircuitBreakerConfig,
)
 
# Initialize providers with key pools and circuit breakers
openai = OpenAIProvider(
    api_keys=["sk-key1", "sk-key2"],
    key_rotation_strategy="adaptive",
    enable_circuit_breaker=True,
)
anthropic = AnthropicProvider(
    api_key="sk-ant-...",
    enable_circuit_breaker=True,
)
google = GoogleProvider(
    api_key="...",
    enable_circuit_breaker=True,
)
 
# Cost-aware routing across all three
router = CostAwareRouter(CostRouterConfig(
    cost_weight=0.5,
    quality_weight=0.35,
    latency_weight=0.15,
    explore_mode=True,
    fallback_provider="anthropic",
))
await router.initialize()
 
# Route a request
decision = await router.select_with_constraints(
    providers=["openai", "anthropic", "google"],
    input_tokens=500,
    output_tokens=1000,
    quality_threshold=0.85,
)
 
# Execute against the selected provider
provider = {"openai": openai, "anthropic": anthropic, "google": google}[
    decision.selected_provider
]
response = await provider.complete(
    messages=[{"role": "user", "content": "Analyze this dataset..."}],
)
 
# Update the router with the result
router.update_posterior(
    decision.selected_provider,
    success=True,
    quality_score=0.92,
)

If OpenAI's circuit breaker is OPEN, the router automatically excludes it and selects the next best provider. If all of OpenAI's keys are rate-limited, the key pool reports exhaustion and the circuit breaker records the failure. The system adapts in real time, request by request.

The OpenAI-Compatible Gateway

If you want a drop-in replacement for the OpenAI API that routes across multiple providers, agenticraft-llm ships with a FastAPI gateway:

from agenticraft_llm.gateway import create_gateway_app
 
app = create_gateway_app(
    providers={"openai": openai, "anthropic": anthropic},
    api_key="your-gateway-key",
)
 
# Start: uvicorn app:app --port 8000
# Use: point any OpenAI SDK client at http://localhost:8000

The gateway exposes POST /v1/chat/completions and GET /v1/models — the same endpoints the OpenAI SDK expects. Behind the scenes, it routes through the full reliability stack. Your existing code doesn't change. Your reliability does.

What the Tutorials Don't Tell You

Cost attribution matters more than aggregate tracking. Knowing you spent $50,000 on LLM calls last month tells you nothing actionable. Knowing that Agent X's reasoning step costs $0.12 per request and accounts for 60% of your total spend tells you exactly where to optimize. agenticraft-llm calculates cost per response automatically from built-in provider pricing tables.

Key exhaustion cascades are real. When one key dies, the remaining keys absorb its traffic. If you had five keys each handling 20% of load, losing one puts 25% on each survivor. Lose two, and the remaining three each handle 33% — pushing them past their own limits. The adaptive strategy prevents this by monitoring pool utilization, not just individual key health.

Quality varies by task type. A provider that scores 0.82 on average might score 0.95 on extraction tasks and 0.60 on multi-step reasoning. The router's Bayesian posteriors track this at the provider level; future updates will support per-task-type quality tracking for even more precise routing.

Provider pricing changes frequently. Major providers adjust pricing multiple times per year. agenticraft-llm ships with built-in pricing tables that update with each release, and supports custom overrides for enterprise agreements.

Where This Fits

agenticraft-llm is standalone. Install it, use it with any agent framework — LangChain, CrewAI, AutoGen, or your own. It runs wherever Python runs.

Inside the AgentiCraft platform, the package powers the LLM service layer — every agent call flows through the router, circuit breakers, and key pools. But you don't need the platform to use the reliability stack.

The package ships with 89+ tests across the full API surface. The documentation covers every module with API reference and guides — including deep dives on cost-aware routing, multi-key rotation, and failure-type-aware circuit breakers.

The hardest production failures in LLM-powered systems aren't model failures. They're infrastructure failures — rate limits, provider outages, cost spikes, key exhaustion. agenticraft-llm puts a reliability layer between your agents and the APIs they depend on, so a provider having a bad day doesn't mean your system has one too.

The GitHub repo has everything you need to get started. If you're building production multi-agent systems, join the waitlist for early access to the full AgentiCraft platform.

Your LLM Calls Are One Outage Away from Silence