Skip to main content

Production Deployment

Checklists, monitoring, scaling strategies, and operational best practices for production AgentiCraft deployments.

1 min read

Pre-Deployment Checklist

Before deploying to production, verify:

Provider Configuration

  • API keys stored in a secrets manager (not environment variables or code)
  • Key pool configured with multiple keys per provider for redundancy
  • Circuit breaker enabled on all providers with appropriate thresholds
  • Rate limits set below provider quotas to avoid hard rejections
  • Fallback providers configured for critical paths

Resilience

  • Circuit breaker thresholds tuned to your traffic patterns
  • Retry policies configured with exponential backoff and jitter
  • Timeout values set on all LLM calls (recommended: 30s for completions, 60s for long-form)
  • Fallback routing tested — verify traffic shifts when a provider goes down

Monitoring

  • Request latency tracked per provider
  • Error rates tracked per provider and error type
  • Token usage tracked for cost management
  • Circuit breaker state changes logged
  • Key rotation events logged

Multi-Provider Resilience

Circuit Breaker Configuration

from agenticraft_llm import CircuitBreaker
 
breaker = CircuitBreaker(
    failure_threshold=5,     # Open after 5 failures
    recovery_timeout=30.0,   # Try recovery after 30s
    half_open_max=2,         # Allow 2 test requests in half-open
)

Fallback Chain

Configure providers in priority order. Traffic automatically shifts on failure:

from agenticraft_llm import CostAwareRouter, LLMProviderConfig
 
router = CostAwareRouter(
    providers=[
        LLMProviderConfig(provider="openai", model="gpt-5.4"),
        LLMProviderConfig(provider="anthropic", model="claude-sonnet-4-6"),
        LLMProviderConfig(provider="google", model="gemini-3.1-pro-preview"),
    ],
    circuit_breaker=breaker,
)

Cost Management

Token Budget Tracking

Monitor token usage per provider to control costs:

# Check usage after each call
response = await router.complete(messages=messages)
print(f"Tokens: {response.usage.total_tokens}")
print(f"Estimated cost: ${response.usage.total_tokens * 0.00001:.4f}")

Cost-Aware Routing

The router uses Thompson sampling to balance cost vs. quality. Configure cost weights:

router = CostAwareRouter(
    providers=[...],
    cost_weight=0.7,      # Prioritize cost savings
    latency_weight=0.2,   # Some latency sensitivity
    quality_weight=0.1,   # Minimal quality differentiation
)

Scaling Considerations

Key Pool Sizing

Rule of thumb: configure at least 3 keys per provider for production workloads. This provides:

  • Redundancy if a key is revoked or rate-limited
  • Better rate limit distribution across keys
  • Zero-downtime key rotation

Connection Pooling

agenticraft-llm uses httpx with connection pooling by default. For high-throughput scenarios:

config = LLMProviderConfig(
    provider="openai",
    model="gpt-5-mini",
    max_connections=20,    # Default: 10
    timeout=30.0,          # Default: 60.0
)

Next Steps