Production Deployment
Checklists, monitoring, scaling strategies, and operational best practices for production AgentiCraft deployments.
1 min read
Pre-Deployment Checklist
Before deploying to production, verify:
Provider Configuration
- API keys stored in a secrets manager (not environment variables or code)
- Key pool configured with multiple keys per provider for redundancy
- Circuit breaker enabled on all providers with appropriate thresholds
- Rate limits set below provider quotas to avoid hard rejections
- Fallback providers configured for critical paths
Resilience
- Circuit breaker thresholds tuned to your traffic patterns
- Retry policies configured with exponential backoff and jitter
- Timeout values set on all LLM calls (recommended: 30s for completions, 60s for long-form)
- Fallback routing tested — verify traffic shifts when a provider goes down
Monitoring
- Request latency tracked per provider
- Error rates tracked per provider and error type
- Token usage tracked for cost management
- Circuit breaker state changes logged
- Key rotation events logged
Multi-Provider Resilience
Circuit Breaker Configuration
from agenticraft_llm import CircuitBreaker
breaker = CircuitBreaker(
failure_threshold=5, # Open after 5 failures
recovery_timeout=30.0, # Try recovery after 30s
half_open_max=2, # Allow 2 test requests in half-open
)Fallback Chain
Configure providers in priority order. Traffic automatically shifts on failure:
from agenticraft_llm import CostAwareRouter, LLMProviderConfig
router = CostAwareRouter(
providers=[
LLMProviderConfig(provider="openai", model="gpt-5.4"),
LLMProviderConfig(provider="anthropic", model="claude-sonnet-4-6"),
LLMProviderConfig(provider="google", model="gemini-3.1-pro-preview"),
],
circuit_breaker=breaker,
)Cost Management
Token Budget Tracking
Monitor token usage per provider to control costs:
# Check usage after each call
response = await router.complete(messages=messages)
print(f"Tokens: {response.usage.total_tokens}")
print(f"Estimated cost: ${response.usage.total_tokens * 0.00001:.4f}")Cost-Aware Routing
The router uses Thompson sampling to balance cost vs. quality. Configure cost weights:
router = CostAwareRouter(
providers=[...],
cost_weight=0.7, # Prioritize cost savings
latency_weight=0.2, # Some latency sensitivity
quality_weight=0.1, # Minimal quality differentiation
)Scaling Considerations
Key Pool Sizing
Rule of thumb: configure at least 3 keys per provider for production workloads. This provides:
- Redundancy if a key is revoked or rate-limited
- Better rate limit distribution across keys
- Zero-downtime key rotation
Connection Pooling
agenticraft-llm uses httpx with connection pooling by default. For high-throughput scenarios:
config = LLMProviderConfig(
provider="openai",
model="gpt-5-mini",
max_connections=20, # Default: 10
timeout=30.0, # Default: 60.0
)Next Steps
- Security Best Practices — secure your deployment
- LLM API Reference — full provider configuration options
- Architecture Overview — system design