Skip to main content

T3: Production Architecture Patterns

Duration: 60–90 minutes | Level: Strategic Part of: 🍎 FROOT Transformation Layer Prerequisites: O4 (Azure AI Platform), O5 (AI Infrastructure) Last Updated: March 2026


Table of Contents


T3.1 Production AI is Different

Your POC worked beautifully in the demo. Now you need to serve 10,000 users concurrently, handle API rate limits, manage costs, respond in under 2 seconds, never hallucinate on financial data, and stay available 99.9% of the time.

Welcome to production AI.


T3.2 The AI Application Architecture Stack

Every production AI system has these layers. Missing any one of them is a production incident waiting to happen.


T3.3 Hosting Patterns: Where Agents Live

Pattern Comparison

Decision Matrix

CriterionContainer AppsAKSApp ServiceFunctionsCopilot Studio
ComplexityLow-MediumHighLowLowVery Low
ScalingAuto (0→N)Auto (custom)Manual/AutoAuto (0→N)Managed
GPU Support✅ Preview✅ FullN/A
Long-running⚠️ (max 10 min)
WebSocket/SSE
Dapr sidecar✅ Built-in✅ Add-onN/A
Cost at scale💰💰💰💰💰💰💰💰💰💰
Best forAI APIs, agentsML serving, multi-modelSimple APIsEvent-driven AIBusiness users

Container Apps — The Sweet Spot for Most AI Workloads


T3.4 API Gateway for AI

Azure API Management (APIM) becomes critical for production AI — it's the control plane for all AI traffic.

AI Gateway Capabilities

CapabilityWhat It DoesWhy It Matters
Semantic CachingCache similar queries, not just identical ones30-50% cost reduction on repeated patterns
Token Rate LimitingLimit tokens/minute per user or appPrevent runaway costs
Load BalancingDistribute across multiple Azure OpenAI instancesHandle rate limits, improve availability
Circuit BreakingStop calling failing endpointsProtect against cascading failures
Token MeteringTrack token consumption per user/teamCost allocation and chargeback
Content SafetyPre-screen requests before they hit modelsPrevent policy violations
Prompt Injection DetectionDetect and block injection attemptsSecurity guardrail

Multi-Region AI Gateway


T3.5 Latency Optimization Patterns

Where Latency Hides

ComponentTypical LatencyOptimization
Network to AOAI10-50msPrivate endpoints, regional affinity
Token generation20-80ms per tokenSmaller model, shorter output, PTU
Embedding generation50-200msBatch, cache frequently used
Vector search10-50msHNSW index, filter before search
Reranking100-500msLimit to top-20 candidates
Total RAG pipeline500ms-3sParallel retrieval, streaming

Streaming for Perceived Performance

Instead of waiting for the full response, stream tokens to the user as they're generated:

Without streaming:  [---- 3 seconds of nothing ----] Full response appears
With streaming: H-e-l-l-o-,- -h-e-r-e-'-s- -y-o-u-r- -a-n-s-w-e-r-... (progressive)

TTFT (Time To First Token) drops from 3s to ~200ms. The user sees progress immediately.

Caching Strategies

Cache TypeWhat It CachesHit RateSavings
Exact cacheIdentical queries5-10%100% per hit
Semantic cacheSimilar queries (embedding similarity)20-40%100% per hit
Embedding cacheDocument embeddings80%+Avoid re-embedding
Context cacheRAG retrieval results30-50%Skip retrieval step

T3.6 Cost Control Architecture

Token Economics

Cost per request = (input_tokens × input_rate) + (output_tokens × output_rate)

Example (GPT-4o, March 2026):
System message: 800 tokens × $2.50/1M = $0.002
User message: 200 tokens × $2.50/1M = $0.0005
RAG context: 2,000 tokens × $2.50/1M = $0.005
Output: 500 tokens × $10.00/1M = $0.005
─────────────────────────────────────────────────────
Total per request: $0.0125

At 100K requests/day = $1,250/day = $37,500/month

Cost Optimization Decision Tree


T3.7 Multi-Agent Production Patterns

Pattern 1: Supervisor Agent

When: Clear domain boundaries, need routing intelligence, want centralized control.

Pattern 2: Pipeline (Sequential Handoff)

When: Document processing, data pipelines, workflows with clear sequential steps.

Pattern 3: Swarm (Peer-to-Peer)

When: Creative tasks, complex reasoning, research — agents negotiate and collaborate without a central controller.

Hosting Multi-Agent: The Microservices Approach

Agent 1 (Supervisor)   → Container App (scale 2-10)
Agent 2 (Billing) → Container App (scale 0-5)
Agent 3 (Tech Support) → Container App (scale 0-5)
Agent 4 (Product) → Container App (scale 0-5)

Communication: Dapr pub/sub (async) or HTTP (sync)
State: Cosmos DB (conversation memory)
Observability: Application Insights (distributed tracing)

T3.8 Monitoring & Observability for AI

The AI Observability Stack

LayerWhat to MonitorTool
InfrastructureCPU, memory, GPU, networkAzure Monitor, Container Insights
APILatency, throughput, errors, rate limitsAPIM Analytics, App Insights
ModelToken usage, TTFT, quality scoresCustom metrics in App Insights
QualityHallucination rate, groundedness, relevanceLLM evaluation pipeline
CostToken consumption, cost per request, per userCost Management + custom dashboards
SafetyContent filter triggers, injection attemptsAzure AI Content Safety logs

Key Metrics Dashboard

MetricFormulaAlert Threshold
TTFT P9595th percentile Time To First Token>1 second
Total Latency P9595th percentile end-to-end>5 seconds
Token Cost/RequestTotal tokens × rate / request count>$0.05/request
Error RateFailed requests / total requests>2%
Rate Limit Hits429 responses / total requests>5%
Groundedness ScoreAvg score from evaluation pipeline<0.85
User SatisfactionThumbs up / (thumbs up + thumbs down)<80%

T3.9 Resilience Patterns

Circuit Breaker for AI Endpoints

Fallback Chain

Primary:    Azure OpenAI (East US) GPT-4o
↓ (429 or 500)
Fallback 1: Azure OpenAI (West US) GPT-4o
↓ (429 or 500)
Fallback 2: Azure OpenAI (East US) GPT-4o-mini (degraded quality, lower cost)
↓ (both down)
Fallback 3: Cached response for similar queries
↓ (no cache hit)
Fallback 4: "I'm experiencing high demand. Please try again shortly."

T3.10 The Production Readiness Checklist

Before going live, verify every item:

Architecture & Infrastructure

  • Hosting platform selected (Container Apps / AKS / App Service)
  • Private endpoints for all AI services
  • Managed Identity (no API keys in code)
  • Multi-region deployment (if SLA requires >99.9%)
  • Auto-scaling configured with appropriate min/max
  • Load testing completed at 2x expected peak

Security

  • Prompt injection detection enabled
  • Content Safety filters configured
  • Input validation on all user inputs
  • PII detection and redaction
  • RBAC configured with least privilege
  • Audit logging for all AI interactions

Quality & Reliability

  • Evaluation pipeline running (offline + online)
  • Hallucination rate measured and <5%
  • Groundedness score >0.85
  • Circuit breaker and fallback chain configured
  • Rate limiting per user/application
  • Retry with exponential backoff for transient errors

Cost Management

  • Token budget per user/team configured
  • Cost alerts at 50%, 80%, 100% of budget
  • Semantic caching for common queries
  • Model tier optimization (mini for simple, full for complex)
  • Cost dashboard accessible to stakeholders

Monitoring & Operations

  • Application Insights with custom AI metrics
  • Distributed tracing across agent interactions
  • Alerts for latency, errors, cost, quality degradation
  • Runbook for common incidents (rate limits, model outage)
  • On-call rotation for AI-specific issues

Key Takeaways

The Five Rules of Production AI Architecture
  1. AI is an API problem, not a magic problem. Apply the same engineering rigor you'd use for any production API — rate limiting, caching, circuit breaking, monitoring.
  2. Container Apps is the sweet spot. For most AI agent workloads, Container Apps gives you auto-scaling (including zero), Dapr integration, and managed infrastructure without Kubernetes complexity.
  3. APIM is your AI control plane. Semantic caching, token metering, load balancing across regions, and prompt injection detection — all in one gateway.
  4. Stream everything. Users don't mind waiting 3 seconds for an answer if they see tokens appearing after 200ms. Streaming is non-negotiable for interactive AI.
  5. Cost is the silent killer. A runaway AI agent can burn through thousands of dollars in hours. Token budgets, semantic caching, and model tier optimization are as important as the AI logic itself.

FrootAI T3Production AI is where engineering meets intelligence. Build it like infrastructure. Monitor it like a service. Budget it like a business.

📖Learn