07 - AI Gateway

Azure OpenAI integration, token limiting, semantic caching, and load balancing

🎯 AI Gateway Use Cases

Use Case	APIM Capability
Token Rate Limiting	`llm-token-limit` policy
Semantic Caching	Redis + embeddings
Load Balancing	Backend pools with priority/weight
Circuit Breaker	Automatic failover
Cost Tracking	Token metering per subscription
Streaming	Server-Sent Events (SSE)

🏗️ AI Gateway Architecture

⚡ Token Rate Limiting

LLM Token Limit Policy

<inbound>
    <base />
    <!-- Token-based rate limiting for Azure OpenAI -->
    <llm-token-limit
        tokens-per-minute="10000"
        counter-key="@(context.Subscription.Id)"
        estimate-prompt-tokens="true"
        tokens-consumed-header-name="x-]tokens-consumed"
        remaining-tokens-header-name="x-tokens-remaining" />
</inbound>

<outbound>
    <base />
    <!-- Add token usage to response headers -->
    <set-header name="x-prompt-tokens" exists-action="override">
        <value>@(context.Response.Headers.GetValueOrDefault("x-ms-prompt-tokens", "0"))</value>
    </set-header>
    <set-header name="x-completion-tokens" exists-action="override">
        <value>@(context.Response.Headers.GetValueOrDefault("x-ms-completion-tokens", "0"))</value>
    </set-header>
</outbound>

Token Quota per Subscription

<inbound>
    <!-- Monthly token quota -->
    <llm-token-limit
        tokens-per-minute="50000"
        counter-key="@(context.Subscription.Id)"
        estimate-prompt-tokens="true"
        retry-after-header-name="Retry-After" />
    
    <!-- Alternative: Custom quota tracking -->
    <cache-lookup-value 
        key="@("token-usage-" + context.Subscription.Id)" 
        variable-name="tokenUsage" 
        caching-type="external" />
</inbound>

💾 Semantic Caching

How Semantic Caching Works

Semantic Cache Policy

<inbound>
    <base />
    <!-- Semantic cache with Azure Managed Redis -->
    <llm-semantic-cache-lookup
        score-threshold="0.95"
        embeddings-backend-id="embeddings-backend"
        embeddings-backend-auth="system-assigned" />
</inbound>

<outbound>
    <base />
    <!-- Store response in semantic cache -->
    <llm-semantic-cache-store duration="3600" />
</outbound>

Redis Configuration for Semantic Cache

resource redis 'Microsoft.Cache/redisEnterprise@2023-11-01' = {
  name: 'redis-apim-ai-cache'
  location: location
  sku: {
    name: 'Enterprise_E10'
    capacity: 2
  }
  properties: {
    minimumTlsVersion: '1.2'
  }
}

resource redisDatabase 'Microsoft.Cache/redisEnterprise/databases@2023-11-01' = {
  parent: redis
  name: 'default'
  properties: {
    clientProtocol: 'Encrypted'
    evictionPolicy: 'NoEviction'
    modules: [
      {
        name: 'RediSearch'
      }
      {
        name: 'RedisJSON'
      }
    ]
  }
}

⚖️ Load Balancing OpenAI Instances

Backend Pool Configuration

resource openaiBackend 'Microsoft.ApiManagement/service/backends@2023-05-01-preview' = {
  name: 'openai-backend-pool'
  parent: apim
  properties: {
    type: 'Pool'
    pool: {
      services: [
        {
          id: '/backends/openai-ptu-eastus'
          priority: 1        // PTU first (reserved capacity)
          weight: 1
        }
        {
          id: '/backends/openai-payg-westus'
          priority: 2        // PAYG as fallback
          weight: 3
        }
        {
          id: '/backends/openai-payg-northeu'
          priority: 2
          weight: 1
        }
      ]
    }
  }
}

Individual Backend Definitions

resource openaiPtuBackend 'Microsoft.ApiManagement/service/backends@2023-05-01-preview' = {
  name: 'openai-ptu-eastus'
  parent: apim
  properties: {
    url: 'https://aoai-ptu-eastus.openai.azure.com/openai'
    protocol: 'http'
    circuitBreaker: {
      rules: [
        {
          name: 'openai-circuit-breaker'
          failureCondition: {
            count: 5
            interval: 'PT1M'
            statusCodeRanges: [
              { min: 429, max: 429 }
              { min: 500, max: 599 }
            ]
          }
          tripDuration: 'PT1M'
        }
      ]
    }
  }
}

Load Balancing Policy

<inbound>
    <base />
    <!-- Route to backend pool with automatic failover -->
    <set-backend-service backend-id="openai-backend-pool" />
    
    <!-- Authentication via managed identity -->
    <authentication-managed-identity resource="https://cognitiveservices.azure.com" />
</inbound>

<backend>
    <!-- Retry on transient failures -->
    <retry condition="@(context.Response.StatusCode == 429 || context.Response.StatusCode >= 500)"
           count="3"
           interval="1"
           delta="1"
           max-interval="10"
           first-fast-retry="true">
        <forward-request buffer-request-body="true" />
    </retry>
</backend>

🔄 Streaming Support (SSE)

Streaming Configuration

<inbound>
    <base />
    <!-- Disable buffering for streaming responses -->
    <set-header name="Accept" exists-action="override">
        <value>text/event-stream</value>
    </set-header>
</inbound>

<backend>
    <!-- Forward with streaming enabled -->
    <forward-request timeout="120" buffer-request-body="true" buffer-response="false" />
</backend>

<outbound>
    <base />
    <!-- Ensure streaming headers -->
    <set-header name="Content-Type" exists-action="override">
        <value>text/event-stream</value>
    </set-header>
    <set-header name="Cache-Control" exists-action="override">
        <value>no-cache</value>
    </set-header>
</outbound>

📊 Token Usage Tracking

Log Token Usage to Event Hub

<outbound>
    <base />
    <!-- Extract token usage from response -->
    <set-variable name="promptTokens" 
                  value="@(context.Response.Headers.GetValueOrDefault("x-ms-prompt-tokens", "0"))" />
    <set-variable name="completionTokens" 
                  value="@(context.Response.Headers.GetValueOrDefault("x-ms-completion-tokens", "0"))" />
    
    <!-- Log to Event Hub for billing/analytics -->
    <log-to-eventhub logger-id="token-usage-logger">@{
        return new JObject(
            new JProperty("timestamp", DateTime.UtcNow.ToString("o")),
            new JProperty("subscriptionId", context.Subscription?.Id),
            new JProperty("model", context.Request.Headers.GetValueOrDefault("x-model", "unknown")),
            new JProperty("promptTokens", context.Variables["promptTokens"]),
            new JProperty("completionTokens", context.Variables["completionTokens"]),
            new JProperty("totalTokens", 
                int.Parse((string)context.Variables["promptTokens"]) + 
                int.Parse((string)context.Variables["completionTokens"]))
        ).ToString();
    }</log-to-eventhub>
</outbound>

🔐 AI Gateway Security

Content Filtering

<inbound>
    <base />
    <!-- Block prompt injection patterns -->
    <choose>
        <when condition="@(context.Request.Body.As<JObject>()["messages"]
                         .Any(m => m["content"].ToString().Contains("ignore previous instructions")))">
            <return-response>
                <set-status code="400" reason="Bad Request" />
                <set-body>{"error": "Request contains prohibited content"}</set-body>
            </return-response>
        </when>
    </choose>
</inbound>

Rate Limiting by Model

<inbound>
    <base />
    <choose>
        <!-- GPT-4 - stricter limits -->
        <when condition="@(context.Request.MatchedParameters["deployment-id"] == "gpt-4")">
            <llm-token-limit tokens-per-minute="5000" counter-key="@(context.Subscription.Id)" />
        </when>
        <!-- GPT-3.5 - higher limits -->
        <when condition="@(context.Request.MatchedParameters["deployment-id"] == "gpt-35-turbo")">
            <llm-token-limit tokens-per-minute="50000" counter-key="@(context.Subscription.Id)" />
        </when>
    </choose>
</inbound>

📋 Complete AI Gateway Policy

<policies>
    <inbound>
        <base />
        
        <!-- 1. Token Rate Limiting -->
        <llm-token-limit
            tokens-per-minute="10000"
            counter-key="@(context.Subscription.Id)"
            estimate-prompt-tokens="true"
            tokens-consumed-header-name="x-tokens-consumed"
            remaining-tokens-header-name="x-tokens-remaining" />
        
        <!-- 2. Semantic Cache Lookup -->
        <llm-semantic-cache-lookup
            score-threshold="0.95"
            embeddings-backend-id="embeddings-backend"
            embeddings-backend-auth="system-assigned" />
        
        <!-- 3. Set Backend Pool -->
        <set-backend-service backend-id="openai-backend-pool" />
        
        <!-- 4. Authenticate with Managed Identity -->
        <authentication-managed-identity resource="https://cognitiveservices.azure.com" />
    </inbound>
    
    <backend>
        <retry condition="@(context.Response.StatusCode == 429 || context.Response.StatusCode >= 500)"
               count="3"
               interval="1"
               first-fast-retry="true">
            <forward-request buffer-request-body="true" />
        </retry>
    </backend>
    
    <outbound>
        <base />
        
        <!-- Store in Semantic Cache -->
        <llm-semantic-cache-store duration="3600" />
        
        <!-- Add Usage Headers -->
        <set-header name="x-tokens-used" exists-action="override">
            <value>@(context.Response.Headers.GetValueOrDefault("x-ms-total-tokens", "0"))</value>
        </set-header>
    </outbound>
    
    <on-error>
        <base />
        <set-header name="Content-Type" exists-action="override">
            <value>application/json</value>
        </set-header>
        <return-response>
            <set-status code="500" />
            <set-body>@{
                return new JObject(
                    new JProperty("error", "AI Gateway Error"),
                    new JProperty("requestId", context.RequestId.ToString())
                ).ToString();
            }</set-body>
        </return-response>
    </on-error>
</policies>

✅ AI Gateway Checklist

Token rate limiting configured per tier
Backend pool with PTU primary, PAYG fallback
Circuit breaker for 429/5xx responses
Managed identity for OpenAI authentication
Semantic caching with Redis Enterprise
Streaming support for chat completions
Token usage logging to Event Hub
Content filtering for prompt injection

Document	Description
04-Policies	General policy patterns
06-Monitoring	Token usage monitoring
09-Cost-Optimization	AI cost management

Next: 08-Self-Hosted-Gateway - Kubernetes and hybrid deployment

🎯 AI Gateway Use Cases​

🏗️ AI Gateway Architecture​

⚡ Token Rate Limiting​

LLM Token Limit Policy​

Token Quota per Subscription​

💾 Semantic Caching​

How Semantic Caching Works​

Semantic Cache Policy​

Redis Configuration for Semantic Cache​

⚖️ Load Balancing OpenAI Instances​

Backend Pool Configuration​

Individual Backend Definitions​

Load Balancing Policy​

🔄 Streaming Support (SSE)​

Streaming Configuration​

📊 Token Usage Tracking​

Log Token Usage to Event Hub​

🔐 AI Gateway Security​

Content Filtering​

Rate Limiting by Model​

📋 Complete AI Gateway Policy​

✅ AI Gateway Checklist​

🔗 Related Documents​

🎯 AI Gateway Use Cases

🏗️ AI Gateway Architecture

⚡ Token Rate Limiting

LLM Token Limit Policy

Token Quota per Subscription

💾 Semantic Caching

How Semantic Caching Works

Semantic Cache Policy

Redis Configuration for Semantic Cache

⚖️ Load Balancing OpenAI Instances

Backend Pool Configuration

Individual Backend Definitions

Load Balancing Policy

🔄 Streaming Support (SSE)

Streaming Configuration

📊 Token Usage Tracking

Log Token Usage to Event Hub

🔐 AI Gateway Security

Content Filtering

Rate Limiting by Model

📋 Complete AI Gateway Policy

✅ AI Gateway Checklist

🔗 Related Documents