Skip to main content

02 - Reliability

Building resilient API Management deployments with zone redundancy, multi-region, and disaster recovery

WAF Pillar


📋 WAF Workload Design Checklist

Based on Azure Well-Architected Framework - Reliability Checklist

#RecommendationStatus
(Service) Evaluate gateway capabilities for reliability - determine tier and features needed
(Service) Enable zone redundancy in Premium tier with minimum 2 units
(Service) Enable automatic scale-out based on traffic demands
(Service) Use multiregion topology for complete regional failure resiliency
(Service) Isolate unrelated APIs with workspaces or additional instances
(Service) Review observability capabilities (Azure Monitor, App Insights)
(Service) Factor gateway SLA into your workload's SLOs
(API) Plan for both expected and unexpected backend faults
(API) Implement quotas, rate limits, retry, circuit breakers, load balancing
(API) Thoroughly test policy expressions with resilient error handling
(Service) Use Azure Load Testing from within your network
(Service) Plan DR with backup/restore and APIOps/IaC solutions

🎯 Reliability Goals

MetricTarget
Availability99.99% (Premium tier with zones)
RTO< 1 hour for regional failover
RPONear-zero with active-active
RecoveryAutomated failover where possible

🏗️ High Availability Patterns

Zone Redundancy (Single Region)

Requirements:

  • Premium tier only
  • Minimum 2 units (recommended 3 for full zone coverage)
  • All zones active-active

Multi-Region Deployment


📋 Reliability Configuration Checklist

Zone Redundancy Setup

ConfigurationRecommendationBenefit
TierPremiumRequired for zone redundancy
UnitsMinimum 2, recommended 3Full zone coverage
DistributionAutomatic across zonesNo manual zone selection needed
Load BalancingBuilt-inActive-active across all zones

Multi-Region Setup

ConfigurationRecommendationBenefit
Regions2+ in different geographiesGeographic resilience
Traffic ManagerPriority or Weighted routingIntelligent failover
Backend SyncReplicate to all regionsData consistency
DNS TTLLow (60 seconds)Fast failover

🏢 Workload Isolation with Workspaces

WAF Recommendation: Isolate critical workloads using dedicated gateways

Isolation ApproachUse CaseBenefit
WorkspacesMulti-team, same instanceCost-effective isolation
Dedicated GatewayCritical APIsCompute isolation
Separate InstanceStrict boundariesFull isolation

📊 Reliability Metrics

WAF Recommendation: Collect reliability metrics for deviation detection

Key Metrics to Monitor

MetricDescriptionAlert Threshold
CapacityGateway resource utilization> 80%
RequestsTotal requests per periodBaseline deviation > 50%
Failed Requests4xx and 5xx responses> 5% of total
Backend DurationTime to backend response> 5 seconds
Gateway ErrorsGateway-originated errors> 1%
Rate Limit ViolationsQuota exceeded responsesAny occurrence
Health Check FailuresDependency healthAny failure

🔄 Scaling Strategies

Capacity Metric

The Capacity metric is the key indicator for scaling:

Capacity %Action
< 40%Consider scaling in
40-70%Optimal range
70-80%Plan to scale out
> 80%Scale out immediately

Autoscale Configuration (Bicep)

resource autoscale 'Microsoft.Insights/autoscalesettings@2022-10-01' = {
name: 'apim-autoscale'
location: location
properties: {
targetResourceUri: apim.id
enabled: true
profiles: [
{
name: 'Auto Scale Profile'
capacity: {
minimum: '2'
maximum: '10'
default: '2'
}
rules: [
{
metricTrigger: {
metricName: 'Capacity'
metricResourceUri: apim.id
timeGrain: 'PT1M'
statistic: 'Average'
timeWindow: 'PT10M'
timeAggregation: 'Average'
operator: 'GreaterThan'
threshold: 70
}
scaleAction: {
direction: 'Increase'
type: 'ChangeCount'
value: '1'
cooldown: 'PT20M'
}
}
{
metricTrigger: {
metricName: 'Capacity'
metricResourceUri: apim.id
timeGrain: 'PT1M'
statistic: 'Average'
timeWindow: 'PT30M'
timeAggregation: 'Average'
operator: 'LessThan'
threshold: 40
}
scaleAction: {
direction: 'Decrease'
type: 'ChangeCount'
value: '1'
cooldown: 'PT60M'
}
}
]
}
]
}
}

🛡️ Resilience Policies

Circuit Breaker Pattern

<backend id="backend-service">
<circuit-breaker
trip-duration="PT1M"
accept-retriable-5xx="false">
<failure-condition count="10" interval="PT10S" />
</circuit-breaker>
</backend>

Retry Policy

<retry condition="@(context.Response.StatusCode >= 500)" 
count="3"
interval="1"
delta="1"
max-interval="10"
first-fast-retry="true">
<forward-request buffer-request-body="true" />
</retry>

Backend Load Balancing

<backend id="load-balanced-backend">
<backend-pool>
<backend url="https://backend1.example.com" priority="1" weight="3"/>
<backend url="https://backend2.example.com" priority="1" weight="1"/>
<backend url="https://backend3.example.com" priority="2" weight="1"/>
</backend-pool>
</backend>

💾 Disaster Recovery

Backup Strategy

Built-in Backup (Azure CLI)

# Create backup
az apim backup \
--name "apim-westeu-prod-mesh" \
--resource-group "rg-apim-prod" \
--backup-name "backup-20251215" \
--storage-account-name "stbackup001" \
--storage-account-container "apim-backups" \
--storage-account-key "<key>"

# Restore from backup
az apim restore \
--name "apim-westeu-prod-mesh" \
--resource-group "rg-apim-prod" \
--backup-name "backup-20251215" \
--storage-account-name "stbackup001" \
--storage-account-container "apim-backups" \
--storage-account-key "<key>"

What's Backed Up

IncludedNot Included
✅ API definitions❌ User subscriptions (keys)
✅ Products❌ Custom domain certificates
✅ Policies❌ Managed identity assignments
✅ Named values❌ VNet configuration
✅ Developer portal content❌ Diagnostic settings
✅ Certificates (public)❌ Azure AD B2C config

📊 Reliability Metrics

Key Metrics to Monitor

MetricDescriptionAlert Threshold
CapacityGateway resource utilization> 80%
RequestsTotal requests per periodBaseline deviation > 50%
Failed Requests4xx and 5xx responses> 5% of total
Backend DurationTime to backend response> 5 seconds
Gateway ErrorsGateway-originated errors> 1%

Alert Configuration (Bicep)

resource capacityAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
name: 'apim-capacity-alert'
location: 'global'
properties: {
description: 'APIM Capacity exceeded 80%'
severity: 2
enabled: true
scopes: [apim.id]
evaluationFrequency: 'PT1M'
windowSize: 'PT5M'
criteria: {
'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
allOf: [
{
name: 'CapacityThreshold'
metricName: 'Capacity'
operator: 'GreaterThan'
threshold: 80
timeAggregation: 'Average'
}
]
}
actions: [
{
actionGroupId: actionGroup.id
}
]
}
}

✅ Reliability Checklist

Design Phase

  • Selected Premium tier for zone redundancy
  • Planned minimum 2 scale units
  • Identified multi-region requirements
  • Designed backend failover strategy
  • Documented RTO/RPO requirements

Implementation Phase

  • Enabled zone redundancy
  • Configured autoscaling rules
  • Implemented circuit breakers
  • Set up retry policies
  • Configured backend load balancing

Operations Phase

  • Configured capacity alerts
  • Set up automated backups
  • Documented DR runbooks
  • Tested failover procedures
  • Scheduled regular DR drills

DocumentDescription
03-SecurityNetwork isolation and protection
06-MonitoringMetrics, alerts, and diagnostics
09-Cost-OptimizationScaling cost considerations

Next: 03-Security - OWASP protection and network isolation

📖Learn