Skip to main content

17 - Troubleshooting Guide

Common issues, diagnostics, and resolution patterns for Azure API Management


🔍 Diagnostic Tools Overview


🚨 Common Issues & Solutions

1. 502 Bad Gateway

Symptoms: Clients receive 502 errors intermittently or consistently.

Causes & Solutions:

CauseDiagnosisSolution
Backend timeoutCheck BackendDuration metricIncrease timeout or optimize backend
Backend unreachableNetwork Status bladeFix NSG, firewall, private endpoint
SSL/TLS handshake failureCheck backend TLS versionAlign TLS versions
Invalid backend certificateCertificate validation errorFix certificate chain

Diagnostic Query:

ApiManagementGatewayLogs
| where TimeGenerated > ago(1h)
| where ResponseCode == 502
| project TimeGenerated, ApiId, OperationId, BackendUrl,
BackendResponseCode, LastErrorMessage, TotalTime
| order by TimeGenerated desc

Policy Fix (Retry):

<backend>
<retry condition="@(context.Response.StatusCode == 502)"
count="3"
interval="1"
delta="2"
max-interval="10"
first-fast-retry="true">
<forward-request buffer-request-body="true" timeout="60" />
</retry>
</backend>

2. 401 Unauthorized

Symptoms: JWT validation failing, subscription key rejected.

Causes & Solutions:

CauseDiagnosisSolution
Invalid JWTCheck token claimsVerify issuer, audience
Expired tokenCheck exp claimToken refresh logic
Wrong subscription keyKey mismatchUse correct key
Subscription suspendedCheck subscription stateReactivate

Diagnostic Query:

ApiManagementGatewayLogs
| where TimeGenerated > ago(1h)
| where ResponseCode == 401
| project TimeGenerated, ApiId, CallerIpAddress,
LastErrorReason, LastErrorMessage
| summarize Count=count() by LastErrorReason

Debug JWT Issues:

<inbound>
<!-- Debug: Log JWT claims (DEV ONLY - remove in prod) -->
<validate-jwt header-name="Authorization"
output-token-variable-name="jwt"
failed-validation-httpcode="401">
<openid-config url="{{openid-config-url}}"/>
</validate-jwt>

<trace source="JWT Debug">
<message>@{
var jwt = (Jwt)context.Variables["jwt"];
return $"Subject: {jwt.Subject}, Issuer: {jwt.Issuer}, Audience: {string.Join(",", jwt.Audiences)}";
}</message>
</trace>
</inbound>

3. 429 Too Many Requests

Symptoms: Rate limit or quota exceeded.

Causes & Solutions:

CauseDiagnosisSolution
Rate limit hitCheck Retry-After headerImplement backoff
Quota exhaustedCheck quota remainingUpgrade tier or wait
Misconfigured limitsReview policyAdjust limits

Diagnostic Query:

ApiManagementGatewayLogs
| where TimeGenerated > ago(24h)
| where ResponseCode == 429
| summarize RateLimitHits=count() by SubscriptionId, bin(TimeGenerated, 1h)
| render timechart

Response Headers to Check:

Retry-After: 30
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1702743600

4. High Latency

Symptoms: Slow API responses, P95 > SLA target.

Causes & Solutions:

CauseDiagnosisSolution
Slow backendBackendDuration highOptimize backend
Policy processingTotalTime - BackendDurationSimplify policies
Network latencyGeographic distanceDeploy closer
No cachingCache missImplement caching

Diagnostic Query:

ApiManagementGatewayLogs
| where TimeGenerated > ago(1h)
| summarize
P50=percentile(TotalTime, 50),
P95=percentile(TotalTime, 95),
P99=percentile(TotalTime, 99),
AvgBackend=avg(BackendTime)
by ApiId, bin(TimeGenerated, 5m)
| order by P95 desc

Latency Breakdown:

TotalTime = GatewayTime + BackendTime
GatewayTime = Policy Processing + Network Overhead

5. Capacity Issues

Symptoms: High capacity metric, slow responses, timeouts.

Causes & Solutions:

CauseDiagnosisSolution
Traffic spikeRequest count spikeAutoscale or scale out
Insufficient unitsCapacity > 80%Add units
Complex policiesCPU usage highOptimize policies
Large payloadsMemory pressureLimit payload size

Diagnostic Query:

AzureMetrics
| where ResourceProvider == "MICROSOFT.APIMANAGEMENT"
| where MetricName == "Capacity"
| summarize AvgCapacity=avg(Average), MaxCapacity=max(Maximum)
by bin(TimeGenerated, 5m)
| render timechart

Scale Out Command:

az apim update --name $APIM_NAME --resource-group $RG \
--sku-capacity 3

6. Network Connectivity Issues

Symptoms: Cannot reach backend, DNS resolution failures.

Diagnostic Steps:

  1. Check Network Status blade in Azure Portal
  2. Verify NSG rules allow required ports
  3. Check Private DNS zones are linked to VNet
  4. Verify Private Endpoints are approved

Required NSG Rules:

DirectionPortSourcePurpose
Inbound443Your sourcesAPI traffic
Inbound3443ApiManagementManagement
Inbound6390AzureLoadBalancerHealth probe
Outbound443VirtualNetworkAzure services
Outbound1433VirtualNetworkAzure SQL
Outbound5671-5672VirtualNetworkEvent Hub

DNS Resolution Test:

<!-- Policy to test DNS resolution -->
<inbound>
<send-request mode="new" response-variable-name="dnstest" timeout="10">
<set-url>https://your-backend.privatelink.azurewebsites.net/health</set-url>
<set-method>GET</set-method>
</send-request>
<return-response response-variable-name="dnstest" />
</inbound>

7. Certificate Issues

Symptoms: SSL handshake failures, certificate warnings.

Common Issues:

IssueDiagnosisSolution
Expired certificateCheck expiry dateRenew certificate
Chain incompleteMissing intermediate CAUpload full chain
CN mismatchCertificate vs hostnameUse correct certificate
Key Vault access deniedAPIM managed identityGrant access

Check Certificate:

# Check certificate in Key Vault
az keyvault certificate show --vault-name $KV_NAME \
--name $CERT_NAME \
--query '{expires:attributes.expires, thumbprint:x509ThumbprintHex}'

# Check APIM custom domain
az apim show --name $APIM_NAME --resource-group $RG \
--query 'hostnameConfigurations[].{host:hostName, cert:certificateSource}'

8. Developer Portal Issues

Symptoms: Portal not loading, authentication failures.

Common Issues:

IssueSolution
Blank pagePublish portal after changes
CORS errorsConfigure CORS in portal settings
Auth loopCheck Entra ID app registration
API not visibleCheck product visibility

Publish Portal:

az apim portal publish --resource-group $RG --name $APIM_NAME

📊 Essential KQL Queries

Error Analysis

// Top errors by API
ApiManagementGatewayLogs
| where TimeGenerated > ago(24h)
| where ResponseCode >= 400
| summarize Count=count() by ApiId, ResponseCode
| order by Count desc
| take 20

Performance Analysis

// Slow operations
ApiManagementGatewayLogs
| where TimeGenerated > ago(1h)
| where TotalTime > 5000 // > 5 seconds
| project TimeGenerated, ApiId, OperationId, TotalTime, BackendTime
| order by TotalTime desc
| take 50

Traffic Patterns

// Requests per minute by API
ApiManagementGatewayLogs
| where TimeGenerated > ago(1h)
| summarize Requests=count() by ApiId, bin(TimeGenerated, 1m)
| render timechart

Security Events

// Failed authentication attempts
ApiManagementGatewayLogs
| where TimeGenerated > ago(24h)
| where ResponseCode == 401 or ResponseCode == 403
| summarize Attempts=count() by CallerIpAddress, bin(TimeGenerated, 1h)
| where Attempts > 100
| order by Attempts desc

🔧 API Tracing (Development Only)

⚠️ Never enable in production - exposes sensitive data

Enable Tracing

# Enable for subscription (development only)
az apim api operation invoke \
--resource-group $RG \
--service-name $APIM_NAME \
--api-id $API_ID \
--operation-id $OPERATION_ID \
--http-method GET \
--headers "Ocp-Apim-Trace: true" \
--headers "Ocp-Apim-Subscription-Key: $KEY"

Trace Output Analysis

{
"traceId": "...",
"traces": [
{
"source": "inbound",
"timestamp": "...",
"elapsed": "00:00:00.0234567",
"data": {
"message": "Policy executed successfully"
}
}
]
}

📋 Troubleshooting Checklist

Quick Diagnosis Flow


DocumentDescription
06-MonitoringMonitoring setup
02-ReliabilityHA and failover
03-SecuritySecurity configuration

Next: 18-Capacity-Planning - Sizing and capacity guidance

📖Learn