Unified Monitoring Solution - Operations Runbook

Version: 1.0
Last Updated: January 2026 (Workshop Edition)
Purpose: Day-2 operations, alerting procedures, KQL queries, and troubleshooting guides

Feature Availability Legend: ✅ GA | ⚠️ Preview | 📍 Planned

📚 Quick Navigation	README	Architecture	Operations Runbook	Advanced Topics

📖 How to Use This Document

This Operations Runbook provides the "How" (practical implementation, KQL queries, troubleshooting).
The Architecture document provides the "What & Why" (theory, design decisions, patterns).

Practice ↔ Theory Cross-Reference

Operations Topic	Architecture Counterpart	Description
1. Operations Overview	3. Federated Model	Day-2 responsibilities ← Governance model
2. Alert Response	5. Landing Zone Alerting	Alert response ← Alert design
3. KQL Query Library	—	Ready-to-use queries (no architecture counterpart)
4. Federated Visibility	3.4 Federated Visibility	Cross-subscription visibility ← Federated architecture
5. Maintenance Windows	6. AMBA	Suppression ← Baseline alerts
6. Troubleshooting Guide	4. Core Components	Troubleshooting ← Component design
7. RBAC Operations	8. Security & Access	Access management ← RBAC design
8. Cost Monitoring	Advanced Topics: Cost	Quick reference → Deep-dive in Advanced

Operations Overview
Alert Response Procedures
KQL Query Library
Federated Visibility Operations
Maintenance Window Management
Troubleshooting Guide
RBAC Operations
Cost Monitoring & Optimization
Runbook Templates

1. Operations Overview

1.1 Operations Model

The Unified Monitoring Solution follows a shared responsibility model aligned with the federated architecture:

1.2 Responsibility Matrix (RACI)

Activity	Central Team	Landing Zone	Security	FinOps
LAW Infrastructure	R/A	I	C	I
DCR Baseline	R/A	C	C	I
Custom DCRs	C	R/A	C	I
AMBA Policies	R/A	I	C	I
Custom Alerts	C	R/A	I	I
Incident Response	C	R/A	C	I
Cost Optimization	C	C	I	R/A
Access Reviews	C	R	A	I

Legend: R = Responsible, A = Accountable, C = Consulted, I = Informed

1.3 Operational Cadence

2. Alert Response Procedures

2.1 Alert Severity Classification

Severity	Response Time	Escalation Path	Examples
Sev 0 - Critical	15 min	On-call → Manager → Director	Service outage, Data loss
Sev 1 - High	1 hour	On-call → Team Lead	Performance degradation, Availability < 99%
Sev 2 - Medium	4 hours	Team queue	Threshold warnings, Capacity alerts
Sev 3 - Low	24 hours	Backlog	Informational, Optimization opportunities

2.2 Alert Response Workflow

2.3 Common Alert Response Actions

VM Availability Alert

# Check VM status
az vm get-instance-view --name <vm-name> --resource-group <rg-name> --query "instanceView.statuses[1].displayStatus"

# Check recent boot diagnostics
az vm boot-diagnostics get-boot-log --name <vm-name> --resource-group <rg-name>

# Restart VM if needed
az vm restart --name <vm-name> --resource-group <rg-name>

High CPU Alert

# Get process details from guest metrics
az monitor metrics list --resource <vm-resource-id> \
    --metric "Percentage CPU" \
    --start-time (Get-Date).AddHours(-1).ToString("yyyy-MM-ddTHH:mm:ssZ") \
    --end-time (Get-Date).ToString("yyyy-MM-ddTHH:mm:ssZ") \
    --interval PT5M

Log Analytics Ingestion Alert

# Check ingestion latency
az monitor log-analytics workspace show --workspace-name <law-name> --resource-group <rg-name>

# Query ingestion health
# Use KQL from Section 3.4

2.4 Actionable Alerts - Including Remediation Steps

Best Practice: "Make every notification actionable - alert contains sufficient information to act"

Embedding Remediation Guidance in Alerts

Use alert rule descriptions and custom properties to include remediation steps directly in notifications:

resource actionableAlert 'Microsoft.Insights/scheduledQueryRules@2023-03-15-preview' = {
  name: 'alert-vm-cpu-critical'
  location: location
  properties: {
    displayName: 'VM CPU Critical - Actionable'
    description: '''
    ## Alert: VM CPU Critical
    **Impact**: Application performance degradation
    
    ### Immediate Actions:
    1. Check process consuming CPU: `Get-Process | Sort-Object CPU -Descending | Select -First 10`
    2. Review recent deployments in the last 24 hours
    3. Check for runaway processes or memory leaks
    
    ### Escalation:
    - If not resolved in 15 min → Escalate to App Team Lead
    - If widespread → Engage Platform Team
    
    ### Runbook: [VM CPU Troubleshooting](https://wiki.contoso.com/runbooks/vm-cpu)
    '''
    severity: 1
    enabled: true
    evaluationFrequency: 'PT5M'
    scopes: [lawId]
    windowSize: 'PT15M'
    criteria: {
      allOf: [
        {
          query: '''
            Perf
            | where ObjectName == "Processor" and CounterName == "% Processor Time"
            | where CounterValue > 95
            | summarize AvgCPU = avg(CounterValue) by Computer, bin(TimeGenerated, 5m)
            | where AvgCPU > 95
          '''
          timeAggregation: 'Count'
          operator: 'GreaterThan'
          threshold: 0
        }
      ]
    }
    actions: {
      actionGroups: [actionGroupId]
      customProperties: {
        RunbookURL: 'https://wiki.contoso.com/runbooks/vm-cpu'
        EscalationPath: 'AppTeam → PlatformTeam → OnCall'
        ExpectedRTO: '15 minutes'
        Severity: 'P1'
      }
    }
  }
}

Alert Template with Remediation Context

Alert Field	Purpose	Example
displayName	Clear, actionable title	"VM CPU Critical - Scale Up Required"
description	Step-by-step remediation	Markdown with numbered steps
customProperties.RunbookURL	Link to detailed runbook	Wiki/Confluence link
customProperties.EscalationPath	Who to contact if unresolved	"L1 → L2 → Manager"
customProperties.ExpectedRTO	Resolution timeframe	"15 minutes"

Action Group with Rich Notifications

resource richActionGroup 'Microsoft.Insights/actionGroups@2023-01-01' = {
  name: 'ag-rich-notifications'
  location: 'global'
  properties: {
    groupShortName: 'RichNotify'
    enabled: true
    emailReceivers: [
      {
        name: 'ops-team'
        emailAddress: 'ops@contoso.com'
        useCommonAlertSchema: true  // Enables rich formatting
      }
    ]
    webhookReceivers: [
      {
        name: 'teams-webhook'
        serviceUri: teamsWebhookUrl
        useCommonAlertSchema: true
        useAadAuth: false
      }
    ]
    logicAppReceivers: [
      {
        name: 'enrichment-logic-app'
        resourceId: enrichmentLogicAppId
        callbackUrl: enrichmentLogicAppCallbackUrl
        useCommonAlertSchema: true
      }
    ]
  }
}

2.5 Alert Suppression During Maintenance

Use Alert Processing Rules to suppress alerts during planned maintenance:

// Reference: See Section 7.4 DCR Transformation Patterns for Cost Optimization
resource maintenanceSuppressionRule 'Microsoft.AlertsManagement/actionRules@2023-05-01-preview' = {
  name: 'suppress-${maintenanceWindow.name}'
  location: 'Global'
  properties: {
    scopes: [
      resourceGroup().id
    ]
    conditions: [
      {
        field: 'TargetResourceGroup'
        operator: 'Equals'
        values: [maintenanceWindow.targetResourceGroup]
      }
    ]
    schedule: {
      effectiveFrom: maintenanceWindow.startTime
      effectiveUntil: maintenanceWindow.endTime
      timeZone: 'UTC'
    }
    actions: [
      {
        actionType: 'RemoveAllActionGroups'
      }
    ]
    enabled: true
  }
}

2.6 Alert Suppression & Correlation

Common Challenge: "Avoiding alert storms, correlating related alerts"

Alert Storm Prevention

Technique	Description	Implementation
Aggregation	Group similar alerts into single notification	Use `muteActionsDuration` in alert rules
Throttling	Limit notification frequency	Set `autoMitigate: false` + longer evaluation window
Dependency-Based	Suppress child alerts when parent is down	Use Alert Processing Rules with conditions
Smart Grouping	Group by resource, severity, or alert type	Configure in Action Group settings

Alert Aggregation Example

resource aggregatedAlert 'Microsoft.Insights/scheduledQueryRules@2023-03-15-preview' = {
  name: 'alert-multiple-vm-cpu'
  location: location
  properties: {
    displayName: 'Multiple VMs High CPU'
    severity: 2
    enabled: true
    evaluationFrequency: 'PT5M'
    windowSize: 'PT15M'
    scopes: [lawId]
    criteria: {
      allOf: [
        {
          query: '''
            Perf
            | where ObjectName == "Processor" and CounterName == "% Processor Time"
            | where CounterValue > 90
            | summarize AvgCPU = avg(CounterValue), AffectedVMs = dcount(Computer) by bin(TimeGenerated, 5m)
            | where AffectedVMs >= 3  // Only alert if 3+ VMs affected
          '''
          timeAggregation: 'Count'
          operator: 'GreaterThan'
          threshold: 0
        }
      ]
    }
    muteActionsDuration: 'PT30M'  // Mute for 30 min after firing
    actions: {
      actionGroups: [actionGroupId]
    }
  }
}

Dependency-Based Suppression

Suppress child resource alerts when parent is unhealthy:

// Suppress VM alerts when host is down
resource dependencySuppression 'Microsoft.AlertsManagement/actionRules@2023-05-01-preview' = {
  name: 'suppress-vm-when-host-down'
  location: 'global'
  properties: {
    scopes: [subscription().id]
    conditions: [
      {
        field: 'AlertRuleName'
        operator: 'Contains'
        values: ['vm-']  // All VM alerts
      }
    ]
    actions: [
      {
        actionType: 'RemoveAllActionGroups'
      }
    ]
    // Only active when host alert is firing (managed via automation)
    enabled: false  // Toggled by Logic App when host alert fires
  }
}

Alert Correlation Query

Identify related alerts for root cause analysis:

// Find correlated alerts within time window
let timeWindow = 15m;
let primaryAlert = "alert-network-latency";
Alerts
| where TimeGenerated > ago(1h)
| where AlertName == primaryAlert
| project PrimaryTime = TimeGenerated, PrimaryResource = ResourceId
| join kind=inner (
    Alerts
    | where TimeGenerated > ago(1h)
    | project AlertName, AlertTime = TimeGenerated, Resource = ResourceId, Severity
) on $left.PrimaryResource == $right.Resource
| where AlertTime between (PrimaryTime - timeWindow .. PrimaryTime + timeWindow)
| where AlertName != primaryAlert
| summarize CorrelatedAlerts = make_set(AlertName), Count = count() by PrimaryResource
| order by Count desc

Smart Grouping Configuration

Grouping Strategy	Use Case	Configuration
By Resource	One notification per affected resource	`groupByFields: ['resourceId']`
By Alert Type	One notification per alert rule	`groupByFields: ['alertRule']`
By Severity	Critical alerts separate from warnings	`groupByFields: ['severity']`
Combined	Sophisticated grouping	`groupByFields: ['resourceGroup', 'severity']`

3. KQL Query Library

3.1 Application Insights Alert Patterns

Reference: Production patterns from enterprise implementations (Fortune 500 deployments)

These alert patterns are battle-tested in production for Function Apps and web applications.

HTTP 500 Internal Server Error Alert

// Pattern: Alert on any HTTP 500 errors
resource http500Alert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
  name: 'alert-http-500-${serviceName}'
  location: 'global'
  properties: {
    description: '${serviceName} Internal server error, http code 500'
    severity: 2
    enabled: true
    scopes: [appInsightsId]
    evaluationFrequency: 'PT1M'
    windowSize: 'PT5M'
    autoMitigate: true
    criteria: {
      'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
      allOf: [
        {
          criterionType: 'StaticThresholdCriterion'
          name: 'http500'
          metricNamespace: 'microsoft.insights/components'
          metricName: 'requests/count'
          operator: 'GreaterThan'
          threshold: 0
          timeAggregation: 'Count'
          dimensions: [
            {
              name: 'request/resultCode'
              operator: 'Include'
              values: ['500']
            }
          ]
        }
      ]
    }
    actions: [
      {
        actionGroupId: actionGroupId
      }
    ]
  }
}

Dependency Throttling (HTTP 429) Alert

// Pattern: Alert on dependency throttling
resource throttlingAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
  name: 'alert-dependency-throttling-${serviceName}'
  location: 'global'
  properties: {
    description: '${serviceName} dependency call returned http code 429 (throttled)'
    severity: 2
    enabled: true
    scopes: [appInsightsId]
    evaluationFrequency: 'PT1M'
    windowSize: 'PT5M'
    autoMitigate: true
    criteria: {
      'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
      allOf: [
        {
          criterionType: 'StaticThresholdCriterion'
          name: 'throttling'
          metricNamespace: 'microsoft.insights/components'
          metricName: 'dependencies/failed'
          operator: 'GreaterThan'
          threshold: 0
          timeAggregation: 'Count'
          dimensions: [
            {
              name: 'dependency/type'
              operator: 'Exclude'
              values: ['INPROC']  // Exclude in-process calls
            }
            {
              name: 'dependency/resultCode'
              operator: 'Include'
              values: ['429']
            }
          ]
        }
      ]
    }
    actions: [
      {
        actionGroupId: actionGroupId
      }
    ]
  }
}

Server Response Time Exceeded Alert

// Pattern: Alert when response time exceeds threshold
resource responseTimeAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
  name: 'alert-response-time-${serviceName}'
  location: 'global'
  properties: {
    description: '${serviceName} server response time exceeded 20 seconds'
    severity: 2
    enabled: true
    scopes: [appInsightsId]
    evaluationFrequency: 'PT1M'
    windowSize: 'PT5M'
    autoMitigate: true
    criteria: {
      'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
      allOf: [
        {
          criterionType: 'StaticThresholdCriterion'
          name: 'responseTime'
          metricNamespace: 'microsoft.insights/components'
          metricName: 'requests/duration'
          operator: 'GreaterThan'
          threshold: 20000  // 20 seconds in milliseconds
          timeAggregation: 'Maximum'
        }
      ]
    }
    actions: [
      {
        actionGroupId: actionGroupId
      }
    ]
  }
}

Server Exceptions Alert

// Pattern: Alert on unhandled exceptions
resource exceptionsAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
  name: 'alert-exceptions-${serviceName}'
  location: 'global'
  properties: {
    description: '${serviceName} server exceptions detected'
    severity: 2
    enabled: true
    scopes: [appInsightsId]
    evaluationFrequency: 'PT1M'
    windowSize: 'PT5M'
    autoMitigate: true
    criteria: {
      'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
      allOf: [
        {
          criterionType: 'StaticThresholdCriterion'
          name: 'exceptions'
          metricNamespace: 'microsoft.insights/components'
          metricName: 'exceptions/server'
          operator: 'GreaterThan'
          threshold: 0
          timeAggregation: 'Count'
        }
      ]
    }
    actions: [
      {
        actionGroupId: actionGroupId
      }
    ]
  }
}

Dead Lettering Alert (EventGrid)

// Pattern: Alert on EventGrid dead-lettered messages
resource deadLetterAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
  name: 'alert-deadletter-${systemTopicName}'
  location: 'global'
  properties: {
    description: 'Dead Lettering Events detected on ${systemTopicName}'
    severity: 2
    enabled: true
    scopes: [eventGridSystemTopicId]
    evaluationFrequency: 'PT1M'
    windowSize: 'PT5M'
    autoMitigate: false  // Don't auto-mitigate - needs investigation
    criteria: {
      'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
      allOf: [
        {
          criterionType: 'StaticThresholdCriterion'
          name: 'deadLetter'
          metricNamespace: 'Microsoft.EventGrid/systemTopics'
          metricName: 'DeadLetteredCount'
          operator: 'GreaterThan'
          threshold: 0
          timeAggregation: 'Total'
        }
      ]
    }
    actions: [
      {
        actionGroupId: actionGroupId
      }
    ]
  }
}

Alert Patterns Summary Table

Alert Pattern	Metric	Threshold	AutoMitigate	Severity
HTTP 500 Errors	requests/count (resultCode=500)	> 0	Yes	2
Dependency Throttling (429)	dependencies/failed	> 0	Yes	2
Response Time	requests/duration	> 20000ms	Yes	2
Server Exceptions	exceptions/server	> 0	Yes	2
Dead Lettering	DeadLetteredCount	> 0	No	2
Storage Timeout	Transactions (ServerTimeoutError)	> 5	No	2

3.2 Resource Health Queries

All Resource Health Events (Last 24h)

AzureActivity
| where TimeGenerated > ago(24h)
| where CategoryValue == "ResourceHealth"
| project TimeGenerated, ResourceGroup, Resource, Level, 
          Status = Properties_d.statusCode, 
          Message = Properties_d.statusMessage
| order by TimeGenerated desc

VM Availability Summary

Heartbeat
| where TimeGenerated > ago(1h)
| summarize LastHeartbeat = max(TimeGenerated) by Computer, ResourceGroup
| extend Status = iff(LastHeartbeat < ago(5m), "Offline", "Online")
| summarize 
    TotalVMs = count(),
    OnlineVMs = countif(Status == "Online"),
    OfflineVMs = countif(Status == "Offline")

Resource Availability by Type

AzureMetrics
| where TimeGenerated > ago(24h)
| where MetricName == "Availability"
| summarize 
    AvgAvailability = avg(Average),
    MinAvailability = min(Minimum)
    by ResourceProvider, Resource, bin(TimeGenerated, 1h)
| where AvgAvailability < 100
| order by AvgAvailability asc

3.2 Performance Queries

Top 10 CPU Consumers

Perf
| where TimeGenerated > ago(1h)
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| where InstanceName == "_Total"
| summarize AvgCPU = avg(CounterValue), MaxCPU = max(CounterValue) by Computer
| top 10 by AvgCPU desc

Memory Pressure Analysis

Perf
| where TimeGenerated > ago(1h)
| where ObjectName == "Memory" 
| where CounterName in ("% Committed Bytes In Use", "Available MBytes")
| summarize Value = avg(CounterValue) by Computer, CounterName, bin(TimeGenerated, 5m)
| evaluate pivot(CounterName, take_any(Value))
| project TimeGenerated, Computer, 
          MemoryUsedPct = ['% Committed Bytes In Use'],
          AvailableMB = ['Available MBytes']
| where MemoryUsedPct > 80

Disk I/O Performance

Perf
| where TimeGenerated > ago(1h)
| where ObjectName == "LogicalDisk"
| where CounterName in ("Disk Reads/sec", "Disk Writes/sec", "Avg. Disk sec/Read", "Avg. Disk sec/Write")
| where InstanceName != "_Total"
| summarize Value = avg(CounterValue) by Computer, InstanceName, CounterName, bin(TimeGenerated, 5m)
| evaluate pivot(CounterName, take_any(Value))

3.3 Security Queries

SigninLogs
| where TimeGenerated > ago(24h)
| where ResultType != "0"  // Non-successful
| summarize 
    FailedAttempts = count(),
    DistinctIPs = dcount(IPAddress),
    LastAttempt = max(TimeGenerated)
    by UserPrincipalName, ResultDescription
| where FailedAttempts > 5
| order by FailedAttempts desc

Security Alert Summary

SecurityAlert
| where TimeGenerated > ago(7d)
| summarize 
    AlertCount = count(),
    Providers = make_set(ProviderName)
    by AlertSeverity, AlertName
| order by case(
    AlertSeverity == "High", 1,
    AlertSeverity == "Medium", 2,
    AlertSeverity == "Low", 3,
    4
)

Privileged Operations

AzureActivity
| where TimeGenerated > ago(24h)
| where Authorization_d.action has_any ("Microsoft.Authorization/roleAssignments", "Microsoft.Authorization/roleDefinitions")
| project TimeGenerated, Caller, OperationName, 
          Action = Authorization_d.action,
          ResourceGroup, Resource
| order by TimeGenerated desc

3.4 Ingestion Health Queries

Data Ingestion Volume by Table

Usage
| where TimeGenerated > ago(1d)
| summarize 
    IngestionGB = sum(Quantity) / 1024,
    BillableGB = sumif(Quantity, IsBillable) / 1024
    by DataType
| order by IngestionGB desc
| take 20

Ingestion Latency Analysis

// Check time between event generation and ingestion
AzureActivity
| where TimeGenerated > ago(1h)
| extend IngestionDelay = TimeGenerated - todatetime(Properties_d.eventTimestamp)
| summarize 
    AvgLatency = avg(IngestionDelay),
    P95Latency = percentile(IngestionDelay, 95),
    MaxLatency = max(IngestionDelay)
    by bin(TimeGenerated, 5m)

Missing Data Detection

// Detect gaps in heartbeat data
Heartbeat
| where TimeGenerated > ago(24h)
| summarize HeartbeatCount = count() by Computer, bin(TimeGenerated, 15m)
| where HeartbeatCount < 3  // Expected ~3 per 15min
| order by TimeGenerated desc

3.5 Cost Analysis Queries

Daily Ingestion Cost Estimate

// Note: Pricing varies by region. Check https://azure.microsoft.com/pricing/details/monitor/
let pricePerGB = 2.30;  // Example: West Europe Pay-As-You-Go (verify current pricing)
Usage
| where TimeGenerated > startofday(ago(30d))
| where IsBillable == true
| summarize DailyGB = sum(Quantity) / 1024 by bin(TimeGenerated, 1d)
| extend EstimatedCost = DailyGB * pricePerGB
| order by TimeGenerated desc

Top Data Contributors

Usage
| where TimeGenerated > ago(7d)
| where IsBillable == true
| summarize TotalGB = sum(Quantity) / 1024 by DataType
| order by TotalGB desc
| take 10
| extend CostEstimate = TotalGB * 2.76

Retention vs Ingestion Analysis

Usage
| where TimeGenerated > ago(30d)
| where IsBillable == true
| summarize TotalGB = sum(Quantity) / 1024 by DataType
| join kind=inner (
    workspace('LAW-Name').Usage
    | where TimeGenerated > ago(30d)
    | where IsBillable == false
    | summarize RetainedGB = sum(Quantity) / 1024 by DataType
) on DataType
| project DataType, TotalGB, RetainedGB, 
          RetentionRatio = RetainedGB / TotalGB

3.6 Alert Queries (For Custom Alerts)

Service Health Impact Assessment

ServiceHealth
| where TimeGenerated > ago(7d)
| where Status == "Active"
| project TimeGenerated, Service, Region, 
          ImpactType, Title, Summary,
          AffectedResources = Properties_d.affectedResources
| order by TimeGenerated desc

Resource Creation/Deletion Tracking

AzureActivity
| where TimeGenerated > ago(24h)
| where OperationNameValue has_any ("Microsoft.Resources/deployments", "delete")
| where ActivityStatusValue == "Success"
| project TimeGenerated, Caller, OperationName, 
          ResourceGroup, Resource, SubscriptionId
| order by TimeGenerated desc

3.7 Data Enrichment Logic

Blueprint Requirement: "Define enrichment fields including business context, ownership, criticality. Implement enrichment logic via lookup tables, functions. Automate tagging with tag inheritance and policy-based tagging."

3.7.1 Business Context Enrichment Overview

Data enrichment adds business context to raw telemetry data, enabling better filtering, routing, and analysis.

3.7.2 Resource Ownership Lookup Table

Create a lookup table for resource-to-owner mapping:

// Create externaldata lookup for resource ownership
let ResourceOwners = externaldata(
    ResourceId: string,
    SubscriptionName: string,
    LandingZone: string,
    CostCenter: string,
    BusinessUnit: string,
    ApplicationName: string,
    Owner: string,
    OwnerEmail: string,
    Criticality: string,
    Environment: string
) [
    h@"https://yourstorageaccount.blob.core.windows.net/lookups/resource-owners.csv"
] with (format="csv", ignoreFirstRecord=true);

// Use in queries
AzureMetrics
| extend ResourceIdLower = tolower(_ResourceId)
| join kind=leftouter (
    ResourceOwners | extend ResourceIdLower = tolower(ResourceId)
) on ResourceIdLower
| project TimeGenerated, Resource, MetricName, Average,
          LandingZone, BusinessUnit, Owner, Criticality

3.7.3 Criticality Classification Matrix

Criticality Level	Definition	Alert Routing	Response SLA
P1 - Critical	Revenue-impacting, customer-facing, no redundancy	ServiceNow P1 + Phone	15 min
P2 - High	Business-critical internal, limited redundancy	ServiceNow P2 + Teams	30 min
P3 - Medium	Important but redundant, degraded performance OK	Email + Teams	4 hours
P4 - Low	Development, test, non-critical	Dashboard only	Next business day

// Criticality lookup function
let GetCriticality = (resourceId: string) {
    let criticalityMapping = datatable(Pattern: string, Criticality: string) [
        "prd", "P1-Critical",
        "prod", "P1-Critical",
        "stg", "P2-High",
        "staging", "P2-High",
        "uat", "P3-Medium",
        "dev", "P4-Low",
        "test", "P4-Low",
        "sandbox", "P4-Low"
    ];
    let resourceLower = tolower(resourceId);
    toscalar(
        criticalityMapping
        | where resourceLower contains Pattern
        | take 1
        | project Criticality
    )
};

3.7.4 Tag Inheritance via Azure Policy

Ensure resources inherit tags from their resource group for consistent enrichment:

// Azure Policy: Inherit tags from resource group
resource tagInheritancePolicy 'Microsoft.Authorization/policyDefinitions@2021-06-01' = {
  name: 'inherit-tags-from-rg'
  properties: {
    policyType: 'Custom'
    mode: 'Indexed'
    displayName: 'Inherit tags from resource group'
    parameters: {
      tagName: {
        type: 'String'
        metadata: {
          displayName: 'Tag Name'
          description: 'Name of the tag to inherit'
        }
      }
    }
    policyRule: {
      if: {
        allOf: [
          {
            field: '[concat(\'tags[\', parameters(\'tagName\'), \']\')]'
            exists: 'false'
          }
          {
            value: '[resourceGroup().tags[parameters(\'tagName\')]]'
            notEquals: ''
          }
        ]
      }
      then: {
        effect: 'modify'
        details: {
          roleDefinitionIds: [
            '/providers/microsoft.authorization/roleDefinitions/b24988ac-6180-42a0-ab88-20f7382dd24c'
          ]
          operations: [
            {
              operation: 'add'
              field: '[concat(\'tags[\', parameters(\'tagName\'), \']\')]'
              value: '[resourceGroup().tags[parameters(\'tagName\')]]'
            }
          ]
        }
      }
    }
  }
}

3.7.5 Required Tags for Enrichment

Tag Name	Purpose	Example Value	Mandatory
`Environment`	Env classification	`Production`, `Staging`, `Dev`	✅ Yes
`CostCenter`	Finance allocation	`CC-12345`	✅ Yes
`Owner`	Responsible team/person	`platform-team@company.com`	✅ Yes
`ApplicationName`	Application identifier	`data-mesh-api`	✅ Yes
`Criticality`	Business criticality	`P1-Critical`, `P2-High`	✅ Yes
`LandingZone`	ALZ identifier	`lz-aiml-prod`	✅ Yes
`DataClassification`	Data sensitivity	`Confidential`, `Internal`	⚠️ If applicable

3.7.6 DCR Transformation for Enrichment at Ingestion

// Add enrichment fields at ingestion time using DCR transformation
dataFlows: [
  {
    streams: ['Microsoft-Perf']
    destinations: ['centralWorkspace']
    transformKql: '''
      source
      | extend 
          // Extract environment from computer name convention
          Environment = case(
            Computer startswith "prd-" or Computer contains "-prod-", "Production",
            Computer startswith "stg-" or Computer contains "-staging-", "Staging",
            Computer startswith "uat-" or Computer contains "-uat-", "UAT",
            Computer startswith "dev-" or Computer contains "-dev-", "Development",
            "Unknown"
          ),
          // Derive criticality from environment
          Criticality = case(
            Computer startswith "prd-" or Computer contains "-prod-", "P1-Critical",
            Computer startswith "stg-" or Computer contains "-staging-", "P2-High",
            Computer startswith "uat-", "P3-Medium",
            "P4-Low"
          ),
          // Extract landing zone from resource group naming
          LandingZone = extract(@"rg-([a-z]+)-", 1, _ResourceId)
    '''
    outputStream: 'Microsoft-Perf'
  }
]

4. Federated Visibility Operations

This section covers operational procedures for the 60/40 federated visibility model where the platform team has visibility into both centralized and landing zone-dedicated workspaces.

📖 Architecture Reference: See 01-architecture-overview.md, Section 3.4 for detailed architecture diagrams.

4.1 Accessing Landing Zone Dashboards

The platform team can access LZ-dedicated workbooks and dashboards through shared access.

Method 1: Shared Workbook Gallery

# List workbooks shared with the platform team
az monitor app-insights workbook list \
  --resource-group "rg-monitoring-platform" \
  --category "workbook" \
  --source-id "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.OperationalInsights/workspaces/<law>"

Method 2: Cross-Workspace KQL Queries

Query data from LZ-dedicated workspaces without copying data:

// Query alerts across all Landing Zone workspaces
let lz1_alerts = workspace("law-lz-team1-prod").Alert;
let lz2_alerts = workspace("law-lz-team2-prod").Alert;
let lz3_alerts = workspace("law-lz-team3-prod").Alert;

union lz1_alerts, lz2_alerts, lz3_alerts
| where TimeGenerated > ago(24h)
| summarize AlertCount = count() by AlertName, bin(TimeGenerated, 1h)
| order by AlertCount desc

Method 3: Azure Resource Graph for Metadata

// List all Landing Zone LAWs and their properties
resources
| where type == "microsoft.operationalinsights/workspaces"
| where resourceGroup contains "lz-"
| project 
    name, 
    resourceGroup, 
    subscriptionId,
    sku = properties.sku.name,
    retentionInDays = properties.retentionInDays,
    dailyQuotaGb = properties.workspaceCapping.dailyQuotaGb

// Dashboard with shared access for Platform Team
resource sharedDashboard 'Microsoft.Portal/dashboards@2020-09-01-preview' = {
  name: 'dash-lz-team1-shared'
  location: location
  properties: {
    lenses: [
      // Dashboard tiles
    ]
  }
}

// Role assignment for Platform Team
resource dashboardAccess 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  name: guid(sharedDashboard.id, platformTeamGroupId, 'Reader')
  scope: sharedDashboard
  properties: {
    principalId: platformTeamGroupId
    roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions', 'acdd72a7-3385-48ef-bd42-f606fba81ae7') // Reader
    principalType: 'Group'
  }
}

For Platform Team (Access Shared Dashboards)

Access Method	RBAC Role	Scope
View LZ Dashboard	`Reader`	Dashboard resource
Query LZ LAW	`Log Analytics Reader`	LAW resource
View LZ Workbooks	`Workbook Reader`	Workbook resource

4.3 Monitoring Landing Zone Health (Without Ownership)

The platform team monitors LZ health through aggregated views, not individual alert management.

Health Overview Workbook Query

// LZ Health Score by Team
let resourceHealth = 
    AzureActivity
    | where CategoryValue == "ResourceHealth"
    | extend LandingZone = extract(@"/subscriptions/[^/]+/resourceGroups/rg-([^-]+)-", 1, _ResourceId)
    | summarize 
        HealthyCount = countif(ActivityStatusValue == "Succeeded"),
        UnhealthyCount = countif(ActivityStatusValue != "Succeeded")
        by LandingZone;

let alertCount =
    AlertsManagementResources
    | where type == "microsoft.alertsmanagement/alerts"
    | extend LandingZone = extract(@"/subscriptions/[^/]+/resourceGroups/rg-([^-]+)-", 1, tolower(tostring(properties.targetResource)))
    | where properties.essentials.monitorCondition == "Fired"
    | summarize ActiveAlerts = count() by LandingZone;

resourceHealth
| join kind=leftouter alertCount on LandingZone
| extend HealthScore = round(100.0 * HealthyCount / (HealthyCount + UnhealthyCount), 1)
| project LandingZone, HealthScore, HealthyCount, UnhealthyCount, ActiveAlerts
| order by HealthScore asc

4.4 Escalation from LZ Dashboards

When the platform team identifies issues through shared dashboards:

Observation	Action	Owner
High alert count on LZ dashboard	Contact LZ team owner	Platform Team
Resource health degraded	Check service health, notify LZ	Platform Team
Cost anomaly detected	Review with LZ team	Platform Team
Security event	Escalate to Security team	Platform Team

5. Maintenance Window Management

4.1 Planned Maintenance Process

4.2 Maintenance Window Bicep Template

@description('Maintenance window configuration')
param maintenanceConfig object = {
  name: 'scheduled-maintenance'
  startTime: '2024-01-15T22:00:00Z'
  endTime: '2024-01-16T02:00:00Z'
  targetScopes: [
    '/subscriptions/<sub-id>/resourceGroups/<rg-name>'
  ]
  suppressSeverities: ['Sev2', 'Sev3', 'Sev4']
}

resource maintenanceRule 'Microsoft.AlertsManagement/actionRules@2023-05-01-preview' = {
  name: 'maint-${maintenanceConfig.name}'
  location: 'Global'
  properties: {
    scopes: maintenanceConfig.targetScopes
    conditions: [
      {
        field: 'Severity'
        operator: 'Equals'
        values: maintenanceConfig.suppressSeverities
      }
    ]
    schedule: {
      effectiveFrom: maintenanceConfig.startTime
      effectiveUntil: maintenanceConfig.endTime
      timeZone: 'UTC'
    }
    actions: [
      {
        actionType: 'RemoveAllActionGroups'
      }
    ]
    enabled: true
    description: 'Maintenance window: ${maintenanceConfig.name}'
  }
}

4.3 Emergency Maintenance (Ad-hoc)

For emergency maintenance windows:

# Create emergency suppression rule
$suppressionRule = @{
    Name = "emergency-maint-$(Get-Date -Format 'yyyyMMddHHmm')"
    ResourceGroupName = "<rg-name>"
    Location = "Global"
    Scope = @("/subscriptions/<sub-id>/resourceGroups/<target-rg>")
    Status = "Enabled"
    ScheduleStartTime = (Get-Date).ToUniversalTime().ToString("yyyy-MM-ddTHH:mm:ssZ")
    ScheduleEndTime = (Get-Date).AddHours(4).ToUniversalTime().ToString("yyyy-MM-ddTHH:mm:ssZ")
    ActionType = "RemoveAllActionGroups"
}

# Note: Use Azure CLI or ARM template for full implementation
az rest --method PUT --uri "/subscriptions/<sub-id>/resourceGroups/<rg-name>/providers/Microsoft.AlertsManagement/actionRules/$($suppressionRule.Name)?api-version=2023-05-01-preview" --body "@emergency-rule.json"

6. Troubleshooting Guide

5.1 Common Issues Decision Tree

5.2 Data Collection Troubleshooting

Check Azure Monitor Agent Status

# Windows VM - Check AMA service
Get-Service -Name 'AzureMonitorAgent' | Select-Object Name, Status, StartType

# Check AMA logs
Get-WinEvent -LogName 'Microsoft-AzureMonitor-Agent/Operational' -MaxEvents 50

# Linux VM - Check AMA service
systemctl status azuremonitoragent
journalctl -u azuremonitoragent --since "1 hour ago"

Verify DCR Assignment

# List DCR associations for a VM
az monitor data-collection rule association list --resource "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Compute/virtualMachines/<vm-name>"

Test DCE Connectivity

# From the VM, test DCE endpoint
$dceEndpoint = "<dce-name>.<region>.ingest.monitor.azure.com"
Test-NetConnection -ComputerName $dceEndpoint -Port 443

# Check if private endpoint is resolving correctly
Resolve-DnsName $dceEndpoint

5.3 Alert Troubleshooting

Check Alert Rule Status

# Get alert rule details
az monitor scheduled-query show --name "<alert-name>" --resource-group "<rg-name>"

# Check alert evaluation history
az monitor activity-log list --resource-id "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Insights/scheduledqueryrules/<alert-name>" --offset 24h

Test Alert Query Manually

// Run the alert query directly in Log Analytics
// If it returns results, alert should fire
// Check if the threshold is correct

// Example: Copy your alert query and run it
Heartbeat
| where TimeGenerated > ago(5m)
| summarize LastHeartbeat = max(TimeGenerated) by Computer
| where LastHeartbeat < ago(5m)
// If this returns rows, alert should fire

Check Action Group Notifications

# Get action group details
az monitor action-group show --name "<ag-name>" --resource-group "<rg-name>"

# Test action group
az monitor action-group test-notifications create --action-group "<ag-name>" --resource-group "<rg-name>" --alert-type "metric" --receivers @test-receivers.json

5.4 Log Analytics Workspace Issues

Check Workspace Health

// Check for ingestion issues
_LogOperation
| where TimeGenerated > ago(1h)
| where Level != "Info"
| project TimeGenerated, Operation, Detail, _ResourceId
| order by TimeGenerated desc

Verify Table Schema

// Get schema for a specific table
<TableName>
| getschema

Check Query Performance

// Identify slow queries
LAQueryLogs
| where TimeGenerated > ago(24h)
| where ResponseDurationMs > 30000  // > 30 seconds
| project TimeGenerated, RequestClientApp, QueryText, ResponseRowCount, ResponseDurationMs
| order by ResponseDurationMs desc

5.5 Private Link Troubleshooting

Private Link Connectivity Check

# Check DNS resolution returns private IP
Resolve-DnsName "<law-name>.ods.opinsights.azure.com"
# Should return 10.x.x.x (private IP)

# Check Private DNS Zone
az network private-dns record-set a list --zone-name "privatelink.ods.opinsights.azure.com" --resource-group "<dns-rg>"

# Verify Private Endpoint
az network private-endpoint show --name "<pe-name>" --resource-group "<rg-name>" --query "customDnsConfigs"

7. RBAC Operations

6.1 Access Review Process

6.2 Generate Access Report

# Export all role assignments for LAW
$lawId = "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.OperationalInsights/workspaces/<law-name>"

az role assignment list --scope $lawId --all --output table > law-access-report.txt

# Get users with Reader role
az role assignment list --scope $lawId --role "Log Analytics Reader" --output json | ConvertFrom-Json | Select-Object principalName, principalType

6.3 Common RBAC Scenarios

Grant Landing Zone Team Access

// Reference: See knowledge-base/data-mesh-monitoring-reference.md for existing patterns

resource lawRbac 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
  name: guid(logAnalyticsWorkspace.id, landingZoneGroupId, 'Log Analytics Reader')
  scope: logAnalyticsWorkspace
  properties: {
    principalId: landingZoneGroupId
    roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions', '73c42c96-874c-492b-b04d-ab87d138a893') // Log Analytics Reader
    principalType: 'Group'
  }
}

Table-Level RBAC

# Grant access to specific table
az monitor log-analytics workspace table update \
    --workspace-name "<law-name>" \
    --resource-group "<rg>" \
    --table-name "SecurityEvent" \
    --total-retention-in-days 90

# Note: Table-level RBAC requires custom role definitions
# See 01-architecture-overview.md Section 3.3 for custom role examples

8. Cost Monitoring & Optimization

7.1 Cost Dashboard Query

// Monthly cost trend with forecast
Usage
| where TimeGenerated > ago(90d)
| where IsBillable == true
| summarize DailyGB = sum(Quantity) / 1024 by bin(TimeGenerated, 1d)
| extend CostUSD = DailyGB * 2.76  // Adjust for your region pricing
| order by TimeGenerated asc
| extend RunningTotal = row_cumsum(CostUSD)

7.2 Optimization Opportunities

// Find tables with low query activity but high ingestion
let queriedTables = LAQueryLogs
| where TimeGenerated > ago(30d)
| extend TableName = extract(@"([\w]+)\s*\|", 1, QueryText)
| summarize QueryCount = count() by TableName;

Usage
| where TimeGenerated > ago(30d)
| where IsBillable == true
| summarize IngestedGB = sum(Quantity) / 1024 by DataType
| join kind=leftouter queriedTables on $left.DataType == $right.TableName
| where isnull(QueryCount) or QueryCount < 10
| where IngestedGB > 1  // More than 1GB
| project DataType, IngestedGB, QueryCount = coalesce(QueryCount, 0)
| order by IngestedGB desc

7.3 Cost Reduction Actions

Scenario	Action	Estimated Savings
Unused tables	Disable collection	100% of table cost
Verbose logging	Apply sampling in DCR	50-90%
Long retention	Reduce to 30 days	Variable
Infrequent queries	Archive tier	60-80%
Duplicate data	Consolidate DCRs	Variable

7.4 DCR Transformation Quick Reference

Microsoft Best Practice: Use transformations to filter or modify incoming data before it's sent to Log Analytics to reduce ingestion costs.

📖 Full Deep-Dive: See 03-advanced-topics.md - Section 4. Cost Optimization for complete Bicep examples and implementation patterns.

Technique	Typical Savings	Quick Example
Filter Rows	50-90%	`source \| where SeverityLevel in ("err", "crit")`
Filter Columns	20-40%	`source \| project-away ParameterXml, UserData`
Aggregate Data	60-80%	`source \| summarize avg(CounterValue) by bin(TimeGenerated, 5m)`
Route to Basic Logs	80%	Route verbose logs to Basic Logs table
Mask PII	0% (compliance)	`replace_regex(Email, @"[a-z]+@", "***@")`

Reference: Microsoft Transformation Samples

9. Runbook Templates

8.1 New Landing Zone Onboarding

## Runbook: Onboard New Landing Zone to UMS

### Prerequisites
- [ ] Landing Zone deployed
- [ ] Service Principal created
- [ ] Network connectivity established

### Steps

1. **Register Landing Zone in Central LAW**
   ```powershell
   # Parameters
   $landingZoneName = "<lz-name>"
   $landingZoneSubId = "<sub-id>"
   $centralLawId = "<central-law-resource-id>"
   
   # Apply baseline DCR
   az deployment group create --template-file dcr-baseline.bicep --parameters landingZone=$landingZoneName

Assign RBAC
- Log Analytics Reader: LZ-Team-Group
- Log Analytics Contributor: LZ-Admin-Group
Deploy AMBA Policies
- Assign policy initiative to subscription
- Set exemptions if needed
Configure Action Group
- Add LZ email/webhook to action group
- Test notification
Validate
- Run test query
- Trigger test alert
- Confirm notification received

Rollback

Remove DCR association
Revoke RBAC
Remove policy assignment

### 8.2 Monthly Operations Checklist

```markdown
## Runbook: Monthly Operations Review

### Week 1: Access Review
- [ ] Generate RBAC report
- [ ] Review with team leads
- [ ] Revoke stale access
- [ ] Document approvals

### Week 2: Cost Analysis
- [ ] Run cost queries
- [ ] Identify optimization opportunities
- [ ] Create tickets for savings initiatives
- [ ] Update forecast

### Week 3: Capacity Planning
- [ ] Review ingestion trends
- [ ] Check commitment tier utilization
- [ ] Plan for new workloads
- [ ] Update capacity model

### Week 4: Alert Review
- [ ] Review false positive rate
- [ ] Tune noisy alerts
- [ ] Add missing coverage
- [ ] Update runbooks

8.3 Incident Response Template

## Incident Response: [INCIDENT-ID]

### Incident Details
- **Severity**: Sev [0/1/2/3]
- **Start Time**: YYYY-MM-DD HH:MM UTC
- **Impact**: [Description]
- **Affected Resources**: [List]

### Timeline
| Time | Action | By |
|------|--------|----|
| HH:MM | Alert received | System |
| HH:MM | Acknowledged | [Name] |
| HH:MM | Investigation started | [Name] |
| HH:MM | Root cause identified | [Name] |
| HH:MM | Fix applied | [Name] |
| HH:MM | Validated | [Name] |
| HH:MM | Closed | [Name] |

### Root Cause
[Description of root cause]

### Resolution
[Steps taken to resolve]

### Lessons Learned
- [ ] Update monitoring
- [ ] Update runbook
- [ ] Training needed
- [ ] Process improvement

### Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| [Action] | [Name] | YYYY-MM-DD |

Repository References

Existing Module	Purpose	Path
Action Groups	Alert notification setup	modules/action-groups.bicep
Log Alerts	Alert rule definitions	modules/log-alerts.bicep
LAW	Workspace configuration	modules/log-analytic-workspace.bicep
Metric Alerts	Metric-based alerts	modules/metric-alerts.bicep
Storage Alerts	Storage account alerts	modules/storage-alerts.bicep
Diagnostic Settings	Storage logging	modules/storage-account-diagnostic-settings.bicep
Scheduled Query Rules	KQL-based alerts	modules/scheduled-query-rules.bicep

Quick Reference Commands

# Azure Monitor Agent
az vm extension show --vm-name <vm> --resource-group <rg> --name AzureMonitorWindowsAgent

# DCR
az monitor data-collection rule show --name <dcr-name> --resource-group <rg>

# Alert Rules
az monitor scheduled-query list --resource-group <rg>

# Action Groups
az monitor action-group list --resource-group <rg>

# LAW
az monitor log-analytics workspace show --workspace-name <law> --resource-group <rg>

Next Steps

File 3: 03-advanced-topics.md - Future roadmap: DR, audit logs, cost optimization deep-dive

Document Control
Maintained by: Central Platform Team
Review Cycle: Monthly
Classification: Internal

📖 How to Use This Document​

Practice ↔ Theory Cross-Reference​

Table of Contents​

1. Operations Overview​

1.1 Operations Model​

1.2 Responsibility Matrix (RACI)​

1.3 Operational Cadence​

2. Alert Response Procedures​

2.1 Alert Severity Classification​

2.2 Alert Response Workflow​

2.3 Common Alert Response Actions​

VM Availability Alert​

High CPU Alert​

Log Analytics Ingestion Alert​

2.4 Actionable Alerts - Including Remediation Steps​

Embedding Remediation Guidance in Alerts​

Alert Template with Remediation Context​

Action Group with Rich Notifications​

2.5 Alert Suppression During Maintenance​

2.6 Alert Suppression & Correlation​

Alert Storm Prevention​

Alert Aggregation Example​

Dependency-Based Suppression​

Alert Correlation Query​

Smart Grouping Configuration​

3. KQL Query Library​

3.1 Application Insights Alert Patterns​

HTTP 500 Internal Server Error Alert​

Dependency Throttling (HTTP 429) Alert​

Server Response Time Exceeded Alert​

Server Exceptions Alert​

Dead Lettering Alert (EventGrid)​

Alert Patterns Summary Table​

3.2 Resource Health Queries​

All Resource Health Events (Last 24h)​

VM Availability Summary​

Resource Availability by Type​

3.2 Performance Queries​

Top 10 CPU Consumers​

Memory Pressure Analysis​

Disk I/O Performance​

3.3 Security Queries​

Failed Sign-in Attempts​

Security Alert Summary​

Privileged Operations​

3.4 Ingestion Health Queries​

Data Ingestion Volume by Table​

Ingestion Latency Analysis​

Missing Data Detection​

3.5 Cost Analysis Queries​

Daily Ingestion Cost Estimate​

Top Data Contributors​

Retention vs Ingestion Analysis​

3.6 Alert Queries (For Custom Alerts)​

Service Health Impact Assessment​

Resource Creation/Deletion Tracking​

3.7 Data Enrichment Logic​

3.7.1 Business Context Enrichment Overview​

3.7.2 Resource Ownership Lookup Table​

3.7.3 Criticality Classification Matrix​

3.7.4 Tag Inheritance via Azure Policy​

3.7.5 Required Tags for Enrichment​

3.7.6 DCR Transformation for Enrichment at Ingestion​

4. Federated Visibility Operations​

4.1 Accessing Landing Zone Dashboards​

Method 1: Shared Workbook Gallery​

Method 2: Cross-Workspace KQL Queries​

Method 3: Azure Resource Graph for Metadata​

4.2 Setting Up Dashboard Sharing​

For Landing Zone Teams (Share Dashboards)​

For Platform Team (Access Shared Dashboards)​

4.3 Monitoring Landing Zone Health (Without Ownership)​

Health Overview Workbook Query​

4.4 Escalation from LZ Dashboards​

5. Maintenance Window Management​

4.1 Planned Maintenance Process​

4.2 Maintenance Window Bicep Template​

4.3 Emergency Maintenance (Ad-hoc)​

6. Troubleshooting Guide​

5.1 Common Issues Decision Tree​