Skip to main content

Unified Monitoring Solution - Operations Runbook

Version: 1.0
Last Updated: January 2026 (Workshop Edition)
Purpose: Day-2 operations, alerting procedures, KQL queries, and troubleshooting guides

Feature Availability Legend: ✅ GA | ⚠️ Preview | 📍 Planned

📚 Quick NavigationREADMEArchitectureOperations RunbookAdvanced Topics

📖 How to Use This Document

This Operations Runbook provides the "How" (practical implementation, KQL queries, troubleshooting).
The Architecture document provides the "What & Why" (theory, design decisions, patterns).

Practice ↔ Theory Cross-Reference

Operations TopicArchitecture CounterpartDescription
1. Operations Overview3. Federated ModelDay-2 responsibilities ← Governance model
2. Alert Response5. Landing Zone AlertingAlert response ← Alert design
3. KQL Query LibraryReady-to-use queries (no architecture counterpart)
4. Federated Visibility3.4 Federated VisibilityCross-subscription visibility ← Federated architecture
5. Maintenance Windows6. AMBASuppression ← Baseline alerts
6. Troubleshooting Guide4. Core ComponentsTroubleshooting ← Component design
7. RBAC Operations8. Security & AccessAccess management ← RBAC design
8. Cost MonitoringAdvanced Topics: CostQuick reference → Deep-dive in Advanced

Table of Contents

  1. Operations Overview
  2. Alert Response Procedures
  3. KQL Query Library
  4. Federated Visibility Operations
  5. Maintenance Window Management
  6. Troubleshooting Guide
  7. RBAC Operations
  8. Cost Monitoring & Optimization
  9. Runbook Templates

1. Operations Overview

1.1 Operations Model

The Unified Monitoring Solution follows a shared responsibility model aligned with the federated architecture:

1.2 Responsibility Matrix (RACI)

ActivityCentral TeamLanding ZoneSecurityFinOps
LAW InfrastructureR/AICI
DCR BaselineR/ACCI
Custom DCRsCR/ACI
AMBA PoliciesR/AICI
Custom AlertsCR/AII
Incident ResponseCR/ACI
Cost OptimizationCCIR/A
Access ReviewsCRAI

Legend: R = Responsible, A = Accountable, C = Consulted, I = Informed

1.3 Operational Cadence


2. Alert Response Procedures

2.1 Alert Severity Classification

SeverityResponse TimeEscalation PathExamples
Sev 0 - Critical15 minOn-call → Manager → DirectorService outage, Data loss
Sev 1 - High1 hourOn-call → Team LeadPerformance degradation, Availability < 99%
Sev 2 - Medium4 hoursTeam queueThreshold warnings, Capacity alerts
Sev 3 - Low24 hoursBacklogInformational, Optimization opportunities

2.2 Alert Response Workflow

2.3 Common Alert Response Actions

VM Availability Alert

# Check VM status
az vm get-instance-view --name <vm-name> --resource-group <rg-name> --query "instanceView.statuses[1].displayStatus"

# Check recent boot diagnostics
az vm boot-diagnostics get-boot-log --name <vm-name> --resource-group <rg-name>

# Restart VM if needed
az vm restart --name <vm-name> --resource-group <rg-name>

High CPU Alert

# Get process details from guest metrics
az monitor metrics list --resource <vm-resource-id> \
--metric "Percentage CPU" \
--start-time (Get-Date).AddHours(-1).ToString("yyyy-MM-ddTHH:mm:ssZ") \
--end-time (Get-Date).ToString("yyyy-MM-ddTHH:mm:ssZ") \
--interval PT5M

Log Analytics Ingestion Alert

# Check ingestion latency
az monitor log-analytics workspace show --workspace-name <law-name> --resource-group <rg-name>

# Query ingestion health
# Use KQL from Section 3.4

2.4 Actionable Alerts - Including Remediation Steps

Best Practice: "Make every notification actionable - alert contains sufficient information to act"

Embedding Remediation Guidance in Alerts

Use alert rule descriptions and custom properties to include remediation steps directly in notifications:

resource actionableAlert 'Microsoft.Insights/scheduledQueryRules@2023-03-15-preview' = {
name: 'alert-vm-cpu-critical'
location: location
properties: {
displayName: 'VM CPU Critical - Actionable'
description: '''
## Alert: VM CPU Critical
**Impact**: Application performance degradation

### Immediate Actions:
1. Check process consuming CPU: `Get-Process | Sort-Object CPU -Descending | Select -First 10`
2. Review recent deployments in the last 24 hours
3. Check for runaway processes or memory leaks

### Escalation:
- If not resolved in 15 min → Escalate to App Team Lead
- If widespread → Engage Platform Team

### Runbook: [VM CPU Troubleshooting](https://wiki.contoso.com/runbooks/vm-cpu)
'''
severity: 1
enabled: true
evaluationFrequency: 'PT5M'
scopes: [lawId]
windowSize: 'PT15M'
criteria: {
allOf: [
{
query: '''
Perf
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| where CounterValue > 95
| summarize AvgCPU = avg(CounterValue) by Computer, bin(TimeGenerated, 5m)
| where AvgCPU > 95
'''
timeAggregation: 'Count'
operator: 'GreaterThan'
threshold: 0
}
]
}
actions: {
actionGroups: [actionGroupId]
customProperties: {
RunbookURL: 'https://wiki.contoso.com/runbooks/vm-cpu'
EscalationPath: 'AppTeam → PlatformTeam → OnCall'
ExpectedRTO: '15 minutes'
Severity: 'P1'
}
}
}
}

Alert Template with Remediation Context

Alert FieldPurposeExample
displayNameClear, actionable title"VM CPU Critical - Scale Up Required"
descriptionStep-by-step remediationMarkdown with numbered steps
customProperties.RunbookURLLink to detailed runbookWiki/Confluence link
customProperties.EscalationPathWho to contact if unresolved"L1 → L2 → Manager"
customProperties.ExpectedRTOResolution timeframe"15 minutes"

Action Group with Rich Notifications

resource richActionGroup 'Microsoft.Insights/actionGroups@2023-01-01' = {
name: 'ag-rich-notifications'
location: 'global'
properties: {
groupShortName: 'RichNotify'
enabled: true
emailReceivers: [
{
name: 'ops-team'
emailAddress: 'ops@contoso.com'
useCommonAlertSchema: true // Enables rich formatting
}
]
webhookReceivers: [
{
name: 'teams-webhook'
serviceUri: teamsWebhookUrl
useCommonAlertSchema: true
useAadAuth: false
}
]
logicAppReceivers: [
{
name: 'enrichment-logic-app'
resourceId: enrichmentLogicAppId
callbackUrl: enrichmentLogicAppCallbackUrl
useCommonAlertSchema: true
}
]
}
}

2.5 Alert Suppression During Maintenance

Use Alert Processing Rules to suppress alerts during planned maintenance:

// Reference: See Section 7.4 DCR Transformation Patterns for Cost Optimization
resource maintenanceSuppressionRule 'Microsoft.AlertsManagement/actionRules@2023-05-01-preview' = {
name: 'suppress-${maintenanceWindow.name}'
location: 'Global'
properties: {
scopes: [
resourceGroup().id
]
conditions: [
{
field: 'TargetResourceGroup'
operator: 'Equals'
values: [maintenanceWindow.targetResourceGroup]
}
]
schedule: {
effectiveFrom: maintenanceWindow.startTime
effectiveUntil: maintenanceWindow.endTime
timeZone: 'UTC'
}
actions: [
{
actionType: 'RemoveAllActionGroups'
}
]
enabled: true
}
}

2.6 Alert Suppression & Correlation

Common Challenge: "Avoiding alert storms, correlating related alerts"

Alert Storm Prevention

TechniqueDescriptionImplementation
AggregationGroup similar alerts into single notificationUse muteActionsDuration in alert rules
ThrottlingLimit notification frequencySet autoMitigate: false + longer evaluation window
Dependency-BasedSuppress child alerts when parent is downUse Alert Processing Rules with conditions
Smart GroupingGroup by resource, severity, or alert typeConfigure in Action Group settings

Alert Aggregation Example

resource aggregatedAlert 'Microsoft.Insights/scheduledQueryRules@2023-03-15-preview' = {
name: 'alert-multiple-vm-cpu'
location: location
properties: {
displayName: 'Multiple VMs High CPU'
severity: 2
enabled: true
evaluationFrequency: 'PT5M'
windowSize: 'PT15M'
scopes: [lawId]
criteria: {
allOf: [
{
query: '''
Perf
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| where CounterValue > 90
| summarize AvgCPU = avg(CounterValue), AffectedVMs = dcount(Computer) by bin(TimeGenerated, 5m)
| where AffectedVMs >= 3 // Only alert if 3+ VMs affected
'''
timeAggregation: 'Count'
operator: 'GreaterThan'
threshold: 0
}
]
}
muteActionsDuration: 'PT30M' // Mute for 30 min after firing
actions: {
actionGroups: [actionGroupId]
}
}
}

Dependency-Based Suppression

Suppress child resource alerts when parent is unhealthy:

// Suppress VM alerts when host is down
resource dependencySuppression 'Microsoft.AlertsManagement/actionRules@2023-05-01-preview' = {
name: 'suppress-vm-when-host-down'
location: 'global'
properties: {
scopes: [subscription().id]
conditions: [
{
field: 'AlertRuleName'
operator: 'Contains'
values: ['vm-'] // All VM alerts
}
]
actions: [
{
actionType: 'RemoveAllActionGroups'
}
]
// Only active when host alert is firing (managed via automation)
enabled: false // Toggled by Logic App when host alert fires
}
}

Alert Correlation Query

Identify related alerts for root cause analysis:

// Find correlated alerts within time window
let timeWindow = 15m;
let primaryAlert = "alert-network-latency";
Alerts
| where TimeGenerated > ago(1h)
| where AlertName == primaryAlert
| project PrimaryTime = TimeGenerated, PrimaryResource = ResourceId
| join kind=inner (
Alerts
| where TimeGenerated > ago(1h)
| project AlertName, AlertTime = TimeGenerated, Resource = ResourceId, Severity
) on $left.PrimaryResource == $right.Resource
| where AlertTime between (PrimaryTime - timeWindow .. PrimaryTime + timeWindow)
| where AlertName != primaryAlert
| summarize CorrelatedAlerts = make_set(AlertName), Count = count() by PrimaryResource
| order by Count desc

Smart Grouping Configuration

Grouping StrategyUse CaseConfiguration
By ResourceOne notification per affected resourcegroupByFields: ['resourceId']
By Alert TypeOne notification per alert rulegroupByFields: ['alertRule']
By SeverityCritical alerts separate from warningsgroupByFields: ['severity']
CombinedSophisticated groupinggroupByFields: ['resourceGroup', 'severity']

3. KQL Query Library

3.1 Application Insights Alert Patterns

Reference: Production patterns from enterprise implementations (Fortune 500 deployments)

These alert patterns are battle-tested in production for Function Apps and web applications.

HTTP 500 Internal Server Error Alert

// Pattern: Alert on any HTTP 500 errors
resource http500Alert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
name: 'alert-http-500-${serviceName}'
location: 'global'
properties: {
description: '${serviceName} Internal server error, http code 500'
severity: 2
enabled: true
scopes: [appInsightsId]
evaluationFrequency: 'PT1M'
windowSize: 'PT5M'
autoMitigate: true
criteria: {
'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
allOf: [
{
criterionType: 'StaticThresholdCriterion'
name: 'http500'
metricNamespace: 'microsoft.insights/components'
metricName: 'requests/count'
operator: 'GreaterThan'
threshold: 0
timeAggregation: 'Count'
dimensions: [
{
name: 'request/resultCode'
operator: 'Include'
values: ['500']
}
]
}
]
}
actions: [
{
actionGroupId: actionGroupId
}
]
}
}

Dependency Throttling (HTTP 429) Alert

// Pattern: Alert on dependency throttling
resource throttlingAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
name: 'alert-dependency-throttling-${serviceName}'
location: 'global'
properties: {
description: '${serviceName} dependency call returned http code 429 (throttled)'
severity: 2
enabled: true
scopes: [appInsightsId]
evaluationFrequency: 'PT1M'
windowSize: 'PT5M'
autoMitigate: true
criteria: {
'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
allOf: [
{
criterionType: 'StaticThresholdCriterion'
name: 'throttling'
metricNamespace: 'microsoft.insights/components'
metricName: 'dependencies/failed'
operator: 'GreaterThan'
threshold: 0
timeAggregation: 'Count'
dimensions: [
{
name: 'dependency/type'
operator: 'Exclude'
values: ['INPROC'] // Exclude in-process calls
}
{
name: 'dependency/resultCode'
operator: 'Include'
values: ['429']
}
]
}
]
}
actions: [
{
actionGroupId: actionGroupId
}
]
}
}

Server Response Time Exceeded Alert

// Pattern: Alert when response time exceeds threshold
resource responseTimeAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
name: 'alert-response-time-${serviceName}'
location: 'global'
properties: {
description: '${serviceName} server response time exceeded 20 seconds'
severity: 2
enabled: true
scopes: [appInsightsId]
evaluationFrequency: 'PT1M'
windowSize: 'PT5M'
autoMitigate: true
criteria: {
'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
allOf: [
{
criterionType: 'StaticThresholdCriterion'
name: 'responseTime'
metricNamespace: 'microsoft.insights/components'
metricName: 'requests/duration'
operator: 'GreaterThan'
threshold: 20000 // 20 seconds in milliseconds
timeAggregation: 'Maximum'
}
]
}
actions: [
{
actionGroupId: actionGroupId
}
]
}
}

Server Exceptions Alert

// Pattern: Alert on unhandled exceptions
resource exceptionsAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
name: 'alert-exceptions-${serviceName}'
location: 'global'
properties: {
description: '${serviceName} server exceptions detected'
severity: 2
enabled: true
scopes: [appInsightsId]
evaluationFrequency: 'PT1M'
windowSize: 'PT5M'
autoMitigate: true
criteria: {
'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
allOf: [
{
criterionType: 'StaticThresholdCriterion'
name: 'exceptions'
metricNamespace: 'microsoft.insights/components'
metricName: 'exceptions/server'
operator: 'GreaterThan'
threshold: 0
timeAggregation: 'Count'
}
]
}
actions: [
{
actionGroupId: actionGroupId
}
]
}
}

Dead Lettering Alert (EventGrid)

// Pattern: Alert on EventGrid dead-lettered messages
resource deadLetterAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
name: 'alert-deadletter-${systemTopicName}'
location: 'global'
properties: {
description: 'Dead Lettering Events detected on ${systemTopicName}'
severity: 2
enabled: true
scopes: [eventGridSystemTopicId]
evaluationFrequency: 'PT1M'
windowSize: 'PT5M'
autoMitigate: false // Don't auto-mitigate - needs investigation
criteria: {
'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'
allOf: [
{
criterionType: 'StaticThresholdCriterion'
name: 'deadLetter'
metricNamespace: 'Microsoft.EventGrid/systemTopics'
metricName: 'DeadLetteredCount'
operator: 'GreaterThan'
threshold: 0
timeAggregation: 'Total'
}
]
}
actions: [
{
actionGroupId: actionGroupId
}
]
}
}

Alert Patterns Summary Table

Alert PatternMetricThresholdAutoMitigateSeverity
HTTP 500 Errorsrequests/count (resultCode=500)> 0Yes2
Dependency Throttling (429)dependencies/failed> 0Yes2
Response Timerequests/duration> 20000msYes2
Server Exceptionsexceptions/server> 0Yes2
Dead LetteringDeadLetteredCount> 0No2
Storage TimeoutTransactions (ServerTimeoutError)> 5No2

3.2 Resource Health Queries

All Resource Health Events (Last 24h)

AzureActivity
| where TimeGenerated > ago(24h)
| where CategoryValue == "ResourceHealth"
| project TimeGenerated, ResourceGroup, Resource, Level,
Status = Properties_d.statusCode,
Message = Properties_d.statusMessage
| order by TimeGenerated desc

VM Availability Summary

Heartbeat
| where TimeGenerated > ago(1h)
| summarize LastHeartbeat = max(TimeGenerated) by Computer, ResourceGroup
| extend Status = iff(LastHeartbeat < ago(5m), "Offline", "Online")
| summarize
TotalVMs = count(),
OnlineVMs = countif(Status == "Online"),
OfflineVMs = countif(Status == "Offline")

Resource Availability by Type

AzureMetrics
| where TimeGenerated > ago(24h)
| where MetricName == "Availability"
| summarize
AvgAvailability = avg(Average),
MinAvailability = min(Minimum)
by ResourceProvider, Resource, bin(TimeGenerated, 1h)
| where AvgAvailability < 100
| order by AvgAvailability asc

3.2 Performance Queries

Top 10 CPU Consumers

Perf
| where TimeGenerated > ago(1h)
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| where InstanceName == "_Total"
| summarize AvgCPU = avg(CounterValue), MaxCPU = max(CounterValue) by Computer
| top 10 by AvgCPU desc

Memory Pressure Analysis

Perf
| where TimeGenerated > ago(1h)
| where ObjectName == "Memory"
| where CounterName in ("% Committed Bytes In Use", "Available MBytes")
| summarize Value = avg(CounterValue) by Computer, CounterName, bin(TimeGenerated, 5m)
| evaluate pivot(CounterName, take_any(Value))
| project TimeGenerated, Computer,
MemoryUsedPct = ['% Committed Bytes In Use'],
AvailableMB = ['Available MBytes']
| where MemoryUsedPct > 80

Disk I/O Performance

Perf
| where TimeGenerated > ago(1h)
| where ObjectName == "LogicalDisk"
| where CounterName in ("Disk Reads/sec", "Disk Writes/sec", "Avg. Disk sec/Read", "Avg. Disk sec/Write")
| where InstanceName != "_Total"
| summarize Value = avg(CounterValue) by Computer, InstanceName, CounterName, bin(TimeGenerated, 5m)
| evaluate pivot(CounterName, take_any(Value))

3.3 Security Queries

Failed Sign-in Attempts

SigninLogs
| where TimeGenerated > ago(24h)
| where ResultType != "0" // Non-successful
| summarize
FailedAttempts = count(),
DistinctIPs = dcount(IPAddress),
LastAttempt = max(TimeGenerated)
by UserPrincipalName, ResultDescription
| where FailedAttempts > 5
| order by FailedAttempts desc

Security Alert Summary

SecurityAlert
| where TimeGenerated > ago(7d)
| summarize
AlertCount = count(),
Providers = make_set(ProviderName)
by AlertSeverity, AlertName
| order by case(
AlertSeverity == "High", 1,
AlertSeverity == "Medium", 2,
AlertSeverity == "Low", 3,
4
)

Privileged Operations

AzureActivity
| where TimeGenerated > ago(24h)
| where Authorization_d.action has_any ("Microsoft.Authorization/roleAssignments", "Microsoft.Authorization/roleDefinitions")
| project TimeGenerated, Caller, OperationName,
Action = Authorization_d.action,
ResourceGroup, Resource
| order by TimeGenerated desc

3.4 Ingestion Health Queries

Data Ingestion Volume by Table

Usage
| where TimeGenerated > ago(1d)
| summarize
IngestionGB = sum(Quantity) / 1024,
BillableGB = sumif(Quantity, IsBillable) / 1024
by DataType
| order by IngestionGB desc
| take 20

Ingestion Latency Analysis

// Check time between event generation and ingestion
AzureActivity
| where TimeGenerated > ago(1h)
| extend IngestionDelay = TimeGenerated - todatetime(Properties_d.eventTimestamp)
| summarize
AvgLatency = avg(IngestionDelay),
P95Latency = percentile(IngestionDelay, 95),
MaxLatency = max(IngestionDelay)
by bin(TimeGenerated, 5m)

Missing Data Detection

// Detect gaps in heartbeat data
Heartbeat
| where TimeGenerated > ago(24h)
| summarize HeartbeatCount = count() by Computer, bin(TimeGenerated, 15m)
| where HeartbeatCount < 3 // Expected ~3 per 15min
| order by TimeGenerated desc

3.5 Cost Analysis Queries

Daily Ingestion Cost Estimate

// Note: Pricing varies by region. Check https://azure.microsoft.com/pricing/details/monitor/
let pricePerGB = 2.30; // Example: West Europe Pay-As-You-Go (verify current pricing)
Usage
| where TimeGenerated > startofday(ago(30d))
| where IsBillable == true
| summarize DailyGB = sum(Quantity) / 1024 by bin(TimeGenerated, 1d)
| extend EstimatedCost = DailyGB * pricePerGB
| order by TimeGenerated desc

Top Data Contributors

Usage
| where TimeGenerated > ago(7d)
| where IsBillable == true
| summarize TotalGB = sum(Quantity) / 1024 by DataType
| order by TotalGB desc
| take 10
| extend CostEstimate = TotalGB * 2.76

Retention vs Ingestion Analysis

Usage
| where TimeGenerated > ago(30d)
| where IsBillable == true
| summarize TotalGB = sum(Quantity) / 1024 by DataType
| join kind=inner (
workspace('LAW-Name').Usage
| where TimeGenerated > ago(30d)
| where IsBillable == false
| summarize RetainedGB = sum(Quantity) / 1024 by DataType
) on DataType
| project DataType, TotalGB, RetainedGB,
RetentionRatio = RetainedGB / TotalGB

3.6 Alert Queries (For Custom Alerts)

Service Health Impact Assessment

ServiceHealth
| where TimeGenerated > ago(7d)
| where Status == "Active"
| project TimeGenerated, Service, Region,
ImpactType, Title, Summary,
AffectedResources = Properties_d.affectedResources
| order by TimeGenerated desc

Resource Creation/Deletion Tracking

AzureActivity
| where TimeGenerated > ago(24h)
| where OperationNameValue has_any ("Microsoft.Resources/deployments", "delete")
| where ActivityStatusValue == "Success"
| project TimeGenerated, Caller, OperationName,
ResourceGroup, Resource, SubscriptionId
| order by TimeGenerated desc

3.7 Data Enrichment Logic

Blueprint Requirement: "Define enrichment fields including business context, ownership, criticality. Implement enrichment logic via lookup tables, functions. Automate tagging with tag inheritance and policy-based tagging."

3.7.1 Business Context Enrichment Overview

Data enrichment adds business context to raw telemetry data, enabling better filtering, routing, and analysis.

3.7.2 Resource Ownership Lookup Table

Create a lookup table for resource-to-owner mapping:

// Create externaldata lookup for resource ownership
let ResourceOwners = externaldata(
ResourceId: string,
SubscriptionName: string,
LandingZone: string,
CostCenter: string,
BusinessUnit: string,
ApplicationName: string,
Owner: string,
OwnerEmail: string,
Criticality: string,
Environment: string
) [
h@"https://yourstorageaccount.blob.core.windows.net/lookups/resource-owners.csv"
] with (format="csv", ignoreFirstRecord=true);

// Use in queries
AzureMetrics
| extend ResourceIdLower = tolower(_ResourceId)
| join kind=leftouter (
ResourceOwners | extend ResourceIdLower = tolower(ResourceId)
) on ResourceIdLower
| project TimeGenerated, Resource, MetricName, Average,
LandingZone, BusinessUnit, Owner, Criticality

3.7.3 Criticality Classification Matrix

Criticality LevelDefinitionAlert RoutingResponse SLA
P1 - CriticalRevenue-impacting, customer-facing, no redundancyServiceNow P1 + Phone15 min
P2 - HighBusiness-critical internal, limited redundancyServiceNow P2 + Teams30 min
P3 - MediumImportant but redundant, degraded performance OKEmail + Teams4 hours
P4 - LowDevelopment, test, non-criticalDashboard onlyNext business day
// Criticality lookup function
let GetCriticality = (resourceId: string) {
let criticalityMapping = datatable(Pattern: string, Criticality: string) [
"prd", "P1-Critical",
"prod", "P1-Critical",
"stg", "P2-High",
"staging", "P2-High",
"uat", "P3-Medium",
"dev", "P4-Low",
"test", "P4-Low",
"sandbox", "P4-Low"
];
let resourceLower = tolower(resourceId);
toscalar(
criticalityMapping
| where resourceLower contains Pattern
| take 1
| project Criticality
)
};

3.7.4 Tag Inheritance via Azure Policy

Ensure resources inherit tags from their resource group for consistent enrichment:

// Azure Policy: Inherit tags from resource group
resource tagInheritancePolicy 'Microsoft.Authorization/policyDefinitions@2021-06-01' = {
name: 'inherit-tags-from-rg'
properties: {
policyType: 'Custom'
mode: 'Indexed'
displayName: 'Inherit tags from resource group'
parameters: {
tagName: {
type: 'String'
metadata: {
displayName: 'Tag Name'
description: 'Name of the tag to inherit'
}
}
}
policyRule: {
if: {
allOf: [
{
field: '[concat(\'tags[\', parameters(\'tagName\'), \']\')]'
exists: 'false'
}
{
value: '[resourceGroup().tags[parameters(\'tagName\')]]'
notEquals: ''
}
]
}
then: {
effect: 'modify'
details: {
roleDefinitionIds: [
'/providers/microsoft.authorization/roleDefinitions/b24988ac-6180-42a0-ab88-20f7382dd24c'
]
operations: [
{
operation: 'add'
field: '[concat(\'tags[\', parameters(\'tagName\'), \']\')]'
value: '[resourceGroup().tags[parameters(\'tagName\')]]'
}
]
}
}
}
}
}

3.7.5 Required Tags for Enrichment

Tag NamePurposeExample ValueMandatory
EnvironmentEnv classificationProduction, Staging, Dev✅ Yes
CostCenterFinance allocationCC-12345✅ Yes
OwnerResponsible team/personplatform-team@company.com✅ Yes
ApplicationNameApplication identifierdata-mesh-api✅ Yes
CriticalityBusiness criticalityP1-Critical, P2-High✅ Yes
LandingZoneALZ identifierlz-aiml-prod✅ Yes
DataClassificationData sensitivityConfidential, Internal⚠️ If applicable

3.7.6 DCR Transformation for Enrichment at Ingestion

// Add enrichment fields at ingestion time using DCR transformation
dataFlows: [
{
streams: ['Microsoft-Perf']
destinations: ['centralWorkspace']
transformKql: '''
source
| extend
// Extract environment from computer name convention
Environment = case(
Computer startswith "prd-" or Computer contains "-prod-", "Production",
Computer startswith "stg-" or Computer contains "-staging-", "Staging",
Computer startswith "uat-" or Computer contains "-uat-", "UAT",
Computer startswith "dev-" or Computer contains "-dev-", "Development",
"Unknown"
),
// Derive criticality from environment
Criticality = case(
Computer startswith "prd-" or Computer contains "-prod-", "P1-Critical",
Computer startswith "stg-" or Computer contains "-staging-", "P2-High",
Computer startswith "uat-", "P3-Medium",
"P4-Low"
),
// Extract landing zone from resource group naming
LandingZone = extract(@"rg-([a-z]+)-", 1, _ResourceId)
'''
outputStream: 'Microsoft-Perf'
}
]

4. Federated Visibility Operations

This section covers operational procedures for the 60/40 federated visibility model where the platform team has visibility into both centralized and landing zone-dedicated workspaces.

📖 Architecture Reference: See 01-architecture-overview.md, Section 3.4 for detailed architecture diagrams.

4.1 Accessing Landing Zone Dashboards

The platform team can access LZ-dedicated workbooks and dashboards through shared access.

# List workbooks shared with the platform team
az monitor app-insights workbook list \
--resource-group "rg-monitoring-platform" \
--category "workbook" \
--source-id "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.OperationalInsights/workspaces/<law>"

Method 2: Cross-Workspace KQL Queries

Query data from LZ-dedicated workspaces without copying data:

// Query alerts across all Landing Zone workspaces
let lz1_alerts = workspace("law-lz-team1-prod").Alert;
let lz2_alerts = workspace("law-lz-team2-prod").Alert;
let lz3_alerts = workspace("law-lz-team3-prod").Alert;

union lz1_alerts, lz2_alerts, lz3_alerts
| where TimeGenerated > ago(24h)
| summarize AlertCount = count() by AlertName, bin(TimeGenerated, 1h)
| order by AlertCount desc

Method 3: Azure Resource Graph for Metadata

// List all Landing Zone LAWs and their properties
resources
| where type == "microsoft.operationalinsights/workspaces"
| where resourceGroup contains "lz-"
| project
name,
resourceGroup,
subscriptionId,
sku = properties.sku.name,
retentionInDays = properties.retentionInDays,
dailyQuotaGb = properties.workspaceCapping.dailyQuotaGb

4.2 Setting Up Dashboard Sharing

For Landing Zone Teams (Share Dashboards)

// Dashboard with shared access for Platform Team
resource sharedDashboard 'Microsoft.Portal/dashboards@2020-09-01-preview' = {
name: 'dash-lz-team1-shared'
location: location
properties: {
lenses: [
// Dashboard tiles
]
}
}

// Role assignment for Platform Team
resource dashboardAccess 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
name: guid(sharedDashboard.id, platformTeamGroupId, 'Reader')
scope: sharedDashboard
properties: {
principalId: platformTeamGroupId
roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions', 'acdd72a7-3385-48ef-bd42-f606fba81ae7') // Reader
principalType: 'Group'
}
}

For Platform Team (Access Shared Dashboards)

Access MethodRBAC RoleScope
View LZ DashboardReaderDashboard resource
Query LZ LAWLog Analytics ReaderLAW resource
View LZ WorkbooksWorkbook ReaderWorkbook resource

4.3 Monitoring Landing Zone Health (Without Ownership)

The platform team monitors LZ health through aggregated views, not individual alert management.

Health Overview Workbook Query

// LZ Health Score by Team
let resourceHealth =
AzureActivity
| where CategoryValue == "ResourceHealth"
| extend LandingZone = extract(@"/subscriptions/[^/]+/resourceGroups/rg-([^-]+)-", 1, _ResourceId)
| summarize
HealthyCount = countif(ActivityStatusValue == "Succeeded"),
UnhealthyCount = countif(ActivityStatusValue != "Succeeded")
by LandingZone;

let alertCount =
AlertsManagementResources
| where type == "microsoft.alertsmanagement/alerts"
| extend LandingZone = extract(@"/subscriptions/[^/]+/resourceGroups/rg-([^-]+)-", 1, tolower(tostring(properties.targetResource)))
| where properties.essentials.monitorCondition == "Fired"
| summarize ActiveAlerts = count() by LandingZone;

resourceHealth
| join kind=leftouter alertCount on LandingZone
| extend HealthScore = round(100.0 * HealthyCount / (HealthyCount + UnhealthyCount), 1)
| project LandingZone, HealthScore, HealthyCount, UnhealthyCount, ActiveAlerts
| order by HealthScore asc

4.4 Escalation from LZ Dashboards

When the platform team identifies issues through shared dashboards:

ObservationActionOwner
High alert count on LZ dashboardContact LZ team ownerPlatform Team
Resource health degradedCheck service health, notify LZPlatform Team
Cost anomaly detectedReview with LZ teamPlatform Team
Security eventEscalate to Security teamPlatform Team

5. Maintenance Window Management

4.1 Planned Maintenance Process

4.2 Maintenance Window Bicep Template

@description('Maintenance window configuration')
param maintenanceConfig object = {
name: 'scheduled-maintenance'
startTime: '2024-01-15T22:00:00Z'
endTime: '2024-01-16T02:00:00Z'
targetScopes: [
'/subscriptions/<sub-id>/resourceGroups/<rg-name>'
]
suppressSeverities: ['Sev2', 'Sev3', 'Sev4']
}

resource maintenanceRule 'Microsoft.AlertsManagement/actionRules@2023-05-01-preview' = {
name: 'maint-${maintenanceConfig.name}'
location: 'Global'
properties: {
scopes: maintenanceConfig.targetScopes
conditions: [
{
field: 'Severity'
operator: 'Equals'
values: maintenanceConfig.suppressSeverities
}
]
schedule: {
effectiveFrom: maintenanceConfig.startTime
effectiveUntil: maintenanceConfig.endTime
timeZone: 'UTC'
}
actions: [
{
actionType: 'RemoveAllActionGroups'
}
]
enabled: true
description: 'Maintenance window: ${maintenanceConfig.name}'
}
}

4.3 Emergency Maintenance (Ad-hoc)

For emergency maintenance windows:

# Create emergency suppression rule
$suppressionRule = @{
Name = "emergency-maint-$(Get-Date -Format 'yyyyMMddHHmm')"
ResourceGroupName = "<rg-name>"
Location = "Global"
Scope = @("/subscriptions/<sub-id>/resourceGroups/<target-rg>")
Status = "Enabled"
ScheduleStartTime = (Get-Date).ToUniversalTime().ToString("yyyy-MM-ddTHH:mm:ssZ")
ScheduleEndTime = (Get-Date).AddHours(4).ToUniversalTime().ToString("yyyy-MM-ddTHH:mm:ssZ")
ActionType = "RemoveAllActionGroups"
}

# Note: Use Azure CLI or ARM template for full implementation
az rest --method PUT --uri "/subscriptions/<sub-id>/resourceGroups/<rg-name>/providers/Microsoft.AlertsManagement/actionRules/$($suppressionRule.Name)?api-version=2023-05-01-preview" --body "@emergency-rule.json"

6. Troubleshooting Guide

5.1 Common Issues Decision Tree

5.2 Data Collection Troubleshooting

Check Azure Monitor Agent Status

# Windows VM - Check AMA service
Get-Service -Name 'AzureMonitorAgent' | Select-Object Name, Status, StartType

# Check AMA logs
Get-WinEvent -LogName 'Microsoft-AzureMonitor-Agent/Operational' -MaxEvents 50

# Linux VM - Check AMA service
systemctl status azuremonitoragent
journalctl -u azuremonitoragent --since "1 hour ago"

Verify DCR Assignment

# List DCR associations for a VM
az monitor data-collection rule association list --resource "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Compute/virtualMachines/<vm-name>"

Test DCE Connectivity

# From the VM, test DCE endpoint
$dceEndpoint = "<dce-name>.<region>.ingest.monitor.azure.com"
Test-NetConnection -ComputerName $dceEndpoint -Port 443

# Check if private endpoint is resolving correctly
Resolve-DnsName $dceEndpoint

5.3 Alert Troubleshooting

Check Alert Rule Status

# Get alert rule details
az monitor scheduled-query show --name "<alert-name>" --resource-group "<rg-name>"

# Check alert evaluation history
az monitor activity-log list --resource-id "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Insights/scheduledqueryrules/<alert-name>" --offset 24h

Test Alert Query Manually

// Run the alert query directly in Log Analytics
// If it returns results, alert should fire
// Check if the threshold is correct

// Example: Copy your alert query and run it
Heartbeat
| where TimeGenerated > ago(5m)
| summarize LastHeartbeat = max(TimeGenerated) by Computer
| where LastHeartbeat < ago(5m)
// If this returns rows, alert should fire

Check Action Group Notifications

# Get action group details
az monitor action-group show --name "<ag-name>" --resource-group "<rg-name>"

# Test action group
az monitor action-group test-notifications create --action-group "<ag-name>" --resource-group "<rg-name>" --alert-type "metric" --receivers @test-receivers.json

5.4 Log Analytics Workspace Issues

Check Workspace Health

// Check for ingestion issues
_LogOperation
| where TimeGenerated > ago(1h)
| where Level != "Info"
| project TimeGenerated, Operation, Detail, _ResourceId
| order by TimeGenerated desc

Verify Table Schema

// Get schema for a specific table
<TableName>
| getschema

Check Query Performance

// Identify slow queries
LAQueryLogs
| where TimeGenerated > ago(24h)
| where ResponseDurationMs > 30000 // > 30 seconds
| project TimeGenerated, RequestClientApp, QueryText, ResponseRowCount, ResponseDurationMs
| order by ResponseDurationMs desc
# Check DNS resolution returns private IP
Resolve-DnsName "<law-name>.ods.opinsights.azure.com"
# Should return 10.x.x.x (private IP)

# Check Private DNS Zone
az network private-dns record-set a list --zone-name "privatelink.ods.opinsights.azure.com" --resource-group "<dns-rg>"

# Verify Private Endpoint
az network private-endpoint show --name "<pe-name>" --resource-group "<rg-name>" --query "customDnsConfigs"

7. RBAC Operations

6.1 Access Review Process

6.2 Generate Access Report

# Export all role assignments for LAW
$lawId = "/subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.OperationalInsights/workspaces/<law-name>"

az role assignment list --scope $lawId --all --output table > law-access-report.txt

# Get users with Reader role
az role assignment list --scope $lawId --role "Log Analytics Reader" --output json | ConvertFrom-Json | Select-Object principalName, principalType

6.3 Common RBAC Scenarios

Grant Landing Zone Team Access

// Reference: See knowledge-base/data-mesh-monitoring-reference.md for existing patterns

resource lawRbac 'Microsoft.Authorization/roleAssignments@2022-04-01' = {
name: guid(logAnalyticsWorkspace.id, landingZoneGroupId, 'Log Analytics Reader')
scope: logAnalyticsWorkspace
properties: {
principalId: landingZoneGroupId
roleDefinitionId: subscriptionResourceId('Microsoft.Authorization/roleDefinitions', '73c42c96-874c-492b-b04d-ab87d138a893') // Log Analytics Reader
principalType: 'Group'
}
}

Table-Level RBAC

# Grant access to specific table
az monitor log-analytics workspace table update \
--workspace-name "<law-name>" \
--resource-group "<rg>" \
--table-name "SecurityEvent" \
--total-retention-in-days 90

# Note: Table-level RBAC requires custom role definitions
# See 01-architecture-overview.md Section 3.3 for custom role examples

8. Cost Monitoring & Optimization

7.1 Cost Dashboard Query

// Monthly cost trend with forecast
Usage
| where TimeGenerated > ago(90d)
| where IsBillable == true
| summarize DailyGB = sum(Quantity) / 1024 by bin(TimeGenerated, 1d)
| extend CostUSD = DailyGB * 2.76 // Adjust for your region pricing
| order by TimeGenerated asc
| extend RunningTotal = row_cumsum(CostUSD)

7.2 Optimization Opportunities

// Find tables with low query activity but high ingestion
let queriedTables = LAQueryLogs
| where TimeGenerated > ago(30d)
| extend TableName = extract(@"([\w]+)\s*\|", 1, QueryText)
| summarize QueryCount = count() by TableName;

Usage
| where TimeGenerated > ago(30d)
| where IsBillable == true
| summarize IngestedGB = sum(Quantity) / 1024 by DataType
| join kind=leftouter queriedTables on $left.DataType == $right.TableName
| where isnull(QueryCount) or QueryCount < 10
| where IngestedGB > 1 // More than 1GB
| project DataType, IngestedGB, QueryCount = coalesce(QueryCount, 0)
| order by IngestedGB desc

7.3 Cost Reduction Actions

ScenarioActionEstimated Savings
Unused tablesDisable collection100% of table cost
Verbose loggingApply sampling in DCR50-90%
Long retentionReduce to 30 daysVariable
Infrequent queriesArchive tier60-80%
Duplicate dataConsolidate DCRsVariable

7.4 DCR Transformation Quick Reference

Microsoft Best Practice: Use transformations to filter or modify incoming data before it's sent to Log Analytics to reduce ingestion costs.

📖 Full Deep-Dive: See 03-advanced-topics.md - Section 4. Cost Optimization for complete Bicep examples and implementation patterns.

TechniqueTypical SavingsQuick Example
Filter Rows50-90%source | where SeverityLevel in ("err", "crit")
Filter Columns20-40%source | project-away ParameterXml, UserData
Aggregate Data60-80%source | summarize avg(CounterValue) by bin(TimeGenerated, 5m)
Route to Basic Logs80%Route verbose logs to Basic Logs table
Mask PII0% (compliance)replace_regex(Email, @"[a-z]+@", "***@")

Reference: Microsoft Transformation Samples


9. Runbook Templates

8.1 New Landing Zone Onboarding

## Runbook: Onboard New Landing Zone to UMS

### Prerequisites
- [ ] Landing Zone deployed
- [ ] Service Principal created
- [ ] Network connectivity established

### Steps

1. **Register Landing Zone in Central LAW**
```powershell
# Parameters
$landingZoneName = "<lz-name>"
$landingZoneSubId = "<sub-id>"
$centralLawId = "<central-law-resource-id>"

# Apply baseline DCR
az deployment group create --template-file dcr-baseline.bicep --parameters landingZone=$landingZoneName
  1. Assign RBAC

    • Log Analytics Reader: LZ-Team-Group
    • Log Analytics Contributor: LZ-Admin-Group
  2. Deploy AMBA Policies

    • Assign policy initiative to subscription
    • Set exemptions if needed
  3. Configure Action Group

    • Add LZ email/webhook to action group
    • Test notification
  4. Validate

    • Run test query
    • Trigger test alert
    • Confirm notification received

Rollback

  1. Remove DCR association
  2. Revoke RBAC
  3. Remove policy assignment

### 8.2 Monthly Operations Checklist

```markdown
## Runbook: Monthly Operations Review

### Week 1: Access Review
- [ ] Generate RBAC report
- [ ] Review with team leads
- [ ] Revoke stale access
- [ ] Document approvals

### Week 2: Cost Analysis
- [ ] Run cost queries
- [ ] Identify optimization opportunities
- [ ] Create tickets for savings initiatives
- [ ] Update forecast

### Week 3: Capacity Planning
- [ ] Review ingestion trends
- [ ] Check commitment tier utilization
- [ ] Plan for new workloads
- [ ] Update capacity model

### Week 4: Alert Review
- [ ] Review false positive rate
- [ ] Tune noisy alerts
- [ ] Add missing coverage
- [ ] Update runbooks

8.3 Incident Response Template

## Incident Response: [INCIDENT-ID]

### Incident Details
- **Severity**: Sev [0/1/2/3]
- **Start Time**: YYYY-MM-DD HH:MM UTC
- **Impact**: [Description]
- **Affected Resources**: [List]

### Timeline
| Time | Action | By |
|------|--------|----|
| HH:MM | Alert received | System |
| HH:MM | Acknowledged | [Name] |
| HH:MM | Investigation started | [Name] |
| HH:MM | Root cause identified | [Name] |
| HH:MM | Fix applied | [Name] |
| HH:MM | Validated | [Name] |
| HH:MM | Closed | [Name] |

### Root Cause
[Description of root cause]

### Resolution
[Steps taken to resolve]

### Lessons Learned
- [ ] Update monitoring
- [ ] Update runbook
- [ ] Training needed
- [ ] Process improvement

### Action Items
| Action | Owner | Due Date |
|--------|-------|----------|
| [Action] | [Name] | YYYY-MM-DD |

Repository References

Existing ModulePurposePath
Action GroupsAlert notification setupmodules/action-groups.bicep
Log AlertsAlert rule definitionsmodules/log-alerts.bicep
LAWWorkspace configurationmodules/log-analytic-workspace.bicep
Metric AlertsMetric-based alertsmodules/metric-alerts.bicep
Storage AlertsStorage account alertsmodules/storage-alerts.bicep
Diagnostic SettingsStorage loggingmodules/storage-account-diagnostic-settings.bicep
Scheduled Query RulesKQL-based alertsmodules/scheduled-query-rules.bicep

Quick Reference Commands

# Azure Monitor Agent
az vm extension show --vm-name <vm> --resource-group <rg> --name AzureMonitorWindowsAgent

# DCR
az monitor data-collection rule show --name <dcr-name> --resource-group <rg>

# Alert Rules
az monitor scheduled-query list --resource-group <rg>

# Action Groups
az monitor action-group list --resource-group <rg>

# LAW
az monitor log-analytics workspace show --workspace-name <law> --resource-group <rg>

Next Steps


Document Control
Maintained by: Central Platform Team
Review Cycle: Monthly
Classification: Internal

📖Learn