Skip to main content

Unified Monitoring Solution - Advanced Topics & Roadmap

Version: 1.0
Last Updated: January 2026 (Workshop Edition)
Purpose: Future enhancements, advanced scenarios, and roadmap items

Feature Availability Legend: ✅ GA | ⚠️ Preview | 📍 Planned

📚 Quick NavigationREADMEArchitectureOperations RunbookAdvanced Topics

Table of Contents

  1. Advanced Scenarios Overview
  2. Audit Logs Framework
  3. Disaster Recovery for Monitoring
  4. Cost Optimization Deep-Dive
  5. Multi-Cloud Monitoring
  6. AIOps & Intelligent Alerting
  7. Compliance & Governance Automation
  8. Workshop Quick Reference Card

1. Advanced Scenarios Overview

1.1 Maturity Model

The Unified Monitoring Solution follows a maturity progression:

1.2 Roadmap Timeline

PhaseTimelineFocus Areas
Phase 1NowFederated baseline, AMBA, DCR
Phase 2Q2Audit logs framework, Advanced RBAC
Phase 3Q3DR strategy, Multi-region
Phase 4Q4AIOps, Predictive analytics

2. Audit Logs Framework

Teaser from Knowledge Base: "Need comprehensive audit logging framework similar to Data Mesh patterns"

2.1 Audit Log Categories

2.2 Six Key Principles for Audit Logs

Reference: Microsoft Best Practices from Knowledge Base

#PrincipleDescription
1Centralized Logging ArchitectureConsolidate all platform, resource, and identity logs (Activity, Resource, Entra ID, Security) into central Log Analytics workspaces
2Common Logging StandardsDefine org-wide standardized schema, required log categories per resource type, minimum retention, classification (audit vs operational)
3Decoupling from Application LifecycleLogs must persist when applications are migrated, decommissioned, or refactored. Use central workspaces to ensure traceability
4Retention & Cost GovernanceImplement tiered retention: Hot (30-90d), Warm (1-2y), Cold/Archive (7+ years), Immutable (as required)
5Governance via Azure PolicyEnforce mandatory logging: required diagnostic settings, mandatory log categories, destination workspace requirements
6Enhanced Security & AnalyticsEnable advanced KQL analytics, Defender for Cloud, Microsoft Sentinel SIEM, cross-environment correlation

2.3 How Other Organizations Do Audit Logging

Industry Insight: "How is this done in other organizations?"

IndustryPatternKey Characteristics
Financial ServicesImmutable logs for 7+ yearsStrict schema enforcement, SOX/PCI-DSS compliance, dedicated compliance LAW
HealthcareHIPAA-compliant retentionAccess audit trails, PHI protection, 6+ year retention
ManufacturingOT/IT log correlationProcess mining for efficiency, hybrid on-prem/cloud
RetailPCI-DSS complianceTransaction audit trails, cardholder data protection

2.4 Audit Log Architecture

2.5 Audit Log Retention Strategy

// Audit-specific DCR with extended retention
resource auditDcr 'Microsoft.Insights/dataCollectionRules@2022-06-01' = {
name: 'dcr-audit-logs'
location: location
properties: {
dataSources: {
windowsEventLogs: [
{
name: 'securityEvents'
streams: ['Microsoft-SecurityEvent']
xPathQueries: [
'Security!*[System[(EventID=4624 or EventID=4625 or EventID=4648)]]' // Logon events
'Security!*[System[(EventID=4720 or EventID=4722 or EventID=4723)]]' // Account management
'Security!*[System[(EventID=4732 or EventID=4733 or EventID=4756)]]' // Group membership
]
}
]
}
destinations: {
logAnalytics: [
{
name: 'auditLaw'
workspaceResourceId: auditLawId
}
]
// Future: Add archive storage
// storageAccounts: [...]
}
dataFlows: [
{
streams: ['Microsoft-SecurityEvent']
destinations: ['auditLaw']
transformKql: 'source | where EventID in (4624, 4625, 4648, 4720, 4722, 4723, 4732, 4733, 4756)'
}
]
}
}

// Extended retention for audit table
resource auditTableRetention 'Microsoft.OperationalInsights/workspaces/tables@2022-10-01' = {
name: '${lawName}/SecurityEvent'
properties: {
totalRetentionInDays: 365 // 1 year for compliance
archiveRetentionInDays: 335 // 365 - 30 = 335 archive days
}
}

2.6 Compliance Reporting Query

// SOC 2 Type II - Access Review Report
let startDate = ago(90d);
let endDate = now();

// Sign-in activity
SigninLogs
| where TimeGenerated between (startDate .. endDate)
| summarize
TotalSignIns = count(),
SuccessfulSignIns = countif(ResultType == 0),
FailedSignIns = countif(ResultType != 0),
UniqueUsers = dcount(UserPrincipalName),
UniqueIPs = dcount(IPAddress)
| extend
FailureRate = round(FailedSignIns * 100.0 / TotalSignIns, 2),
ReportPeriod = strcat(format_datetime(startDate, 'yyyy-MM-dd'), ' to ', format_datetime(endDate, 'yyyy-MM-dd'))

// Join with privileged operations
union (
AuditLogs
| where TimeGenerated between (startDate .. endDate)
| where OperationName has_any ("Add member to role", "Remove member from role", "Add user", "Delete user")
| summarize PrivilegedOperations = count() by OperationName
)

3. Disaster Recovery for Monitoring

Common Question: "If we move central log to another region, is Zone-Redundant enough? Or should we consider Geo-replication?"

3.1 LAW Resilience Options ✅

Microsoft Official Answer: Azure does NOT provide native automatic geo-replication for Log Analytics Workspaces. You must configure one of the options below.

FeatureService ResilienceData BackupHigh AvailabilityScopeSetupCost
Workspace ReplicationCross-region protectionEnable replication + manual switchoverPaid (per replicated GB)
Availability Zones✅ (some regions)In-region datacenter protectionAutomatic in supported regionsFree
Continuous Data ExportRegional backup to GRS storageEnable per tableData export + Storage

Key Insight:

  • Availability Zones: Automatic, free, protects against datacenter failures within a region
  • Workspace Replication: Paid feature, replicates workspace and logs to another region, requires manual switchover
  • Neither provides automatic geo-failover - you must monitor primary workspace health and decide when to switch

3.2 Industry Patterns for DR

Industry Insight: "How do other organizations handle this?"

PatternArchitectureBest For
Pattern 1 (Recommended)Enable Workspace Replication to secondary regionOrganizations wanting Microsoft-supported cross-region DR
Pattern 2Single LAW with zone-redundancy + continuous export to GRS storageCost-conscious organizations with acceptable RPO
Pattern 3Dual DCR ingestion to two LAWs simultaneouslyMaximum redundancy, full control, double cost
Pattern 4Azure Data Explorer with follower databasesAdvanced analytics + DR combined

Enterprise Recommendations:

  • Zone-redundancy is baseline for all production workloads (free, automatic)
  • Workspace Replication for compliance-driven scenarios needing cross-region protection
  • Cost optimization through tiered approach: critical data uses replication, non-critical uses export

3.3 DR Strategies Diagram

3.4 Workspace Replication Pattern

// Primary workspace with data export to secondary
resource primaryLaw 'Microsoft.OperationalInsights/workspaces@2022-10-01' = {
name: 'law-ums-primary'
location: 'westeurope'
properties: {
sku: {
name: 'PerGB2018'
}
retentionInDays: 90
}
}

// Data export rule for critical tables
resource dataExportRule 'Microsoft.OperationalInsights/workspaces/dataExports@2020-08-01' = {
parent: primaryLaw
name: 'export-critical-to-secondary'
properties: {
destination: {
resourceId: eventHubNamespaceId // Event Hub for replication
}
tableNames: [
'Heartbeat'
'SecurityEvent'
'AzureActivity'
'Perf'
]
enable: true
}
}

// Secondary workspace ingests from Event Hub
resource secondaryLaw 'Microsoft.OperationalInsights/workspaces@2022-10-01' = {
name: 'law-ums-secondary'
location: 'northeurope'
properties: {
sku: {
name: 'PerGB2018'
}
}
}

3.5 DR Alert Configuration

// Duplicate critical alerts in secondary region
resource drAlert 'Microsoft.Insights/scheduledQueryRules@2023-03-15-preview' = {
name: 'alert-heartbeat-dr'
location: 'northeurope'
properties: {
displayName: 'VM Heartbeat Missing (DR)'
severity: 1
enabled: true
evaluationFrequency: 'PT5M'
scopes: [
secondaryLawId
]
windowSize: 'PT15M'
criteria: {
allOf: [
{
query: '''
Heartbeat
| summarize LastHeartbeat = max(TimeGenerated) by Computer
| where LastHeartbeat < ago(10m)
'''
timeAggregation: 'Count'
operator: 'GreaterThan'
threshold: 0
}
]
}
actions: {
actionGroups: [
drActionGroupId
]
}
}
}

3.6 RTO/RPO Considerations

ScenarioRPORTOSolution
LAW outage015 minCross-workspace query
Region outage5 min1 hourData export + secondary LAW
Total loss1 hour4 hoursStorage account backup

3.7 DR + Monitoring Integration Scenario

Enterprise Scenario: "Link DR and monitoring together → Stage disaster scenario → Observe failover → Track via dashboard → Receive notifications → Verify restoration"

End-to-End DR Monitoring Walkthrough

Step 1: DR Monitoring Dashboard Setup

// DR Health Dashboard - Primary vs Secondary LAW
let primaryLaw = "law-ums-primary";
let secondaryLaw = "law-ums-secondary";

// Check ingestion health in both workspaces
Usage
| where TimeGenerated > ago(1h)
| summarize
LastIngestion = max(TimeGenerated),
DataVolumeGB = sum(Quantity) / 1024
by DataType
| extend
IngestionLag = now() - LastIngestion,
Status = iff(IngestionLag < 5m, "Healthy", iff(IngestionLag < 15m, "Degraded", "Critical"))
| project DataType, LastIngestion, IngestionLag, DataVolumeGB, Status

Step 2: Replication Lag Monitoring

// Monitor data export / replication lag
let checkInterval = 5m;

// Compare latest records between primary and secondary
Heartbeat
| where TimeGenerated > ago(checkInterval)
| summarize
PrimaryLatest = max(TimeGenerated)
by Computer
| join kind=leftouter (
// Query secondary workspace
workspace("law-ums-secondary").Heartbeat
| where TimeGenerated > ago(checkInterval)
| summarize SecondaryLatest = max(TimeGenerated) by Computer
) on Computer
| extend
ReplicationLag = PrimaryLatest - SecondaryLatest,
LagStatus = case(
isnull(SecondaryLatest), "Not Replicated",
ReplicationLag < 1m, "Synced",
ReplicationLag < 5m, "Acceptable Lag",
"Critical Lag"
)

Step 3: Failover Alert Configuration

// Alert when failover is detected or required
resource failoverAlert 'Microsoft.Insights/scheduledQueryRules@2023-03-15-preview' = {
name: 'alert-dr-failover-detected'
location: location
properties: {
displayName: 'DR Failover Required - Primary LAW Unhealthy'
description: '''
## DR ALERT: Primary LAW Health Degraded

### Immediate Actions:
1. Verify primary LAW status in Azure Portal
2. Check Azure Service Health for outages
3. If outage confirmed, initiate failover to secondary

### Failover Steps:
1. Update DCR destinations to secondary LAW
2. Update dashboard connections
3. Notify stakeholders via Teams channel

### Runbook: [DR Failover Procedure](https://wiki.contoso.com/dr-failover)
'''
severity: 0
enabled: true
evaluationFrequency: 'PT1M'
scopes: [lawId]
windowSize: 'PT5M'
criteria: {
allOf: [
{
query: '''
Heartbeat
| summarize LastHeartbeat = max(TimeGenerated)
| where LastHeartbeat < ago(5m)
'''
timeAggregation: 'Count'
operator: 'GreaterThan'
threshold: 0
}
]
}
actions: {
actionGroups: [drActionGroupId]
customProperties: {
FailoverRunbook: 'https://wiki.contoso.com/dr-failover'
SecondaryLAW: 'law-ums-secondary'
EscalationPath: 'Platform Team → On-Call → DR Commander'
}
}
}
}

Step 4: Post-Failover Verification Dashboard

// Verify successful failover and data integrity
let failoverTime = datetime(2026-01-07T12:00:00Z); // Set to actual failover time

Heartbeat
| where TimeGenerated > failoverTime
| summarize
VmsReporting = dcount(Computer),
FirstHeartbeat = min(TimeGenerated),
LatestHeartbeat = max(TimeGenerated)
| extend
RecoveryTime = FirstHeartbeat - failoverTime,
Status = iff(VmsReporting > 0, "Failover Successful", "Failover In Progress")
| project Status, VmsReporting, RecoveryTime, FirstHeartbeat, LatestHeartbeat

DR Monitoring Checklist

PhaseMonitoring TaskDashboard/Alert
Pre-FailoverMonitor replication lagReplication Lag Widget
Pre-FailoverTrack primary healthPrimary LAW Health Alert
During FailoverDetect failover triggerFailover Required Alert
During FailoverTrack failover progressFailover Status Dashboard
Post-FailoverVerify data continuityData Integrity Query
Post-FailoverConfirm all VMs reportingVM Heartbeat Count
Post-FailoverValidate alert routingTest Alert Notification

4. Cost Optimization Deep-Dive

4.1 Cost Optimization Framework

4.2 Basic Logs Implementation

Basic Logs tables reduce cost by 80% for low-query-frequency data:

// Convert high-volume, low-query tables to Basic Logs
resource basicLogTable 'Microsoft.OperationalInsights/workspaces/tables@2022-10-01' = {
parent: logAnalyticsWorkspace
name: 'ContainerLog' // High volume container logs
properties: {
plan: 'Basic' // ~67% cost reduction vs Analytics
retentionInDays: 30 // Basic logs: 30 days interactive (updated June 2024)
totalRetentionInDays: 365 // Can archive for up to 12 years
}
}

Basic Logs Candidates:

  • ContainerLog / ContainerLogV2
  • AppTraces (verbose application logs)
  • Custom high-volume tables

4.3 Transformation for Cost Reduction

// DCR with transformation to reduce ingestion
resource costOptimizedDcr 'Microsoft.Insights/dataCollectionRules@2022-06-01' = {
name: 'dcr-cost-optimized'
location: location
properties: {
dataSources: {
performanceCounters: [
{
name: 'perfCounters'
streams: ['Microsoft-Perf']
samplingFrequencyInSeconds: 300 // 5 min instead of 1 min = 80% reduction
counterSpecifiers: [
'\\Processor(_Total)\\% Processor Time'
'\\Memory\\Available MBytes'
]
}
]
}
dataFlows: [
{
streams: ['Microsoft-Perf']
destinations: ['lawDestination']
transformKql: '''
source
| where CounterValue > 0 // Filter zero values
| summarize Value = avg(CounterValue) by Computer, ObjectName, CounterName, bin(TimeGenerated, 5m)
''' // Aggregate to reduce rows
}
]
}
}

4.4 Commitment Tier Calculator

// Calculate optimal commitment tier
let dailyIngestionGB = toscalar(
Usage
| where TimeGenerated > ago(30d)
| where IsBillable == true
| summarize TotalMB = sum(Quantity)
| extend TotalGB = TotalMB / 1024 / 30 // Daily average
);

let tiers = datatable(TierName:string, MinGB:real, PricePerGB:real) [
"Pay-As-You-Go", 0, 2.76,
"100 GB/day", 100, 2.30,
"200 GB/day", 200, 2.12,
"300 GB/day", 300, 2.07,
"400 GB/day", 400, 2.03,
"500 GB/day", 500, 2.00
];

tiers
| where MinGB <= dailyIngestionGB
| extend
DailyCost = dailyIngestionGB * PricePerGB,
MonthlyCost = dailyIngestionGB * PricePerGB * 30
| order by MinGB desc
| take 3

4.5 Cost Alerting

// Alert when daily ingestion exceeds threshold
resource costAlert 'Microsoft.Insights/scheduledQueryRules@2023-03-15-preview' = {
name: 'alert-ingestion-cost-threshold'
location: location
properties: {
displayName: 'Daily Ingestion Exceeds Budget'
severity: 2
enabled: true
evaluationFrequency: 'PT1H'
windowSize: 'P1D'
scopes: [lawId]
criteria: {
allOf: [
{
query: '''
Usage
| where TimeGenerated > ago(1d)
| where IsBillable == true
| summarize TotalGB = sum(Quantity) / 1024
| where TotalGB > 100 // Daily budget threshold
'''
timeAggregation: 'Count'
operator: 'GreaterThan'
threshold: 0
}
]
}
}
}

4.6 Summary Rules for High-Volume Data ⚠️ Preview

Microsoft Best Practice: Summary Rules enable automated data summarization that significantly reduces storage costs for high-volume log data while maintaining analytical insights.

⚠️ Status: Summary Rules are currently in Public Preview (as of January 2026). Features may change before GA.

Summary Rules let you aggregate high-ingestion-rate streams, providing robust analysis while optimizing long-term retention expenses:

// Example: Create summary rule for performance data
// Summarize hourly instead of every minute = 60x reduction
.create table PerfSummary (
Computer: string,
CounterName: string,
AvgValue: real,
MaxValue: real,
MinValue: real,
SampleCount: int,
TimeGenerated: datetime
)

// Summary rule query
Perf
| summarize
AvgValue = avg(CounterValue),
MaxValue = max(CounterValue),
MinValue = min(CounterValue),
SampleCount = count()
by Computer, CounterName, bin(TimeGenerated, 1h)

4.7 Azure Advisor Cost Alerts

Microsoft Best Practice: Set up Azure Advisor alerts to proactively notify you of cost optimization opportunities.

Configure alerts for these Advisor recommendations:

  • Basic Logs eligible tables - Tables with >1GB/month that could use lower-cost plan
  • Pricing tier changes - Usage volume suggests commitment tier discount
  • Unused restored tables - Restored data incurring unnecessary charges
  • Data ingestion anomalies - Sudden ingestion spikes vs. baseline
// Create Azure Advisor alert for cost recommendations
resource advisorAlert 'Microsoft.Insights/activityLogAlerts@2020-10-01' = {
name: 'alert-advisor-cost-recommendations'
location: 'global'
properties: {
description: 'Alert when Azure Advisor has cost recommendations for LAW'
enabled: true
scopes: [subscription().id]
condition: {
allOf: [
{
field: 'category'
equals: 'Recommendation'
}
{
field: 'operationName'
equals: 'Microsoft.Advisor/recommendations/available/action'
}
]
}
actions: {
actionGroups: [
{
actionGroupId: actionGroupId
}
]
}
}
}

4.8 Free Alert Types (Cost Optimization) ✅

Microsoft Best Practice (Confirmed): Activity log alerts, Service Health alerts, and Resource Health alerts are free of charge. Use these whenever possible.

Alert TypeCostUse Case
Activity Log AlertsFREEAdministrative operations, policy changes
Service Health AlertsFREEAzure service outages, maintenance
Resource Health AlertsFREEIndividual resource availability
Metric AlertsPaidPerformance thresholds
Log Search AlertsPaidComplex KQL queries

Priority: Always check if Activity/Service/Resource Health alerts can meet your requirement before using paid alert types.

4.9 DCR Organization by Observability Scope

Microsoft Best Practice: Create DCRs specific to data source type inside defined observability scopes for better management and cost control.

DCR Organization Principles:

  1. One DCR per data source type - Separate performance from events
  2. Scope by application/workload - SQL, Web, Database, etc.
  3. Environment separation - Dev/Test/Prod can have different collection levels
  4. Destination-based - Separate DCRs for different compliance regions

4.10 Search Jobs for Historical Analysis

Microsoft Best Practice: Use Search Jobs for complex analytical queries on large datasets and long-term retention data without impacting real-time monitoring.

Search Jobs are asynchronous queries that:

  • Run on any data including archived/long-term retention data
  • Create new Analytics tables for further querying
  • Separate analytical workloads from operational monitoring
# Create a search job for historical security analysis
az monitor log-analytics workspace search-job create `
--workspace-name "law-central-prod" `
--resource-group "rg-monitoring" `
--name "SecurityAudit2025" `
--start-time "2025-01-01T00:00:00Z" `
--end-time "2025-12-31T23:59:59Z" `
--table-name "SecurityEvent" `
--query "SecurityEvent | where EventID in (4624, 4625, 4648)"

5. Multi-Cloud Monitoring

5.1 Multi-Cloud Architecture

5.2 Azure Arc for Multi-Cloud

// Azure Arc-enabled server monitoring
resource arcServer 'Microsoft.HybridCompute/machines@2023-10-03-preview' existing = {
name: 'aws-server-01'
}

// AMA extension for Arc-enabled servers
resource amaExtension 'Microsoft.HybridCompute/machines/extensions@2023-10-03-preview' = {
parent: arcServer
name: 'AzureMonitorLinuxAgent'
location: location
properties: {
publisher: 'Microsoft.Azure.Monitor'
type: 'AzureMonitorLinuxAgent'
typeHandlerVersion: '1.0'
autoUpgradeMinorVersion: true
}
}

// DCR association for Arc server
resource dcrAssociation 'Microsoft.Insights/dataCollectionRuleAssociations@2022-06-01' = {
name: 'arc-dcr-association'
scope: arcServer
properties: {
dataCollectionRuleId: dcrId
}
}

5.3 Container Insights Integration

Reference: Production pattern from enterprise AKS implementations (Fortune 500 deployments)

Container Insights provides monitoring for AKS clusters with optimized configuration for cost and performance.

Container Insights Architecture

AKS with Container Insights (Bicep)

// AKS cluster with Container Insights enabled
resource aksCluster 'Microsoft.ContainerService/managedClusters@2023-10-01' = {
name: aksClusterName
location: location
properties: {
kubernetesVersion: '1.28'
dnsPrefix: aksClusterName
agentPoolProfiles: [
{
name: 'system'
count: 3
vmSize: 'Standard_D4s_v5'
mode: 'System'
}
]
addonProfiles: {
omsagent: {
enabled: true
config: {
logAnalyticsWorkspaceResourceID: lawId
useAADAuth: 'true'
}
}
azurePolicy: {
enabled: true
}
}
}
}

// DCR for Container Insights (AMA-based)
resource containerInsightsDcr 'Microsoft.Insights/dataCollectionRules@2022-06-01' = {
name: 'dcr-container-insights-${aksClusterName}'
location: location
properties: {
dataSources: {
extensions: [
{
name: 'ContainerInsightsExtension'
streams: [
'Microsoft-ContainerLog'
'Microsoft-ContainerLogV2'
'Microsoft-KubeEvents'
'Microsoft-KubePodInventory'
'Microsoft-KubeNodeInventory'
'Microsoft-KubeServices'
'Microsoft-KubeMonAgentEvents'
'Microsoft-InsightsMetrics'
'Microsoft-ContainerInventory'
'Microsoft-ContainerNodeInventory'
'Microsoft-Perf'
]
extensionName: 'ContainerInsights'
extensionSettings: {
dataCollectionSettings: {
interval: '1m'
namespaceFilteringMode: 'Include'
namespaces: ['kube-system', 'default', 'app-namespace']
enableContainerLogV2: true
}
}
}
]
}
destinations: {
logAnalytics: [
{
workspaceResourceId: lawId
name: 'ciLaw'
}
]
}
dataFlows: [
{
streams: [
'Microsoft-ContainerLog'
'Microsoft-ContainerLogV2'
'Microsoft-KubeEvents'
'Microsoft-KubePodInventory'
]
destinations: ['ciLaw']
}
]
}
}

Container Insights ConfigMap (Cost Optimization)

# ConfigMap for Container Insights agent settings
# Reference: Enterprise cloud-native pattern - optimized for cost
kind: ConfigMap
apiVersion: v1
metadata:
name: container-azm-ms-agentconfig
namespace: kube-system
data:
schema-version: v1
config-version: v1

# Log collection settings - optimized for FluentBit/Loki alternative
log_collection_settings: |-
[log_collection_settings]
[log_collection_settings.stdout]
enabled = false # Disabled when using FluentBit → Loki
exclude_namespaces = ["kube-system"]
[log_collection_settings.stderr]
enabled = false # Disabled when using FluentBit → Loki
[log_collection_settings.env_var]
enabled = true
[log_collection_settings.enrich_container_logs]
enabled = true

# Alertable metrics thresholds
alertable_metrics_configuration_settings: |-
[alertable_metrics_configuration_settings.container_resource_utilization_thresholds]
container_cpu_threshold_percentage = 95.0
container_memory_rss_threshold_percentage = 95.0
container_memory_working_set_threshold_percentage = 95.0
[alertable_metrics_configuration_settings.pv_utilization_thresholds]
pv_usage_threshold_percentage = 60.0
[alertable_metrics_configuration_settings.job_completion_threshold]
job_completion_threshold_time_minutes = 360

Container Insights KQL Queries

// Pod restart analysis
KubePodInventory
| where TimeGenerated > ago(24h)
| where RestartCount > 0
| summarize TotalRestarts = sum(RestartCount) by Name, Namespace, ClusterName
| order by TotalRestarts desc
| take 20

// Container CPU/Memory pressure
Perf
| where TimeGenerated > ago(1h)
| where ObjectName == "K8SContainer"
| where CounterName in ("cpuUsageNanoCores", "memoryWorkingSetBytes")
| summarize AvgValue = avg(CounterValue) by CounterName, InstanceName
| extend Container = tostring(split(InstanceName, "/")[1])
| project Container, CounterName, AvgValue

// OOMKilled containers
KubeEvents
| where TimeGenerated > ago(24h)
| where Reason == "OOMKilled"
| project TimeGenerated, Name, Namespace, Message
| order by TimeGenerated desc

5.4 Hybrid Cloud-Native Observability Stack

Reference: Production pattern from enterprise AKS implementations (Fortune 500 deployments)

This section covers integrating cloud-native CNCF observability tools (Prometheus, Grafana, Loki, Tempo) with Azure-native monitoring for a hybrid approach.

Hybrid Architecture Overview

When to Use Hybrid Stack

ScenarioUse Azure-NativeUse Cloud-Native (CNCF)Use Hybrid
AKS-heavy workloads✅ Prometheus/Loki native
Mixed IaaS/PaaS✅ Azure Monitor✅ Best of both
Cost-sensitive log storage❌ LAW expensive✅ Loki + Blob cheaper
Tracing requirements⚠️ App Insights✅ Tempo/Jaeger
Multi-cluster Kubernetes✅ Prometheus federation
Azure PaaS monitoring✅ Metrics/Diagnostic

Grafana with Azure Monitor Datasource

# Grafana datasource configuration (enterprise pattern)
# Multi-environment Azure Monitor integration
datasources:
- name: Azure Monitor - PRD
type: grafana-azure-monitor-datasource
jsonData:
cloudName: azuremonitor
subscriptionId: ${SUBSCRIPTION_ID_PRD}
tenantId: ${TENANT_ID}
clientId: ${GRAFANA_CLIENT_ID}
logAnalyticsDefaultWorkspace: ${LAW_ID_PRD}
azureLogAnalyticsSameAs: true
secureJsonData:
clientSecret: ${GRAFANA_CLIENT_SECRET}

- name: Azure Monitor - DEV
type: grafana-azure-monitor-datasource
jsonData:
cloudName: azuremonitor
subscriptionId: ${SUBSCRIPTION_ID_DEV}
tenantId: ${TENANT_ID}
clientId: ${GRAFANA_CLIENT_ID}
logAnalyticsDefaultWorkspace: ${LAW_ID_DEV}

- name: Prometheus - PRD
type: prometheus
url: http://prometheus-server.monitoring:80

- name: Loki - PRD
type: loki
url: http://loki-gateway.monitoring:80
jsonData:
derivedFields:
- name: TraceID
matcherRegex: "traceID=(\\w+)"
url: "$${__value.raw}"
datasourceUid: tempo-prd

- name: Tempo - PRD
type: tempo
url: http://tempo-gateway.monitoring:80
uid: tempo-prd

FluentBit for Loki (Cost-Optimized Logging)

# FluentBit configuration for Loki integration
# Alternative to Container Insights stdout/stderr collection
config:
inputs: |
[INPUT]
Name tail
Path /var/log/containers/*.log
Tag kube.*
DB /var/log/flb_kube.db
multiline.parser docker, cri
Mem_Buf_Limit 256MB
storage.type filesystem

filters: |
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Merge_Log On
Merge_Log_Key log_processed
K8S-Logging.Parser On

outputs: |
[OUTPUT]
Name loki
Match *
Host loki-gateway.monitoring
Port 80
Labels env=${ENVIRONMENT}, namespace=$kubernetes['namespace_name'], pod=$kubernetes['pod_name']
auto_kubernetes_labels on
tenant_id ${SUBSCRIPTION_ID}

OpenTelemetry Collector for Traces

# OpenTelemetry Collector configuration for Tempo
# Deployment mode (not DaemonSet)
config:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318

processors:
memory_limiter:
limit_mib: 512
spike_limit_mib: 128
batch:
timeout: 10s
send_batch_size: 1024

exporters:
otlp:
endpoint: tempo-gateway.monitoring:4317
tls:
insecure: true
headers:
X-Scope-OrgID: ${SUBSCRIPTION_ID}

service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp]

Hybrid Stack Decision Matrix

ComponentAzure-Native OptionCloud-Native OptionRecommendation
MetricsAzure Monitor MetricsPrometheusPrometheus for K8s, Azure for PaaS
LogsLog AnalyticsLokiLoki for K8s (cost), LAW for compliance
TracesApplication InsightsTempo/JaegerTempo for K8s, App Insights for .NET
VisualizationAzure DashboardsGrafanaGrafana as single pane
AlertingAzure Monitor AlertsPrometheus AlertmanagerAzure Monitor (integration)

6. AIOps & Intelligent Alerting

6.1 AIOps Capabilities Roadmap

6.2 Dynamic Threshold Alerts

// Metric alert with dynamic threshold
resource dynamicAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
name: 'alert-cpu-dynamic'
location: 'global'
properties: {
severity: 2
enabled: true
scopes: [vmId]
evaluationFrequency: 'PT5M'
windowSize: 'PT5M'
criteria: {
'odata.type': 'Microsoft.Azure.Monitor.MultipleResourceMultipleMetricCriteria'
allOf: [
{
criterionType: 'DynamicThresholdCriterion'
name: 'cpuDynamic'
metricName: 'Percentage CPU'
metricNamespace: 'Microsoft.Compute/virtualMachines'
operator: 'GreaterThan'
alertSensitivity: 'Medium'
failingPeriods: {
numberOfEvaluationPeriods: 4
minFailingPeriodsToAlert: 3
}
timeAggregation: 'Average'
}
]
}
actions: [
{
actionGroupId: actionGroupId
}
]
}
}

6.3 Smart Alert Grouping

// Correlate related alerts
AlertsManagementResources
| where type == "microsoft.alertsmanagement/alerts"
| where properties.essentials.startDateTime > ago(24h)
| extend
AlertName = properties.essentials.alertRule,
Severity = properties.essentials.severity,
TargetResource = properties.essentials.targetResource,
StartTime = properties.essentials.startDateTime
| summarize
AlertCount = count(),
Alerts = make_set(AlertName),
Severities = make_set(Severity)
by TargetResource, bin(todatetime(StartTime), 5m)
| where AlertCount > 3 // Multiple alerts on same resource = potential cascade
| order by AlertCount desc

6.4 Anomaly Detection Query

// Detect anomalies in metric data using series_decompose_anomalies
let timeRange = 7d;
let anomalyThreshold = 3.0;

Perf
| where TimeGenerated > ago(timeRange)
| where ObjectName == "Processor" and CounterName == "% Processor Time"
| summarize AvgCPU = avg(CounterValue) by Computer, bin(TimeGenerated, 1h)
| order by Computer, TimeGenerated asc
| summarize TimeGenerated = make_list(TimeGenerated), CPU = make_list(AvgCPU) by Computer
| extend (anomalies, score, baseline) = series_decompose_anomalies(CPU, anomalyThreshold)
| mv-expand TimeGenerated, CPU, anomalies, score, baseline
| where toint(anomalies) != 0
| project TimeGenerated = todatetime(TimeGenerated), Computer,
ActualCPU = toreal(CPU),
Baseline = toreal(baseline),
AnomalyScore = toreal(score),
AnomalyType = iff(toint(anomalies) > 0, "Spike", "Dip")

6.5 Auto-Remediation Integration

Common Requirement: "Logic Apps for automated remediation, Azure Automation runbooks, ServiceNow ticket creation with context"

Auto-Remediation Architecture

Logic App for Automated Remediation

// Logic App Definition - VM Restart on Unresponsive
{
"definition": {
"$schema": "https://schema.management.azure.com/providers/Microsoft.Logic/schemas/2016-06-01/workflowdefinition.json#",
"triggers": {
"When_alert_fires": {
"type": "Request",
"kind": "Http",
"inputs": {
"schema": {
"type": "object",
"properties": {
"schemaId": { "type": "string" },
"data": {
"type": "object",
"properties": {
"essentials": { "type": "object" },
"alertContext": { "type": "object" }
}
}
}
}
}
}
},
"actions": {
"Parse_Alert": {
"type": "ParseJson",
"inputs": {
"content": "@triggerBody()",
"schema": { }
}
},
"Get_VM_Details": {
"type": "Http",
"inputs": {
"method": "GET",
"uri": "https://management.azure.com@{body('Parse_Alert')?['data']?['essentials']?['targetResourceId']}?api-version=2023-03-01",
"authentication": { "type": "ManagedServiceIdentity" }
}
},
"Check_If_Auto_Remediate": {
"type": "If",
"expression": {
"and": [
{ "equals": ["@body('Get_VM_Details')?['tags']?['AutoRemediate']", "true"] },
{ "lessOrEquals": ["@body('Parse_Alert')?['data']?['essentials']?['severity']", "Sev2"] }
]
},
"actions": {
"Restart_VM": {
"type": "Http",
"inputs": {
"method": "POST",
"uri": "https://management.azure.com@{body('Parse_Alert')?['data']?['essentials']?['targetResourceId']}/restart?api-version=2023-03-01",
"authentication": { "type": "ManagedServiceIdentity" }
}
},
"Log_Remediation": {
"type": "Http",
"inputs": {
"method": "POST",
"uri": "@parameters('LogAnalyticsDataCollectorUrl')",
"body": {
"AlertName": "@body('Parse_Alert')?['data']?['essentials']?['alertRule']",
"ResourceId": "@body('Parse_Alert')?['data']?['essentials']?['targetResourceId']",
"RemediationAction": "VM Restart",
"Status": "Initiated",
"Timestamp": "@utcNow()"
}
}
}
},
"else": {
"actions": {
"Create_ServiceNow_Ticket": {
"type": "ServiceNow",
"inputs": { }
}
}
}
}
}
}
}

Azure Automation Runbook

# Runbook: Remediate-HighCPU.ps1
param (
[Parameter(Mandatory=$true)]
[string]$ResourceId,

[Parameter(Mandatory=$true)]
[string]$AlertName,

[Parameter(Mandatory=$false)]
[int]$CPUThreshold = 90
)

# Connect using managed identity
Connect-AzAccount -Identity

# Parse resource details
$resourceParts = $ResourceId -split '/'
$subscriptionId = $resourceParts[2]
$resourceGroup = $resourceParts[4]
$vmName = $resourceParts[-1]

Set-AzContext -SubscriptionId $subscriptionId

# Get VM details
$vm = Get-AzVM -ResourceGroupName $resourceGroup -Name $vmName

# Check if auto-remediation is allowed
if ($vm.Tags['AutoRemediate'] -ne 'true') {
Write-Output "Auto-remediation not enabled for VM: $vmName"
return
}

# Attempt remediation based on issue type
switch ($AlertName) {
'alert-vm-cpu-critical' {
# For high CPU, try to identify and stop runaway process
$result = Invoke-AzVMRunCommand -ResourceGroupName $resourceGroup -VMName $vmName `
-CommandId 'RunPowerShellScript' `
-ScriptString 'Get-Process | Sort-Object CPU -Descending | Select-Object -First 5 | Format-Table Name, CPU, Id'

Write-Output "Top CPU processes: $($result.Value[0].Message)"
}

'alert-vm-disk-full' {
# Clear temp files
$result = Invoke-AzVMRunCommand -ResourceGroupName $resourceGroup -VMName $vmName `
-CommandId 'RunPowerShellScript' `
-ScriptString 'Remove-Item -Path "$env:TEMP\*" -Recurse -Force -ErrorAction SilentlyContinue; Get-PSDrive C | Select-Object Free'

Write-Output "Disk cleanup result: $($result.Value[0].Message)"
}

'alert-vm-unresponsive' {
# Restart VM
Restart-AzVM -ResourceGroupName $resourceGroup -Name $vmName -Force
Write-Output "VM $vmName restarted successfully"
}
}

# Log remediation action
$logEntry = @{
TimeGenerated = (Get-Date).ToUniversalTime().ToString("o")
ResourceId = $ResourceId
AlertName = $AlertName
RemediationAction = $AlertName
Status = "Completed"
VMName = $vmName
}

Write-Output "Remediation completed: $($logEntry | ConvertTo-Json)"

ServiceNow Integration with Context

// Action Group with ServiceNow webhook
resource serviceNowActionGroup 'Microsoft.Insights/actionGroups@2023-01-01' = {
name: 'ag-servicenow-integration'
location: 'global'
properties: {
groupShortName: 'SNOWInt'
enabled: true
webhookReceivers: [
{
name: 'servicenow-incident'
serviceUri: 'https://${serviceNowInstance}.service-now.com/api/now/table/incident'
useCommonAlertSchema: true
useAadAuth: true
objectId: serviceNowAppObjectId
identifierUri: 'api://${serviceNowAppId}'
tenantId: subscription().tenantId
}
]
}
}

ServiceNow Payload Transformation (via Logic App):

{
"short_description": "Azure Alert: @{body('Parse_Alert')?['data']?['essentials']?['alertRule']}",
"description": "## Alert Details\n\n**Severity**: @{body('Parse_Alert')?['data']?['essentials']?['severity']}\n**Resource**: @{body('Parse_Alert')?['data']?['essentials']?['targetResource']}\n**Time**: @{body('Parse_Alert')?['data']?['essentials']?['firedDateTime']}\n\n## Remediation Steps\n\n@{body('Parse_Alert')?['data']?['alertContext']?['SearchResults']}",
"urgency": "@{if(equals(body('Parse_Alert')?['data']?['essentials']?['severity'], 'Sev0'), '1', if(equals(body('Parse_Alert')?['data']?['essentials']?['severity'], 'Sev1'), '2', '3'))}",
"impact": "@{if(equals(body('Parse_Alert')?['data']?['essentials']?['severity'], 'Sev0'), '1', if(equals(body('Parse_Alert')?['data']?['essentials']?['severity'], 'Sev1'), '2', '3'))}",
"category": "Azure",
"subcategory": "Monitoring",
"assignment_group": "@{body('Get_Resource_Tags')?['SupportTeam']}",
"cmdb_ci": "@{body('Parse_Alert')?['data']?['essentials']?['targetResourceName']}",
"u_azure_resource_id": "@{body('Parse_Alert')?['data']?['essentials']?['targetResourceId']}",
"u_runbook_url": "@{body('Parse_Alert')?['data']?['alertContext']?['customProperties']?['RunbookURL']}"
}

Auto-Remediation Decision Matrix

Alert TypeAuto-Remediate?ActionConditions
VM UnresponsiveYesRestart VMTag: AutoRemediate=true, Severity ≤ Sev2
High CPUPartialIdentify process, logAlways log, escalate if persists
Disk FullYesClear temp filesTag: AutoRemediate=true
Memory PressureNoCreate ticketAlways manual review
Network IssueNoCreate ticketRequires network team
Security AlertNeverEscalate immediatelyAlways human review

6.6 Circuit Breaker Pattern

Reference: Production pattern from enterprise implementations (Fortune 500 deployments)

The circuit breaker pattern uses Azure Function receivers in Action Groups to enable automated remediation with severity-based routing.

Circuit Breaker Architecture

Action Group with Azure Function Receiver

// Action Group with Circuit Breaker Function (from Data Mesh pattern)
resource circuitBreakerActionGroup 'Microsoft.Insights/actionGroups@2023-01-01' = {
name: 'ag-circuit-breaker'
location: 'global'
properties: {
groupShortName: 'CircuitBrk'
enabled: true
emailReceivers: [
{
name: 'ops-team'
emailAddress: 'ops@contoso.com'
useCommonAlertSchema: true
}
]
azureFunctionReceivers: [
{
name: 'circuit-breaker-function'
functionAppResourceId: circuitBreakerFunctionApp.id
functionName: 'ProcessAlert'
httpTriggerUrl: 'https://${circuitBreakerFunctionApp.name}.azurewebsites.net/api/ProcessAlert?code=${functionKey}'
useCommonAlertSchema: true
}
]
}
}

// Standard Action Group (email only)
resource standardActionGroup 'Microsoft.Insights/actionGroups@2023-01-01' = {
name: 'ag-standard'
location: 'global'
properties: {
groupShortName: 'Standard'
enabled: true
emailReceivers: [
{
name: 'ops-team'
emailAddress: 'ops@contoso.com'
useCommonAlertSchema: true
}
]
}
}

Severity-Based Routing

// Route Severity 0-1 to Circuit Breaker, others to Standard
var targetActionGroupId = alert.severity <= 1
? circuitBreakerActionGroup.id
: standardActionGroup.id

resource metricAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {
name: alert.name
location: 'global'
properties: {
severity: alert.severity
enabled: true
scopes: [resourceId]
evaluationFrequency: 'PT5M'
windowSize: 'PT5M'
criteria: alert.criteria
actions: [
{
actionGroupId: targetActionGroupId
}
]
}
}

Circuit Breaker Azure Function

// Azure Function: ProcessAlert.cs
using Microsoft.Azure.WebJobs;
using Microsoft.Extensions.Logging;
using Newtonsoft.Json;

public static class ProcessAlert
{
[FunctionName("ProcessAlert")]
public static async Task Run(
[HttpTrigger(AuthorizationLevel.Function, "post")] HttpRequest req,
ILogger log)
{
var requestBody = await new StreamReader(req.Body).ReadToEndAsync();
var alert = JsonConvert.DeserializeObject<CommonAlertSchema>(requestBody);

log.LogInformation($"Processing alert: {alert.Data.Essentials.AlertRule}");

// Check if auto-remediation is appropriate
if (CanAutoRemediate(alert))
{
await ExecuteRemediation(alert, log);
}
else
{
await CreateServiceNowTicket(alert, log);
}
}

private static bool CanAutoRemediate(CommonAlertSchema alert)
{
// Only auto-remediate for specific alert types and severity
var allowedAlerts = new[] { "vm-unresponsive", "disk-full", "service-restart" };
return alert.Data.Essentials.Severity <= 2 &&
allowedAlerts.Any(a => alert.Data.Essentials.AlertRule.Contains(a));
}
}

7. Compliance & Governance Automation

7.1 Azure Policy for Monitoring Compliance

// Policy definition: Require diagnostic settings
resource policyDefinition 'Microsoft.Authorization/policyDefinitions@2021-06-01' = {
name: 'require-diagnostic-settings'
properties: {
displayName: 'Require diagnostic settings for all resources'
policyType: 'Custom'
mode: 'All'
metadata: {
category: 'Monitoring'
version: '1.0.0'
}
policyRule: {
if: {
allOf: [
{
field: 'type'
in: [
'Microsoft.Compute/virtualMachines'
'Microsoft.Web/sites'
'Microsoft.Storage/storageAccounts'
'Microsoft.Sql/servers/databases'
'Microsoft.KeyVault/vaults'
]
}
]
}
then: {
effect: 'DeployIfNotExists'
details: {
type: 'Microsoft.Insights/diagnosticSettings'
existenceCondition: {
allOf: [
{
field: 'Microsoft.Insights/diagnosticSettings/workspaceId'
exists: 'true'
}
]
}
roleDefinitionIds: [
'/providers/Microsoft.Authorization/roleDefinitions/749f88d5-cbae-40b8-bcfc-e573ddc772fa' // Monitoring Contributor
]
deployment: {
properties: {
mode: 'Incremental'
template: {
// Diagnostic settings template
}
}
}
}
}
}
}
}

7.2 Governance Dashboard Query

// Compliance dashboard: Resources without monitoring
let monitoredResources =
resources
| where type in~ (
'microsoft.compute/virtualmachines',
'microsoft.web/sites',
'microsoft.storage/storageaccounts'
)
| project ResourceId = tolower(id), Name = name, Type = type, ResourceGroup = resourceGroup;

let resourcesWithDiagnostics =
resources
| where type == "microsoft.insights/diagnosticsettings"
| extend ParentResource = tolower(tostring(split(id, '/providers/microsoft.insights')[0]))
| distinct ParentResource;

monitoredResources
| join kind=leftanti resourcesWithDiagnostics on $left.ResourceId == $right.ParentResource
| summarize NonCompliantCount = count() by Type
| order by NonCompliantCount desc

8. Workshop Quick Reference Card

8.1 Key Azure Monitor URLs

ResourceURL
Azure Monitor Overviewportal.azure.com/#view/Microsoft_Azure_Monitoring
Log Analytics Workspacesportal.azure.com/#blade/HubsExtension/BrowseResource/resourceType/Microsoft.OperationalInsights%2Fworkspaces
Alert Rulesportal.azure.com/#view/Microsoft_Azure_Monitoring/AzureMonitoringBrowseBlade/~/alertsV2
Workbooks Galleryportal.azure.com/#blade/Microsoft_Azure_Monitoring/AzureMonitoringBrowseBlade/workbooks

8.2 Essential CLI Commands

# === Log Analytics Workspace ===
az monitor log-analytics workspace create --name <name> --resource-group <rg> --location <loc>
az monitor log-analytics workspace show --name <name> --resource-group <rg>
az monitor log-analytics workspace table list --workspace-name <name> --resource-group <rg>

# === Data Collection Rules ===
az monitor data-collection rule create --name <name> --resource-group <rg> --location <loc> --data-flows "[...]"
az monitor data-collection rule association create --name <name> --resource <vm-id> --rule-id <dcr-id>

# === Alerts ===
az monitor scheduled-query create --name <name> --resource-group <rg> --scopes <law-id> --condition "..."
az monitor action-group create --name <name> --resource-group <rg> --short-name <short> --action email admin admin@example.com

# === Diagnostic Settings ===
az monitor diagnostic-settings create --name <name> --resource <resource-id> --workspace <law-id> --logs "[...]"

8.3 Common KQL Patterns

// Pattern 1: Time-based aggregation
<Table>
| where TimeGenerated > ago(1h)
| summarize Count = count() by bin(TimeGenerated, 5m)

// Pattern 2: Top N analysis
<Table>
| summarize Count = count() by <GroupColumn>
| top 10 by Count desc

// Pattern 3: Join across tables
Table1
| join kind=inner Table2 on $left.Key == $right.Key

// Pattern 4: Anomaly detection
| make-series Value = avg(<Metric>) on TimeGenerated step 1h
| extend (anomalies, score) = series_decompose_anomalies(Value)

// Pattern 5: Resource graph correlation
| join kind=inner (arg("").resources | where type == "...") on $left.ResourceId == $right.id

8.4 Architecture Decision Matrix

RequirementSingle LAWMultiple LAWDedicated Cluster
< 100 GB/day
100-500 GB/day⚠️⚠️
> 500 GB/day⚠️
Multi-tenant isolation
CMK encryption
Cross-subscription⚠️

Legend: ✅ Recommended, ⚠️ Consider, ❌ Not Recommended

8.5 Naming Conventions

# Log Analytics Workspace
law-{workload}-{environment}-{region}-{instance}
Example: law-ums-prod-weu-001

# Data Collection Rule
dcr-{purpose}-{scope}-{environment}
Example: dcr-vm-perf-lz-prod

# Data Collection Endpoint
dce-{region}-{environment}
Example: dce-westeurope-prod

# Action Group
ag-{purpose}-{severity}
Example: ag-platform-sev1

# Alert Rule
alert-{resource}-{metric}-{severity}
Example: alert-vm-cpu-sev2

8.6 Cost Quick Reference

⚠️ Pricing Note: Prices vary by region and change over time. Always verify at Azure Monitor Pricing.

SKUApprox. PriceUse Case
Pay-As-You-Go~$2.30/GB< 100 GB/day
100 GB Commitment~$1.80/GB100-200 GB/day (up to 30% savings)
500 GB Commitment~$1.50/GB500+ GB/day
Basic Logs~$0.50/GBHigh-volume, low-query (30 days retention)
Auxiliary Logs~$0.15/GBCompliance/security logs ⚠️ Preview
Archive~$0.02/GBLong-term retention (up to 12 years)

DocumentPurposeLink
Architecture OverviewSolution design01-architecture-overview.md
Operations RunbookDay-2 operations02-operations-runbook.md
Platform Observability ScenariosCommon monitoring scenarios04-platform-observability-scenarios.md

Workshop Next Steps

After completing this workshop:

  1. Immediate (Week 1)

    • Deploy baseline DCR and LAW
    • Configure AMBA policy initiative
    • Set up action groups
  2. Short-term (Month 1)

    • Onboard pilot landing zones
    • Customize workbooks
    • Train landing zone teams
  3. Medium-term (Quarter 1)

    • Roll out to all landing zones
    • Implement cost optimization
    • Establish operational cadence
  4. Long-term (Year 1)

    • Implement audit logs framework
    • Configure DR strategy
    • Evaluate AIOps capabilities

Document Control
Maintained by: Central Platform Team
Review Cycle: Quarterly
Classification: Internal
Version: 1.0 - Workshop Edition

📖Learn