Skip to main content

Unified Monitoring Solution - Architecture Overview

Document Type: Solution Architecture
Version: 1.0
Last Updated: January 2026

Feature Availability Legend:
✅ GA (Generally Available) | ⚠️ Preview (may change) | 📍 Planned/Roadmap

📚 Quick NavigationREADMEArchitectureOperations RunbookAdvanced Topics

📖 How to Use This Document

This Architecture document provides the "What & Why" (theory, design decisions, patterns).
The Operations Runbook provides the "How" (practical implementation, KQL queries, troubleshooting).

Theory ↔ Practice Cross-Reference

Architecture TopicOperations CounterpartDescription
3. Federated Model1. Operations OverviewGovernance model → Day-2 responsibilities
4. Core Components (DCR/LAW)5. Troubleshooting GuideComponent design → Troubleshooting steps
5. Landing Zone Alerting2. Alert ResponseAlert design → Alert response actions
6. AMBA4. Maintenance WindowsBaseline alerts → Suppression & maintenance
8. Security & Access Control6. RBAC OperationsRBAC design → Access management tasks
3. KQL Query LibraryReady-to-use queries
7. Cost MonitoringCost operations (deep-dive in Advanced Topics)

1. Executive Summary

The Unified Monitoring Solution (UMS) provides a federated, enterprise-scale observability framework for Azure Landing Zones. It delivers a centralized governance layer with decentralized execution, enabling platform teams to enforce baseline monitoring standards while empowering application teams with the flexibility to customize their observability stack.

1.1 Key Design Principles

PrincipleDescription
Federated ModelCentral baseline + decentralized customization
Policy-DrivenAzure Policy enforces monitoring standards
Actionable AlertsEvery notification has clear remediation steps
Landing Zone AwareAlerts scoped to specific landing zones, not global broadcast
IaC-FirstAll monitoring configuration deployed via Bicep/Terraform
Cost-OptimizedTiered retention, data filtering, smart ingestion

2. High-Level Architecture


3. Federated Monitoring Model

The federated approach addresses a common enterprise challenge: "All Microsoft service health alerts forwarded to ALL landing zone owners" causing excessive, irrelevant notifications.

3.1 Model Overview

3.2 Responsibilities Matrix

ResponsibilityCentral Platform TeamLanding Zone Team
Baseline DCRs✅ Define & Deploy❌ Read-only
AMBA Alerts✅ Deploy via Policy🔄 Can customize thresholds
Custom DCRs✅ Create & Manage
Custom Alerts✅ Create & Manage
Custom Metrics✅ Create & Manage
Custom Logs✅ Create & Manage
Custom Diagnostic Settings✅ Create & Manage
Central LAW Access✅ Full Access🔄 Resource-context access
LZ-Specific LAW✅ Full Access
Action Groups✅ Define defaults✅ Create LZ-specific
Azure Policy✅ Define & Assign❌ Read-only
Workbooks✅ Publish templates✅ Create custom

3.3 Baseline vs. Custom Components

Component TypeBaseline (Mandatory)Custom (Optional)
Data CollectionOS performance counters, Security events, Activity logsApplication logs, Custom metrics, Business events
AlertsService Health, Resource Health, AMBA metricsApplication-specific, Business SLA alerts
Retention90 days operational, 2 years complianceAs needed per application
NotificationsPlatform team email, ServiceNowTeam-specific channels (Slack, Teams, PagerDuty)

3.4 Federated Visibility Architecture

The 60/40 split represents two distinct data flow patterns that work together to provide platform-wide visibility while respecting landing zone autonomy.

Scenario A: Centralized LAW (60% Baseline + Critical Workloads)

Platform services and critical landing zone workloads send telemetry to the Central Platform LAW for unified visibility.

What Flows to Central LAW:

SourceData TypePurpose
Platform ServicesAll diagnostic logsFull platform health visibility
LZ BaselineActivity LogsAudit trail across all LZs
LZ BaselineSecurity EventsCompliance & threat detection
LZ BaselineVM PerformanceCapacity planning & health
LZ CriticalApplication errorsCritical workload monitoring

Scenario B: LZ-Dedicated LAW with Shared Dashboards (40% Custom)

Non-critical workloads stay in LZ-Dedicated LAWs, but the platform team gets visibility through shared Workbooks and Dashboards.

How Sharing Works:

MethodConfigurationPlatform Team Access
Workbook SharingLZ team publishes to shared galleryRead-only view
Dashboard SharingPin to shared dashboardRead-only view
RBACMonitoring Reader role on LZ LAWQuery access
Cross-Workspace QueriesKQL workspace() functionFederated queries

Combined Architecture: 60/40 Split in Practice

Implementation Decision Matrix

Workload TypeData DestinationRationalePlatform Visibility
Platform ServicesCentral LAWCore infrastructureDirect access
Production CriticalCentral LAWSLA-driven, complianceDirect access
Production StandardBoth (dual-write)FlexibilityDirect + Shared
DevelopmentLZ-Dedicated LAWCost optimizationShared dashboards
Test/SandboxLZ-Dedicated LAWIsolationOptional sharing

3.5 Baseline Telemetry Contract Template

Blueprint Requirement: "End-to-end process of defining, negotiating, and formalizing Baseline Telemetry Contracts between the Monitoring Team and platform/service owners."

Each Landing Zone / Platform must complete a Telemetry Contract before onboarding. This ensures alignment on what data is collected, how it's processed, and who receives alerts.

Contract Template

Contract ElementDescriptionExample ValueOwner
Platform/Service NameUnique identifier for the platformAI/ML PlatformLZ Team
Contract VersionVersion control for changesv1.2Both
Effective DateWhen contract becomes active2026-02-01Both
1. Telemetry Scope
Data TypeIncludedSpecific ItemsJustification
OS Performance Metrics✅ YesCPU, Memory, Disk, NetworkBaseline requirement
Security Events✅ YesEventID 4624, 4625, 4648, 4672Compliance
Application Logs✅ Yes/var/log/app/*.logTroubleshooting
Custom Metrics❌ No-Not required
Traces❌ No-App Insights OOS
2. Integration Method
ComponentConfigurationDetails
Agent TypeAzure Monitor Agent (AMA)Mandatory
DCR Namedcr-aiml-platform-prodPlatform-specific
DCE Endpointdce-central-westeu-001Regional endpoint
Diagnostic SettingsEnabled for all resourcesVia Azure Policy
3. Retention Requirements
Data CategoryHot RetentionArchive RetentionCompliance Driver
Performance Metrics30 days90 daysOperational
Security Events90 days2 yearsSOX/GDPR
Application Logs14 days30 daysCost optimization
Activity Logs90 days7 yearsAudit
4. Alerting Rules
Alert CategoryAlert NameThresholdSeverityAction
AMBA BaselineVM Availability< 99%Sev 1ServiceNow P1
AMBA BaselineCPU Critical> 95% for 5minSev 2Email + Ticket
CustomModel Training FailedError count > 0Sev 2Teams + Email
CustomGPU Utilization Low< 10% for 1hrSev 4Dashboard only
5. Visualization Requirements
Dashboard TypePurposeAccess LevelRefresh Rate
Platform HealthOverview for leadershipRead-only5 min
OperationsDetailed troubleshootingTeam access1 min
Cost AnalysisResource consumptionFinance teamDaily
6. Notification Preferences
SeverityChannelRecipientsEscalation
Sev 0 (Critical)Phone + ServiceNow P1On-call + Team Lead15 min to Service Owner
Sev 1 (High)ServiceNow P2 + TeamsOn-call team30 min to Product Owner
Sev 2 (Medium)Email + TeamsTeam distribution listNext business day
Sev 3-4 (Low)Dashboard onlySelf-serviceNone
7. Governance & Review
AspectFrequencyOwnerDeliverable
Contract ReviewQuarterlyBoth teamsUpdated contract
Alert TuningMonthlyLZ TeamNoise reduction report
Cost ReviewMonthlyPlatform TeamCost optimization recommendations
Compliance AuditAnnualSecurity TeamCompliance attestation

Contract Approval

RoleNameSignatureDate
Platform Owner_______________________________________
Monitoring Lead_______________________________________
Security Approver_______________________________________

4. Core Components Deep Dive

4.1 Data Collection Rules (DCR)

DCRs are the heart of the federated model for VM and container workloads, enabling granular control over what data is collected and where it's sent.

⚠️ Important Distinction: DCRs apply primarily to VMs, VMSS, and AKS via Azure Monitor Agent. For PaaS services, you still use Diagnostic Settings to collect resource logs. DCR-based metrics export for PaaS is in preview with limited service support.

DCR vs Diagnostic Settings Comparison

AspectData Collection Rules (DCR)Diagnostic Settings
Primary UseVMs, VMSS, AKS, Arc serversPaaS resource logs
Agent RequiredYes - Azure Monitor AgentNo - native to resource
Transformations✅ KQL transformations supported❌ No transformations
Multi-destination✅ Multiple LAWs from single DCR⚠️ One setting per destination
PaaS Logs❌ Not supported✅ Primary method
PaaS Metrics⚠️ Preview - limited services✅ Supported
Custom Logs✅ Via Logs Ingestion API❌ Not supported

DCR-Supported Resources (as of January 2026)

Resource TypeDCR SupportNotes
Virtual Machines✅ GAVia Azure Monitor Agent
VM Scale Sets✅ GAVia Azure Monitor Agent
AKS Clusters✅ GAContainer Insights
Arc-enabled Servers✅ GAVia Azure Monitor Agent
Metrics Export⚠️ Preview
Storage Accounts⚠️ PreviewLimited regions
Key Vault⚠️ PreviewLimited regions
Redis Cache⚠️ PreviewLimited regions
SQL Server/DB⚠️ PreviewLimited regions
IoT Hub⚠️ PreviewLimited regions
Logs Ingestion API✅ GACustom apps via REST

Note: For App Service, Azure Functions, Cosmos DB, Event Grid, Service Bus, API Management, and most other PaaS services, continue using Diagnostic Settings to send logs to Log Analytics.

DCR Best Practices (from Microsoft Documentation)

Best PracticeExplanation
Separate DCRs by data source typeDon't mix performance counters and events in one DCR
Separate DCRs by destinationCompliance requirements may need specific destinations
Define observability scopesGroup by application, environment, or platform
Keep DCRs leanOnly collect what's needed for each scope
Limit DCR associationsEach resource can associate with multiple DCRs; avoid excessive associations
Use Diagnostic Settings for PaaSDCRs don't support PaaS resource logs - use Diagnostic Settings
Consider Workspace Transformation DCRApply transformations to Diagnostic Settings data at workspace level

4.2 Log Analytics Workspace Design

Workspace Design Decision Matrix

ScenarioRecommended StrategyRationale
< 100 GB/day ingestionSingle workspace + Resource-context RBACSimpler management, cost-effective
≥ 100 GB/day ingestionDedicated cluster + Single workspaceCommitment tier savings
Multi-tenant/complianceSeparate workspaces per tenantData sovereignty, isolation
Security + OperationsSeparate Security (Sentinel) + Operations LAWDifferent retention, access patterns

4.3 Azure Monitor Agent (AMA)

The Azure Monitor Agent replaces legacy agents (MMA, OMS) and provides:

FeatureBenefit
Centralized configurationDCRs instead of workspace configuration
Multi-homingSend data to multiple workspaces
Granular collectionOnly collect what's defined in associated DCRs
TransformationsFilter and transform data before ingestion
Reduced costNo agent cost, pay only for data ingestion

5. Landing Zone Scoped Alerting

5.1 Problem Statement (Current State)

"A single logic app is responsible for forwarding all alerts via webhooks. If this app is deleted, the entire alerting system fails."

5.2 Target State Architecture

5.3 Alert Processing Rules Design

Rule TypeScopeFilter CriteriaAction
LZ RoutingSubscriptionresourceGroup contains 'lz-name'Route to LZ-specific action group
Severity FilteringResource Groupseverity = 'Sev0' OR severity = 'Sev1'Route to on-call team
Maintenance WindowSubscriptionSchedule: Sundays 02:00-06:00Suppress notifications
Business HoursSubscriptionSchedule: Mon-Fri 09:00-17:00Route to primary team

5.4 Application-Level Alert Containment

Meeting Requirement: "When one alert gets fired, containing it per application area - not firing alerts everywhere"

This pattern ensures alerts are contained within application boundaries and don't cascade across the entire Landing Zone or organization.

Containment Architecture

Tag-Based Application Scoping

Use Azure resource tags to define application boundaries:

TagPurposeExample Values
ApplicationApplication identifierOrderProcessing, CustomerPortal
ApplicationOwnerTeam/email for notificationsorders-team@contoso.com
ApplicationTierComponent tierWeb, API, Data, Integration
DependsOnParent application dependencyCoreAPI, SharedDB

Alert Processing Rule for Application Containment

// Route alerts only to the application-specific team
resource appContainmentRule 'Microsoft.AlertsManagement/actionRules@2023-05-01-preview' = {
name: 'apr-app-${applicationName}-containment'
location: 'global'
properties: {
scopes: [subscription().id]
conditions: [
{
field: 'TargetResourceTags'
operator: 'Contains'
values: ['Application:${applicationName}']
}
]
actions: [
{
actionType: 'AddActionGroups'
actionGroupIds: [applicationActionGroup.id]
}
]
description: 'Route all ${applicationName} alerts to the application team only'
}
}

Dependency Alert Correlation (Balanced Approach)

Design Choice: Instead of fully suppressing dependent alerts (RemoveAllActionGroups), we use a correlation pattern that:

  1. Still records the alert (visible in portal/logs)
  2. Routes to a reduced notification channel (e.g., dashboard-only or low-priority queue)
  3. Includes parent alert context for correlation
// Option 1: Route dependent alerts to low-priority channel (RECOMMENDED)
// Alerts are NOT suppressed - just routed differently when parent is down
resource dependencyCorrelationRule 'Microsoft.AlertsManagement/actionRules@2023-05-01-preview' = {
name: 'apr-correlate-dependent-${applicationName}'
location: 'global'
properties: {
scopes: [subscription().id]
conditions: [
{
field: 'TargetResourceTags'
operator: 'Contains'
values: ['DependsOn:${parentApplicationName}']
}
]
actions: [
{
actionType: 'AddActionGroups'
actionGroupIds: [lowPriorityActionGroup.id] // Dashboard/logging only, no phone/SMS
}
]
description: 'Route dependent app alerts to low-priority channel during parent outage'
enabled: false // Enabled dynamically when parent alert fires
}
}

// Option 2: Full suppression (USE WITH CAUTION - only for known cascading failures)
resource dependencySuppressionRule 'Microsoft.AlertsManagement/actionRules@2023-05-01-preview' = {
name: 'apr-suppress-dependent-${applicationName}'
location: 'global'
properties: {
scopes: [subscription().id]
conditions: [
{
field: 'TargetResourceTags'
operator: 'Contains'
values: ['DependsOn:${parentApplicationName}']
}
{
field: 'Severity'
operator: 'Equals'
values: ['Sev3', 'Sev4'] // Only suppress low-severity, keep Sev0-2 visible
}
]
actions: [
{
actionType: 'RemoveAllActionGroups'
}
]
description: 'Suppress LOW severity dependent alerts only during parent outage'
enabled: false
}
}
ApproachProsConsBest For
Correlation (Option 1)Alerts visible, just quieterRequires low-priority Action GroupMost scenarios
Suppression (Option 2)Reduces noise significantlyAlerts hidden completelyKnown cascade patterns only
Severity FilterCritical alerts still reach teamMore complex ruleProduction environments

Parent-Child Dependency Suppression

When a parent application component fails, suppress alerts from dependent child components to avoid alert storms:

Logic App for Dynamic Suppression:

How it works: This Logic App is triggered when a parent application alert fires. It automatically enables the suppression rule for dependent apps, waits for recovery, then re-enables their alerts.

┌─────────────────────────────────────────────────────────────────────────────┐
│ LOGIC APP WORKFLOW: Dynamic Dependency Suppression │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ 1️⃣ TRIGGER: Parent Alert Fires │
│ └─> Action Group calls this Logic App when Core API goes DOWN │
│ │
│ 2️⃣ STEP 1: Enable Suppression Rule │
│ └─> PATCH API call sets "enabled: true" on the suppression rule │
│ └─> Dependent apps (Order, Portal, Reporting) stop sending alerts │
│ │
│ 3️⃣ STEP 2: Wait for Recovery │
│ └─> Pause for 15 minutes (configurable) │
│ └─> Gives time for parent to recover before re-enabling alerts │
│ │
│ 4️⃣ STEP 3: Re-Enable Child Alerts │
│ └─> PATCH API call sets "enabled: false" on the suppression rule │
│ └─> Dependent apps resume normal alerting │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
{
"actions": {
"When_Parent_Alert_Fires": {
// STEP 1: Enable the suppression rule for dependent apps
// This stops notifications from Order, Portal, Reporting services
"type": "ApiConnection",
"inputs": {
"method": "PATCH",
"uri": "https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{rg}/providers/Microsoft.AlertsManagement/actionRules/apr-suppress-dependent-{parentApp}?api-version=2023-05-01-preview",
"body": {
"properties": {
"enabled": true
}
}
}
},
"Wait_For_Parent_Recovery": {
// STEP 2: Wait before re-enabling alerts
// Adjust the interval based on typical recovery time
"type": "Wait",
"inputs": {
"interval": {
"count": 15,
"unit": "Minute"
}
}
},
"Re_Enable_Child_Alerts": {
// STEP 3: Disable the suppression rule (re-enable alerts)
// Dependent apps resume normal alerting behavior
"type": "ApiConnection",
"inputs": {
"method": "PATCH",
"uri": "https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{rg}/providers/Microsoft.AlertsManagement/actionRules/apr-suppress-dependent-{parentApp}?api-version=2023-05-01-preview",
"body": {
"properties": {
"enabled": false
}
}
}
}
}
}

💡 Tip: For production, add a "Check_Parent_Health" step before re-enabling alerts to verify the parent is actually recovered, not just timed out.


#### Application Boundary KQL Query

Identify all resources within an application boundary:

```kql
// List all resources for a specific application
Resources
| where tags.Application == "OrderProcessing"
| project
ResourceName = name,
ResourceType = type,
Tier = tostring(tags.ApplicationTier),
DependsOn = tostring(tags.DependsOn),
Owner = tostring(tags.ApplicationOwner)
| order by Tier asc

// Find all applications that depend on a specific parent
Resources
| where tags.DependsOn == "CoreAPI"
| summarize
DependentApps = make_set(tags.Application),
ResourceCount = count()

Alert Containment Patterns

PatternDescriptionUse Case
Application IsolationEach app has its own action groupDefault for all applications
Tier-Based GroupingWeb/API/Data tiers grouped separatelyLarge applications with specialized teams
Dependency SuppressionParent failure suppresses child alertsMicroservices, API-dependent apps
Blast Radius LimitingCap max alerts per app per hourPrevent alert storms during major outages

Blast Radius Limiting

Prevent excessive alerts during major outages:

// Alert with rate limiting
resource rateLimitedAlert 'Microsoft.Insights/scheduledQueryRules@2023-03-15-preview' = {
name: 'alert-${appName}-ratelimited'
location: location
properties: {
displayName: '${appName} - Rate Limited Alert'
severity: 2
enabled: true
evaluationFrequency: 'PT5M'
windowSize: 'PT15M'
scopes: [lawId]
criteria: {
allOf: [
{
query: '''
// Only alert if we haven't alerted in the last hour
let recentAlerts = AlertsManagementResources
| where properties.essentials.targetResource contains "${appName}"
| where properties.essentials.startDateTime > ago(1h)
| count;
Perf
| where ObjectName == "Processor"
| where CounterValue > 95
| where Computer has "${appName}"
| where recentAlerts < 5 // Max 5 alerts per app per hour
'''
timeAggregation: 'Count'
operator: 'GreaterThan'
threshold: 0
}
]
}
muteActionsDuration: 'PT1H' // Mute for 1 hour after firing
actions: {
actionGroups: [appActionGroupId]
}
}
}

5.5 Risk-Based Alerting

Risk-based alerting prioritizes alerts based on business impact and criticality rather than purely technical thresholds. This approach integrates with rule-based alerting to ensure critical business systems receive appropriate attention.

Business Criticality Classification

Criticality LevelDefinitionExample SystemsResponse Target
Tier 0 - CriticalCore business systems, revenue-impactingPayment processing, core APIs< 5 min response
Tier 1 - HighCustomer-facing servicesWeb portals, mobile backends< 15 min response
Tier 2 - MediumInternal business operationsReporting, analytics< 1 hour response
Tier 3 - LowDevelopment, testing environmentsDev/QA systemsNext business day

Risk-Based Alert Severity Mapping

Logic: Criticality Tier + Threshold Breach = Resulting Severity

TierCriticality+ Threshold Breach= SeverityAction
Tier 0CriticalCPU > 90%Sev 0Phone + P1
Tier 1HighCPU > 90%Sev 1Page + P2
Tier 2MediumCPU > 90%Sev 2Email
Tier 3LowCPU > 90%Sev 3Dashboard

Example: Same CPU > 90% threshold, different outcomes based on resource criticality.

Criticality-Based Threshold Adjustment

MetricTier 0 ThresholdTier 1 ThresholdTier 2 ThresholdTier 3 Threshold
CPU Utilization> 80% for 2 min> 85% for 5 min> 90% for 10 min> 95% for 15 min
Memory Usage> 75% for 2 min> 80% for 5 min> 85% for 10 min> 90% for 15 min
Error Rate> 0.1%> 0.5%> 1%> 5%
Response Time (P95)> 500 ms> 1 sec> 3 sec> 10 sec
Availability< 99.99%< 99.9%< 99.5%< 99%

Risk Score Calculation

Combine technical severity with business criticality for a composite risk score:

Risk Score = Technical Severity × Criticality Weight × Impact Factor

Where:
- Technical Severity: 1 (Low) to 4 (Critical) based on threshold breach
- Criticality Weight: Tier 0=4, Tier 1=3, Tier 2=2, Tier 3=1
- Impact Factor: 1.0 (isolated) to 2.0 (widespread/cascading)
Risk ScoreClassificationAction
24-32CriticalImmediate escalation, war room
16-23HighPage on-call, 15 min response
8-15MediumEmail notification, ticket creation
1-7LowDashboard update, batch reporting

Implementing Risk-Based Alerting with Tags

Use Azure resource tags to drive risk-based alerting:

// Tag-based alert processing rule for critical systems
resource criticalAlertRule 'Microsoft.AlertsManagement/actionRules@2021-08-08' = {
name: 'apr-critical-systems'
location: 'global'
properties: {
scopes: [subscription().id]
conditions: [
{
field: 'TargetResourceTags'
operator: 'Contains'
values: ['BusinessCriticality:Tier0']
}
]
actions: [
{
actionType: 'AddActionGroups'
actionGroupIds: [criticalActionGroup.id]
}
]
}
}

Integration: Rule-Based + Risk-Based

Alert TypeRule-Based ComponentRisk-Based Modifier
VM CPU AlertThreshold: > 90% for 5 minTier 0 → Sev 0, Tier 3 → Sev 3
API LatencyP95 > 1 secondPayment API → Sev 0, Dev API → Sev 4
Error Rate> 1% error rateCustomer-facing → escalate immediately
Disk Space< 10% free spaceDatabase servers → Sev 1, test VMs → Sev 4

6. Azure Monitor Baseline Alerts (AMBA)

AMBA provides a policy-driven approach to deploying baseline alerts across Azure Landing Zones.

6.1 AMBA Deployment Architecture

6.2 AMBA Coverage

Resource TypeAlert Categories
ExpressRouteCircuit, Gateway, Connection metrics
Azure FirewallThroughput, Health, SNAT port utilization
Virtual NetworkGateway, NSG flow, DDoS protection
Virtual WANHub, VPN Gateway, ExpressRoute Gateway
Log Analytics WorkspaceIngestion rate, Query throttling
Key VaultAvailability, Latency, Saturation
Virtual MachineCPU, Memory, Disk, Network
Storage AccountAvailability, Latency, Capacity

6.3 AMBA Deployment Options

OptionDescriptionProsCons
Azure PortalGuided deployment via ALZ AcceleratorEasy, visualLess repeatable
BicepIaC deployment with parameter filesRepeatable, version-controlledRequires IaC knowledge
TerraformIaC deployment with Terraform modulesRepeatable, state managementTerraform expertise needed
Azure DevOps/GitHub ActionsCI/CD pipeline deploymentAutomated, consistentPipeline setup required

6.4 AMBA Implementation Strategies

Common Question: "How is this done in other organizations - Centralized Rollout vs Golden Templates?"

Strategy Comparison

StrategyDescriptionProsCons
A: Centralized RolloutPlatform team deploys baseline alerts for all LZs where resource type existsConsistent coverage, central control, faster time-to-valueMay create initial noise, less flexibility per team
B: Golden TemplatesRecommend LZ teams auto-deploy baseline alerts when deploying resources via Golden TemplatesTeam ownership, contextual deployment, self-serviceAdoption risk, potentially inconsistent coverage
C: Hybrid (Recommended)Central mandatory baseline + optional enhanced via templatesBalance of governance and flexibility, progressive adoptionMore complex to manage initially

Industry Patterns

Company ProfileRecommended StrategyRationale
Highly Regulated (Finance, Healthcare)Centralized RolloutCompliance requires consistent coverage, audit requirements
Tech-Savvy DevOpsGolden TemplatesTeams capable of self-service, prefer autonomy
Mixed MaturityHybridSome teams need guidance, others want flexibility
Startup/Fast-MovingGolden TemplatesSpeed over consistency, iterate quickly

Decision Framework

Hybrid Implementation Pattern

Phase 1: Mandatory Baseline (Centralized)

  • Deploy AMBA at Management Group level
  • Cover: Service Health, Resource Health, Platform resources
  • Central team manages and tunes

Phase 2: Enhanced Alerts (Golden Templates)

  • Application-specific alerts in deployment templates
  • Teams can opt-in to enhanced monitoring
  • Self-service customization within guardrails
// Golden Template pattern - include monitoring in resource deployment
module vmWithMonitoring './vm-monitored.bicep' = {
name: 'vm-${vmName}'
params: {
vmName: vmName
// Standard VM parameters...

// Monitoring integration
enableBaselineAlerts: true // Opt-in to enhanced alerts
alertActionGroupId: actionGroupId
customAlertThresholds: {
cpuThreshold: 85 // Team-specific override
memoryThreshold: 80
}
}
}

7. Observability Control Panel

7.1 Dashboard Architecture

Visualization Tool Selection Guide:

ToolBest ForAudienceAccess Requirement
Azure WorkbooksDeep investigation, KQL-based analysisPlatform Team, SREsAzure Portal access
Azure DashboardsQuick operational overview, pinned tilesOperations TeamAzure Portal access
Managed GrafanaPrometheus integration, multi-cloud viewsDevOps, SREsGrafana workspace access
Power BIExecutive dashboards, cross-org reportingLeadership, Security, WinObs, IdentityMicrosoft 365 license (no Azure Portal needed)

💡 Power BI Use Case: For stakeholders who don't have Azure Portal access (WinObs team, Security, Identity, leadership), Power BI provides rich visualization through the Azure Monitor Logs connector. Data can flow via:

  • Direct Query: Live connection to Log Analytics Workspace
  • Data Export: Scheduled export to Storage Account → Power BI import
  • Azure Data Explorer: LAW → ADX cluster → Power BI for high-volume analytics

7.2 Workbook Templates

WorkbookPurposeKey Metrics
Platform HealthOverall ALZ healthService Health, Resource Health, Alert counts
VM PerformanceVirtual machine metricsCPU, Memory, Disk, Network
Storage AnalyticsStorage account healthAvailability, Latency, Capacity
Network InsightsNetwork monitoringExpressRoute, Firewall, VPN
Cost OverviewLog ingestion costsGB/day by workspace, resource type

8. Security & Access Control

8.1 RBAC Model

8.2 Access Control Matrix

TeamWorkspace AccessTable AccessResource Access
Platform TeamFullAll tablesAll resources
Security TeamFullSecurityEvent, AuditLogsAll resources
LZ Team AResource-contextOwn resourcesLZ-A resources only
LZ Team BResource-contextOwn resourcesLZ-B resources only
AuditorsRead-onlyAuditLogs, ActivityLogRead-only all

9. Next Steps

DocumentPurpose
02-operations-runbook.mdOperations, KQL queries, DCR patterns, troubleshooting
03-advanced-topics.mdDR, audit logs, cost optimization, AI integration

10. References


Document generated as part of the Unified Monitoring Solution workshop preparation.

📖Learn