Azure Observability in Foundry Control Plane Tutorial: Architecture, Pricing, Use Cases, and Hands-On Guide for AI + Machine Learning

Category

AI + Machine Learning

1. Introduction

Observability in Foundry Control Plane in Azure is about gaining reliable visibility into what the platform is doing when you build, configure, secure, and operate AI systems—especially the management-plane actions that create, update, deploy, and govern AI resources.

In simple terms: it helps you answer “what changed, who changed it, when, and what happened next?” for your AI platform setup. This includes tracking administrative operations, policy outcomes, service health signals, and operational logs that explain why an AI environment is healthy, degraded, or failing.

Technically, “Observability in Foundry Control Plane” is not typically a single standalone Azure resource with its own billing meter. Instead, it is best understood as the set of observability signals and integrations that cover Foundry-related control-plane operations—implemented through standard Azure observability building blocks such as:

  • Azure Monitor (metrics, alerts, workbooks)
  • Azure Monitor Logs / Log Analytics (central log store + KQL queries)
  • Azure Activity Log (subscription-level control-plane events)
  • Diagnostic settings (routing platform logs to Log Analytics / Storage / Event Hubs)
  • Optional integrations like Microsoft Sentinel (SIEM) and ITSM connectors

What problem does it solve? It reduces the risk and toil caused by “invisible” platform changes and failures—like unexpected access changes, deployments that don’t take effect, policy blocks, quota issues, or regional incidents—by giving you auditability, troubleshooting data, and actionable alerts for the AI platform control plane.

Naming note (verify in official docs): Microsoft’s AI platform branding has evolved (for example, Azure AI Studio and Azure AI Foundry naming). This tutorial treats “Observability in Foundry Control Plane” as the observability scope for Foundry’s management plane and shows how to implement it using current Azure Monitor capabilities. If your tenant uses different portal names, follow the equivalent resources and blades.


2. What is Observability in Foundry Control Plane?

Official purpose (practical definition aligned with Azure)

Observability in Foundry Control Plane is the practice and implementation of collecting, centralizing, analyzing, and alerting on control-plane signals related to Foundry-based AI platform resources in Azure.

Because control-plane operations in Azure are fundamentally governed by Azure Resource Manager (ARM), most “control plane observability” relies on:

  • Azure Activity Log for subscription-level events (create/update/delete, RBAC changes, policy actions)
  • Resource logs (when supported by specific resource types) routed using Diagnostic settings
  • Service Health / Resource Health for platform and regional incidents
  • Azure Monitor alerts and dashboards to detect, notify, and triage issues

Core capabilities

In a Foundry control-plane context, observability typically includes:

  • Audit trail of administrative actions
  • Who created/updated/deleted AI resources and configurations
  • Who changed access, keys, networking, or identity settings
  • Policy and governance visibility
  • Policy compliance results and “deny” outcomes
  • Drift detection for “approved” configurations
  • Operational troubleshooting
  • Correlating a deployment/configuration change to an outage
  • Explaining authorization failures (RBAC), networking blocks, quota failures, or region issues
  • Alerting and reporting
  • Alerts for suspicious or risky control-plane actions (deletions, public network enablement, key rotations)
  • Periodic reporting and dashboards for platform operations

Major components (Azure building blocks)

The most common components used to implement Observability in Foundry Control Plane are:

  • Azure Activity Log (subscription scope)
  • Log Analytics workspace (central log store)
  • Diagnostic settings (routing control-plane logs to sinks)
  • Azure Monitor Alerts (metric alerts, log alerts)
  • Azure Monitor Workbooks (dashboards)
  • Optional:
  • Microsoft Sentinel (security analytics, incident management)
  • Event Hubs (stream logs to external platforms)
  • Storage accounts (long retention/archival)
  • Azure Managed Grafana (visualization, when appropriate)

Service type

Observability in Foundry Control Plane is best viewed as a solution pattern implemented using Azure’s native observability services. It is not usually purchased as a single SKU.

Scope: regional/global/subscription

  • Azure Activity Log is subscription-scoped and not tied to a single region.
  • Log Analytics workspaces are regional resources (you choose a region).
  • Service Health is global and tenant/subscription contextual.

How it fits into the Azure ecosystem (AI + Machine Learning)

Foundry-based AI systems frequently rely on a mix of services (for example: model endpoints, orchestration, data stores, networking, identity). Foundry control-plane observability connects those operations back to:

  • Identity (Microsoft Entra ID) and RBAC decisions
  • ARM deployments (Bicep/Terraform/Portal changes)
  • Policy enforcement (Azure Policy)
  • Operational governance (tagging, naming, budget alerts, resource locks)

This is especially important in AI + Machine Learning, where misconfiguration can create: – data exposure risks, – runaway costs, – model deployment failures, – compliance gaps.


3. Why use Observability in Foundry Control Plane?

Business reasons

  • Reduce downtime and incident duration: Faster root cause analysis when you can correlate outages with recent control-plane changes.
  • Lower operational risk: Catch risky actions early (e.g., public network enabled, diagnostic logs disabled, key vault access changed).
  • Improve audit readiness: Maintain traceability of changes for regulated workloads.

Technical reasons

  • Single source of truth for change events: Centralize control-plane events and resource logs into Log Analytics.
  • Correlation across services: Track changes across AI resources, networking, identity, and data services in one timeline.
  • Evidence-based troubleshooting: Replace guesswork with logs and structured events.

Operational reasons (SRE/Platform/DevOps)

  • Actionable alerting: Notify the right teams on destructive operations, policy denies, or repeated failures.
  • Operational dashboards: Workbooks for recurring operational questions (who changed what, what failed, what’s trending).
  • Change management integration: Stream audit logs to SIEM/ITSM tools.

Security/compliance reasons

  • Detect unauthorized or unexpected changes: RBAC, identity, and network posture changes are common sources of security incidents.
  • Support least privilege: Use logs to validate that roles and permissions are used as intended.
  • Retention controls: Store logs to meet regulatory retention requirements (often via Storage or Sentinel).

Scalability/performance reasons

Control-plane observability helps scaling indirectly: – When you scale AI systems, you create more resources, deployments, and changes. Observability prevents that scale from turning into chaos. – Alerting on throttling/quota and policy issues helps prevent repeated failed rollouts.

When teams should choose it

Choose Observability in Foundry Control Plane when: – You operate AI environments in shared subscriptions or landing zones. – You need audit trails and governance evidence. – Multiple teams deploy models and services frequently. – You must respond to incidents quickly and consistently.

When teams should not choose it

You may not need a full control-plane observability implementation if: – You are running a short-lived prototype in a sandbox with no compliance requirements. – You have a single developer and minimal change frequency. – You do not retain resources beyond a few days.

Even then, enabling basic Activity Log routing is usually low-effort and pays off quickly.


4. Where is Observability in Foundry Control Plane used?

Industries

  • Finance and insurance (auditability, change control)
  • Healthcare and life sciences (compliance, access tracking)
  • Retail and e-commerce (availability + rapid releases)
  • Manufacturing (operational reliability, OT/IT boundaries)
  • Public sector (policy enforcement, retention requirements)
  • SaaS/ISVs building AI features (multi-tenant governance)

Team types

  • Platform engineering teams operating Azure landing zones
  • SRE/Operations teams managing incident response
  • Security engineering and SOC teams
  • AI/ML engineering teams deploying models at scale
  • DevOps teams managing CI/CD and infrastructure as code

Workloads and architectures

  • Hub-and-spoke networks with private endpoints
  • Multi-subscription environments with centralized logging
  • Production AI platforms with strict role separation
  • Regulated environments using Azure Policy and Sentinel

Production vs dev/test

  • Dev/test: Focus on rapid debugging, basic change tracking, cost guardrails.
  • Production: Add retention, SIEM integration, strict alerting, and governance reporting.

5. Top Use Cases and Scenarios

Below are realistic scenarios where Observability in Foundry Control Plane is directly useful.

1) Audit “who changed the model deployment configuration”

  • Problem: A model endpoint starts returning errors after a configuration change.
  • Why this fits: Control-plane logs reveal the change operation, identity, time, and target resource.
  • Example: An engineer updates a deployment SKU or networking setting; Activity Log shows the update event and the caller.

2) Alert on destructive actions (delete, purge, disable logging)

  • Problem: Critical AI resources are deleted or logging is turned off.
  • Why this fits: Log alerts can detect delete operations or diagnostic settings changes.
  • Example: Alert when a resource delete occurs under AI resource groups, page the on-call team.

3) Detect RBAC drift and privilege escalation

  • Problem: Unexpected access grants appear on AI resources or resource groups.
  • Why this fits: Activity Log captures role assignment changes.
  • Example: Notify security when “Owner” is assigned to a non-approved group.

4) Troubleshoot policy denies that block deployments

  • Problem: A pipeline fails with a vague “forbidden” error.
  • Why this fits: Policy events and Activity Log entries help identify the policy assignment causing the deny.
  • Example: A policy requiring private endpoints blocks a deployment; logs show the policy name and assignment scope.

5) Quota and capacity incident correlation

  • Problem: Deployments fail intermittently due to quota/capacity constraints.
  • Why this fits: Control-plane failure events plus service health context can guide remediation.
  • Example: Activity Log shows repeated “failed” create operations; correlate with region service health advisory.

6) Build an operational “AI platform change timeline”

  • Problem: Incident reviews require a consistent timeline of changes across resources.
  • Why this fits: Centralized logs let you query by time range and resource group tags.
  • Example: A workbook shows all create/update/delete operations in the last 24 hours for the AI platform.

7) Multi-team governance reporting (chargeback/showback support)

  • Problem: Leadership asks which teams are creating AI resources and whether they follow standards.
  • Why this fits: Control-plane logs + tags provide evidence for reporting.
  • Example: Report top resource creators per subscription and whether tagging policies were satisfied.

8) Incident response automation with Sentinel

  • Problem: SOC needs to detect suspicious admin activity and open incidents.
  • Why this fits: Stream logs to Microsoft Sentinel for correlation and automated response.
  • Example: Sentinel rule triggers when multiple role changes happen outside business hours.

9) Validate infrastructure-as-code deployments

  • Problem: You want proof that a CI/CD pipeline applied the intended changes.
  • Why this fits: Activity Log shows deployment operations and outcomes.
  • Example: Confirm that a Bicep deployment updated diagnostic settings and network rules.

10) Prove compliance for regulated AI environments

  • Problem: Auditors require evidence of access control, retention, and change tracking.
  • Why this fits: Centralized logs + retention policies + audit trails support compliance.
  • Example: Provide evidence of RBAC changes, key rotations, and policy compliance over time.

6. Core Features

Because Observability in Foundry Control Plane is typically implemented using Azure Monitor primitives, the “features” are best described as the capabilities you enable.

Feature 1: Subscription-level control-plane event capture (Azure Activity Log)

  • What it does: Captures administrative events such as create/update/delete operations, RBAC changes, policy actions, and service health notifications at subscription scope.
  • Why it matters: Most critical AI platform incidents involve “what changed” in the control plane.
  • Practical benefit: A single timeline for changes across Foundry-related resources.
  • Caveat: Activity Log retention in the portal is limited; for longer retention you must export via diagnostic settings.

Feature 2: Diagnostic settings routing to Log Analytics / Storage / Event Hubs

  • What it does: Exports logs to one or more sinks for retention, analysis, or streaming.
  • Why it matters: Centralization is required for cross-resource correlation and alerting.
  • Practical benefit: Query across subscriptions/workloads; store long term; feed SIEM.
  • Caveat: Not every resource type exposes the same resource logs/metrics categories. Verify per resource in Azure portal.

Feature 3: Centralized log search and analytics (Log Analytics + KQL)

  • What it does: Stores logs and allows querying with Kusto Query Language (KQL).
  • Why it matters: Control-plane troubleshooting often needs filtering by caller, operation, resource, status, correlation ID.
  • Practical benefit: Fast investigations and reusable queries for SRE runbooks.
  • Caveat: Costs depend on ingestion and retention; implement filters/retention tiers carefully.

Feature 4: Alerting on risky or anomalous control-plane events (Azure Monitor Alerts)

  • What it does: Generates notifications/incidents when queries match conditions (log alerts) or metrics cross thresholds.
  • Why it matters: You shouldn’t learn about deletions, access changes, or policy denies from users.
  • Practical benefit: Proactive operations and security response.
  • Caveat: Poorly tuned alerts create noise. Start with a small set of high-signal detections.

Feature 5: Dashboards and reporting (Azure Monitor Workbooks)

  • What it does: Visualizes queries and metrics with parameterized dashboards.
  • Why it matters: Platform operations need repeatable “daily view” dashboards.
  • Practical benefit: Self-service visibility for engineers and stakeholders.
  • Caveat: Workbooks are only as good as the underlying log hygiene (tags, consistent scopes, routed logs).

Feature 6: Service Health / Resource Health integration

  • What it does: Provides Azure platform incident notifications and per-resource health signals.
  • Why it matters: Separates “our change broke it” from “Azure incident is impacting it.”
  • Practical benefit: Faster triage and clearer comms during outages.
  • Caveat: Health signals are not a substitute for your app/data-plane monitoring—use both.

Feature 7: Governance visibility (Policy + Activity Log + optional compliance reporting)

  • What it does: Shows what policies were evaluated, denied, or remediated.
  • Why it matters: Foundry control plane often must enforce private networking, encryption, tagging, and restricted SKUs.
  • Practical benefit: Clear evidence of enforcement and drift.
  • Caveat: Policy event coverage and details vary by resource provider and policy effect. Validate policy logging behavior.

Feature 8: Security analytics via SIEM (optional Microsoft Sentinel)

  • What it does: Correlates events, applies detections, and manages incidents.
  • Why it matters: AI platforms are high-value targets; admin actions are high-signal events.
  • Practical benefit: SOC-ready detections and incident workflows.
  • Caveat: Additional cost and operational ownership required; don’t forward everything without a plan.

7. Architecture and How It Works

High-level architecture

Observability in Foundry Control Plane follows a straightforward pattern:

  1. Control-plane events occur whenever someone or something (portal, CLI, IaC pipeline) performs ARM operations on Foundry-related resources.
  2. Azure emits: – Activity Log events at subscription scope – Optional resource logs/metrics for specific resources (where supported)
  3. Diagnostic settings export these signals to: – Log Analytics for query/alert/dashboard – Storage for archival/retention – Event Hubs for streaming to third-party tools
  4. Azure Monitor evaluates alert rules and triggers notifications/actions.

Request/data/control flow (what flows where)

  • Control flow: User/CI → ARM → Resource Provider (AI/ML services)
  • Telemetry flow:
  • ARM writes Activity Log events
  • Resource Provider may emit resource logs/metrics
  • Diagnostic settings route telemetry to Log Analytics / Storage / Event Hubs

Integrations with related services

Common integrations in Azure AI + Machine Learning environments include:

  • Microsoft Entra ID: identity and authentication to Azure
  • Azure RBAC: authorization decisions
  • Azure Policy: governance controls; policy deny events affect deployments
  • Private Link / Private Endpoints: networking posture; changes are critical to observe
  • Key Vault: secrets and keys; access changes should be monitored
  • Azure DevOps / GitHub Actions: IaC pipelines generating control-plane events
  • Microsoft Sentinel: SIEM for high-value control-plane events

Dependency services

To implement this pattern you typically need: – Log Analytics workspace – Azure Monitor alert rules – Diagnostic settings at subscription/resource scope

Security/authentication model

  • Authentication uses Microsoft Entra ID
  • Authorization uses Azure RBAC
  • Access to logs is governed by:
  • Log Analytics workspace RBAC (Log Analytics Reader/Contributor)
  • Azure Monitor roles (Monitoring Reader/Contributor)
  • Azure subscription/Resource Group RBAC

Networking model

  • Activity Log and Log Analytics are Azure services accessed over Azure’s public endpoints by default.
  • You can harden access with:
  • Private Link options (availability varies by service—verify in official docs)
  • Network restrictions and firewall rules where supported
  • Restricting who can read logs via RBAC, rather than relying only on network controls

Monitoring/logging/governance considerations

  • Decide which subscriptions and which resource groups represent Foundry platform boundaries.
  • Standardize:
  • naming conventions (to filter queries)
  • tagging (owner, env, cost center)
  • retention strategy (hot vs archive)
  • Treat “disable diagnostic settings” as a high-severity event—alert on it.

Simple architecture diagram

flowchart LR
  U[Engineer / CI Pipeline] --> ARM[Azure Resource Manager]
  ARM --> RP[Foundry-related Resource Providers]
  ARM --> AL[Azure Activity Log]
  RP --> RL[Resource Logs / Metrics<br/>(when supported)]

  AL --> DS[Diagnostic Settings]
  RL --> DS

  DS --> LAW[Log Analytics Workspace]
  LAW --> AM[Azure Monitor Alerts]
  LAW --> WB[Workbooks / Dashboards]
  AM --> N[Notifications / ITSM / Webhook]

Production-style architecture diagram

flowchart TB
  subgraph Management["Management & Governance"]
    AAD[Microsoft Entra ID]
    RBAC[Azure RBAC]
    POL[Azure Policy]
    SH[Service Health / Resource Health]
  end

  subgraph Platform["Foundry Platform Subscriptions"]
    CI[GitHub Actions / Azure DevOps]
    ARM[Azure Resource Manager]
    AI[AI + ML Resources<br/>(Foundry-related)]
    KV[Key Vault]
    NET[Networking<br/>(VNet/Private Endpoints)]
  end

  subgraph Observability["Central Observability Subscription"]
    DS[Diagnostic Settings<br/>(Subscription + Resource)]
    LAW[Log Analytics Workspace]
    STO[Storage Account (Archive)]
    EH[Event Hubs (Streaming)]
    WB[Azure Monitor Workbooks]
    ALRT[Azure Monitor Alerts]
    SENT[Microsoft Sentinel (Optional)]
  end

  CI --> ARM
  ARM --> AI
  ARM --> KV
  ARM --> NET

  AAD --> ARM
  RBAC --> ARM
  POL --> ARM

  ARM -->|Control-plane events| DS
  AI -->|Resource logs/metrics| DS
  SH --> DS

  DS --> LAW
  DS --> STO
  DS --> EH

  LAW --> WB
  LAW --> ALRT
  LAW --> SENT
  EH --> SENT

8. Prerequisites

Account/subscription requirements

  • An Azure subscription where you can:
  • Configure diagnostic settings at the subscription level, and/or
  • Configure diagnostic settings on Foundry-related resources

Permissions (IAM roles)

Typical minimum roles (scope varies by where you configure things): – To create a Log Analytics workspace: Contributor on a resource group (or higher) – To configure diagnostic settings: – Owner or Contributor at the subscription/resource scope is commonly required – Some environments use a dedicated role with monitoring permissions; verify your org’s RBAC model – To query logs: Log Analytics Reader – To create alerts: Monitoring Contributor

Billing requirements

  • A payment method enabled for Azure Monitor Logs ingestion/retention and alerting.

Tools

  • Azure Portal
  • Azure CLI (az)
    Install: https://learn.microsoft.com/cli/azure/install-azure-cli

Region availability

  • Log Analytics workspace is regional; choose a region consistent with your data residency requirements.
  • Activity Log is subscription-level and not tied to one region.

Quotas/limits (verify in official docs)

  • Log Analytics ingestion/retention constraints
  • Alert rules per subscription/workspace limits
  • Diagnostic settings per resource limits

Prerequisite services

  • Azure Monitor
  • Log Analytics workspace (recommended for the lab)

9. Pricing / Cost

Observability in Foundry Control Plane is priced through the Azure services you use to store, query, and act on telemetry—not usually as a standalone “Foundry observability” SKU.

Primary pricing dimensions (what you pay for)

  1. Azure Monitor Logs (Log Analytics) – Data ingestion (GB/day) – Retention (days stored in the workspace) – Optional archive and restore costs (where used)
  2. Alerting – Log alerts may have charges depending on alert type and evaluation frequency (verify current Azure Monitor pricing details).
  3. Data export / streaming – Event Hubs throughput units and retention (if streaming) – Storage costs for archived logs (capacity + transactions)
  4. SIEM (optional) – Microsoft Sentinel charges (typically based on data ingestion/retention)

Official pricing: – Azure Monitor pricing: https://azure.microsoft.com/pricing/details/monitor/ – Azure Pricing Calculator: https://azure.microsoft.com/pricing/calculator/

Free tier (what may be free)

Azure pricing changes over time. Some aspects that are commonly “included” or low-cost: – Viewing recent Activity Log entries in the portal (limited retention) – Some basic platform logs may not incur additional charges until exported/ingested

Verify in official docs for the current free allowances for Log Analytics ingestion and retention in your region.

Cost drivers (most important)

  • High-volume Activity Log export across many subscriptions
  • Verbose resource logs exported at high frequency
  • Long retention periods kept in hot storage
  • Unfiltered logs streamed to multiple sinks (Log Analytics + Event Hubs + Storage)
  • Noisy alerts evaluated too frequently

Hidden or indirect costs

  • Cross-team access: More users querying logs may increase operational load (not a direct cost, but real toil).
  • Data egress: Streaming to third-party tools may incur network charges depending on architecture.
  • Retention compliance: Long-term retention in hot tier can be expensive; storage archive patterns may be cheaper.

Network/data transfer implications

  • Exporting logs to Event Hubs and then to non-Azure tools can introduce egress charges.
  • Centralized logging across regions may create additional complexity. Prefer regionally aligned workspaces where required by policy.

How to optimize cost (without losing auditability)

  • Start with Activity Log export only; add resource logs selectively.
  • Use short hot retention in Log Analytics + archive to Storage for long retention (verify the recommended approach in current Azure docs).
  • Reduce alert frequency; use high-signal conditions.
  • Use KQL to focus on:
  • specific resource groups
  • specific operation names (delete, write, role assignments)
  • failures only (when appropriate)

Example low-cost starter estimate (model, not numbers)

A minimal setup for a small team: – 1 Log Analytics workspace – Subscription Activity Log routed to workspace – 2–5 log alerts (delete operations, RBAC changes) Main cost components: – Workspace ingestion from Activity Log volume – Retention days chosen – Alerts evaluation frequency

Because the exact price per GB and alert charges vary by region and plan, use the Azure Pricing Calculator to estimate with your expected GB/day.

Example production cost considerations

In production, cost planning should include: – Central workspace per region or per landing zone – Storage archive for multi-year retention – Sentinel (if SOC required) – Event Hubs streaming to enterprise SIEM – Multiple workbooks and alerts – Budget alerts and cost anomaly detection (FinOps)


10. Step-by-Step Hands-On Tutorial

Objective

Implement a practical baseline for Observability in Foundry Control Plane by:

  1. Creating a Log Analytics workspace
  2. Exporting Azure Activity Log to that workspace (subscription-level control-plane visibility)
  3. Running KQL queries to inspect Foundry-related control-plane events (by filtering to AI/ML resource providers)
  4. Creating a basic alert for a high-risk control-plane action
  5. Cleaning up to avoid ongoing cost

This lab is designed to be safe and low-cost. You will generate only a small number of control-plane events.

Lab Overview

You will: – Create a resource group and Log Analytics workspace. – Configure a subscription diagnostic setting to send Activity Logs to Log Analytics. – Generate a control-plane event by creating and deleting a small Azure resource (you can use a minimal AI/ML-related resource if available in your subscription; otherwise any resource will still validate the pipeline). – Query Activity Log data in Log Analytics. – Create an alert for delete operations.

Note: Foundry-specific resource types vary by tenant and by how your organization provisions AI services. The Activity Log approach still applies because it captures ARM operations across resource providers.


Step 1: Create a resource group

Action (Azure CLI):

az account show
az group create \
  --name rg-foundry-observability-lab \
  --location eastus

Expected outcome: – A resource group named rg-foundry-observability-lab exists.

Verification:

az group show --name rg-foundry-observability-lab --query "{name:name, location:location}" -o table

Step 2: Create a Log Analytics workspace

Action (Azure CLI):

az monitor log-analytics workspace create \
  --resource-group rg-foundry-observability-lab \
  --workspace-name law-foundry-obsv-lab \
  --location eastus

Expected outcome: – A Log Analytics workspace is created.

Verification:

az monitor log-analytics workspace show \
  --resource-group rg-foundry-observability-lab \
  --workspace-name law-foundry-obsv-lab \
  --query "{name:name, customerId:customerId, location:location}" -o table

Step 3: Export Azure Activity Log to Log Analytics (subscription diagnostic setting)

This is the key step for control-plane observability.

Option A (recommended): Azure Portal

  1. Go to Monitor in the Azure portal.
  2. Navigate to Activity log.
  3. Select Export Activity Logs (or Diagnostic settings depending on portal layout).
  4. Create a diagnostic setting: – Destination: Send to Log Analytics workspace – Select your workspace: law-foundry-obsv-lab – Categories to include (typical baseline):
    • Administrative
    • Policy
    • Security
    • ServiceHealth
    • ResourceHealth
    • Alert (if available)
    • Recommendation (if available)

Expected outcome: – A diagnostic setting exists for the subscription that exports Activity Logs to Log Analytics.

Option B: Azure CLI (if available in your environment)

Azure CLI support for subscription diagnostic settings can vary by CLI version/extension. If the following commands fail, use the portal.

1) Get your subscription ID:

SUB_ID=$(az account show --query id -o tsv)
echo $SUB_ID

2) Create the subscription diagnostic setting (command group may vary; verify in official docs if it differs):

az monitor diagnostic-settings subscription create \
  --name ds-activitylog-to-law \
  --subscription $SUB_ID \
  --workspace law-foundry-obsv-lab \
  --resource-group rg-foundry-observability-lab \
  --logs '[
    {"category":"Administrative","enabled":true},
    {"category":"Policy","enabled":true},
    {"category":"Security","enabled":true},
    {"category":"ServiceHealth","enabled":true},
    {"category":"ResourceHealth","enabled":true}
  ]'

Expected outcome: – Activity Log events begin flowing to Log Analytics (may take a few minutes).

Verification (Portal): – Go to the workspace → Logs → run a query (next step).


Step 4: Generate a control-plane event

To validate end-to-end, create a small resource. If your subscription allows provisioning an AI/ML resource you normally use with Foundry, prefer that (because it will generate provider-specific events). If not, any Azure resource will still prove the control-plane logging pipeline.

Example (safe and generally available): Create and delete a Storage account

Storage is not “AI”, but the Activity Log pipeline is identical and confirms your setup.

az storage account create \
  --name stfoundryobsv$RANDOM \
  --resource-group rg-foundry-observability-lab \
  --location eastus \
  --sku Standard_LRS

Wait ~1–3 minutes, then delete it:

ST_NAME=$(az storage account list -g rg-foundry-observability-lab --query "[0].name" -o tsv)
az storage account delete --name $ST_NAME --resource-group rg-foundry-observability-lab --yes

Expected outcome: – You generated at least two Activity Log events: a create and a delete.


Step 5: Query Activity Log data in Log Analytics (KQL)

In Azure portal: 1. Open your Log Analytics workspace: law-foundry-obsv-lab 2. Select Logs 3. Run this query:

AzureActivity
| where TimeGenerated > ago(1h)
| project TimeGenerated, OperationNameValue, ActivityStatusValue, Caller, ResourceGroup, ResourceProviderValue, ResourceId
| order by TimeGenerated desc

Expected outcome: – You see recent control-plane operations, including your storage create/delete (or other resource operations).

Filter to AI/ML-related providers (examples)

Depending on what you use with Foundry, you might filter to providers like these. Use what matches your environment.

AzureActivity
| where TimeGenerated > ago(24h)
| where ResourceProviderValue has_any ("Microsoft.MachineLearningServices", "Microsoft.CognitiveServices")
| project TimeGenerated, OperationNameValue, ActivityStatusValue, Caller, ResourceGroup, ResourceId
| order by TimeGenerated desc

Expected outcome: – If you have AI/ML resources and recent operations, you’ll see them here. – If not, you’ll get zero results—meaning you need to generate an AI/ML operation in your subscription to validate provider-specific coverage.


Step 6: Create a high-signal alert for delete operations

A practical baseline alert is: any delete operation in your Foundry platform resource group(s).

Create a log alert (Portal method)

  1. Go to MonitorAlertsCreateAlert rule
  2. Scope: select your Log Analytics workspace
  3. Condition: Custom log search
  4. Use this query:
AzureActivity
| where TimeGenerated > ago(10m)
| where OperationNameValue endswith "/delete"
| where ResourceGroup == "rg-foundry-observability-lab"
  1. Set: – Evaluation frequency: e.g., 5 minutes – Lookback period: e.g., 10 minutes – Threshold: greater than 0
  2. Action group: email yourself (and/or webhook/ITSM connector)
  3. Name: alert-delete-ops-rg-foundry-observability-lab

Expected outcome: – If a delete happens in the resource group, the alert fires.


Validation

Use this checklist:

  1. Activity Log export enabled – Portal: Monitor → Activity log → Export/Diagnostic settings shows your Log Analytics destination.
  2. Data arriving in Log AnalyticsAzureActivity | where TimeGenerated > ago(1h) returns rows.
  3. Alert rule created and enabled – Monitor → Alerts shows the rule as enabled.
  4. Test alert – Delete a small resource in the lab resource group and confirm the alert triggers.

Troubleshooting

Issue: AzureActivity table has no data – Wait 5–15 minutes after enabling export. – Confirm diagnostic setting is configured at the subscription level, not only on a resource. – Ensure you selected relevant categories (Administrative is essential).

Issue: Permission denied creating diagnostic settings – You likely need Owner or Contributor at subscription scope (or a role that includes Microsoft.Insights/diagnosticSettings/*). – In locked-down environments, request help from the platform team.

Issue: Alert never fires – Confirm your query is correct: – Use a larger time window temporarily (e.g., ago(1h)). – Remove the ResourceGroup filter to confirm delete operations appear. – Confirm the alert evaluation period/frequency matches your query window.

Issue: Too many alerts (noise) – Narrow by: – Resource group(s) for Foundry platform – Specific operation names (role assignments, delete, write) – Only failed operations (ActivityStatusValue != "Succeeded")


Cleanup

To avoid ongoing charges, remove what you created.

1) Delete the lab resource group (deletes workspace and any remaining lab resources):

az group delete --name rg-foundry-observability-lab --yes --no-wait

2) Remove the subscription diagnostic setting (if you created one) – Portal: Monitor → Activity Log → Export/Diagnostic settings → delete the setting
or use CLI if available (command patterns vary—verify in official docs).

3) Remove alert rules created for the lab: – Portal: Monitor → Alerts → Alert rules → delete the lab rule


11. Best Practices

Architecture best practices

  • Centralize logs by landing zone or platform subscription to enable cross-resource correlation.
  • Use a tiered retention strategy:
  • Hot retention in Log Analytics for active investigations
  • Archive in Storage for long-term compliance (verify best practice in current Azure docs)

IAM/security best practices

  • Restrict who can:
  • change diagnostic settings,
  • delete workspaces,
  • disable alerts.
  • Use separation of duties:
  • Platform team owns export pipelines and workspaces
  • App/ML teams have reader access and create team-level workbooks (where appropriate)

Cost best practices

  • Export what you need:
  • Start with Activity Log categories: Administrative, Policy, Security
  • Add resource logs selectively
  • Tune alerts for signal, not completeness.
  • Use budgets and cost alerts for observability resources too (workspaces can grow unexpectedly).

Performance best practices

  • Prefer focused queries (time-bounded, filtered by resource group/provider).
  • Build “investigation queries” as saved queries or workbook components.

Reliability best practices

  • Alert if diagnostic settings are removed or modified (control-plane observability must be protected).
  • Use resource locks on critical logging resources (carefully—locks can block legitimate changes).

Operations best practices

  • Maintain an on-call runbook:
  • where to look first (Activity Log timeline),
  • key KQL queries,
  • escalation paths (platform vs Azure incident).
  • Create a standard workbook for:
  • recent changes,
  • failed operations,
  • RBAC changes,
  • policy denies.

Governance/tagging/naming best practices

  • Standardize tags like:
  • env (dev/test/prod)
  • owner
  • costCenter
  • dataClassification
  • Use consistent resource group naming for Foundry platform boundaries (makes queries and alerts precise).

12. Security Considerations

Identity and access model

  • Control-plane actions authenticate via Microsoft Entra ID.
  • Authorization is enforced by Azure RBAC (and sometimes resource-specific roles).
  • Log access is also RBAC-controlled:
  • Use least privilege for Log Analytics readers.
  • Restrict write permissions to avoid tampering.

Encryption

  • Azure services encrypt data at rest by default (verify specifics for Log Analytics and Storage in current docs).
  • For archives in Storage, consider:
  • encryption keys (Microsoft-managed vs customer-managed keys), if required by policy.

Network exposure

  • Treat observability endpoints as sensitive:
  • They can reveal resource names, IDs, and operational details.
  • Prefer RBAC restrictions as the primary control.
  • Where available/required, evaluate private connectivity options (verify support per service).

Secrets handling

  • Avoid embedding secrets in alert webhooks or automation scripts.
  • Store secrets in Azure Key Vault and use managed identities for automation.

Audit/logging

  • Your observability pipeline itself must be observable:
  • Alert when diagnostic settings are changed.
  • Alert when the workspace is deleted (activity log events).
  • Consider streaming to Sentinel for tamper-resistant security operations (with proper governance).

Compliance considerations

  • Define retention requirements (e.g., 90 days hot, 1–7 years archive) based on your regulatory obligations.
  • Ensure logs do not violate data residency rules—choose workspace region accordingly.

Common security mistakes

  • Allowing too many users to modify diagnostic settings
  • Storing logs only in short-retention default views
  • Not alerting on role assignment changes
  • Not separating production and non-production logging workspaces

Secure deployment recommendations

  • Use IaC (Bicep/Terraform) to define:
  • diagnostic settings
  • workspaces
  • alerts
  • action groups
  • Apply policy to require diagnostic settings on critical resource types (verify feasibility per resource provider).

13. Limitations and Gotchas

  • Not all resources emit the same logs: Some Foundry-related services may have limited resource logs. Always check the resource’s Diagnostic settings categories.
  • Activity Log is necessary but not sufficient: It shows control-plane operations, not application/data-plane telemetry (e.g., model inference latency inside your app).
  • Retention defaults can be short: Relying only on portal views risks losing critical evidence.
  • Alert noise is easy to create: Without filters, you’ll overwhelm responders with low-signal events.
  • CLI/portal differences: Some diagnostic setting operations are easier in the portal; CLI support can vary by version. Use the portal if commands don’t match your environment.
  • Costs can grow quietly: Log ingestion increases with organizational scale and change frequency. Implement budgets and periodic reviews.

14. Comparison with Alternatives

Observability in Foundry Control Plane is a control-plane-focused approach. Here’s how it compares with nearby options.

Option Best For Strengths Weaknesses When to Choose
Observability in Foundry Control Plane (Azure Monitor + Activity Log + Log Analytics) Auditing and operating Foundry management plane Strong change tracking, native Azure integration, flexible KQL Doesn’t automatically cover app/data-plane telemetry When you need governance, audit trails, and control-plane alerting
Azure Monitor (general) Broad monitoring across Azure Standard platform for metrics/logs/alerts Requires design to cover Foundry boundaries When you want a unified monitoring strategy
Application Insights (app telemetry) Application performance monitoring Traces, dependencies, distributed tracing for apps Not a control-plane audit trail When you need app-level observability for AI apps (APIs, RAG services)
Microsoft Sentinel Security operations and incident response SIEM/SOAR, correlation, detections Additional cost/ops overhead When SOC needs detections for admin actions and suspicious changes
AWS CloudTrail + CloudWatch AWS control-plane observability Mature change/audit tracking Different cloud; not Azure-native If your AI platform runs on AWS
GCP Cloud Audit Logs + Cloud Monitoring GCP control-plane observability Strong audit logs Different cloud; not Azure-native If your AI platform runs on GCP
Datadog / Splunk (self-managed or SaaS) Cross-cloud enterprise observability Powerful search/correlation Cost, integration complexity, data residency concerns When you need a unified multi-cloud observability layer
Prometheus/Grafana (self-managed) Metrics-focused observability Open ecosystem Control-plane audit coverage is not the focus When you primarily need metrics and have platform maturity

15. Real-World Example

Enterprise example (regulated)

  • Problem: A financial services company runs AI workloads with strict governance. Auditors require proof of change control for AI platform resources, and incidents must be triaged quickly.
  • Proposed architecture:
  • Subscription Activity Log exported to a central Log Analytics workspace
  • Resource logs enabled on critical AI, networking, and Key Vault resources (where supported)
  • Workbooks for “change timeline”, “RBAC changes”, “policy denies”
  • Alerts on delete operations, role assignment changes, and diagnostic setting changes
  • Optional Microsoft Sentinel for SOC detections and incident workflows
  • Why this service was chosen: Observability in Foundry Control Plane aligns with Azure-native governance and audit requirements and integrates with existing Azure Monitor and security tooling.
  • Expected outcomes:
  • Faster audits (repeatable evidence)
  • Lower MTTR through change correlation
  • Reduced risk of unauthorized configuration drift

Startup/small-team example

  • Problem: A small SaaS team ships AI features weekly. A few outages were caused by accidental config changes and lack of visibility.
  • Proposed architecture:
  • Single Log Analytics workspace
  • Activity Log export enabled
  • 3 log alerts: deletes, role assignment changes, repeated failed writes
  • One workbook showing last 7 days of changes
  • Why this service was chosen: Minimal setup effort, low operational overhead, and immediate value from change visibility.
  • Expected outcomes:
  • Quick identification of “what changed”
  • Better on-call experience with fewer blind spots
  • Cost-controlled logging with short retention

16. FAQ

1) What does “control plane” mean in Foundry Control Plane observability?

Control plane refers to management operations (create/update/delete/configure) executed through Azure Resource Manager. It is different from data plane traffic such as application requests to your AI endpoint.

2) Is Observability in Foundry Control Plane a standalone Azure product?

Usually no. It is commonly implemented using Azure Monitor, Activity Log, Log Analytics, diagnostic settings, and alerts. Verify your organization’s Foundry documentation for any Foundry-specific dashboards or integrations.

3) What is the first thing to enable?

Enable subscription Activity Log export to a Log Analytics workspace. It provides immediate, broad control-plane visibility.

4) Does this replace application monitoring for AI apps?

No. Control-plane observability explains platform changes. You still need application/data-plane monitoring (often with Application Insights and distributed tracing).

5) How long does Activity Log data take to appear in Log Analytics?

Typically minutes, but delays can occur. If you see no data after 15 minutes, re-check diagnostic settings and permissions.

6) Which events are most important to alert on?

Start with high-signal events: – delete operations – role assignment changes – policy denies affecting deployments – diagnostic setting modifications

7) Can I route logs to both Log Analytics and Storage?

Yes, diagnostic settings often support multiple sinks. This is common for “hot search” in Log Analytics plus long-term archive in Storage.

8) What’s the difference between Activity Log and resource logs?

Activity Log is subscription-level control-plane events. Resource logs are resource-specific telemetry exposed via diagnostic settings (varies by resource type).

9) How do I prove “who changed what” during an incident?

Use Activity Log records in Log Analytics, filtering by time range, resource group, and operation. The Caller field is commonly used to identify the actor.

10) How do I protect observability from tampering?

Use RBAC to restrict modification of diagnostic settings and workspaces. Consider alerts when diagnostic settings change and use resource locks where appropriate.

11) Do I need Microsoft Sentinel?

Not always. Sentinel is beneficial when you need SOC workflows, correlation, and incident management. Many teams start with Azure Monitor and add Sentinel later.

12) Will this increase my Azure bill significantly?

It can, depending on log volume and retention. The main cost drivers are Log Analytics ingestion and retention. Start small, measure GB/day, and optimize.

13) Can I use this across multiple subscriptions?

Yes. Many organizations export logs from multiple subscriptions into central workspaces (or per-region workspaces) to support a platform view.

14) What KQL table should I query for control-plane events?

If you export Activity Log to Log Analytics, you’ll typically query the AzureActivity table.

15) What if Foundry resource logs aren’t available?

Rely on Activity Log (control plane) plus health signals and policy logs. For deeper telemetry, implement data-plane observability in your applications and AI services.


17. Top Online Resources to Learn Observability in Foundry Control Plane

Resource Type Name Why It Is Useful
Official documentation Azure Monitor overview: https://learn.microsoft.com/azure/azure-monitor/overview Foundation for metrics, logs, alerts, and visualization in Azure
Official documentation Azure Activity log: https://learn.microsoft.com/azure/azure-monitor/essentials/activity-log Core control-plane event source for subscriptions
Official documentation Diagnostic settings: https://learn.microsoft.com/azure/azure-monitor/essentials/diagnostic-settings How to route Activity Log and resource logs to Log Analytics/Storage/Event Hubs
Official documentation Log Analytics workspace overview: https://learn.microsoft.com/azure/azure-monitor/logs/log-analytics-workspace-overview How to design and operate a workspace
Official documentation KQL query overview: https://learn.microsoft.com/azure/azure-monitor/logs/log-query-overview How to query control-plane logs effectively
Official documentation Azure Monitor alerts: https://learn.microsoft.com/azure/azure-monitor/alerts/alerts-overview How to create actionable alerts from logs/metrics
Official documentation Azure Monitor workbooks: https://learn.microsoft.com/azure/azure-monitor/visualize/workbooks-overview How to build dashboards for operations
Official documentation Azure Service Health: https://learn.microsoft.com/azure/service-health/overview Platform incident visibility for triage
Official documentation Microsoft Sentinel overview: https://learn.microsoft.com/azure/sentinel/overview SIEM/SOAR option for security-driven observability
Official pricing Azure Monitor pricing: https://azure.microsoft.com/pricing/details/monitor/ Understand ingestion/retention/alerting cost model
Official tool Azure Pricing Calculator: https://azure.microsoft.com/pricing/calculator/ Build region-specific cost estimates

Foundry-specific documentation links can change with product naming. If you cannot find “Foundry Control Plane” by that name, search Microsoft Learn for “Azure AI Foundry” + “monitoring” + “diagnostic settings” and use the relevant resource provider documentation.


18. Training and Certification Providers

Institute Suitable Audience Likely Learning Focus Mode Website URL
DevOpsSchool.com DevOps engineers, SREs, platform teams Azure operations, monitoring, DevOps practices, CI/CD integration Check website https://www.devopsschool.com/
ScmGalaxy.com Beginners to intermediate engineers DevOps fundamentals, tooling, process, and governance Check website https://www.scmgalaxy.com/
CLoudOpsNow.in Cloud operations teams Cloud operations patterns, monitoring, reliability Check website https://www.cloudopsnow.in/
SreSchool.com SREs, reliability engineers SRE principles, alerting strategy, incident response Check website https://www.sreschool.com/
AiOpsSchool.com Ops + AI teams AIOps concepts, monitoring automation, operational analytics Check website https://www.aiopsschool.com/

19. Top Trainers

Platform/Site Likely Specialization Suitable Audience Website URL
RajeshKumar.xyz DevOps/Cloud training content (verify scope) Beginners to intermediate https://www.rajeshkumar.xyz/
devopstrainer.in DevOps training and coaching (verify offerings) DevOps engineers, SREs https://www.devopstrainer.in/
devopsfreelancer.com Freelance DevOps help/training (verify offerings) Teams needing short-term support https://www.devopsfreelancer.com/
devopssupport.in DevOps support and guidance (verify offerings) Ops teams and engineers https://www.devopssupport.in/

20. Top Consulting Companies

Company Likely Service Area Where They May Help Consulting Use Case Examples Website URL
cotocus.com Cloud/DevOps consulting (verify exact services) Platform engineering, operational readiness, monitoring foundations Central logging design, alerting standards, IaC observability rollout https://www.cotocus.com/
DevOpsSchool.com DevOps consulting and training DevOps/SRE enablement, monitoring practices Implement Azure Monitor baselines, dashboards, CI/CD guardrails https://www.devopsschool.com/
DEVOPSCONSULTING.IN DevOps consulting (verify exact services) DevOps processes, automation, operations Activity log export rollout, RBAC governance, incident response runbooks https://www.devopsconsulting.in/

21. Career and Learning Roadmap

What to learn before this service

  • Azure fundamentals: subscriptions, resource groups, regions
  • Microsoft Entra ID basics and Azure RBAC
  • Azure Resource Manager concepts (deployments, resource providers)
  • Azure Monitor basics (metrics vs logs, diagnostic settings)

What to learn after this service

  • Advanced KQL (joins, parsing, workbook parameters)
  • Microsoft Sentinel detections and incident workflows
  • IaC-based monitoring (Bicep/Terraform modules for diagnostic settings and alerts)
  • Data-plane observability for AI apps:
  • Application Insights
  • OpenTelemetry tracing patterns
  • SLOs/SLIs and error budgets

Job roles that use it

  • Cloud engineer / platform engineer
  • DevOps engineer
  • SRE
  • Security engineer / SOC analyst
  • Solutions architect (AI platform governance)
  • FinOps practitioner (logging cost governance)

Certification path (Azure)

There is no single “Foundry control plane observability” certification. Helpful Microsoft certifications (verify current names/availability): – Azure Administrator (operations and monitoring foundations) – Azure Security Engineer (security monitoring and governance) – Azure Solutions Architect (architecture and platform design)

Project ideas for practice

  • Build a “Foundry platform change timeline” workbook (last 7/30/90 days)
  • Create an alert pack:
  • role assignment changes
  • delete operations
  • diagnostic settings changes
  • repeated failed write operations
  • Implement a multi-subscription export pattern with standardized retention and tags

22. Glossary

  • Observability: The ability to understand a system’s internal state using outputs like logs, metrics, and traces.
  • Control plane: Management operations (create/update/delete/configure) performed via Azure Resource Manager.
  • Data plane: Runtime operations (e.g., application requests, model inference calls).
  • Azure Activity Log: Subscription-level log of control-plane events.
  • Diagnostic settings: Azure mechanism to route logs/metrics to Log Analytics, Storage, or Event Hubs.
  • Log Analytics workspace: Azure Monitor Logs store for querying and retention.
  • KQL (Kusto Query Language): Query language used for Azure Monitor Logs.
  • Azure Monitor: Azure’s platform for metrics, logs, alerts, and dashboards.
  • Workbook: Azure Monitor visualization artifact built from queries and parameters.
  • Alert rule: Condition that triggers notifications/actions based on logs or metrics.
  • Action group: Notification and automation targets for alert rules (email, webhook, ITSM, etc.).
  • RBAC: Role-Based Access Control for authorization in Azure.
  • Azure Policy: Governance service for enforcing rules and compliance.
  • Service Health: Azure service providing incident and maintenance notifications.
  • Resource Health: Health status for a specific Azure resource.
  • SIEM: Security Information and Event Management system (e.g., Microsoft Sentinel).
  • Retention: How long logs are stored and searchable.

23. Summary

Observability in Foundry Control Plane (Azure) is the discipline of capturing and operationalizing control-plane telemetry—especially Activity Logs, diagnostic exports, queries, dashboards, and alerts—so you can reliably answer what changed, who changed it, and how it impacted your AI platform.

It matters because AI + Machine Learning platforms are configuration-heavy and security-sensitive; control-plane visibility reduces outages, improves audit readiness, and strengthens governance. Cost is primarily driven by Log Analytics ingestion and retention, plus optional SIEM and streaming. Security hinges on RBAC, protecting diagnostic settings, and alerting on risky admin actions.

Use it when you operate Foundry-based AI environments beyond basic prototypes—especially in shared, regulated, or fast-changing production platforms. Next step: expand from baseline Activity Log export to targeted resource logs, workbooks, and high-signal alert packs, then integrate with Microsoft Sentinel if SOC workflows are required.