Azure Observability in Foundry Control Plane Tutorial: Architecture, Pricing, Use Cases, and Hands-On Guide for AI + Machine Learning

1. Introduction

Observability in Foundry Control Plane in Azure is about gaining reliable visibility into what the platform is doing when you build, configure, secure, and operate AI systems—especially the management-plane actions that create, update, deploy, and govern AI resources.

In simple terms: it helps you answer “what changed, who changed it, when, and what happened next?” for your AI platform setup. This includes tracking administrative operations, policy outcomes, service health signals, and operational logs that explain why an AI environment is healthy, degraded, or failing.

Technically, “Observability in Foundry Control Plane” is not typically a single standalone Azure resource with its own billing meter. Instead, it is best understood as the set of observability signals and integrations that cover Foundry-related control-plane operations—implemented through standard Azure observability building blocks such as:

Azure Monitor (metrics, alerts, workbooks)
Azure Monitor Logs / Log Analytics (central log store + KQL queries)
Azure Activity Log (subscription-level control-plane events)
Diagnostic settings (routing platform logs to Log Analytics / Storage / Event Hubs)
Optional integrations like Microsoft Sentinel (SIEM) and ITSM connectors

What problem does it solve? It reduces the risk and toil caused by “invisible” platform changes and failures—like unexpected access changes, deployments that don’t take effect, policy blocks, quota issues, or regional incidents—by giving you auditability, troubleshooting data, and actionable alerts for the AI platform control plane.

Naming note (verify in official docs): Microsoft’s AI platform branding has evolved (for example, Azure AI Studio and Azure AI Foundry naming). This tutorial treats “Observability in Foundry Control Plane” as the observability scope for Foundry’s management plane and shows how to implement it using current Azure Monitor capabilities. If your tenant uses different portal names, follow the equivalent resources and blades.

2. What is Observability in Foundry Control Plane?

Official purpose (practical definition aligned with Azure)

Observability in Foundry Control Plane is the practice and implementation of collecting, centralizing, analyzing, and alerting on control-plane signals related to Foundry-based AI platform resources in Azure.

Because control-plane operations in Azure are fundamentally governed by Azure Resource Manager (ARM), most “control plane observability” relies on:

Azure Activity Log for subscription-level events (create/update/delete, RBAC changes, policy actions)
Resource logs (when supported by specific resource types) routed using Diagnostic settings
Service Health / Resource Health for platform and regional incidents
Azure Monitor alerts and dashboards to detect, notify, and triage issues

Core capabilities

In a Foundry control-plane context, observability typically includes:

Audit trail of administrative actions
Who created/updated/deleted AI resources and configurations
Who changed access, keys, networking, or identity settings
Policy and governance visibility
Policy compliance results and “deny” outcomes
Drift detection for “approved” configurations
Operational troubleshooting
Correlating a deployment/configuration change to an outage
Explaining authorization failures (RBAC), networking blocks, quota failures, or region issues
Alerting and reporting
Alerts for suspicious or risky control-plane actions (deletions, public network enablement, key rotations)
Periodic reporting and dashboards for platform operations

Major components (Azure building blocks)

The most common components used to implement Observability in Foundry Control Plane are:

Azure Activity Log (subscription scope)
Log Analytics workspace (central log store)
Diagnostic settings (routing control-plane logs to sinks)
Azure Monitor Alerts (metric alerts, log alerts)
Azure Monitor Workbooks (dashboards)
Optional:
Microsoft Sentinel (security analytics, incident management)
Event Hubs (stream logs to external platforms)
Storage accounts (long retention/archival)
Azure Managed Grafana (visualization, when appropriate)

Service type

Observability in Foundry Control Plane is best viewed as a solution pattern implemented using Azure’s native observability services. It is not usually purchased as a single SKU.

Scope: regional/global/subscription

Azure Activity Log is subscription-scoped and not tied to a single region.
Log Analytics workspaces are regional resources (you choose a region).
Service Health is global and tenant/subscription contextual.

How it fits into the Azure ecosystem (AI + Machine Learning)

Foundry-based AI systems frequently rely on a mix of services (for example: model endpoints, orchestration, data stores, networking, identity). Foundry control-plane observability connects those operations back to:

Identity (Microsoft Entra ID) and RBAC decisions
ARM deployments (Bicep/Terraform/Portal changes)
Policy enforcement (Azure Policy)
Operational governance (tagging, naming, budget alerts, resource locks)

This is especially important in AI + Machine Learning, where misconfiguration can create: – data exposure risks, – runaway costs, – model deployment failures, – compliance gaps.

3. Why use Observability in Foundry Control Plane?

Business reasons

Reduce downtime and incident duration: Faster root cause analysis when you can correlate outages with recent control-plane changes.
Lower operational risk: Catch risky actions early (e.g., public network enabled, diagnostic logs disabled, key vault access changed).
Improve audit readiness: Maintain traceability of changes for regulated workloads.

Technical reasons

Single source of truth for change events: Centralize control-plane events and resource logs into Log Analytics.
Correlation across services: Track changes across AI resources, networking, identity, and data services in one timeline.
Evidence-based troubleshooting: Replace guesswork with logs and structured events.

Operational reasons (SRE/Platform/DevOps)

Actionable alerting: Notify the right teams on destructive operations, policy denies, or repeated failures.
Operational dashboards: Workbooks for recurring operational questions (who changed what, what failed, what’s trending).
Change management integration: Stream audit logs to SIEM/ITSM tools.

Security/compliance reasons

Detect unauthorized or unexpected changes: RBAC, identity, and network posture changes are common sources of security incidents.
Support least privilege: Use logs to validate that roles and permissions are used as intended.
Retention controls: Store logs to meet regulatory retention requirements (often via Storage or Sentinel).

Scalability/performance reasons

Control-plane observability helps scaling indirectly: – When you scale AI systems, you create more resources, deployments, and changes. Observability prevents that scale from turning into chaos. – Alerting on throttling/quota and policy issues helps prevent repeated failed rollouts.

When teams should choose it

Choose Observability in Foundry Control Plane when: – You operate AI environments in shared subscriptions or landing zones. – You need audit trails and governance evidence. – Multiple teams deploy models and services frequently. – You must respond to incidents quickly and consistently.

When teams should not choose it

You may not need a full control-plane observability implementation if: – You are running a short-lived prototype in a sandbox with no compliance requirements. – You have a single developer and minimal change frequency. – You do not retain resources beyond a few days.

Even then, enabling basic Activity Log routing is usually low-effort and pays off quickly.

4. Where is Observability in Foundry Control Plane used?

Industries

Finance and insurance (auditability, change control)
Healthcare and life sciences (compliance, access tracking)
Retail and e-commerce (availability + rapid releases)
Manufacturing (operational reliability, OT/IT boundaries)
Public sector (policy enforcement, retention requirements)
SaaS/ISVs building AI features (multi-tenant governance)

Team types

Platform engineering teams operating Azure landing zones
SRE/Operations teams managing incident response
Security engineering and SOC teams
AI/ML engineering teams deploying models at scale
DevOps teams managing CI/CD and infrastructure as code

Workloads and architectures

Hub-and-spoke networks with private endpoints
Multi-subscription environments with centralized logging
Production AI platforms with strict role separation
Regulated environments using Azure Policy and Sentinel

Production vs dev/test

Dev/test: Focus on rapid debugging, basic change tracking, cost guardrails.
Production: Add retention, SIEM integration, strict alerting, and governance reporting.

5. Top Use Cases and Scenarios

Below are realistic scenarios where Observability in Foundry Control Plane is directly useful.

1) Audit “who changed the model deployment configuration”

Problem: A model endpoint starts returning errors after a configuration change.
Why this fits: Control-plane logs reveal the change operation, identity, time, and target resource.
Example: An engineer updates a deployment SKU or networking setting; Activity Log shows the update event and the caller.

2) Alert on destructive actions (delete, purge, disable logging)

Problem: Critical AI resources are deleted or logging is turned off.
Why this fits: Log alerts can detect delete operations or diagnostic settings changes.
Example: Alert when a resource delete occurs under AI resource groups, page the on-call team.

3) Detect RBAC drift and privilege escalation

Problem: Unexpected access grants appear on AI resources or resource groups.
Why this fits: Activity Log captures role assignment changes.
Example: Notify security when “Owner” is assigned to a non-approved group.

4) Troubleshoot policy denies that block deployments

Problem: A pipeline fails with a vague “forbidden” error.
Why this fits: Policy events and Activity Log entries help identify the policy assignment causing the deny.
Example: A policy requiring private endpoints blocks a deployment; logs show the policy name and assignment scope.

5) Quota and capacity incident correlation

Problem: Deployments fail intermittently due to quota/capacity constraints.
Why this fits: Control-plane failure events plus service health context can guide remediation.
Example: Activity Log shows repeated “failed” create operations; correlate with region service health advisory.

6) Build an operational “AI platform change timeline”

Problem: Incident reviews require a consistent timeline of changes across resources.
Why this fits: Centralized logs let you query by time range and resource group tags.
Example: A workbook shows all create/update/delete operations in the last 24 hours for the AI platform.

7) Multi-team governance reporting (chargeback/showback support)

Problem: Leadership asks which teams are creating AI resources and whether they follow standards.
Why this fits: Control-plane logs + tags provide evidence for reporting.
Example: Report top resource creators per subscription and whether tagging policies were satisfied.

8) Incident response automation with Sentinel

Problem: SOC needs to detect suspicious admin activity and open incidents.
Why this fits: Stream logs to Microsoft Sentinel for correlation and automated response.
Example: Sentinel rule triggers when multiple role changes happen outside business hours.

9) Validate infrastructure-as-code deployments

Problem: You want proof that a CI/CD pipeline applied the intended changes.
Why this fits: Activity Log shows deployment operations and outcomes.
Example: Confirm that a Bicep deployment updated diagnostic settings and network rules.

10) Prove compliance for regulated AI environments

Problem: Auditors require evidence of access control, retention, and change tracking.
Why this fits: Centralized logs + retention policies + audit trails support compliance.
Example: Provide evidence of RBAC changes, key rotations, and policy compliance over time.

6. Core Features

Because Observability in Foundry Control Plane is typically implemented using Azure Monitor primitives, the “features” are best described as the capabilities you enable.

Feature 1: Subscription-level control-plane event capture (Azure Activity Log)

What it does: Captures administrative events such as create/update/delete operations, RBAC changes, policy actions, and service health notifications at subscription scope.
Why it matters: Most critical AI platform incidents involve “what changed” in the control plane.
Practical benefit: A single timeline for changes across Foundry-related resources.
Caveat: Activity Log retention in the portal is limited; for longer retention you must export via diagnostic settings.

Feature 2: Diagnostic settings routing to Log Analytics / Storage / Event Hubs

What it does: Exports logs to one or more sinks for retention, analysis, or streaming.
Why it matters: Centralization is required for cross-resource correlation and alerting.
Practical benefit: Query across subscriptions/workloads; store long term; feed SIEM.
Caveat: Not every resource type exposes the same resource logs/metrics categories. Verify per resource in Azure portal.

Feature 3: Centralized log search and analytics (Log Analytics + KQL)

What it does: Stores logs and allows querying with Kusto Query Language (KQL).
Why it matters: Control-plane troubleshooting often needs filtering by caller, operation, resource, status, correlation ID.
Practical benefit: Fast investigations and reusable queries for SRE runbooks.
Caveat: Costs depend on ingestion and retention; implement filters/retention tiers carefully.

Feature 4: Alerting on risky or anomalous control-plane events (Azure Monitor Alerts)

What it does: Generates notifications/incidents when queries match conditions (log alerts) or metrics cross thresholds.
Why it matters: You shouldn’t learn about deletions, access changes, or policy denies from users.
Practical benefit: Proactive operations and security response.
Caveat: Poorly tuned alerts create noise. Start with a small set of high-signal detections.

Feature 5: Dashboards and reporting (Azure Monitor Workbooks)

What it does: Visualizes queries and metrics with parameterized dashboards.
Why it matters: Platform operations need repeatable “daily view” dashboards.
Practical benefit: Self-service visibility for engineers and stakeholders.
Caveat: Workbooks are only as good as the underlying log hygiene (tags, consistent scopes, routed logs).

Feature 6: Service Health / Resource Health integration

What it does: Provides Azure platform incident notifications and per-resource health signals.
Why it matters: Separates “our change broke it” from “Azure incident is impacting it.”
Practical benefit: Faster triage and clearer comms during outages.
Caveat: Health signals are not a substitute for your app/data-plane monitoring—use both.

Feature 7: Governance visibility (Policy + Activity Log + optional compliance reporting)

What it does: Shows what policies were evaluated, denied, or remediated.
Why it matters: Foundry control plane often must enforce private networking, encryption, tagging, and restricted SKUs.
Practical benefit: Clear evidence of enforcement and drift.
Caveat: Policy event coverage and details vary by resource provider and policy effect. Validate policy logging behavior.

Feature 8: Security analytics via SIEM (optional Microsoft Sentinel)

What it does: Correlates events, applies detections, and manages incidents.
Why it matters: AI platforms are high-value targets; admin actions are high-signal events.
Practical benefit: SOC-ready detections and incident workflows.
Caveat: Additional cost and operational ownership required; don’t forward everything without a plan.

7. Architecture and How It Works

High-level architecture

Observability in Foundry Control Plane follows a straightforward pattern:

Control-plane events occur whenever someone or something (portal, CLI, IaC pipeline) performs ARM operations on Foundry-related resources.
Azure emits: – Activity Log events at subscription scope – Optional resource logs/metrics for specific resources (where supported)
Diagnostic settings export these signals to: – Log Analytics for query/alert/dashboard – Storage for archival/retention – Event Hubs for streaming to third-party tools
Azure Monitor evaluates alert rules and triggers notifications/actions.

Request/data/control flow (what flows where)

Control flow: User/CI → ARM → Resource Provider (AI/ML services)
Telemetry flow:
ARM writes Activity Log events
Resource Provider may emit resource logs/metrics
Diagnostic settings route telemetry to Log Analytics / Storage / Event Hubs

Integrations with related services

Common integrations in Azure AI + Machine Learning environments include:

Microsoft Entra ID: identity and authentication to Azure
Azure RBAC: authorization decisions
Azure Policy: governance controls; policy deny events affect deployments
Private Link / Private Endpoints: networking posture; changes are critical to observe
Key Vault: secrets and keys; access changes should be monitored
Azure DevOps / GitHub Actions: IaC pipelines generating control-plane events
Microsoft Sentinel: SIEM for high-value control-plane events

Dependency services

To implement this pattern you typically need: – Log Analytics workspace – Azure Monitor alert rules – Diagnostic settings at subscription/resource scope

Security/authentication model

Authentication uses Microsoft Entra ID
Authorization uses Azure RBAC
Access to logs is governed by:
Log Analytics workspace RBAC (Log Analytics Reader/Contributor)
Azure Monitor roles (Monitoring Reader/Contributor)
Azure subscription/Resource Group RBAC

Networking model

Activity Log and Log Analytics are Azure services accessed over Azure’s public endpoints by default.
You can harden access with:
Private Link options (availability varies by service—verify in official docs)
Network restrictions and firewall rules where supported
Restricting who can read logs via RBAC, rather than relying only on network controls

Monitoring/logging/governance considerations

Decide which subscriptions and which resource groups represent Foundry platform boundaries.
Standardize:
naming conventions (to filter queries)
tagging (owner, env, cost center)
retention strategy (hot vs archive)
Treat “disable diagnostic settings” as a high-severity event—alert on it.

Simple architecture diagram

flowchart LR
  U[Engineer / CI Pipeline] --> ARM[Azure Resource Manager]
  ARM --> RP[Foundry-related Resource Providers]
  ARM --> AL[Azure Activity Log]
  RP --> RL[Resource Logs / Metrics<br/>(when supported)]

  AL --> DS[Diagnostic Settings]
  RL --> DS

  DS --> LAW[Log Analytics Workspace]
  LAW --> AM[Azure Monitor Alerts]
  LAW --> WB[Workbooks / Dashboards]
  AM --> N[Notifications / ITSM / Webhook]

Production-style architecture diagram

flowchart TB
  subgraph Management["Management & Governance"]
    AAD[Microsoft Entra ID]
    RBAC[Azure RBAC]
    POL[Azure Policy]
    SH[Service Health / Resource Health]
  end

  subgraph Platform["Foundry Platform Subscriptions"]
    CI[GitHub Actions / Azure DevOps]
    ARM[Azure Resource Manager]
    AI[AI + ML Resources<br/>(Foundry-related)]
    KV[Key Vault]
    NET[Networking<br/>(VNet/Private Endpoints)]
  end

  subgraph Observability["Central Observability Subscription"]
    DS[Diagnostic Settings<br/>(Subscription + Resource)]
    LAW[Log Analytics Workspace]
    STO[Storage Account (Archive)]
    EH[Event Hubs (Streaming)]
    WB[Azure Monitor Workbooks]
    ALRT[Azure Monitor Alerts]
    SENT[Microsoft Sentinel (Optional)]
  end

  CI --> ARM
  ARM --> AI
  ARM --> KV
  ARM --> NET

  AAD --> ARM
  RBAC --> ARM
  POL --> ARM

  ARM -->|Control-plane events| DS
  AI -->|Resource logs/metrics| DS
  SH --> DS

  DS --> LAW
  DS --> STO
  DS --> EH

  LAW --> WB
  LAW --> ALRT
  LAW --> SENT
  EH --> SENT

8. Prerequisites

Account/subscription requirements

An Azure subscription where you can:
Configure diagnostic settings at the subscription level, and/or
Configure diagnostic settings on Foundry-related resources

Permissions (IAM roles)

Typical minimum roles (scope varies by where you configure things): – To create a Log Analytics workspace: Contributor on a resource group (or higher) – To configure diagnostic settings: – Owner or Contributor at the subscription/resource scope is commonly required – Some environments use a dedicated role with monitoring permissions; verify your org’s RBAC model – To query logs: Log Analytics Reader – To create alerts: Monitoring Contributor

Billing requirements

A payment method enabled for Azure Monitor Logs ingestion/retention and alerting.

Tools

Azure Portal
Azure CLI (az)
Install: https://learn.microsoft.com/cli/azure/install-azure-cli

Region availability

Log Analytics workspace is regional; choose a region consistent with your data residency requirements.
Activity Log is subscription-level and not tied to one region.

Quotas/limits (verify in official docs)

Log Analytics ingestion/retention constraints
Alert rules per subscription/workspace limits
Diagnostic settings per resource limits

Prerequisite services

Azure Monitor
Log Analytics workspace (recommended for the lab)

9. Pricing / Cost

Observability in Foundry Control Plane is priced through the Azure services you use to store, query, and act on telemetry—not usually as a standalone “Foundry observability” SKU.

Primary pricing dimensions (what you pay for)

Azure Monitor Logs (Log Analytics) – Data ingestion (GB/day) – Retention (days stored in the workspace) – Optional archive and restore costs (where used)
Alerting – Log alerts may have charges depending on alert type and evaluation frequency (verify current Azure Monitor pricing details).
Data export / streaming – Event Hubs throughput units and retention (if streaming) – Storage costs for archived logs (capacity + transactions)
SIEM (optional) – Microsoft Sentinel charges (typically based on data ingestion/retention)

Official pricing: – Azure Monitor pricing: https://azure.microsoft.com/pricing/details/monitor/ – Azure Pricing Calculator: https://azure.microsoft.com/pricing/calculator/

Free tier (what may be free)

Azure pricing changes over time. Some aspects that are commonly “included” or low-cost: – Viewing recent Activity Log entries in the portal (limited retention) – Some basic platform logs may not incur additional charges until exported/ingested

Verify in official docs for the current free allowances for Log Analytics ingestion and retention in your region.

Cost drivers (most important)

High-volume Activity Log export across many subscriptions
Verbose resource logs exported at high frequency
Long retention periods kept in hot storage
Unfiltered logs streamed to multiple sinks (Log Analytics + Event Hubs + Storage)
Noisy alerts evaluated too frequently

Hidden or indirect costs

Cross-team access: More users querying logs may increase operational load (not a direct cost, but real toil).
Data egress: Streaming to third-party tools may incur network charges depending on architecture.
Retention compliance: Long-term retention in hot tier can be expensive; storage archive patterns may be cheaper.

Network/data transfer implications

Exporting logs to Event Hubs and then to non-Azure tools can introduce egress charges.
Centralized logging across regions may create additional complexity. Prefer regionally aligned workspaces where required by policy.

How to optimize cost (without losing auditability)

Start with Activity Log export only; add resource logs selectively.
Use short hot retention in Log Analytics + archive to Storage for long retention (verify the recommended approach in current Azure docs).
Reduce alert frequency; use high-signal conditions.
Use KQL to focus on:
specific resource groups
specific operation names (delete, write, role assignments)
failures only (when appropriate)

Example low-cost starter estimate (model, not numbers)

A minimal setup for a small team: – 1 Log Analytics workspace – Subscription Activity Log routed to workspace – 2–5 log alerts (delete operations, RBAC changes) Main cost components: – Workspace ingestion from Activity Log volume – Retention days chosen – Alerts evaluation frequency

Because the exact price per GB and alert charges vary by region and plan, use the Azure Pricing Calculator to estimate with your expected GB/day.

Example production cost considerations

In production, cost planning should include: – Central workspace per region or per landing zone – Storage archive for multi-year retention – Sentinel (if SOC required) – Event Hubs streaming to enterprise SIEM – Multiple workbooks and alerts – Budget alerts and cost anomaly detection (FinOps)

10. Step-by-Step Hands-On Tutorial

Objective

Implement a practical baseline for Observability in Foundry Control Plane by:

Creating a Log Analytics workspace
Exporting Azure Activity Log to that workspace (subscription-level control-plane visibility)
Running KQL queries to inspect Foundry-related control-plane events (by filtering to AI/ML resource providers)
Creating a basic alert for a high-risk control-plane action
Cleaning up to avoid ongoing cost

This lab is designed to be safe and low-cost. You will generate only a small number of control-plane events.

Lab Overview

You will: – Create a resource group and Log Analytics workspace. – Configure a subscription diagnostic setting to send Activity Logs to Log Analytics. – Generate a control-plane event by creating and deleting a small Azure resource (you can use a minimal AI/ML-related resource if available in your subscription; otherwise any resource will still validate the pipeline). – Query Activity Log data in Log Analytics. – Create an alert for delete operations.

Note: Foundry-specific resource types vary by tenant and by how your organization provisions AI services. The Activity Log approach still applies because it captures ARM operations across resource providers.

Step 1: Create a resource group

Action (Azure CLI):

az account show
az group create \
  --name rg-foundry-observability-lab \
  --location eastus

Expected outcome: – A resource group named rg-foundry-observability-lab exists.

Verification:

az group show --name rg-foundry-observability-lab --query "{name:name, location:location}" -o table

Step 2: Create a Log Analytics workspace

Action (Azure CLI):

az monitor log-analytics workspace create \
  --resource-group rg-foundry-observability-lab \
  --workspace-name law-foundry-obsv-lab \
  --location eastus

Expected outcome: – A Log Analytics workspace is created.

Verification:

az monitor log-analytics workspace show \
  --resource-group rg-foundry-observability-lab \
  --workspace-name law-foundry-obsv-lab \
  --query "{name:name, customerId:customerId, location:location}" -o table

Step 3: Export Azure Activity Log to Log Analytics (subscription diagnostic setting)

This is the key step for control-plane observability.

Option A (recommended): Azure Portal

Go to Monitor in the Azure portal.
Navigate to Activity log.
Select Export Activity Logs (or Diagnostic settings depending on portal layout).
Create a diagnostic setting: – Destination: Send to Log Analytics workspace – Select your workspace: law-foundry-obsv-lab – Categories to include (typical baseline):
- Administrative
- Policy
- Security
- ServiceHealth
- ResourceHealth
- Alert (if available)
- Recommendation (if available)

Expected outcome: – A diagnostic setting exists for the subscription that exports Activity Logs to Log Analytics.

Option B: Azure CLI (if available in your environment)

Azure CLI support for subscription diagnostic settings can vary by CLI version/extension. If the following commands fail, use the portal.

1) Get your subscription ID:

SUB_ID=$(az account show --query id -o tsv)
echo $SUB_ID

2) Create the subscription diagnostic setting (command group may vary; verify in official docs if it differs):

az monitor diagnostic-settings subscription create \
  --name ds-activitylog-to-law \
  --subscription $SUB_ID \
  --workspace law-foundry-obsv-lab \
  --resource-group rg-foundry-observability-lab \
  --logs '[
    {"category":"Administrative","enabled":true},
    {"category":"Policy","enabled":true},
    {"category":"Security","enabled":true},
    {"category":"ServiceHealth","enabled":true},
    {"category":"ResourceHealth","enabled":true}
  ]'

Expected outcome: – Activity Log events begin flowing to Log Analytics (may take a few minutes).

Verification (Portal): – Go to the workspace → Logs → run a query (next step).

Step 4: Generate a control-plane event

To validate end-to-end, create a small resource. If your subscription allows provisioning an AI/ML resource you normally use with Foundry, prefer that (because it will generate provider-specific events). If not, any Azure resource will still prove the control-plane logging pipeline.

Example (safe and generally available): Create and delete a Storage account

Storage is not “AI”, but the Activity Log pipeline is identical and confirms your setup.

az storage account create \
  --name stfoundryobsv$RANDOM \
  --resource-group rg-foundry-observability-lab \
  --location eastus \
  --sku Standard_LRS

Wait ~1–3 minutes, then delete it:

ST_NAME=$(az storage account list -g rg-foundry-observability-lab --query "[0].name" -o tsv)
az storage account delete --name $ST_NAME --resource-group rg-foundry-observability-lab --yes

Expected outcome: – You generated at least two Activity Log events: a create and a delete.

Step 5: Query Activity Log data in Log Analytics (KQL)

In Azure portal: 1. Open your Log Analytics workspace: law-foundry-obsv-lab 2. Select Logs 3. Run this query:

AzureActivity
| where TimeGenerated > ago(1h)
| project TimeGenerated, OperationNameValue, ActivityStatusValue, Caller, ResourceGroup, ResourceProviderValue, ResourceId
| order by TimeGenerated desc

Expected outcome: – You see recent control-plane operations, including your storage create/delete (or other resource operations).

Filter to AI/ML-related providers (examples)

Depending on what you use with Foundry, you might filter to providers like these. Use what matches your environment.

AzureActivity
| where TimeGenerated > ago(24h)
| where ResourceProviderValue has_any ("Microsoft.MachineLearningServices", "Microsoft.CognitiveServices")
| project TimeGenerated, OperationNameValue, ActivityStatusValue, Caller, ResourceGroup, ResourceId
| order by TimeGenerated desc

Expected outcome: – If you have AI/ML resources and recent operations, you’ll see them here. – If not, you’ll get zero results—meaning you need to generate an AI/ML operation in your subscription to validate provider-specific coverage.

Step 6: Create a high-signal alert for delete operations

A practical baseline alert is: any delete operation in your Foundry platform resource group(s).

Create a log alert (Portal method)

Go to Monitor → Alerts → Create → Alert rule
Scope: select your Log Analytics workspace
Condition: Custom log search
Use this query:

AzureActivity
| where TimeGenerated > ago(10m)
| where OperationNameValue endswith "/delete"
| where ResourceGroup == "rg-foundry-observability-lab"

Set: – Evaluation frequency: e.g., 5 minutes – Lookback period: e.g., 10 minutes – Threshold: greater than 0
Action group: email yourself (and/or webhook/ITSM connector)
Name: alert-delete-ops-rg-foundry-observability-lab

Expected outcome: – If a delete happens in the resource group, the alert fires.

Validation

Use this checklist:

Activity Log export enabled – Portal: Monitor → Activity log → Export/Diagnostic settings shows your Log Analytics destination.
Data arriving in Log Analytics – AzureActivity | where TimeGenerated > ago(1h) returns rows.
Alert rule created and enabled – Monitor → Alerts shows the rule as enabled.
Test alert – Delete a small resource in the lab resource group and confirm the alert triggers.

Troubleshooting

Issue: AzureActivity table has no data – Wait 5–15 minutes after enabling export. – Confirm diagnostic setting is configured at the subscription level, not only on a resource. – Ensure you selected relevant categories (Administrative is essential).

Issue: Permission denied creating diagnostic settings – You likely need Owner or Contributor at subscription scope (or a role that includes Microsoft.Insights/diagnosticSettings/*). – In locked-down environments, request help from the platform team.

Issue: Alert never fires – Confirm your query is correct: – Use a larger time window temporarily (e.g., ago(1h)). – Remove the ResourceGroup filter to confirm delete operations appear. – Confirm the alert evaluation period/frequency matches your query window.

Issue: Too many alerts (noise) – Narrow by: – Resource group(s) for Foundry platform – Specific operation names (role assignments, delete, write) – Only failed operations (ActivityStatusValue != "Succeeded")

Cleanup

To avoid ongoing charges, remove what you created.

1) Delete the lab resource group (deletes workspace and any remaining lab resources):

az group delete --name rg-foundry-observability-lab --yes --no-wait

2) Remove the subscription diagnostic setting (if you created one) – Portal: Monitor → Activity Log → Export/Diagnostic settings → delete the setting
or use CLI if available (command patterns vary—verify in official docs).

3) Remove alert rules created for the lab: – Portal: Monitor → Alerts → Alert rules → delete the lab rule

11. Best Practices

Architecture best practices

Centralize logs by landing zone or platform subscription to enable cross-resource correlation.
Use a tiered retention strategy:
Hot retention in Log Analytics for active investigations
Archive in Storage for long-term compliance (verify best practice in current Azure docs)

IAM/security best practices

Restrict who can:
change diagnostic settings,
delete workspaces,
disable alerts.
Use separation of duties:
Platform team owns export pipelines and workspaces
App/ML teams have reader access and create team-level workbooks (where appropriate)

Cost best practices

Export what you need:
Start with Activity Log categories: Administrative, Policy, Security
Add resource logs selectively
Tune alerts for signal, not completeness.
Use budgets and cost alerts for observability resources too (workspaces can grow unexpectedly).

Performance best practices

Prefer focused queries (time-bounded, filtered by resource group/provider).
Build “investigation queries” as saved queries or workbook components.

Reliability best practices

Alert if diagnostic settings are removed or modified (control-plane observability must be protected).
Use resource locks on critical logging resources (carefully—locks can block legitimate changes).

Operations best practices

Maintain an on-call runbook:
where to look first (Activity Log timeline),
key KQL queries,
escalation paths (platform vs Azure incident).
Create a standard workbook for:
recent changes,
failed operations,
RBAC changes,
policy denies.

Governance/tagging/naming best practices

Standardize tags like:
env (dev/test/prod)
owner
costCenter
dataClassification
Use consistent resource group naming for Foundry platform boundaries (makes queries and alerts precise).

12. Security Considerations

Identity and access model

Control-plane actions authenticate via Microsoft Entra ID.
Authorization is enforced by Azure RBAC (and sometimes resource-specific roles).
Log access is also RBAC-controlled:
Use least privilege for Log Analytics readers.
Restrict write permissions to avoid tampering.

Encryption

Azure services encrypt data at rest by default (verify specifics for Log Analytics and Storage in current docs).
For archives in Storage, consider:
encryption keys (Microsoft-managed vs customer-managed keys), if required by policy.

Network exposure

Treat observability endpoints as sensitive:
They can reveal resource names, IDs, and operational details.
Prefer RBAC restrictions as the primary control.
Where available/required, evaluate private connectivity options (verify support per service).

Secrets handling

Avoid embedding secrets in alert webhooks or automation scripts.
Store secrets in Azure Key Vault and use managed identities for automation.

Audit/logging

Your observability pipeline itself must be observable:
Alert when diagnostic settings are changed.
Alert when the workspace is deleted (activity log events).
Consider streaming to Sentinel for tamper-resistant security operations (with proper governance).

Compliance considerations

Define retention requirements (e.g., 90 days hot, 1–7 years archive) based on your regulatory obligations.
Ensure logs do not violate data residency rules—choose workspace region accordingly.

Common security mistakes

Allowing too many users to modify diagnostic settings
Storing logs only in short-retention default views
Not alerting on role assignment changes
Not separating production and non-production logging workspaces

Secure deployment recommendations

Use IaC (Bicep/Terraform) to define:
diagnostic settings
workspaces
alerts
action groups
Apply policy to require diagnostic settings on critical resource types (verify feasibility per resource provider).

13. Limitations and Gotchas

Not all resources emit the same logs: Some Foundry-related services may have limited resource logs. Always check the resource’s Diagnostic settings categories.
Activity Log is necessary but not sufficient: It shows control-plane operations, not application/data-plane telemetry (e.g., model inference latency inside your app).
Retention defaults can be short: Relying only on portal views risks losing critical evidence.
Alert noise is easy to create: Without filters, you’ll overwhelm responders with low-signal events.
CLI/portal differences: Some diagnostic setting operations are easier in the portal; CLI support can vary by version. Use the portal if commands don’t match your environment.
Costs can grow quietly: Log ingestion increases with organizational scale and change frequency. Implement budgets and periodic reviews.

14. Comparison with Alternatives

Observability in Foundry Control Plane is a control-plane-focused approach. Here’s how it compares with nearby options.

Option	Best For	Strengths	Weaknesses	When to Choose
Observability in Foundry Control Plane (Azure Monitor + Activity Log + Log Analytics)	Auditing and operating Foundry management plane	Strong change tracking, native Azure integration, flexible KQL	Doesn’t automatically cover app/data-plane telemetry	When you need governance, audit trails, and control-plane alerting
Azure Monitor (general)	Broad monitoring across Azure	Standard platform for metrics/logs/alerts	Requires design to cover Foundry boundaries	When you want a unified monitoring strategy
Application Insights (app telemetry)	Application performance monitoring	Traces, dependencies, distributed tracing for apps	Not a control-plane audit trail	When you need app-level observability for AI apps (APIs, RAG services)
Microsoft Sentinel	Security operations and incident response	SIEM/SOAR, correlation, detections	Additional cost/ops overhead	When SOC needs detections for admin actions and suspicious changes
AWS CloudTrail + CloudWatch	AWS control-plane observability	Mature change/audit tracking	Different cloud; not Azure-native	If your AI platform runs on AWS
GCP Cloud Audit Logs + Cloud Monitoring	GCP control-plane observability	Strong audit logs	Different cloud; not Azure-native	If your AI platform runs on GCP
Datadog / Splunk (self-managed or SaaS)	Cross-cloud enterprise observability	Powerful search/correlation	Cost, integration complexity, data residency concerns	When you need a unified multi-cloud observability layer
Prometheus/Grafana (self-managed)	Metrics-focused observability	Open ecosystem	Control-plane audit coverage is not the focus	When you primarily need metrics and have platform maturity

15. Real-World Example

Enterprise example (regulated)

Problem: A financial services company runs AI workloads with strict governance. Auditors require proof of change control for AI platform resources, and incidents must be triaged quickly.
Proposed architecture:
Subscription Activity Log exported to a central Log Analytics workspace
Resource logs enabled on critical AI, networking, and Key Vault resources (where supported)
Workbooks for “change timeline”, “RBAC changes”, “policy denies”
Alerts on delete operations, role assignment changes, and diagnostic setting changes
Optional Microsoft Sentinel for SOC detections and incident workflows
Why this service was chosen: Observability in Foundry Control Plane aligns with Azure-native governance and audit requirements and integrates with existing Azure Monitor and security tooling.
Expected outcomes:
Faster audits (repeatable evidence)
Lower MTTR through change correlation
Reduced risk of unauthorized configuration drift

Startup/small-team example

Problem: A small SaaS team ships AI features weekly. A few outages were caused by accidental config changes and lack of visibility.
Proposed architecture:
Single Log Analytics workspace
Activity Log export enabled
3 log alerts: deletes, role assignment changes, repeated failed writes
One workbook showing last 7 days of changes
Why this service was chosen: Minimal setup effort, low operational overhead, and immediate value from change visibility.
Expected outcomes:
Quick identification of “what changed”
Better on-call experience with fewer blind spots
Cost-controlled logging with short retention

16. FAQ

1) What does “control plane” mean in Foundry Control Plane observability?

Control plane refers to management operations (create/update/delete/configure) executed through Azure Resource Manager. It is different from data plane traffic such as application requests to your AI endpoint.

2) Is Observability in Foundry Control Plane a standalone Azure product?

Usually no. It is commonly implemented using Azure Monitor, Activity Log, Log Analytics, diagnostic settings, and alerts. Verify your organization’s Foundry documentation for any Foundry-specific dashboards or integrations.

3) What is the first thing to enable?

Enable subscription Activity Log export to a Log Analytics workspace. It provides immediate, broad control-plane visibility.

4) Does this replace application monitoring for AI apps?

No. Control-plane observability explains platform changes. You still need application/data-plane monitoring (often with Application Insights and distributed tracing).

5) How long does Activity Log data take to appear in Log Analytics?

Typically minutes, but delays can occur. If you see no data after 15 minutes, re-check diagnostic settings and permissions.

6) Which events are most important to alert on?

Start with high-signal events: – delete operations – role assignment changes – policy denies affecting deployments – diagnostic setting modifications

7) Can I route logs to both Log Analytics and Storage?

Yes, diagnostic settings often support multiple sinks. This is common for “hot search” in Log Analytics plus long-term archive in Storage.

8) What’s the difference between Activity Log and resource logs?

Activity Log is subscription-level control-plane events. Resource logs are resource-specific telemetry exposed via diagnostic settings (varies by resource type).

9) How do I prove “who changed what” during an incident?

Use Activity Log records in Log Analytics, filtering by time range, resource group, and operation. The Caller field is commonly used to identify the actor.

10) How do I protect observability from tampering?

Use RBAC to restrict modification of diagnostic settings and workspaces. Consider alerts when diagnostic settings change and use resource locks where appropriate.

11) Do I need Microsoft Sentinel?

Not always. Sentinel is beneficial when you need SOC workflows, correlation, and incident management. Many teams start with Azure Monitor and add Sentinel later.

12) Will this increase my Azure bill significantly?

It can, depending on log volume and retention. The main cost drivers are Log Analytics ingestion and retention. Start small, measure GB/day, and optimize.

13) Can I use this across multiple subscriptions?

Yes. Many organizations export logs from multiple subscriptions into central workspaces (or per-region workspaces) to support a platform view.

14) What KQL table should I query for control-plane events?

If you export Activity Log to Log Analytics, you’ll typically query the AzureActivity table.

15) What if Foundry resource logs aren’t available?

Rely on Activity Log (control plane) plus health signals and policy logs. For deeper telemetry, implement data-plane observability in your applications and AI services.

17. Top Online Resources to Learn Observability in Foundry Control Plane

Resource Type	Name	Why It Is Useful
Official documentation	Azure Monitor overview: https://learn.microsoft.com/azure/azure-monitor/overview	Foundation for metrics, logs, alerts, and visualization in Azure
Official documentation	Azure Activity log: https://learn.microsoft.com/azure/azure-monitor/essentials/activity-log	Core control-plane event source for subscriptions
Official documentation	Diagnostic settings: https://learn.microsoft.com/azure/azure-monitor/essentials/diagnostic-settings	How to route Activity Log and resource logs to Log Analytics/Storage/Event Hubs
Official documentation	Log Analytics workspace overview: https://learn.microsoft.com/azure/azure-monitor/logs/log-analytics-workspace-overview	How to design and operate a workspace
Official documentation	KQL query overview: https://learn.microsoft.com/azure/azure-monitor/logs/log-query-overview	How to query control-plane logs effectively
Official documentation	Azure Monitor alerts: https://learn.microsoft.com/azure/azure-monitor/alerts/alerts-overview	How to create actionable alerts from logs/metrics
Official documentation	Azure Monitor workbooks: https://learn.microsoft.com/azure/azure-monitor/visualize/workbooks-overview	How to build dashboards for operations
Official documentation	Azure Service Health: https://learn.microsoft.com/azure/service-health/overview	Platform incident visibility for triage
Official documentation	Microsoft Sentinel overview: https://learn.microsoft.com/azure/sentinel/overview	SIEM/SOAR option for security-driven observability
Official pricing	Azure Monitor pricing: https://azure.microsoft.com/pricing/details/monitor/	Understand ingestion/retention/alerting cost model
Official tool	Azure Pricing Calculator: https://azure.microsoft.com/pricing/calculator/	Build region-specific cost estimates

Foundry-specific documentation links can change with product naming. If you cannot find “Foundry Control Plane” by that name, search Microsoft Learn for “Azure AI Foundry” + “monitoring” + “diagnostic settings” and use the relevant resource provider documentation.

18. Training and Certification Providers

Institute	Suitable Audience	Likely Learning Focus	Mode	Website URL
DevOpsSchool.com	DevOps engineers, SREs, platform teams	Azure operations, monitoring, DevOps practices, CI/CD integration	Check website	https://www.devopsschool.com/
ScmGalaxy.com	Beginners to intermediate engineers	DevOps fundamentals, tooling, process, and governance	Check website	https://www.scmgalaxy.com/
CLoudOpsNow.in	Cloud operations teams	Cloud operations patterns, monitoring, reliability	Check website	https://www.cloudopsnow.in/
SreSchool.com	SREs, reliability engineers	SRE principles, alerting strategy, incident response	Check website	https://www.sreschool.com/
AiOpsSchool.com	Ops + AI teams	AIOps concepts, monitoring automation, operational analytics	Check website	https://www.aiopsschool.com/

19. Top Trainers

Platform/Site	Likely Specialization	Suitable Audience	Website URL
RajeshKumar.xyz	DevOps/Cloud training content (verify scope)	Beginners to intermediate	https://www.rajeshkumar.xyz/
devopstrainer.in	DevOps training and coaching (verify offerings)	DevOps engineers, SREs	https://www.devopstrainer.in/
devopsfreelancer.com	Freelance DevOps help/training (verify offerings)	Teams needing short-term support	https://www.devopsfreelancer.com/
devopssupport.in	DevOps support and guidance (verify offerings)	Ops teams and engineers	https://www.devopssupport.in/

20. Top Consulting Companies

Company	Likely Service Area	Where They May Help	Consulting Use Case Examples	Website URL
cotocus.com	Cloud/DevOps consulting (verify exact services)	Platform engineering, operational readiness, monitoring foundations	Central logging design, alerting standards, IaC observability rollout	https://www.cotocus.com/
DevOpsSchool.com	DevOps consulting and training	DevOps/SRE enablement, monitoring practices	Implement Azure Monitor baselines, dashboards, CI/CD guardrails	https://www.devopsschool.com/
DEVOPSCONSULTING.IN	DevOps consulting (verify exact services)	DevOps processes, automation, operations	Activity log export rollout, RBAC governance, incident response runbooks	https://www.devopsconsulting.in/

21. Career and Learning Roadmap

What to learn before this service

Azure fundamentals: subscriptions, resource groups, regions
Microsoft Entra ID basics and Azure RBAC
Azure Resource Manager concepts (deployments, resource providers)
Azure Monitor basics (metrics vs logs, diagnostic settings)

What to learn after this service

Advanced KQL (joins, parsing, workbook parameters)
Microsoft Sentinel detections and incident workflows
IaC-based monitoring (Bicep/Terraform modules for diagnostic settings and alerts)
Data-plane observability for AI apps:
Application Insights
OpenTelemetry tracing patterns
SLOs/SLIs and error budgets

Job roles that use it

Cloud engineer / platform engineer
DevOps engineer
SRE
Security engineer / SOC analyst
Solutions architect (AI platform governance)
FinOps practitioner (logging cost governance)

Certification path (Azure)

There is no single “Foundry control plane observability” certification. Helpful Microsoft certifications (verify current names/availability): – Azure Administrator (operations and monitoring foundations) – Azure Security Engineer (security monitoring and governance) – Azure Solutions Architect (architecture and platform design)

Project ideas for practice

Build a “Foundry platform change timeline” workbook (last 7/30/90 days)
Create an alert pack:
role assignment changes
delete operations
diagnostic settings changes
repeated failed write operations
Implement a multi-subscription export pattern with standardized retention and tags

22. Glossary

Observability: The ability to understand a system’s internal state using outputs like logs, metrics, and traces.
Control plane: Management operations (create/update/delete/configure) performed via Azure Resource Manager.
Data plane: Runtime operations (e.g., application requests, model inference calls).
Azure Activity Log: Subscription-level log of control-plane events.
Diagnostic settings: Azure mechanism to route logs/metrics to Log Analytics, Storage, or Event Hubs.
Log Analytics workspace: Azure Monitor Logs store for querying and retention.
KQL (Kusto Query Language): Query language used for Azure Monitor Logs.
Azure Monitor: Azure’s platform for metrics, logs, alerts, and dashboards.
Workbook: Azure Monitor visualization artifact built from queries and parameters.
Alert rule: Condition that triggers notifications/actions based on logs or metrics.
Action group: Notification and automation targets for alert rules (email, webhook, ITSM, etc.).
RBAC: Role-Based Access Control for authorization in Azure.
Azure Policy: Governance service for enforcing rules and compliance.
Service Health: Azure service providing incident and maintenance notifications.
Resource Health: Health status for a specific Azure resource.
SIEM: Security Information and Event Management system (e.g., Microsoft Sentinel).
Retention: How long logs are stored and searchable.

23. Summary

Observability in Foundry Control Plane (Azure) is the discipline of capturing and operationalizing control-plane telemetry—especially Activity Logs, diagnostic exports, queries, dashboards, and alerts—so you can reliably answer what changed, who changed it, and how it impacted your AI platform.

It matters because AI + Machine Learning platforms are configuration-heavy and security-sensitive; control-plane visibility reduces outages, improves audit readiness, and strengthens governance. Cost is primarily driven by Log Analytics ingestion and retention, plus optional SIEM and streaming. Security hinges on RBAC, protecting diagnostic settings, and alerting on risky admin actions.

Use it when you operate Foundry-based AI environments beyond basic prototypes—especially in shared, regulated, or fast-changing production platforms. Next step: expand from baseline Activity Log export to targeted resource logs, workbooks, and high-signal alert packs, then integrate with Microsoft Sentinel if SOC workflows are required.

rajeshkumar

Category