{"id":372,"date":"2026-04-13T20:15:02","date_gmt":"2026-04-13T20:15:02","guid":{"rendered":"https:\/\/www.devopsschool.com\/tutorials\/azure-sre-agent-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-ai-machine-learning\/"},"modified":"2026-04-13T20:15:02","modified_gmt":"2026-04-13T20:15:02","slug":"azure-sre-agent-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-ai-machine-learning","status":"publish","type":"post","link":"https:\/\/www.devopsschool.com\/tutorials\/azure-sre-agent-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-ai-machine-learning\/","title":{"rendered":"Azure SRE Agent Tutorial: Architecture, Pricing, Use Cases, and Hands-On Guide for AI + Machine Learning"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Category<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">AI + Machine Learning<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">1. Introduction<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What this service is<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Azure SRE Agent<\/strong> is best understood as an <strong>AI-powered Site Reliability Engineering (SRE) assistant implemented on Azure<\/strong> that helps teams detect, summarize, and respond to operational signals (alerts, incidents, changes, and logs) using Azure-native observability and automation services plus an LLM.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">One-paragraph simple explanation<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You can use Azure SRE Agent to turn noisy operational data\u2014like Azure Activity Log events, monitoring alerts, and logs\u2014into a readable \u201cwhat happened, why it matters, and what to do next\u201d briefing, and optionally trigger safe automations for common SRE workflows (triage, routing, status updates, and runbook guidance).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">One-paragraph technical explanation<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In practice, Azure SRE Agent is typically assembled from <strong>Azure AI (for LLM inference), Azure Monitor (for telemetry and alerts), and an execution layer (Azure Functions, Container Apps, Logic Apps, or Automation)<\/strong>. The agent ingests operational context (changes, incidents, metrics\/logs), applies guardrails (RBAC, prompt boundaries, tool allow-lists), generates structured outputs (incident summaries, suggested queries, remediation steps), and integrates with ITSM\/chat tools (Teams, ServiceNow, PagerDuty\u2014depending on what you connect).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What problem it solves<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Modern production systems generate more operational data than humans can reliably process in time. Azure SRE Agent addresses:\n&#8211; <strong>Signal overload<\/strong>: too many alerts, logs, and change events.\n&#8211; <strong>Slow triage<\/strong>: identifying what changed and what to check takes time.\n&#8211; <strong>Inconsistent incident response<\/strong>: response quality varies by engineer and shift.\n&#8211; <strong>Knowledge gaps<\/strong>: runbooks and tribal knowledge are hard to find when it matters.<\/p>\n\n\n\n<blockquote>\n<p>Important note about naming and official status: As of my last verified knowledge cutoff (2025-08), I cannot confirm that <strong>\u201cAzure SRE Agent\u201d<\/strong> is a standalone, generally available Azure product with a dedicated pricing page and canonical documentation entry in the Azure product catalog. It may be a <strong>new offering, preview capability, solution accelerator, or internal initiative<\/strong>. <strong>Verify in official docs<\/strong> if Microsoft has since released a first-class service by this exact name.<br\/>\nThis tutorial therefore treats <strong>Azure SRE Agent<\/strong> as a <strong>deployable Azure reference implementation\/pattern<\/strong> for an SRE-focused AI agent, built only from verifiable, current Azure services.<\/p>\n<\/blockquote>\n\n\n\n<h2 class=\"wp-block-heading\">2. What is Azure SRE Agent?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Official purpose (practical, Azure-aligned)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Azure SRE Agent\u2019s purpose is to <strong>augment SRE and operations teams<\/strong> by using AI to:\n&#8211; Summarize operational events and telemetry into human-friendly narratives.\n&#8211; Correlate changes and symptoms (what changed vs. what broke).\n&#8211; Recommend next-step diagnostics (queries, dashboards, runbooks).\n&#8211; Produce consistent incident updates and post-incident artifacts.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Core capabilities<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">When implemented on Azure, Azure SRE Agent commonly provides:\n&#8211; <strong>Operational summarization<\/strong>: \u201cWhat happened in the last N hours?\u201d\n&#8211; <strong>Change intelligence<\/strong>: deployments\/config changes from Activity Log and CI\/CD.\n&#8211; <strong>Incident assistance<\/strong>: triage checklists, likely causes, safe next actions.\n&#8211; <strong>Query assistance<\/strong>: generate Kusto queries for Azure Monitor Logs (guarded).\n&#8211; <strong>Notification and collaboration<\/strong>: send briefings to Microsoft Teams\/email.\n&#8211; <strong>Automation hooks<\/strong>: call approved runbooks (human-in-the-loop recommended).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Major components (typical implementation)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Because Azure SRE Agent is commonly a composition of services, the \u201ccomponents\u201d are usually:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p><strong>LLM inference<\/strong>\n   &#8211; Most commonly <strong>Azure OpenAI Service<\/strong> (or another Azure-hosted model endpoint available in your tenant).\n   &#8211; Used for summarization, classification, and structured output generation.<\/p>\n<\/li>\n<li>\n<p><strong>Observability data sources<\/strong>\n   &#8211; <strong>Azure Monitor<\/strong> (metrics, alerts)\n   &#8211; <strong>Azure Monitor Logs \/ Log Analytics<\/strong> (KQL queries across logs)\n   &#8211; <strong>Azure Activity Log<\/strong> (control-plane changes)\n   &#8211; Optional: Application Insights (APM traces), Microsoft Sentinel (security signals)<\/p>\n<\/li>\n<li>\n<p><strong>Orchestration\/execution<\/strong>\n   &#8211; <strong>Azure Functions<\/strong> (simple, low-cost schedulers and webhooks)\n   &#8211; Or <strong>Logic Apps<\/strong> (workflow integrations)\n   &#8211; Or <strong>Azure Container Apps \/ AKS<\/strong> (more control, multi-tool agents)<\/p>\n<\/li>\n<li>\n<p><strong>Identity and secrets<\/strong>\n   &#8211; <strong>Managed Identity<\/strong> (recommended) for Azure API access\n   &#8211; <strong>Azure Key Vault<\/strong> for secrets (if any; prefer managed identity)<\/p>\n<\/li>\n<li>\n<p><strong>Outputs and integrations<\/strong>\n   &#8211; Microsoft Teams (incoming webhook), email, ITSM tools, dashboards<\/p>\n<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Service type<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>If Microsoft has released a standalone service named \u201cAzure SRE Agent,\u201d treat it as a <strong>managed AI\/ops service<\/strong> and follow its official docs.<\/li>\n<li>In this tutorial, Azure SRE Agent is implemented as a <strong>solution pattern<\/strong>: an application you deploy into <strong>your subscription<\/strong>.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scope (regional\/global\/zonal\/subscription)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For the pattern described here:\n&#8211; <strong>Subscription-scoped deployment<\/strong>: you deploy resources into one subscription\/resource group.\n&#8211; <strong>Regional compute<\/strong>: Functions\/OpenAI resources are regional.\n&#8211; <strong>Data scope<\/strong>: depends on which subscriptions\/workspaces you allow it to read.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How it fits into the Azure ecosystem<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Azure SRE Agent sits at the intersection of:\n&#8211; <strong>AI + Machine Learning<\/strong> (LLM inference, structured reasoning)\n&#8211; <strong>Operations<\/strong> (Azure Monitor, Service Health, Activity Log)\n&#8211; <strong>Automation<\/strong> (Functions\/Logic Apps\/Runbooks)\n&#8211; <strong>Security<\/strong> (RBAC, Key Vault, audit logging)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It should not replace Azure Monitor or your incident management system; it <strong>augments<\/strong> them.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">3. Why use Azure SRE Agent?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Business reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Reduce mean time to acknowledge (MTTA)<\/strong> with faster incident understanding.<\/li>\n<li><strong>Reduce mean time to resolve (MTTR)<\/strong> by accelerating first-pass diagnostics.<\/li>\n<li><strong>Improve reliability outcomes<\/strong> through consistent triage and postmortems.<\/li>\n<li><strong>Lower on-call burden<\/strong> by summarizing noise into actionable signals.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Technical reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Azure-native access to:<\/li>\n<li>Activity Log (changes)<\/li>\n<li>Monitor alerts\/metrics\/logs<\/li>\n<li>Resource Graph \/ ARM metadata (optional)<\/li>\n<li>LLMs can produce:<\/li>\n<li>Structured incident briefs<\/li>\n<li>Suggested next checks<\/li>\n<li>Standardized status updates<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operational reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Create a repeatable \u201cincident briefing\u201d workflow across teams.<\/li>\n<li>Standardize handoffs (follow-the-sun) with consistent summaries.<\/li>\n<li>Provide a single \u201cops context\u201d entrypoint for on-call.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/compliance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>With <strong>Managed Identity + RBAC<\/strong>, you can limit access to only required resources.<\/li>\n<li>With <strong>private networking<\/strong> options (where supported) and <strong>data minimization<\/strong>, you can reduce data exposure.<\/li>\n<li>Can support auditability via Application Insights and Azure Monitor logs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scalability\/performance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Functions\/Container Apps scale out as needed.<\/li>\n<li>LLM throughput depends on model\/SKU and quotas (verify your Azure OpenAI quotas).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should choose it<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Azure SRE Agent is a good fit if you:\n&#8211; Already use <strong>Azure Monitor<\/strong> and want <strong>human-readable incident context<\/strong>.\n&#8211; Need to correlate <strong>control-plane changes<\/strong> with outages.\n&#8211; Want to automate repetitive SRE tasks with strict guardrails.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should not choose it<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Avoid (or delay) Azure SRE Agent if:\n&#8211; You cannot approve an AI data flow for operational data (policy\/compliance).\n&#8211; You require deterministic outputs with zero hallucination risk.\n&#8211; You lack a mature incident process (AI will amplify chaos if inputs are messy).\n&#8211; Your environment has strict data residency constraints you cannot satisfy (verify regional availability and networking options for your model endpoint).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">4. Where is Azure SRE Agent used?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Industries<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>SaaS and technology<\/li>\n<li>Finance (with strict controls)<\/li>\n<li>Retail\/e-commerce<\/li>\n<li>Healthcare (with careful PHI handling)<\/li>\n<li>Manufacturing\/IoT (telemetry-heavy)<\/li>\n<li>Gaming and media streaming<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Team types<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>SRE teams<\/li>\n<li>DevOps\/platform engineering<\/li>\n<li>NOC\/operations<\/li>\n<li>Cloud center of excellence (CCoE)<\/li>\n<li>Security operations (when integrated with Sentinel; verify policy constraints)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Workloads<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Microservices on AKS<\/li>\n<li>Azure App Service web apps\/APIs<\/li>\n<li>Event-driven systems (Functions, Service Bus, Event Hubs)<\/li>\n<li>Data platforms (Synapse, Databricks; verify what telemetry you expose)<\/li>\n<li>Hybrid with Azure Arc-enabled servers (for unified monitoring)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Architectures<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Hub-and-spoke networks with centralized logging<\/li>\n<li>Multi-subscription landing zones with centralized governance<\/li>\n<li>Regulated environments with private endpoints and strict RBAC<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Real-world deployment contexts<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>On-call briefing bots (daily\/weekly summaries)<\/li>\n<li>Incident channel assistants (Teams)<\/li>\n<li>Change review assistants (what changed in prod last 24 hours)<\/li>\n<li>Post-incident report drafting (human-reviewed)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Production vs dev\/test usage<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Dev\/test<\/strong>: start with Activity Log + a small subset of resources; avoid sensitive logs.<\/li>\n<li><strong>Production<\/strong>: add Log Analytics queries, alert correlation, strict RBAC, and human-in-the-loop approvals.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">5. Top Use Cases and Scenarios<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Below are realistic scenarios where Azure SRE Agent (as an Azure-implemented SRE AI agent) is commonly valuable.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1) Daily production change briefing<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Engineers don\u2019t know what changed overnight.<\/li>\n<li><strong>Why it fits<\/strong>: Activity Log provides a change feed; LLM summarizes.<\/li>\n<li><strong>Example<\/strong>: Every morning at 08:00, the agent posts \u201cTop changes in Prod RG\u201d to Teams with links to events.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2) Incident first-response summary<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: On-call wastes time reading scattered alerts and logs.<\/li>\n<li><strong>Why it fits<\/strong>: Agent turns raw signals into a single narrative and checklist.<\/li>\n<li><strong>Example<\/strong>: When a Sev2 alert triggers, the agent produces an incident brief with likely impacted services and recent deployments.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3) Change-to-incident correlation (control-plane focus)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Outages often follow configuration changes.<\/li>\n<li><strong>Why it fits<\/strong>: Agent compares a window of Activity Log changes with incident start times.<\/li>\n<li><strong>Example<\/strong>: \u201cAt 02:14 UTC, NSG rule changed; at 02:16, 5xx spiked.\u201d<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4) Runbook recommendation assistant<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Runbooks exist but are hard to find in the moment.<\/li>\n<li><strong>Why it fits<\/strong>: Agent retrieves and suggests relevant runbooks (from a curated store).<\/li>\n<li><strong>Example<\/strong>: For \u201cAKS node not ready,\u201d it links the internal runbook and top commands.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">5) Query assistant for Azure Monitor Logs (KQL)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Engineers struggle to write KQL under pressure.<\/li>\n<li><strong>Why it fits<\/strong>: Agent generates KQL templates with guardrails.<\/li>\n<li><strong>Example<\/strong>: \u201cShow exceptions by operationName in last 30m\u201d query, with workspace scoping.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6) Executive status update drafting<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Status updates are inconsistent and slow.<\/li>\n<li><strong>Why it fits<\/strong>: LLM drafts concise, non-technical summaries from incident notes.<\/li>\n<li><strong>Example<\/strong>: Drafts \u201ccustomer impact \/ mitigation \/ ETA\u201d every 30 minutes for review.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">7) Post-incident report skeleton generation<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Postmortems are delayed due to writing overhead.<\/li>\n<li><strong>Why it fits<\/strong>: Agent drafts timeline and action items from incident data.<\/li>\n<li><strong>Example<\/strong>: After closure, agent compiles a timeline from alerts, changes, and chat notes.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">8) Noisy alert triage and grouping<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Many alerts are duplicates or symptoms.<\/li>\n<li><strong>Why it fits<\/strong>: Agent clusters alerts and highlights probable root signal.<\/li>\n<li><strong>Example<\/strong>: Groups 30 \u201cHTTP 500\u201d alerts across services into one upstream dependency issue.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">9) Compliance-friendly operational assistant (restricted scope)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Regulated teams can\u2019t send full logs to AI.<\/li>\n<li><strong>Why it fits<\/strong>: Agent can operate on <strong>metadata + summaries<\/strong> only (minimize data).<\/li>\n<li><strong>Example<\/strong>: Only uses Activity Log and high-level metrics, not raw payload logs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">10) Multi-subscription operational overview (landing zone)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Platform team needs a consolidated view across subscriptions.<\/li>\n<li><strong>Why it fits<\/strong>: Agent runs with Reader access across approved scopes.<\/li>\n<li><strong>Example<\/strong>: Weekly platform report: top incidents, change hotspots, recurring alerts.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">11) \u201cWhat changed?\u201d helper for failed deployments<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: A deployment fails and the team needs quick diff context.<\/li>\n<li><strong>Why it fits<\/strong>: Agent uses CI\/CD metadata + Activity Log to summarize what was updated.<\/li>\n<li><strong>Example<\/strong>: \u201cKey Vault access policy changed; App Service identity rotated.\u201d<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">12) Capacity and performance anomaly narrative<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Engineers see a spike but don\u2019t know what it means.<\/li>\n<li><strong>Why it fits<\/strong>: Agent narrates \u201cwhat changed in metrics\u201d and suggests next checks.<\/li>\n<li><strong>Example<\/strong>: \u201cCPU rose with request rate; DB DTU saturated; check connection pool.\u201d<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">6. Core Features<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Because Azure SRE Agent is commonly implemented as a pattern, the \u201cfeatures\u201d below describe what you can build using supported Azure components. If Microsoft provides an official Azure SRE Agent service, <strong>verify the exact feature set in official docs<\/strong>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Feature 1: Change feed ingestion (Azure Activity Log)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Pulls control-plane events (write\/delete actions, policy changes, role assignments, deployments).<\/li>\n<li><strong>Why it matters<\/strong>: Many incidents are caused by changes; this provides ground truth.<\/li>\n<li><strong>Practical benefit<\/strong>: Rapid \u201cwhat changed\u201d answers without searching portals.<\/li>\n<li><strong>Caveats<\/strong>: Activity Log retention and export vary; verify retention and diagnostic export options in your tenant.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Feature 2: Telemetry summarization (Azure Monitor metrics\/logs)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Summarizes key signals from metrics\/log queries.<\/li>\n<li><strong>Why it matters<\/strong>: Humans need synthesized context, not raw noise.<\/li>\n<li><strong>Practical benefit<\/strong>: Faster triage; consistent briefings.<\/li>\n<li><strong>Caveats<\/strong>: Be careful with sensitive log contents; use minimization and redaction.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Feature 3: Guardrailed LLM prompts for ops<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Uses strict prompt templates and structured outputs (e.g., JSON-like sections) to reduce ambiguity.<\/li>\n<li><strong>Why it matters<\/strong>: Operations requires clarity and repeatability.<\/li>\n<li><strong>Practical benefit<\/strong>: Standard format across incidents and teams.<\/li>\n<li><strong>Caveats<\/strong>: LLMs can hallucinate; treat output as suggestions, not truth.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Feature 4: Tool-based actions (optional, allow-listed)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: The agent can call approved tools: query logs, fetch change lists, open tickets, trigger runbooks.<\/li>\n<li><strong>Why it matters<\/strong>: Moves from \u201cchat\u201d to \u201cassisted operations.\u201d<\/li>\n<li><strong>Practical benefit<\/strong>: Less copy\/paste; fewer context switches.<\/li>\n<li><strong>Caveats<\/strong>: Put humans in the loop for production-impacting actions.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Feature 5: Teams\/ChatOps integration<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Posts summaries into Teams channels or incident chats.<\/li>\n<li><strong>Why it matters<\/strong>: Collaboration happens in chat; meet engineers where they work.<\/li>\n<li><strong>Practical benefit<\/strong>: Shared situational awareness.<\/li>\n<li><strong>Caveats<\/strong>: Protect webhook URLs; treat them as secrets.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Feature 6: Identity-first access (Managed Identity + RBAC)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Uses Azure AD authentication without embedding credentials.<\/li>\n<li><strong>Why it matters<\/strong>: Reduces secret sprawl and improves auditability.<\/li>\n<li><strong>Practical benefit<\/strong>: Safer automation.<\/li>\n<li><strong>Caveats<\/strong>: RBAC misconfiguration can over-permit access; scope roles tightly.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Feature 7: Observability of the agent itself<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Logs requests, failures, latency, token usage metadata (avoid logging sensitive prompts).<\/li>\n<li><strong>Why it matters<\/strong>: The agent becomes part of production; it must be operable.<\/li>\n<li><strong>Practical benefit<\/strong>: Troubleshoot and control costs.<\/li>\n<li><strong>Caveats<\/strong>: Avoid storing sensitive incident data in logs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Feature 8: Cost controls<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Enforces token limits, time windows, sampling, and caching.<\/li>\n<li><strong>Why it matters<\/strong>: LLM inference can become a major cost driver.<\/li>\n<li><strong>Practical benefit<\/strong>: Predictable spend.<\/li>\n<li><strong>Caveats<\/strong>: Over-aggressive truncation can reduce quality.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">7. Architecture and How It Works<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">High-level architecture<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">At a high level, Azure SRE Agent:\n1. Collects operational context (changes, alerts, logs, topology metadata).\n2. Applies policies and guardrails (allowed scopes, time windows, redaction).\n3. Calls an LLM endpoint for summarization\/recommendations.\n4. Delivers outputs to humans and\/or ticketing systems.\n5. Records telemetry about the agent execution.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Request\/data\/control flow (typical)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Trigger<\/strong>: schedule (timer), webhook (alert fired), or manual HTTP request.<\/li>\n<li><strong>Data retrieval<\/strong>: Activity Log + (optional) Log Analytics queries.<\/li>\n<li><strong>Enrichment<\/strong>: service metadata, ownership tags, runbook links.<\/li>\n<li><strong>LLM prompt<\/strong>: templated prompt with structured sections.<\/li>\n<li><strong>Output<\/strong>: Teams message + optional storage of summary.<\/li>\n<li><strong>Audit<\/strong>: Application Insights traces + Azure Monitor logs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Integrations with related services<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Common Azure integrations include:\n&#8211; <strong>Azure Monitor<\/strong> (metrics\/alerts\/logs)\n&#8211; <strong>Log Analytics workspace<\/strong> (KQL queries)\n&#8211; <strong>Application Insights<\/strong> (APM for apps and for agent)\n&#8211; <strong>Azure OpenAI Service<\/strong> (LLM inference)\n&#8211; <strong>Azure Key Vault<\/strong> (secrets, if needed)\n&#8211; <strong>Microsoft Teams<\/strong> (webhook)\n&#8211; Optional: <strong>Microsoft Sentinel<\/strong> (security incidents), <strong>Azure Automation<\/strong> (runbooks)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Dependency services<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The minimal working pattern usually depends on:\n&#8211; A compute\/orchestration layer (Functions\/Container Apps)\n&#8211; An identity (managed identity)\n&#8211; An LLM endpoint (Azure OpenAI or other Azure-hosted model)\n&#8211; An observability source (Activity Log is the simplest starting point)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/authentication model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Azure OpenAI<\/strong>: typically API key or Entra ID depending on supported auth mode (verify current supported authentication in official docs for your region\/resource).<\/li>\n<li><strong>Azure Management APIs<\/strong> (Activity Log): use <strong>Managed Identity<\/strong> + Azure RBAC.<\/li>\n<li><strong>Teams webhook<\/strong>: secret URL stored in Key Vault or app settings (treat as secret).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Networking model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Simple: public endpoints with HTTPS.<\/li>\n<li>Production: prefer:<\/li>\n<li>Private endpoints where available (Azure OpenAI supports private networking in many scenarios; <strong>verify<\/strong> for your region\/model).<\/li>\n<li>VNet integration for Functions (Premium plan) if needed.<\/li>\n<li>Egress controls via firewall\/NAT where appropriate.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Monitoring\/logging\/governance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Enable Application Insights for the agent runtime.<\/li>\n<li>Track:<\/li>\n<li>Execution success\/fail<\/li>\n<li>Latency of Azure API calls<\/li>\n<li>LLM call duration and failures<\/li>\n<li>Token usage (store metadata, avoid sensitive content)<\/li>\n<li>Apply tagging and naming standards.<\/li>\n<li>Use Azure Policy to enforce approved SKUs and networking.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Simple architecture diagram (Mermaid)<\/h3>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart LR\n  T[Timer\/HTTP Trigger] --&gt; F[Azure Function: Azure SRE Agent]\n  F --&gt; A[Azure Activity Log API]\n  F --&gt; L[Azure OpenAI Service]\n  F --&gt; M[Teams Incoming Webhook]\n  F --&gt; AI[Application Insights]\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Production-style architecture diagram (Mermaid)<\/h3>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart TB\n  subgraph Ops[\"Operations Signals\"]\n    AM[Azure Monitor Alerts]\n    AL[Azure Activity Log]\n    LA[Log Analytics Workspace]\n    SH[Azure Service Health]\n  end\n\n  subgraph Agent[\"Azure SRE Agent (your deployment)\"]\n    CA[Orchestrator: Functions \/ Container Apps]\n    KV[Azure Key Vault]\n    MI[Managed Identity]\n    AI[Application Insights]\n    Q[Guardrails: scope + redaction + token limits]\n  end\n\n  subgraph AIStack[\"AI + Machine Learning\"]\n    AOAI[Azure OpenAI Service (LLM endpoint)]\n  end\n\n  subgraph Outputs[\"Outputs\"]\n    TEAMS[Microsoft Teams Incident Channel]\n    ITSM[ITSM\/Ticketing (optional)]\n    KB[Runbooks\/Docs (SharePoint\/Wiki\/Git) (optional)]\n  end\n\n  AM --&gt; CA\n  CA --&gt;|read| AL\n  CA --&gt;|query| LA\n  CA --&gt;|check| SH\n  CA --&gt; Q --&gt; AOAI\n  CA --&gt;|secrets| KV\n  CA --&gt;|auth| MI\n  CA --&gt; AI\n  CA --&gt; TEAMS\n  CA --&gt; ITSM\n  KB --&gt; CA\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">8. Prerequisites<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Account\/subscription\/tenancy requirements<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>An active <strong>Azure subscription<\/strong> where you can create resources.<\/li>\n<li>Access to <strong>Azure OpenAI Service<\/strong> in your tenant (often requires approval). If you don\u2019t have access, you can still complete parts of this tutorial but cannot run the LLM call.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Permissions \/ IAM roles<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You\u2019ll typically need:\n&#8211; Resource creation: <strong>Contributor<\/strong> on a resource group (or equivalent).\n&#8211; To grant the Function identity permissions:\n  &#8211; <strong>Reader<\/strong> (minimum) on the scope you want to analyze (resource group\/subscription).\n  &#8211; <strong>Monitoring Reader<\/strong> may be useful if you later query Monitor data (verify your exact needs).\n&#8211; If using Key Vault: <strong>Key Vault Secrets Officer<\/strong> (or fine-grained RBAC) for secret management.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Billing requirements<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Billing enabled for:<\/li>\n<li>Azure Functions (Consumption is usually low-cost for small workloads)<\/li>\n<li>Azure OpenAI (usage-based)<\/li>\n<li>Application Insights \/ Log Analytics (ingestion-based)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">CLI\/SDK\/tools needed<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/learn.microsoft.com\/cli\/azure\/install-azure-cli\">Azure CLI<\/a><\/li>\n<li>Optional for local dev:<\/li>\n<li><a href=\"https:\/\/learn.microsoft.com\/azure\/azure-functions\/functions-run-local\">Azure Functions Core Tools<\/a><\/li>\n<li>Python 3.10+ (match Functions runtime support; <strong>verify current supported versions<\/strong>)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Region availability<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Choose a region where:<\/li>\n<li>Azure Functions is available (most regions).<\/li>\n<li>Azure OpenAI is available for your subscription and desired models (<strong>verify<\/strong>).<\/li>\n<li>Keep resources in the same region where possible to reduce latency and complexity.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Quotas\/limits<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Azure OpenAI quotas are model\/region\/subscription specific (<strong>verify in Azure OpenAI quota pages\/portal<\/strong>).<\/li>\n<li>Functions timeouts differ by plan (Consumption has limits; Premium has more flexibility).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Prerequisite services<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For the lab in this article:\n&#8211; Azure Function App\n&#8211; Azure OpenAI resource + model deployment\n&#8211; Application Insights (often created automatically with Function App)\n&#8211; (Optional) Microsoft Teams channel + incoming webhook<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">9. Pricing \/ Cost<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Because Azure SRE Agent is (in this tutorial) a <strong>pattern<\/strong> rather than a single SKU, pricing is the sum of its parts. Always confirm current prices in official pages and the Azure Pricing Calculator.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Pricing dimensions (typical)<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p><strong>Azure OpenAI Service<\/strong>\n   &#8211; Billed by model and usage (tokens\/characters depending on model).\n   &#8211; Different models have different per-token rates and throughput limits.\n   &#8211; Official docs and pricing:  <\/p>\n<ul>\n<li>https:\/\/learn.microsoft.com\/azure\/ai-services\/openai\/  <\/li>\n<li>Pricing: https:\/\/azure.microsoft.com\/pricing\/details\/cognitive-services\/openai-service\/ (verify current URL\/structure)<\/li>\n<\/ul>\n<\/li>\n<li>\n<p><strong>Azure Functions<\/strong>\n   &#8211; Consumption plan: pay per execution, GB-seconds, and other factors.\n   &#8211; Premium\/Dedicated: pay for provisioned compute.\n   &#8211; Pricing: https:\/\/azure.microsoft.com\/pricing\/details\/functions\/<\/p>\n<\/li>\n<li>\n<p><strong>Application Insights \/ Azure Monitor Logs<\/strong>\n   &#8211; Charged by data ingestion\/retention (workspace-based).\n   &#8211; Pricing: https:\/\/azure.microsoft.com\/pricing\/details\/monitor\/<\/p>\n<\/li>\n<li>\n<p><strong>Log Analytics (if used)<\/strong>\n   &#8211; Workspace ingestion and retention costs (often part of Azure Monitor pricing).<\/p>\n<\/li>\n<li>\n<p><strong>Key Vault (optional)<\/strong>\n   &#8211; Operations-based pricing (secrets reads\/writes) + premium features if used.\n   &#8211; Pricing: https:\/\/azure.microsoft.com\/pricing\/details\/key-vault\/<\/p>\n<\/li>\n<li>\n<p><strong>Networking<\/strong>\n   &#8211; Data egress (outbound) charges if sending data across regions or to internet destinations.\n   &#8211; Private endpoints may add cost (NICs, private DNS, etc.).<\/p>\n<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Free tier (if applicable)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Azure Functions has a generous free grant on Consumption in many subscriptions (verify current grants).<\/li>\n<li>Azure Monitor and Log Analytics sometimes include free allocations depending on offers; <strong>verify<\/strong>.<\/li>\n<li>Azure OpenAI generally does not have a broad \u201calways free\u201d tier; <strong>verify<\/strong> current promotions.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Cost drivers<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>LLM usage: prompt size + response size + call frequency.<\/li>\n<li>High-cardinality logs: ingestion volume, retention.<\/li>\n<li>Agent verbosity: long summaries and repeated calls increase tokens.<\/li>\n<li>Frequent polling vs event-driven triggers.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Hidden\/indirect costs<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Storing summaries (Storage\/DB).<\/li>\n<li>Extra logs from the agent itself (Application Insights ingestion).<\/li>\n<li>Private networking (Premium Functions + private endpoints).<\/li>\n<li>Development\/test environments left running.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Network\/data transfer implications<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Keep agent compute, monitoring workspaces, and model endpoint in the same region when possible.<\/li>\n<li>Avoid sending large raw logs to the LLM; summarize locally first or restrict queries.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">How to optimize cost<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use <strong>time windows<\/strong> (last 1 hour \/ 24 hours).<\/li>\n<li>Use <strong>token limits<\/strong> and structured outputs.<\/li>\n<li>Prefer <strong>event-driven<\/strong> triggers (alert fired) over constant polling.<\/li>\n<li>Cache and deduplicate summaries (e.g., don\u2019t re-summarize the same incident repeatedly).<\/li>\n<li>Redact\/trim logs and pass only the minimum context needed.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Example low-cost starter estimate (model-agnostic)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A starter \u201cdaily change briefing\u201d can be low-cost because:\n&#8211; Activity Log calls are not typically billed per request in the same way as data ingestion (verify).\n&#8211; One LLM call per day with a compact prompt can be small.\nYour main costs will likely be:\n&#8211; Azure OpenAI tokens for 1 call\/day\n&#8211; A small number of Function executions\n&#8211; Minimal Application Insights logs<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Because prices vary by region\/model and change over time, <strong>use the Azure Pricing Calculator<\/strong> to estimate:\nhttps:\/\/azure.microsoft.com\/pricing\/calculator\/<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Example production cost considerations<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In production, expect cost growth from:\n&#8211; High-frequency incident updates (many LLM calls\/day)\n&#8211; Including Log Analytics queries (workspace ingestion + query)\n&#8211; Multiple teams\/subscriptions\n&#8211; Richer prompts (topology + runbooks + metrics + logs)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A common production approach is to set budgets\/alerts and measure:\n&#8211; LLM calls per incident\n&#8211; Tokens per call\n&#8211; $ per incident summary\n&#8211; Workspace ingestion by agent logs<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">10. Step-by-Step Hands-On Tutorial<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">This lab builds a small but real Azure SRE Agent that:\n&#8211; Reads the <strong>Azure Activity Log<\/strong> for a resource group (last N hours)\n&#8211; Uses <strong>Azure OpenAI<\/strong> to generate a structured \u201cSRE change briefing\u201d\n&#8211; Optionally posts the briefing to <strong>Microsoft Teams<\/strong>\n&#8211; Exposes an HTTP endpoint so you can test on-demand<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Objective<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Deploy a minimal <strong>Azure SRE Agent<\/strong> into your Azure subscription that produces an AI-generated change summary from the Azure Activity Log.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Lab Overview<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You will:\n1. Create an Azure resource group and an Azure OpenAI model deployment (prereq).\n2. Create an Azure Function App with a system-assigned managed identity.\n3. Grant the Function identity Reader access to the target resource group.\n4. Deploy Python code for an HTTP-triggered \u201cbriefing\u201d endpoint.\n5. Test and validate the output.\n6. (Optional) Post to Microsoft Teams.\n7. Clean up resources.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 1: Create a resource group<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Pick a region that supports Azure OpenAI for your subscription (verify).<\/p>\n\n\n\n<pre><code class=\"language-bash\">az login\naz account show\naz account set --subscription \"&lt;YOUR_SUBSCRIPTION_ID&gt;\"\n\n# Variables\nRG=\"rg-azure-sre-agent-lab\"\nLOCATION=\"eastus\"   # change to your region\naz group create -n \"$RG\" -l \"$LOCATION\"\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome<\/strong>\n&#8211; A resource group exists for the lab.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verify<\/strong><\/p>\n\n\n\n<pre><code class=\"language-bash\">az group show -n \"$RG\" --query \"{name:name,location:location,provisioningState:properties.provisioningState}\"\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Step 2: Prepare Azure OpenAI (model deployment)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You need:\n&#8211; An <strong>Azure OpenAI<\/strong> resource\n&#8211; A <strong>model deployment<\/strong> (a chat-capable model)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Because model names, versions, and availability change, follow Microsoft\u2019s official guidance:\n&#8211; Docs: https:\/\/learn.microsoft.com\/azure\/ai-services\/openai\/\n&#8211; Quickstarts (Python): https:\/\/learn.microsoft.com\/azure\/ai-services\/openai\/quickstart<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>What to do in the Azure portal<\/strong>\n1. Create <strong>Azure OpenAI<\/strong> resource in the same region (or a supported region).\n2. In Azure OpenAI Studio (or Azure AI Foundry UI\u2014naming may vary; verify), create a <strong>deployment<\/strong> for an available chat model.\n3. Record:\n   &#8211; <strong>Endpoint<\/strong> (e.g., <code>https:\/\/&lt;resource&gt;.openai.azure.com\/<\/code>)\n   &#8211; <strong>API key<\/strong> (if using key auth)\n   &#8211; <strong>Deployment name<\/strong> (you choose this)\n   &#8211; <strong>API version<\/strong> (per docs)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome<\/strong>\n&#8211; You have working endpoint + credentials and a deployment name.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verify quickly (optional)<\/strong>\nUse the official quickstart to confirm you can call the model.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 3: Create a Function App (Consumption) and enable Managed Identity<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Create a storage account (required by Functions) and the Function App.<\/p>\n\n\n\n<pre><code class=\"language-bash\"># Variables\nSA=\"sreagent$RANDOM$RANDOM\"   # must be globally unique, lowercase, 3-24 chars\nFUNC=\"func-azure-sre-agent-$RANDOM\"\nRUNTIME=\"python\"\nFUNC_VERSION=\"4\"\n\n# Create storage account\naz storage account create \\\n  -n \"$SA\" -g \"$RG\" -l \"$LOCATION\" \\\n  --sku Standard_LRS\n\n# Create Function App (Linux Consumption)\naz functionapp create \\\n  -n \"$FUNC\" -g \"$RG\" -s \"$SA\" \\\n  --consumption-plan-location \"$LOCATION\" \\\n  --runtime \"$RUNTIME\" --functions-version \"$FUNC_VERSION\" \\\n  --os-type Linux\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Enable system-assigned managed identity:<\/p>\n\n\n\n<pre><code class=\"language-bash\">az functionapp identity assign -n \"$FUNC\" -g \"$RG\"\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Capture the principalId:<\/p>\n\n\n\n<pre><code class=\"language-bash\">PRINCIPAL_ID=$(az functionapp identity show -n \"$FUNC\" -g \"$RG\" --query principalId -o tsv)\necho \"$PRINCIPAL_ID\"\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome<\/strong>\n&#8211; A Function App exists, with a managed identity.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verify<\/strong><\/p>\n\n\n\n<pre><code class=\"language-bash\">az functionapp show -n \"$FUNC\" -g \"$RG\" --query \"{name:name,state:state,httpsOnly:httpsOnly,kind:kind}\"\naz functionapp identity show -n \"$FUNC\" -g \"$RG\" --query \"{type:type,principalId:principalId}\"\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Step 4: Grant the Function identity access to read Activity Log scope<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For this lab, restrict scope to the lab resource group.<\/p>\n\n\n\n<pre><code class=\"language-bash\">RG_ID=$(az group show -n \"$RG\" --query id -o tsv)\n\naz role assignment create \\\n  --assignee-object-id \"$PRINCIPAL_ID\" \\\n  --assignee-principal-type ServicePrincipal \\\n  --role \"Reader\" \\\n  --scope \"$RG_ID\"\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome<\/strong>\n&#8211; The function can query Azure Resource Manager at least at Reader level for the RG.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verify<\/strong><\/p>\n\n\n\n<pre><code class=\"language-bash\">az role assignment list --assignee-object-id \"$PRINCIPAL_ID\" --scope \"$RG_ID\" -o table\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Step 5: Configure Function App settings (Azure OpenAI + optional Teams)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Set app settings. Replace values with your Azure OpenAI details.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">You will set:\n&#8211; <code>AZURE_OPENAI_ENDPOINT<\/code>\n&#8211; <code>AZURE_OPENAI_API_KEY<\/code> (if using key auth; if using Entra ID, verify supported approach)\n&#8211; <code>AZURE_OPENAI_DEPLOYMENT<\/code>\n&#8211; <code>AZURE_OPENAI_API_VERSION<\/code> (verify supported API version)\n&#8211; Optional: <code>TEAMS_WEBHOOK_URL<\/code><\/p>\n\n\n\n<pre><code class=\"language-bash\">AOAI_ENDPOINT=\"https:\/\/&lt;YOUR_AOAI_RESOURCE&gt;.openai.azure.com\/\"\nAOAI_KEY=\"&lt;YOUR_AOAI_API_KEY&gt;\"\nAOAI_DEPLOYMENT=\"&lt;YOUR_MODEL_DEPLOYMENT_NAME&gt;\"\nAOAI_API_VERSION=\"2024-xx-xx\"  # verify in official docs for your chosen model\n\naz functionapp config appsettings set -n \"$FUNC\" -g \"$RG\" --settings \\\n  \"AZURE_OPENAI_ENDPOINT=$AOAI_ENDPOINT\" \\\n  \"AZURE_OPENAI_API_KEY=$AOAI_KEY\" \\\n  \"AZURE_OPENAI_DEPLOYMENT=$AOAI_DEPLOYMENT\" \\\n  \"AZURE_OPENAI_API_VERSION=$AOAI_API_VERSION\"\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome<\/strong>\n&#8211; Function App has required configuration settings.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verify<\/strong><\/p>\n\n\n\n<pre><code class=\"language-bash\">az functionapp config appsettings list -n \"$FUNC\" -g \"$RG\" --query \"[?name=='AZURE_OPENAI_ENDPOINT' || name=='AZURE_OPENAI_DEPLOYMENT' || name=='AZURE_OPENAI_API_VERSION']\" -o table\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Step 6: Deploy the Azure SRE Agent function code<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">This section uses Python and the Azure Functions programming model. There are multiple valid ways to deploy:\n&#8211; Using Functions Core Tools (<code>func azure functionapp publish<\/code>)\n&#8211; Zip deploy (<code>az functionapp deployment source config-zip<\/code>)\n&#8211; GitHub Actions<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Below is a <strong>zip deploy<\/strong> approach to keep it broadly reproducible.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">6.1 Create a local project structure<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">On your machine:<\/p>\n\n\n\n<pre><code class=\"language-bash\">mkdir azure-sre-agent-func\ncd azure-sre-agent-func\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Create <code>requirements.txt<\/code>:<\/p>\n\n\n\n<pre><code class=\"language-txt\">azure-functions\nazure-identity\nrequests\nopenai\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Create the function folder and files:<\/p>\n\n\n\n<pre><code class=\"language-bash\">mkdir ChangeBriefing\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Create <code>ChangeBriefing\/function.json<\/code> (HTTP trigger):<\/p>\n\n\n\n<pre><code class=\"language-json\">{\n  \"bindings\": [\n    {\n      \"authLevel\": \"function\",\n      \"type\": \"httpTrigger\",\n      \"direction\": \"in\",\n      \"name\": \"req\",\n      \"methods\": [ \"get\", \"post\" ]\n    },\n    {\n      \"type\": \"http\",\n      \"direction\": \"out\",\n      \"name\": \"$return\"\n    }\n  ]\n}\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Create <code>ChangeBriefing\/__init__.py<\/code>:<\/p>\n\n\n\n<pre><code class=\"language-python\">import os\nimport json\nimport datetime\nimport logging\nimport requests\n\nimport azure.functions as func\nfrom azure.identity import DefaultAzureCredential\n\nfrom openai import AzureOpenAI\n\ndef _get_env(name: str) -&gt; str:\n    v = os.getenv(name)\n    if not v:\n        raise ValueError(f\"Missing required setting: {name}\")\n    return v\n\ndef _get_activity_log_events(subscription_id: str, resource_group: str, hours: int = 24, top: int = 25):\n    \"\"\"\n    Queries Azure Activity Log events for a resource group over the last N hours.\n\n    Uses Azure Management REST API:\n    https:\/\/learn.microsoft.com\/rest\/api\/monitor\/activity-logs\/list\n\n    Note: API versions can change; verify in official docs if requests fail.\n    \"\"\"\n    credential = DefaultAzureCredential()\n    token = credential.get_token(\"https:\/\/management.azure.com\/.default\").token\n\n    end_time = datetime.datetime.utcnow()\n    start_time = end_time - datetime.timedelta(hours=hours)\n\n    # Activity Log filter\n    # We filter by eventTimestamp and resourceGroupName.\n    # Resource group filter uses 'resourceGroupName' field in Activity Logs.\n    # See REST docs for supported OData filters.\n    filter_str = (\n        f\"eventTimestamp ge '{start_time.isoformat()}Z' and \"\n        f\"eventTimestamp le '{end_time.isoformat()}Z' and \"\n        f\"resourceGroupName eq '{resource_group}'\"\n    )\n\n    api_version = \"2015-04-01\"  # commonly used for Activity Logs; verify if needed\n    url = f\"https:\/\/management.azure.com\/subscriptions\/{subscription_id}\/providers\/microsoft.insights\/eventtypes\/management\/values\"\n    params = {\n        \"api-version\": api_version,\n        \"$filter\": filter_str,\n        \"$top\": str(top)\n    }\n\n    headers = {\n        \"Authorization\": f\"Bearer {token}\"\n    }\n\n    r = requests.get(url, headers=headers, params=params, timeout=30)\n    r.raise_for_status()\n\n    data = r.json()\n    values = data.get(\"value\", [])\n\n    # Return a simplified event list to reduce prompt size\n    simplified = []\n    for ev in values:\n        simplified.append({\n            \"eventTimestamp\": ev.get(\"eventTimestamp\"),\n            \"operationName\": (ev.get(\"operationName\") or {}).get(\"value\"),\n            \"status\": (ev.get(\"status\") or {}).get(\"value\"),\n            \"caller\": ev.get(\"caller\"),\n            \"resourceGroupName\": ev.get(\"resourceGroupName\"),\n            \"resourceId\": ev.get(\"resourceId\"),\n            \"category\": (ev.get(\"category\") or {}).get(\"value\"),\n            \"subStatus\": (ev.get(\"subStatus\") or {}).get(\"value\"),\n            \"eventName\": (ev.get(\"eventName\") or {}).get(\"value\"),\n            \"correlationId\": ev.get(\"correlationId\"),\n        })\n\n    return simplified\n\ndef _summarize_with_azure_openai(events, hours: int):\n    endpoint = _get_env(\"AZURE_OPENAI_ENDPOINT\")\n    api_key = _get_env(\"AZURE_OPENAI_API_KEY\")\n    deployment = _get_env(\"AZURE_OPENAI_DEPLOYMENT\")\n    api_version = _get_env(\"AZURE_OPENAI_API_VERSION\")\n\n    client = AzureOpenAI(\n        azure_endpoint=endpoint,\n        api_key=api_key,\n        api_version=api_version\n    )\n\n    # Keep prompt compact and structured.\n    system = (\n        \"You are Azure SRE Agent. You write concise, accurate operational change briefings. \"\n        \"Use only the provided events. If you are unsure, say you are unsure. \"\n        \"Do not invent outages or causes. \"\n        \"Output in Markdown with these sections: Summary, Notable Changes, Risk Notes, Suggested Next Checks.\"\n    )\n\n    user = {\n        \"time_window_hours\": hours,\n        \"activity_log_events\": events\n    }\n\n    resp = client.chat.completions.create(\n        model=deployment,\n        messages=[\n            {\"role\": \"system\", \"content\": system},\n            {\"role\": \"user\", \"content\": json.dumps(user)}\n        ],\n        temperature=0.2,\n        max_tokens=700\n    )\n\n    return resp.choices[0].message.content\n\ndef main(req: func.HttpRequest) -&gt; func.HttpResponse:\n    try:\n        subscription_id = _get_env(\"AZURE_SUBSCRIPTION_ID\")\n        resource_group = _get_env(\"TARGET_RESOURCE_GROUP\")\n\n        hours = int(req.params.get(\"hours\", \"24\"))\n        top = int(req.params.get(\"top\", \"25\"))\n\n        events = _get_activity_log_events(subscription_id, resource_group, hours=hours, top=top)\n\n        if not events:\n            return func.HttpResponse(\n                f\"No Activity Log events found in resource group '{resource_group}' for the last {hours} hours.\",\n                status_code=200\n            )\n\n        briefing_md = _summarize_with_azure_openai(events, hours=hours)\n\n        # Optional: post to Teams webhook if configured\n        teams_url = os.getenv(\"TEAMS_WEBHOOK_URL\")\n        if teams_url:\n            # Teams incoming webhooks expect a JSON payload with \"text\" (simple format)\n            # Consider Adaptive Cards for richer formatting (beyond scope).\n            requests.post(\n                teams_url,\n                json={\"text\": briefing_md},\n                timeout=15\n            )\n\n        return func.HttpResponse(briefing_md, status_code=200, mimetype=\"text\/markdown\")\n\n    except Exception as e:\n        logging.exception(\"Azure SRE Agent failure\")\n        return func.HttpResponse(f\"Error: {str(e)}\", status_code=500)\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Create <code>host.json<\/code> in the project root:<\/p>\n\n\n\n<pre><code class=\"language-json\">{\n  \"version\": \"2.0\",\n  \"logging\": {\n    \"applicationInsights\": {\n      \"samplingSettings\": {\n        \"isEnabled\": true\n      }\n    }\n  }\n}\n<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">6.2 Add required app settings for scope<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">The function code expects:\n&#8211; <code>AZURE_SUBSCRIPTION_ID<\/code>\n&#8211; <code>TARGET_RESOURCE_GROUP<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Set them:<\/p>\n\n\n\n<pre><code class=\"language-bash\">SUBSCRIPTION_ID=$(az account show --query id -o tsv)\n\naz functionapp config appsettings set -n \"$FUNC\" -g \"$RG\" --settings \\\n  \"AZURE_SUBSCRIPTION_ID=$SUBSCRIPTION_ID\" \\\n  \"TARGET_RESOURCE_GROUP=$RG\"\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome<\/strong>\n&#8211; Function knows what subscription and RG to query.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">6.3 Zip deploy<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">From the project directory (<code>azure-sre-agent-func<\/code>), create a zip:<\/p>\n\n\n\n<pre><code class=\"language-bash\">cd ..\nzip -r azure-sre-agent-func.zip azure-sre-agent-func\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Deploy:<\/p>\n\n\n\n<pre><code class=\"language-bash\">az functionapp deployment source config-zip \\\n  -g \"$RG\" -n \"$FUNC\" \\\n  --src azure-sre-agent-func.zip\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome<\/strong>\n&#8211; The function code is deployed to Azure.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verify<\/strong>\nList functions (may take a minute after deployment):<\/p>\n\n\n\n<pre><code class=\"language-bash\">az functionapp function list -g \"$RG\" -n \"$FUNC\" -o table\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Step 7: Get the function key and call the endpoint<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Fetch the default host name:<\/p>\n\n\n\n<pre><code class=\"language-bash\">HOST=$(az functionapp show -g \"$RG\" -n \"$FUNC\" --query defaultHostName -o tsv)\necho \"$HOST\"\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Get a function key (for the function named <code>ChangeBriefing<\/code>):<\/p>\n\n\n\n<pre><code class=\"language-bash\"># List function keys\naz functionapp function keys list \\\n  -g \"$RG\" -n \"$FUNC\" \\\n  --function-name ChangeBriefing\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Copy the <code>default<\/code> key value. Call the endpoint:<\/p>\n\n\n\n<pre><code class=\"language-bash\">KEY=\"&lt;PASTE_FUNCTION_DEFAULT_KEY&gt;\"\ncurl -s \"https:\/\/$HOST\/api\/ChangeBriefing?code=$KEY&amp;hours=24&amp;top=25\"\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome<\/strong>\n&#8211; You receive a Markdown change briefing generated by Azure OpenAI based on Activity Log events.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Validation<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Use these checks to confirm the lab is working:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p><strong>You see Activity Log events<\/strong>\n   &#8211; Make a change in the resource group (e.g., create a storage container or update a tag), then call the function again.<\/p>\n<\/li>\n<li>\n<p><strong>The function returns Markdown<\/strong>\n   &#8211; Output contains sections: Summary, Notable Changes, Risk Notes, Suggested Next Checks.<\/p>\n<\/li>\n<li>\n<p><strong>The agent does not invent facts<\/strong>\n   &#8211; If events are limited, the output should be correspondingly sparse.<\/p>\n<\/li>\n<li>\n<p><strong>(Optional) Teams post works<\/strong>\n   &#8211; If you set <code>TEAMS_WEBHOOK_URL<\/code>, the message appears in the channel.<\/p>\n<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Troubleshooting<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">Issue: <code>Missing required setting: AZURE_OPENAI_*<\/code><\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Cause<\/strong>: App settings not configured correctly.<\/li>\n<li><strong>Fix<\/strong>: Re-run <code>az functionapp config appsettings set<\/code> and confirm settings list.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Issue: 500 error with <code>DefaultAzureCredential<\/code> failures<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Cause<\/strong>: Managed identity not enabled or not used in Azure environment.<\/li>\n<li><strong>Fix<\/strong>:<\/li>\n<li>Ensure identity is assigned: <code>az functionapp identity show<\/code><\/li>\n<li>Restart Function App: <code>az functionapp restart -g \"$RG\" -n \"$FUNC\"<\/code><\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Issue: Activity Log API returns 403<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Cause<\/strong>: Missing RBAC permissions.<\/li>\n<li><strong>Fix<\/strong>:<\/li>\n<li>Ensure Reader role assignment exists at the RG scope.<\/li>\n<li>Wait a few minutes for RBAC propagation.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Issue: Activity Log API returns 400 (filter\/api-version issues)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Cause<\/strong>: OData filter formatting or API version mismatch.<\/li>\n<li><strong>Fix<\/strong>:<\/li>\n<li>Verify Activity Log REST API docs and update <code>api_version<\/code> if needed:<br\/>\n    https:\/\/learn.microsoft.com\/rest\/api\/monitor\/activity-logs\/list<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Issue: Azure OpenAI call fails (401\/404)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Cause<\/strong>: Wrong endpoint, key, API version, or deployment name.<\/li>\n<li><strong>Fix<\/strong>:<\/li>\n<li>Confirm endpoint format in the Azure OpenAI resource overview.<\/li>\n<li>Confirm deployment name exactly matches.<\/li>\n<li>Verify API version supported by your model in docs.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Issue: Function not found after deployment<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Cause<\/strong>: Packaging layout issue.<\/li>\n<li><strong>Fix<\/strong>: Ensure the zip includes:<\/li>\n<li><code>host.json<\/code> at the root of the deployed content<\/li>\n<li><code>ChangeBriefing\/function.json<\/code><\/li>\n<li><code>ChangeBriefing\/__init__.py<\/code><\/li>\n<li><code>requirements.txt<\/code><\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Cleanup<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To delete everything created in this lab:<\/p>\n\n\n\n<pre><code class=\"language-bash\">az group delete -n \"$RG\" --yes --no-wait\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Also delete or rotate:\n&#8211; Azure OpenAI API keys used in app settings (if you created them)\n&#8211; Teams webhook URL (recreate if exposed)<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">11. Best Practices<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Architecture best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Start with <strong>read-only<\/strong> scenarios (summaries, recommendations).<\/li>\n<li>Add tool actions only with:<\/li>\n<li>Allow-listed tools<\/li>\n<li>Minimal permissions<\/li>\n<li>Human approval for impactful actions<\/li>\n<li>Separate environments:<\/li>\n<li>Dev agent for prompt iteration<\/li>\n<li>Prod agent with strict policies and change control<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">IAM\/security best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Prefer <strong>Managed Identity<\/strong> over secrets wherever possible.<\/li>\n<li>Scope RBAC to the minimum:<\/li>\n<li>Resource group, not subscription (unless required)<\/li>\n<li>Use separate identities for:<\/li>\n<li>Reading telemetry<\/li>\n<li>Writing tickets\/notifications<\/li>\n<li>Regularly review role assignments.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Cost best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Enforce:<\/li>\n<li>Max tokens<\/li>\n<li>Rate limits<\/li>\n<li>Budget alerts on Azure OpenAI and monitoring workspaces<\/li>\n<li>Minimize prompt size:<\/li>\n<li>Send only summarized events (like the lab does)<\/li>\n<li>Avoid feeding raw logs to the LLM.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Performance best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use a time window and <code>top N<\/code> pattern for events.<\/li>\n<li>Cache summaries if the same query repeats.<\/li>\n<li>Prefer event-driven triggers from alerts over frequent polling.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Reliability best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Make the agent degrade gracefully:<\/li>\n<li>If Azure OpenAI is unavailable, return a non-AI fallback summary.<\/li>\n<li>Add retry policies with backoff for transient errors.<\/li>\n<li>Keep clear timeouts for Azure API calls.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operations best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Monitor the agent:<\/li>\n<li>Error rate<\/li>\n<li>Latency<\/li>\n<li>LLM failures<\/li>\n<li>Add runbooks for the agent itself:<\/li>\n<li>Key rotation<\/li>\n<li>RBAC audit<\/li>\n<li>Prompt\/template changes<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Governance\/tagging\/naming best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use consistent tags:<\/li>\n<li><code>app=azure-sre-agent<\/code><\/li>\n<li><code>env=dev|prod<\/code><\/li>\n<li><code>owner=&lt;team&gt;<\/code><\/li>\n<li><code>costCenter=&lt;id&gt;<\/code><\/li>\n<li>Use naming conventions aligned with Azure CAF where applicable:<\/li>\n<li>https:\/\/learn.microsoft.com\/azure\/cloud-adoption-framework\/ready\/azure-best-practices\/resource-naming<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">12. Security Considerations<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Identity and access model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use <strong>Entra ID (Azure AD) + RBAC<\/strong> to control what the agent can read.<\/li>\n<li>Use Managed Identity for Azure API access.<\/li>\n<li>For Azure OpenAI authentication:<\/li>\n<li>Many setups use API keys; treat them as secrets.<\/li>\n<li>If Entra ID auth is supported for your configuration, prefer it (verify current docs).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Encryption<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data in transit: HTTPS\/TLS by default.<\/li>\n<li>Data at rest:<\/li>\n<li>Storage, Key Vault, and monitoring data are encrypted by Azure by default (verify your compliance requirements).<\/li>\n<li>Consider customer-managed keys (CMK) where required (service-dependent; verify).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Network exposure<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Minimize public exposure:<\/li>\n<li>Restrict Function App access (IP restrictions, private endpoints where feasible).<\/li>\n<li>Prefer private networking for model endpoints if supported in your region.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secrets handling<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Don\u2019t store webhook URLs and API keys in code.<\/li>\n<li>Use:<\/li>\n<li>Function App settings (minimum)<\/li>\n<li>Key Vault references for production<\/li>\n<li>Rotate secrets regularly.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Audit\/logging<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Log:<\/li>\n<li>Who invoked the agent (if applicable)<\/li>\n<li>What scopes were queried<\/li>\n<li>High-level metadata (counts, timestamps)<\/li>\n<li>Avoid logging:<\/li>\n<li>Full prompts containing sensitive data<\/li>\n<li>Full LLM responses if they include sensitive details<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Compliance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Determine whether operational data is sensitive in your org (it often is).<\/li>\n<li>Ensure data residency alignment:<\/li>\n<li>Where your Azure OpenAI resource is hosted<\/li>\n<li>Where logs are stored<\/li>\n<li>Implement redaction and minimization by design.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Common security mistakes<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Granting subscription-wide Reader when only one RG is needed.<\/li>\n<li>Posting sensitive summaries into broad Teams channels.<\/li>\n<li>Logging raw incident payloads into Application Insights.<\/li>\n<li>Letting the agent trigger remediations automatically without approvals.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secure deployment recommendations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use private endpoints and VNet integration where required.<\/li>\n<li>Apply Azure Policy to enforce:<\/li>\n<li>Approved regions<\/li>\n<li>Private networking<\/li>\n<li>Required tags<\/li>\n<li>Use separate subscriptions for dev\/prod.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">13. Limitations and Gotchas<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Because this is an SRE AI agent pattern on Azure, limitations come from both AI behavior and cloud service constraints.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>LLM hallucinations<\/strong>: The agent can generate plausible but incorrect statements. Mitigate with strict prompts, citations, and \u201cuse only provided data\u201d rules.<\/li>\n<li><strong>Model access\/quota<\/strong>: Azure OpenAI access and quotas vary by tenant\/region\/model (verify).<\/li>\n<li><strong>Latency<\/strong>: LLM calls add seconds; don\u2019t put the agent on ultra-low-latency critical paths.<\/li>\n<li><strong>Sensitive data risk<\/strong>: Logs and incident notes may contain secrets or personal data.<\/li>\n<li><strong>RBAC propagation delay<\/strong>: New role assignments may take minutes before they work.<\/li>\n<li><strong>Activity Log filtering quirks<\/strong>: OData filters and API versions can be finicky; verify against REST docs if errors occur.<\/li>\n<li><strong>Costs can spike<\/strong>: Frequent summaries and long prompts increase token usage quickly.<\/li>\n<li><strong>Operational ownership<\/strong>: The agent itself becomes a service\u2014monitor it, version it, and document it.<\/li>\n<li><strong>Partial observability<\/strong>: If your logs aren\u2019t centralized or consistent, the agent\u2019s output will be incomplete.<\/li>\n<li><strong>Teams webhook governance<\/strong>: Webhooks can be disabled by org policy; confirm with your M365 administrators.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">14. Comparison with Alternatives<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Azure SRE Agent (as an Azure-implemented pattern) overlaps with\u2014but does not replace\u2014several products.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Comparison table<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Option<\/th>\n<th>Best For<\/th>\n<th>Strengths<\/th>\n<th>Weaknesses<\/th>\n<th>When to Choose<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Azure SRE Agent (this pattern)<\/strong><\/td>\n<td>Teams wanting a customizable SRE AI assistant on Azure<\/td>\n<td>Tailored prompts\/guardrails; integrates with your Azure telemetry; flexible<\/td>\n<td>You must build\/operate it; AI risk; composite pricing<\/td>\n<td>When you need customization and Azure-native integration<\/td>\n<\/tr>\n<tr>\n<td><strong>Azure Monitor<\/strong><\/td>\n<td>Core observability (metrics, logs, alerts)<\/td>\n<td>Mature monitoring platform; alerting and dashboards<\/td>\n<td>Not an AI agent by itself<\/td>\n<td>Always\u2014this is the foundation for ops signals<\/td>\n<\/tr>\n<tr>\n<td><strong>Azure Automation (runbooks)<\/strong><\/td>\n<td>Deterministic automation<\/td>\n<td>Reliable, auditable automation<\/td>\n<td>Not intelligent; no natural language<\/td>\n<td>When you need safe remediation workflows<\/td>\n<\/tr>\n<tr>\n<td><strong>Microsoft Sentinel (with automation)<\/strong><\/td>\n<td>Security operations &amp; SIEM\/SOAR<\/td>\n<td>Strong security analytics; playbooks<\/td>\n<td>Security-focused, not SRE-first<\/td>\n<td>When incidents are security-centric or you need SIEM<\/td>\n<\/tr>\n<tr>\n<td><strong>Copilot for Azure<\/strong><\/td>\n<td>Assisted Azure management in the portal<\/td>\n<td>Integrated experience for Azure ops tasks<\/td>\n<td>Scope and capabilities vary; may not be customizable<\/td>\n<td>When your org standardizes on Microsoft Copilot experiences (verify features)<\/td>\n<\/tr>\n<tr>\n<td><strong>AWS DevOps Guru<\/strong><\/td>\n<td>Managed anomaly detection on AWS<\/td>\n<td>Purpose-built AIOps for AWS<\/td>\n<td>AWS-only<\/td>\n<td>When workloads are on AWS and you want a managed AIOps service<\/td>\n<\/tr>\n<tr>\n<td><strong>Google Cloud Operations + AIOps<\/strong><\/td>\n<td>Ops signals on GCP<\/td>\n<td>Tight GCP integration<\/td>\n<td>GCP-only<\/td>\n<td>When workloads are primarily on GCP<\/td>\n<\/tr>\n<tr>\n<td><strong>Open-source (e.g., self-hosted agent + Prometheus\/Grafana + LLM)<\/strong><\/td>\n<td>Maximum control, self-managed<\/td>\n<td>Full customization; cloud-agnostic<\/td>\n<td>High ops burden; security and scaling complexity<\/td>\n<td>When you need on-prem\/self-host control and have platform capacity<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">15. Real-World Example<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Enterprise example (regulated, multi-team)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: A financial services company runs dozens of Azure subscriptions with strict change control. Incidents often correlate with policy\/RBAC\/network changes, but triage is slow.<\/li>\n<li><strong>Proposed architecture<\/strong><\/li>\n<li>Central Log Analytics workspaces per landing zone<\/li>\n<li>Azure Monitor alerts routed to an agent orchestrator (Functions\/Container Apps)<\/li>\n<li>Azure SRE Agent reads:<ul>\n<li>Activity Log across approved subscriptions<\/li>\n<li>Selected KQL queries (only metadata, no sensitive payloads)<\/li>\n<\/ul>\n<\/li>\n<li>Azure OpenAI in an approved region with private networking (verify feasibility)<\/li>\n<li>Outputs posted to a restricted Teams incident channel and ITSM ticket drafts<\/li>\n<li><strong>Why this service\/pattern was chosen<\/strong><\/li>\n<li>Custom RBAC scoping and strong audit requirements<\/li>\n<li>Need for standardized incident briefs across many teams<\/li>\n<li><strong>Expected outcomes<\/strong><\/li>\n<li>Faster \u201cwhat changed\u201d identification<\/li>\n<li>More consistent incident updates<\/li>\n<li>Reduced on-call fatigue with curated summaries<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Startup\/small-team example (speed-focused)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: A startup runs an AKS cluster and App Service APIs. On-call is part-time; outages are painful and slow to diagnose.<\/li>\n<li><strong>Proposed architecture<\/strong><\/li>\n<li>Azure Monitor alerts to a simple Azure Function<\/li>\n<li>Agent summarizes recent changes + top alert details<\/li>\n<li>Posts to a Teams channel and creates a lightweight incident note<\/li>\n<li><strong>Why this service\/pattern was chosen<\/strong><\/li>\n<li>Low operational overhead (serverless)<\/li>\n<li>Quick time-to-value with minimal integrations<\/li>\n<li><strong>Expected outcomes<\/strong><\/li>\n<li>Faster first response<\/li>\n<li>Shared context without needing a dedicated SRE team<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">16. FAQ<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p><strong>Is Azure SRE Agent an official Azure product?<\/strong><br\/>\n   I cannot confirm from official sources (as of 2025-08 knowledge cutoff) that \u201cAzure SRE Agent\u201d is a standalone GA Azure product with a dedicated pricing page. <strong>Verify in official docs<\/strong>. This tutorial treats it as an Azure-deployed SRE AI agent pattern.<\/p>\n<\/li>\n<li>\n<p><strong>What Azure services do I need at minimum?<\/strong><br\/>\n   Typically: Azure Functions (or equivalent), Azure OpenAI (or another Azure-hosted model endpoint), and at least one data source like Azure Activity Log.<\/p>\n<\/li>\n<li>\n<p><strong>Can I build Azure SRE Agent without Azure OpenAI?<\/strong><br\/>\n   You can build the data collection and formatting parts, but you\u2019ll lose the LLM summarization. If you use another model provider, confirm compliance and networking constraints.<\/p>\n<\/li>\n<li>\n<p><strong>Does it replace Azure Monitor?<\/strong><br\/>\n   No. Azure Monitor is the system of record for telemetry; the agent is a consumer and summarizer.<\/p>\n<\/li>\n<li>\n<p><strong>How do I prevent hallucinations?<\/strong><br\/>\n   You can\u2019t eliminate them entirely. Reduce risk with strict prompts (\u201cuse only provided data\u201d), structured output, low temperature, citations\/links, and human review.<\/p>\n<\/li>\n<li>\n<p><strong>Should the agent run automatic remediation?<\/strong><br\/>\n   Usually start with read-only. If you add remediation, use allow-listed actions, approval gates, and safe rollbacks.<\/p>\n<\/li>\n<li>\n<p><strong>What data should I avoid sending to the model?<\/strong><br\/>\n   Secrets, credentials, personal data, and raw payload logs. Prefer aggregated metrics, event metadata, and redacted summaries.<\/p>\n<\/li>\n<li>\n<p><strong>How do I scope access safely?<\/strong><br\/>\n   Use Managed Identity and RBAC scoped to resource groups\/workspaces. Avoid subscription-wide roles unless necessary.<\/p>\n<\/li>\n<li>\n<p><strong>Can it summarize Log Analytics queries (KQL)?<\/strong><br\/>\n   Yes, but be careful: keep queries targeted and don\u2019t export sensitive log bodies to the prompt.<\/p>\n<\/li>\n<li>\n<p><strong>How do I control cost?<\/strong><br\/>\n   Limit calls, limit tokens, keep prompts small, use event triggers, and monitor usage. Put budgets on Azure OpenAI where possible.<\/p>\n<\/li>\n<li>\n<p><strong>What\u2019s the best trigger: polling or alerts?<\/strong><br\/>\n   Alerts\/event-driven triggers are usually better. Polling can be simpler but can waste cost and add noise.<\/p>\n<\/li>\n<li>\n<p><strong>How do I integrate with Teams securely?<\/strong><br\/>\n   Store the webhook URL in Key Vault or protected app settings, restrict channel membership, and avoid posting sensitive details.<\/p>\n<\/li>\n<li>\n<p><strong>Can it work across multiple subscriptions?<\/strong><br\/>\n   Yes, if RBAC is granted at each approved scope. Centralize logging\/monitoring to reduce complexity.<\/p>\n<\/li>\n<li>\n<p><strong>How do I test safely in production?<\/strong><br\/>\n   Use a \u201cshadow mode\u201d: generate summaries without posting broadly, compare to human triage, then gradually roll out.<\/p>\n<\/li>\n<li>\n<p><strong>What should I log for auditing?<\/strong><br\/>\n   Invocation time, scopes queried, counts of events, and success\/failure. Avoid logging sensitive prompts and responses.<\/p>\n<\/li>\n<li>\n<p><strong>Does this require Azure AI Studio \/ Azure AI Foundry?<\/strong><br\/>\n   Not strictly. You can call Azure OpenAI directly from code. UIs and naming can change; <strong>verify<\/strong> the current Azure AI product surface.<\/p>\n<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">17. Top Online Resources to Learn Azure SRE Agent<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Because \u201cAzure SRE Agent\u201d may not have a canonical documentation page by that exact name, the most practical learning resources are the underlying official Azure services used to implement it.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Resource Type<\/th>\n<th>Name<\/th>\n<th>Why It Is Useful<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Official documentation<\/td>\n<td>Azure OpenAI Service docs \u2014 https:\/\/learn.microsoft.com\/azure\/ai-services\/openai\/<\/td>\n<td>Core reference for models, authentication, quotas, and API usage<\/td>\n<\/tr>\n<tr>\n<td>Official pricing<\/td>\n<td>Azure OpenAI pricing \u2014 https:\/\/azure.microsoft.com\/pricing\/details\/cognitive-services\/openai-service\/<\/td>\n<td>Understand token-based billing and model\/SKU differences<\/td>\n<\/tr>\n<tr>\n<td>Official documentation<\/td>\n<td>Azure Functions docs \u2014 https:\/\/learn.microsoft.com\/azure\/azure-functions\/<\/td>\n<td>Build serverless orchestration for the agent<\/td>\n<\/tr>\n<tr>\n<td>Official pricing<\/td>\n<td>Azure Functions pricing \u2014 https:\/\/azure.microsoft.com\/pricing\/details\/functions\/<\/td>\n<td>Estimate execution\/compute cost<\/td>\n<\/tr>\n<tr>\n<td>Official documentation<\/td>\n<td>Azure Monitor overview \u2014 https:\/\/learn.microsoft.com\/azure\/azure-monitor\/<\/td>\n<td>Foundation for metrics\/logs\/alerts used by the agent<\/td>\n<\/tr>\n<tr>\n<td>Official pricing<\/td>\n<td>Azure Monitor pricing \u2014 https:\/\/azure.microsoft.com\/pricing\/details\/monitor\/<\/td>\n<td>Understand log ingestion and retention costs<\/td>\n<\/tr>\n<tr>\n<td>REST API reference<\/td>\n<td>Activity Logs REST API \u2014 https:\/\/learn.microsoft.com\/rest\/api\/monitor\/activity-logs\/list<\/td>\n<td>Authoritative reference for querying Azure Activity Log programmatically<\/td>\n<\/tr>\n<tr>\n<td>Official guidance<\/td>\n<td>Azure resource naming best practices \u2014 https:\/\/learn.microsoft.com\/azure\/cloud-adoption-framework\/ready\/azure-best-practices\/resource-naming<\/td>\n<td>Governance foundation for production deployments<\/td>\n<\/tr>\n<tr>\n<td>Official tool<\/td>\n<td>Azure Pricing Calculator \u2014 https:\/\/azure.microsoft.com\/pricing\/calculator\/<\/td>\n<td>Build realistic environment cost estimates<\/td>\n<\/tr>\n<tr>\n<td>Community learning (reputable)<\/td>\n<td>Azure Architecture Center \u2014 https:\/\/learn.microsoft.com\/azure\/architecture\/<\/td>\n<td>Patterns for reliability, monitoring, and secure architecture on Azure<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">18. Training and Certification Providers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Institute<\/th>\n<th>Suitable Audience<\/th>\n<th>Likely Learning Focus<\/th>\n<th>Mode<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>DevOps\/SRE engineers, platform teams<\/td>\n<td>DevOps, SRE practices, automation, cloud ops<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>ScmGalaxy.com<\/td>\n<td>DevOps beginners to intermediate<\/td>\n<td>SCM, CI\/CD, DevOps foundations<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.scmgalaxy.com\/<\/td>\n<\/tr>\n<tr>\n<td>CLoudOpsNow.in<\/td>\n<td>Cloud ops practitioners<\/td>\n<td>Cloud operations, monitoring, reliability<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.cloudopsnow.in\/<\/td>\n<\/tr>\n<tr>\n<td>SreSchool.com<\/td>\n<td>SREs and operations teams<\/td>\n<td>SRE principles, incident response, reliability engineering<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.sreschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>AiOpsSchool.com<\/td>\n<td>Ops + AI practitioners<\/td>\n<td>AIOps concepts, applying ML\/LLMs to operations<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.aiopsschool.com\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">19. Top Trainers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Platform\/Site<\/th>\n<th>Likely Specialization<\/th>\n<th>Suitable Audience<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>RajeshKumar.xyz<\/td>\n<td>DevOps\/SRE coaching and guidance (verify offerings)<\/td>\n<td>Individuals and teams seeking hands-on mentoring<\/td>\n<td>https:\/\/rajeshkumar.xyz\/<\/td>\n<\/tr>\n<tr>\n<td>devopstrainer.in<\/td>\n<td>DevOps training resources (verify course list)<\/td>\n<td>Beginners to working professionals<\/td>\n<td>https:\/\/www.devopstrainer.in\/<\/td>\n<\/tr>\n<tr>\n<td>devopsfreelancer.com<\/td>\n<td>Freelance DevOps\/SRE services and training (verify)<\/td>\n<td>Teams needing short-term expert help<\/td>\n<td>https:\/\/www.devopsfreelancer.com\/<\/td>\n<\/tr>\n<tr>\n<td>devopssupport.in<\/td>\n<td>DevOps support\/training resources (verify)<\/td>\n<td>Ops teams needing practical support<\/td>\n<td>https:\/\/www.devopssupport.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">20. Top Consulting Companies<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Company Name<\/th>\n<th>Likely Service Area<\/th>\n<th>Where They May Help<\/th>\n<th>Consulting Use Case Examples<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>cotocus.com<\/td>\n<td>DevOps\/SRE\/cloud consulting (verify service catalog)<\/td>\n<td>Platform engineering, automation, operational maturity<\/td>\n<td>Observability setup, CI\/CD improvements, SRE process design<\/td>\n<td>https:\/\/cotocus.com\/<\/td>\n<\/tr>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>DevOps\/SRE consulting &amp; enablement (verify)<\/td>\n<td>Training + implementation support<\/td>\n<td>Building monitoring strategy, incident response playbooks, agent implementation guidance<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>DEVOPSCONSULTING.IN<\/td>\n<td>DevOps consulting (verify offerings)<\/td>\n<td>DevOps transformation, toolchain integration<\/td>\n<td>CI\/CD pipeline modernization, governance, reliability improvements<\/td>\n<td>https:\/\/www.devopsconsulting.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">21. Career and Learning Roadmap<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn before this service<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Azure fundamentals:<\/li>\n<li>Resource groups, subscriptions, ARM concepts<\/li>\n<li>RBAC and Managed Identity<\/li>\n<li>Observability fundamentals:<\/li>\n<li>Metrics vs logs vs traces<\/li>\n<li>Alerting concepts (SLOs, SLIs)<\/li>\n<li>SRE foundations:<\/li>\n<li>Incident management, postmortems, error budgets<\/li>\n<li>Basic Python or scripting for automation<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn after this service<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Advanced Azure Monitor:<\/li>\n<li>Log Analytics workspace design<\/li>\n<li>KQL proficiency<\/li>\n<li>Alert routing and action groups<\/li>\n<li>Secure AI architecture:<\/li>\n<li>Data minimization, redaction, governance<\/li>\n<li>Private networking patterns<\/li>\n<li>Agentic operations patterns:<\/li>\n<li>Tool allow-lists, approvals, policy-as-code for actions<\/li>\n<li>Platform engineering:<\/li>\n<li>Multi-subscription landing zones, centralized logging<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Job roles that use it<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Site Reliability Engineer (SRE)<\/li>\n<li>DevOps Engineer<\/li>\n<li>Platform Engineer<\/li>\n<li>Cloud Operations Engineer<\/li>\n<li>Cloud Architect (reliability and governance)<\/li>\n<li>Security Engineer (if integrating ops + sec signals)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Certification path (if available)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">There is no specific \u201cAzure SRE Agent\u201d certification known here. Practical related paths include:\n&#8211; Azure Administrator (AZ-104)\n&#8211; Azure Developer (AZ-204)\n&#8211; Azure Solutions Architect (AZ-305)\n&#8211; DevOps Engineer Expert (AZ-400)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">(Verify current certification codes and requirements in Microsoft Learn.)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Project ideas for practice<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>An \u201cincident briefing\u201d bot triggered by an Azure Monitor alert.<\/li>\n<li>A \u201cchange risk\u201d reporter that highlights risky operations (NSG changes, role assignments).<\/li>\n<li>A postmortem draft generator fed by a curated incident timeline.<\/li>\n<li>A KQL assistant with strict query templates and workspace scoping.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">22. Glossary<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>SRE (Site Reliability Engineering)<\/strong>: Discipline focused on reliability, automation, and operating services at scale.<\/li>\n<li><strong>MTTA\/MTTR<\/strong>: Mean time to acknowledge \/ mean time to resolve.<\/li>\n<li><strong>Azure Activity Log<\/strong>: Azure control-plane event log showing subscription\/resource operations.<\/li>\n<li><strong>Azure Monitor<\/strong>: Azure platform for metrics, logs, alerts, and monitoring.<\/li>\n<li><strong>Log Analytics<\/strong>: Workspace-based log store queried using KQL.<\/li>\n<li><strong>KQL (Kusto Query Language)<\/strong>: Query language for Azure Monitor Logs \/ Log Analytics.<\/li>\n<li><strong>Managed Identity<\/strong>: Azure identity for resources to access Azure services without secrets.<\/li>\n<li><strong>RBAC<\/strong>: Role-Based Access Control for scoping permissions.<\/li>\n<li><strong>LLM<\/strong>: Large Language Model used for summarization and natural language outputs.<\/li>\n<li><strong>Prompt<\/strong>: Input instructions\/context provided to an LLM.<\/li>\n<li><strong>Guardrails<\/strong>: Controls that restrict what an agent can access and do (scope, tools, policies).<\/li>\n<li><strong>Teams Incoming Webhook<\/strong>: Simple HTTPS endpoint for posting messages into a Teams channel.<\/li>\n<li><strong>Application Insights<\/strong>: APM and telemetry service for app monitoring (part of Azure Monitor).<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">23. Summary<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Azure SRE Agent (as implemented in this guide) is an <strong>Azure-native AI + Machine Learning pattern<\/strong> for building an SRE assistant that <strong>summarizes operational signals, accelerates triage, and standardizes incident communications<\/strong>. It fits alongside\u2014rather than replacing\u2014<strong>Azure Monitor<\/strong>, using Azure\u2019s observability data plus an LLM endpoint (commonly Azure OpenAI) to produce actionable briefings.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Key takeaways:\n&#8211; <strong>Cost<\/strong> is mainly driven by LLM usage (tokens) and monitoring ingestion\u2014control it with short prompts, fewer calls, and strict limits.\n&#8211; <strong>Security<\/strong> depends on strict RBAC scoping, managed identities, careful secrets handling, and data minimization.\n&#8211; Use it when you need <strong>faster, more consistent operational understanding<\/strong> and can support AI governance.\n&#8211; Next step: expand from Activity Log briefings to <strong>alert-triggered incident briefs<\/strong> and <strong>guardrailed Log Analytics\/KQL enrichment<\/strong>, then introduce human-approved automation actions.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>AI + Machine Learning<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3,40,33],"tags":[],"class_list":["post-372","post","type-post","status-publish","format-standard","hentry","category-ai-machine-learning","category-azure","category-management-and-governance"],"_links":{"self":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/372","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/comments?post=372"}],"version-history":[{"count":0,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/372\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/media?parent=372"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/categories?post=372"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/tags?post=372"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}