{"id":237,"date":"2026-04-13T07:49:16","date_gmt":"2026-04-13T07:49:16","guid":{"rendered":"https:\/\/www.devopsschool.com\/tutorials\/aws-amazon-devops-guru-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-machine-learning-ml-and-artificial-intelligence-ai\/"},"modified":"2026-04-13T07:49:16","modified_gmt":"2026-04-13T07:49:16","slug":"aws-amazon-devops-guru-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-machine-learning-ml-and-artificial-intelligence-ai","status":"publish","type":"post","link":"https:\/\/www.devopsschool.com\/tutorials\/aws-amazon-devops-guru-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-machine-learning-ml-and-artificial-intelligence-ai\/","title":{"rendered":"AWS Amazon DevOps Guru Tutorial: Architecture, Pricing, Use Cases, and Hands-On Guide for Machine Learning (ML) and Artificial Intelligence (AI)"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Category<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Machine Learning (ML) and Artificial Intelligence (AI)<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">1. Introduction<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What this service is<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Amazon DevOps Guru is an AWS managed operations service that uses Machine Learning (ML) to detect anomalous behavior in your AWS workloads, surface likely root causes, and recommend remediation actions. It is designed for DevOps and SRE teams who want earlier detection of issues and faster mean time to resolution (MTTR) without building a full in-house AIOps pipeline.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Simple explanation (one paragraph)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You enable Amazon DevOps Guru for the AWS resources that make up an application (for example, resources in one or more AWS CloudFormation stacks or resources tagged as part of an app). DevOps Guru then watches telemetry such as metrics (and optional integrations like logs\/traces where supported), detects unusual behavior, and generates \u201cinsights\u201d that tell you what\u2019s wrong and what to do next\u2014plus it can notify you through channels like Amazon SNS.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Technical explanation (one paragraph)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Amazon DevOps Guru applies ML models to operational data from supported AWS services to identify statistically significant deviations from learned baselines, correlate anomalies across related resources, and present findings as insights with context and recommendations. It is an opinionated, managed AIOps layer that sits on top of existing observability data sources (for example Amazon CloudWatch metrics), and it focuses on proactive anomaly detection and diagnosis rather than raw telemetry storage or visualization.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What problem it solves<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Teams often have plenty of telemetry (metrics, logs, traces) but still struggle with:\n&#8211; Alert fatigue from noisy threshold alarms\n&#8211; Slow correlation across multiple resources during incidents\n&#8211; Missed early warning signals that don\u2019t cross hard thresholds\n&#8211; Long time-to-triage because \u201cwhat changed?\u201d is unclear<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Amazon DevOps Guru addresses those gaps by detecting anomalies, correlating them, and generating actionable insights with recommendations\u2014reducing the manual effort of triage and speeding up operations response.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">2. What is Amazon DevOps Guru?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Official purpose<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Amazon DevOps Guru is an AWS service that helps you improve application availability and operational performance by using ML to detect operational issues and provide recommendations for remediation. (For the most current positioning and supported integrations, verify in the official product documentation.)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Core capabilities<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Amazon DevOps Guru commonly provides these core capabilities:\n&#8211; <strong>ML-based anomaly detection<\/strong> across operational signals for supported AWS resources\n&#8211; <strong>Insights<\/strong> (summaries of detected issues) with context, impacted resources, and recommended actions\n&#8211; <strong>Correlation<\/strong> of related anomalies\/events to reduce the \u201cneedle in a haystack\u201d problem\n&#8211; <strong>Notifications<\/strong> through supported channels (commonly Amazon SNS) so insights reach responders quickly\n&#8211; <strong>Resource grouping<\/strong> so you can monitor an application\u2019s resources together (for example by CloudFormation stack membership or tags)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Major components (conceptual model)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Resource collections<\/strong>: A logical grouping of AWS resources that represent an application or workload. Common ways include CloudFormation stacks or tag-based grouping (confirm current options in docs for your region).<\/li>\n<li><strong>Insights<\/strong>: The primary output. Insights typically describe what\u2019s happening, when it started, what resources are involved, and what to do.<\/li>\n<li><strong>Anomalies \/ signals<\/strong>: Underlying detected unusual patterns (for example a spike in errors, latency, throttling, or resource pressure), correlated into insights.<\/li>\n<li><strong>Recommendations<\/strong>: Prescriptive guidance that points to likely remediations (configuration changes, scaling actions, best practices).<\/li>\n<li><strong>Notification channels<\/strong>: Mechanisms to push insights to humans or systems (often Amazon SNS; downstream integrations can include ChatOps or ticket creation via other AWS services).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Service type<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Type<\/strong>: Fully managed AWS service (AIOps \/ operational intelligence) that consumes telemetry and emits insights and recommendations.<\/li>\n<li><strong>Operating model<\/strong>: You enable it and configure scope; AWS runs the detection and analysis.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scope (regional\/global\/account\/project)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Amazon DevOps Guru is generally treated as a <strong>regional<\/strong> service that you enable <strong>per AWS account and per AWS Region<\/strong>, monitoring supported resources in that Region. Multi-account approaches (for example through AWS Organizations) may be available depending on current service features\u2014verify the latest \u201cOrganizations\u201d support in the official documentation for your environment.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How it fits into the AWS ecosystem<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Amazon DevOps Guru complements (not replaces) core observability and operations services:\n&#8211; <strong>Amazon CloudWatch<\/strong>: metrics, logs, alarms, dashboards; DevOps Guru can analyze CloudWatch metrics and related signals.\n&#8211; <strong>AWS X-Ray<\/strong> (where integrated): distributed tracing data can help correlate app-level latency\/errors.\n&#8211; <strong>AWS Systems Manager<\/strong> (where integrated): operations workflows (for example OpsCenter and runbooks) can be used to operationalize remediation.\n&#8211; <strong>Amazon SNS<\/strong>: push insights to email, SMS, HTTP endpoints, or fan out to automation.\n&#8211; <strong>AWS CloudTrail \/ AWS Config<\/strong> (indirectly useful): change tracking for incident correlation and governance (exact ingestion sources for DevOps Guru can vary\u2014verify in docs).<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">3. Why use Amazon DevOps Guru?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Business reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Reduce downtime and customer impact<\/strong>: Early anomaly detection can identify issues before they become full outages.<\/li>\n<li><strong>Lower operational cost<\/strong>: Less time spent manually correlating graphs, alarms, and changes.<\/li>\n<li><strong>Improve SLA\/SLO performance<\/strong>: Faster detection and diagnosis improves incident response outcomes.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Technical reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Baseline-driven detection<\/strong>: ML-driven baselining can catch \u201cweird\u201d behavior that never crosses fixed thresholds.<\/li>\n<li><strong>Cross-resource correlation<\/strong>: Helps connect the dots between symptoms (for example increased latency) and potential causes (for example saturation, throttling, downstream dependency issues).<\/li>\n<li><strong>Actionable recommendations<\/strong>: Provides suggested remediations rather than only raising alerts.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operational reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Faster triage<\/strong>: Insights can quickly narrow the blast radius and identify likely culprits.<\/li>\n<li><strong>Less alert fatigue<\/strong>: Shifts from dozens of alarms to fewer, higher-signal insights (you still need alarms for hard limits and paging, but insights can reduce noise).<\/li>\n<li><strong>Standardization<\/strong>: Gives platform teams a consistent approach across applications that follow tagging or CloudFormation conventions.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/compliance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Centralized operational visibility<\/strong> can support governance (for example, consistent monitoring across critical workloads).<\/li>\n<li><strong>Auditable actions<\/strong>: Notifications and follow-on automation can be logged via CloudTrail, Systems Manager, and ticket systems (depending on your integration).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scalability\/performance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Works across modern architectures<\/strong>: Microservices, event-driven, and managed-service-heavy stacks can produce huge telemetry volume; DevOps Guru focuses on analysis rather than storage\/visualization.<\/li>\n<li><strong>Adaptive to varying workloads<\/strong>: Baselines can help for services with daily\/weekly patterns.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should choose Amazon DevOps Guru<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Choose Amazon DevOps Guru when:\n&#8211; You run production workloads on AWS and already rely on CloudWatch telemetry.\n&#8211; Your incident triage requires too much human correlation across multiple services.\n&#8211; You want an AWS-native AIOps signal without deploying and operating a separate AIOps platform.\n&#8211; You use CloudFormation stacks and\/or consistent resource tags so you can define \u201capplications\u201d cleanly.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should not choose it<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Amazon DevOps Guru may not be the best fit when:\n&#8211; <strong>Your workloads are mostly off AWS<\/strong> (DevOps Guru focuses on AWS resources).\n&#8211; <strong>You need deep custom analytics over raw logs<\/strong> (that\u2019s typically a log analytics platform job; DevOps Guru is insight-oriented).\n&#8211; <strong>You need deterministic alerting only<\/strong> (CloudWatch alarms and Synthetics are straightforward for known thresholds and checks).\n&#8211; <strong>Your application resources are not well grouped<\/strong> (no CloudFormation, inconsistent tags). You can fix this, but without grouping the service is harder to operationalize.\n&#8211; <strong>You need a single-pane-of-glass across multiple clouds<\/strong> (consider third-party observability\/AIOps tools).<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">4. Where is Amazon DevOps Guru used?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Industries<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Amazon DevOps Guru is commonly useful in any industry that runs customer-facing production systems on AWS, including:\n&#8211; SaaS and software platforms\n&#8211; E-commerce and digital marketplaces\n&#8211; Financial services (with careful compliance controls)\n&#8211; Media and streaming\n&#8211; Gaming\n&#8211; Healthcare and life sciences (especially for operational reliability)\n&#8211; Logistics, manufacturing, and IoT backends<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Team types<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>SRE and platform engineering teams<\/li>\n<li>DevOps teams<\/li>\n<li>Cloud operations \/ NOC teams<\/li>\n<li>Application owners (with shared responsibility for operations)<\/li>\n<li>Security and compliance teams (for monitoring consistency and auditability, not as a security detection tool)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Workloads<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Microservices on containers or serverless<\/li>\n<li>Web apps with autoscaling and managed databases<\/li>\n<li>Event-driven architectures with queues\/streams<\/li>\n<li>Data processing pipelines with periodic spikes<\/li>\n<li>Multi-tier architectures where correlation is hard (app + cache + database + messaging)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Architectures<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>CloudFormation-managed stacks (common for clean application boundaries)<\/li>\n<li>Tagging-based application boundaries<\/li>\n<li>Multi-account landing zones (central ops teams monitoring multiple application accounts; verify best practice patterns for your organization)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Real-world deployment contexts<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Production<\/strong>: Most valuable in production where patterns and baselines exist and where incident cost is high.<\/li>\n<li><strong>Dev\/test<\/strong>: Useful for validating operational readiness and spotting regressions, but insights may be less meaningful if traffic patterns are inconsistent or too low to establish baselines.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">5. Top Use Cases and Scenarios<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Below are realistic use cases. For each, focus on what DevOps Guru contributes: anomaly detection, correlation, and recommendations.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1) Detect rising application error rates before paging thresholds<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: 5xx errors start climbing but remain below an alarm threshold; users start complaining before on-call is paged.<\/li>\n<li><strong>Why it fits<\/strong>: DevOps Guru can detect abnormal changes relative to baseline, not just fixed thresholds.<\/li>\n<li><strong>Scenario<\/strong>: A new deployment causes intermittent 502 errors; DevOps Guru detects the deviation and raises an insight.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2) Correlate latency spikes with downstream resource saturation<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: API latency spikes occur; dashboards show many possible culprits.<\/li>\n<li><strong>Why it fits<\/strong>: Correlation across resources helps connect latency symptoms with underlying constraints.<\/li>\n<li><strong>Scenario<\/strong>: Increased p95 latency correlates with higher DB load and connection pressure; DevOps Guru points at the likely hotspot.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3) Spot throttling and concurrency pressure in serverless workloads<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Lambda throttles increase, causing retries and user-visible slowdowns.<\/li>\n<li><strong>Why it fits<\/strong>: DevOps Guru can detect unusual throttle patterns and surface recommendations.<\/li>\n<li><strong>Scenario<\/strong>: A scheduled job overlaps with peak traffic; throttling jumps and DevOps Guru highlights concurrency as the issue.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4) Identify unhealthy scaling behavior in Auto Scaling groups<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Scaling oscillates (scale out\/in repeatedly), increasing cost and instability.<\/li>\n<li><strong>Why it fits<\/strong>: Baseline-based detection can catch unstable patterns and highlight related signals (CPU, request rate, errors).<\/li>\n<li><strong>Scenario<\/strong>: A misconfigured scaling policy causes thrash; DevOps Guru surfaces the anomaly and likely remediation.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">5) Detect database performance regressions (where supported integrations apply)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: DB response time degrades after a schema change or query regression.<\/li>\n<li><strong>Why it fits<\/strong>: DevOps Guru can integrate with supported AWS database performance signals (verify exact database coverage in docs).<\/li>\n<li><strong>Scenario<\/strong>: Aurora performance degrades due to a new query plan; DevOps Guru highlights DB pressure and suggests next steps.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6) Reduce MTTR during incidents by summarizing impact and timeline<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: During an incident, teams lose time assembling \u201cwhat happened when\u201d from many dashboards.<\/li>\n<li><strong>Why it fits<\/strong>: DevOps Guru insights provide a summary and related anomalies in one place.<\/li>\n<li><strong>Scenario<\/strong>: An availability incident spans API, queue backlog, and DB; DevOps Guru groups related anomalies into one narrative.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">7) Support post-incident reviews with a consistent insight record<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Postmortems lack consistent telemetry and context.<\/li>\n<li><strong>Why it fits<\/strong>: Insights can be used as a structured input for incident timelines and contributing factors.<\/li>\n<li><strong>Scenario<\/strong>: The on-call uses the insight record to document start time, affected resources, and symptoms.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">8) Monitor multi-service architectures where manual dashboards don\u2019t scale<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Each microservice has its own dashboard; teams can\u2019t keep up with cross-service dependencies.<\/li>\n<li><strong>Why it fits<\/strong>: DevOps Guru focuses on anomalies and correlation rather than per-service dashboarding.<\/li>\n<li><strong>Scenario<\/strong>: A downstream queue backlog drives upstream timeouts; DevOps Guru flags correlated anomalies across both components.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">9) Improve on-call experience with routed notifications<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Insights exist but responders don\u2019t see them quickly.<\/li>\n<li><strong>Why it fits<\/strong>: SNS notifications allow routing to email, ChatOps, or incident tooling.<\/li>\n<li><strong>Scenario<\/strong>: Insights are routed to an SNS topic, which triggers Lambda to create a ticket and notify Slack.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">10) Detect configuration-change-related instability (indirectly)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: A change (deployment or config update) introduces instability, but the relationship is unclear.<\/li>\n<li><strong>Why it fits<\/strong>: DevOps Guru correlates anomalies and may surface related operational events (exact event sources vary\u2014verify).<\/li>\n<li><strong>Scenario<\/strong>: After a change window, error rates climb; DevOps Guru highlights affected resources and the nature of the anomaly.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">11) Standardize operational monitoring across teams via tagging\/CloudFormation boundaries<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Monitoring coverage varies across teams; some apps are \u201cinvisible.\u201d<\/li>\n<li><strong>Why it fits<\/strong>: Resource collections let platform teams define a standard approach: \u201cEvery app must be taggable or CloudFormation-managed.\u201d<\/li>\n<li><strong>Scenario<\/strong>: A platform team enforces tagging rules and uses those tags to include workloads in DevOps Guru monitoring.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">12) Early detection of cost-impacting performance issues<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Performance issues cause retries, scaling, and higher spend.<\/li>\n<li><strong>Why it fits<\/strong>: Detecting anomalies earlier can reduce the duration of waste.<\/li>\n<li><strong>Scenario<\/strong>: A sudden increase in retries increases request volume and compute usage; DevOps Guru flags the change for faster remediation.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">6. Core Features<\/h2>\n\n\n\n<blockquote>\n<p>Feature availability can vary by Region and by the AWS services you use. Always confirm in the official documentation for Amazon DevOps Guru.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">1) Resource collections (application grouping)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Lets you scope monitoring to the resources that represent an application\/workload, commonly using CloudFormation stack membership and\/or tags.<\/li>\n<li><strong>Why it matters<\/strong>: Clear boundaries reduce noise and make insights actionable to the owning team.<\/li>\n<li><strong>Practical benefit<\/strong>: \u201cApp A\u201d on-call sees insights about App A, not unrelated shared infrastructure.<\/li>\n<li><strong>Limitations\/caveats<\/strong>:<\/li>\n<li>If your tagging is inconsistent or stacks don\u2019t represent app boundaries, insights may be less useful.<\/li>\n<li>Shared resources (like shared databases) can complicate ownership\u2014define conventions.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2) ML-based anomaly detection<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Learns baselines of \u201cnormal\u201d and detects deviations.<\/li>\n<li><strong>Why it matters<\/strong>: Many incidents begin as subtle deviations that don\u2019t exceed static thresholds.<\/li>\n<li><strong>Practical benefit<\/strong>: Detect slow regressions, unusual spikes, and emergent behavior.<\/li>\n<li><strong>Limitations\/caveats<\/strong>:<\/li>\n<li>Needs sufficient signal history and meaningful traffic patterns to establish baselines.<\/li>\n<li>In dev\/test or low-traffic apps, anomaly detection may be less reliable.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3) Insights (operational findings)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Generates \u201cinsights\u201d that summarize anomalies, affected resources, and recommended actions.<\/li>\n<li><strong>Why it matters<\/strong>: Reduces time to triage by presenting a coherent operational story.<\/li>\n<li><strong>Practical benefit<\/strong>: Faster \u201cwhat changed and where?\u201d during incidents.<\/li>\n<li><strong>Limitations\/caveats<\/strong>:<\/li>\n<li>Recommendations are guidance, not guaranteed fixes. Validate against your context.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4) Correlation across signals and resources<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Associates related anomalies\/events to reduce noise and help identify root causes.<\/li>\n<li><strong>Why it matters<\/strong>: Complex systems fail in multi-symptom patterns.<\/li>\n<li><strong>Practical benefit<\/strong>: You investigate one insight rather than 20 separate alarms.<\/li>\n<li><strong>Limitations\/caveats<\/strong>:<\/li>\n<li>Correlation quality depends on the telemetry and service integrations available for your stack.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">5) Recommendations and operational guidance<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Provides recommended actions aligned with common AWS operational best practices.<\/li>\n<li><strong>Why it matters<\/strong>: Less experienced teams get a \u201cnext step,\u201d and experienced teams triage faster.<\/li>\n<li><strong>Practical benefit<\/strong>: Shorter time from detection to remediation.<\/li>\n<li><strong>Limitations\/caveats<\/strong>:<\/li>\n<li>Some recommendations may be generic; always validate and test changes safely.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6) Notifications via Amazon SNS (and downstream integrations)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Pushes insight notifications to an SNS topic; you can fan out to email\/SMS\/HTTP endpoints or automation.<\/li>\n<li><strong>Why it matters<\/strong>: Insights only help if responders see them quickly.<\/li>\n<li><strong>Practical benefit<\/strong>: Integrate with Slack\/MS Teams (commonly via AWS Chatbot), PagerDuty (via webhook\/bridge), ticketing, or custom workflows.<\/li>\n<li><strong>Limitations\/caveats<\/strong>:<\/li>\n<li>SNS is reliable but downstream delivery and formatting is your responsibility.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">7) Optional integrations with other telemetry sources (where supported)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Some environments can integrate with additional sources beyond metrics (for example, traces or database performance signals).<\/li>\n<li><strong>Why it matters<\/strong>: Richer signals improve correlation and diagnosis.<\/li>\n<li><strong>Practical benefit<\/strong>: Faster identification of whether an issue is app-level, dependency-level, or infrastructure-level.<\/li>\n<li><strong>Limitations\/caveats<\/strong>:<\/li>\n<li>Availability depends on your AWS services and Region. Verify current integrations in docs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">8) Account health \/ resource collection health views<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Provides health summaries at the configured scope.<\/li>\n<li><strong>Why it matters<\/strong>: Gives operators a quick \u201care we OK?\u201d view.<\/li>\n<li><strong>Practical benefit<\/strong>: Operations teams can prioritize attention.<\/li>\n<li><strong>Limitations\/caveats<\/strong>:<\/li>\n<li>Health views are not a substitute for SLO dashboards; they are a complement.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">7. Architecture and How It Works<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">High-level architecture<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">At a high level:\n1. You <strong>enable<\/strong> Amazon DevOps Guru and define the <strong>resources to monitor<\/strong> (resource collections).\n2. DevOps Guru <strong>consumes operational signals<\/strong> (commonly CloudWatch metrics; optional integrations may add more context).\n3. ML models <strong>learn baselines<\/strong> and detect <strong>anomalies<\/strong>.\n4. DevOps Guru correlates anomalies into <strong>insights<\/strong> and attaches <strong>recommendations<\/strong>.\n5. Insights are shown in the <strong>DevOps Guru console<\/strong> and can be pushed via <strong>notification channels<\/strong> (commonly SNS).\n6. Your team (and\/or automation) uses the insight to remediate, verify, and close the incident loop.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Request\/data\/control flow (practical view)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Data plane<\/strong>: Telemetry from AWS services (metrics and possibly additional signals) is analyzed by DevOps Guru.<\/li>\n<li><strong>Control plane<\/strong>: You configure monitored scope and notifications. IAM controls who can read insights and modify configuration.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Integrations with related services (typical)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Amazon CloudWatch<\/strong>: Metrics (and your existing alarm\/dashboards).<\/li>\n<li><strong>Amazon SNS<\/strong>: Insight notifications and routing.<\/li>\n<li><strong>AWS Chatbot<\/strong> (optional): Forward SNS notifications to Slack or Amazon Chime.<\/li>\n<li><strong>AWS Lambda \/ Amazon EventBridge<\/strong> (optional): Automation on insights (for example create tickets).<\/li>\n<li><strong>AWS Systems Manager<\/strong> (optional, where supported): Operational workflows (for example OpsCenter).<\/li>\n<li><strong>AWS Organizations<\/strong> (optional, verify support): Multi-account enablement\/visibility patterns.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Dependency services (what you should plan for)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You will usually need:\n&#8211; A consistent grouping mechanism (CloudFormation or tags)\n&#8211; CloudWatch metric coverage for key components (which most AWS managed services publish by default)\n&#8211; SNS topic and subscriptions for notifications (email\/ChatOps\/automation)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/authentication model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>IAM<\/strong> controls:<\/li>\n<li>Who can enable\/configure DevOps Guru<\/li>\n<li>Who can view insights and recommendations<\/li>\n<li>Who can manage notification channels<\/li>\n<li>Use least privilege and separate \u201coperators who view\u201d from \u201cadmins who configure.\u201d<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Networking model<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Amazon DevOps Guru is an AWS managed service; you interact via:\n&#8211; AWS Management Console\n&#8211; AWS CLI \/ SDK (where available)\nNo special VPC networking is typically required to <em>use<\/em> the service, but your notification consumers (webhooks, endpoints) may require network planning.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Monitoring\/logging\/governance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>AWS CloudTrail<\/strong> should record DevOps Guru API actions (enablement\/config changes) for audit.<\/li>\n<li><strong>Tagging governance<\/strong> is critical if you use tag-based resource collections.<\/li>\n<li><strong>Operational ownership<\/strong>: Define who triages insights and how they map to incident processes.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Simple architecture diagram (Mermaid)<\/h3>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart LR\n  A[AWS Resources&lt;br\/&gt;EC2, RDS, Lambda, ASG, etc.] --&gt; B[CloudWatch Metrics&lt;br\/&gt;(and other supported signals)]\n  B --&gt; C[Amazon DevOps Guru&lt;br\/&gt;ML anomaly detection]\n  C --&gt; D[Insights + Recommendations]\n  D --&gt; E[DevOps Guru Console]\n  D --&gt; F[Amazon SNS Topic]\n  F --&gt; G[Email \/ SMS \/ HTTP Subscribers]\n  F --&gt; H[AWS Chatbot -&gt; Slack\/Chime]\n  F --&gt; I[Automation (Lambda\/EventBridge)]\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Production-style architecture diagram (Mermaid)<\/h3>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart TB\n  subgraph Org[AWS Organization \/ Multi-Account Landing Zone]\n    subgraph ProdAcct[Production Account]\n      App1[App Resources&lt;br\/&gt;CloudFormation stacks \/ tags]\n      CW1[CloudWatch metrics\/logs]\n    end\n    subgraph SharedOps[Operations \/ Tooling Account]\n      SNS[(SNS Topics)]\n      Chat[ChatOps&lt;br\/&gt;AWS Chatbot]\n      Ticket[ITSM \/ Ticketing&lt;br\/&gt;(via Lambda\/Webhook)]\n      Runbook[Systems Manager Automation&lt;br\/&gt;(optional)]\n    end\n  end\n\n  App1 --&gt; CW1 --&gt; Guru[Amazon DevOps Guru&lt;br\/&gt;Regional analysis]\n  Guru --&gt; Insights[Insights + Recommendations]\n  Insights --&gt; SNS\n  SNS --&gt; Chat\n  SNS --&gt; Ticket\n  SNS --&gt; Runbook\n  Insights --&gt; Console[DevOps Guru Console&lt;br\/&gt;Ops visibility]\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Notes:\n&#8211; Multi-account patterns vary. Confirm the recommended setup for AWS Organizations in the official docs.\n&#8211; Keep notification routing centralized where it helps operations, but maintain clear app ownership.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">8. Prerequisites<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Account requirements<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>An active <strong>AWS account<\/strong> with permissions to enable Amazon DevOps Guru.<\/li>\n<li>If you use a multi-account setup, ensure your governance model supports enabling services per account\/Region (and verify any AWS Organizations support).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Permissions \/ IAM roles<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">At minimum you need:\n&#8211; Permissions to <strong>enable\/configure DevOps Guru<\/strong>, manage resource collections, and manage notification channels.\n&#8211; Permissions to create and manage:\n  &#8211; Amazon SNS topics and subscriptions\n  &#8211; (Optional) AWS Chatbot configuration\n  &#8211; (Optional) CloudFormation stacks used in the lab<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Best practice<\/strong>: Use a dedicated role for DevOps Guru administration and a separate read-only role for operators who only view insights.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Billing requirements<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A valid payment method attached to the AWS account.<\/li>\n<li>Cost visibility enabled (AWS Cost Explorer recommended).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">CLI\/SDK\/tools needed<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For this tutorial:\n&#8211; AWS Management Console access\n&#8211; (Optional) <strong>AWS CLI v2<\/strong> configured:\n  &#8211; <code>aws configure<\/code> with an IAM user\/role that has the required permissions\n&#8211; (Optional) <code>curl<\/code> or a simple load tool for generating requests to a sample endpoint<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Region availability<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Amazon DevOps Guru is not necessarily available in every AWS Region.<\/li>\n<li>Choose a Region where DevOps Guru is available and where you can deploy the tutorial resources.<\/li>\n<li>Verify current Region availability in official docs: https:\/\/docs.aws.amazon.com\/devops-guru\/latest\/userguide\/what-is-devops-guru.html (and the \u201cRegions\u201d section linked from there).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Quotas\/limits<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Amazon DevOps Guru has service quotas (for example number of monitored resources or resource collections).<\/li>\n<li>Check <strong>Service Quotas<\/strong> in the AWS Console for <strong>Amazon DevOps Guru<\/strong> and request increases if needed:<\/li>\n<li>AWS Console \u2192 Service Quotas \u2192 Amazon DevOps Guru<\/li>\n<li>Quotas change over time; verify current values in your account.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Prerequisite services<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Amazon CloudWatch<\/strong> (metrics exist by default for many AWS resources)<\/li>\n<li><strong>Amazon SNS<\/strong> (for notifications in this lab)<\/li>\n<li><strong>AWS CloudFormation<\/strong> (we\u2019ll deploy a small sample stack)<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">9. Pricing \/ Cost<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Amazon DevOps Guru pricing is <strong>usage-based<\/strong> and can vary by Region. Do not estimate cost using assumptions\u2014use the official pricing page and, for production, validate with the AWS Pricing Calculator.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Official pricing: https:\/\/aws.amazon.com\/devops-guru\/pricing\/<\/li>\n<li>AWS Pricing Calculator: https:\/\/calculator.aws\/#\/<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Pricing dimensions (how you get billed)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Amazon DevOps Guru pricing is typically based on factors such as:\n&#8211; <strong>Scope of monitoring<\/strong> (the number of resources \/ signals analyzed in your resource collections)\n&#8211; <strong>Optional integrations<\/strong> (for example, if you enable additional supported integrations such as database performance analysis, billing may include those dimensions)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Exact billing units and definitions (for example, resource-hours or instance-hours) can change; <strong>verify current billing dimensions on the official pricing page<\/strong> for your Region.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Free tier \/ trial<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">AWS has historically offered trials for some services and new accounts, but availability changes. <strong>Verify the current Free Tier\/trial terms on the DevOps Guru pricing page<\/strong>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Primary cost drivers<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Number of monitored resources<\/strong>: More resources in resource collections typically increases analysis scope.<\/li>\n<li><strong>High-churn environments<\/strong>: Constant creation\/deletion of resources can increase monitoring complexity (and can increase indirect costs in your environment).<\/li>\n<li><strong>Optional integrations<\/strong>: Database or tracing integrations may affect cost depending on how they are priced and the volume analyzed.<\/li>\n<li><strong>Multi-Region monitoring<\/strong>: Enabling DevOps Guru in multiple Regions increases cost proportionally.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Hidden or indirect costs (commonly overlooked)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Even if DevOps Guru cost is manageable, you may incur indirect costs from connected services:\n&#8211; <strong>CloudWatch Logs ingestion and retention<\/strong> (if you increase logging to improve observability)\n&#8211; <strong>AWS X-Ray tracing<\/strong> (if you enable additional tracing)\n&#8211; <strong>SNS deliveries<\/strong> (small cost, but can grow with very high notification volumes)\n&#8211; <strong>Automation costs<\/strong> (Lambda invocations, EventBridge rules, Systems Manager Automation runs)\n&#8211; <strong>Data transfer<\/strong> if you forward notifications to external endpoints (varies by path)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Network\/data transfer implications<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>DevOps Guru itself is a managed AWS service; you don\u2019t pay \u201cnetwork charges\u201d to send internal telemetry to it in the same way you might for self-managed collectors.<\/li>\n<li>If you route notifications to external systems (webhooks) or cross-Region endpoints, normal AWS data transfer charges can apply.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">How to optimize cost<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Start small<\/strong>: Enable DevOps Guru for one or two critical applications first.<\/li>\n<li><strong>Use tight resource collections<\/strong>: Avoid sweeping in unrelated shared resources unless you really want them correlated.<\/li>\n<li><strong>Use consistent tags<\/strong>: Prevent accidental inclusion of temporary\/dev resources in production monitoring.<\/li>\n<li><strong>Tune notification routing<\/strong>: Don\u2019t fan-out every insight to expensive downstream tooling unless needed.<\/li>\n<li><strong>Review coverage periodically<\/strong>: Remove retired stacks, old environments, and unused Regions.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Example low-cost starter estimate (no fabricated prices)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A low-cost starter approach typically looks like:\n&#8211; 1 Region\n&#8211; 1 small application resource collection (for example, a handful of resources)\n&#8211; SNS email notifications only\n&#8211; No optional integrations beyond the default metrics analysis<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To estimate accurately:\n1. List the resources you will monitor (by stack or tags).\n2. Use the <strong>DevOps Guru pricing page<\/strong> for your Region to understand billing units.\n3. Use the <strong>AWS Pricing Calculator<\/strong> to model scenarios.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Example production cost considerations<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For production, plan for:\n&#8211; Multiple applications\/resource collections\n&#8211; Potential multi-account coverage (if your org structure requires it)\n&#8211; Multiple Regions for global services\n&#8211; Additional signal sources (logs\/traces) that may increase <em>indirect<\/em> costs even if DevOps Guru\u2019s own pricing is stable\n&#8211; Budget alerts:\n  &#8211; Use AWS Budgets and Cost Anomaly Detection for financial guardrails<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">10. Step-by-Step Hands-On Tutorial<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">This lab focuses on enabling Amazon DevOps Guru safely, creating a resource collection around a sample application deployed with CloudFormation, and configuring notifications. Generating ML-based insights can take time because baselines often require sufficient telemetry history; the lab includes a simple method to create elevated error\/throttle signals, but you should treat \u201can insight appears\u201d as a best-effort validation rather than guaranteed within minutes.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Objective<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Deploy a small sample workload with AWS CloudFormation.<\/li>\n<li>Enable <strong>Amazon DevOps Guru<\/strong> in a chosen AWS Region.<\/li>\n<li>Create a <strong>resource collection<\/strong> for the CloudFormation stack.<\/li>\n<li>Configure an <strong>Amazon SNS<\/strong> notification channel for insights.<\/li>\n<li>Generate some load and errors to produce meaningful signals.<\/li>\n<li>Validate configuration and learn how to troubleshoot.<\/li>\n<li>Clean up to avoid ongoing charges.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Lab Overview<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You will create:\n&#8211; A CloudFormation stack containing:\n  &#8211; An AWS Lambda function\n  &#8211; An Amazon API Gateway HTTP API (or REST API depending on template support; we\u2019ll use a simple Lambda Function URL to reduce dependencies where possible)\n  &#8211; CloudWatch log group (created automatically by Lambda on first invoke)\n&#8211; An SNS topic + email subscription\n&#8211; A DevOps Guru resource collection that monitors the stack<\/p>\n\n\n\n<blockquote>\n<p>Note: API Gateway introduces additional moving parts and permissions. To keep the lab simpler and low-risk, we\u2019ll use a <strong>Lambda Function URL<\/strong>. If your organization restricts function URLs, you can adapt this to API Gateway.<\/p>\n<\/blockquote>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 1: Choose a Region and set up tools<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Pick an AWS Region where <strong>Amazon DevOps Guru<\/strong> is available (for example, <code>us-east-1<\/code> or <code>us-west-2<\/code>, but verify availability first).<\/li>\n<li>Ensure you have permissions for:\n   &#8211; CloudFormation create\/update\/delete stacks\n   &#8211; Lambda create\/update\/delete\n   &#8211; SNS create topics and subscriptions\n   &#8211; DevOps Guru enable\/configure and manage notification channels<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Optional CLI setup<\/strong><\/p>\n\n\n\n<pre><code class=\"language-bash\">aws --version\naws configure set region us-east-1\naws sts get-caller-identity\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome<\/strong>\n&#8211; You know which Region you\u2019ll use.\n&#8211; Your identity is confirmed via STS.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 2: Deploy a sample workload with CloudFormation<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Create a local file named <code>devopsguru-lab.yaml<\/code> with the following template:<\/p>\n\n\n\n<pre><code class=\"language-yaml\">AWSTemplateFormatVersion: '2010-09-09'\nDescription: DevOps Guru Lab - Lambda function with Function URL and controlled error behavior\n\nParameters:\n  ErrorRatePercent:\n    Type: Number\n    Default: 20\n    MinValue: 0\n    MaxValue: 100\n    Description: Percentage of requests that intentionally fail (0-100)\n\nResources:\n  LabFunctionRole:\n    Type: AWS::IAM::Role\n    Properties:\n      AssumeRolePolicyDocument:\n        Version: '2012-10-17'\n        Statement:\n          - Effect: Allow\n            Principal:\n              Service:\n                - lambda.amazonaws.com\n            Action:\n              - sts:AssumeRole\n      ManagedPolicyArns:\n        # Basic execution writes to CloudWatch Logs\n        - arn:aws:iam::aws:policy\/service-role\/AWSLambdaBasicExecutionRole\n\n  LabFunction:\n    Type: AWS::Lambda::Function\n    Properties:\n      FunctionName: !Sub devopsguru-lab-fn-${AWS::StackName}\n      Runtime: python3.12\n      Handler: index.handler\n      Role: !GetAtt LabFunctionRole.Arn\n      Timeout: 5\n      MemorySize: 128\n      Environment:\n        Variables:\n          ERROR_RATE_PERCENT: !Ref ErrorRatePercent\n      Code:\n        ZipFile: |\n          import os, json, random, time\n\n          def handler(event, context):\n              # small jitter to create latency variation\n              time.sleep(random.random() * 0.1)\n\n              rate = int(os.environ.get(\"ERROR_RATE_PERCENT\", \"0\"))\n              if random.randint(1, 100) &lt;= rate:\n                  # Return a 500-like response shape\n                  return {\n                      \"statusCode\": 500,\n                      \"headers\": {\"content-type\": \"application\/json\"},\n                      \"body\": json.dumps({\"ok\": False, \"error\": \"Intentional error for lab\"})\n                  }\n\n              return {\n                  \"statusCode\": 200,\n                  \"headers\": {\"content-type\": \"application\/json\"},\n                  \"body\": json.dumps({\"ok\": True, \"message\": \"Hello from DevOps Guru lab\"})\n              }\n\n  LabFunctionUrl:\n    Type: AWS::Lambda::Url\n    Properties:\n      TargetFunctionArn: !Ref LabFunction\n      AuthType: NONE\n\n  LabFunctionUrlPermission:\n    Type: AWS::Lambda::Permission\n    Properties:\n      FunctionName: !Ref LabFunction\n      Action: lambda:InvokeFunctionUrl\n      Principal: \"*\"\n      FunctionUrlAuthType: NONE\n\nOutputs:\n  FunctionName:\n    Value: !Ref LabFunction\n  FunctionUrl:\n    Value: !GetAtt LabFunctionUrl.FunctionUrl\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Deploy it:<\/p>\n\n\n\n<pre><code class=\"language-bash\">aws cloudformation deploy \\\n  --stack-name devopsguru-lab-stack \\\n  --template-file devopsguru-lab.yaml \\\n  --capabilities CAPABILITY_NAMED_IAM \\\n  --parameter-overrides ErrorRatePercent=20\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Fetch the Function URL:<\/p>\n\n\n\n<pre><code class=\"language-bash\">aws cloudformation describe-stacks \\\n  --stack-name devopsguru-lab-stack \\\n  --query \"Stacks[0].Outputs[?OutputKey=='FunctionUrl'].OutputValue\" \\\n  --output text\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome<\/strong>\n&#8211; CloudFormation stack status is <code>CREATE_COMPLETE<\/code>.\n&#8211; You have a public HTTPS Function URL you can test.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Quick test<\/strong><\/p>\n\n\n\n<pre><code class=\"language-bash\">FUNC_URL=\"$(aws cloudformation describe-stacks --stack-name devopsguru-lab-stack --query \"Stacks[0].Outputs[?OutputKey=='FunctionUrl'].OutputValue\" --output text)\"\ncurl -sS \"$FUNC_URL\" | head\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">You should see JSON responses; some will be <code>{\"ok\": false, ...}<\/code> due to the configured error rate.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 3: Generate baseline traffic and CloudWatch signals<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To give DevOps Guru something to analyze, generate traffic for 10\u201315 minutes. A simple approach:<\/p>\n\n\n\n<pre><code class=\"language-bash\">FUNC_URL=\"$(aws cloudformation describe-stacks --stack-name devopsguru-lab-stack --query \"Stacks[0].Outputs[?OutputKey=='FunctionUrl'].OutputValue\" --output text)\"\n\nfor i in $(seq 1 600); do\n  curl -s -o \/dev\/null -w \"%{http_code}\\n\" \"$FUNC_URL\" &amp;\n  # small parallelism burst\n  if (( i % 20 == 0 )); then wait; fi\n  sleep 0.5\ndone\nwait\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome<\/strong>\n&#8211; You generate a mix of 200 and 500 responses.\n&#8211; Lambda metrics (Invocations, Errors, Duration) begin to show activity in CloudWatch.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification<\/strong>\n&#8211; AWS Console \u2192 CloudWatch \u2192 Metrics \u2192 Lambda \u2192 view metrics for your function.\n&#8211; AWS Console \u2192 CloudWatch \u2192 Logs \u2192 log group for your Lambda function exists after invocation.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 4: Enable Amazon DevOps Guru<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>AWS Console \u2192 Search for <strong>DevOps Guru<\/strong> \u2192 open <strong>Amazon DevOps Guru<\/strong>.<\/li>\n<li>If this is your first time:\n   &#8211; Choose <strong>Get started<\/strong> \/ <strong>Enable DevOps Guru<\/strong> (wording may differ by console updates).<\/li>\n<li>Select a scope:\n   &#8211; Prefer <strong>Application \/ resource collection<\/strong> monitoring rather than \u201ceverything\u201d (safer and more cost-controlled for labs).<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome<\/strong>\n&#8211; DevOps Guru is enabled in your chosen Region.\n&#8211; You can create or manage resource collections.<\/p>\n\n\n\n<blockquote>\n<p>If you do not see enablement options, verify IAM permissions and Region availability.<\/p>\n<\/blockquote>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 5: Create a resource collection for the CloudFormation stack<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>In the DevOps Guru console, find <strong>Resource collections<\/strong> (or similar navigation).<\/li>\n<li>Create a resource collection using <strong>CloudFormation<\/strong> (if available in your console):\n   &#8211; Choose the stack: <code>devopsguru-lab-stack<\/code>\n   &#8211; Name the resource collection: <code>devopsguru-lab-collection<\/code><\/li>\n<li>Save\/confirm.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome<\/strong>\n&#8211; The collection is created.\n&#8211; DevOps Guru begins monitoring the resources in that stack.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification<\/strong>\n&#8211; In DevOps Guru, locate the resource collection health view (often \u201cResource collection health\u201d).\n&#8211; You should see your collection listed.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 6: Configure an SNS notification channel for DevOps Guru<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Create an SNS topic and email subscription:<\/p>\n\n\n\n<pre><code class=\"language-bash\">aws sns create-topic --name devopsguru-lab-insights\nTOPIC_ARN=\"$(aws sns list-topics --query \"Topics[?contains(TopicArn,'devopsguru-lab-insights')].TopicArn | [0]\" --output text)\"\necho \"$TOPIC_ARN\"\n\naws sns subscribe \\\n  --topic-arn \"$TOPIC_ARN\" \\\n  --protocol email \\\n  --notification-endpoint you@example.com\n<\/code><\/pre>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Confirm the subscription from the email you receive (SNS requires confirmation).<\/li>\n<li>In the DevOps Guru console:\n   &#8211; Go to <strong>Settings<\/strong> \/ <strong>Notifications<\/strong> (exact location may vary).\n   &#8211; Add an <strong>SNS topic<\/strong> notification channel using the ARN.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome<\/strong>\n&#8211; Your email subscription is confirmed.\n&#8211; DevOps Guru is configured to publish insight notifications to your SNS topic.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification<\/strong>\n&#8211; AWS Console \u2192 SNS \u2192 Topic \u2192 Subscriptions shows <code>Confirmed<\/code>.\n&#8211; DevOps Guru notification channels list includes your SNS topic.<\/p>\n\n\n\n<blockquote>\n<p>If you prefer ChatOps, you can connect SNS to Slack via AWS Chatbot, but that adds setup steps and permissions.<\/p>\n<\/blockquote>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 7: (Optional) Increase anomaly likelihood by changing error rate<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Baselines can take time. To create a more obvious change, update the stack to increase the intentional error rate (for example from 20% to 70%), then generate traffic again.<\/p>\n\n\n\n<pre><code class=\"language-bash\">aws cloudformation deploy \\\n  --stack-name devopsguru-lab-stack \\\n  --template-file devopsguru-lab.yaml \\\n  --capabilities CAPABILITY_NAMED_IAM \\\n  --parameter-overrides ErrorRatePercent=70\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Generate traffic again for 10\u201315 minutes.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome<\/strong>\n&#8211; CloudWatch Lambda <code>Errors<\/code> should increase relative to invocations.\n&#8211; DevOps Guru may eventually surface an insight (time varies; not guaranteed in a short lab window).<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Validation<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Use this checklist:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p><strong>Resource collection exists and is healthy<\/strong>\n   &#8211; DevOps Guru console shows the collection and monitored resources.<\/p>\n<\/li>\n<li>\n<p><strong>CloudWatch telemetry exists<\/strong>\n   &#8211; CloudWatch shows Invocations\/Errors\/Duration for the Lambda function.<\/p>\n<\/li>\n<li>\n<p><strong>Notification channel is configured<\/strong>\n   &#8211; SNS topic and confirmed subscription exist.\n   &#8211; DevOps Guru notifications include the topic.<\/p>\n<\/li>\n<li>\n<p><strong>Insights (best-effort)<\/strong>\n   &#8211; DevOps Guru console \u2192 Insights: check for new insights.\n   &#8211; If an insight appears, confirm it lists your Lambda function and relevant anomalies.<\/p>\n<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">If you want to explore via CLI, the AWS CLI supports a <code>devops-guru<\/code> namespace in many environments. Run:<\/p>\n\n\n\n<pre><code class=\"language-bash\">aws devops-guru list-insights --max-results 5\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">If this command is not available or returns an error, update AWS CLI or verify IAM and service availability.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Troubleshooting<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">Issue: DevOps Guru is not available in my Region<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Switch to a Region where DevOps Guru is available.<\/li>\n<li>Verify official docs for Region availability:<\/li>\n<li>https:\/\/docs.aws.amazon.com\/devops-guru\/latest\/userguide\/what-is-devops-guru.html<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Issue: I can\u2019t enable DevOps Guru (access denied)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Ensure your identity has permissions for DevOps Guru actions and any required linked services.<\/li>\n<li>Use IAM policy simulator to confirm.<\/li>\n<li>Check AWS Organizations SCPs if applicable.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Issue: No insights appear<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">This can be normal in short labs. Common reasons:\n&#8211; Not enough historical data\/baseline\n&#8211; Traffic volume too low or too inconsistent\n&#8211; Error\/latency changes not statistically significant\n&#8211; Resource collection does not include the intended resources<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">What to do:\n&#8211; Run traffic longer (30\u2013120 minutes).\n&#8211; Increase error-rate shift (20% \u2192 70%) and keep steady traffic.\n&#8211; Confirm the resource collection includes the Lambda function.\n&#8211; Confirm CloudWatch metrics are present and updating.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Issue: SNS email subscription never confirms<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Check spam\/junk.<\/li>\n<li>Ensure you used the correct email address.<\/li>\n<li>Recreate the subscription if needed.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Issue: Function URL blocked by security policy<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use API Gateway with IAM auth or deploy inside your internal network patterns.<\/li>\n<li>Or invoke the Lambda via AWS CLI <code>aws lambda invoke<\/code> from a trusted network to generate metrics.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Cleanup<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To avoid ongoing charges:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Remove DevOps Guru notification channel (optional but clean).<\/li>\n<li>Delete SNS topic (this deletes subscriptions too):<\/li>\n<\/ol>\n\n\n\n<pre><code class=\"language-bash\">aws sns delete-topic --topic-arn \"$TOPIC_ARN\"\n<\/code><\/pre>\n\n\n\n<ol class=\"wp-block-list\" start=\"3\">\n<li>Delete the CloudFormation stack:<\/li>\n<\/ol>\n\n\n\n<pre><code class=\"language-bash\">aws cloudformation delete-stack --stack-name devopsguru-lab-stack\naws cloudformation wait stack-delete-complete --stack-name devopsguru-lab-stack\n<\/code><\/pre>\n\n\n\n<ol class=\"wp-block-list\" start=\"4\">\n<li>If you enabled DevOps Guru only for this lab, disable it or remove the resource collection (console workflow depends on current UI). Ensure you understand billing implications of leaving it enabled.<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">11. Best Practices<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Architecture best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Define application boundaries<\/strong>:<\/li>\n<li>Prefer CloudFormation stacks per app\/service or consistent tags like:<ul>\n<li><code>App=payments<\/code>, <code>Env=prod<\/code>, <code>Owner=team-a<\/code><\/li>\n<\/ul>\n<\/li>\n<li><strong>Monitor what you own<\/strong>:<\/li>\n<li>Include shared components carefully; define ownership and escalation paths.<\/li>\n<li><strong>Layered observability<\/strong>:<\/li>\n<li>Use DevOps Guru for anomaly\/insight detection.<\/li>\n<li>Use CloudWatch dashboards and SLO tooling for ongoing health tracking.<\/li>\n<li>Use alarms for hard limits and paging.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">IAM\/security best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Least privilege<\/strong>:<\/li>\n<li>Separate roles for:<ul>\n<li>DevOps Guru configuration (admin)<\/li>\n<li>Insight viewing (read-only)<\/li>\n<\/ul>\n<\/li>\n<li><strong>Use AWS Organizations guardrails<\/strong>:<\/li>\n<li>If you operate multi-account, align with SCPs and delegated admin patterns (verify current service support).<\/li>\n<li><strong>Audit configuration changes<\/strong>:<\/li>\n<li>Ensure CloudTrail is enabled and logs are centrally retained.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Cost best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Start with critical apps<\/strong> and expand based on value.<\/li>\n<li><strong>Avoid blanket monitoring<\/strong> in early stages.<\/li>\n<li><strong>Review resource collections quarterly<\/strong> to remove obsolete stacks and environments.<\/li>\n<li><strong>Use budgets<\/strong>:<\/li>\n<li>AWS Budgets for service spend<\/li>\n<li>Cost Anomaly Detection for unexpected changes<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Performance best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Improve telemetry quality<\/strong>:<\/li>\n<li>Good metrics and consistent naming\/tags make insights more actionable.<\/li>\n<li><strong>Make deployments observable<\/strong>:<\/li>\n<li>Emit deployment markers (where possible) and maintain change logs to correlate with anomalies.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Reliability best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Tie insights to incident response<\/strong>:<\/li>\n<li>Define runbooks and ownership for top insight types.<\/li>\n<li><strong>Use game days<\/strong>:<\/li>\n<li>Intentionally introduce controlled faults and validate whether insights and notifications are useful.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operations best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Route notifications with context<\/strong>:<\/li>\n<li>Include account, Region, app name, and severity in downstream messages\/tickets.<\/li>\n<li><strong>Establish an insight triage process<\/strong>:<\/li>\n<li>Who acknowledges?<\/li>\n<li>What is the SLA for investigation?<\/li>\n<li>How do you suppress\/handle known benign patterns?<\/li>\n<li><strong>Integrate with ticketing<\/strong>:<\/li>\n<li>Use SNS \u2192 Lambda to create tickets and attach insight details.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Governance\/tagging\/naming best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Tag standards<\/strong>:<\/li>\n<li><code>App<\/code>, <code>Service<\/code>, <code>Env<\/code>, <code>Owner<\/code>, <code>CostCenter<\/code>, <code>DataClassification<\/code><\/li>\n<li><strong>Naming conventions<\/strong>:<\/li>\n<li>Stack names and resource names should be consistent and human-parsable.<\/li>\n<li><strong>Policy enforcement<\/strong>:<\/li>\n<li>Use IaC checks (cfn-lint, policy-as-code) and tag policies (AWS Organizations) where appropriate.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">12. Security Considerations<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Identity and access model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Controlled by <strong>AWS IAM<\/strong>.<\/li>\n<li>Typical actions to control:<\/li>\n<li>Enable\/disable DevOps Guru<\/li>\n<li>Create\/update resource collections<\/li>\n<li>Manage notification channels<\/li>\n<li>Read insights and recommendations<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Recommendations<\/strong>\n&#8211; Grant write permissions only to a small platform\/admin group.\n&#8211; Provide read-only access to on-call engineers and application owners.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Encryption<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data in transit to AWS services uses TLS.<\/li>\n<li>For encryption at rest specifics (including any customer-managed key options), <strong>verify in official docs<\/strong>, as capabilities can vary by service and evolve over time.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Network exposure<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>DevOps Guru is accessed via AWS public service endpoints (like most AWS control-plane services).<\/li>\n<li>Lock down access to the console and API with:<\/li>\n<li>IAM policies<\/li>\n<li>MFA<\/li>\n<li>Conditional access (source IP, VPC endpoints where applicable; verify if DevOps Guru supports specific endpoint types in your region)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secrets handling<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">DevOps Guru itself is not a secrets store. If you automate remediation:\n&#8211; Store secrets in <strong>AWS Secrets Manager<\/strong> or <strong>SSM Parameter Store<\/strong> (SecureString).\n&#8211; Do not embed secrets in Lambda environment variables without encryption and rotation.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Audit\/logging<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Ensure <strong>AWS CloudTrail<\/strong> is enabled for management events.<\/li>\n<li>Centralize logs in a security\/log archive account if you use multi-account.<\/li>\n<li>Track changes to:<\/li>\n<li>resource collections<\/li>\n<li>notification channels<\/li>\n<li>service integrations<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Compliance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Align with your compliance needs:<\/li>\n<li>Data residency: enable only in approved Regions<\/li>\n<li>Access controls: least privilege and segregation of duties<\/li>\n<li>Retention: CloudTrail log retention and SIEM forwarding<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Common security mistakes<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Over-broad IAM permissions (for example, allowing everyone to modify notification channels)<\/li>\n<li>Routing SNS notifications to untrusted endpoints without validation<\/li>\n<li>Including sensitive environment resources in monitoring without access governance (insight data may include resource identifiers and context)<\/li>\n<li>Failing to log and review configuration changes<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secure deployment recommendations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use infrastructure-as-code (IaC) for SNS topics, subscriptions, and automation.<\/li>\n<li>Use KMS encryption for SNS topics (supported by SNS) and enforce encryption where required.<\/li>\n<li>Use least privilege for automation consumers (Lambda that creates tickets should not have admin permissions).<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">13. Limitations and Gotchas<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Because service behavior and support matrices evolve, treat this list as guidance and confirm details in official docs and Service Quotas.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Known limitations \/ realities in practice<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Insights are not instantaneous<\/strong>: ML baselines and correlation can require time and sufficient telemetry history.<\/li>\n<li><strong>Not a full observability suite<\/strong>: DevOps Guru does not replace log search, tracing analysis tools, or metric dashboards.<\/li>\n<li><strong>Resource grouping quality matters<\/strong>: Poor tagging\/stack boundaries lead to noisy or less actionable insights.<\/li>\n<li><strong>Low-traffic apps may not benefit<\/strong>: Without stable patterns, anomaly detection can be less effective.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Quotas<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Limits can exist for:<\/li>\n<li>number of resource collections<\/li>\n<li>number of monitored resources<\/li>\n<li>notification channels<\/li>\n<li>Check <strong>Service Quotas \u2192 Amazon DevOps Guru<\/strong> for current values.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Regional constraints<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Service availability varies by Region.<\/li>\n<li>If you run multi-Region workloads, you may need to enable DevOps Guru in multiple Regions.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Pricing surprises<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Broad resource collections can increase analysis scope and cost.<\/li>\n<li>Indirect costs from enabling more logs\/traces are easy to underestimate.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Compatibility issues<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Not all AWS services are equally represented in DevOps Guru analysis.<\/li>\n<li>Optional integrations (for example tracing or database performance) may require additional enablement and may not be available everywhere.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operational gotchas<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Notification routing<\/strong>: SNS is flexible, but if you don\u2019t standardize message handling, responders may ignore insights.<\/li>\n<li><strong>Ownership confusion<\/strong>: Shared resources included in multiple collections can lead to unclear on-call responsibility.<\/li>\n<li><strong>Change correlation<\/strong>: DevOps Guru is not a full change management system; keep deployment\/change logs elsewhere and link them during incidents.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Migration challenges<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>If you currently use a third-party AIOps platform, you\u2019ll need to decide:<\/li>\n<li>Which signals remain in that platform<\/li>\n<li>Which alerts are replaced by DevOps Guru insights<\/li>\n<li>How to prevent duplicate paging<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Vendor-specific nuances<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>AWS-native tools integrate well, but you must still design your operational process (incident response, runbooks, postmortems). DevOps Guru provides insights, not process.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">14. Comparison with Alternatives<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Amazon DevOps Guru sits between raw observability (metrics\/logs\/traces) and full AIOps platforms.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Nearest services in AWS<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Amazon CloudWatch<\/strong>: metrics\/logs\/alarms\/dashboards; deterministic alerting and visualization.<\/li>\n<li><strong>CloudWatch Anomaly Detection<\/strong>: anomaly detection on individual metrics (more metric-specific; less \u201cinsight narrative\u201d).<\/li>\n<li><strong>AWS Compute Optimizer<\/strong>: rightsizing and resource optimization recommendations (cost\/perf), not incident detection.<\/li>\n<li><strong>AWS Trusted Advisor<\/strong>: best-practice checks (cost, security, fault tolerance), not real-time anomaly insights.<\/li>\n<li><strong>AWS Health<\/strong>: AWS service events and account-specific advisories; not your app telemetry analysis.<\/li>\n<li><strong>AWS X-Ray<\/strong>: tracing; deep request-level performance analysis, not cross-service anomaly \u201cinsights\u201d by itself.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Nearest services in other clouds<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Azure Advisor \/ Azure Monitor<\/strong>: recommendations + monitoring; anomaly capabilities exist in Azure Monitor, but operational model differs.<\/li>\n<li><strong>Google Cloud Operations suite<\/strong>: monitoring\/logging\/tracing with alerting and some intelligent features; different integration patterns.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Open-source \/ self-managed alternatives<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Prometheus + Alertmanager + Grafana<\/strong>: strong metrics stack, but you build correlation and AIOps yourself.<\/li>\n<li><strong>OpenTelemetry + tracing backend<\/strong>: great telemetry foundation, but AIOps correlation is additional.<\/li>\n<li><strong>Elastic stack<\/strong>: strong log analytics; AIOps features vary by edition.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Comparison table<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Option<\/th>\n<th>Best For<\/th>\n<th>Strengths<\/th>\n<th>Weaknesses<\/th>\n<th>When to Choose<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Amazon DevOps Guru<\/strong><\/td>\n<td>AWS-native anomaly detection + insights<\/td>\n<td>ML baselines, correlated insights, AWS integration, managed service<\/td>\n<td>Not a full observability platform; insights may require time and good telemetry<\/td>\n<td>You want AWS-native AIOps-style insights with low operational overhead<\/td>\n<\/tr>\n<tr>\n<td><strong>Amazon CloudWatch (metrics\/logs\/alarms)<\/strong><\/td>\n<td>Core AWS monitoring<\/td>\n<td>Deterministic alarms, dashboards, log storage\/queries, broad AWS support<\/td>\n<td>Alarm noise; correlation mostly manual<\/td>\n<td>You need foundational monitoring and paging; always used alongside DevOps Guru<\/td>\n<\/tr>\n<tr>\n<td><strong>CloudWatch Anomaly Detection<\/strong><\/td>\n<td>Single-metric anomaly alerts<\/td>\n<td>Good for specific metrics; integrates with alarms<\/td>\n<td>Less narrative correlation; metric-by-metric setup<\/td>\n<td>You want anomaly bands on key metrics and explicit alerting thresholds<\/td>\n<\/tr>\n<tr>\n<td><strong>AWS X-Ray<\/strong><\/td>\n<td>Tracing and service maps<\/td>\n<td>Great for debugging latency and errors per request<\/td>\n<td>Not an AIOps insight engine<\/td>\n<td>You need deep request-level analysis and dependency tracing<\/td>\n<\/tr>\n<tr>\n<td><strong>Datadog \/ New Relic \/ Dynatrace<\/strong><\/td>\n<td>Cross-cloud observability + AIOps<\/td>\n<td>Strong correlation, dashboards, broad integrations, mature UX<\/td>\n<td>Licensing cost; agent management; vendor lock-in considerations<\/td>\n<td>You need multi-cloud visibility and rich app monitoring features<\/td>\n<\/tr>\n<tr>\n<td><strong>Prometheus + Grafana (self-managed)<\/strong><\/td>\n<td>Custom metrics + full control<\/td>\n<td>Flexible, open ecosystem<\/td>\n<td>You operate it; correlation and AIOps are DIY<\/td>\n<td>You want full control and have platform engineering capacity<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">15. Real-World Example<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Enterprise example (regulated, multi-team)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Problem<\/strong>\nA financial services company runs dozens of customer-facing services on AWS. Incidents often start as subtle latency regressions and escalate. On-call teams spend too long correlating CloudWatch alarms, dashboards, and recent changes.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Proposed architecture<\/strong>\n&#8211; Each product domain deploys via CloudFormation with mandatory tags:\n  &#8211; <code>App<\/code>, <code>Env<\/code>, <code>Owner<\/code>, <code>CostCenter<\/code>\n&#8211; Amazon DevOps Guru is enabled in production Regions and configured with:\n  &#8211; Resource collections per application (CloudFormation stacks + tag scoping)\n  &#8211; SNS notifications routed to:\n    &#8211; AWS Chatbot \u2192 Slack channel per domain\n    &#8211; Lambda \u2192 ITSM ticket creation with insight link\n&#8211; CloudTrail logs are centralized for audit\n&#8211; CloudWatch dashboards remain the primary \u201cSLO view,\u201d with DevOps Guru providing anomaly\/insight overlays<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Why Amazon DevOps Guru was chosen<\/strong>\n&#8211; AWS-native approach aligned with the organization\u2019s security posture.\n&#8211; Reduced need to deploy\/operate third-party AIOps tooling in regulated environments.\n&#8211; Faster triage from correlated insights without replacing existing CloudWatch investments.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcomes<\/strong>\n&#8211; Reduced MTTR via more focused triage\n&#8211; Fewer noisy pages by shifting some attention to insights rather than raw alarms\n&#8211; Better postmortems with consistent insight records and timelines<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Startup\/small-team example (lean ops)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Problem<\/strong>\nA small SaaS startup has one platform engineer supporting multiple services. They rely on basic CloudWatch alarms but still miss early signals. They need better detection without building a full observability platform.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Proposed architecture<\/strong>\n&#8211; Enable Amazon DevOps Guru for the production stack only.\n&#8211; Create one resource collection for the main CloudFormation stack.\n&#8211; SNS topic sends insight notifications to:\n  &#8211; Email distro for on-call\n  &#8211; Slack via AWS Chatbot\n&#8211; Keep CloudWatch alarms for paging on known thresholds (CPU saturation, queue depth, 5xx rate)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Why Amazon DevOps Guru was chosen<\/strong>\n&#8211; Minimal operational overhead\n&#8211; Adds ML-based anomaly detection on top of existing AWS telemetry\n&#8211; Easy to start small and expand as the team grows<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcomes<\/strong>\n&#8211; Earlier detection of \u201cweird\u201d behavior\n&#8211; Less time correlating metrics during incidents\n&#8211; Improved reliability without a large tooling budget<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">16. FAQ<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">1) Is Amazon DevOps Guru an observability platform?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">No. It\u2019s best viewed as an <strong>AIOps-style insight and recommendation layer<\/strong> that analyzes operational signals (commonly CloudWatch metrics) and emits insights. You still use CloudWatch, logs, and tracing tools for deep investigation.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">2) Does DevOps Guru replace CloudWatch alarms?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Typically no. Use CloudWatch alarms for deterministic paging and guardrails. Use DevOps Guru for anomaly detection, correlation, and triage acceleration.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3) Do I enable DevOps Guru for an entire account or per application?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You generally enable the service in an account\/Region and then configure <strong>resource collections<\/strong> to scope monitoring per application\/workload. Exact setup options can change; verify in the console\/docs.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4) How does DevOps Guru know what my \u201capplication\u201d is?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You define it using resource collections\u2014commonly via CloudFormation stacks and\/or tags\u2014so it understands which resources belong together.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">5) How long does it take to start producing insights?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">It depends. ML baselines and meaningful anomaly detection may require time and sufficient telemetry. For new\/low-traffic apps, insights may take longer or be less frequent.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">6) Can I use DevOps Guru in dev\/test?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Yes, but benefits are typically higher in production where traffic patterns are stable enough to learn baselines. Dev\/test can still help validate observability and detect big regressions.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">7) How do notifications work?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">DevOps Guru can publish notifications to channels such as <strong>Amazon SNS<\/strong>. You can then route SNS messages to email, Slack (via AWS Chatbot), webhooks, or automation.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">8) Can DevOps Guru open tickets automatically?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Not by itself in all cases, but you can implement it using SNS \u2192 Lambda (or SNS \u2192 EventBridge where applicable) to create tickets in your ITSM tool.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">9) Is DevOps Guru multi-account?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">There are patterns to operate across multiple AWS accounts (often via AWS Organizations), but exact feature support and recommended architectures can evolve. Verify current multi-account guidance in the official docs.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">10) Does DevOps Guru analyze logs?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">DevOps Guru primarily focuses on operational signals and may offer integrations that include additional context. Whether and how logs are used depends on current service integrations\u2014verify in docs for your Region and services.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">11) Does DevOps Guru analyze traces?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">It can integrate with tracing signals in some setups (for example AWS X-Ray), but integration availability depends on current features and Region. Verify in official docs.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">12) What\u2019s the difference between an anomaly and an insight?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">An <strong>anomaly<\/strong> is typically a detected unusual signal (for example, errors increased). An <strong>insight<\/strong> is a higher-level grouping\/correlation of anomalies with context and recommendations.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">13) Can I control what resources are monitored?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Yes. Use resource collections (CloudFormation or tags) to control scope. Keep scope tight initially for cost and relevance.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">14) How do I reduce noise?<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use accurate app boundaries (tags\/stacks)<\/li>\n<li>Ensure telemetry is meaningful (avoid random noisy metrics)<\/li>\n<li>Route notifications to the right team and avoid blasting every channel<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">15) Is DevOps Guru suitable for compliance-heavy environments?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">It can be, if you apply proper IAM controls, auditing (CloudTrail), Region selection, and governance. Always verify compliance requirements and AWS service attestations relevant to your industry.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">16) What skills do engineers need to use DevOps Guru effectively?<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Basic AWS monitoring knowledge (CloudWatch metrics\/logs)<\/li>\n<li>Understanding of your application architecture and dependencies<\/li>\n<li>Incident response discipline (runbooks, ownership, escalation)<\/li>\n<li>Tagging and IaC hygiene<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">17) What\u2019s a good first application to onboard?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Choose a production application with:\n&#8211; Clear CloudFormation boundaries and\/or strong tagging\n&#8211; Known operational pain (frequent incidents)\n&#8211; Good CloudWatch metric coverage\nThis maximizes your chance of getting actionable insights quickly.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">17. Top Online Resources to Learn Amazon DevOps Guru<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Resource Type<\/th>\n<th>Name<\/th>\n<th>Why It Is Useful<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Official documentation<\/td>\n<td>Amazon DevOps Guru User Guide: https:\/\/docs.aws.amazon.com\/devops-guru\/latest\/userguide\/what-is-devops-guru.html<\/td>\n<td>Primary source for features, setup, integrations, and concepts<\/td>\n<\/tr>\n<tr>\n<td>Official pricing<\/td>\n<td>Amazon DevOps Guru Pricing: https:\/\/aws.amazon.com\/devops-guru\/pricing\/<\/td>\n<td>Accurate pricing dimensions and Region-specific details<\/td>\n<\/tr>\n<tr>\n<td>Pricing tool<\/td>\n<td>AWS Pricing Calculator: https:\/\/calculator.aws\/#\/<\/td>\n<td>Model expected spend for your planned monitoring scope<\/td>\n<\/tr>\n<tr>\n<td>API reference<\/td>\n<td>Amazon DevOps Guru API Reference: https:\/\/docs.aws.amazon.com\/devops-guru\/latest\/APIReference\/Welcome.html<\/td>\n<td>SDK\/CLI automation and integration development<\/td>\n<\/tr>\n<tr>\n<td>AWS CLI reference<\/td>\n<td>AWS CLI Command Reference (search \u201cdevops-guru\u201d): https:\/\/docs.aws.amazon.com\/cli\/latest\/reference\/<\/td>\n<td>Practical automation and scripting for operations<\/td>\n<\/tr>\n<tr>\n<td>Architecture guidance<\/td>\n<td>AWS Architecture Center: https:\/\/aws.amazon.com\/architecture\/<\/td>\n<td>Patterns for ops, reliability, and governance used with DevOps Guru<\/td>\n<\/tr>\n<tr>\n<td>Reliability framework<\/td>\n<td>AWS Well-Architected Framework: https:\/\/docs.aws.amazon.com\/wellarchitected\/latest\/framework\/welcome.html<\/td>\n<td>Best practices that align with DevOps Guru recommendations<\/td>\n<\/tr>\n<tr>\n<td>Notifications<\/td>\n<td>Amazon SNS Developer Guide: https:\/\/docs.aws.amazon.com\/sns\/latest\/dg\/welcome.html<\/td>\n<td>Build robust notification fan-out and automation<\/td>\n<\/tr>\n<tr>\n<td>ChatOps<\/td>\n<td>AWS Chatbot docs: https:\/\/docs.aws.amazon.com\/chatbot\/latest\/adminguide\/what-is.html<\/td>\n<td>Send DevOps Guru notifications to Slack\/Chime in a controlled way<\/td>\n<\/tr>\n<tr>\n<td>Observability foundation<\/td>\n<td>Amazon CloudWatch docs: https:\/\/docs.aws.amazon.com\/AmazonCloudWatch\/latest\/monitoring\/WhatIsCloudWatch.html<\/td>\n<td>Understand the metrics\/logs foundation DevOps Guru builds on<\/td>\n<\/tr>\n<tr>\n<td>Tracing (optional)<\/td>\n<td>AWS X-Ray docs: https:\/\/docs.aws.amazon.com\/xray\/latest\/devguide\/aws-xray.html<\/td>\n<td>Add traces to improve incident investigation workflows<\/td>\n<\/tr>\n<tr>\n<td>Updates\/news<\/td>\n<td>AWS \u201cWhat\u2019s New\u201d (search DevOps Guru): https:\/\/aws.amazon.com\/new\/<\/td>\n<td>Track feature launches and changes over time<\/td>\n<\/tr>\n<tr>\n<td>Video learning<\/td>\n<td>AWS YouTube channel: https:\/\/www.youtube.com\/user\/AmazonWebServices<\/td>\n<td>Sessions, demos, and re:Invent talks (search DevOps Guru)<\/td>\n<\/tr>\n<tr>\n<td>Samples (general AWS)<\/td>\n<td>AWS Samples GitHub: https:\/\/github.com\/aws-samples<\/td>\n<td>Look for DevOps Guru-related examples; validate repository trust and recency<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">18. Training and Certification Providers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Institute<\/th>\n<th>Suitable Audience<\/th>\n<th>Likely Learning Focus<\/th>\n<th>Mode<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>DevOps engineers, SREs, cloud engineers<\/td>\n<td>AWS operations, DevOps tooling, monitoring\/AIOps concepts<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>ScmGalaxy.com<\/td>\n<td>Beginners to intermediate DevOps learners<\/td>\n<td>DevOps fundamentals, CI\/CD, operational practices<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.scmgalaxy.com\/<\/td>\n<\/tr>\n<tr>\n<td>CLoudOpsNow.in<\/td>\n<td>Cloud operations teams, platform engineers<\/td>\n<td>Cloud operations practices, monitoring, reliability<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.cloudopsnow.in\/<\/td>\n<\/tr>\n<tr>\n<td>SreSchool.com<\/td>\n<td>SREs, reliability engineers, ops leads<\/td>\n<td>SRE principles, incident response, reliability engineering<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.sreschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>AiOpsSchool.com<\/td>\n<td>Ops teams adopting AIOps<\/td>\n<td>AIOps concepts, event correlation, ML-assisted operations<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.aiopsschool.com\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">19. Top Trainers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Platform\/Site<\/th>\n<th>Likely Specialization<\/th>\n<th>Suitable Audience<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>RajeshKumar.xyz<\/td>\n<td>DevOps\/Cloud training content (verify specific offerings)<\/td>\n<td>Engineers seeking guided learning<\/td>\n<td>https:\/\/www.rajeshkumar.xyz\/<\/td>\n<\/tr>\n<tr>\n<td>devopstrainer.in<\/td>\n<td>DevOps training and mentoring (verify course scope)<\/td>\n<td>Beginners to advanced DevOps practitioners<\/td>\n<td>https:\/\/www.devopstrainer.in\/<\/td>\n<\/tr>\n<tr>\n<td>devopsfreelancer.com<\/td>\n<td>DevOps consulting\/training style offerings (verify services)<\/td>\n<td>Teams needing practical help<\/td>\n<td>https:\/\/www.devopsfreelancer.com\/<\/td>\n<\/tr>\n<tr>\n<td>devopssupport.in<\/td>\n<td>DevOps support and training resources (verify scope)<\/td>\n<td>Ops teams needing hands-on support<\/td>\n<td>https:\/\/www.devopssupport.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">20. Top Consulting Companies<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Company Name<\/th>\n<th>Likely Service Area<\/th>\n<th>Where They May Help<\/th>\n<th>Consulting Use Case Examples<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>cotocus.com<\/td>\n<td>Cloud\/DevOps consulting (verify detailed portfolio)<\/td>\n<td>Platform engineering, DevOps process, AWS operations<\/td>\n<td>DevOps Guru onboarding plan; SNS\/ChatOps integration; tagging strategy<\/td>\n<td>https:\/\/cotocus.com\/<\/td>\n<\/tr>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>DevOps consulting and training (verify service catalog)<\/td>\n<td>DevOps transformation, monitoring strategy, operational maturity<\/td>\n<td>Define resource collections; build incident workflows; optimize notifications<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>DEVOPSCONSULTING.IN<\/td>\n<td>DevOps consulting (verify offerings and regions served)<\/td>\n<td>CI\/CD + operations alignment, tooling implementation<\/td>\n<td>Integrate DevOps Guru insights into ticketing\/ChatOps; governance and IAM reviews<\/td>\n<td>https:\/\/www.devopsconsulting.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">21. Career and Learning Roadmap<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn before Amazon DevOps Guru<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To use DevOps Guru effectively, you should understand:\n&#8211; <strong>AWS fundamentals<\/strong>: IAM, Regions, VPC basics, CloudFormation basics\n&#8211; <strong>Observability basics<\/strong>:\n  &#8211; CloudWatch metrics, dimensions, alarms\n  &#8211; CloudWatch Logs and log retention\n&#8211; <strong>Operations basics<\/strong>:\n  &#8211; Incident response lifecycle\n  &#8211; Runbooks, postmortems, on-call practices\n&#8211; <strong>Tagging and governance<\/strong>:\n  &#8211; Tag strategy, ownership models, cost allocation<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn after Amazon DevOps Guru<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Advanced observability<\/strong>:<\/li>\n<li>Distributed tracing (AWS X-Ray or OpenTelemetry)<\/li>\n<li>Service-level objectives (SLOs) and error budgets<\/li>\n<li><strong>Automation<\/strong>:<\/li>\n<li>SNS \u2192 Lambda \u2192 ticketing<\/li>\n<li>Systems Manager Automation runbooks<\/li>\n<li><strong>Reliability engineering<\/strong>:<\/li>\n<li>Chaos engineering\/game days<\/li>\n<li>Resilience testing patterns<\/li>\n<li><strong>Multi-account operations<\/strong>:<\/li>\n<li>Central logging, SIEM integration<\/li>\n<li>Org-wide governance (AWS Organizations SCPs, tag policies)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Job roles that use it<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>DevOps Engineer<\/li>\n<li>Site Reliability Engineer (SRE)<\/li>\n<li>Cloud Operations Engineer<\/li>\n<li>Platform Engineer<\/li>\n<li>Solutions Architect (operational readiness focus)<\/li>\n<li>Technical Product Owner for platform\/operations<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Certification path (AWS)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">There is no certification specifically for Amazon DevOps Guru, but relevant AWS certifications include:\n&#8211; <strong>AWS Certified SysOps Administrator \u2013 Associate<\/strong>\n&#8211; <strong>AWS Certified DevOps Engineer \u2013 Professional<\/strong>\n&#8211; <strong>AWS Certified Solutions Architect \u2013 Associate\/Professional<\/strong><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Project ideas for practice<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Insight-to-ticket automation<\/strong>: SNS \u2192 Lambda \u2192 create Jira\/ServiceNow ticket with insight metadata.<\/li>\n<li><strong>Multi-environment resource collections<\/strong>: Separate collections for <code>dev<\/code>, <code>staging<\/code>, <code>prod<\/code> using tags and validate notification routing.<\/li>\n<li><strong>Operational readiness scorecard<\/strong>: Combine DevOps Guru insights with Well-Architected reviews and CloudWatch alarm coverage reports.<\/li>\n<li><strong>Game day playbook<\/strong>: Run controlled experiments (in non-prod) and document which signals become insights and how fast responders act.<\/li>\n<li><strong>Cost guardrails<\/strong>: Use AWS Budgets + tagging to keep DevOps Guru scope aligned to critical resources only.<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">22. Glossary<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>AIOps<\/strong>: Applying analytics\/ML to IT operations data to detect issues, correlate events, and assist remediation.<\/li>\n<li><strong>Anomaly<\/strong>: A deviation from normal behavior (for example, unusual error rate or latency) detected via statistical\/ML methods.<\/li>\n<li><strong>Insight (DevOps Guru)<\/strong>: A correlated, higher-level finding that groups anomalies, impacted resources, and recommendations.<\/li>\n<li><strong>Resource collection<\/strong>: A set of AWS resources grouped as an application\/workload for DevOps Guru monitoring.<\/li>\n<li><strong>Baseline<\/strong>: Learned \u201cnormal\u201d behavior over time used for anomaly detection.<\/li>\n<li><strong>CloudWatch metrics<\/strong>: Time-series data published by AWS services and custom applications.<\/li>\n<li><strong>CloudWatch Logs<\/strong>: Centralized log storage and basic analytics for AWS workloads.<\/li>\n<li><strong>SNS topic<\/strong>: A pub\/sub channel in Amazon SNS to fan out notifications.<\/li>\n<li><strong>Subscription<\/strong>: An endpoint (email, SMS, HTTP, Lambda, etc.) that receives SNS messages.<\/li>\n<li><strong>Least privilege<\/strong>: IAM principle of granting only the minimum permissions required.<\/li>\n<li><strong>MTTR<\/strong>: Mean Time To Resolution (or Recovery), a key operational performance metric.<\/li>\n<li><strong>SLO<\/strong>: Service Level Objective; a reliability target (for example 99.9% availability).<\/li>\n<li><strong>Runbook<\/strong>: A documented set of steps to diagnose and fix common operational issues.<\/li>\n<li><strong>CloudFormation stack<\/strong>: A deployable unit of infrastructure-as-code that provisions AWS resources together.<\/li>\n<li><strong>Tagging<\/strong>: Key\/value metadata on AWS resources used for ownership, cost allocation, and automation.<\/li>\n<li><strong>ChatOps<\/strong>: Operational workflows conducted through chat tools (for example Slack) integrated with automation and alerts.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">23. Summary<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Amazon DevOps Guru is an AWS managed service in the Machine Learning (ML) and Artificial Intelligence (AI) category that helps operations teams detect anomalies, correlate operational signals, and generate actionable insights and recommendations for AWS workloads. It fits alongside Amazon CloudWatch (telemetry and alarms), AWS X-Ray (tracing), and SNS (notifications) to improve incident detection and triage without running your own AIOps platform.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Cost and security success depends on disciplined scoping (tight resource collections), good tagging\/CloudFormation boundaries, least-privilege IAM, and careful notification routing. Use Amazon DevOps Guru when you want AWS-native operational intelligence for production workloads and want to reduce alert fatigue and MTTR; avoid relying on it as your only monitoring system.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Next step: enable DevOps Guru for one well-defined production application, route insights to your on-call workflow via SNS, and run a small game day to validate that insights and recommendations are operationally useful.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Machine Learning (ML) and Artificial Intelligence (AI)<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20,32],"tags":[],"class_list":["post-237","post","type-post","status-publish","format-standard","hentry","category-aws","category-machine-learning-ml-and-artificial-intelligence-ai"],"_links":{"self":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/237","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/comments?post=237"}],"version-history":[{"count":0,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/237\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/media?parent=237"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/categories?post=237"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/tags?post=237"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}