{"id":201,"date":"2026-04-13T04:33:54","date_gmt":"2026-04-13T04:33:54","guid":{"rendered":"https:\/\/www.devopsschool.com\/tutorials\/aws-fault-injection-service-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-developer-tools\/"},"modified":"2026-04-13T04:33:54","modified_gmt":"2026-04-13T04:33:54","slug":"aws-fault-injection-service-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-developer-tools","status":"publish","type":"post","link":"https:\/\/www.devopsschool.com\/tutorials\/aws-fault-injection-service-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-developer-tools\/","title":{"rendered":"AWS Fault Injection Service Tutorial: Architecture, Pricing, Use Cases, and Hands-On Guide for Developer tools"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Category<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Developer tools<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">1. Introduction<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">AWS Fault Injection Service is an AWS managed service for running controlled fault-injection experiments\u2014often called <em>chaos engineering<\/em>\u2014against AWS workloads. You use it to deliberately introduce failures (for example, instance reboots, network disruptions, or other service-specific faults) to validate that your application and operations practices behave the way you expect under stress.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In simple terms: <strong>you define an experiment, pick the resources to target, choose the fault to inject, set safety guardrails, then run it and observe what happens<\/strong>. It\u2019s like a fire drill for your cloud architecture\u2014done in a repeatable, audited, automatable way.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Technically, AWS Fault Injection Service provides an experiment orchestration plane that integrates with AWS APIs and selected AWS services (such as Amazon EC2 and AWS Systems Manager, plus other services depending on region and current support). You define an <strong>experiment template<\/strong> containing <strong>targets<\/strong>, <strong>actions<\/strong>, and <strong>stop conditions<\/strong>. When you start an experiment, the service assumes an IAM role and issues the necessary AWS API calls (or invokes integrated mechanisms) to inject faults. Experiment progress is observable through the service console\/API and is also visible via AWS-native governance services like AWS CloudTrail and Amazon EventBridge.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The problem it solves is straightforward: <strong>many systems appear reliable until they fail in a realistic way<\/strong>. Standard unit tests and integration tests rarely reproduce real dependency failures (latency, node loss, AZ impairment patterns, misbehaving instances). AWS Fault Injection Service helps you find resilience gaps <em>before<\/em> customers do, and it helps teams prove that runbooks, alarms, and auto-healing actually work.<\/p>\n\n\n\n<blockquote>\n<p>Naming note (verify in official docs for your account\/region): AWS has historically used the name <strong>\u201cAWS Fault Injection Simulator (FIS)\u201d<\/strong> in APIs, IAM policy names, and documentation paths. You may still see \u201csimulator\u201d in some places even when the console and marketing pages use <strong>AWS Fault Injection Service<\/strong>.<\/p>\n<\/blockquote>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">2. What is AWS Fault Injection Service?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Official purpose<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">AWS Fault Injection Service is designed to help you <strong>improve application performance, observability, and resilience<\/strong> by <strong>running fault injection experiments on AWS workloads<\/strong> in a controlled and repeatable manner.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Core capabilities<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">At a high level, AWS Fault Injection Service enables you to:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Define experiments<\/strong> as reusable templates<\/li>\n<li><strong>Select targets<\/strong> using tags, resource IDs, or other supported selection methods<\/li>\n<li><strong>Inject faults<\/strong> using supported actions (service-dependent)<\/li>\n<li><strong>Set stop conditions<\/strong> (typically Amazon CloudWatch alarms) to automatically halt experiments if risk thresholds are exceeded<\/li>\n<li><strong>Observe results<\/strong> and integrate experiment execution with operational tooling<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Major components (conceptual model)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Common constructs you\u2019ll work with:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Experiment template<\/strong>: A reusable blueprint defining <em>what to target<\/em> and <em>what to do<\/em>.<\/li>\n<li><strong>Experiment<\/strong>: A single run created from a template.<\/li>\n<li><strong>Targets<\/strong>: Sets of AWS resources selected for fault injection (for example, tagged EC2 instances).<\/li>\n<li><strong>Actions<\/strong>: The fault injection steps (for example, reboot an instance).<\/li>\n<li><strong>Stop conditions<\/strong>: Safety controls (most commonly CloudWatch alarms) that stop the experiment when triggered.<\/li>\n<li><strong>IAM role<\/strong>: A role that AWS Fault Injection Service assumes to execute actions against your resources.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Service type<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Managed AWS service<\/strong> (control plane), accessed through:<\/li>\n<li>AWS Management Console<\/li>\n<li>AWS CLI \/ SDKs (where supported)<\/li>\n<li>AWS APIs<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scope: regional vs global<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">AWS Fault Injection Service is <strong>regional<\/strong>: templates and experiments live in and operate within a specific AWS Region. Your targets must generally be in the same Region as the experiment. Always confirm Region support and action availability in the official documentation for your selected Region.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How it fits into the AWS ecosystem<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">AWS Fault Injection Service sits in the \u201cDeveloper tools\u201d\/engineering enablement layer as a resilience testing orchestrator. It pairs naturally with:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Amazon CloudWatch<\/strong> (metrics, alarms, dashboards)<\/li>\n<li><strong>AWS CloudTrail<\/strong> (auditability of experiment actions and API calls)<\/li>\n<li><strong>Amazon EventBridge<\/strong> (experiment state-change events routed to notifications\/automation)<\/li>\n<li><strong>AWS Systems Manager<\/strong> (commonly used for OS-level fault injection patterns where supported)<\/li>\n<li><strong>AWS Resilience Hub<\/strong> and <strong>AWS Well-Architected Tool<\/strong> (governance and resilience posture; not a replacement for fault injection)<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">3. Why use AWS Fault Injection Service?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Business reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Reduce revenue-impacting outages<\/strong> by identifying failure modes earlier.<\/li>\n<li><strong>Increase customer trust<\/strong> by validating resilience claims with evidence.<\/li>\n<li><strong>Lower incident cost<\/strong> by improving MTTR through practiced responses and validated automation.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Technical reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Test assumptions<\/strong> in distributed systems (timeouts, retries, circuit breakers, fallback behavior).<\/li>\n<li><strong>Validate redundancy<\/strong> (multi-AZ, scaling policies, self-healing) under real disruption.<\/li>\n<li><strong>Find hidden dependencies<\/strong> (shared services, DNS behavior, credential refresh, stateful components).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operational reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Turn \u201crunbooks\u201d into verified procedures<\/strong> by running experiments during game days.<\/li>\n<li><strong>Improve monitoring quality<\/strong>: ensure alarms trigger when they should, and only then.<\/li>\n<li><strong>Create repeatable, versionable tests<\/strong> (templates) that can be run on demand or during release cycles.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/compliance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Auditability<\/strong> via CloudTrail and IAM roles used for experiments.<\/li>\n<li><strong>Controlled access<\/strong> through least-privilege IAM and scoped targets.<\/li>\n<li>Helps demonstrate operational resilience practices for certain compliance programs (requirements vary; verify with your compliance team).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scalability\/performance reasons<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Fault injection can validate scaling and performance behavior under partial failure (for example, increased latency causing request queueing). It\u2019s not a load testing tool by itself, but it often complements load testing.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should choose it<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Choose AWS Fault Injection Service when you need:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>AWS-native fault injection orchestration<\/li>\n<li>Repeatable experiments with safety guardrails (stop conditions)<\/li>\n<li>Integration with AWS monitoring and governance tools<\/li>\n<li>A managed service (no chaos platform to host\/maintain)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should not choose it<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Avoid or defer AWS Fault Injection Service when:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>You cannot safely isolate blast radius (no staging, no canary, no strict targeting)<\/li>\n<li>Your organization has not established basic observability (you can\u2019t measure impact)<\/li>\n<li>You need fault types not supported by AWS Fault Injection Service (you may need third-party or self-managed chaos tooling)<\/li>\n<li>You\u2019re looking for classic performance\/load testing\u2014use tools like distributed load generators instead<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">4. Where is AWS Fault Injection Service used?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Industries<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>SaaS and internet services<\/li>\n<li>FinTech and payments (with strict controls and non-production emphasis)<\/li>\n<li>Media\/streaming<\/li>\n<li>E-commerce<\/li>\n<li>Gaming<\/li>\n<li>Healthcare and regulated industries (often in pre-production, with strong governance)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Team types<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>SRE \/ platform engineering<\/li>\n<li>DevOps teams running CI\/CD and operational readiness programs<\/li>\n<li>Cloud infrastructure teams validating multi-AZ patterns<\/li>\n<li>Application teams responsible for service reliability<\/li>\n<li>Security and risk teams coordinating resilience and operational risk controls<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Workloads<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Microservices on compute fleets<\/li>\n<li>Containerized workloads (service support varies; verify)<\/li>\n<li>Event-driven systems (queues, streams, workflows)<\/li>\n<li>Tiered web applications (ALB \u2192 compute \u2192 datastore)<\/li>\n<li>Batch pipelines and scheduled workloads<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Architectures<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Multi-AZ architectures where instance-level and network-level failures are meaningful<\/li>\n<li>Service-mesh or gateway-based architectures where failure propagation can be studied<\/li>\n<li>Hybrid patterns (some AWS resources plus on-prem) where AWS-side failures are still important<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Real-world deployment contexts<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Game days<\/strong>: planned resilience exercises with stakeholders.<\/li>\n<li><strong>Release readiness<\/strong>: run experiments before major launches.<\/li>\n<li><strong>Post-incident learning<\/strong>: validate that fixes prevent repeat incidents.<\/li>\n<li><strong>Continuous resilience<\/strong>: scheduled experiments in non-production (and sometimes controlled production).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Production vs dev\/test usage<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Non-production<\/strong> is the safest place to start: replicate topology and validate behavior.<\/li>\n<li><strong>Production<\/strong> can be appropriate, but only with:<\/li>\n<li>strict scoping (tags, narrow selection)<\/li>\n<li>mature observability and on-call readiness<\/li>\n<li>clear abort criteria via stop conditions<\/li>\n<li>change management approvals where required<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">5. Top Use Cases and Scenarios<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Below are realistic, commonly implemented scenarios. Exact actions available depend on Region and service support\u2014always confirm in the <strong>Actions<\/strong> section of the official docs.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1) EC2 instance reboot resilience test<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: You don\u2019t know if an instance reboot causes customer impact.<\/li>\n<li><strong>Why this service fits<\/strong>: A controlled reboot action can validate load balancer health checks, auto-healing, and graceful shutdown.<\/li>\n<li><strong>Example<\/strong>: Reboot one instance in an Auto Scaling group and confirm traffic stays healthy and alarms behave.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2) Validate Auto Scaling replacement and warmup<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: After node loss, replacement capacity is slow or misconfigured.<\/li>\n<li><strong>Why this service fits<\/strong>: Fault injection triggers failure while you measure replacement time and error rates.<\/li>\n<li><strong>Example<\/strong>: Inject instance termination (in non-prod) to confirm lifecycle hooks and warmup scripts work.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3) Network impairment drill (latency\/packet loss) for dependency timeouts<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Services have overly long timeouts and cause request pileups under latency.<\/li>\n<li><strong>Why this service fits<\/strong>: If supported for your setup, network disruption actions can create controlled latency.<\/li>\n<li><strong>Example<\/strong>: Increase latency to a subset of instances and confirm retries\/backoff are sane.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4) Validate alarms and stop conditions<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: You have alarms, but you\u2019re unsure they trigger under real failures.<\/li>\n<li><strong>Why this service fits<\/strong>: You can wire CloudWatch alarms as stop conditions and verify they trigger and halt experiments.<\/li>\n<li><strong>Example<\/strong>: Trigger a fault expected to raise a specific alarm and confirm the experiment stops automatically.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">5) Game day for operational readiness<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Teams don\u2019t practice responding to failures; runbooks are untested.<\/li>\n<li><strong>Why this service fits<\/strong>: Repeatable experiments enable consistent drills and measurable improvements.<\/li>\n<li><strong>Example<\/strong>: Monthly game day: run the same experiment template and compare MTTR and error budgets over time.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6) Validate blue\/green or canary deployment rollback behavior<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Rollbacks are assumed to work but rarely tested under pressure.<\/li>\n<li><strong>Why this service fits<\/strong>: Fault injection can introduce controlled instability during a release rehearsal.<\/li>\n<li><strong>Example<\/strong>: During a canary, inject a fault in the canary fleet and verify rollback automation triggers safely.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">7) Test resilience of stateful services\u2019 client behavior<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Client libraries behave badly during partial outages (connection storms, thundering herd).<\/li>\n<li><strong>Why this service fits<\/strong>: Inject faults into compute tiers and watch how clients reconnect or retry.<\/li>\n<li><strong>Example<\/strong>: Reboot a percentage of API servers and validate connection pooling and retry jitter.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">8) Validate multi-AZ behavior (application tier)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: You think the app is multi-AZ, but traffic is sticky to one AZ or capacity is uneven.<\/li>\n<li><strong>Why this service fits<\/strong>: Target selection can focus on a subset (for example, instances tagged by AZ via automation).<\/li>\n<li><strong>Example<\/strong>: Disrupt compute nodes in one AZ in staging and validate traffic shifts.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">9) Confirm incident response automation<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Automation exists (e.g., runbooks, tickets, paging), but no one knows if it triggers correctly.<\/li>\n<li><strong>Why this service fits<\/strong>: Use EventBridge experiment events to drive automation (notifications\/tickets).<\/li>\n<li><strong>Example<\/strong>: When an experiment starts\/stops, send messages to a chat channel and create a change record.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">10) Validate queue\/stream backpressure under partial worker loss<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Losing workers causes backlog and violates processing SLAs.<\/li>\n<li><strong>Why this service fits<\/strong>: Disrupt workers and measure queue depth and recovery time.<\/li>\n<li><strong>Example<\/strong>: Reboot a fraction of worker instances and confirm autoscaling catches up.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">11) Assess blast radius of misconfigurations (safely)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: A bad config might take down a tier; you want a safe rehearsal.<\/li>\n<li><strong>Why this service fits<\/strong>: Fault injection in staging reveals whether your guardrails and config rollout strategies are safe.<\/li>\n<li><strong>Example<\/strong>: Inject disruption while testing feature flags and staged rollouts.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">12) Prove resilience controls for audits<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Auditors ask for evidence of resilience testing and governance.<\/li>\n<li><strong>Why this service fits<\/strong>: Templates, roles, CloudTrail logs, and experiment history form a defensible evidence trail.<\/li>\n<li><strong>Example<\/strong>: Provide experiment template definitions and CloudTrail event evidence showing controlled execution.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">6. Core Features<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Feature availability can vary by Region and target service. For the authoritative list of supported actions and targets, use the official documentation:\nhttps:\/\/docs.aws.amazon.com\/fis\/latest\/userguide\/what-is.html<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1) Experiment templates (repeatability)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Lets you define experiments once and run them many times.<\/li>\n<li><strong>Why it matters<\/strong>: Repeatability turns chaos experiments into regression tests for resilience.<\/li>\n<li><strong>Practical benefit<\/strong>: You can run the same test after code changes, infrastructure changes, or incidents.<\/li>\n<li><strong>Caveats<\/strong>: Templates are regional; cross-region experiments typically require separate templates.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2) Targets and resource selection (blast radius control)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Defines the scope of resources affected (often via tags).<\/li>\n<li><strong>Why it matters<\/strong>: Safe chaos engineering depends on tight targeting.<\/li>\n<li><strong>Practical benefit<\/strong>: You can select \u201conly staging\u201d resources by tag, or \u201conly one instance\u201d for a canary.<\/li>\n<li><strong>Caveats<\/strong>: Tag hygiene becomes a critical safety control; weak tagging increases risk.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3) Actions (fault injection primitives)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Executes fault types against targets (for example, instance reboot).<\/li>\n<li><strong>Why it matters<\/strong>: Enables realistic failure injection using AWS-integrated mechanisms.<\/li>\n<li><strong>Practical benefit<\/strong>: Simulates failures that are hard to reproduce manually and consistently.<\/li>\n<li><strong>Caveats<\/strong>: Supported actions differ by service\/Region; some actions are disruptive and not easily reversible.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4) Stop conditions (safety guardrails)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Stops the experiment automatically when a specified condition occurs (commonly a CloudWatch alarm).<\/li>\n<li><strong>Why it matters<\/strong>: Prevents experiments from escalating into unacceptable customer impact.<\/li>\n<li><strong>Practical benefit<\/strong>: You can tie stop conditions to key SLO\/SLA signals (5xx rate, latency, queue depth).<\/li>\n<li><strong>Caveats<\/strong>: Stop conditions stop the experiment execution, but they may not automatically \u201cundo\u201d all impacts. Plan rollback steps.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">5) Integration with Amazon CloudWatch (observability)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Uses CloudWatch alarms for stop conditions; you can observe metrics and dashboards during experiments.<\/li>\n<li><strong>Why it matters<\/strong>: Without measurable signals, chaos experiments are guesswork.<\/li>\n<li><strong>Practical benefit<\/strong>: Standardizes resilience testing around metrics that matter to users.<\/li>\n<li><strong>Caveats<\/strong>: Ensure alarms are tuned; overly sensitive alarms can halt experiments prematurely.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6) Integration with AWS CloudTrail (auditing)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Captures API activity associated with experiments and actions.<\/li>\n<li><strong>Why it matters<\/strong>: You need an audit trail for governance and incident review.<\/li>\n<li><strong>Practical benefit<\/strong>: Enables security teams to verify who ran what, when, and under which role.<\/li>\n<li><strong>Caveats<\/strong>: CloudTrail logs record API calls; interpret in combination with experiment logs and resource metrics.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">7) Integration with Amazon EventBridge (automation hooks)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Emits events about experiment lifecycle state changes.<\/li>\n<li><strong>Why it matters<\/strong>: Helps integrate experiments into ChatOps, ticketing, CI\/CD, or runbook automation.<\/li>\n<li><strong>Practical benefit<\/strong>: Automatically notify on-call when an experiment starts\/stops.<\/li>\n<li><strong>Caveats<\/strong>: Event schemas can evolve; validate rules in a sandbox.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">8) IAM execution role model (least privilege)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Uses an IAM role that AWS Fault Injection Service assumes to perform actions.<\/li>\n<li><strong>Why it matters<\/strong>: Separates \u201cwho can run experiments\u201d from \u201cwhat experiments can do.\u201d<\/li>\n<li><strong>Practical benefit<\/strong>: Scope the role to only allowed actions and only allowed resources.<\/li>\n<li><strong>Caveats<\/strong>: Misconfigured roles can either block experiments (too restrictive) or create risk (too broad).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">9) Tagging and governance support<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Lets you tag templates and related resources; integrates with AWS governance patterns.<\/li>\n<li><strong>Why it matters<\/strong>: Production-grade chaos programs require inventory, ownership, and cost allocation.<\/li>\n<li><strong>Practical benefit<\/strong>: Tag by environment, owner, application, and change ticket.<\/li>\n<li><strong>Caveats<\/strong>: Establish naming and tagging standards early.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">10) API\/CLI\/SDK accessibility (automation)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Enables infrastructure-as-code and automation workflows to create and run experiments (where supported).<\/li>\n<li><strong>Why it matters<\/strong>: Mature teams treat experiments like code and run them regularly.<\/li>\n<li><strong>Practical benefit<\/strong>: Integrate into pipelines and scheduled validation.<\/li>\n<li><strong>Caveats<\/strong>: Some teams start in console, then codify later; ensure approval workflows before automation in production.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">7. Architecture and How It Works<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">High-level architecture<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">AWS Fault Injection Service acts as a <strong>control plane<\/strong> that orchestrates fault injection by calling AWS APIs (or using integrated mechanisms) against selected targets. You define templates, then start experiments. Experiments execute actions while monitoring stop conditions. Results and activity are visible through AWS-native logging and monitoring.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Request \/ control flow<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>User (engineer, pipeline, or automation) starts an experiment from a template.<\/li>\n<li>AWS Fault Injection Service assumes the configured IAM role.<\/li>\n<li>The service resolves targets (e.g., by tags) into concrete resource IDs.<\/li>\n<li>The service executes actions against those targets.<\/li>\n<li>Stop conditions (e.g., CloudWatch alarms) are evaluated; if triggered, the experiment stops.<\/li>\n<li>Events are emitted (EventBridge), and API activity is logged (CloudTrail).<\/li>\n<li>Operators observe metrics\/logs and validate application behavior.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Integrations with related services<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Common integrations include:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>CloudWatch<\/strong>: alarms (stop conditions), dashboards for validation<\/li>\n<li><strong>CloudTrail<\/strong>: audit trail for actions and template changes<\/li>\n<li><strong>EventBridge<\/strong>: experiment lifecycle events, notifications, automation<\/li>\n<li><strong>Systems Manager<\/strong>: used in some patterns (e.g., agent-based disruptions) depending on supported actions<\/li>\n<li><strong>IAM<\/strong>: role assumption model, permission boundaries, SCPs<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Dependency services<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">AWS Fault Injection Service relies on the target services it interacts with (EC2, etc.) and on AWS identity\/governance services (IAM, CloudTrail). If your account is constrained by SCPs or permission boundaries, ensure experiments can still execute safely.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/authentication model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>IAM users\/roles need permissions to create templates and start experiments.<\/li>\n<li>Experiments run using an <strong>IAM execution role<\/strong> trusted by the AWS Fault Injection Service principal.<\/li>\n<li>CloudTrail records API calls for governance.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Networking model<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">AWS Fault Injection Service itself is managed by AWS. Your targets may be in VPCs. Some fault types (especially network impairment patterns) may rely on agent-based mechanisms and security group\/NACL routing behavior\u2014verify specific action requirements in official docs for your selected fault type.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Monitoring\/logging\/governance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Define dashboards and alarms before experiments.<\/li>\n<li>Use CloudTrail to audit <em>who changed templates<\/em> and <em>who ran experiments<\/em>.<\/li>\n<li>Use EventBridge to notify stakeholders when experiments begin\/end.<\/li>\n<li>Use tags for ownership, environment, and change control mapping.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Simple architecture diagram (Mermaid)<\/h3>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart LR\n  U[Engineer \/ CI Pipeline] --&gt;|Start experiment| FIS[AWS Fault Injection Service]\n  FIS --&gt;|Assume role| IAM[IAM Execution Role]\n  FIS --&gt;|Execute action| T[(Target Resources\\n(e.g., EC2 instances))]\n  CW[Amazon CloudWatch Alarm] --&gt;|Stop condition| FIS\n  FIS --&gt; EB[Amazon EventBridge Events]\n  FIS --&gt; CT[AWS CloudTrail Logs]\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Production-style architecture diagram (Mermaid)<\/h3>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart TB\n  subgraph Ops[\"Operations &amp; Governance\"]\n    SOC[Security \/ Audit]\n    ONCALL[On-call \/ SRE]\n    CT[AWS CloudTrail]\n    EB[Amazon EventBridge]\n    CW[Amazon CloudWatch\\nDashboards + Alarms]\n  end\n\n  subgraph App[\"Workload (Multi-AZ example)\"]\n    ALB[Load Balancer]\n    ASG1[Compute Fleet AZ-A]\n    ASG2[Compute Fleet AZ-B]\n    DEP[Downstream Dependency\\n(e.g., datastore \/ API)]\n  end\n\n  subgraph Chaos[\"Chaos Engineering Control Plane\"]\n    U[Engineer \/ Pipeline]\n    FIS[AWS Fault Injection Service]\n    ROLE[IAM Experiment Execution Role]\n    TEMPLATE[Experiment Template\\nTargets + Actions + Stop Conditions]\n  end\n\n  U --&gt; TEMPLATE --&gt; FIS\n  FIS --&gt; ROLE\n  FIS --&gt;|Inject fault| ASG1\n  FIS --&gt;|Inject fault| ASG2\n\n  ALB --&gt; ASG1\n  ALB --&gt; ASG2\n  ASG1 --&gt; DEP\n  ASG2 --&gt; DEP\n\n  CW --&gt;|Stop condition| FIS\n  FIS --&gt; EB --&gt; ONCALL\n  FIS --&gt; CT --&gt; SOC\n  CW &lt;--&gt;|Metrics| ALB\n  CW &lt;--&gt;|Metrics| ASG1\n  CW &lt;--&gt;|Metrics| ASG2\n  CW &lt;--&gt;|Metrics| DEP\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">8. Prerequisites<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Account requirements<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>An AWS account with permissions to use AWS Fault Injection Service.<\/li>\n<li>A non-production (recommended) or tightly controlled production environment.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Permissions \/ IAM roles<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You typically need:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>User permissions<\/strong> (for the operator) to:\n   &#8211; Create and manage experiment templates\n   &#8211; Start\/stop experiments\n   &#8211; View experiment history\n   &#8211; Create\/read CloudWatch alarms used for stop conditions<\/li>\n<li><strong>Experiment execution role<\/strong> permissions:\n   &#8211; A role that AWS Fault Injection Service can assume\n   &#8211; Permissions scoped to the specific actions and target resources you intend to use<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">Best practice: create a dedicated role per environment (dev\/stage\/prod) and keep it narrowly scoped.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Billing requirements<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A valid payment method on the account.<\/li>\n<li>Even if AWS Fault Injection Service usage is small, underlying targets (EC2, logs, etc.) can incur costs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Tools<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For the hands-on lab you can use either:\n&#8211; AWS Management Console (recommended for beginners)\n&#8211; AWS CLI (optional, for provisioning targets)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If using AWS CLI:\n&#8211; AWS CLI v2 installed and configured (<code>aws configure<\/code>)\n&#8211; Permissions to create\/terminate EC2 instances (in the lab)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Region availability<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>AWS Fault Injection Service is regional and not available in all Regions with identical features.<\/li>\n<li><strong>Verify Region availability and supported actions<\/strong> in the official docs for your chosen Region.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Quotas \/ limits<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Service quotas apply (templates, experiments, API rates, etc.).<\/li>\n<li>Check:<\/li>\n<li>AWS Service Quotas console<\/li>\n<li>AWS Fault Injection Service documentation for quotas<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Prerequisite services (lab)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Amazon EC2 (a small instance to target)<\/li>\n<li>Amazon CloudWatch (an alarm to use as a stop condition)<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">9. Pricing \/ Cost<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">AWS Fault Injection Service uses a <strong>usage-based pricing model<\/strong>. Pricing details can change and can be Region-specific.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Pricing dimensions (what you pay for)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Common pricing dimensions include (verify on the official pricing page):<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Experiment\/action execution usage<\/strong>: Often measured by experiment duration or action minutes\/seconds, or per action invocation (model depends on current AWS pricing).<\/li>\n<li><strong>Underlying AWS resources<\/strong> impacted by the experiment:<\/li>\n<li>EC2 instances (compute)<\/li>\n<li>EBS volumes (storage and I\/O)<\/li>\n<li>CloudWatch metrics\/alarms (some metrics are free; alarms typically have costs)<\/li>\n<li>CloudWatch Logs ingestion and retention (if you log heavily)<\/li>\n<li>Data transfer (if the test increases cross-AZ or cross-region traffic)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Free tier<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>If a free tier exists for this service, it will be listed on the pricing page. <strong>Verify in official docs\/pricing<\/strong>:\n  https:\/\/aws.amazon.com\/fis\/pricing\/<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Cost drivers<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>How often you run experiments<\/strong> (frequency)<\/li>\n<li><strong>How long experiments run<\/strong><\/li>\n<li><strong>How many actions per experiment<\/strong><\/li>\n<li><strong>How many targets<\/strong> are affected (and their cost profiles)<\/li>\n<li><strong>Observability overhead<\/strong> (dashboards, logs, alarms, synthetic checks)<\/li>\n<li><strong>Blast radius<\/strong>: larger experiments can indirectly increase infrastructure scaling and therefore costs<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Hidden or indirect costs<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Auto Scaling events<\/strong>: if experiments trigger scale-out, you pay for extra instances.<\/li>\n<li><strong>Incident response overhead<\/strong>: game days consume engineering time (worth it, but real cost).<\/li>\n<li><strong>Data transfer<\/strong>: fault injection that changes traffic patterns can increase cross-AZ charges.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Network\/data transfer implications<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">AWS Fault Injection Service doesn\u2019t \u201ctransfer your data\u201d itself like a data pipeline service, but fault scenarios can change routing behavior and traffic distribution. Watch:\n&#8211; Cross-AZ traffic (commonly billed)\n&#8211; NAT Gateway processing if more traffic flows through it during the test\n&#8211; Internet egress if retries amplify outbound calls<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How to optimize cost<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Start with <strong>small scope<\/strong>: one instance, one action, short runtime.<\/li>\n<li>Use <strong>non-production<\/strong> environments with right-sized instances.<\/li>\n<li>Prefer <strong>reversible actions<\/strong> (like reboot) while learning.<\/li>\n<li>Use <strong>precise tagging<\/strong> to avoid accidentally targeting extra resources.<\/li>\n<li>Keep log retention short for experiment-specific logs, if appropriate.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Example low-cost starter estimate (conceptual)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A minimal starter lab might include:\n&#8211; One small EC2 instance (e.g., a burstable instance)\n&#8211; One CloudWatch alarm\n&#8211; One short experiment run (minutes)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Your primary costs are typically the EC2 instance runtime plus the alarm and any logs. <strong>Exact totals depend on Region and current pricing<\/strong>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Example production cost considerations<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In production, costs are less about the AWS Fault Injection Service line item and more about:\n&#8211; Additional telemetry (metrics, logs, traces)\n&#8211; Increased capacity during tests (intentional headroom)\n&#8211; Time spent by on-call and stakeholders\n&#8211; Risk mitigation steps (canary, rollback readiness)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For official pricing and to estimate costs:\n&#8211; Pricing page: https:\/\/aws.amazon.com\/fis\/pricing\/\n&#8211; AWS Pricing Calculator: https:\/\/calculator.aws\/<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">10. Step-by-Step Hands-On Tutorial<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Objective<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Run a safe, beginner-friendly AWS Fault Injection Service experiment that <strong>reboots a single EC2 instance<\/strong> selected by tag, using a <strong>CloudWatch alarm stop condition<\/strong> for safety, and then validate the impact through EC2 instance state and monitoring.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This lab is designed to be low-cost and reversible.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Lab Overview<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You will:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Create a small EC2 instance and tag it as an experiment target.<\/li>\n<li>Create a CloudWatch alarm (stop condition) and a simple dashboard\/metric view to observe.<\/li>\n<li>Create an AWS Fault Injection Service experiment template targeting the tagged instance and performing a reboot action.<\/li>\n<li>Run the experiment and validate:\n   &#8211; experiment lifecycle (started \u2192 completed\/stopped)\n   &#8211; EC2 instance reboot behavior\n   &#8211; CloudTrail\/EventBridge visibility (basic checks)<\/li>\n<li>Clean up all created resources.<\/li>\n<\/ol>\n\n\n\n<blockquote>\n<p>Safety note: Run this in a <strong>non-production<\/strong> account or environment. Rebooting an instance will interrupt any sessions and can cause brief downtime for anything hosted on it.<\/p>\n<\/blockquote>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 1: Choose a Region and set basic naming<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Pick an AWS Region where you will run everything (EC2 + CloudWatch + AWS Fault Injection Service).<\/li>\n<li>Decide a consistent prefix, for example:\n   &#8211; Project: <code>fis-lab<\/code>\n   &#8211; Environment tag: <code>dev<\/code><\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> You have a single Region selected and a naming convention to avoid confusion.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 2: Create a target EC2 instance (low-cost)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You can do this via the console (recommended) or AWS CLI.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Option A: Console (recommended)<\/h4>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Open <strong>Amazon EC2<\/strong> console.<\/li>\n<li>Launch an instance:\n   &#8211; AMI: Amazon Linux (or another small Linux AMI you prefer)\n   &#8211; Instance type: choose a small, low-cost type appropriate for your account (commonly a burstable instance)\n   &#8211; Network: default VPC is fine for this lab\n   &#8211; Security group: allow SSH from your IP if you want to log in (optional)<\/li>\n<li>Add tags:\n   &#8211; <code>Name = fis-lab-instance<\/code>\n   &#8211; <code>FisTarget = true<\/code>\n   &#8211; <code>Environment = dev<\/code><\/li>\n<\/ol>\n\n\n\n<h4 class=\"wp-block-heading\">Option B: AWS CLI (example)<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">This example assumes you already have a key pair and you know which subnet to use. If not, use the console method.<\/p>\n\n\n\n<pre><code class=\"language-bash\">aws ec2 run-instances \\\n  --image-id &lt;YOUR_AMI_ID&gt; \\\n  --instance-type &lt;YOUR_INSTANCE_TYPE&gt; \\\n  --subnet-id &lt;YOUR_SUBNET_ID&gt; \\\n  --security-group-ids &lt;YOUR_SG_ID&gt; \\\n  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=fis-lab-instance},{Key=FisTarget,Value=true},{Key=Environment,Value=dev}]' \\\n  --count 1\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> One running EC2 instance tagged with <code>FisTarget=true<\/code>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong>\n&#8211; EC2 console \u2192 Instances \u2192 confirm instance state is <strong>Running<\/strong>\n&#8211; Confirm tags exist and are spelled exactly (case-sensitive)<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 3: Create a CloudWatch alarm (stop condition)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">This lab uses a stop condition to demonstrate guardrails. The alarm does not have to trigger; it\u2019s there to show how to wire safety controls.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Open <strong>Amazon CloudWatch<\/strong> console.<\/li>\n<li>Go to <strong>Alarms<\/strong> \u2192 <strong>Create alarm<\/strong>.<\/li>\n<li>Select a metric for the EC2 instance. Two common options:\n   &#8211; A metric that indicates instance health (useful for real tests), or\n   &#8211; A metric that indicates user impact (like 5xx rate on a load balancer), if you have one.<\/li>\n<li>For a simple lab on a single instance, choose an EC2 metric available for your instance and set a threshold that is unlikely to trigger accidentally.<\/li>\n<li>Name the alarm:\n   &#8211; <code>fis-lab-stop-alarm<\/code><\/li>\n<li>(Optional) Configure notifications to an SNS topic if you already use one. For a minimal lab, you can skip notifications.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> A CloudWatch alarm exists and is in <strong>OK<\/strong> state.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong>\n&#8211; CloudWatch \u2192 Alarms \u2192 confirm <code>fis-lab-stop-alarm<\/code> is present and OK.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 4: Create an AWS Fault Injection Service experiment template<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You will now create an experiment template that:\n&#8211; Targets resources with tag <code>FisTarget=true<\/code>\n&#8211; Performs a <strong>reboot<\/strong> action on the selected EC2 instance(s)\n&#8211; Uses the CloudWatch alarm as a <strong>stop condition<\/strong><\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Open the <strong>AWS Fault Injection Service<\/strong> console in your Region.<\/li>\n<li>Go to <strong>Experiment templates<\/strong> \u2192 <strong>Create experiment template<\/strong>.<\/li>\n<li>Template basics:\n   &#8211; Name: <code>fis-lab-ec2-reboot<\/code>\n   &#8211; Description: <code>Reboot a tagged EC2 instance (lab)<\/code>\n   &#8211; Tags (recommended):<ul>\n<li><code>Environment=dev<\/code><\/li>\n<li><code>Owner=&lt;your-team-or-name&gt;<\/code><\/li>\n<\/ul>\n<\/li>\n<li>Define <strong>Targets<\/strong>:\n   &#8211; Target name: <code>ec2-target<\/code>\n   &#8211; Resource type: choose the EC2 instance resource type offered in the UI\n   &#8211; Selection method: <strong>Tags<\/strong>\n   &#8211; Tag key: <code>FisTarget<\/code>\n   &#8211; Tag value: <code>true<\/code>\n   &#8211; (If the UI supports further constraints, keep it narrow\u2014this lab should affect only one instance.)<\/li>\n<li>Define <strong>Actions<\/strong>:\n   &#8211; Action name: <code>reboot-ec2<\/code>\n   &#8211; Action type: choose an EC2 reboot action if available (wording may differ by console version)\n   &#8211; Target: select <code>ec2-target<\/code><\/li>\n<li>Configure <strong>Stop conditions<\/strong>:\n   &#8211; Add stop condition: select <strong>CloudWatch alarm<\/strong>\n   &#8211; Choose <code>fis-lab-stop-alarm<\/code><\/li>\n<li>\n<p>IAM Role for experiment execution:\n   &#8211; If the console offers to create or select a role, choose the safest option:<\/p>\n<ul>\n<li>Prefer a <strong>new role<\/strong> created for this template, or<\/li>\n<li>Select an existing dedicated FIS execution role that is scoped to EC2 reboot on only the tagged instance(s).<\/li>\n<li>If you are unsure, follow the official docs guidance for the execution role:\n https:\/\/docs.aws.amazon.com\/fis\/latest\/userguide\/getting-started.html (navigate to IAM\/role sections)<\/li>\n<\/ul>\n<\/li>\n<li>\n<p>Create the template.<\/p>\n<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> The experiment template <code>fis-lab-ec2-reboot<\/code> is created successfully.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong>\n&#8211; In AWS Fault Injection Service console, the template appears in the list.\n&#8211; The template shows:\n  &#8211; 1 target (tag-based)\n  &#8211; 1 action (reboot)\n  &#8211; 1 stop condition (CloudWatch alarm)<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 5: Run the experiment<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>In the AWS Fault Injection Service console, open the template <code>fis-lab-ec2-reboot<\/code>.<\/li>\n<li>Click <strong>Start experiment<\/strong>.<\/li>\n<li>Confirm the prompt. Carefully review:\n   &#8211; Which resources will be targeted\n   &#8211; Which action will be executed<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> An experiment run is created and transitions to <strong>Running<\/strong>, then <strong>Completed<\/strong> (or <strong>Stopped<\/strong> if the stop condition triggers).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong>\n&#8211; AWS Fault Injection Service \u2192 Experiments \u2192 select the running experiment and watch status changes.\n&#8211; EC2 console: the instance should show signs of reboot (for example, system status checks may reset briefly).\n&#8211; If you are connected via SSH, your session will likely drop during reboot.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 6: Observe and validate behavior<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">During and after the experiment, validate both infrastructure and operations signals.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Infrastructure checks<\/strong>\n&#8211; EC2 instance returns to <strong>Running<\/strong> state after reboot.\n&#8211; Instance status checks return to normal.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Operational checks<\/strong>\n&#8211; CloudWatch alarm remains <strong>OK<\/strong> (unless you intentionally tuned it otherwise).\n&#8211; CloudTrail shows relevant API activity (for audit).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Basic CloudTrail check:\n1. Open <strong>CloudTrail<\/strong> \u2192 <strong>Event history<\/strong>.\n2. Filter by:\n   &#8211; Event source such as EC2\n   &#8211; Or by the IAM role used by the experiment\n3. Confirm the reboot-related API call is logged.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> You can prove the experiment ran, impacted only the intended instance, and returned to a healthy state.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Validation<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Use this checklist:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>[ ] Experiment status reached <strong>Completed<\/strong> (or stopped for a known reason)<\/li>\n<li>[ ] Only the intended instance (tagged target) was affected<\/li>\n<li>[ ] EC2 instance is healthy after the reboot<\/li>\n<li>[ ] CloudWatch alarm did not unexpectedly trigger (or triggered as designed)<\/li>\n<li>[ ] CloudTrail shows an audit trail of the activity<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Troubleshooting<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Problem: Template creation fails due to IAM permissions<\/strong>\n&#8211; <strong>Cause<\/strong>: Your user\/role lacks permissions to create\/attach roles, or your org uses SCPs.\n&#8211; <strong>Fix<\/strong>:\n  &#8211; Ask an admin to grant the minimum IAM permissions.\n  &#8211; Use a pre-approved FIS execution role.\n  &#8211; Review SCP\/permission boundary constraints.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Problem: Experiment fails with \u201cno targets matched\u201d<\/strong>\n&#8211; <strong>Cause<\/strong>: Tags don\u2019t match exactly, or resources are in a different Region.\n&#8211; <strong>Fix<\/strong>:\n  &#8211; Confirm tag key\/value spelling and case.\n  &#8211; Confirm the instance is in the same Region as the experiment.\n  &#8211; Ensure the selection method is tag-based and points to the correct tag.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Problem: Action not available in the console<\/strong>\n&#8211; <strong>Cause<\/strong>: Action support varies by Region and resource type.\n&#8211; <strong>Fix<\/strong>:\n  &#8211; Verify supported actions in the official docs for your Region.\n  &#8211; Try a different Region or a different supported target\/action pairing.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Problem: Experiment stops immediately<\/strong>\n&#8211; <strong>Cause<\/strong>: Stop condition CloudWatch alarm is in ALARM state or triggers quickly.\n&#8211; <strong>Fix<\/strong>:\n  &#8211; Ensure the alarm is in OK state before starting.\n  &#8211; Adjust alarm threshold\/evaluation periods for the lab.\n  &#8211; Use a different stop condition or temporarily remove it in non-prod (not recommended for production).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Problem: Instance does not recover after reboot<\/strong>\n&#8211; <strong>Cause<\/strong>: The instance had underlying boot issues, disk issues, or configuration drift.\n&#8211; <strong>Fix<\/strong>:\n  &#8211; Check EC2 system logs and screenshots.\n  &#8211; Confirm security group\/NACL allows required access.\n  &#8211; If this is a disposable lab instance, terminate and recreate.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Cleanup<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To avoid ongoing costs:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Stop\/terminate the EC2 instance<\/strong>\n   &#8211; EC2 console \u2192 Instances \u2192 select instance \u2192 Terminate<\/li>\n<li><strong>Delete CloudWatch alarm<\/strong>\n   &#8211; CloudWatch \u2192 Alarms \u2192 delete <code>fis-lab-stop-alarm<\/code><\/li>\n<li><strong>Delete the experiment template<\/strong>\n   &#8211; AWS Fault Injection Service \u2192 Experiment templates \u2192 delete <code>fis-lab-ec2-reboot<\/code><\/li>\n<li>(Optional) Remove any IAM roles created specifically for this lab if they are not reused\n   &#8211; IAM console \u2192 Roles \u2192 locate the lab role \u2192 delete (ensure it\u2019s not used elsewhere)<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> No lab resources remain that could incur costs.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">11. Best Practices<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Architecture best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Start small, then scale scope<\/strong>: one instance \u2192 one tier \u2192 one AZ \u2192 multi-AZ.<\/li>\n<li><strong>Test one hypothesis per experiment<\/strong>: clearer outcomes and safer rollback.<\/li>\n<li><strong>Design for blast radius<\/strong>:<\/li>\n<li>Strict tag-based targeting<\/li>\n<li>Separate templates per environment and per application<\/li>\n<li><strong>Make resilience measurable<\/strong>:<\/li>\n<li>Tie experiments to SLO metrics (latency, error rate, backlog)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">IAM\/security best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use <strong>least-privilege execution roles<\/strong> per template\/environment.<\/li>\n<li>Separate permissions:<\/li>\n<li>\u201cTemplate authors\u201d<\/li>\n<li>\u201cExperiment runners\u201d<\/li>\n<li>\u201cApprovers\u201d<\/li>\n<li>Use <strong>permission boundaries<\/strong> and <strong>SCPs<\/strong> to restrict risky actions in production.<\/li>\n<li>Require tagging and enforce it with policy (where possible).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Cost best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use non-prod and right-size targets.<\/li>\n<li>Keep experiment durations short.<\/li>\n<li>Avoid experiments that trigger expensive scale-out unless that is the purpose.<\/li>\n<li>Control log ingestion and retention.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Performance best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Monitor retry behavior to avoid amplification (retry storms).<\/li>\n<li>Validate connection pool behavior and client timeouts.<\/li>\n<li>Watch downstream dependencies for saturation.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Reliability best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Always include:<\/li>\n<li>Observability plan (dashboards, alarms)<\/li>\n<li>Rollback plan (how to restore normal state)<\/li>\n<li>Safety plan (stop conditions + human abort procedure)<\/li>\n<li>Schedule game days during staffed hours.<\/li>\n<li>Record results and track action items like you would for incidents.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operations best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Integrate with EventBridge to notify:<\/li>\n<li>on-call rotations<\/li>\n<li>incident channels<\/li>\n<li>change management systems<\/li>\n<li>Keep experiment templates in source control conceptually (documented, reviewed).<\/li>\n<li>Use consistent naming:<\/li>\n<li><code>app-env-faulttype-v#<\/code><\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Governance\/tagging\/naming best practices<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Recommended tags:\n&#8211; <code>Application<\/code>\n&#8211; <code>Environment<\/code>\n&#8211; <code>Owner<\/code>\n&#8211; <code>CostCenter<\/code>\n&#8211; <code>ChangeTicket<\/code> (if required)\n&#8211; <code>DataClassification<\/code> (if your org uses it)<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">12. Security Considerations<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Identity and access model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Two layers of access control<\/strong>:\n  1. Who can create\/run experiments (IAM permissions for operators)\n  2. What the experiment can do (execution role permissions)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Key recommendations:\n&#8211; Use a dedicated role per environment.\n&#8211; Restrict role permissions to:\n  &#8211; specific actions\n  &#8211; specific resource ARNs or tag conditions where supported\n&#8211; Restrict who can pass\/assign roles (IAM <code>PassRole<\/code> controls are crucial).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Encryption<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">AWS Fault Injection Service is a control-plane service. Encryption considerations typically focus on:\n&#8211; <strong>CloudTrail logs<\/strong> (S3 encryption, KMS keys, retention)\n&#8211; <strong>CloudWatch Logs<\/strong> (if used)\n&#8211; Any affected data stores (ensure encryption at rest is already in place)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Network exposure<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Fault injection can alter availability. Ensure:<\/li>\n<li>Bastion\/SSM access patterns are resilient if you need emergency access<\/li>\n<li>You don\u2019t depend on a single NAT gateway or single egress path during tests<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secrets handling<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Do not embed secrets in experiment descriptions or tags.<\/li>\n<li>If your test triggers application failover, ensure secret rotation\/refresh mechanisms behave correctly.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Audit\/logging<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Enable CloudTrail across the account (org trail preferred in enterprises).<\/li>\n<li>Review:<\/li>\n<li>who started experiments<\/li>\n<li>what templates were modified<\/li>\n<li>what actions were executed<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Compliance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Some compliance frameworks care about operational resilience and change control.<\/li>\n<li>Treat experiments like changes:<\/li>\n<li>approvals<\/li>\n<li>maintenance windows<\/li>\n<li>documented outcomes<\/li>\n<li>If you test in production, ensure your change management policy allows it.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Common security mistakes<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Overly broad execution roles that can disrupt too many resources<\/li>\n<li>Poor tagging that accidentally includes production resources<\/li>\n<li>No stop conditions and no on-call notification<\/li>\n<li>No audit review after experiments<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secure deployment recommendations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use separate AWS accounts for dev\/stage\/prod (recommended).<\/li>\n<li>Use SCPs to forbid high-risk actions except via approved roles.<\/li>\n<li>Require explicit \u201copt-in\u201d tags for targets:<\/li>\n<li>Example: <code>ChaosReady=true<\/code><\/li>\n<li>Use CloudWatch alarms tied to customer-impacting signals as stop conditions.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">13. Limitations and Gotchas<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Because AWS Fault Injection Service is integrated with specific AWS APIs and mechanisms, there are practical constraints.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Known limitations (general)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Action availability varies<\/strong> by:<\/li>\n<li>AWS Region<\/li>\n<li>Target service\/resource type<\/li>\n<li>Account configuration<\/li>\n<li>Some faults are <strong>not easily reversible<\/strong> (e.g., termination). Prefer reversible actions for early adoption.<\/li>\n<li>Stop conditions <strong>stop the experiment<\/strong>, but may not automatically restore all changes.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Quotas<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Service quotas exist for templates, experiments, and API rate limits.<\/li>\n<li>Check the <strong>Service Quotas<\/strong> console for current limits and request increases if needed.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Regional constraints<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The service is regional; cross-region testing requires separate experiments in each Region.<\/li>\n<li>Multi-account organizations must plan account boundaries and permissions.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Pricing surprises<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The major cost is often not the orchestration service, but:<\/li>\n<li>extra capacity triggered by failure<\/li>\n<li>logs\/metrics retention<\/li>\n<li>data transfer increases due to retries or traffic shifts<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Compatibility issues<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Some actions may require prerequisites on targets (for example, management agents or permissions), depending on the action type.<\/li>\n<li>Resources must be discoverable and selectable by the targeting method you choose (tags require consistent tag propagation).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operational gotchas<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Running experiments without a communication plan can trigger false incident responses.<\/li>\n<li>If you don\u2019t inform on-call teams, they may treat the experiment as a real outage.<\/li>\n<li>Experiments can interact with auto-remediation in unexpected ways (auto-scaling, health-based replacement).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Migration challenges \/ adoption challenges<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Organizations often need to mature:<\/li>\n<li>tagging strategies<\/li>\n<li>observability standards<\/li>\n<li>runbook discipline\nbefore running meaningful production experiments.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Vendor-specific nuances<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>AWS Fault Injection Service is AWS-native and works best when your workload is already built on AWS primitives and governance patterns. If your environment is multi-cloud or heavily on-prem, you may need additional tooling for broader fault coverage.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">14. Comparison with Alternatives<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">AWS Fault Injection Service is one option in a broader resilience and chaos engineering ecosystem.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Key alternatives<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Within AWS<\/strong><\/li>\n<li><strong>AWS Systems Manager<\/strong> (Automation\/Run Command): can execute scripted disruptions but lacks the dedicated chaos experiment model and guardrails in the same way.<\/li>\n<li><strong>AWS Resilience Hub<\/strong>: focuses on resilience assessment and improvement plans rather than fault injection execution (complementary, not a direct replacement).<\/li>\n<li><strong>Other clouds<\/strong><\/li>\n<li><strong>Azure Chaos Studio<\/strong>: Azure-native chaos engineering service (for Azure workloads).<\/li>\n<li><strong>Third-party \/ open-source<\/strong><\/li>\n<li><strong>Gremlin<\/strong> (commercial chaos engineering platform)<\/li>\n<li><strong>Chaos Mesh<\/strong> (commonly used in Kubernetes)<\/li>\n<li><strong>LitmusChaos<\/strong> (Kubernetes-focused chaos engineering)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Comparison table<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Option<\/th>\n<th>Best For<\/th>\n<th>Strengths<\/th>\n<th>Weaknesses<\/th>\n<th>When to Choose<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>AWS Fault Injection Service<\/td>\n<td>AWS-native chaos experiments<\/td>\n<td>Managed service, IAM\/CloudTrail\/EventBridge integration, stop conditions, templates<\/td>\n<td>Action coverage varies by Region; AWS-scoped; not all fault types<\/td>\n<td>You run primarily on AWS and want AWS-native governance and low ops overhead<\/td>\n<\/tr>\n<tr>\n<td>AWS Systems Manager (scripts\/automation)<\/td>\n<td>Custom, scripted operational tasks<\/td>\n<td>Very flexible; can run arbitrary commands; good for remediation and ops automation<\/td>\n<td>You must build your own experiment orchestration, safety model, and reporting<\/td>\n<td>You need custom faults not supported by FIS and can invest in building guardrails<\/td>\n<\/tr>\n<tr>\n<td>AWS Resilience Hub<\/td>\n<td>Resilience planning and tracking<\/td>\n<td>Assessment and recommendations, governance view<\/td>\n<td>Not a fault-injection runner; complements FIS<\/td>\n<td>You want posture management and structured resilience improvements alongside experiments<\/td>\n<\/tr>\n<tr>\n<td>Azure Chaos Studio<\/td>\n<td>Azure-native chaos engineering<\/td>\n<td>Deep Azure integration<\/td>\n<td>Not for AWS workloads<\/td>\n<td>Your workloads are primarily in Azure<\/td>\n<\/tr>\n<tr>\n<td>Gremlin<\/td>\n<td>Cross-platform chaos engineering<\/td>\n<td>Broad fault library, mature UX, multi-cloud options<\/td>\n<td>Additional vendor\/cost; integration model differs<\/td>\n<td>You need broader fault types, multi-cloud\/on-prem coverage, or advanced org workflows<\/td>\n<\/tr>\n<tr>\n<td>Chaos Mesh \/ LitmusChaos<\/td>\n<td>Kubernetes-native chaos<\/td>\n<td>Strong for K8s fault scenarios; open-source<\/td>\n<td>You operate it yourself; governance and safety must be designed<\/td>\n<td>You are Kubernetes-centric and want deep cluster-level chaos patterns<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">15. Real-World Example<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Enterprise example: Multi-team SaaS platform validating tier resilience<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong><\/li>\n<li>A SaaS provider runs hundreds of microservices on AWS. They experienced intermittent outages caused by instance-level failures and slow failover behavior. Leadership wants evidence-based resilience improvements.<\/li>\n<li><strong>Proposed architecture<\/strong><\/li>\n<li>Standardized experiment templates per critical tier (ingress, API, workers).<\/li>\n<li>Tagging standard: <code>ChaosReady=true<\/code>, <code>App=&lt;name&gt;<\/code>, <code>Env=stage|prod<\/code>.<\/li>\n<li>CloudWatch SLO alarms (latency\/error rate) used as stop conditions.<\/li>\n<li>EventBridge sends experiment lifecycle events to a central operations channel and change-management system.<\/li>\n<li>CloudTrail logs archived to a central security account.<\/li>\n<li><strong>Why AWS Fault Injection Service was chosen<\/strong><\/li>\n<li>AWS-native IAM controls and CloudTrail auditing fit enterprise governance.<\/li>\n<li>Stop conditions and scoped roles reduced risk for controlled production experiments.<\/li>\n<li><strong>Expected outcomes<\/strong><\/li>\n<li>Measurable improvements in failover time and error budgets.<\/li>\n<li>Fewer \u201cunknown unknowns\u201d during real incidents.<\/li>\n<li>Repeatable game days with consistent metrics and postmortems.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Startup\/small-team example: Single web app validating basic self-healing<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong><\/li>\n<li>A startup runs a web app on EC2 behind a load balancer and uses Auto Scaling. They assume the system is resilient, but have never tested failures.<\/li>\n<li><strong>Proposed architecture<\/strong><\/li>\n<li>One experiment template in staging:<ul>\n<li>target one instance by tag<\/li>\n<li>reboot action<\/li>\n<li>stop condition on 5xx metric alarm (if using a load balancer)<\/li>\n<\/ul>\n<\/li>\n<li>A simple dashboard to observe latency and errors.<\/li>\n<li><strong>Why AWS Fault Injection Service was chosen<\/strong><\/li>\n<li>Minimal operational overhead (managed service).<\/li>\n<li>Easy to start with a single action and small blast radius.<\/li>\n<li><strong>Expected outcomes<\/strong><\/li>\n<li>Confidence that health checks and Auto Scaling behave correctly.<\/li>\n<li>Early discovery of misconfigured health checks, startup times, or missing alarms.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">16. FAQ<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">1) Is AWS Fault Injection Service the same as AWS Fault Injection Simulator (FIS)?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">They refer to the same AWS service family and acronym. AWS naming may differ across console pages, IAM policies, and older documentation paths. Verify current naming in the official AWS docs for your Region.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">2) Is AWS Fault Injection Service only for chaos engineering?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Its primary use is chaos engineering and resilience validation, but it can also be used for operational readiness tests, game days, and controlled failure drills.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3) Can I run experiments in production?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Yes, many teams do\u2014but only with strict controls: narrow targets, stop conditions, approvals, and clear rollback plans. Start in staging first.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4) Do stop conditions automatically undo the fault?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Stop conditions stop <em>experiment execution<\/em>. They do not guarantee that the impacted resources are restored to their pre-experiment state. Plan explicit rollback steps.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">5) What AWS services can be targeted?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">It depends on current AWS support per Region and resource type. Consult the official \u201cactions and targets\u201d documentation for the definitive list:\nhttps:\/\/docs.aws.amazon.com\/fis\/latest\/userguide\/what-is.html<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">6) Do I need an agent on instances?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Some fault types may require an agent or Systems Manager capability, while others are pure AWS API actions. It depends on the action. Verify action prerequisites in the docs.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">7) How do I prevent accidentally targeting production?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Use strict tags (opt-in tags like <code>ChaosReady=true<\/code>), separate accounts for environments, and enforce controls with IAM\/SCPs.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">8) Can I integrate experiment runs into CI\/CD?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Yes. Use APIs\/CLI\/SDK (where supported) and EventBridge for lifecycle events. Ensure you have approvals and guardrails if any environment is customer-facing.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">9) How do I notify my team when an experiment starts?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Use EventBridge rules to route experiment events to SNS, chat integrations, incident tools, or ticketing workflows.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">10) Does AWS Fault Injection Service generate logs?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You can observe experiments in the console and via AWS governance tools like CloudTrail. Additional logs\/metrics depend on the target services and your observability tooling.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">11) What\u2019s the safest first experiment?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A single, reversible action with narrow targeting\u2014like rebooting one non-production EC2 instance\u2014while observing dashboards and alarms.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">12) What\u2019s the difference between fault injection and load testing?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Fault injection introduces failures; load testing increases traffic\/requests. They complement each other. Many reliability issues only appear when both load and failures occur.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">13) How do I measure success for an experiment?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Define a hypothesis and acceptance criteria:\n&#8211; SLO impact (p95 latency, error rate)\n&#8211; recovery time\n&#8211; alarm behavior\n&#8211; runbook correctness<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">14) Can I test multi-AZ failover?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You can design experiments that target resources associated with one AZ (often by tag or selection constraints). The exact approach depends on the resource type and your architecture.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">15) Does AWS Fault Injection Service support multi-account organizations?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Yes, but governance is more complex. Many organizations keep experiments within an account boundary and standardize roles, tags, and guardrails across accounts using AWS Organizations patterns.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">16) What permissions do I need to run experiments?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You need permissions to start experiments and to pass or use the execution role. The execution role then needs permissions for the target actions. Use least privilege and verify with your IAM team.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">17) How do I keep experiments from becoming \u201crandom chaos\u201d?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Use disciplined experimentation:\n&#8211; one hypothesis per experiment\n&#8211; defined metrics\n&#8211; documented results\n&#8211; tracked action items\n&#8211; controlled scheduling<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">17. Top Online Resources to Learn AWS Fault Injection Service<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Resource Type<\/th>\n<th>Name<\/th>\n<th>Why It Is Useful<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Official documentation<\/td>\n<td>AWS Fault Injection Service User Guide \u2014 https:\/\/docs.aws.amazon.com\/fis\/latest\/userguide\/what-is.html<\/td>\n<td>Authoritative feature descriptions, concepts, and configuration details<\/td>\n<\/tr>\n<tr>\n<td>Official getting started<\/td>\n<td>Getting started (AWS Fault Injection Service) \u2014 https:\/\/docs.aws.amazon.com\/fis\/latest\/userguide\/getting-started.html<\/td>\n<td>Step-by-step onboarding guidance and prerequisites<\/td>\n<\/tr>\n<tr>\n<td>Official pricing<\/td>\n<td>AWS Fault Injection Service Pricing \u2014 https:\/\/aws.amazon.com\/fis\/pricing\/<\/td>\n<td>Current pricing model and Region considerations<\/td>\n<\/tr>\n<tr>\n<td>Pricing tool<\/td>\n<td>AWS Pricing Calculator \u2014 https:\/\/calculator.aws\/<\/td>\n<td>Estimate end-to-end costs including EC2\/CloudWatch and other dependencies<\/td>\n<\/tr>\n<tr>\n<td>API reference<\/td>\n<td>AWS Fault Injection Service API Reference \u2014 https:\/\/docs.aws.amazon.com\/fis\/latest\/APIReference\/Welcome.html<\/td>\n<td>Automate templates\/experiments using API\/CLI\/SDK<\/td>\n<\/tr>\n<tr>\n<td>Governance<\/td>\n<td>AWS CloudTrail \u2014 https:\/\/docs.aws.amazon.com\/awscloudtrail\/latest\/userguide\/cloudtrail-user-guide.html<\/td>\n<td>Audit experiment execution and changes<\/td>\n<\/tr>\n<tr>\n<td>Events\/automation<\/td>\n<td>Amazon EventBridge \u2014 https:\/\/docs.aws.amazon.com\/eventbridge\/latest\/userguide\/eb-what-is.html<\/td>\n<td>Route experiment lifecycle events into notifications and automation<\/td>\n<\/tr>\n<tr>\n<td>Monitoring<\/td>\n<td>Amazon CloudWatch \u2014 https:\/\/docs.aws.amazon.com\/AmazonCloudWatch\/latest\/monitoring\/WhatIsCloudWatch.html<\/td>\n<td>Build dashboards\/alarms and stop conditions<\/td>\n<\/tr>\n<tr>\n<td>Architecture guidance<\/td>\n<td>AWS Well-Architected Framework \u2014 https:\/\/docs.aws.amazon.com\/wellarchitected\/latest\/framework\/welcome.html<\/td>\n<td>Reliability pillar concepts to guide what to test and why<\/td>\n<\/tr>\n<tr>\n<td>Workshops\/labs<\/td>\n<td>AWS Workshops (search for FIS\/chaos engineering) \u2014 https:\/\/workshops.aws\/<\/td>\n<td>Hands-on labs maintained by AWS and community contributors (verify lab freshness)<\/td>\n<\/tr>\n<tr>\n<td>Video learning<\/td>\n<td>AWS YouTube channel \u2014 https:\/\/www.youtube.com\/@amazonwebservices<\/td>\n<td>Talks and demos; search for \u201cFault Injection Simulator\/Service\u201d<\/td>\n<\/tr>\n<tr>\n<td>Community reference<\/td>\n<td>AWS Compute Blog \/ Architecture Blog \u2014 https:\/\/aws.amazon.com\/blogs\/<\/td>\n<td>Practical patterns and announcements (verify post dates)<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">18. Training and Certification Providers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Institute<\/th>\n<th>Suitable Audience<\/th>\n<th>Likely Learning Focus<\/th>\n<th>Mode<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>DevOps engineers, SREs, platform teams<\/td>\n<td>DevOps\/Cloud operations practices; may include resilience\/chaos topics<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>ScmGalaxy.com<\/td>\n<td>Beginners to intermediate engineers<\/td>\n<td>SCM\/DevOps fundamentals, automation, process<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.scmgalaxy.com\/<\/td>\n<\/tr>\n<tr>\n<td>CLoudOpsNow.in<\/td>\n<td>Cloud ops practitioners<\/td>\n<td>Cloud operations and reliability practices<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.cloudopsnow.in\/<\/td>\n<\/tr>\n<tr>\n<td>SreSchool.com<\/td>\n<td>SREs, reliability-focused teams<\/td>\n<td>SRE practices, incident response, reliability engineering<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.sreschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>AiOpsSchool.com<\/td>\n<td>Ops teams exploring AIOps<\/td>\n<td>Monitoring\/automation and AIOps concepts<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.aiopsschool.com\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">19. Top Trainers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Platform\/Site<\/th>\n<th>Likely Specialization<\/th>\n<th>Suitable Audience<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>RajeshKumar.xyz<\/td>\n<td>DevOps\/cloud training content<\/td>\n<td>Engineers seeking guided learning<\/td>\n<td>https:\/\/rajeshkumar.xyz\/<\/td>\n<\/tr>\n<tr>\n<td>devopstrainer.in<\/td>\n<td>DevOps training and mentoring<\/td>\n<td>Beginners to working professionals<\/td>\n<td>https:\/\/www.devopstrainer.in\/<\/td>\n<\/tr>\n<tr>\n<td>devopsfreelancer.com<\/td>\n<td>Freelance DevOps expertise<\/td>\n<td>Teams needing short-term coaching\/support<\/td>\n<td>https:\/\/www.devopsfreelancer.com\/<\/td>\n<\/tr>\n<tr>\n<td>devopssupport.in<\/td>\n<td>DevOps support services\/training<\/td>\n<td>Ops teams needing practical support<\/td>\n<td>https:\/\/www.devopssupport.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">20. Top Consulting Companies<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Company<\/th>\n<th>Likely Service Area<\/th>\n<th>Where They May Help<\/th>\n<th>Consulting Use Case Examples<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>cotocus.com<\/td>\n<td>Cloud\/DevOps consulting<\/td>\n<td>Architecture reviews, reliability programs, automation<\/td>\n<td>Set up chaos engineering guardrails; implement monitoring + stop conditions; define tagging and IAM model<\/td>\n<td>https:\/\/cotocus.com\/<\/td>\n<\/tr>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>DevOps consulting and enablement<\/td>\n<td>Training + implementation support<\/td>\n<td>Establish game day process; create experiment templates and operational runbooks; integrate EventBridge notifications<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>DEVOPSCONSULTING.IN<\/td>\n<td>DevOps consulting<\/td>\n<td>CI\/CD, ops practices, cloud governance<\/td>\n<td>Build repeatable resilience test workflows; align with change management; cost and security reviews<\/td>\n<td>https:\/\/www.devopsconsulting.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">21. Career and Learning Roadmap<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn before AWS Fault Injection Service<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To get real value from AWS Fault Injection Service, you should already understand:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>AWS IAM basics (roles, policies, trust relationships, least privilege)<\/li>\n<li>Amazon EC2 fundamentals (instances, tags, security groups, Auto Scaling)<\/li>\n<li>Amazon CloudWatch basics (metrics, alarms, dashboards)<\/li>\n<li>Basic reliability concepts:<\/li>\n<li>health checks<\/li>\n<li>retries and timeouts<\/li>\n<li>multi-AZ patterns<\/li>\n<li>graceful shutdown<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn after AWS Fault Injection Service<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Once you can run safe experiments, level up with:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Advanced observability (logs, metrics, traces) and SLO management<\/li>\n<li>Event-driven automation with EventBridge and incident tooling<\/li>\n<li>Resilience engineering frameworks:<\/li>\n<li>AWS Well-Architected Reliability pillar<\/li>\n<li>error budgets and release policies<\/li>\n<li>Infrastructure as Code for repeatable environments (Terraform\/CloudFormation\/CDK)<\/li>\n<li>For containers: Kubernetes failure modes and chaos tooling (if applicable)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Job roles that use it<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Site Reliability Engineer (SRE)<\/li>\n<li>DevOps Engineer \/ Platform Engineer<\/li>\n<li>Cloud Solutions Architect (reliability-focused)<\/li>\n<li>Operations Engineer \/ Incident Manager (for game days)<\/li>\n<li>Security\/GRC engineers (governance and evidence collection)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Certification path (AWS)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">AWS certifications don\u2019t typically certify a single service, but AWS Fault Injection Service aligns well with:\n&#8211; AWS Certified Solutions Architect (Associate\/Professional)\n&#8211; AWS Certified DevOps Engineer (Professional)\n&#8211; AWS Certified SysOps Administrator (Associate)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Use AWS Fault Injection Service as a practical tool to demonstrate Reliability pillar skills during certification prep.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Project ideas for practice<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Build a \u201cresilience test harness\u201d in staging:\n   &#8211; dashboard + alarms + one FIS template per tier<\/li>\n<li>Implement EventBridge notifications for experiments:\n   &#8211; Slack\/Teams via SNS or webhook bridge (implementation varies)<\/li>\n<li>Create a quarterly game day program:\n   &#8211; documented hypotheses, results, and remediation backlog<\/li>\n<li>Validate autoscaling and health checks:\n   &#8211; instance reboot experiments + measurement of recovery time<\/li>\n<li>Create IAM guardrails:\n   &#8211; least-privilege execution roles and tag-based restrictions<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">22. Glossary<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Chaos engineering<\/strong>: The practice of intentionally injecting failures into systems to learn how they behave and improve resilience.<\/li>\n<li><strong>Experiment template<\/strong>: A reusable definition of targets, actions, and stop conditions in AWS Fault Injection Service.<\/li>\n<li><strong>Experiment<\/strong>: A single run created from an experiment template.<\/li>\n<li><strong>Target<\/strong>: The AWS resources selected for fault injection (often via tags).<\/li>\n<li><strong>Action<\/strong>: The fault injection operation performed on targets (e.g., reboot an instance).<\/li>\n<li><strong>Stop condition<\/strong>: A safety rule\u2014often a CloudWatch alarm\u2014that halts an experiment if triggered.<\/li>\n<li><strong>Blast radius<\/strong>: The scope of impact from a change or failure (how many users\/resources are affected).<\/li>\n<li><strong>SLO (Service Level Objective)<\/strong>: A defined reliability target (e.g., 99.9% availability, p95 latency &lt; 200 ms).<\/li>\n<li><strong>Error budget<\/strong>: The allowable amount of unreliability given an SLO, used to guide release and operational decisions.<\/li>\n<li><strong>CloudWatch alarm<\/strong>: A CloudWatch construct that transitions state based on metric thresholds; used as stop conditions.<\/li>\n<li><strong>EventBridge<\/strong>: AWS event bus service used to route events (including experiment lifecycle events) to targets.<\/li>\n<li><strong>CloudTrail<\/strong>: AWS service that logs API calls for auditing and security analysis.<\/li>\n<li><strong>Least privilege<\/strong>: IAM practice of granting only the permissions required to perform a task.<\/li>\n<li><strong>Game day<\/strong>: A planned operational exercise where teams simulate incidents or run failure drills to practice response.<\/li>\n<li><strong>Runbook<\/strong>: Documented operational procedures for responding to incidents or performing tasks.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">23. Summary<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">AWS Fault Injection Service (AWS, Developer tools category) is a managed way to run controlled fault injection experiments against AWS workloads. It matters because reliability gaps often only appear under realistic disruptions\u2014and this service makes those disruptions repeatable, auditable, and safer through targets, IAM scoping, and stop conditions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Architecturally, it fits as a control-plane orchestration layer integrated with CloudWatch (alarms\/metrics), CloudTrail (audit), and EventBridge (automation). Cost is usage-based and highly influenced by indirect drivers like additional capacity, metrics\/logs, and traffic shifts\u2014so start small and measure.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">From a security perspective, success depends on least-privilege execution roles, strict tagging to control blast radius, and strong governance (CloudTrail, approvals, communication). Use it when you have clear hypotheses, measurable signals, and a rollback plan\u2014then expand from simple reversible tests (like a single instance reboot) to more realistic resilience scenarios.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Next learning step: review the official \u201cGetting started\u201d guide, then build one experiment per critical tier in staging and run a monthly game day with documented outcomes:\nhttps:\/\/docs.aws.amazon.com\/fis\/latest\/userguide\/getting-started.html<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Developer tools<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[20,18],"tags":[],"class_list":["post-201","post","type-post","status-publish","format-standard","hentry","category-aws","category-developer-tools"],"_links":{"self":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/201","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/comments?post=201"}],"version-history":[{"count":0,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/201\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/media?parent=201"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/categories?post=201"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/tags?post=201"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}