Azure Spot Virtual Machines Tutorial: Architecture, Pricing, Use Cases, and Hands-On Guide for Compute

1. Introduction

Azure Spot Virtual Machines is an Azure Compute pricing and capacity option that lets you run Virtual Machines (VMs) using unused Azure capacity at steep discounts compared to pay-as-you-go prices. The tradeoff is that Azure can evict (stop) your Spot VM with short notice when Azure needs the capacity back or when your configured maximum price is exceeded.

In simple terms: you get cheaper VMs, but they can be interrupted. If your workload can tolerate interruptions—batch jobs, CI runners, distributed builds, rendering, simulations, big data processing, stateless services, or fault-tolerant microservices—Spot is often the most cost-effective way to scale out.

Technically, Azure Spot Virtual Machines are standard Azure Virtual Machines configured with a “Spot” priority and an eviction policy. They run in an Azure region and optionally in an Availability Zone, use the same VM images, disks, networking, identity, and monitoring stack as regular VMs, and are created via Azure portal, ARM/Bicep, Terraform, Azure CLI, PowerShell, or SDKs. The key difference is that compute capacity is not guaranteed, and eviction is part of the operating model.

The problem it solves: reducing compute costs for interruption-tolerant workloads, often enabling larger scale, faster completion times, and more experimentation under the same budget.

Naming note (important): Azure previously used terms like “low-priority VMs” (notably in some batch/scale-set contexts). Today, the official product term is Azure Spot Virtual Machines. If you see “low-priority” in older materials, treat it as legacy terminology and verify in current docs.

2. What is Azure Spot Virtual Machines?

Official purpose: Azure Spot Virtual Machines let you deploy VMs on unused Azure capacity at discounted rates, with the understanding that Azure can reclaim that capacity at any time through eviction.
Official docs: https://learn.microsoft.com/azure/virtual-machines/spot-vms

Core capabilities

Run standard Azure VMs as Spot (Linux or Windows), using the same VM sizes, images, disks, NICs, and VNets as regular VMs (subject to region/size availability).
Set eviction behavior:
Deallocate (stop and release compute; you keep the VM resources like disks/NICs so you can try to start it again later).
Delete (remove the VM on eviction; resource retention depends on delete options—verify for your scenario).
Control maximum price:
Set a max price you’re willing to pay for the Spot VM.
Optionally allow paying up to the on-demand price (commonly represented as -1 in tooling—verify in official docs/CLI help for your environment).
Receive short eviction notice (commonly ~30 seconds) via Azure Scheduled Events, enabling graceful shutdown/checkpointing where possible.
Scheduled Events docs: https://learn.microsoft.com/azure/virtual-machines/linux/scheduled-events (Linux) and equivalent Windows page.

Major components (what you actually use)

Azure Virtual Machines configured with Spot priority
Azure Virtual Machine Scale Sets (VMSS) (optional but common for resilient Spot fleets)
Azure networking (VNets, NSGs, Load Balancer/Application Gateway as applicable)
Azure Managed Disks / Storage for persistence and checkpointing
Azure Monitor (metrics, logs, alerts)
Azure Policy / RBAC for governance and security

Service type

Compute (pricing/capacity model for Azure Virtual Machines), not a separate standalone compute service.

Scope and availability model

Regional: Spot capacity and pricing are region-specific and depend on unused capacity.
Zonal (optional): You may place Spot VMs into an Availability Zone where supported, but zone placement can reduce capacity options and increase eviction likelihood depending on local demand.
Subscription-scoped management: You deploy Spot VMs into resource groups within an Azure subscription, under an Azure AD tenant.

How it fits into the Azure ecosystem

Azure Spot Virtual Machines is best viewed as a cost optimization lever for Azure Compute: – Pair it with VM Scale Sets for elasticity and self-healing. – Pair it with Azure Batch, AKS spot node pools, CI/CD systems, or event-driven processing for interruption-tolerant work. – Use Managed Identity + Key Vault for secretless access. – Use Azure Monitor + Log Analytics for observability and operational response.

3. Why use Azure Spot Virtual Machines?

Business reasons

Lower compute spend for workloads that don’t require always-on guarantees.
Faster time-to-result by scaling out cheaply (finish jobs sooner by using more parallel nodes).
More experimentation (more test environments, bigger integration tests, more frequent performance runs) under a fixed budget.

Technical reasons

Same VM features as regular VMs (images, disks, VNets, extensions), so migration is often a configuration change.
Elastic scale-out for distributed processing (render farms, Monte Carlo simulations, build farms).
Works well with fault-tolerant patterns: queues, retries, checkpointing, idempotent tasks.

Operational reasons

Integrates with existing VM ops tooling: Azure Monitor, Update Management alternatives, VM extensions, automation, and IaC.
Flexible eviction policies for different recovery models (deallocate vs delete).
Supports automation (ARM/Bicep/Terraform/CLI) for reproducible fleets.

Security/compliance reasons

Uses the same Azure security controls as standard VMs:
Azure RBAC, NSGs, encryption, Defender for Cloud (where enabled), logging
Helps reduce risk from “shadow compute” by making low-cost compute still centrally governed and auditable.

Scalability/performance reasons

Scale-out economically while maintaining performance-per-core characteristics of selected VM families.
Choose GPU/HPC VM sizes as Spot when available (capacity dependent), enabling bursty high-performance runs at lower cost.

When teams should choose it

Choose Azure Spot Virtual Machines when: – Jobs are interruptible and can restart/retry. – Work is stateless or can checkpoint frequently. – You can tolerate losing nodes at any time. – You want high parallelism and can spread across VM sizes/regions for resilience.

When teams should not choose it

Avoid Azure Spot Virtual Machines when: – You need strict availability guarantees or stable capacity (production stateful databases, critical singletons). – You can’t tolerate VM interruption or data loss. – The application cannot handle node churn (no retry logic, no idempotency, no checkpointing). – You require SLA-backed uptime on those instances (Spot has different expectations; verify official SLA statements).

4. Where is Azure Spot Virtual Machines used?

Industries

Media & entertainment (rendering, transcoding)
Finance (risk simulations, backtesting)
Manufacturing/engineering (CAE/CFD simulations)
Healthcare/life sciences (genomics pipelines)
Gaming (build/test automation, large-scale testing)
Retail/e-commerce (load testing, analytics processing)
AI/ML (distributed training, hyperparameter sweeps—where interruption is acceptable)

Team types

Platform engineering teams building internal compute platforms
DevOps/SRE teams optimizing CI/CD and environments
Data engineering teams running batch pipelines
Research teams needing burst compute
FinOps teams driving cost reduction initiatives

Workloads

Batch processing, ETL, Spark-like workloads (when designed for failure)
CI runners and ephemeral build agents
Containerized background workers
Distributed build systems
Parallel test runners and fuzzing
HPC and simulation
Large-scale scraping/crawling (where permitted and ethical)

Architectures and real-world contexts

Queue-based worker pools (Service Bus/Storage Queue + Spot workers)
VM Scale Sets behind a load balancer with autoscale
Hybrid fleets: baseline on-demand + burst Spot
Multi-region execution for best capacity and lowest eviction risk

Production vs dev/test usage

Dev/test: common and straightforward (interruptions are usually acceptable).
Production: viable when production components are designed for interruption (stateless workers, batch pipelines) and when you provide capacity buffering and retry logic.

5. Top Use Cases and Scenarios

Below are realistic scenarios where Azure Spot Virtual Machines is commonly a strong fit.

1) CI/CD build agents (self-hosted runners)

Problem: Hosted CI minutes are costly; self-hosted agents need scale.
Why Spot fits: Build agents are ephemeral; jobs can retry on failure.
Example: GitHub Actions self-hosted runners on Spot VMs in a VM Scale Set; if evicted, jobs re-queued.

2) Batch ETL workers

Problem: Large nightly ETL needs lots of CPU for a short window.
Why Spot fits: Work can be partitioned; failed partitions can rerun.
Example: 200 Spot workers process partitions from a storage container; results written back to ADLS/Blob.

3) Media transcoding farm

Problem: High CPU usage for video transcoding spikes cost.
Why Spot fits: Individual transcode jobs are independent and retryable.
Example: A queue contains file conversion tasks; Spot VMs pull tasks and upload outputs.

4) Monte Carlo / risk simulation

Problem: Need massive parallel simulation runs quickly.
Why Spot fits: Simulations are independent; interruptions only delay some runs.
Example: Thousands of simulation seeds run on Spot; missing seeds are resubmitted.

5) Large-scale integration testing

Problem: Integration tests take too long on small clusters.
Why Spot fits: Tests can be sharded and retried.
Example: A test coordinator schedules shards across Spot VMs; failed shards rerun.

6) Rendering (3D/animation)

Problem: Render jobs need many cores; deadlines matter.
Why Spot fits: Frames are independent; re-render a dropped frame.
Example: Spot VM pool renders frames; eviction re-queues unfinished frames.

7) Containerized background workers (stateless)

Problem: Background processing is throughput-limited and costly on on-demand.
Why Spot fits: Stateless workers can be replaced easily.
Example: VMSS Spot nodes run containers that pull from Service Bus; if evicted, autoscale replenishes.

8) ML hyperparameter sweeps

Problem: Many training trials; some can be interrupted.
Why Spot fits: Individual trials can be restarted; results accumulate.
Example: Hundreds of trials scheduled; partial progress checkpointed to Blob.

9) Web crawling / data collection (rate-limited and compliant)

Problem: Need periodic collection with burst compute.
Why Spot fits: Tasks are chunked; eviction delays some chunks.
Example: Each Spot VM processes a subset of URLs; progress saved frequently.

10) HPC-style embarrassingly parallel workloads

Problem: Need lots of compute; workload can be partitioned.
Why Spot fits: Each job unit is independent.
Example: Finite element jobs distributed; evicted nodes rerun jobs from last checkpoint.

11) Development environments that can be re-created

Problem: Engineers need cheap sandboxes.
Why Spot fits: Environments are disposable; state is kept in Git/remote storage.
Example: Temporary Spot-based dev boxes created via IaC; data stored in repos and storage.

12) Disaster recovery / chaos testing capacity

Problem: Need occasional large test capacity without steady cost.
Why Spot fits: Tests are time-boxed and can tolerate interruptions.
Example: Quarterly failover testing runs on Spot fleets.

6. Core Features

Feature 1: Spot priority (use unused capacity)

What it does: Deploys a VM using Azure’s unused capacity at a discounted rate.
Why it matters: Major compute cost reduction.
Practical benefit: Scale out more nodes for the same money.
Caveat: Capacity can disappear at any time; eviction is expected.

Feature 2: Eviction (interruption) model

What it does: Azure can evict Spot VMs when capacity is needed.
Why it matters: This is the core tradeoff; your workload must handle it.
Practical benefit: Enables the discount.
Caveat: You may get only short notice (commonly ~30 seconds). Always design for abrupt loss.

Feature 3: Eviction policy (Deallocate vs Delete)

What it does:
Deallocate: VM stops; you keep allocated resources such as disks/NIC; you can try to restart later.
Delete: VM is removed when evicted.
Why it matters: Determines recovery workflow and storage lifecycle.
Practical benefit: Deallocate can preserve OS/data disks for later restart; Delete reduces clutter.
Caveat: Resource retention with Delete depends on resource delete options and configuration—verify in official docs for disks/public IP behavior.

Feature 4: Max price control

What it does: Sets the maximum hourly price you’re willing to pay for the Spot VM.
Why it matters: Prevents surprise spend if Spot price rises.
Practical benefit: Hard cap aligned to your budget.
Caveat: If Spot price exceeds your max price, your VM can be evicted.

Feature 5: Works with standard VM building blocks

What it does: Uses the same Azure VM images, extensions, managed disks, VNets, NSGs, and load balancing options.
Why it matters: Low friction adoption.
Practical benefit: You can reuse hardening, monitoring agents, and IaC modules.
Caveat: Some VM sizes/regions may not be available as Spot at the moment you deploy.

Feature 6: Scheduled Events eviction notifications

What it does: Provides an in-VM metadata endpoint to learn about upcoming events like eviction.
Why it matters: Enables graceful shutdown, checkpointing, draining, and deregistration.
Practical benefit: Less lost work when evicted.
Caveat: Not all failures are graceful; always assume you can also lose the VM without notice.

Feature 7: VM Scale Sets compatibility (common pattern)

What it does: Lets you manage pools of Spot VMs with autoscaling and orchestration.
Why it matters: Spot is more reliable when treated as a fleet, not a singleton.
Practical benefit: Replace evicted nodes automatically; integrate with load balancers.
Caveat: The exact capabilities depend on VMSS orchestration mode and configuration—verify current VMSS docs.

Feature 8: Zone and region placement (where supported)

What it does: Deploys into a specific region and optionally an Availability Zone.
Why it matters: Placement influences resiliency and capacity availability.
Practical benefit: Zone-aware designs can reduce correlated failures (but may increase capacity constraints).
Caveat: Zonal Spot capacity can be more volatile; test across zones/regions.

Feature 9: API/IaC automation support

What it does: Deploy via Azure portal, ARM/Bicep, Terraform, CLI, PowerShell, SDKs.
Why it matters: Spot usage often requires automation and rapid replacement.
Practical benefit: Reproducible, scalable fleets.
Caveat: Keep tooling versions current; Spot-related flags/fields can differ by API version—verify.

7. Architecture and How It Works

High-level architecture

Azure Spot Virtual Machines are regular Azure VMs with: – Priority set to Spot – Eviction policy configured – Max price (optional) They run inside your standard Azure infrastructure: VNet + subnet + NSG, using Managed Disks, and are governed by Azure RBAC and Azure Policy.

Control flow (deployment)

You request a Spot VM via Azure Resource Manager (ARM) (Portal/CLI/IaC).
Azure checks current unused capacity for the VM size in the selected region/zone.
If capacity exists and your max price allows it, Azure provisions the VM.
Billing begins at the Spot rate for the VM while it runs (plus normal charges for disks, IPs, etc.).

Runtime flow (eviction)

Azure may decide to reclaim capacity.
A short notification may be delivered via Scheduled Events.
Based on eviction policy: – Deallocate: VM transitions to stopped/deallocated. – Delete: VM is removed.
Your workload should: – Save progress (checkpoint), – Re-queue unfinished work, – Replace capacity via scale-out or automation.

Integrations with related services

VM Scale Sets for elasticity and replacement
Azure Load Balancer / Application Gateway for distributing traffic (for stateless services)
Azure Batch for batch scheduling (Spot-like capacity options exist; verify current naming in Batch docs)
AKS spot node pools for container workloads (Spot nodes)
Azure Monitor / Log Analytics for metrics/logs/alerts
Azure Event Grid + Activity Log (pattern) to react to VM lifecycle changes (verify event types)

Dependency services (common)

Storage (Blob/ADLS) for checkpointing and outputs
Key Vault for secrets (prefer Managed Identity)
Managed Identity for credential-less access to Azure resources
DNS / Private DNS for internal naming (optional)

Security/authentication model

Management plane: Azure AD + Azure RBAC
In-VM access to Azure services: Managed Identity recommended
Network security: NSGs, private subnets, and Azure Firewall/NVA if required
Governance: Azure Policy, tags, resource locks (use carefully with ephemeral compute)

Networking model

Spot VMs are placed in your VNet/subnet like any VM.
You choose public IP or private-only.
Inbound access typically via:
Bastion, jump host, private connectivity, or
Restrictive NSG rules if public IP is used (not recommended for fleets).

Monitoring/logging/governance

Use Azure Monitor for VM metrics and guest insights (if enabled).
Use Log Analytics for OS logs and agent-based telemetry where needed.
Use Activity Log for lifecycle operations (start/stop/deallocate/delete), and consider alerting on unexpected deallocations.
Apply tagging for cost allocation: env, app, owner, costCenter, workload, lifecycle=spot.

Simple architecture diagram (Mermaid)

flowchart LR
  U[Engineer / Pipeline] -->|Portal/CLI/IaC| ARM[Azure Resource Manager]
  ARM --> VM[Azure Spot Virtual Machine]
  VM --> VNET[VNet/Subnet + NSG]
  VM --> DISK[Managed Disks]
  VM --> MON[Azure Monitor]
  VM --> STG[Blob/ADLS for checkpoints]

Production-style architecture diagram (Mermaid)

flowchart TB
  subgraph "Azure Subscription"
    subgraph "Resource Group: compute-spot-rg"
      ADO[CI/CD or Scheduler] --> ARM[ARM Deployment (Bicep/Terraform)]
      ARM --> VMSS[VM Scale Set - Spot instances]

      VMSS --> VNET[VNet/Subnets]
      VNET --> NSG[NSG Rules]
      VMSS --> MI[Managed Identity]
      MI --> KV[Azure Key Vault]
      VMSS --> STG[Storage (Blob/ADLS) - checkpoints/artifacts]
      VMSS --> LOG[Azure Monitor + Log Analytics]

      VMSS --> LB[Load Balancer / App Gateway]
      LB --> APP[Stateless Services / Workers]

      ACT[Azure Activity Log] --> ALERT[Alerts/Action Group]
      ALERT --> RUN[Automation/Runbook/Function (optional)]
    end
  end

8. Prerequisites

Account/subscription requirements

An active Azure subscription with billing enabled.
Permission to create:
Resource groups
Virtual machines
Networking resources (VNet/subnet/NSG)
Public IP (optional)

Permissions / IAM roles

At minimum, one of: – Contributor on the subscription or target resource group
or a combination granting: – Microsoft.Compute/* (VM create/stop/start) – Microsoft.Network/* (NIC, VNet, NSG) – Microsoft.Storage/* (if using storage)

For least privilege, prefer resource-group scoped roles.

Billing requirements

Spot VMs bill based on usage at Spot rates for compute, plus normal charges for:
Managed disks
Public IP (if any)
Bandwidth (egress)
Monitoring/log analytics ingestion

Tools needed (for the lab)

Azure CLI (current version recommended): https://learn.microsoft.com/cli/azure/install-azure-cli
An SSH client (OpenSSH)
Optional: jq locally (or use --query in Azure CLI)

Region availability

Spot availability depends on region and VM size capacity at the time of deployment.
Pick a region close to you and be prepared to try an alternate VM size/region if capacity is unavailable.

Quotas/limits

Spot VMs consume the same vCPU quotas as regular VMs.
You may need to request quota increases for certain VM families.
Verify quotas in Azure portal: Subscriptions → Usage + quotas.

Prerequisite services (optional but recommended)

Storage account for checkpointing/job output
Log Analytics workspace for centralized logging (optional)
Key Vault (optional)

9. Pricing / Cost

Azure Spot Virtual Machines pricing is usage-based and variable depending on region, VM size, OS, and current capacity/demand conditions.

Official pricing page (Spot):
https://azure.microsoft.com/pricing/details/virtual-machines/spot/

Azure pricing calculator:
https://azure.microsoft.com/pricing/calculator/

Pricing dimensions (what you pay for)

Compute (Spot rate) – Charged while the VM is running. – Spot price varies; you can set a max price.
OS disk and data disks – Managed disks are billed regardless of VM priority. – If a Spot VM is deallocated due to eviction, disks may still incur storage charges.
Networking – Outbound data transfer (egress) is typically billed. – Public IP resources may incur charges depending on SKU/usage (verify current rules).
Operations/monitoring – Log Analytics ingestion and retention (if used) – Defender for Cloud (if enabled)
Licensing – Windows licensing is included in Windows VM rates; licensing programs (e.g., Azure Hybrid Benefit) may affect cost in some scenarios—verify applicability for Spot in official docs.

Free tier

There is no general “free tier” for Spot VMs. Azure has free account credits/trials for eligible new accounts, but Spot itself is not a free service.

Cost drivers (most important)

VM size and hours running
Eviction rate (more evictions → more retries → higher total compute consumed)
Disk footprint (large premium SSDs can dominate cost even if compute is cheap)
Data egress (moving large outputs out of Azure can be costly)
Overprovisioning (running too many nodes “just in case”)
Architecture inefficiency (no checkpointing → wasted work on eviction)

Hidden or indirect costs to watch

Persistent storage costs after eviction (especially with Deallocate policy)
Rebuild time and CI delays (cost of engineering time and pipeline inefficiency)
Extra monitoring ingestion from large ephemeral fleets
NAT Gateway / Firewall costs if used for egress control in large fleets

Network/data transfer implications

Prefer writing intermediate outputs to same-region storage (Blob/ADLS).
Avoid frequent large egress across regions or to on-prem unless necessary.
Consider compressing outputs and batching uploads.

How to optimize cost (practical)

Use Spot for stateless workers and store state in managed services.
Checkpoint frequently so evictions waste minimal compute.
Use autoscaling and right-size VM families.
Consider multi-size or multi-region strategies to reduce capacity shortages (where operationally acceptable).
Set max price if budget predictability matters; otherwise, allow paying up to on-demand (verify the exact setting).
Keep disks lean; prefer smaller OS disks and attach only what you need.
Use Delete eviction policy for truly ephemeral nodes if you don’t need disk persistence (but verify deletion behavior).

Example low-cost starter estimate (non-numeric)

A minimal lab setup typically includes: – 1 small Linux Spot VM (compute billed at Spot rate while running) – 1 OS disk (storage billed while it exists) – Minimal outbound bandwidth This can be low-cost for a short-lived lab, but exact costs vary by region/size and Spot price. Use the pricing calculator to estimate your chosen VM size and region.

Example production cost considerations (non-numeric)

In production, cost modeling should include: – Expected average Spot discount for your chosen sizes – Eviction probability and its impact on retries – Storage costs for checkpoints and outputs – Observability costs (Log Analytics ingestion/retention) – Network egress costs – A baseline of on-demand or reserved capacity if you require guaranteed throughput

10. Step-by-Step Hands-On Tutorial

Objective

Deploy an Azure Spot Virtual Machine (Linux) using Azure CLI, configure it for safe operation, and add an eviction-notice handler using Scheduled Events to demonstrate how you would gracefully checkpoint work.

Lab Overview

You will: 1. Create a resource group and networking. 2. Create an Azure Spot Virtual Machine with an eviction policy. 3. Connect via SSH and install a small sample “worker” script. 4. Configure a background service to watch for eviction notices (Scheduled Events). 5. Validate VM settings and confirm the watcher is running. 6. Clean up resources.

This lab is designed to be low-cost and to avoid creating large fleets.

Important reality check: You cannot force an eviction on demand reliably (evictions depend on Azure capacity and pricing). This lab focuses on correct deployment and correct handling logic, not on triggering a real eviction.

Step 1: Sign in and select your subscription

az login
az account show --output table
# If needed:
az account set --subscription "<SUBSCRIPTION_ID_OR_NAME>"

Expected outcome: Azure CLI is authenticated and pointed to the correct subscription.

Step 2: Create a resource group

Choose a region close to you. If Spot capacity is unavailable later, you may need to switch regions.

REGION="eastus"
RG="rg-spotvm-lab"
az group create -n "$RG" -l "$REGION"

Expected outcome: Resource group is created.

Verify:

az group show -n "$RG" --query "{name:name, location:location}" -o table

Step 3: Create networking (VNet + subnet + NSG rule for SSH)

VNET="vnet-spotvm-lab"
SUBNET="subnet-spotvm-lab"
NSG="nsg-spotvm-lab"

az network vnet create \
  -g "$RG" -n "$VNET" \
  --address-prefix 10.10.0.0/16 \
  --subnet-name "$SUBNET" \
  --subnet-prefix 10.10.1.0/24

az network nsg create -g "$RG" -n "$NSG"

# Allow SSH from your public IP only (recommended).
MYIP="$(curl -s https://ifconfig.me)/32"
az network nsg rule create \
  -g "$RG" --nsg-name "$NSG" -n "Allow-SSH-MyIP" \
  --priority 1000 \
  --access Allow --protocol Tcp --direction Inbound \
  --source-address-prefixes "$MYIP" --source-port-ranges "*" \
  --destination-address-prefixes "*" --destination-port-ranges 22

# Associate NSG to subnet
az network vnet subnet update \
  -g "$RG" --vnet-name "$VNET" -n "$SUBNET" \
  --network-security-group "$NSG"

Expected outcome: A VNet/subnet exists with SSH allowed only from your IP.

Verify:

az network vnet show -g "$RG" -n "$VNET" --query "{name:name, subnets:subnets[].name}" -o table
az network nsg rule list -g "$RG" --nsg-name "$NSG" -o table

Step 4: Create an Azure Spot Virtual Machine

Pick a small VM size. If provisioning fails due to Spot capacity, try a different size or region.

VM="vm-spot-lab01"
ADMIN="azureuser"

# Create an SSH key locally if you don't have one:
# ssh-keygen -t ed25519 -f ~/.ssh/spotvm_lab -N ""

SSHKEY="$HOME/.ssh/spotvm_lab.pub"

az vm create \
  -g "$RG" -n "$VM" \
  --image "Ubuntu2204" \
  --size "Standard_B1s" \
  --admin-username "$ADMIN" \
  --ssh-key-values "$SSHKEY" \
  --vnet-name "$VNET" \
  --subnet "$SUBNET" \
  --public-ip-sku Standard \
  --priority Spot \
  --eviction-policy Deallocate \
  --max-price -1

Expected outcome: The VM is created as a Spot VM.

Notes: – --priority Spot is the key setting. – --eviction-policy Deallocate keeps the VM resources so you can attempt a restart later. – --max-price -1 commonly means “pay up to on-demand price”; confirm with az vm create -h or official docs for your CLI version.

Verify Spot properties:

az vm show -g "$RG" -n "$VM" --query "{name:name, priority:priority, evictionPolicy:evictionPolicy}" -o table

Get public IP:

IP="$(az vm show -d -g "$RG" -n "$VM" --query publicIps -o tsv)"
echo "$IP"

Step 5: SSH into the VM and install tools

ssh -i ~/.ssh/spotvm_lab "${ADMIN}@${IP}"

Inside the VM:

sudo apt-get update
sudo apt-get install -y curl jq

Expected outcome: You can connect, and curl/jq are installed.

Step 6: Add an eviction watcher (Scheduled Events)

Azure Scheduled Events are exposed via the Azure Instance Metadata Service (IMDS). For Linux, the Scheduled Events endpoint is commonly reachable at:

http://169.254.169.254/metadata/scheduledevents?api-version=...

Create a simple watcher script that polls scheduled events and writes a checkpoint marker. (For real workloads, you’d checkpoint to Blob/ADLS and stop accepting new work.)

Inside the VM:

sudo tee /usr/local/bin/spot-eviction-watch.sh >/dev/null <<'EOF'
#!/usr/bin/env bash
set -euo pipefail

METADATA="http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01"
HDR="Metadata:true"

log() { echo "$(date -Is) $*" | sudo tee -a /var/log/spot-eviction-watch.log >/dev/null; }

log "Starting spot eviction watcher"

while true; do
  # Query scheduled events
  RESP="$(curl -sS -H "$HDR" "$METADATA" || true)"

  # If jq isn't present or response is empty, keep trying
  if [[ -z "${RESP}" ]]; then
    log "Empty response from Scheduled Events"
    sleep 5
    continue
  fi

  # Look for Preempt / Terminate style events
  # Event schema can evolve; verify with official docs.
  HAS_EVENTS="$(echo "$RESP" | jq -r '.Events | length' 2>/dev/null || echo "0")"

  if [[ "$HAS_EVENTS" != "0" ]]; then
    log "Scheduled Events received: $RESP"

    # Write a simple "checkpoint" marker locally
    # In production, upload state to durable storage and stop services gracefully.
    sudo bash -c 'echo "$(date -Is) eviction_notice" >> /var/lib/spot-checkpoints.txt'

    # Optional: attempt graceful stop actions here (systemctl stop myservice, drain, etc.)
  fi

  sleep 5
done
EOF

sudo chmod +x /usr/local/bin/spot-eviction-watch.sh
sudo mkdir -p /var/lib

Create a systemd service:

sudo tee /etc/systemd/system/spot-eviction-watch.service >/dev/null <<'EOF'
[Unit]
Description=Azure Spot VM eviction watcher (Scheduled Events)
After=network-online.target
Wants=network-online.target

[Service]
Type=simple
ExecStart=/usr/local/bin/spot-eviction-watch.sh
Restart=always
RestartSec=2

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now spot-eviction-watch.service

Expected outcome: The watcher runs continuously and logs to /var/log/spot-eviction-watch.log.

Verify:

sudo systemctl status spot-eviction-watch.service --no-pager
sudo tail -n 20 /var/log/spot-eviction-watch.log

Step 7: Validate Spot configuration from the Azure side

Exit SSH (or keep it open) and run locally:

az vm show -g "$RG" -n "$VM" --query "{
  name:name,
  location:location,
  vmSize:hardwareProfile.vmSize,
  priority:priority,
  evictionPolicy:evictionPolicy
}" -o table

Expected outcome: priority shows Spot, and evictionPolicy shows Deallocate.

Validation

Use this checklist:

VM is Spot – az vm show ... --query priority returns Spot.
SSH connectivity works – You can SSH to the VM.
Eviction watcher service is running – systemctl status spot-eviction-watch.service is active.
Scheduled Events endpoint is reachable – On the VM: bash curl -sS -H "Metadata:true" \ "http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01" | jq – It should return JSON, typically with an Events array (often empty).

Troubleshooting

Issue: VM creation fails with capacity error – Try another VM size (e.g., a different B-series or D-series). – Try a different region. – Remove zone pinning (if you used zones). – Spot is capacity-driven; failures are normal and should be handled by automation in production.

Issue: SSH times out – Confirm your NSG rule uses your current public IP. – Confirm the VM has a public IP. – Check: bash az vm get-instance-view -g "$RG" -n "$VM" --query "instanceView.statuses" -o table

Issue: Scheduled Events endpoint returns nothing or errors – Ensure you used the required header Metadata:true. – Ensure curl is installed. – Verify the API version in official docs if the endpoint schema changes.

Issue: VM gets evicted – This can happen at any time. – If eviction policy is Deallocate, try restarting: bash az vm start -g "$RG" -n "$VM" If capacity isn’t available, start may fail—try later or redeploy.

Cleanup

Delete the whole resource group (recommended):

az group delete -n "$RG" --yes --no-wait

Expected outcome: All lab resources are scheduled for deletion, preventing ongoing charges (especially for disks and public IPs).

11. Best Practices

Architecture best practices

Design for eviction from day one: assume nodes disappear.
Prefer fleets over pets:
Use VM Scale Sets or automated redeployments rather than single VMs.
Externalize state:
Store state in managed services (Storage, databases) instead of local disks.
Checkpoint frequently:
Make progress durable every N minutes or per task unit.
Use queue-based work distribution:
Service Bus / Storage queues with retry + poison queue patterns.
Consider hybrid capacity:
Baseline on-demand capacity for guaranteed throughput; add Spot for burst.

IAM/security best practices

Use Managed Identity for VM-to-Azure authentication.
Use least privilege:
RBAC scoped to resource groups and specific resources (Storage container, Key Vault secrets).
Control admin access:
Prefer Azure Bastion or private access + jump host.
Restrict SSH/RDP via NSGs to known IPs and use strong key-based auth.

Cost best practices

Treat Spot as compute cost reduction, not a full solution:
Disks and egress can still dominate.
Use Deallocate only when you intend to restart; otherwise Delete can reduce lingering resource costs (verify delete options).
Right-size and measure:
Use autoscale and choose VM families that give best $/work unit, not only lowest $/hour.
Add tags for cost allocation and spot-specific reporting:
priority=spot, workload=batch, owner=..., env=...

Performance best practices

Prefer parallel-friendly algorithms and distributed job scheduling.
Avoid heavy warm-up times; use images or pre-baked artifacts to reduce node initialization.
Use proximity placement groups only when necessary; they can reduce capacity flexibility.

Reliability best practices

Assume random node loss:
Use retries with backoff.
Ensure tasks are idempotent.
Spread risk:
Use multiple VM sizes, zones, or regions (where feasible).
For services, use health checks and load balancers to remove evicted nodes quickly.

Operations best practices

Monitor:
VM start failures (capacity)
Unexpected deallocations
Queue depth / job completion rates
Automate replacement:
VMSS autoscale rules
Scheduled redeployments
Maintain golden images and versioning:
Bake dependencies into VM images for consistent, fast scale-out.

Governance/tagging/naming best practices

Standard naming: vm-spot-<app>-<env>-<region>-<nn>
Standard tags:
env, app, owner, costCenter, dataClass, lifecycle=ephemeral, priority=spot
Apply Azure Policy to enforce:
No public IPs (where required)
Approved images
Required tags

12. Security Considerations

Identity and access model

Management plane access uses Azure AD + Azure RBAC.
For in-VM access to Azure resources:
Prefer System-assigned Managed Identity (per-VM) or User-assigned Managed Identity (shared across fleet).
Avoid embedding credentials in VM images or scripts.

Encryption

At rest:
Managed disks support encryption (platform-managed keys by default; customer-managed keys are possible depending on requirements—verify current options).
Storage for checkpoints (Blob/ADLS) supports encryption.
In transit:
Use TLS to storage endpoints and internal services.

Network exposure

Avoid public IPs for fleets when possible.
Use NSGs with least exposure:
Restrict SSH/RDP to known admin IPs or via Bastion.
Consider private endpoints/private networking for Storage/Key Vault when required.

Secrets handling

Use Key Vault + Managed Identity.
If you must use secrets, rotate them and keep them out of logs.
Never bake long-lived secrets into VM images.

Audit/logging

Use:
Azure Activity Log for management operations
Azure Monitor / Log Analytics for guest logs (agent-based)
Alert on:
Unexpected VM deletions/deallocations
Role assignment changes
NSG changes opening inbound access

Compliance considerations

Data residency: choose regions carefully.
Data classification: ensure workloads running on Spot still follow the same compliance controls (encryption, access logging, retention).
If using Spot for sensitive workloads, ensure:
Full disk encryption posture meets requirements
Hardened images and patching strategy are defined

Common security mistakes

Opening SSH/RDP to the internet broadly (0.0.0.0/0).
Using shared admin passwords across ephemeral nodes.
Putting secrets in cloud-init scripts without secure retrieval.
Assuming evicted nodes are fully wiped immediately (design as if disks may persist if deallocated).

Secure deployment recommendations

Private subnets + Bastion/jump host
Managed Identity everywhere
Key Vault for secrets/certs
Minimal inbound rules; outbound control if required (Firewall/NAT policies)
Golden images with CIS-aligned hardening (where applicable)

13. Limitations and Gotchas

No capacity guarantee: Spot VMs may fail to allocate at creation time.
Evictions can happen anytime: design for sudden loss; do not rely solely on graceful shutdown.
Short eviction notice: often ~30 seconds via Scheduled Events; not enough for large data flushes.
Region/size volatility: some VM sizes may frequently be unavailable as Spot.
Quotas still apply: Spot uses the same vCPU quotas as regular VMs.
Persistent resource costs:
If evicted and deallocated, disks and other resources can still accrue costs.
Stateful workloads are risky:
Databases and single-instance services are poor fits unless architected for interruption and replication.
Price can change:
Spot price varies; max price can protect you but can also increase evictions.
Operational complexity:
Requires retry logic, checkpointing, and fleet management.
Zonal constraints:
Pinning to a zone can reduce available capacity, increasing allocation failures.
Tooling differences:
Fields/flags can differ by API version and tooling; verify with the latest Azure CLI/ARM/Bicep/Terraform docs.

14. Comparison with Alternatives

Alternatives in Azure

Regular Azure Virtual Machines (pay-as-you-go): stable capacity, no eviction.
Reserved Instances / Savings Plans: cost reduction for steady usage without eviction risk (verify current product specifics and applicability by VM type).
Azure VM Scale Sets (on-demand): fleet management without Spot interruptions.
Azure Batch: managed job scheduling; can use lower-cost capacity options (verify current terminology and options).
AKS with Spot node pools: Spot benefits for Kubernetes worker nodes (still eviction-prone).

Alternatives in other clouds

AWS EC2 Spot Instances
Google Cloud Spot VMs (previously “Preemptible VMs” terminology in older materials—verify current naming)

Open-source/self-managed alternatives

Run workloads on your own hardware with a preemptible model (e.g., Kubernetes with cluster autoscaler + mixed node types), but you lose the cloud provider’s elastic spare capacity economics.

Comparison table

Option	Best For	Strengths	Weaknesses	When to Choose
Azure Spot Virtual Machines	Interruptible compute, batch, burst scale-out	Deep discounts, same VM features, good for fleets	Evictions, no capacity guarantee, more engineering	Workloads can retry/checkpoint and tolerate loss
Azure Virtual Machines (pay-as-you-go)	Always-on services, steady workloads	Stable, predictable, no eviction	Higher cost	You need reliability and predictable capacity
Reserved Instances / Savings Plans (Azure)	Steady baseline usage	Lower cost without interruption	Commitment/term, less flexible	You have predictable long-running compute needs
Azure VM Scale Sets (on-demand)	Scalable services without interruptions	Autoscale, self-healing	Higher cost than Spot	You need scale but can’t tolerate eviction
Azure Batch	Managed batch scheduling	Job orchestration, pools, scheduling features	Learning curve, service-specific model	You want a managed batch platform and queues/pools
AKS Spot node pools	Kubernetes worker capacity for stateless pods	Integrates with K8s scheduling, cost reduction	Node churn, pod disruption handling required	You run Kubernetes and can tolerate node interruptions
AWS EC2 Spot / GCP Spot VMs	Cross-cloud equivalents	Similar economics and patterns	Different APIs, behaviors, pricing	Multi-cloud strategy or existing footprint elsewhere

15. Real-World Example

Enterprise example: Risk simulations with checkpointing

Problem: A financial services company runs nightly Monte Carlo simulations. On-demand compute is expensive, and windows are tight.
Proposed architecture:
Job scheduler submits simulation batches to a queue.
A VM Scale Set of Azure Spot Virtual Machines pulls tasks.
Each simulation writes checkpoints every few minutes to Azure Blob Storage.
Results are aggregated into a durable data store.
Monitoring alerts on queue depth and job completion SLA.
Why Azure Spot Virtual Machines was chosen:
Large cost reduction for massively parallel compute.
Evictions are acceptable because simulations are restartable from checkpoints.
Expected outcomes:
Significant reduction in compute spend.
Ability to run more simulations or finish earlier.
Resilience to evictions via retries and checkpointing.

Startup/small-team example: CI runners for monorepo builds

Problem: A startup’s monorepo builds consume many CPU hours. Hosted CI is costly and slow at peak times.
Proposed architecture:
Self-hosted runners on Spot VMs (single VMSS).
Jobs are short-lived and retryable; artifacts stored in remote storage.
Baseline small on-demand runner pool for guaranteed minimum throughput; Spot adds burst.
Why Azure Spot Virtual Machines was chosen:
Major CI cost reduction.
Build jobs are naturally retryable; interruptions are tolerable.
Expected outcomes:
Lower CI spend and faster build throughput during spikes.
Automated scaling without paying for idle capacity.

16. FAQ

What is an Azure Spot Virtual Machine?
A standard Azure VM configured to run on unused Azure capacity at a discounted rate, with the possibility of eviction when Azure needs the capacity back or when your max price is exceeded.
Can Spot VMs be used for production?
Yes, for production workloads that are designed to handle interruptions (stateless services, workers, batch jobs). Avoid using Spot for critical stateful single-instance systems.
What does eviction mean in practice?
Your VM can be stopped (deallocated) or deleted depending on policy. Your application must expect node loss and recover via retries, rescheduling, or autoscaling.
How much cheaper are Spot VMs?
Discounts can be significant, but exact savings vary by region, VM size, and demand. Use the official pricing page and compare to pay-as-you-go for your target SKU.
Do Spot VMs have an SLA?
Spot has different availability expectations than on-demand. Review official SLA documentation and service terms—Spot capacity is not guaranteed.
What is the difference between Deallocate and Delete eviction policies?
Deallocate stops the VM and releases compute, typically keeping attached resources so you can restart later. Delete removes the VM; what happens to disks/IPs depends on delete options—verify before relying on it.
How do I get notified of an eviction?
Use Azure Scheduled Events from inside the VM to detect upcoming eviction notices and trigger graceful shutdown/checkpointing.
Can I prevent eviction?
No. You can only reduce impact through architecture (checkpointing, retries) and reduce likelihood by choosing different sizes/regions or using hybrid capacity.
What is “max price” and how should I set it?
Max price caps what you’re willing to pay. If Spot price exceeds it, the VM can be evicted. If you want fewer price-based evictions, set it higher (or allow up to on-demand if supported by your tooling). Always verify semantics in official docs.
Do reservations or savings plans apply to Spot VMs?
Typically Spot is billed at Spot rates and doesn’t stack with some commitment discounts. Verify current Azure billing rules for reservations/savings plans vs Spot.
Can I use Spot VMs with VM Scale Sets?
Yes. VM Scale Sets are a common way to run Spot fleets with autoscaling and replacement of evicted instances.
Can Spot VMs run Windows?
Yes, depending on VM size/region availability. Costs and licensing differ—verify pricing for Windows Spot VMs.
What happens to my data when a Spot VM is evicted?
Data on the VM’s temporary/local disk may be lost. Data on managed disks persists if the disks remain. Design for durability by storing state externally.
How do I design workloads for Spot?
Use idempotent tasks, queues, checkpointing, retries, and avoid single points of failure. Treat nodes as disposable.
What’s the fastest way to start using Spot safely?
Start with non-critical batch jobs or CI runners, add checkpointing/retries, and expand gradually. Monitor eviction behavior and tune VM sizes/regions.
Can I combine Spot and on-demand in the same architecture?
Yes. A common approach is baseline on-demand capacity plus Spot for burst. This improves reliability while still saving money.
Are GPUs available as Spot?
Sometimes, depending on region and capacity. GPU Spot is highly capacity-sensitive; be prepared for allocation failures and higher eviction risk.

17. Top Online Resources to Learn Azure Spot Virtual Machines

Resource Type	Name	Why It Is Useful
Official documentation	Spot Virtual Machines (Azure VMs) – https://learn.microsoft.com/azure/virtual-machines/spot-vms	Core concepts, configuration options, eviction, max price, guidance
Official pricing	Spot pricing – https://azure.microsoft.com/pricing/details/virtual-machines/spot/	Understand the Spot pricing model and constraints
Pricing calculator	Azure Pricing Calculator – https://azure.microsoft.com/pricing/calculator/	Estimate end-to-end costs including disks, bandwidth, monitoring
Official docs (eviction handling)	Azure Scheduled Events (Linux) – https://learn.microsoft.com/azure/virtual-machines/linux/scheduled-events	How to detect eviction notices and other events from inside the VM
Official docs (VM fundamentals)	Azure Virtual Machines documentation – https://learn.microsoft.com/azure/virtual-machines/	VM networking, disks, images, and operational guidance
Official docs (fleet mgmt)	Virtual Machine Scale Sets – https://learn.microsoft.com/azure/virtual-machine-scale-sets/	Autoscale and manage Spot fleets more safely
Official architecture	Azure Architecture Center – https://learn.microsoft.com/azure/architecture/	Patterns for resilient, scalable architectures
Official learning modules	Microsoft Learn (Azure Virtual Machines learning paths) – https://learn.microsoft.com/training/browse/?products=azure-virtual-machines	Structured training; search for VM/scale set modules
Official samples	Azure Quickstart Templates – https://github.com/Azure/azure-quickstart-templates	Many VM/VMSS templates; look for Spot/priority examples
Community (reputable)	Azure Friday / Microsoft Azure YouTube – https://www.youtube.com/@MicrosoftAzure	Videos and demos; verify details against current docs

18. Training and Certification Providers

DevOpsSchool.com – Suitable audience: DevOps engineers, SREs, cloud engineers, platform teams – Likely learning focus: Azure DevOps, cloud operations, automation, IaC, CI/CD (check course listings for Spot/VM/compute coverage) – Mode: Check website – Website URL: https://www.devopsschool.com/
ScmGalaxy.com – Suitable audience: DevOps and SCM learners, build/release engineers – Likely learning focus: SCM, CI/CD pipelines, DevOps foundations (check for Azure modules) – Mode: Check website – Website URL: https://www.scmgalaxy.com/
CLoudOpsNow.in – Suitable audience: Cloud operations practitioners, administrators, SREs – Likely learning focus: Cloud operations, monitoring, reliability, cost awareness (verify available Azure content) – Mode: Check website – Website URL: https://www.cloudopsnow.in/
SreSchool.com – Suitable audience: SREs, operations engineers, platform engineers – Likely learning focus: Reliability engineering, incident response, observability, scalability (Spot fits into reliability/cost tradeoffs) – Mode: Check website – Website URL: https://www.sreschool.com/
AiOpsSchool.com – Suitable audience: SRE/ops teams adopting AIOps practices – Likely learning focus: Monitoring, automation, event correlation (useful for operating ephemeral Spot fleets) – Mode: Check website – Website URL: https://www.aiopsschool.com/

19. Top Trainers

RajeshKumar.xyz – Likely specialization: DevOps/cloud training content (verify current offerings) – Suitable audience: Engineers seeking practical DevOps/cloud guidance – Website URL: https://www.rajeshkumar.xyz/
devopstrainer.in – Likely specialization: DevOps tools and cloud-focused training (verify Azure coverage) – Suitable audience: Beginners to intermediate DevOps engineers – Website URL: https://www.devopstrainer.in/
devopsfreelancer.com – Likely specialization: DevOps consulting/training platform content (verify services) – Suitable audience: Teams seeking external DevOps expertise and coaching – Website URL: https://www.devopsfreelancer.com/
devopssupport.in – Likely specialization: DevOps support and training resources (verify Azure offerings) – Suitable audience: Operations/DevOps teams needing guided troubleshooting support – Website URL: https://www.devopssupport.in/

20. Top Consulting Companies

cotocus.com – Likely service area: Cloud/DevOps consulting (verify exact service catalog) – Where they may help: Designing scalable compute platforms, cost optimization, automation, governance – Consulting use case examples: Spot-based CI runner platform; batch worker fleet with checkpointing; VMSS autoscaling and monitoring setup – Website URL: https://www.cotocus.com/
DevOpsSchool.com – Likely service area: DevOps/cloud consulting and enablement (verify current offerings) – Where they may help: DevOps transformations, CI/CD, cloud operations, training-to-implementation engagements – Consulting use case examples: IaC modules for Spot/VMSS; operational runbooks for evictions; cost governance tagging strategy – Website URL: https://www.devopsschool.com/
DEVOPSCONSULTING.IN – Likely service area: DevOps consulting services (verify scope and regions served) – Where they may help: Cloud automation, infrastructure reliability, monitoring and support – Consulting use case examples: Building interruption-tolerant worker platforms; integrating alerts for Spot eviction events; platform hardening and access controls – Website URL: https://www.devopsconsulting.in/

21. Career and Learning Roadmap

What to learn before Azure Spot Virtual Machines

Azure fundamentals:
Subscriptions, resource groups, regions, availability zones
Azure Virtual Machines basics:
VM images, sizing, disks, networking (VNet/subnet/NSG), VM lifecycle
Security basics:
Azure AD, RBAC, Managed Identity, Key Vault
Operations basics:
Azure Monitor, logs/metrics, alerting, Activity Log
Reliability basics:
Stateless design, retries, backoff, idempotency, checkpointing
IaC fundamentals:
ARM/Bicep or Terraform basics; Azure CLI

What to learn after

VM Scale Sets advanced patterns:
Autoscale rules, health probes, rolling upgrades
Kubernetes Spot patterns:
AKS spot node pools, pod disruption budgets, cluster autoscaler
Batch orchestration:
Azure Batch, queue-driven processing, workflow engines
FinOps:
Cost allocation with tags, budgets, anomaly detection, unit-cost models
Security hardening:
Golden images, vulnerability scanning, policy-as-code

Job roles that use it

Cloud engineer / Cloud administrator
Solutions architect
DevOps engineer
SRE / Platform engineer
Data engineer (batch platforms)
FinOps analyst (advisory + governance)

Certification path (Azure)

Azure certifications change over time; verify current offerings on Microsoft Learn. Commonly relevant certifications include: – AZ-900 (Azure Fundamentals) – AZ-104 (Azure Administrator) – AZ-305 (Azure Solutions Architect) Also consider DevOps-focused certifications depending on your role (verify current exam codes and requirements).

Project ideas for practice

Build a queue-based worker system (Service Bus + Spot VMSS) with checkpointing to Blob.
Create a CI runner fleet on Spot with autoscaling and job retry logic.
Implement an eviction-aware daemon that drains work, uploads state, and terminates gracefully.
Do a cost/performance benchmark comparing on-demand vs Spot for a batch workload and report unit cost ($ per 1,000 tasks).

22. Glossary

Azure Spot Virtual Machines: Azure VMs running on unused capacity with variable price and eviction risk.
Eviction: Azure reclaiming capacity; the Spot VM is stopped/deallocated or deleted.
Eviction policy: Setting that controls what happens to the VM on eviction (e.g., Deallocate or Delete).
Max price: The maximum price you’re willing to pay for a Spot VM; exceeding it may cause eviction.
VM Scale Set (VMSS): A service for running and managing a set of identical VMs with autoscaling.
Checkpointing: Periodically saving progress to durable storage so work can resume after interruption.
Idempotent task: A task that can be retried without causing inconsistent results.
Scheduled Events: In-VM metadata API for upcoming platform events like eviction notices.
IMDS (Instance Metadata Service): Metadata endpoint reachable from inside Azure VMs (169.254.169.254).
NSG (Network Security Group): Azure firewall rules for subnets/NICs.
Managed Identity: Azure feature providing an identity for Azure resources to access other services without secrets.
Activity Log: Azure subscription-level log for management operations.
Egress: Outbound network traffic from Azure to the internet or other regions.

23. Summary

Azure Spot Virtual Machines is an Azure Compute option that runs standard Azure VMs on unused capacity at discounted rates, with the critical tradeoff that VMs can be evicted with short notice. It matters because it can dramatically reduce compute costs and enable large-scale parallel processing—when your workload is designed for interruption.

Architecturally, Spot works best for stateless and fault-tolerant workloads, supported by queues, retries, and checkpointing, and usually managed as a fleet (often via VM Scale Sets) rather than individual long-lived servers. From a cost perspective, the biggest savings are on compute, while disks, networking egress, and monitoring can still be meaningful; from a security perspective, Spot uses the same RBAC, networking, and encryption controls as normal VMs, so you should apply standard Azure hardening and governance.

Use Azure Spot Virtual Machines when you can tolerate interruption and want the lowest-cost compute for batch, CI, testing, and scalable workers. Next, deepen your skills by implementing Spot fleets with VM Scale Sets, adding eviction-aware shutdown logic using Scheduled Events, and building a unit-cost model to quantify savings vs interruption overhead.

rajeshkumar

Category

1. Introduction

2. What is Azure Spot Virtual Machines?

Core capabilities

Major components (what you actually use)

Service type

Scope and availability model

How it fits into the Azure ecosystem

3. Why use Azure Spot Virtual Machines?

Business reasons

Technical reasons

Operational reasons

Security/compliance reasons

Scalability/performance reasons

When teams should choose it

When teams should not choose it

4. Where is Azure Spot Virtual Machines used?

Industries

Team types

Workloads

Architectures and real-world contexts

Production vs dev/test usage

5. Top Use Cases and Scenarios

1) CI/CD build agents (self-hosted runners)

2) Batch ETL workers

3) Media transcoding farm

4) Monte Carlo / risk simulation

5) Large-scale integration testing

6) Rendering (3D/animation)

7) Containerized background workers (stateless)

8) ML hyperparameter sweeps

9) Web crawling / data collection (rate-limited and compliant)

10) HPC-style embarrassingly parallel workloads

11) Development environments that can be re-created

12) Disaster recovery / chaos testing capacity

6. Core Features

Feature 1: Spot priority (use unused capacity)

Feature 2: Eviction (interruption) model

Feature 3: Eviction policy (Deallocate vs Delete)

Feature 4: Max price control

Feature 5: Works with standard VM building blocks

Feature 6: Scheduled Events eviction notifications

Feature 7: VM Scale Sets compatibility (common pattern)

Feature 8: Zone and region placement (where supported)

Feature 9: API/IaC automation support

7. Architecture and How It Works

High-level architecture

Control flow (deployment)

Runtime flow (eviction)

Integrations with related services

Dependency services (common)

Security/authentication model

Networking model

Monitoring/logging/governance

Simple architecture diagram (Mermaid)

Production-style architecture diagram (Mermaid)

8. Prerequisites

Account/subscription requirements

Permissions / IAM roles

Billing requirements

Tools needed (for the lab)

Region availability

Quotas/limits

Prerequisite services (optional but recommended)

9. Pricing / Cost

Pricing dimensions (what you pay for)

Free tier

Cost drivers (most important)

Hidden or indirect costs to watch

Network/data transfer implications

How to optimize cost (practical)

Example low-cost starter estimate (non-numeric)

Example production cost considerations (non-numeric)

10. Step-by-Step Hands-On Tutorial

Objective

Lab Overview

Step 1: Sign in and select your subscription

Step 2: Create a resource group

Step 3: Create networking (VNet + subnet + NSG rule for SSH)

Step 4: Create an Azure Spot Virtual Machine