Category
Compute
1. Introduction
Azure Spot Virtual Machines is an Azure Compute pricing and capacity option that lets you run Virtual Machines (VMs) using unused Azure capacity at steep discounts compared to pay-as-you-go prices. The tradeoff is that Azure can evict (stop) your Spot VM with short notice when Azure needs the capacity back or when your configured maximum price is exceeded.
In simple terms: you get cheaper VMs, but they can be interrupted. If your workload can tolerate interruptions—batch jobs, CI runners, distributed builds, rendering, simulations, big data processing, stateless services, or fault-tolerant microservices—Spot is often the most cost-effective way to scale out.
Technically, Azure Spot Virtual Machines are standard Azure Virtual Machines configured with a “Spot” priority and an eviction policy. They run in an Azure region and optionally in an Availability Zone, use the same VM images, disks, networking, identity, and monitoring stack as regular VMs, and are created via Azure portal, ARM/Bicep, Terraform, Azure CLI, PowerShell, or SDKs. The key difference is that compute capacity is not guaranteed, and eviction is part of the operating model.
The problem it solves: reducing compute costs for interruption-tolerant workloads, often enabling larger scale, faster completion times, and more experimentation under the same budget.
Naming note (important): Azure previously used terms like “low-priority VMs” (notably in some batch/scale-set contexts). Today, the official product term is Azure Spot Virtual Machines. If you see “low-priority” in older materials, treat it as legacy terminology and verify in current docs.
2. What is Azure Spot Virtual Machines?
Official purpose: Azure Spot Virtual Machines let you deploy VMs on unused Azure capacity at discounted rates, with the understanding that Azure can reclaim that capacity at any time through eviction.
Official docs: https://learn.microsoft.com/azure/virtual-machines/spot-vms
Core capabilities
- Run standard Azure VMs as Spot (Linux or Windows), using the same VM sizes, images, disks, NICs, and VNets as regular VMs (subject to region/size availability).
- Set eviction behavior:
- Deallocate (stop and release compute; you keep the VM resources like disks/NICs so you can try to start it again later).
- Delete (remove the VM on eviction; resource retention depends on delete options—verify for your scenario).
- Control maximum price:
- Set a max price you’re willing to pay for the Spot VM.
- Optionally allow paying up to the on-demand price (commonly represented as
-1in tooling—verify in official docs/CLI help for your environment). - Receive short eviction notice (commonly ~30 seconds) via Azure Scheduled Events, enabling graceful shutdown/checkpointing where possible.
Scheduled Events docs: https://learn.microsoft.com/azure/virtual-machines/linux/scheduled-events (Linux) and equivalent Windows page.
Major components (what you actually use)
- Azure Virtual Machines configured with Spot priority
- Azure Virtual Machine Scale Sets (VMSS) (optional but common for resilient Spot fleets)
- Azure networking (VNets, NSGs, Load Balancer/Application Gateway as applicable)
- Azure Managed Disks / Storage for persistence and checkpointing
- Azure Monitor (metrics, logs, alerts)
- Azure Policy / RBAC for governance and security
Service type
- Compute (pricing/capacity model for Azure Virtual Machines), not a separate standalone compute service.
Scope and availability model
- Regional: Spot capacity and pricing are region-specific and depend on unused capacity.
- Zonal (optional): You may place Spot VMs into an Availability Zone where supported, but zone placement can reduce capacity options and increase eviction likelihood depending on local demand.
- Subscription-scoped management: You deploy Spot VMs into resource groups within an Azure subscription, under an Azure AD tenant.
How it fits into the Azure ecosystem
Azure Spot Virtual Machines is best viewed as a cost optimization lever for Azure Compute: – Pair it with VM Scale Sets for elasticity and self-healing. – Pair it with Azure Batch, AKS spot node pools, CI/CD systems, or event-driven processing for interruption-tolerant work. – Use Managed Identity + Key Vault for secretless access. – Use Azure Monitor + Log Analytics for observability and operational response.
3. Why use Azure Spot Virtual Machines?
Business reasons
- Lower compute spend for workloads that don’t require always-on guarantees.
- Faster time-to-result by scaling out cheaply (finish jobs sooner by using more parallel nodes).
- More experimentation (more test environments, bigger integration tests, more frequent performance runs) under a fixed budget.
Technical reasons
- Same VM features as regular VMs (images, disks, VNets, extensions), so migration is often a configuration change.
- Elastic scale-out for distributed processing (render farms, Monte Carlo simulations, build farms).
- Works well with fault-tolerant patterns: queues, retries, checkpointing, idempotent tasks.
Operational reasons
- Integrates with existing VM ops tooling: Azure Monitor, Update Management alternatives, VM extensions, automation, and IaC.
- Flexible eviction policies for different recovery models (deallocate vs delete).
- Supports automation (ARM/Bicep/Terraform/CLI) for reproducible fleets.
Security/compliance reasons
- Uses the same Azure security controls as standard VMs:
- Azure RBAC, NSGs, encryption, Defender for Cloud (where enabled), logging
- Helps reduce risk from “shadow compute” by making low-cost compute still centrally governed and auditable.
Scalability/performance reasons
- Scale-out economically while maintaining performance-per-core characteristics of selected VM families.
- Choose GPU/HPC VM sizes as Spot when available (capacity dependent), enabling bursty high-performance runs at lower cost.
When teams should choose it
Choose Azure Spot Virtual Machines when: – Jobs are interruptible and can restart/retry. – Work is stateless or can checkpoint frequently. – You can tolerate losing nodes at any time. – You want high parallelism and can spread across VM sizes/regions for resilience.
When teams should not choose it
Avoid Azure Spot Virtual Machines when: – You need strict availability guarantees or stable capacity (production stateful databases, critical singletons). – You can’t tolerate VM interruption or data loss. – The application cannot handle node churn (no retry logic, no idempotency, no checkpointing). – You require SLA-backed uptime on those instances (Spot has different expectations; verify official SLA statements).
4. Where is Azure Spot Virtual Machines used?
Industries
- Media & entertainment (rendering, transcoding)
- Finance (risk simulations, backtesting)
- Manufacturing/engineering (CAE/CFD simulations)
- Healthcare/life sciences (genomics pipelines)
- Gaming (build/test automation, large-scale testing)
- Retail/e-commerce (load testing, analytics processing)
- AI/ML (distributed training, hyperparameter sweeps—where interruption is acceptable)
Team types
- Platform engineering teams building internal compute platforms
- DevOps/SRE teams optimizing CI/CD and environments
- Data engineering teams running batch pipelines
- Research teams needing burst compute
- FinOps teams driving cost reduction initiatives
Workloads
- Batch processing, ETL, Spark-like workloads (when designed for failure)
- CI runners and ephemeral build agents
- Containerized background workers
- Distributed build systems
- Parallel test runners and fuzzing
- HPC and simulation
- Large-scale scraping/crawling (where permitted and ethical)
Architectures and real-world contexts
- Queue-based worker pools (Service Bus/Storage Queue + Spot workers)
- VM Scale Sets behind a load balancer with autoscale
- Hybrid fleets: baseline on-demand + burst Spot
- Multi-region execution for best capacity and lowest eviction risk
Production vs dev/test usage
- Dev/test: common and straightforward (interruptions are usually acceptable).
- Production: viable when production components are designed for interruption (stateless workers, batch pipelines) and when you provide capacity buffering and retry logic.
5. Top Use Cases and Scenarios
Below are realistic scenarios where Azure Spot Virtual Machines is commonly a strong fit.
1) CI/CD build agents (self-hosted runners)
- Problem: Hosted CI minutes are costly; self-hosted agents need scale.
- Why Spot fits: Build agents are ephemeral; jobs can retry on failure.
- Example: GitHub Actions self-hosted runners on Spot VMs in a VM Scale Set; if evicted, jobs re-queued.
2) Batch ETL workers
- Problem: Large nightly ETL needs lots of CPU for a short window.
- Why Spot fits: Work can be partitioned; failed partitions can rerun.
- Example: 200 Spot workers process partitions from a storage container; results written back to ADLS/Blob.
3) Media transcoding farm
- Problem: High CPU usage for video transcoding spikes cost.
- Why Spot fits: Individual transcode jobs are independent and retryable.
- Example: A queue contains file conversion tasks; Spot VMs pull tasks and upload outputs.
4) Monte Carlo / risk simulation
- Problem: Need massive parallel simulation runs quickly.
- Why Spot fits: Simulations are independent; interruptions only delay some runs.
- Example: Thousands of simulation seeds run on Spot; missing seeds are resubmitted.
5) Large-scale integration testing
- Problem: Integration tests take too long on small clusters.
- Why Spot fits: Tests can be sharded and retried.
- Example: A test coordinator schedules shards across Spot VMs; failed shards rerun.
6) Rendering (3D/animation)
- Problem: Render jobs need many cores; deadlines matter.
- Why Spot fits: Frames are independent; re-render a dropped frame.
- Example: Spot VM pool renders frames; eviction re-queues unfinished frames.
7) Containerized background workers (stateless)
- Problem: Background processing is throughput-limited and costly on on-demand.
- Why Spot fits: Stateless workers can be replaced easily.
- Example: VMSS Spot nodes run containers that pull from Service Bus; if evicted, autoscale replenishes.
8) ML hyperparameter sweeps
- Problem: Many training trials; some can be interrupted.
- Why Spot fits: Individual trials can be restarted; results accumulate.
- Example: Hundreds of trials scheduled; partial progress checkpointed to Blob.
9) Web crawling / data collection (rate-limited and compliant)
- Problem: Need periodic collection with burst compute.
- Why Spot fits: Tasks are chunked; eviction delays some chunks.
- Example: Each Spot VM processes a subset of URLs; progress saved frequently.
10) HPC-style embarrassingly parallel workloads
- Problem: Need lots of compute; workload can be partitioned.
- Why Spot fits: Each job unit is independent.
- Example: Finite element jobs distributed; evicted nodes rerun jobs from last checkpoint.
11) Development environments that can be re-created
- Problem: Engineers need cheap sandboxes.
- Why Spot fits: Environments are disposable; state is kept in Git/remote storage.
- Example: Temporary Spot-based dev boxes created via IaC; data stored in repos and storage.
12) Disaster recovery / chaos testing capacity
- Problem: Need occasional large test capacity without steady cost.
- Why Spot fits: Tests are time-boxed and can tolerate interruptions.
- Example: Quarterly failover testing runs on Spot fleets.
6. Core Features
Feature 1: Spot priority (use unused capacity)
- What it does: Deploys a VM using Azure’s unused capacity at a discounted rate.
- Why it matters: Major compute cost reduction.
- Practical benefit: Scale out more nodes for the same money.
- Caveat: Capacity can disappear at any time; eviction is expected.
Feature 2: Eviction (interruption) model
- What it does: Azure can evict Spot VMs when capacity is needed.
- Why it matters: This is the core tradeoff; your workload must handle it.
- Practical benefit: Enables the discount.
- Caveat: You may get only short notice (commonly ~30 seconds). Always design for abrupt loss.
Feature 3: Eviction policy (Deallocate vs Delete)
- What it does:
- Deallocate: VM stops; you keep allocated resources such as disks/NIC; you can try to restart later.
- Delete: VM is removed when evicted.
- Why it matters: Determines recovery workflow and storage lifecycle.
- Practical benefit: Deallocate can preserve OS/data disks for later restart; Delete reduces clutter.
- Caveat: Resource retention with Delete depends on resource delete options and configuration—verify in official docs for disks/public IP behavior.
Feature 4: Max price control
- What it does: Sets the maximum hourly price you’re willing to pay for the Spot VM.
- Why it matters: Prevents surprise spend if Spot price rises.
- Practical benefit: Hard cap aligned to your budget.
- Caveat: If Spot price exceeds your max price, your VM can be evicted.
Feature 5: Works with standard VM building blocks
- What it does: Uses the same Azure VM images, extensions, managed disks, VNets, NSGs, and load balancing options.
- Why it matters: Low friction adoption.
- Practical benefit: You can reuse hardening, monitoring agents, and IaC modules.
- Caveat: Some VM sizes/regions may not be available as Spot at the moment you deploy.
Feature 6: Scheduled Events eviction notifications
- What it does: Provides an in-VM metadata endpoint to learn about upcoming events like eviction.
- Why it matters: Enables graceful shutdown, checkpointing, draining, and deregistration.
- Practical benefit: Less lost work when evicted.
- Caveat: Not all failures are graceful; always assume you can also lose the VM without notice.
Feature 7: VM Scale Sets compatibility (common pattern)
- What it does: Lets you manage pools of Spot VMs with autoscaling and orchestration.
- Why it matters: Spot is more reliable when treated as a fleet, not a singleton.
- Practical benefit: Replace evicted nodes automatically; integrate with load balancers.
- Caveat: The exact capabilities depend on VMSS orchestration mode and configuration—verify current VMSS docs.
Feature 8: Zone and region placement (where supported)
- What it does: Deploys into a specific region and optionally an Availability Zone.
- Why it matters: Placement influences resiliency and capacity availability.
- Practical benefit: Zone-aware designs can reduce correlated failures (but may increase capacity constraints).
- Caveat: Zonal Spot capacity can be more volatile; test across zones/regions.
Feature 9: API/IaC automation support
- What it does: Deploy via Azure portal, ARM/Bicep, Terraform, CLI, PowerShell, SDKs.
- Why it matters: Spot usage often requires automation and rapid replacement.
- Practical benefit: Reproducible, scalable fleets.
- Caveat: Keep tooling versions current; Spot-related flags/fields can differ by API version—verify.
7. Architecture and How It Works
High-level architecture
Azure Spot Virtual Machines are regular Azure VMs with: – Priority set to Spot – Eviction policy configured – Max price (optional) They run inside your standard Azure infrastructure: VNet + subnet + NSG, using Managed Disks, and are governed by Azure RBAC and Azure Policy.
Control flow (deployment)
- You request a Spot VM via Azure Resource Manager (ARM) (Portal/CLI/IaC).
- Azure checks current unused capacity for the VM size in the selected region/zone.
- If capacity exists and your max price allows it, Azure provisions the VM.
- Billing begins at the Spot rate for the VM while it runs (plus normal charges for disks, IPs, etc.).
Runtime flow (eviction)
- Azure may decide to reclaim capacity.
- A short notification may be delivered via Scheduled Events.
- Based on eviction policy: – Deallocate: VM transitions to stopped/deallocated. – Delete: VM is removed.
- Your workload should: – Save progress (checkpoint), – Re-queue unfinished work, – Replace capacity via scale-out or automation.
Integrations with related services
- VM Scale Sets for elasticity and replacement
- Azure Load Balancer / Application Gateway for distributing traffic (for stateless services)
- Azure Batch for batch scheduling (Spot-like capacity options exist; verify current naming in Batch docs)
- AKS spot node pools for container workloads (Spot nodes)
- Azure Monitor / Log Analytics for metrics/logs/alerts
- Azure Event Grid + Activity Log (pattern) to react to VM lifecycle changes (verify event types)
Dependency services (common)
- Storage (Blob/ADLS) for checkpointing and outputs
- Key Vault for secrets (prefer Managed Identity)
- Managed Identity for credential-less access to Azure resources
- DNS / Private DNS for internal naming (optional)
Security/authentication model
- Management plane: Azure AD + Azure RBAC
- In-VM access to Azure services: Managed Identity recommended
- Network security: NSGs, private subnets, and Azure Firewall/NVA if required
- Governance: Azure Policy, tags, resource locks (use carefully with ephemeral compute)
Networking model
- Spot VMs are placed in your VNet/subnet like any VM.
- You choose public IP or private-only.
- Inbound access typically via:
- Bastion, jump host, private connectivity, or
- Restrictive NSG rules if public IP is used (not recommended for fleets).
Monitoring/logging/governance
- Use Azure Monitor for VM metrics and guest insights (if enabled).
- Use Log Analytics for OS logs and agent-based telemetry where needed.
- Use Activity Log for lifecycle operations (start/stop/deallocate/delete), and consider alerting on unexpected deallocations.
- Apply tagging for cost allocation:
env,app,owner,costCenter,workload,lifecycle=spot.
Simple architecture diagram (Mermaid)
flowchart LR
U[Engineer / Pipeline] -->|Portal/CLI/IaC| ARM[Azure Resource Manager]
ARM --> VM[Azure Spot Virtual Machine]
VM --> VNET[VNet/Subnet + NSG]
VM --> DISK[Managed Disks]
VM --> MON[Azure Monitor]
VM --> STG[Blob/ADLS for checkpoints]
Production-style architecture diagram (Mermaid)
flowchart TB
subgraph "Azure Subscription"
subgraph "Resource Group: compute-spot-rg"
ADO[CI/CD or Scheduler] --> ARM[ARM Deployment (Bicep/Terraform)]
ARM --> VMSS[VM Scale Set - Spot instances]
VMSS --> VNET[VNet/Subnets]
VNET --> NSG[NSG Rules]
VMSS --> MI[Managed Identity]
MI --> KV[Azure Key Vault]
VMSS --> STG[Storage (Blob/ADLS) - checkpoints/artifacts]
VMSS --> LOG[Azure Monitor + Log Analytics]
VMSS --> LB[Load Balancer / App Gateway]
LB --> APP[Stateless Services / Workers]
ACT[Azure Activity Log] --> ALERT[Alerts/Action Group]
ALERT --> RUN[Automation/Runbook/Function (optional)]
end
end
8. Prerequisites
Account/subscription requirements
- An active Azure subscription with billing enabled.
- Permission to create:
- Resource groups
- Virtual machines
- Networking resources (VNet/subnet/NSG)
- Public IP (optional)
Permissions / IAM roles
At minimum, one of:
– Contributor on the subscription or target resource group
or a combination granting:
– Microsoft.Compute/* (VM create/stop/start)
– Microsoft.Network/* (NIC, VNet, NSG)
– Microsoft.Storage/* (if using storage)
For least privilege, prefer resource-group scoped roles.
Billing requirements
- Spot VMs bill based on usage at Spot rates for compute, plus normal charges for:
- Managed disks
- Public IP (if any)
- Bandwidth (egress)
- Monitoring/log analytics ingestion
Tools needed (for the lab)
- Azure CLI (current version recommended): https://learn.microsoft.com/cli/azure/install-azure-cli
- An SSH client (OpenSSH)
- Optional:
jqlocally (or use--queryin Azure CLI)
Region availability
- Spot availability depends on region and VM size capacity at the time of deployment.
- Pick a region close to you and be prepared to try an alternate VM size/region if capacity is unavailable.
Quotas/limits
- Spot VMs consume the same vCPU quotas as regular VMs.
- You may need to request quota increases for certain VM families.
- Verify quotas in Azure portal: Subscriptions → Usage + quotas.
Prerequisite services (optional but recommended)
- Storage account for checkpointing/job output
- Log Analytics workspace for centralized logging (optional)
- Key Vault (optional)
9. Pricing / Cost
Azure Spot Virtual Machines pricing is usage-based and variable depending on region, VM size, OS, and current capacity/demand conditions.
Official pricing page (Spot):
https://azure.microsoft.com/pricing/details/virtual-machines/spot/
Azure pricing calculator:
https://azure.microsoft.com/pricing/calculator/
Pricing dimensions (what you pay for)
- Compute (Spot rate) – Charged while the VM is running. – Spot price varies; you can set a max price.
- OS disk and data disks – Managed disks are billed regardless of VM priority. – If a Spot VM is deallocated due to eviction, disks may still incur storage charges.
- Networking – Outbound data transfer (egress) is typically billed. – Public IP resources may incur charges depending on SKU/usage (verify current rules).
- Operations/monitoring – Log Analytics ingestion and retention (if used) – Defender for Cloud (if enabled)
- Licensing – Windows licensing is included in Windows VM rates; licensing programs (e.g., Azure Hybrid Benefit) may affect cost in some scenarios—verify applicability for Spot in official docs.
Free tier
- There is no general “free tier” for Spot VMs. Azure has free account credits/trials for eligible new accounts, but Spot itself is not a free service.
Cost drivers (most important)
- VM size and hours running
- Eviction rate (more evictions → more retries → higher total compute consumed)
- Disk footprint (large premium SSDs can dominate cost even if compute is cheap)
- Data egress (moving large outputs out of Azure can be costly)
- Overprovisioning (running too many nodes “just in case”)
- Architecture inefficiency (no checkpointing → wasted work on eviction)
Hidden or indirect costs to watch
- Persistent storage costs after eviction (especially with Deallocate policy)
- Rebuild time and CI delays (cost of engineering time and pipeline inefficiency)
- Extra monitoring ingestion from large ephemeral fleets
- NAT Gateway / Firewall costs if used for egress control in large fleets
Network/data transfer implications
- Prefer writing intermediate outputs to same-region storage (Blob/ADLS).
- Avoid frequent large egress across regions or to on-prem unless necessary.
- Consider compressing outputs and batching uploads.
How to optimize cost (practical)
- Use Spot for stateless workers and store state in managed services.
- Checkpoint frequently so evictions waste minimal compute.
- Use autoscaling and right-size VM families.
- Consider multi-size or multi-region strategies to reduce capacity shortages (where operationally acceptable).
- Set max price if budget predictability matters; otherwise, allow paying up to on-demand (verify the exact setting).
- Keep disks lean; prefer smaller OS disks and attach only what you need.
- Use Delete eviction policy for truly ephemeral nodes if you don’t need disk persistence (but verify deletion behavior).
Example low-cost starter estimate (non-numeric)
A minimal lab setup typically includes: – 1 small Linux Spot VM (compute billed at Spot rate while running) – 1 OS disk (storage billed while it exists) – Minimal outbound bandwidth This can be low-cost for a short-lived lab, but exact costs vary by region/size and Spot price. Use the pricing calculator to estimate your chosen VM size and region.
Example production cost considerations (non-numeric)
In production, cost modeling should include: – Expected average Spot discount for your chosen sizes – Eviction probability and its impact on retries – Storage costs for checkpoints and outputs – Observability costs (Log Analytics ingestion/retention) – Network egress costs – A baseline of on-demand or reserved capacity if you require guaranteed throughput
10. Step-by-Step Hands-On Tutorial
Objective
Deploy an Azure Spot Virtual Machine (Linux) using Azure CLI, configure it for safe operation, and add an eviction-notice handler using Scheduled Events to demonstrate how you would gracefully checkpoint work.
Lab Overview
You will: 1. Create a resource group and networking. 2. Create an Azure Spot Virtual Machine with an eviction policy. 3. Connect via SSH and install a small sample “worker” script. 4. Configure a background service to watch for eviction notices (Scheduled Events). 5. Validate VM settings and confirm the watcher is running. 6. Clean up resources.
This lab is designed to be low-cost and to avoid creating large fleets.
Important reality check: You cannot force an eviction on demand reliably (evictions depend on Azure capacity and pricing). This lab focuses on correct deployment and correct handling logic, not on triggering a real eviction.
Step 1: Sign in and select your subscription
az login
az account show --output table
# If needed:
az account set --subscription "<SUBSCRIPTION_ID_OR_NAME>"
Expected outcome: Azure CLI is authenticated and pointed to the correct subscription.
Step 2: Create a resource group
Choose a region close to you. If Spot capacity is unavailable later, you may need to switch regions.
REGION="eastus"
RG="rg-spotvm-lab"
az group create -n "$RG" -l "$REGION"
Expected outcome: Resource group is created.
Verify:
az group show -n "$RG" --query "{name:name, location:location}" -o table
Step 3: Create networking (VNet + subnet + NSG rule for SSH)
VNET="vnet-spotvm-lab"
SUBNET="subnet-spotvm-lab"
NSG="nsg-spotvm-lab"
az network vnet create \
-g "$RG" -n "$VNET" \
--address-prefix 10.10.0.0/16 \
--subnet-name "$SUBNET" \
--subnet-prefix 10.10.1.0/24
az network nsg create -g "$RG" -n "$NSG"
# Allow SSH from your public IP only (recommended).
MYIP="$(curl -s https://ifconfig.me)/32"
az network nsg rule create \
-g "$RG" --nsg-name "$NSG" -n "Allow-SSH-MyIP" \
--priority 1000 \
--access Allow --protocol Tcp --direction Inbound \
--source-address-prefixes "$MYIP" --source-port-ranges "*" \
--destination-address-prefixes "*" --destination-port-ranges 22
# Associate NSG to subnet
az network vnet subnet update \
-g "$RG" --vnet-name "$VNET" -n "$SUBNET" \
--network-security-group "$NSG"
Expected outcome: A VNet/subnet exists with SSH allowed only from your IP.
Verify:
az network vnet show -g "$RG" -n "$VNET" --query "{name:name, subnets:subnets[].name}" -o table
az network nsg rule list -g "$RG" --nsg-name "$NSG" -o table
Step 4: Create an Azure Spot Virtual Machine
Pick a small VM size. If provisioning fails due to Spot capacity, try a different size or region.
VM="vm-spot-lab01"
ADMIN="azureuser"
# Create an SSH key locally if you don't have one:
# ssh-keygen -t ed25519 -f ~/.ssh/spotvm_lab -N ""
SSHKEY="$HOME/.ssh/spotvm_lab.pub"
az vm create \
-g "$RG" -n "$VM" \
--image "Ubuntu2204" \
--size "Standard_B1s" \
--admin-username "$ADMIN" \
--ssh-key-values "$SSHKEY" \
--vnet-name "$VNET" \
--subnet "$SUBNET" \
--public-ip-sku Standard \
--priority Spot \
--eviction-policy Deallocate \
--max-price -1
Expected outcome: The VM is created as a Spot VM.
Notes:
– --priority Spot is the key setting.
– --eviction-policy Deallocate keeps the VM resources so you can attempt a restart later.
– --max-price -1 commonly means “pay up to on-demand price”; confirm with az vm create -h or official docs for your CLI version.
Verify Spot properties:
az vm show -g "$RG" -n "$VM" --query "{name:name, priority:priority, evictionPolicy:evictionPolicy}" -o table
Get public IP:
IP="$(az vm show -d -g "$RG" -n "$VM" --query publicIps -o tsv)"
echo "$IP"
Step 5: SSH into the VM and install tools
ssh -i ~/.ssh/spotvm_lab "${ADMIN}@${IP}"
Inside the VM:
sudo apt-get update
sudo apt-get install -y curl jq
Expected outcome: You can connect, and curl/jq are installed.
Step 6: Add an eviction watcher (Scheduled Events)
Azure Scheduled Events are exposed via the Azure Instance Metadata Service (IMDS). For Linux, the Scheduled Events endpoint is commonly reachable at:
http://169.254.169.254/metadata/scheduledevents?api-version=...
Create a simple watcher script that polls scheduled events and writes a checkpoint marker. (For real workloads, you’d checkpoint to Blob/ADLS and stop accepting new work.)
Inside the VM:
sudo tee /usr/local/bin/spot-eviction-watch.sh >/dev/null <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
METADATA="http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01"
HDR="Metadata:true"
log() { echo "$(date -Is) $*" | sudo tee -a /var/log/spot-eviction-watch.log >/dev/null; }
log "Starting spot eviction watcher"
while true; do
# Query scheduled events
RESP="$(curl -sS -H "$HDR" "$METADATA" || true)"
# If jq isn't present or response is empty, keep trying
if [[ -z "${RESP}" ]]; then
log "Empty response from Scheduled Events"
sleep 5
continue
fi
# Look for Preempt / Terminate style events
# Event schema can evolve; verify with official docs.
HAS_EVENTS="$(echo "$RESP" | jq -r '.Events | length' 2>/dev/null || echo "0")"
if [[ "$HAS_EVENTS" != "0" ]]; then
log "Scheduled Events received: $RESP"
# Write a simple "checkpoint" marker locally
# In production, upload state to durable storage and stop services gracefully.
sudo bash -c 'echo "$(date -Is) eviction_notice" >> /var/lib/spot-checkpoints.txt'
# Optional: attempt graceful stop actions here (systemctl stop myservice, drain, etc.)
fi
sleep 5
done
EOF
sudo chmod +x /usr/local/bin/spot-eviction-watch.sh
sudo mkdir -p /var/lib
Create a systemd service:
sudo tee /etc/systemd/system/spot-eviction-watch.service >/dev/null <<'EOF'
[Unit]
Description=Azure Spot VM eviction watcher (Scheduled Events)
After=network-online.target
Wants=network-online.target
[Service]
Type=simple
ExecStart=/usr/local/bin/spot-eviction-watch.sh
Restart=always
RestartSec=2
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now spot-eviction-watch.service
Expected outcome: The watcher runs continuously and logs to /var/log/spot-eviction-watch.log.
Verify:
sudo systemctl status spot-eviction-watch.service --no-pager
sudo tail -n 20 /var/log/spot-eviction-watch.log
Step 7: Validate Spot configuration from the Azure side
Exit SSH (or keep it open) and run locally:
az vm show -g "$RG" -n "$VM" --query "{
name:name,
location:location,
vmSize:hardwareProfile.vmSize,
priority:priority,
evictionPolicy:evictionPolicy
}" -o table
Expected outcome: priority shows Spot, and evictionPolicy shows Deallocate.
Validation
Use this checklist:
- VM is Spot
–
az vm show ... --query priorityreturnsSpot. - SSH connectivity works – You can SSH to the VM.
- Eviction watcher service is running
–
systemctl status spot-eviction-watch.serviceis active. - Scheduled Events endpoint is reachable
– On the VM:
bash curl -sS -H "Metadata:true" \ "http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01" | jq– It should return JSON, typically with anEventsarray (often empty).
Troubleshooting
Issue: VM creation fails with capacity error – Try another VM size (e.g., a different B-series or D-series). – Try a different region. – Remove zone pinning (if you used zones). – Spot is capacity-driven; failures are normal and should be handled by automation in production.
Issue: SSH times out
– Confirm your NSG rule uses your current public IP.
– Confirm the VM has a public IP.
– Check:
bash
az vm get-instance-view -g "$RG" -n "$VM" --query "instanceView.statuses" -o table
Issue: Scheduled Events endpoint returns nothing or errors
– Ensure you used the required header Metadata:true.
– Ensure curl is installed.
– Verify the API version in official docs if the endpoint schema changes.
Issue: VM gets evicted
– This can happen at any time.
– If eviction policy is Deallocate, try restarting:
bash
az vm start -g "$RG" -n "$VM"
If capacity isn’t available, start may fail—try later or redeploy.
Cleanup
Delete the whole resource group (recommended):
az group delete -n "$RG" --yes --no-wait
Expected outcome: All lab resources are scheduled for deletion, preventing ongoing charges (especially for disks and public IPs).
11. Best Practices
Architecture best practices
- Design for eviction from day one: assume nodes disappear.
- Prefer fleets over pets:
- Use VM Scale Sets or automated redeployments rather than single VMs.
- Externalize state:
- Store state in managed services (Storage, databases) instead of local disks.
- Checkpoint frequently:
- Make progress durable every N minutes or per task unit.
- Use queue-based work distribution:
- Service Bus / Storage queues with retry + poison queue patterns.
- Consider hybrid capacity:
- Baseline on-demand capacity for guaranteed throughput; add Spot for burst.
IAM/security best practices
- Use Managed Identity for VM-to-Azure authentication.
- Use least privilege:
- RBAC scoped to resource groups and specific resources (Storage container, Key Vault secrets).
- Control admin access:
- Prefer Azure Bastion or private access + jump host.
- Restrict SSH/RDP via NSGs to known IPs and use strong key-based auth.
Cost best practices
- Treat Spot as compute cost reduction, not a full solution:
- Disks and egress can still dominate.
- Use Deallocate only when you intend to restart; otherwise Delete can reduce lingering resource costs (verify delete options).
- Right-size and measure:
- Use autoscale and choose VM families that give best $/work unit, not only lowest $/hour.
- Add tags for cost allocation and spot-specific reporting:
priority=spot,workload=batch,owner=...,env=...
Performance best practices
- Prefer parallel-friendly algorithms and distributed job scheduling.
- Avoid heavy warm-up times; use images or pre-baked artifacts to reduce node initialization.
- Use proximity placement groups only when necessary; they can reduce capacity flexibility.
Reliability best practices
- Assume random node loss:
- Use retries with backoff.
- Ensure tasks are idempotent.
- Spread risk:
- Use multiple VM sizes, zones, or regions (where feasible).
- For services, use health checks and load balancers to remove evicted nodes quickly.
Operations best practices
- Monitor:
- VM start failures (capacity)
- Unexpected deallocations
- Queue depth / job completion rates
- Automate replacement:
- VMSS autoscale rules
- Scheduled redeployments
- Maintain golden images and versioning:
- Bake dependencies into VM images for consistent, fast scale-out.
Governance/tagging/naming best practices
- Standard naming:
vm-spot-<app>-<env>-<region>-<nn> - Standard tags:
env,app,owner,costCenter,dataClass,lifecycle=ephemeral,priority=spot- Apply Azure Policy to enforce:
- No public IPs (where required)
- Approved images
- Required tags
12. Security Considerations
Identity and access model
- Management plane access uses Azure AD + Azure RBAC.
- For in-VM access to Azure resources:
- Prefer System-assigned Managed Identity (per-VM) or User-assigned Managed Identity (shared across fleet).
- Avoid embedding credentials in VM images or scripts.
Encryption
- At rest:
- Managed disks support encryption (platform-managed keys by default; customer-managed keys are possible depending on requirements—verify current options).
- Storage for checkpoints (Blob/ADLS) supports encryption.
- In transit:
- Use TLS to storage endpoints and internal services.
Network exposure
- Avoid public IPs for fleets when possible.
- Use NSGs with least exposure:
- Restrict SSH/RDP to known admin IPs or via Bastion.
- Consider private endpoints/private networking for Storage/Key Vault when required.
Secrets handling
- Use Key Vault + Managed Identity.
- If you must use secrets, rotate them and keep them out of logs.
- Never bake long-lived secrets into VM images.
Audit/logging
- Use:
- Azure Activity Log for management operations
- Azure Monitor / Log Analytics for guest logs (agent-based)
- Alert on:
- Unexpected VM deletions/deallocations
- Role assignment changes
- NSG changes opening inbound access
Compliance considerations
- Data residency: choose regions carefully.
- Data classification: ensure workloads running on Spot still follow the same compliance controls (encryption, access logging, retention).
- If using Spot for sensitive workloads, ensure:
- Full disk encryption posture meets requirements
- Hardened images and patching strategy are defined
Common security mistakes
- Opening SSH/RDP to the internet broadly (
0.0.0.0/0). - Using shared admin passwords across ephemeral nodes.
- Putting secrets in cloud-init scripts without secure retrieval.
- Assuming evicted nodes are fully wiped immediately (design as if disks may persist if deallocated).
Secure deployment recommendations
- Private subnets + Bastion/jump host
- Managed Identity everywhere
- Key Vault for secrets/certs
- Minimal inbound rules; outbound control if required (Firewall/NAT policies)
- Golden images with CIS-aligned hardening (where applicable)
13. Limitations and Gotchas
- No capacity guarantee: Spot VMs may fail to allocate at creation time.
- Evictions can happen anytime: design for sudden loss; do not rely solely on graceful shutdown.
- Short eviction notice: often ~30 seconds via Scheduled Events; not enough for large data flushes.
- Region/size volatility: some VM sizes may frequently be unavailable as Spot.
- Quotas still apply: Spot uses the same vCPU quotas as regular VMs.
- Persistent resource costs:
- If evicted and deallocated, disks and other resources can still accrue costs.
- Stateful workloads are risky:
- Databases and single-instance services are poor fits unless architected for interruption and replication.
- Price can change:
- Spot price varies; max price can protect you but can also increase evictions.
- Operational complexity:
- Requires retry logic, checkpointing, and fleet management.
- Zonal constraints:
- Pinning to a zone can reduce available capacity, increasing allocation failures.
- Tooling differences:
- Fields/flags can differ by API version and tooling; verify with the latest Azure CLI/ARM/Bicep/Terraform docs.
14. Comparison with Alternatives
Alternatives in Azure
- Regular Azure Virtual Machines (pay-as-you-go): stable capacity, no eviction.
- Reserved Instances / Savings Plans: cost reduction for steady usage without eviction risk (verify current product specifics and applicability by VM type).
- Azure VM Scale Sets (on-demand): fleet management without Spot interruptions.
- Azure Batch: managed job scheduling; can use lower-cost capacity options (verify current terminology and options).
- AKS with Spot node pools: Spot benefits for Kubernetes worker nodes (still eviction-prone).
Alternatives in other clouds
- AWS EC2 Spot Instances
- Google Cloud Spot VMs (previously “Preemptible VMs” terminology in older materials—verify current naming)
Open-source/self-managed alternatives
- Run workloads on your own hardware with a preemptible model (e.g., Kubernetes with cluster autoscaler + mixed node types), but you lose the cloud provider’s elastic spare capacity economics.
Comparison table
| Option | Best For | Strengths | Weaknesses | When to Choose |
|---|---|---|---|---|
| Azure Spot Virtual Machines | Interruptible compute, batch, burst scale-out | Deep discounts, same VM features, good for fleets | Evictions, no capacity guarantee, more engineering | Workloads can retry/checkpoint and tolerate loss |
| Azure Virtual Machines (pay-as-you-go) | Always-on services, steady workloads | Stable, predictable, no eviction | Higher cost | You need reliability and predictable capacity |
| Reserved Instances / Savings Plans (Azure) | Steady baseline usage | Lower cost without interruption | Commitment/term, less flexible | You have predictable long-running compute needs |
| Azure VM Scale Sets (on-demand) | Scalable services without interruptions | Autoscale, self-healing | Higher cost than Spot | You need scale but can’t tolerate eviction |
| Azure Batch | Managed batch scheduling | Job orchestration, pools, scheduling features | Learning curve, service-specific model | You want a managed batch platform and queues/pools |
| AKS Spot node pools | Kubernetes worker capacity for stateless pods | Integrates with K8s scheduling, cost reduction | Node churn, pod disruption handling required | You run Kubernetes and can tolerate node interruptions |
| AWS EC2 Spot / GCP Spot VMs | Cross-cloud equivalents | Similar economics and patterns | Different APIs, behaviors, pricing | Multi-cloud strategy or existing footprint elsewhere |
15. Real-World Example
Enterprise example: Risk simulations with checkpointing
- Problem: A financial services company runs nightly Monte Carlo simulations. On-demand compute is expensive, and windows are tight.
- Proposed architecture:
- Job scheduler submits simulation batches to a queue.
- A VM Scale Set of Azure Spot Virtual Machines pulls tasks.
- Each simulation writes checkpoints every few minutes to Azure Blob Storage.
- Results are aggregated into a durable data store.
- Monitoring alerts on queue depth and job completion SLA.
- Why Azure Spot Virtual Machines was chosen:
- Large cost reduction for massively parallel compute.
- Evictions are acceptable because simulations are restartable from checkpoints.
- Expected outcomes:
- Significant reduction in compute spend.
- Ability to run more simulations or finish earlier.
- Resilience to evictions via retries and checkpointing.
Startup/small-team example: CI runners for monorepo builds
- Problem: A startup’s monorepo builds consume many CPU hours. Hosted CI is costly and slow at peak times.
- Proposed architecture:
- Self-hosted runners on Spot VMs (single VMSS).
- Jobs are short-lived and retryable; artifacts stored in remote storage.
- Baseline small on-demand runner pool for guaranteed minimum throughput; Spot adds burst.
- Why Azure Spot Virtual Machines was chosen:
- Major CI cost reduction.
- Build jobs are naturally retryable; interruptions are tolerable.
- Expected outcomes:
- Lower CI spend and faster build throughput during spikes.
- Automated scaling without paying for idle capacity.
16. FAQ
-
What is an Azure Spot Virtual Machine?
A standard Azure VM configured to run on unused Azure capacity at a discounted rate, with the possibility of eviction when Azure needs the capacity back or when your max price is exceeded. -
Can Spot VMs be used for production?
Yes, for production workloads that are designed to handle interruptions (stateless services, workers, batch jobs). Avoid using Spot for critical stateful single-instance systems. -
What does eviction mean in practice?
Your VM can be stopped (deallocated) or deleted depending on policy. Your application must expect node loss and recover via retries, rescheduling, or autoscaling. -
How much cheaper are Spot VMs?
Discounts can be significant, but exact savings vary by region, VM size, and demand. Use the official pricing page and compare to pay-as-you-go for your target SKU. -
Do Spot VMs have an SLA?
Spot has different availability expectations than on-demand. Review official SLA documentation and service terms—Spot capacity is not guaranteed. -
What is the difference between Deallocate and Delete eviction policies?
Deallocate stops the VM and releases compute, typically keeping attached resources so you can restart later. Delete removes the VM; what happens to disks/IPs depends on delete options—verify before relying on it. -
How do I get notified of an eviction?
Use Azure Scheduled Events from inside the VM to detect upcoming eviction notices and trigger graceful shutdown/checkpointing. -
Can I prevent eviction?
No. You can only reduce impact through architecture (checkpointing, retries) and reduce likelihood by choosing different sizes/regions or using hybrid capacity. -
What is “max price” and how should I set it?
Max price caps what you’re willing to pay. If Spot price exceeds it, the VM can be evicted. If you want fewer price-based evictions, set it higher (or allow up to on-demand if supported by your tooling). Always verify semantics in official docs. -
Do reservations or savings plans apply to Spot VMs?
Typically Spot is billed at Spot rates and doesn’t stack with some commitment discounts. Verify current Azure billing rules for reservations/savings plans vs Spot. -
Can I use Spot VMs with VM Scale Sets?
Yes. VM Scale Sets are a common way to run Spot fleets with autoscaling and replacement of evicted instances. -
Can Spot VMs run Windows?
Yes, depending on VM size/region availability. Costs and licensing differ—verify pricing for Windows Spot VMs. -
What happens to my data when a Spot VM is evicted?
Data on the VM’s temporary/local disk may be lost. Data on managed disks persists if the disks remain. Design for durability by storing state externally. -
How do I design workloads for Spot?
Use idempotent tasks, queues, checkpointing, retries, and avoid single points of failure. Treat nodes as disposable. -
What’s the fastest way to start using Spot safely?
Start with non-critical batch jobs or CI runners, add checkpointing/retries, and expand gradually. Monitor eviction behavior and tune VM sizes/regions. -
Can I combine Spot and on-demand in the same architecture?
Yes. A common approach is baseline on-demand capacity plus Spot for burst. This improves reliability while still saving money. -
Are GPUs available as Spot?
Sometimes, depending on region and capacity. GPU Spot is highly capacity-sensitive; be prepared for allocation failures and higher eviction risk.
17. Top Online Resources to Learn Azure Spot Virtual Machines
| Resource Type | Name | Why It Is Useful |
|---|---|---|
| Official documentation | Spot Virtual Machines (Azure VMs) – https://learn.microsoft.com/azure/virtual-machines/spot-vms | Core concepts, configuration options, eviction, max price, guidance |
| Official pricing | Spot pricing – https://azure.microsoft.com/pricing/details/virtual-machines/spot/ | Understand the Spot pricing model and constraints |
| Pricing calculator | Azure Pricing Calculator – https://azure.microsoft.com/pricing/calculator/ | Estimate end-to-end costs including disks, bandwidth, monitoring |
| Official docs (eviction handling) | Azure Scheduled Events (Linux) – https://learn.microsoft.com/azure/virtual-machines/linux/scheduled-events | How to detect eviction notices and other events from inside the VM |
| Official docs (VM fundamentals) | Azure Virtual Machines documentation – https://learn.microsoft.com/azure/virtual-machines/ | VM networking, disks, images, and operational guidance |
| Official docs (fleet mgmt) | Virtual Machine Scale Sets – https://learn.microsoft.com/azure/virtual-machine-scale-sets/ | Autoscale and manage Spot fleets more safely |
| Official architecture | Azure Architecture Center – https://learn.microsoft.com/azure/architecture/ | Patterns for resilient, scalable architectures |
| Official learning modules | Microsoft Learn (Azure Virtual Machines learning paths) – https://learn.microsoft.com/training/browse/?products=azure-virtual-machines | Structured training; search for VM/scale set modules |
| Official samples | Azure Quickstart Templates – https://github.com/Azure/azure-quickstart-templates | Many VM/VMSS templates; look for Spot/priority examples |
| Community (reputable) | Azure Friday / Microsoft Azure YouTube – https://www.youtube.com/@MicrosoftAzure | Videos and demos; verify details against current docs |
18. Training and Certification Providers
-
DevOpsSchool.com – Suitable audience: DevOps engineers, SREs, cloud engineers, platform teams – Likely learning focus: Azure DevOps, cloud operations, automation, IaC, CI/CD (check course listings for Spot/VM/compute coverage) – Mode: Check website – Website URL: https://www.devopsschool.com/
-
ScmGalaxy.com – Suitable audience: DevOps and SCM learners, build/release engineers – Likely learning focus: SCM, CI/CD pipelines, DevOps foundations (check for Azure modules) – Mode: Check website – Website URL: https://www.scmgalaxy.com/
-
CLoudOpsNow.in – Suitable audience: Cloud operations practitioners, administrators, SREs – Likely learning focus: Cloud operations, monitoring, reliability, cost awareness (verify available Azure content) – Mode: Check website – Website URL: https://www.cloudopsnow.in/
-
SreSchool.com – Suitable audience: SREs, operations engineers, platform engineers – Likely learning focus: Reliability engineering, incident response, observability, scalability (Spot fits into reliability/cost tradeoffs) – Mode: Check website – Website URL: https://www.sreschool.com/
-
AiOpsSchool.com – Suitable audience: SRE/ops teams adopting AIOps practices – Likely learning focus: Monitoring, automation, event correlation (useful for operating ephemeral Spot fleets) – Mode: Check website – Website URL: https://www.aiopsschool.com/
19. Top Trainers
-
RajeshKumar.xyz – Likely specialization: DevOps/cloud training content (verify current offerings) – Suitable audience: Engineers seeking practical DevOps/cloud guidance – Website URL: https://www.rajeshkumar.xyz/
-
devopstrainer.in – Likely specialization: DevOps tools and cloud-focused training (verify Azure coverage) – Suitable audience: Beginners to intermediate DevOps engineers – Website URL: https://www.devopstrainer.in/
-
devopsfreelancer.com – Likely specialization: DevOps consulting/training platform content (verify services) – Suitable audience: Teams seeking external DevOps expertise and coaching – Website URL: https://www.devopsfreelancer.com/
-
devopssupport.in – Likely specialization: DevOps support and training resources (verify Azure offerings) – Suitable audience: Operations/DevOps teams needing guided troubleshooting support – Website URL: https://www.devopssupport.in/
20. Top Consulting Companies
-
cotocus.com – Likely service area: Cloud/DevOps consulting (verify exact service catalog) – Where they may help: Designing scalable compute platforms, cost optimization, automation, governance – Consulting use case examples: Spot-based CI runner platform; batch worker fleet with checkpointing; VMSS autoscaling and monitoring setup – Website URL: https://www.cotocus.com/
-
DevOpsSchool.com – Likely service area: DevOps/cloud consulting and enablement (verify current offerings) – Where they may help: DevOps transformations, CI/CD, cloud operations, training-to-implementation engagements – Consulting use case examples: IaC modules for Spot/VMSS; operational runbooks for evictions; cost governance tagging strategy – Website URL: https://www.devopsschool.com/
-
DEVOPSCONSULTING.IN – Likely service area: DevOps consulting services (verify scope and regions served) – Where they may help: Cloud automation, infrastructure reliability, monitoring and support – Consulting use case examples: Building interruption-tolerant worker platforms; integrating alerts for Spot eviction events; platform hardening and access controls – Website URL: https://www.devopsconsulting.in/
21. Career and Learning Roadmap
What to learn before Azure Spot Virtual Machines
- Azure fundamentals:
- Subscriptions, resource groups, regions, availability zones
- Azure Virtual Machines basics:
- VM images, sizing, disks, networking (VNet/subnet/NSG), VM lifecycle
- Security basics:
- Azure AD, RBAC, Managed Identity, Key Vault
- Operations basics:
- Azure Monitor, logs/metrics, alerting, Activity Log
- Reliability basics:
- Stateless design, retries, backoff, idempotency, checkpointing
- IaC fundamentals:
- ARM/Bicep or Terraform basics; Azure CLI
What to learn after
- VM Scale Sets advanced patterns:
- Autoscale rules, health probes, rolling upgrades
- Kubernetes Spot patterns:
- AKS spot node pools, pod disruption budgets, cluster autoscaler
- Batch orchestration:
- Azure Batch, queue-driven processing, workflow engines
- FinOps:
- Cost allocation with tags, budgets, anomaly detection, unit-cost models
- Security hardening:
- Golden images, vulnerability scanning, policy-as-code
Job roles that use it
- Cloud engineer / Cloud administrator
- Solutions architect
- DevOps engineer
- SRE / Platform engineer
- Data engineer (batch platforms)
- FinOps analyst (advisory + governance)
Certification path (Azure)
Azure certifications change over time; verify current offerings on Microsoft Learn. Commonly relevant certifications include: – AZ-900 (Azure Fundamentals) – AZ-104 (Azure Administrator) – AZ-305 (Azure Solutions Architect) Also consider DevOps-focused certifications depending on your role (verify current exam codes and requirements).
Project ideas for practice
- Build a queue-based worker system (Service Bus + Spot VMSS) with checkpointing to Blob.
- Create a CI runner fleet on Spot with autoscaling and job retry logic.
- Implement an eviction-aware daemon that drains work, uploads state, and terminates gracefully.
- Do a cost/performance benchmark comparing on-demand vs Spot for a batch workload and report unit cost ($ per 1,000 tasks).
22. Glossary
- Azure Spot Virtual Machines: Azure VMs running on unused capacity with variable price and eviction risk.
- Eviction: Azure reclaiming capacity; the Spot VM is stopped/deallocated or deleted.
- Eviction policy: Setting that controls what happens to the VM on eviction (e.g., Deallocate or Delete).
- Max price: The maximum price you’re willing to pay for a Spot VM; exceeding it may cause eviction.
- VM Scale Set (VMSS): A service for running and managing a set of identical VMs with autoscaling.
- Checkpointing: Periodically saving progress to durable storage so work can resume after interruption.
- Idempotent task: A task that can be retried without causing inconsistent results.
- Scheduled Events: In-VM metadata API for upcoming platform events like eviction notices.
- IMDS (Instance Metadata Service): Metadata endpoint reachable from inside Azure VMs (169.254.169.254).
- NSG (Network Security Group): Azure firewall rules for subnets/NICs.
- Managed Identity: Azure feature providing an identity for Azure resources to access other services without secrets.
- Activity Log: Azure subscription-level log for management operations.
- Egress: Outbound network traffic from Azure to the internet or other regions.
23. Summary
Azure Spot Virtual Machines is an Azure Compute option that runs standard Azure VMs on unused capacity at discounted rates, with the critical tradeoff that VMs can be evicted with short notice. It matters because it can dramatically reduce compute costs and enable large-scale parallel processing—when your workload is designed for interruption.
Architecturally, Spot works best for stateless and fault-tolerant workloads, supported by queues, retries, and checkpointing, and usually managed as a fleet (often via VM Scale Sets) rather than individual long-lived servers. From a cost perspective, the biggest savings are on compute, while disks, networking egress, and monitoring can still be meaningful; from a security perspective, Spot uses the same RBAC, networking, and encryption controls as normal VMs, so you should apply standard Azure hardening and governance.
Use Azure Spot Virtual Machines when you can tolerate interruption and want the lowest-cost compute for batch, CI, testing, and scalable workers. Next, deepen your skills by implementing Spot fleets with VM Scale Sets, adding eviction-aware shutdown logic using Scheduled Events, and building a unit-cost model to quantify savings vs interruption overhead.