{"id":387,"date":"2026-04-13T21:28:14","date_gmt":"2026-04-13T21:28:14","guid":{"rendered":"https:\/\/www.devopsschool.com\/tutorials\/azure-compute-fleet-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-compute\/"},"modified":"2026-04-13T21:28:14","modified_gmt":"2026-04-13T21:28:14","slug":"azure-compute-fleet-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-compute","status":"publish","type":"post","link":"https:\/\/www.devopsschool.com\/tutorials\/azure-compute-fleet-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-compute\/","title":{"rendered":"Azure Compute Fleet Tutorial: Architecture, Pricing, Use Cases, and Hands-On Guide for Compute"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Category<\/h2>\n\n\n\n<p>Compute<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">1. Introduction<\/h2>\n\n\n\n<p>Azure Compute Fleet is an Azure Compute service designed to help you request and manage a \u201cfleet\u201d of virtual machine (VM) capacity using a single construct, instead of manually coordinating many individual VMs or multiple scaling groups.<\/p>\n\n\n\n<p>In simple terms: you describe the VM configuration you want and the amount of capacity you need, and Azure Compute Fleet helps place that capacity across eligible VM sizes and (where supported) zones, while you keep control over cost and availability trade-offs.<\/p>\n\n\n\n<p>In technical terms: Azure Compute Fleet acts as a control-plane resource in Azure that orchestrates provisioning and lifecycle of a set of VM instances according to a VM profile and allocation strategy. It\u2019s commonly considered alongside VM Scale Sets, Spot VMs, and large-scale compute patterns (batch, render, HPC, large CI runners), where you care about capacity acquisition and placement more than individual pets.<\/p>\n\n\n\n<p>The problem it solves is \u201ccapacity orchestration at scale\u201d: getting the compute capacity you need (often quickly, sometimes opportunistically at lower cost) while reducing operational overhead such as SKU selection, retries when a size is unavailable, and managing heterogeneous capacity pools.<\/p>\n\n\n\n<blockquote>\n<p>Service status note: Azure services evolve quickly and some features may be in preview or have region\/SKU constraints. Verify the current GA\/preview status, supported regions, and API versions in the official Microsoft Learn documentation for <strong>Azure Compute Fleet<\/strong> before production use.<\/p>\n<\/blockquote>\n\n\n\n<h2 class=\"wp-block-heading\">2. What is Azure Compute Fleet?<\/h2>\n\n\n\n<p><strong>Official purpose (practical framing)<\/strong><br\/>\nAzure Compute Fleet is intended to simplify how you acquire and manage a large set of VM compute resources by expressing your desired capacity and acceptable options, and letting Azure place instances across available capacity pools.<\/p>\n\n\n\n<p><strong>Core capabilities (what it generally enables)<\/strong>\n&#8211; Define a fleet-level desired capacity (number of instances, vCPU capacity, or similar target depending on the service\u2019s current model\u2014<strong>verify in official docs<\/strong>).\n&#8211; Provide a VM configuration\/profile (image, OS disk, data disks, networking, identity, extensions\/cloud-init).\n&#8211; Optionally provide flexibility across VM sizes\/series to improve placement success when a specific SKU is constrained (<strong>verify exact flexibility mechanism in docs<\/strong>).\n&#8211; Operate the lifecycle of the compute: create, scale, replace, and delete as a unit.<\/p>\n\n\n\n<p><strong>Major components (conceptual)<\/strong>\n&#8211; <strong>Fleet resource<\/strong>: the top-level object you manage (create\/update\/delete).\n&#8211; <strong>VM profile<\/strong>: the specification for what each instance should look like (image, size options, network, identity, extensions).\n&#8211; <strong>Allocation strategy<\/strong>: rules\/inputs that guide how Azure should choose among sizes\/zones\/purchase options (exact knobs depend on current API\u2014<strong>verify<\/strong>).\n&#8211; <strong>Underlying compute instances<\/strong>: the actual VM instances created and billed (plus disks, NICs, IPs, etc.).<\/p>\n\n\n\n<p><strong>Service type<\/strong>\n&#8211; Control-plane orchestration for VM-based compute capacity (not a PaaS runtime like Azure App Service, and not a container scheduler like AKS).<\/p>\n\n\n\n<p><strong>Scope (regional\/global, subscription, etc.)<\/strong>\n&#8211; Typically <strong>regional<\/strong> because VM capacity and placement are regional and often zonal. Expect fleet resources to be created in a region and optionally distribute across zones where available (<strong>verify zonal behavior per region\/SKU<\/strong>).\n&#8211; Managed within an <strong>Azure subscription<\/strong> and <strong>resource group<\/strong>, governed by Azure RBAC, Azure Policy, and tagging\u2014similar to other Azure Compute resources.<\/p>\n\n\n\n<p><strong>How it fits into the Azure ecosystem<\/strong>\nAzure Compute Fleet sits in the Azure Compute family alongside:\n&#8211; <strong>Azure Virtual Machines<\/strong> (the underlying compute instances)\n&#8211; <strong>Azure Virtual Machine Scale Sets<\/strong> (autoscaling and uniform\/flexible instance groups)\n&#8211; <strong>Azure Spot Virtual Machines<\/strong> (discounted spare capacity with eviction risk)\n&#8211; <strong>Azure Batch<\/strong> (job-queue and task scheduling on pools of VMs)\n&#8211; <strong>Azure Monitor<\/strong> (metrics\/logs\/alerts for the fleet and underlying VMs)\n&#8211; <strong>Azure Virtual Network<\/strong> (networking for private compute)<\/p>\n\n\n\n<p>Compute Fleet is most useful when you want \u201ca lot of VMs, quickly, with placement flexibility,\u201d without building custom logic to manage capacity across many SKUs or zones.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">3. Why use Azure Compute Fleet?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Business reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Faster time to capacity<\/strong>: reduce delays caused by SKU shortages by allowing flexibility (where supported).<\/li>\n<li><strong>Cost control options<\/strong>: pair with Spot where appropriate; keep the fleet constrained to approved SKUs\/regions to meet budget and compliance needs.<\/li>\n<li><strong>Simplified procurement\/operations<\/strong>: one object to manage instead of many individual instances or scripts.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Technical reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Capacity orchestration<\/strong>: request capacity with constraints and preferences instead of hard-coding one VM size.<\/li>\n<li><strong>Consistency<\/strong>: ensure all instances share a baseline VM profile (image, networking, identity).<\/li>\n<li><strong>Integration with Infrastructure as Code (IaC)<\/strong>: model fleet resources with ARM\/Bicep\/Terraform where supported (<strong>verify provider support in your tooling<\/strong>).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operational reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Reduced toil<\/strong>: fewer ad-hoc scripts for \u201cretry in another SKU\/zone.\u201d<\/li>\n<li><strong>Fleet-level lifecycle<\/strong>: scale and delete a group as one unit; easier environment management (dev\/test ephemeral fleets).<\/li>\n<li><strong>Standard governance<\/strong>: tags, locks, policies, and RBAC can be applied at fleet scope and inherited by resources (depending on deployment pattern\u2014<strong>verify inheritance<\/strong>).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/compliance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Private networking by default<\/strong>: place instances in VNets\/subnets; avoid public IPs.<\/li>\n<li><strong>Central IAM control<\/strong>: use Azure RBAC and Managed Identities for access to Azure resources.<\/li>\n<li><strong>Auditing<\/strong>: operations are visible in Azure Activity Log; underlying VM events in logs\/metrics.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scalability\/performance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Horizontal scaling<\/strong>: add capacity by increasing fleet target.<\/li>\n<li><strong>Heterogeneous capacity<\/strong>: mix sizes (where supported) to meet performance and availability objectives.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should choose Azure Compute Fleet<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>You need <strong>large VM capacity<\/strong> for batch-like workloads, CI runners, render farms, simulations, data processing, or ephemeral environments.<\/li>\n<li>You can tolerate <strong>some heterogeneity<\/strong> in VM sizes\/series (or you explicitly want it).<\/li>\n<li>You care about <strong>placement success<\/strong> and operational simplicity more than pinning to one exact SKU.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should not choose it<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>You need <strong>strictly one VM size<\/strong> with deterministic placement and you already have stable capacity using VMSS\/Capacity Reservation.<\/li>\n<li>Your workload is <strong>container-native<\/strong> and better served by AKS with cluster autoscaler\/Karpenter-like patterns (on Azure this might mean AKS + node pools + autoscaler).<\/li>\n<li>You require <strong>advanced job scheduling\/queue semantics<\/strong> (Azure Batch may be a better fit).<\/li>\n<li>You require <strong>stateful pets<\/strong> with manual care (consider dedicated VMs with tight change control).<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">4. Where is Azure Compute Fleet used?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Industries<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Media &amp; entertainment (render, transcode)<\/li>\n<li>Financial services (risk simulations, Monte Carlo)<\/li>\n<li>Manufacturing (CAE simulation)<\/li>\n<li>Healthcare\/life sciences (genomics pipelines)<\/li>\n<li>Software\/SaaS (CI runners, test environments)<\/li>\n<li>Gaming (build farms, backend batch processing)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Team types<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Platform engineering teams providing \u201ccompute-on-demand\u201d<\/li>\n<li>DevOps\/SRE teams scaling build\/test capacity<\/li>\n<li>Data engineering teams running VM-based ETL or legacy processing<\/li>\n<li>HPC teams running VM-based clusters (sometimes with Slurm\/CycleCloud patterns)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Workloads<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Embarrassingly parallel batch tasks<\/li>\n<li>Short-lived compute bursts<\/li>\n<li>Workloads tolerant to variable capacity (especially with Spot)<\/li>\n<li>VM-based runners where containerization is not feasible<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Architectures<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Event-driven batch pipelines (queue triggers compute)<\/li>\n<li>Periodic scheduled compute bursts (nightly builds)<\/li>\n<li>Multi-zone compute tiers behind a load balancer (stateless services)<\/li>\n<li>Private compute in hub-and-spoke VNets<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Real-world deployment contexts<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Production<\/strong>: stateless compute tiers, resilient pipelines, cost-optimized Spot pools with fallback.<\/li>\n<li><strong>Dev\/test<\/strong>: ephemeral fleets spun up per branch\/test suite.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">5. Top Use Cases and Scenarios<\/h2>\n\n\n\n<p>Below are realistic scenarios where Azure Compute Fleet commonly fits. Exact feature fit depends on current API surface\u2014verify details in official docs.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1) CI\/CD build runner burst capacity<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Your hosted runners can\u2019t keep up during peak hours; self-hosted runners require fast scaling.<\/li>\n<li><strong>Why this service fits:<\/strong> Fleet-style provisioning can acquire many VM instances quickly and scale down afterward.<\/li>\n<li><strong>Example:<\/strong> A platform team spins up a fleet of Ubuntu runners during business hours and scales to zero at night.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2) Rendering farm for animation frames<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Rendering needs thousands of CPU-hours, but only for short periods.<\/li>\n<li><strong>Why this service fits:<\/strong> Large horizontal scale, cost optimization with flexible SKUs and optional Spot.<\/li>\n<li><strong>Example:<\/strong> A studio requests a fleet sized to 500 vCPUs across multiple VM sizes approved for rendering.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3) Monte Carlo risk simulations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Need many identical jobs; speed matters; capacity availability is variable.<\/li>\n<li><strong>Why this service fits:<\/strong> Fleet lets you request compute in bulk and tolerate multiple VM sizes.<\/li>\n<li><strong>Example:<\/strong> A bank runs nightly simulations; if one SKU is constrained, fleet can place on alternatives (where supported).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4) Large-scale integration testing<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Running integration tests requires many isolated VMs for parallelism.<\/li>\n<li><strong>Why this service fits:<\/strong> Fast provisioning + consistent VM profile (image\/agent).<\/li>\n<li><strong>Example:<\/strong> A SaaS company builds an image with test dependencies and deploys a test fleet per release candidate.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">5) Spot-first batch processing with controlled risk<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Want cheaper compute, but Spot evictions can interrupt work.<\/li>\n<li><strong>Why this service fits:<\/strong> Combine Spot instances with retry logic; optionally keep a baseline of regular instances (depending on service features\u2014verify).<\/li>\n<li><strong>Example:<\/strong> A data pipeline uses Spot VMs for most processing and requeues failed tasks on eviction.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6) Multi-zone stateless service tier<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Need to run stateless workers across zones for resiliency.<\/li>\n<li><strong>Why this service fits:<\/strong> Fleet capacity can be distributed across zones (where supported) and integrated with a load balancer.<\/li>\n<li><strong>Example:<\/strong> A backend worker tier scales to handle traffic spikes and remains available during a zone impairment.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">7) Capacity acquisition for legacy VM workloads<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Legacy applications require VMs and can\u2019t be containerized yet.<\/li>\n<li><strong>Why this service fits:<\/strong> Centralized provisioning and scaling without rewriting the app.<\/li>\n<li><strong>Example:<\/strong> A legacy Windows service is deployed to a fleet where each VM runs the same agent and configuration.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">8) Compute for security scanning \/ SAST\/DAST<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Security scans run periodically and need burst compute.<\/li>\n<li><strong>Why this service fits:<\/strong> Short-lived fleets for scanning windows reduce long-running costs.<\/li>\n<li><strong>Example:<\/strong> Weekly DAST runs spin up a fleet, execute scans, upload results, then delete.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">9) Simulation workloads with strict subnet isolation<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Workloads need isolation and private endpoints for data sources.<\/li>\n<li><strong>Why this service fits:<\/strong> Fleet instances can live in private subnets and use Managed Identity for data access.<\/li>\n<li><strong>Example:<\/strong> CAE simulations read\/write to Azure Storage via private endpoints and restricted NSGs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">10) Education\/labs with ephemeral VM pools<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Training sessions need many identical VMs for a short time.<\/li>\n<li><strong>Why this service fits:<\/strong> Fleet-based provisioning can create a predictable pool and delete after the class.<\/li>\n<li><strong>Example:<\/strong> A university lab provisions a fleet for 3 hours and tears it down automatically.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">11) Blue\/green compute worker refresh<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> You need to rotate VM images safely without stopping all workers.<\/li>\n<li><strong>Why this service fits:<\/strong> Create a second fleet with a new image, shift workload, then remove the old fleet.<\/li>\n<li><strong>Example:<\/strong> A platform team rolls a new hardened image and migrates workers in waves.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">12) Large-scale log processing on VMs (legacy stack)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> A toolchain requires VMs and local disks; containers aren\u2019t supported.<\/li>\n<li><strong>Why this service fits:<\/strong> Rapidly provision VM workers to drain a queue backlog, then scale down.<\/li>\n<li><strong>Example:<\/strong> A log backlog after an outage is cleared by temporarily scaling the fleet.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">6. Core Features<\/h2>\n\n\n\n<p>Because Azure Compute Fleet capabilities can evolve (especially if preview), treat the following as the typical feature set and verify the exact behavior and property names in Microsoft Learn.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Fleet-level capacity request<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Lets you request\/manage a group of VM instances as a unit.<\/li>\n<li><strong>Why it matters:<\/strong> Simplifies provisioning and scaling actions.<\/li>\n<li><strong>Practical benefit:<\/strong> One change to desired capacity instead of orchestrating many VMs.<\/li>\n<li><strong>Limitations\/caveats:<\/strong> The exact capacity target model (instances vs vCPU vs other) may vary\u2014<strong>verify<\/strong>.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">VM profile (common configuration for instances)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Defines core VM settings: image, OS disk, extensions, networking, identity.<\/li>\n<li><strong>Why it matters:<\/strong> Ensures instances are consistently configured.<\/li>\n<li><strong>Practical benefit:<\/strong> Standardized workers\/runners without per-VM drift.<\/li>\n<li><strong>Limitations\/caveats:<\/strong> Some VM features depend on region\/SKU; images may have marketplace terms.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">SKU flexibility (multi-size eligibility)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Allows placement across a set of VM sizes\/series you specify (if supported).<\/li>\n<li><strong>Why it matters:<\/strong> Increases provisioning success when one SKU is constrained.<\/li>\n<li><strong>Practical benefit:<\/strong> Better capacity acquisition under pressure.<\/li>\n<li><strong>Limitations\/caveats:<\/strong> Application must tolerate different CPU\/memory ratios; license constraints may apply.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Zone-aware placement (where available)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Distributes instances across Availability Zones.<\/li>\n<li><strong>Why it matters:<\/strong> Improves resiliency and reduces correlated failures.<\/li>\n<li><strong>Practical benefit:<\/strong> Zone fault tolerance for stateless workloads.<\/li>\n<li><strong>Limitations\/caveats:<\/strong> Not all regions have zones; not all VM sizes are zonal.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Integration with Spot Virtual Machines (cost optimization)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Enables using Spot capacity for cost savings (if supported by fleet configuration).<\/li>\n<li><strong>Why it matters:<\/strong> Can significantly reduce compute cost for interruptible workloads.<\/li>\n<li><strong>Practical benefit:<\/strong> Lower $\/compute for batch and elastic workloads.<\/li>\n<li><strong>Limitations\/caveats:<\/strong> Spot instances can be evicted; you must design for interruption.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Azure Resource Manager (ARM) lifecycle + IaC compatibility<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Manages fleet as an Azure resource with create\/update\/delete operations.<\/li>\n<li><strong>Why it matters:<\/strong> Enables repeatable deployments and governance.<\/li>\n<li><strong>Practical benefit:<\/strong> CI\/CD-managed infrastructure, consistent environments.<\/li>\n<li><strong>Limitations\/caveats:<\/strong> Tooling support (Terraform providers, Az CLI commands) may lag\u2014<strong>verify<\/strong>.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Azure RBAC and Managed Identity support (typical pattern)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Uses Azure RBAC for who can manage the fleet; instances can use Managed Identity to access Azure services.<\/li>\n<li><strong>Why it matters:<\/strong> Avoids secrets in code; least privilege.<\/li>\n<li><strong>Practical benefit:<\/strong> VMs can access Key Vault\/Storage\/Service Bus securely.<\/li>\n<li><strong>Limitations\/caveats:<\/strong> Identity assignment method and scope depends on implementation\u2014<strong>verify<\/strong>.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Observability through Azure Monitor (via underlying resources)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Enables metrics\/logs\/alerts via Azure Monitor for VM health and performance (and fleet operations via Activity Log).<\/li>\n<li><strong>Why it matters:<\/strong> Large-scale compute without observability becomes unmanageable.<\/li>\n<li><strong>Practical benefit:<\/strong> Central dashboards, alerting, and troubleshooting.<\/li>\n<li><strong>Limitations\/caveats:<\/strong> There may not be fleet-specific metrics; you may monitor underlying VMs\/VMSS resources.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Governance: tags, policy, locks (resource management plane)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Apply tags, Azure Policy, resource locks to control changes and enforce standards.<\/li>\n<li><strong>Why it matters:<\/strong> Prevents configuration drift and unmanaged cost.<\/li>\n<li><strong>Practical benefit:<\/strong> Enforce \u201cno public IP,\u201d required tags, approved SKUs, approved regions.<\/li>\n<li><strong>Limitations\/caveats:<\/strong> Policy evaluation depends on the underlying resources created and how they\u2019re modeled.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">7. Architecture and How It Works<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">High-level service architecture<\/h3>\n\n\n\n<p>At a high level, Azure Compute Fleet sits in the management plane and orchestrates provisioning in Azure Compute. The fleet resource defines <em>intent<\/em> (capacity + profile). Azure then creates and manages underlying compute instances that satisfy that intent.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Control flow (request path)<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>You create or update an <strong>Azure Compute Fleet<\/strong> resource (Portal\/ARM\/Bicep\/CLI\/SDK).<\/li>\n<li>Azure validates:\n   &#8211; Subscription\/resource provider registration\n   &#8211; Quotas (cores per region, per SKU family)\n   &#8211; Policy constraints (allowed SKUs, required tags, network rules)<\/li>\n<li>Fleet orchestration selects eligible placement targets (SKU, zone, etc., depending on configuration).<\/li>\n<li>Azure provisions underlying compute instances plus dependent resources (NICs, disks, etc.).<\/li>\n<li>You monitor health\/performance and scale the fleet up\/down as needed.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Data flow (workload path)<\/h3>\n\n\n\n<p>Compute Fleet itself doesn\u2019t process your business data. Your workload runs on the VMs it provisions, so your application data flow is the same as standard Azure VMs:\n&#8211; VMs read\/write to Storage, databases, queues, and internal services over VNet\/private endpoints.\n&#8211; VMs emit logs\/metrics to Azure Monitor agents.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Typical integrations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Azure Virtual Network (VNet):<\/strong> subnets, NSGs, route tables, NAT Gateway, private endpoints.<\/li>\n<li><strong>Azure Storage:<\/strong> boot diagnostics, application data, checkpoints.<\/li>\n<li><strong>Azure Monitor \/ Log Analytics:<\/strong> metrics, logs, alerts; activity log for fleet operations.<\/li>\n<li><strong>Azure Key Vault:<\/strong> secrets\/certificates (prefer Managed Identity).<\/li>\n<li><strong>Azure Load Balancer \/ Application Gateway:<\/strong> if fleet instances serve traffic (verify exact compatibility).<\/li>\n<li><strong>Azure Policy:<\/strong> enforce approved VM sizes, regions, tagging, and security controls.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Dependency services<\/h3>\n\n\n\n<p>Expect dependencies similar to VM deployments:\n&#8211; Compute RP (<code>Microsoft.Compute<\/code>)\n&#8211; Network RP (<code>Microsoft.Network<\/code>)\n&#8211; Storage RP (<code>Microsoft.Storage<\/code>) (if boot diagnostics or data)\n&#8211; Monitor RP (<code>Microsoft.Insights<\/code>) (if diagnostics)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/authentication model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Management-plane:<\/strong> Azure RBAC controls who can create\/update\/delete fleet resources (e.g., Contributor, Virtual Machine Contributor, or custom roles).<\/li>\n<li><strong>Data-plane:<\/strong> Workload access from VMs to Storage\/Key Vault\/etc. is best done via Managed Identity + RBAC.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Networking model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Instances attach to one or more VNets\/subnets (depending on profile).<\/li>\n<li>Use NSGs and UDRs for segmentation.<\/li>\n<li>Prefer <strong>no public IPs<\/strong>; use Bastion, VPN\/ExpressRoute, or jump hosts for admin access.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Monitoring\/logging\/governance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use Azure Activity Log for create\/update\/delete and deployment failures.<\/li>\n<li>Monitor underlying VMs: CPU\/memory (via agent), disk IOPS, network.<\/li>\n<li>Track scaling events and instance churn (especially with Spot).<\/li>\n<li>Use tags for cost allocation and environment separation.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Simple architecture diagram<\/h4>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart LR\n  U[Engineer \/ CI Pipeline] --&gt;|ARM\/Portal| F[Azure Compute Fleet]\n  F --&gt; C[Azure Compute (VM instances)]\n  C --&gt; VNET[Azure VNet\/Subnet]\n  C --&gt; STG[Azure Storage]\n  C --&gt; MON[Azure Monitor \/ Log Analytics]\n<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Production-style architecture diagram<\/h4>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart TB\n  subgraph DevOps[\"Delivery &amp; Governance\"]\n    ADO[CI\/CD Pipeline&lt;br\/&gt;Bicep\/ARM\/Terraform] --&gt; ARM[Azure Resource Manager]\n    POL[Azure Policy] --&gt; ARM\n    RBAC[Azure RBAC] --&gt; ARM\n  end\n\n  ARM --&gt; FLEET[Azure Compute Fleet]\n\n  subgraph Net[\"Networking (Hub-Spoke)\"]\n    HUB[Hub VNet&lt;br\/&gt;Firewall\/NVA&lt;br\/&gt;DNS] --- SPOKE[Spoke VNet]\n    SPOKE --&gt; SUBNET[Private Subnet&lt;br\/&gt;NSG + UDR]\n    NAT[NAT Gateway&lt;br\/&gt;(egress)] --&gt; SUBNET\n    PE[Private Endpoints] --&gt; HUB\n  end\n\n  FLEET --&gt; VM[Fleet Instances (VMs)]\n  VM --&gt; SUBNET\n  VM --&gt; PE\n  VM --&gt; KV[Azure Key Vault]\n  VM --&gt; STG2[Azure Storage \/ Data Lake]\n  VM --&gt; SB[Service Bus \/ Queue]\n\n  subgraph Obs[\"Observability\"]\n    AM[Azure Monitor] --&gt; LA[Log Analytics Workspace]\n    VM --&gt; AM\n    FLEET --&gt; AL[Activity Log]\n  end\n\n  subgraph Sec[\"Security\"]\n    MI[Managed Identity] --&gt; KV\n  end\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">8. Prerequisites<\/h2>\n\n\n\n<p>Before you start, confirm the current requirements in official Azure Compute Fleet documentation (especially if the service is preview).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Account\/subscription requirements<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>An Azure subscription where you can create Compute and Network resources.<\/li>\n<li>If Azure Compute Fleet is in preview\/limited availability, you may need:<\/li>\n<li>Region allow-listing, preview enrollment, or feature registration (<strong>verify in official docs<\/strong>).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Permissions (IAM\/RBAC)<\/h3>\n\n\n\n<p>At minimum, you typically need:\n&#8211; <strong>Contributor<\/strong> on the resource group (for a lab).\n&#8211; Or a combination of:\n  &#8211; Compute contributor permissions (create compute resources)\n  &#8211; Network contributor permissions (create NICs, subnets, NSGs if needed)\n  &#8211; Managed identity permissions (if assigning identities)<\/p>\n\n\n\n<p>For production, prefer least privilege with custom roles and separate RBAC for:\n&#8211; Fleet operators (scale operations)\n&#8211; Image administrators\n&#8211; Network administrators\n&#8211; Security\/compliance reviewers<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Billing requirements<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A payment method and subscription in good standing.<\/li>\n<li>Sufficient quota for the VM sizes\/regions you plan to use.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Tools<\/h3>\n\n\n\n<p>For the hands-on portion, install:\n&#8211; <a href=\"https:\/\/learn.microsoft.com\/cli\/azure\/install-azure-cli\">Azure CLI<\/a>\n&#8211; Optional: Visual Studio Code + Bicep tooling (if you use IaC)\n&#8211; Optional: SSH client (OpenSSH) if you expose SSH (not recommended for production)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Region availability<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Confirm Azure Compute Fleet is available in your target region(s) and supports the VM families you need (<strong>verify in docs<\/strong>).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Quotas\/limits<\/h3>\n\n\n\n<p>Common quota constraints you must check:\n&#8211; <strong>Regional vCPU quotas<\/strong> (overall and per VM family)\n&#8211; <strong>Spot vCPU quotas<\/strong> (if using Spot)\n&#8211; <strong>Public IP quotas<\/strong> (if you create many public endpoints)\n&#8211; <strong>Disk limits<\/strong> and <strong>NIC limits<\/strong> at scale<\/p>\n\n\n\n<p>Check quotas in Azure Portal: <strong>Subscriptions \u2192 Usage + quotas<\/strong>, and request increases if needed.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Prerequisite services (typical)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Resource group<\/li>\n<li>Virtual network + subnet<\/li>\n<li>(Recommended) Log Analytics workspace for monitoring<\/li>\n<li>(Recommended) Key Vault for secrets\/certificates<\/li>\n<li>(Optional) NAT Gateway for stable outbound IP and better SNAT scaling<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">9. Pricing \/ Cost<\/h2>\n\n\n\n<p>Azure Compute Fleet pricing is best understood as <strong>\u201cyou pay for what it provisions\u201d<\/strong>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Pricing model (what to expect)<\/h3>\n\n\n\n<p>In many Azure orchestration features, the control-plane resource itself is not billed separately; you are billed for the underlying resources created (VMs, disks, IPs, load balancers, monitoring). For Azure Compute Fleet specifically, <strong>verify in official pricing documentation<\/strong> whether there is any standalone charge.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Pricing dimensions (primary drivers)<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Virtual Machines<\/strong>\n   &#8211; VM size (vCPU\/RAM)\n   &#8211; Region\n   &#8211; OS (Windows licensing vs Linux)\n   &#8211; Purchase option:<ul>\n<li>Pay-as-you-go<\/li>\n<li>Reserved Instances<\/li>\n<li>Savings Plan for Compute<\/li>\n<li>Spot (if used)<\/li>\n<\/ul>\n<\/li>\n<li><strong>Managed disks<\/strong>\n   &#8211; Disk type (Standard HDD\/SSD, Premium SSD, Ultra Disk)\n   &#8211; Capacity and IOPS tiers<\/li>\n<li><strong>Networking<\/strong>\n   &#8211; Outbound data transfer (egress)\n   &#8211; Public IP addresses (especially if many)\n   &#8211; Load Balancer \/ Application Gateway (if used)\n   &#8211; NAT Gateway (if used)<\/li>\n<li><strong>Monitoring<\/strong>\n   &#8211; Log Analytics ingestion and retention\n   &#8211; VM insights \/ agent data volume<\/li>\n<li><strong>Storage<\/strong>\n   &#8211; Boot diagnostics storage (if enabled)\n   &#8211; Application data and checkpoints<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Free tier<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Azure Compute Fleet itself typically does not have a \u201cfree tier\u201d concept. VM compute is billable. Some monitoring has free grants depending on SKU and promotions\u2014<strong>verify current offers<\/strong>.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Hidden or indirect costs to watch<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Log ingestion<\/strong>: fleets can produce large log volumes quickly.<\/li>\n<li><strong>Egress<\/strong>: moving data out of Azure or across regions can dominate costs.<\/li>\n<li><strong>Public IP sprawl<\/strong>: a fleet with per-VM public IPs is expensive and risky.<\/li>\n<li><strong>Premium disks<\/strong> at scale: convenient but costly; right-size IOPS.<\/li>\n<li><strong>Build artifacts<\/strong> and container\/image pulls: repeated downloads across many VMs increase bandwidth and time.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Network\/data transfer implications<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Keep data sources in the same region as the fleet when possible.<\/li>\n<li>Prefer private endpoints to keep traffic on Microsoft backbone (and reduce exposure).<\/li>\n<li>Use caching (artifact caching, package mirrors) for large fleets.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Cost optimization strategies<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use <strong>Spot<\/strong> for interruptible workloads; design for eviction.<\/li>\n<li>Use <strong>approved size lists<\/strong> that are both cost-effective and widely available.<\/li>\n<li>Prefer <strong>ephemeral OS disks<\/strong> where appropriate (if supported by your chosen VM size) to reduce OS disk cost and speed provisioning\u2014<strong>verify fleet support<\/strong>.<\/li>\n<li>Scale down aggressively; implement scheduled scaling or event-driven scale.<\/li>\n<li>Use <strong>Savings Plan<\/strong> or <strong>Reserved Instances<\/strong> for baseline always-on capacity.<\/li>\n<li>Centralize outbound via <strong>NAT Gateway<\/strong> for predictable networking and fewer public IPs.<\/li>\n<li>Tag everything: <code>env<\/code>, <code>owner<\/code>, <code>costCenter<\/code>, <code>workload<\/code>, <code>dataSensitivity<\/code>.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Example low-cost starter estimate (no fabricated numbers)<\/h3>\n\n\n\n<p>A minimal lab might be:\n&#8211; 1 small Linux VM instance in a fleet\n&#8211; Standard SSD OS disk\n&#8211; No load balancer\n&#8211; Minimal Log Analytics retention<\/p>\n\n\n\n<p>Estimate approach:\n&#8211; VM hourly rate \u00d7 hours used\n&#8211; Disk monthly rate prorated\n&#8211; Log ingestion (MB\/day) \u00d7 retention<\/p>\n\n\n\n<p>Use:\n&#8211; Azure Pricing calculator: https:\/\/azure.microsoft.com\/pricing\/calculator\/\n&#8211; Azure Virtual Machines pricing: https:\/\/azure.microsoft.com\/pricing\/details\/virtual-machines\/<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Example production cost considerations<\/h3>\n\n\n\n<p>In production, costs scale non-linearly with:\n&#8211; Peak instance count\n&#8211; Disk performance tiers (Premium\/Ultra)\n&#8211; Data movement patterns\n&#8211; Monitoring volume\n&#8211; High availability topology (multi-zone, load balancers, gateways)<\/p>\n\n\n\n<p>A good production cost model includes:\n&#8211; Baseline capacity on Savings Plan\/RI\n&#8211; Burst capacity on pay-as-you-go or Spot\n&#8211; Separate line items for monitoring + egress + storage\n&#8211; A chargeback\/showback plan using tags and Cost Management<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">10. Step-by-Step Hands-On Tutorial<\/h2>\n\n\n\n<p>This lab is designed to be <strong>safe and low-cost<\/strong> by using a small number of instances, private networking, and short runtime. Because Azure Compute Fleet availability and exact creation steps can vary (preview\/region), the lab includes both <strong>core steps<\/strong> and <strong>verification\/alternatives<\/strong>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Objective<\/h3>\n\n\n\n<p>Create a small Azure Compute Fleet in a dedicated resource group, place instances into a private subnet, validate that underlying VM instances are created, and set up basic monitoring visibility. Then clean up all resources.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Lab Overview<\/h3>\n\n\n\n<p>You will:\n1. Create a resource group and basic networking (VNet\/subnet\/NSG).\n2. Create an Azure Compute Fleet (small capacity).\n3. Verify underlying compute instances and connectivity signals.\n4. Add basic monitoring (Activity Log + optional Log Analytics).\n5. Clean up.<\/p>\n\n\n\n<p><strong>Expected cost:<\/strong> Primarily the VM runtime + disk + any monitoring ingestion. Keep the fleet small (e.g., 1 instance) and delete it after validation.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 1: Create a resource group<\/h3>\n\n\n\n<p><strong>What you do:<\/strong> Create a dedicated resource group for clean teardown.<\/p>\n\n\n\n<pre><code class=\"language-bash\">az login\naz account show\naz account set --subscription \"&lt;YOUR_SUBSCRIPTION_ID&gt;\"\n\n# Choose a region you have quota for\nREGION=\"eastus\"\nRG=\"rg-compute-fleet-lab\"\n\naz group create -n \"$RG\" -l \"$REGION\"\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome:<\/strong> Resource group is created.<\/p>\n\n\n\n<p><strong>Verify:<\/strong><\/p>\n\n\n\n<pre><code class=\"language-bash\">az group show -n \"$RG\" --query \"{name:name, location:location}\" -o table\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Step 2: Create networking (VNet, subnet, NSG)<\/h3>\n\n\n\n<p>This uses standard Azure networking resources (not Compute Fleet-specific), and is fully executable.<\/p>\n\n\n\n<pre><code class=\"language-bash\">VNET=\"vnet-fleet-lab\"\nSUBNET=\"snet-fleet\"\nNSG=\"nsg-fleet\"\n\naz network vnet create \\\n  -g \"$RG\" -n \"$VNET\" -l \"$REGION\" \\\n  --address-prefix 10.50.0.0\/16 \\\n  --subnet-name \"$SUBNET\" --subnet-prefix 10.50.1.0\/24\n\naz network nsg create -g \"$RG\" -n \"$NSG\" -l \"$REGION\"\n<\/code><\/pre>\n\n\n\n<p>For a low-risk lab, avoid inbound from the internet. If you must test SSH, restrict to your IP and remove after.<\/p>\n\n\n\n<p><strong>Optional (restricted SSH rule):<\/strong><\/p>\n\n\n\n<pre><code class=\"language-bash\">MYIP=\"$(curl -s ifconfig.me)\/32\"\naz network nsg rule create \\\n  -g \"$RG\" --nsg-name \"$NSG\" -n \"AllowSSHFromMyIP\" \\\n  --priority 1000 --access Allow --direction Inbound --protocol Tcp \\\n  --source-address-prefixes \"$MYIP\" --source-port-ranges \"*\" \\\n  --destination-address-prefixes \"*\" --destination-port-ranges 22\n<\/code><\/pre>\n\n\n\n<p>Associate NSG to subnet:<\/p>\n\n\n\n<pre><code class=\"language-bash\">az network vnet subnet update \\\n  -g \"$RG\" --vnet-name \"$VNET\" -n \"$SUBNET\" \\\n  --network-security-group \"$NSG\"\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome:<\/strong> You have a subnet ready for private VM placement.<\/p>\n\n\n\n<p><strong>Verify:<\/strong><\/p>\n\n\n\n<pre><code class=\"language-bash\">az network vnet subnet show -g \"$RG\" --vnet-name \"$VNET\" -n \"$SUBNET\" -o table\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Step 3: Confirm Azure Compute Fleet availability in your subscription\/region<\/h3>\n\n\n\n<p>Because Azure Compute Fleet may be preview\/region-limited, validate that your subscription can see the resource provider and (if applicable) the resource type.<\/p>\n\n\n\n<p>1) Ensure <code>Microsoft.Compute<\/code> is registered:<\/p>\n\n\n\n<pre><code class=\"language-bash\">az provider show -n Microsoft.Compute --query \"registrationState\" -o tsv\naz provider register -n Microsoft.Compute\n<\/code><\/pre>\n\n\n\n<p>2) Check if \u201cfleet\u201d resource type is visible in your environment:<\/p>\n\n\n\n<pre><code class=\"language-bash\">az provider show -n Microsoft.Compute --query \"resourceTypes[?resourceType=='fleets']\" -o json\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome:<\/strong>\n&#8211; If the output shows the <code>fleets<\/code> resource type, you can proceed.\n&#8211; If it does not, Azure Compute Fleet may not be enabled for your subscription\/region.<\/p>\n\n\n\n<p><strong>If not available:<\/strong>\n&#8211; Use Microsoft Learn search to find the latest enrollment\/enablement steps: https:\/\/learn.microsoft.com\/search\/?terms=Azure%20Compute%20Fleet<br\/>\n&#8211; For learning the general pattern, you can still complete steps around networking, monitoring, governance, and cost modeling, and compare with VM Scale Sets in Section 14.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 4: Create an Azure Compute Fleet<\/h3>\n\n\n\n<p>Creation methods vary (Portal, ARM\/Bicep, CLI, REST). The most reliable approach is to follow the <strong>official Azure Compute Fleet quickstart<\/strong> for the current schema and API version.<\/p>\n\n\n\n<p><strong>Option A (recommended): Azure Portal<\/strong>\n1. Azure Portal \u2192 <strong>Create a resource<\/strong>\n2. Search for <strong>Azure Compute Fleet<\/strong>\n3. Select the offering and click <strong>Create<\/strong>\n4. Configure:\n   &#8211; Subscription, Resource group: <code>rg-compute-fleet-lab<\/code>\n   &#8211; Region: same as your VNet\n   &#8211; Instance configuration: choose a small Linux image (Ubuntu LTS) and a small size or an allowed size list (depending on the UI)\n   &#8211; Networking: select <code>vnet-fleet-lab<\/code> and <code>snet-fleet<\/code>\n   &#8211; Authentication: SSH key (preferred) or password (not recommended)\n   &#8211; Tags: <code>env=lab<\/code>, <code>owner=&lt;you&gt;<\/code>\n5. Review + create.<\/p>\n\n\n\n<p><strong>Expected outcome:<\/strong> A fleet resource is created, and then underlying VM instance resources begin provisioning.<\/p>\n\n\n\n<p><strong>Option B: Infrastructure as Code (ARM\/Bicep\/Terraform)<\/strong>\nUse the official template\/schema from Microsoft Learn or official samples. Because property names and API versions can change, do not copy random templates from the internet.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Find official docs: https:\/\/learn.microsoft.com\/search\/?terms=Azure%20Compute%20Fleet%20quickstart  <\/li>\n<li>Find official samples (if available): https:\/\/github.com\/Azure (search within repos for \u201ccompute fleet\u201d)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Step 5: Verify underlying instances were created<\/h3>\n\n\n\n<p>Even if the fleet resource is new to you, you can verify what got created by listing compute resources in the resource group.<\/p>\n\n\n\n<p>List resources:<\/p>\n\n\n\n<pre><code class=\"language-bash\">az resource list -g \"$RG\" -o table\n<\/code><\/pre>\n\n\n\n<p>List VMs:<\/p>\n\n\n\n<pre><code class=\"language-bash\">az vm list -g \"$RG\" -d -o table\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome:<\/strong> You should see one or more VM instances (or related compute resources) created as part of the fleet, with private IPs in <code>10.50.1.0\/24<\/code>.<\/p>\n\n\n\n<p>Check NICs\/private IPs:<\/p>\n\n\n\n<pre><code class=\"language-bash\">az network nic list -g \"$RG\" --query \"[].{name:name, ip:ipConfigurations[0].privateIPAddress}\" -o table\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Step 6: Validate access and basic health signals<\/h3>\n\n\n\n<p>How you validate depends on whether you gave the instances public access.<\/p>\n\n\n\n<p><strong>Recommended validation (no public IP):<\/strong>\n&#8211; Use Azure Portal \u2192 VM \u2192 <strong>Run command<\/strong> (if available on the underlying instances) to run:\n  &#8211; <code>uname -a<\/code>\n  &#8211; <code>systemctl status &lt;your-service&gt;<\/code>\n&#8211; Or use a jump host\/Bastion (more secure but can add cost).<\/p>\n\n\n\n<p><strong>If you allowed SSH (restricted to your IP):<\/strong>\n1. Find the public IP (if any were created):<\/p>\n\n\n\n<pre><code class=\"language-bash\">az network public-ip list -g \"$RG\" -o table\n<\/code><\/pre>\n\n\n\n<ol class=\"wp-block-list\" start=\"2\">\n<li>SSH:<\/li>\n<\/ol>\n\n\n\n<pre><code class=\"language-bash\">ssh &lt;username&gt;@&lt;public-ip&gt;\n<\/code><\/pre>\n\n\n\n<ol class=\"wp-block-list\" start=\"3\">\n<li>Confirm hostname and network:<\/li>\n<\/ol>\n\n\n\n<pre><code class=\"language-bash\">hostname\nip a\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome:<\/strong> You can run a command on an instance and confirm it\u2019s alive and on the expected subnet.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 7: Set up basic monitoring and audit visibility<\/h3>\n\n\n\n<p>Even without fleet-specific metrics, you can (and should) monitor:\n&#8211; Fleet operations: Activity Log\n&#8211; VM performance: Azure Monitor metrics + VM Insights (optional)<\/p>\n\n\n\n<p><strong>Activity Log query (last 24h):<\/strong><\/p>\n\n\n\n<pre><code class=\"language-bash\">az monitor activity-log list --resource-group \"$RG\" --offset 24h -o table\n<\/code><\/pre>\n\n\n\n<p><strong>Optional: Create a Log Analytics workspace<\/strong><\/p>\n\n\n\n<pre><code class=\"language-bash\">LAW=\"law-fleet-lab-$RANDOM\"\n\naz monitor log-analytics workspace create \\\n  -g \"$RG\" -n \"$LAW\" -l \"$REGION\"\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome:<\/strong> You have a workspace for logs if you choose to enable VM Insights\/agents. Enabling agents at scale can increase cost\u2014use carefully in labs.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Validation<\/h3>\n\n\n\n<p>Use this checklist:\n&#8211; [ ] <code>rg-compute-fleet-lab<\/code> exists\n&#8211; [ ] VNet\/subnet\/NSG created and associated\n&#8211; [ ] Fleet resource exists in the RG (Portal or <code>az resource list<\/code>)\n&#8211; [ ] Underlying VM instances exist and have private IPs in the subnet\n&#8211; [ ] Activity Log shows create operations and provisioning events\n&#8211; [ ] (Optional) You can run a command\/SSH to confirm instance health<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Troubleshooting<\/h3>\n\n\n\n<p>Common issues and fixes:<\/p>\n\n\n\n<p>1) <strong><code>The subscription is not registered to use namespace 'Microsoft.Compute'<\/code><\/strong>\n&#8211; Fix:\n  <code>bash\n  az provider register -n Microsoft.Compute<\/code><\/p>\n\n\n\n<p>2) <strong>Fleet resource type not found \/ cannot create<\/strong>\n&#8211; Cause: Azure Compute Fleet not enabled, preview not available in region, or lacking permissions.\n&#8211; Fix: Verify official docs for enablement and supported regions: https:\/\/learn.microsoft.com\/search\/?terms=Azure%20Compute%20Fleet<\/p>\n\n\n\n<p>3) <strong>Quota exceeded (vCPU)<\/strong>\n&#8211; Cause: Not enough regional cores for the selected size\/family.\n&#8211; Fix: Request quota increase; or choose smaller sizes; or reduce desired capacity.<\/p>\n\n\n\n<p>4) <strong>SKU not available \/ allocation failed<\/strong>\n&#8211; Cause: Capacity constraints in that region\/zone\/SKU.\n&#8211; Fix: Add flexibility (more eligible sizes), change zones, or use another region (subject to governance).<\/p>\n\n\n\n<p>5) <strong>Spot eviction \/ capacity not available (Spot)<\/strong>\n&#8211; Cause: Spot is best-effort.\n&#8211; Fix: Design for interruption, checkpoint often, maintain a baseline of regular capacity (if needed).<\/p>\n\n\n\n<p>6) <strong>Cannot connect to instance<\/strong>\n&#8211; Cause: No public IP, missing NSG rule, or route.\n&#8211; Fix: Use private access methods (Bastion\/VPN), or temporarily allow restricted SSH and then remove.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Cleanup<\/h3>\n\n\n\n<p>To avoid ongoing charges, delete the resource group.<\/p>\n\n\n\n<pre><code class=\"language-bash\">az group delete -n \"$RG\" --yes --no-wait\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome:<\/strong> All resources in the lab RG are deleted, including the fleet and underlying compute.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">11. Best Practices<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Architecture best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Design fleet instances as <strong>stateless workers<\/strong> where possible.<\/li>\n<li>Store state externally (Storage, databases) and use <strong>checkpointing<\/strong> for long tasks.<\/li>\n<li>Use <strong>private subnets<\/strong>; avoid per-instance public IPs.<\/li>\n<li>Separate concerns:<\/li>\n<li>One fleet per workload (or per environment) for blast-radius control.<\/li>\n<li>Separate fleets for Spot vs regular capacity if your operational model requires different policies.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">IAM\/security best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use least privilege:<\/li>\n<li>Separate \u201cscale operator\u201d role from \u201cadmin who can change images\/network.\u201d<\/li>\n<li>Prefer <strong>Managed Identity<\/strong> for instances; avoid secrets in scripts.<\/li>\n<li>Restrict who can change:<\/li>\n<li>VM images (golden image pipeline)<\/li>\n<li>Network profile (subnet\/NSG)<\/li>\n<li>Allowed SKUs<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Cost best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Keep an approved <strong>SKU allow-list<\/strong> that balances availability and cost.<\/li>\n<li>Use Spot only when the workload can tolerate eviction.<\/li>\n<li>Use Savings Plan\/RI for baseline capacity; use fleet scaling for bursts.<\/li>\n<li>Control monitoring costs:<\/li>\n<li>Start with essential logs\/metrics<\/li>\n<li>Set retention appropriately<\/li>\n<li>Sample high-volume logs<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Performance best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Benchmark on representative sizes if you allow multiple SKUs.<\/li>\n<li>Use accelerated networking where required and supported by chosen SKUs.<\/li>\n<li>Optimize disk type\/tier to workload I\/O.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Reliability best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Prefer multi-zone designs for stateless services (where supported).<\/li>\n<li>Implement workload-level retries and idempotency.<\/li>\n<li>For Spot-heavy fleets, implement:<\/li>\n<li>Frequent checkpoints<\/li>\n<li>Queue-based work distribution<\/li>\n<li>Rapid rehydration (fast bootstrap)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operations best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Standardize:<\/li>\n<li>Naming: <code>fleet-&lt;workload&gt;-&lt;env&gt;-&lt;region&gt;<\/code><\/li>\n<li>Tags: <code>env<\/code>, <code>owner<\/code>, <code>costCenter<\/code>, <code>dataClass<\/code>, <code>app<\/code><\/li>\n<li>Use Azure Monitor alerts for:<\/li>\n<li>High CPU\/disk queue<\/li>\n<li>Instance provisioning failures<\/li>\n<li>Unexpected scale-out<\/li>\n<li>Automate cleanup for ephemeral fleets (scheduled jobs, pipeline teardown).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Governance\/tagging\/naming best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Enforce tags via Azure Policy.<\/li>\n<li>Use Azure Policy to restrict:<\/li>\n<li>Public IP creation<\/li>\n<li>Unapproved VM sizes<\/li>\n<li>Unapproved regions<\/li>\n<li>Apply resource locks only when appropriate (locks can block auto-remediation\/cleanup).<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">12. Security Considerations<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Identity and access model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Azure RBAC<\/strong> controls management operations on the fleet and underlying resources.<\/li>\n<li>Prefer <strong>Managed Identity<\/strong> on the instances for accessing:<\/li>\n<li>Azure Key Vault (secrets\/certs)<\/li>\n<li>Azure Storage (data\/checkpoints)<\/li>\n<li>Service Bus\/Queue services<\/li>\n<\/ul>\n\n\n\n<p>Recommended pattern:\n&#8211; System-assigned or user-assigned managed identity on instances (verify fleet support and assignment method).\n&#8211; Grant identity only the minimal roles needed (e.g., \u201cStorage Blob Data Contributor\u201d on a single container, not the whole account).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Encryption<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>VM disks are encrypted at rest by default in Azure-managed storage encryption; for stricter requirements use:<\/li>\n<li>Customer-managed keys (CMK) where supported<\/li>\n<li>Azure Disk Encryption or encryption at host (depending on requirements and SKU support\u2014verify)<\/li>\n<li>Encrypt data in transit:<\/li>\n<li>TLS for service endpoints<\/li>\n<li>Private endpoints to reduce exposure<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Network exposure<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Avoid public inbound access to fleet instances.<\/li>\n<li>Use:<\/li>\n<li>Azure Bastion (secure admin)<\/li>\n<li>Private access via VPN\/ExpressRoute<\/li>\n<li>Just-In-Time access (Defender for Cloud) where applicable<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secrets handling<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Do not embed secrets in cloud-init\/custom data.<\/li>\n<li>Use Key Vault + Managed Identity.<\/li>\n<li>Rotate secrets regularly; prefer short-lived credentials.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Audit\/logging<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use Azure Activity Log for fleet create\/update\/delete operations.<\/li>\n<li>Centralize logs in Log Analytics and\/or a SIEM.<\/li>\n<li>Track:<\/li>\n<li>Who scaled capacity<\/li>\n<li>Who changed the VM profile<\/li>\n<li>Unexpected deployments in unapproved regions\/SKUs<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Compliance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data residency: keep compute and data in approved regions.<\/li>\n<li>Harden images: CIS baselines, Defender for Cloud recommendations.<\/li>\n<li>Use Azure Policy initiatives aligned to your standard (e.g., Azure Security Benchmark).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Common security mistakes<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Assigning Contributor at subscription scope to operators who only need scale permissions.<\/li>\n<li>Allowing public IPs for every instance.<\/li>\n<li>Using admin passwords shared across instances.<\/li>\n<li>Not limiting outbound (data exfiltration risk) and not using NAT + firewall controls.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secure deployment recommendations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Private subnet + NSG default deny inbound.<\/li>\n<li>Managed Identity + Key Vault.<\/li>\n<li>Azure Policy to enforce:<\/li>\n<li>No public IP<\/li>\n<li>Required tags<\/li>\n<li>Approved SKUs\/regions<\/li>\n<li>Central logging and alerting for provisioning failures and unusual scale.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">13. Limitations and Gotchas<\/h2>\n\n\n\n<p>Because Azure Compute Fleet can be preview\/feature-evolving, validate these points in official docs for your environment.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Known limitations to check (verify)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Regional availability<\/strong>: not all regions may support it.<\/li>\n<li><strong>Zonal behavior<\/strong>: zone distribution may have constraints.<\/li>\n<li><strong>SKU eligibility<\/strong>: not every VM size\/family may be supported.<\/li>\n<li><strong>Tooling support<\/strong>: CLI\/Terraform support may lag behind REST\/Portal.<\/li>\n<li><strong>Autoscale semantics<\/strong>: whether fleet integrates with Azure Autoscale or uses its own scaling model\u2014verify.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Quotas<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Regional vCPU quotas are often the first blocker at scale.<\/li>\n<li>Spot quotas can be separate from pay-as-you-go quotas.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Pricing surprises<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Log Analytics ingestion at high scale.<\/li>\n<li>Egress from downloading packages\/artifacts repeatedly.<\/li>\n<li>Load balancing and public IPs for many instances.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Compatibility issues<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Mixed SKUs can affect performance consistency.<\/li>\n<li>Some VM extensions are OS- or SKU-dependent.<\/li>\n<li>Marketplace images may require legal terms acceptance.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operational gotchas<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Spot evictions can look like \u201crandom\u201d failures if you don\u2019t track eviction events.<\/li>\n<li>Scaling down without draining work can cause task loss\u2014use queues and graceful shutdown.<\/li>\n<li>Policies applied to the resource group can block underlying resource creation (NICs, IPs, disks).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Migration challenges<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Moving from VMSS\/Batch to Compute Fleet may require:<\/li>\n<li>Reworking scaling logic<\/li>\n<li>Updating monitoring queries and dashboards<\/li>\n<li>Adjusting governance policies for a new resource type<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">14. Comparison with Alternatives<\/h2>\n\n\n\n<p>Azure Compute Fleet sits among several \u201cscale compute\u201d options. The best choice depends on whether you need job scheduling, autoscale, or container orchestration.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Option<\/th>\n<th>Best For<\/th>\n<th>Strengths<\/th>\n<th>Weaknesses<\/th>\n<th>When to Choose<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Azure Compute Fleet<\/strong><\/td>\n<td>Large-scale VM capacity acquisition with flexibility<\/td>\n<td>Single fleet abstraction; potential SKU\/zone flexibility; good for burst capacity<\/td>\n<td>Availability\/feature set may vary; tooling maturity may vary<\/td>\n<td>You want orchestrated VM capacity with flexibility and simplified ops<\/td>\n<\/tr>\n<tr>\n<td><strong>Azure Virtual Machine Scale Sets (VMSS)<\/strong><\/td>\n<td>Autoscaling and managing groups of VMs<\/td>\n<td>Mature; autoscale integration; well-known patterns<\/td>\n<td>You manage SKU choice more directly; heterogeneous capacity patterns can be more complex<\/td>\n<td>You want a proven VM grouping and autoscaling model<\/td>\n<\/tr>\n<tr>\n<td><strong>Azure Batch<\/strong><\/td>\n<td>Batch job scheduling and task execution<\/td>\n<td>Job\/queue semantics; task retries; pool management<\/td>\n<td>Adds a scheduler layer; learning curve; may not fit always-on services<\/td>\n<td>You need job scheduling more than raw capacity orchestration<\/td>\n<\/tr>\n<tr>\n<td><strong>Azure Kubernetes Service (AKS)<\/strong><\/td>\n<td>Containerized workloads needing orchestration<\/td>\n<td>Powerful scheduling; autoscaling; rolling updates<\/td>\n<td>Requires containerization; cluster ops overhead<\/td>\n<td>Your workload is container-ready and benefits from Kubernetes<\/td>\n<\/tr>\n<tr>\n<td><strong>Azure CycleCloud (HPC)<\/strong><\/td>\n<td>HPC clusters with schedulers (Slurm, PBS)<\/td>\n<td>HPC patterns, cluster lifecycle<\/td>\n<td>HPC-specific; added tooling<\/td>\n<td>You run classic HPC clusters and schedulers<\/td>\n<\/tr>\n<tr>\n<td><strong>AWS EC2 Fleet<\/strong> (other cloud)<\/td>\n<td>Similar \u201cfleet\u201d concept<\/td>\n<td>Mature in AWS ecosystem<\/td>\n<td>Different cloud\/provider lock-in<\/td>\n<td>You\u2019re on AWS or comparing patterns<\/td>\n<\/tr>\n<tr>\n<td><strong>GCP Managed Instance Groups<\/strong> (other cloud)<\/td>\n<td>VM groups with autoscaling<\/td>\n<td>Mature on GCP<\/td>\n<td>Different cloud\/provider lock-in<\/td>\n<td>You\u2019re on GCP or comparing patterns<\/td>\n<\/tr>\n<tr>\n<td><strong>Self-managed autoscaling scripts<\/strong><\/td>\n<td>Custom niche requirements<\/td>\n<td>Fully customizable<\/td>\n<td>High toil; error-prone; security risk<\/td>\n<td>Only when managed services can\u2019t meet requirements<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">15. Real-World Example<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Enterprise example (regulated industry)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> A financial services company runs nightly risk simulations requiring thousands of CPU cores for a 3\u20135 hour window. SKU availability varies, and the environment must be private with strict governance.<\/li>\n<li><strong>Proposed architecture:<\/strong><\/li>\n<li>Azure Compute Fleet in a locked-down subscription\/resource group<\/li>\n<li>Private subnet with NSGs and UDRs through a hub firewall<\/li>\n<li>Managed Identity for access to Storage (input data + checkpoint outputs)<\/li>\n<li>Queue-based dispatcher (Service Bus) to distribute tasks<\/li>\n<li>Central monitoring in Log Analytics; Activity Log streamed to SIEM<\/li>\n<li><strong>Why Azure Compute Fleet was chosen:<\/strong><\/li>\n<li>A fleet-level request simplifies operations for large bursts.<\/li>\n<li>Flexibility across approved SKUs improves the probability of acquiring capacity on time (where supported).<\/li>\n<li>Integrates with standard Azure governance (Policy\/RBAC).<\/li>\n<li><strong>Expected outcomes:<\/strong><\/li>\n<li>Reduced run-to-run variability in capacity acquisition.<\/li>\n<li>Lower ops toil (less manual retrying in different SKUs\/zones).<\/li>\n<li>Stronger governance and audit posture.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Startup\/small-team example<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> A startup\u2019s CI pipeline occasionally backs up when multiple PRs are opened, slowing releases. They need burst compute but want to control spend.<\/li>\n<li><strong>Proposed architecture:<\/strong><\/li>\n<li>Small Azure Compute Fleet used as ephemeral runner pool<\/li>\n<li>Golden image with runner agent preinstalled<\/li>\n<li>Autoscale via pipeline schedules (scale up during working hours, down at night)<\/li>\n<li>Minimal monitoring + cost alerts<\/li>\n<li><strong>Why Azure Compute Fleet was chosen:<\/strong><\/li>\n<li>Simplifies managing burst VM capacity as one construct.<\/li>\n<li>Can incorporate cost strategies like Spot (only for non-critical pipeline stages) if supported and appropriate.<\/li>\n<li><strong>Expected outcomes:<\/strong><\/li>\n<li>Faster CI throughput during peaks.<\/li>\n<li>Predictable cost via aggressive scale-down and tagging.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">16. FAQ<\/h2>\n\n\n\n<p>1) <strong>What is Azure Compute Fleet in one sentence?<\/strong><br\/>\nA service to orchestrate and manage a group (\u201cfleet\u201d) of Azure VM compute capacity using a single resource-level intent (capacity + profile).<\/p>\n\n\n\n<p>2) <strong>Is Azure Compute Fleet the same as VM Scale Sets?<\/strong><br\/>\nNo. VMSS is a mature VM grouping\/autoscaling service; Compute Fleet focuses on fleet-style capacity acquisition and orchestration. The exact relationship and capabilities should be verified in official docs.<\/p>\n\n\n\n<p>3) <strong>Do I pay extra for Azure Compute Fleet?<\/strong><br\/>\nTypically you pay for the underlying VMs, disks, networking, and monitoring. Verify whether the fleet control-plane has any standalone charge in current official pricing.<\/p>\n\n\n\n<p>4) <strong>Is Azure Compute Fleet regional or global?<\/strong><br\/>\nCompute placement is regional, and fleets are generally created in a region (and may place across zones). Verify region\/zonal behavior for your configuration.<\/p>\n\n\n\n<p>5) <strong>Can Azure Compute Fleet use Spot VMs?<\/strong><br\/>\nOften that is a key scenario, but the exact configuration options depend on the current service version\u2014verify in official docs.<\/p>\n\n\n\n<p>6) <strong>How do I handle Spot evictions in a fleet?<\/strong><br\/>\nDesign for interruption: checkpoint work, use queues, retry tasks, and avoid storing critical state on the VM OS disk.<\/p>\n\n\n\n<p>7) <strong>Can I restrict which VM sizes the fleet uses?<\/strong><br\/>\nThat\u2019s commonly a best practice (approved SKUs). Whether the fleet supports a list of eligible sizes and how it prioritizes them should be verified.<\/p>\n\n\n\n<p>8) <strong>How do I scale Azure Compute Fleet?<\/strong><br\/>\nTypically by updating the fleet\u2019s desired capacity target. The exact scaling API and semantics depend on the current resource model.<\/p>\n\n\n\n<p>9) <strong>Does Compute Fleet support autoscaling rules?<\/strong><br\/>\nAutoscale integration varies by service. Verify whether it integrates with Azure Autoscale or provides its own scaling mechanisms.<\/p>\n\n\n\n<p>10) <strong>How do I monitor a fleet?<\/strong><br\/>\nMonitor the underlying VMs (metrics\/logs) and use Activity Log for fleet operations. If fleet-level metrics exist, use Azure Monitor to collect them\u2014verify what\u2019s available.<\/p>\n\n\n\n<p>11) <strong>Can I deploy fleet instances into a private subnet only?<\/strong><br\/>\nYes, that\u2019s the recommended security posture: private subnet + controlled outbound. Ensure your profile does not create public IPs.<\/p>\n\n\n\n<p>12) <strong>Can I use custom images with Azure Compute Fleet?<\/strong><br\/>\nTypically VM orchestration supports marketplace images and custom images (Managed Image\/Shared Image Gallery). Verify supported image sources for Compute Fleet.<\/p>\n\n\n\n<p>13) <strong>How does Azure Policy affect fleet creation?<\/strong><br\/>\nPolicy can deny creation of underlying resources (NICs, public IPs, certain VM sizes). Validate policies before scaling.<\/p>\n\n\n\n<p>14) <strong>What are the main failure modes when creating fleets?<\/strong><br\/>\nQuota limits, SKU capacity shortages, policy denials, and network misconfiguration are the most common.<\/p>\n\n\n\n<p>15) <strong>Is Azure Compute Fleet good for stateful workloads?<\/strong><br\/>\nUsually not. Fleets are best for stateless or checkpointed workloads where instances can be replaced without data loss.<\/p>\n\n\n\n<p>16) <strong>How do I avoid public IP sprawl in large fleets?<\/strong><br\/>\nDo not assign public IPs per instance; use private subnets and controlled access (Bastion\/VPN) and NAT for outbound.<\/p>\n\n\n\n<p>17) <strong>What\u2019s a good alternative if Azure Compute Fleet isn\u2019t available in my region?<\/strong><br\/>\nUse VM Scale Sets (flexible orchestration) or Azure Batch for batch workloads, depending on your needs.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">17. Top Online Resources to Learn Azure Compute Fleet<\/h2>\n\n\n\n<p>Because product URLs can change (especially for preview services), the Microsoft Learn search links are reliable starting points.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Resource Type<\/th>\n<th>Name<\/th>\n<th>Why It Is Useful<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Official documentation<\/td>\n<td>Microsoft Learn search: Azure Compute Fleet<\/td>\n<td>Most reliable way to find the current overview, API versions, and quickstarts: https:\/\/learn.microsoft.com\/search\/?terms=Azure%20Compute%20Fleet<\/td>\n<\/tr>\n<tr>\n<td>Official pricing<\/td>\n<td>Azure Virtual Machines pricing<\/td>\n<td>Fleet cost typically maps to VM pricing + dependencies: https:\/\/azure.microsoft.com\/pricing\/details\/virtual-machines\/<\/td>\n<\/tr>\n<tr>\n<td>Pricing tools<\/td>\n<td>Azure Pricing Calculator<\/td>\n<td>Build scenario-based estimates: https:\/\/azure.microsoft.com\/pricing\/calculator\/<\/td>\n<\/tr>\n<tr>\n<td>Official architecture guidance<\/td>\n<td>Azure Architecture Center<\/td>\n<td>Reference architectures and best practices for compute, networking, and resiliency: https:\/\/learn.microsoft.com\/azure\/architecture\/<\/td>\n<\/tr>\n<tr>\n<td>Related docs (compute grouping)<\/td>\n<td>Virtual Machine Scale Sets documentation<\/td>\n<td>Key alternative and complementary patterns: https:\/\/learn.microsoft.com\/azure\/virtual-machine-scale-sets\/<\/td>\n<\/tr>\n<tr>\n<td>Related docs (Spot)<\/td>\n<td>Azure Spot Virtual Machines<\/td>\n<td>Spot concepts, eviction, best practices: https:\/\/learn.microsoft.com\/azure\/virtual-machines\/spot-vms<\/td>\n<\/tr>\n<tr>\n<td>Related docs (batch scheduling)<\/td>\n<td>Azure Batch documentation<\/td>\n<td>When you need job\/task scheduling: https:\/\/learn.microsoft.com\/azure\/batch\/<\/td>\n<\/tr>\n<tr>\n<td>Monitoring<\/td>\n<td>Azure Monitor documentation<\/td>\n<td>Metrics, logs, alerts, workbooks: https:\/\/learn.microsoft.com\/azure\/azure-monitor\/<\/td>\n<\/tr>\n<tr>\n<td>Governance<\/td>\n<td>Azure Policy documentation<\/td>\n<td>Enforce allowed SKUs, regions, tags, and security controls: https:\/\/learn.microsoft.com\/azure\/governance\/policy\/<\/td>\n<\/tr>\n<tr>\n<td>Samples<\/td>\n<td>GitHub (Azure org) search<\/td>\n<td>Look for official examples and templates: https:\/\/github.com\/Azure<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">18. Training and Certification Providers<\/h2>\n\n\n\n<p>The following training providers may offer Azure, DevOps, SRE, and cloud operations courses. Confirm current syllabi and delivery modes on their websites.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p><strong>DevOpsSchool.com<\/strong><br\/>\n   &#8211; <strong>Suitable audience:<\/strong> DevOps engineers, SREs, platform teams, beginners to intermediate<br\/>\n   &#8211; <strong>Likely learning focus:<\/strong> Azure operations, DevOps, CI\/CD, IaC, monitoring<br\/>\n   &#8211; <strong>Mode:<\/strong> Check website<br\/>\n   &#8211; <strong>Website URL:<\/strong> https:\/\/www.devopsschool.com\/<\/p>\n<\/li>\n<li>\n<p><strong>ScmGalaxy.com<\/strong><br\/>\n   &#8211; <strong>Suitable audience:<\/strong> DevOps and SCM learners, build\/release engineers<br\/>\n   &#8211; <strong>Likely learning focus:<\/strong> Source control, CI\/CD, DevOps tooling, cloud basics<br\/>\n   &#8211; <strong>Mode:<\/strong> Check website<br\/>\n   &#8211; <strong>Website URL:<\/strong> https:\/\/www.scmgalaxy.com\/<\/p>\n<\/li>\n<li>\n<p><strong>CLoudOpsNow.in<\/strong><br\/>\n   &#8211; <strong>Suitable audience:<\/strong> Cloud operations engineers, sysadmins transitioning to cloud<br\/>\n   &#8211; <strong>Likely learning focus:<\/strong> Cloud ops, monitoring, incident response, cost controls<br\/>\n   &#8211; <strong>Mode:<\/strong> Check website<br\/>\n   &#8211; <strong>Website URL:<\/strong> https:\/\/www.cloudopsnow.in\/<\/p>\n<\/li>\n<li>\n<p><strong>SreSchool.com<\/strong><br\/>\n   &#8211; <strong>Suitable audience:<\/strong> SREs, operations teams, reliability-focused engineers<br\/>\n   &#8211; <strong>Likely learning focus:<\/strong> SRE practices, SLIs\/SLOs, observability, incident management<br\/>\n   &#8211; <strong>Mode:<\/strong> Check website<br\/>\n   &#8211; <strong>Website URL:<\/strong> https:\/\/www.sreschool.com\/<\/p>\n<\/li>\n<li>\n<p><strong>AiOpsSchool.com<\/strong><br\/>\n   &#8211; <strong>Suitable audience:<\/strong> Ops teams adopting AIOps and automation<br\/>\n   &#8211; <strong>Likely learning focus:<\/strong> Observability, event correlation, automation, AIOps concepts<br\/>\n   &#8211; <strong>Mode:<\/strong> Check website<br\/>\n   &#8211; <strong>Website URL:<\/strong> https:\/\/www.aiopsschool.com\/<\/p>\n<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">19. Top Trainers<\/h2>\n\n\n\n<p>These sites may provide training, mentoring, or trainer directories. Verify offerings and credentials directly on each site.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p><strong>RajeshKumar.xyz<\/strong><br\/>\n   &#8211; <strong>Likely specialization:<\/strong> DevOps\/cloud coaching (verify specific topics on site)<br\/>\n   &#8211; <strong>Suitable audience:<\/strong> Individuals and teams seeking practical mentoring<br\/>\n   &#8211; <strong>Website URL:<\/strong> https:\/\/rajeshkumar.xyz\/<\/p>\n<\/li>\n<li>\n<p><strong>devopstrainer.in<\/strong><br\/>\n   &#8211; <strong>Likely specialization:<\/strong> DevOps tools and practices training<br\/>\n   &#8211; <strong>Suitable audience:<\/strong> Beginners to intermediate DevOps learners<br\/>\n   &#8211; <strong>Website URL:<\/strong> https:\/\/www.devopstrainer.in\/<\/p>\n<\/li>\n<li>\n<p><strong>devopsfreelancer.com<\/strong><br\/>\n   &#8211; <strong>Likely specialization:<\/strong> DevOps freelancing services and\/or training resources (verify)<br\/>\n   &#8211; <strong>Suitable audience:<\/strong> Teams seeking flexible engagement models<br\/>\n   &#8211; <strong>Website URL:<\/strong> https:\/\/www.devopsfreelancer.com\/<\/p>\n<\/li>\n<li>\n<p><strong>devopssupport.in<\/strong><br\/>\n   &#8211; <strong>Likely specialization:<\/strong> DevOps support services and training resources (verify)<br\/>\n   &#8211; <strong>Suitable audience:<\/strong> Teams needing operational support and enablement<br\/>\n   &#8211; <strong>Website URL:<\/strong> https:\/\/www.devopssupport.in\/<\/p>\n<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">20. Top Consulting Companies<\/h2>\n\n\n\n<p>These organizations may offer consulting related to DevOps, cloud architecture, and operations. Validate service offerings and case studies directly with each company.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p><strong>cotocus.com<\/strong><br\/>\n   &#8211; <strong>Likely service area:<\/strong> Cloud\/DevOps consulting (verify exact offerings)<br\/>\n   &#8211; <strong>Where they may help:<\/strong> Architecture reviews, cloud migrations, automation, operations setup<br\/>\n   &#8211; <strong>Consulting use case examples:<\/strong> <\/p>\n<ul>\n<li>Designing secure private networking for large-scale compute  <\/li>\n<li>Setting up IaC pipelines and governance (RBAC\/Policy\/tagging)  <\/li>\n<li>Cost optimization for burst compute workloads  <\/li>\n<li><strong>Website URL:<\/strong> https:\/\/cotocus.com\/<\/li>\n<\/ul>\n<\/li>\n<li>\n<p><strong>DevOpsSchool.com<\/strong><br\/>\n   &#8211; <strong>Likely service area:<\/strong> DevOps enablement, training, and consulting<br\/>\n   &#8211; <strong>Where they may help:<\/strong> Platform engineering, CI\/CD, cloud operations, monitoring<br\/>\n   &#8211; <strong>Consulting use case examples:<\/strong> <\/p>\n<ul>\n<li>Building a repeatable Azure compute provisioning framework  <\/li>\n<li>Implementing observability and cost governance for VM-based platforms  <\/li>\n<li>Establishing SRE practices for large-scale compute environments  <\/li>\n<li><strong>Website URL:<\/strong> https:\/\/www.devopsschool.com\/<\/li>\n<\/ul>\n<\/li>\n<li>\n<p><strong>DEVOPSCONSULTING.IN<\/strong><br\/>\n   &#8211; <strong>Likely service area:<\/strong> DevOps and cloud consulting (verify exact offerings)<br\/>\n   &#8211; <strong>Where they may help:<\/strong> DevOps transformations, automation, release engineering<br\/>\n   &#8211; <strong>Consulting use case examples:<\/strong> <\/p>\n<ul>\n<li>CI runner fleet design and scaling strategy  <\/li>\n<li>Security hardening and policy enforcement for compute resources  <\/li>\n<li>Incident response readiness and operational dashboards  <\/li>\n<li><strong>Website URL:<\/strong> https:\/\/www.devopsconsulting.in\/<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">21. Career and Learning Roadmap<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn before Azure Compute Fleet<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Azure fundamentals:<\/li>\n<li>Subscriptions, resource groups, RBAC, Azure Policy<\/li>\n<li>Azure networking:<\/li>\n<li>VNets, subnets, NSGs, route tables, private endpoints, NAT<\/li>\n<li>Azure Virtual Machines basics:<\/li>\n<li>Images, disks, extensions, boot diagnostics<\/li>\n<li>Cost fundamentals:<\/li>\n<li>Azure Cost Management, tagging, quotas<\/li>\n<li>Observability:<\/li>\n<li>Azure Monitor metrics, Activity Log, Log Analytics basics<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn after Azure Compute Fleet<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>VM Scale Sets deep dive (autoscale, upgrade policies, instance health)<\/li>\n<li>Azure Batch for job scheduling patterns<\/li>\n<li>Golden image pipelines (Azure Image Builder \/ Shared Image Gallery)<\/li>\n<li>Security hardening:<\/li>\n<li>Defender for Cloud, vulnerability management, patching strategy<\/li>\n<li>Advanced governance:<\/li>\n<li>Policy initiatives, landing zones, management groups<\/li>\n<li>Reliability engineering:<\/li>\n<li>SLOs, error budgets, capacity planning for burst compute<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Job roles that use it<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Cloud engineer \/ Azure engineer<\/li>\n<li>DevOps engineer \/ platform engineer<\/li>\n<li>SRE \/ reliability engineer<\/li>\n<li>HPC engineer (VM-based clusters)<\/li>\n<li>Security engineer (governance and hardening of compute)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Certification path (Azure)<\/h3>\n\n\n\n<p>Azure certifications change over time; verify the current certification lineup. Commonly relevant tracks include:\n&#8211; Azure Fundamentals (AZ-900)\n&#8211; Azure Administrator (AZ-104)\n&#8211; Azure Solutions Architect Expert (AZ-305)\n&#8211; DevOps Engineer Expert (AZ-400)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Project ideas for practice<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Build a \u201ccompute burst\u201d environment with:\n   &#8211; Fleet\/VMSS + queue + workers + dashboards<\/li>\n<li>Implement cost controls:\n   &#8211; Tags + budgets + alerts + scheduled scale down<\/li>\n<li>Security baseline:\n   &#8211; No public IP policy + Managed Identity + Key Vault + private endpoints<\/li>\n<li>Spot-resilient pipeline:\n   &#8211; Checkpointing + retry + eviction-aware workers<\/li>\n<li>Multi-environment promotion:\n   &#8211; dev\/test fleet \u2192 staging fleet \u2192 prod fleet using IaC<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">22. Glossary<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Azure Compute Fleet:<\/strong> An Azure resource used to orchestrate a group of VM compute capacity under a unified intent (capacity + profile). Verify exact definition and capabilities in Microsoft Learn.<\/li>\n<li><strong>VM (Virtual Machine):<\/strong> Compute instance providing OS-level control (IaaS).<\/li>\n<li><strong>SKU:<\/strong> A specific VM size\/type (vCPU, RAM, capabilities).<\/li>\n<li><strong>Availability Zone:<\/strong> Physically separate datacenters within an Azure region for higher resiliency.<\/li>\n<li><strong>Spot VM:<\/strong> Discounted compute using spare capacity, subject to eviction.<\/li>\n<li><strong>Managed Identity:<\/strong> Azure AD identity for Azure resources to access other services without stored secrets.<\/li>\n<li><strong>NSG (Network Security Group):<\/strong> Stateful firewall rules for subnets\/NICs.<\/li>\n<li><strong>UDR (User Defined Route):<\/strong> Custom routing rules for subnet traffic.<\/li>\n<li><strong>NAT Gateway:<\/strong> Managed outbound connectivity for private subnets with better SNAT scaling.<\/li>\n<li><strong>Activity Log:<\/strong> Subscription-level log of management-plane operations.<\/li>\n<li><strong>Log Analytics:<\/strong> Workspace for storing and querying logs (KQL).<\/li>\n<li><strong>IaC (Infrastructure as Code):<\/strong> Managing infrastructure using declarative templates (Bicep\/ARM\/Terraform).<\/li>\n<li><strong>Quota:<\/strong> Subscription limits for resources such as vCPUs per region\/family.<\/li>\n<li><strong>Checkpointing:<\/strong> Persisting intermediate state so a task can resume after interruption (e.g., Spot eviction).<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">23. Summary<\/h2>\n\n\n\n<p>Azure Compute Fleet is an Azure Compute service that helps you request and manage VM capacity as a single \u201cfleet,\u201d focusing on simplifying large-scale capacity acquisition and lifecycle operations. It matters when you need burst compute, want to reduce manual toil around SKU\/placement challenges, and prefer managing compute in a standardized, governable way.<\/p>\n\n\n\n<p>Cost-wise, your primary spend is usually the underlying VMs plus disks, networking, and monitoring; keep an eye on log ingestion and data egress. Security-wise, use private networking, Azure RBAC, Managed Identity, and Azure Policy to prevent public exposure and enforce standards.<\/p>\n\n\n\n<p>Use Azure Compute Fleet when you need scalable VM capacity for stateless or checkpointed workloads (CI runners, batch processing, render, simulation). If you need mature autoscaling semantics or job scheduling, compare VM Scale Sets and Azure Batch. Next, validate service availability and supported features in Microsoft Learn and run a small lab fleet in your target region before committing to production.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Compute<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[40,26],"tags":[],"class_list":["post-387","post","type-post","status-publish","format-standard","hentry","category-azure","category-compute"],"_links":{"self":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/387","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/comments?post=387"}],"version-history":[{"count":0,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/387\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/media?parent=387"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/categories?post=387"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/tags?post=387"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}