{"id":390,"date":"2026-04-13T21:42:42","date_gmt":"2026-04-13T21:42:42","guid":{"rendered":"https:\/\/www.devopsschool.com\/tutorials\/azure-cyclecloud-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-compute\/"},"modified":"2026-04-13T21:42:42","modified_gmt":"2026-04-13T21:42:42","slug":"azure-cyclecloud-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-compute","status":"publish","type":"post","link":"https:\/\/www.devopsschool.com\/tutorials\/azure-cyclecloud-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-compute\/","title":{"rendered":"Azure CycleCloud Tutorial: Architecture, Pricing, Use Cases, and Hands-On Guide for Compute"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Category<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Compute<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">1. Introduction<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Azure CycleCloud is Azure\u2019s cluster orchestration and lifecycle management solution for running high-performance computing (HPC) and other scale-out, scheduler-driven workloads on Azure infrastructure.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In simple terms: <strong>Azure CycleCloud helps you stand up an HPC cluster (like Slurm or PBS), scale it up and down automatically, and manage it consistently<\/strong>\u2014without manually creating, configuring, and tearing down large fleets of virtual machines (VMs).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In more technical terms: <strong>Azure CycleCloud is software you deploy in your own Azure subscription<\/strong> (typically as a VM from Azure Marketplace). It uses Azure APIs to provision and configure cluster nodes, integrates with common schedulers, supports autoscaling based on queued jobs, and provides a UI\/CLI for cluster operations. It is not the same thing as Azure Batch; instead, it focuses on <strong>IaaS-based HPC clusters<\/strong> with familiar schedulers and control over VM images, networking, and storage.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>The problem it solves:<\/strong> teams that need traditional HPC schedulers, tightly controlled VM images, specialized VM SKUs (including HPC SKUs), and predictable cluster architecture often struggle with \u201chand-built\u201d VM farms. Azure CycleCloud addresses this by providing repeatable cluster templates, automated scaling, and operational tooling that fits HPC and engineering workloads.<\/p>\n\n\n\n<blockquote>\n<p>Service status note: <strong>\u201cAzure CycleCloud\u201d is the current Microsoft name used in Azure documentation.<\/strong> Always verify the latest deployment methods and supported schedulers in the official documentation because HPC integrations evolve over time. Official docs: https:\/\/learn.microsoft.com\/azure\/cyclecloud\/<\/p>\n<\/blockquote>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">2. What is Azure CycleCloud?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Official purpose<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Azure CycleCloud\u2019s purpose is to <strong>create, manage, operate, and optimize HPC and other compute clusters on Azure<\/strong>. It helps you deploy scheduler-based clusters, manage node lifecycle, and scale compute resources to match workload demand.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Core capabilities (high level)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Cluster provisioning from templates<\/strong> (repeatable cluster definitions).<\/li>\n<li><strong>Scheduler integration<\/strong> (commonly used HPC schedulers; exact list varies by release\u2014verify in docs).<\/li>\n<li><strong>Autoscaling<\/strong> based on scheduler\/job queue state.<\/li>\n<li><strong>Image and configuration management<\/strong> for consistent node builds.<\/li>\n<li><strong>Operational management<\/strong> (start\/stop, add\/remove nodes, monitor cluster state).<\/li>\n<li><strong>Integration with Azure infrastructure<\/strong> (VNets, subnets, NSGs, managed disks, Azure Files\/NFS options depending on design).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Major components<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">While implementations vary by cluster type and scheduler, most deployments include:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p><strong>CycleCloud Server<\/strong>\n   &#8211; A VM you deploy in your subscription (often from Azure Marketplace).\n   &#8211; Provides the <strong>web UI<\/strong> and <strong>CycleCloud CLI<\/strong> endpoint.\n   &#8211; Holds cluster definitions, state, and configuration.<\/p>\n<\/li>\n<li>\n<p><strong>Cluster Nodes<\/strong>\n   &#8211; <strong>Head\/login node<\/strong> (scheduler controller, login gateway, or management node depending on template).\n   &#8211; <strong>Compute nodes<\/strong> (scale-out worker nodes; often in VM Scale Sets or managed as a node array\u2014implementation depends on template and Azure CycleCloud version).\n   &#8211; Optional <strong>visualization<\/strong>, <strong>broker<\/strong>, <strong>gateway<\/strong>, or <strong>storage<\/strong> nodes depending on workload.<\/p>\n<\/li>\n<li>\n<p><strong>Scheduler<\/strong>\n   &#8211; Deployed and configured as part of the cluster template (for supported schedulers).\n   &#8211; Controls job queueing, placement, priorities, and execution.<\/p>\n<\/li>\n<li>\n<p><strong>Azure Infrastructure Resources<\/strong>\n   &#8211; Virtual network\/subnets, network security groups, public IPs (optional), storage (managed disks, Azure Files, or partner NFS offerings), and identity resources.<\/p>\n<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Service type<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Not a fully managed Azure control-plane service<\/strong> like Azure Batch.<\/li>\n<li><strong>Software-managed orchestration<\/strong> you run inside your subscription (IaaS-hosted management server + Azure API-driven provisioning).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scope: subscription and resource-group centric<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Azure CycleCloud is typically:\n&#8211; <strong>Deployed into a specific Azure subscription<\/strong> and one or more resource groups.\n&#8211; Operates within the boundaries of your Azure RBAC permissions and quotas.\n&#8211; Cluster resources are created in your subscription, so governance, policy, and cost management apply normally.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Regional \/ zonal considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The CycleCloud Server VM is deployed into a specific <strong>Azure region<\/strong>.<\/li>\n<li>Clusters are usually deployed into the same region for latency and simplicity, though multi-region patterns are possible but add complexity and should be validated in official guidance.<\/li>\n<li>Availability Zones may be used depending on selected VM SKUs and region support (verify per region\/SKU).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">How it fits into the Azure ecosystem<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Azure CycleCloud sits \u201cabove\u201d Azure Compute and Networking services:\n&#8211; Uses <strong>Azure Virtual Machines<\/strong> (including HPC VM families where appropriate).\n&#8211; Uses <strong>Azure Virtual Network<\/strong> for cluster networking.\n&#8211; Uses storage options (e.g., <strong>managed disks<\/strong>, <strong>Azure Files<\/strong>, and HPC-oriented NFS approaches). The right storage depends on throughput\/IOPS and POSIX requirements.\n&#8211; Works alongside governance tools like <strong>Azure Policy<\/strong>, <strong>Azure Monitor<\/strong>, and <strong>Cost Management<\/strong>.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">3. Why use Azure CycleCloud?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Business reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Faster time-to-results for HPC adoption on Azure:<\/strong> templates reduce the time to build a scheduler-based cluster from weeks to hours.<\/li>\n<li><strong>Elastic cost model:<\/strong> autoscaling reduces paying for idle compute.<\/li>\n<li><strong>Repeatability:<\/strong> standard cluster blueprints reduce errors and rework.<\/li>\n<li><strong>Migration path:<\/strong> supports \u201clift-and-shift\u201d patterns for teams already using schedulers like Slurm\/PBS on-prem (validate exact scheduler support).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Technical reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Scheduler-first design:<\/strong> ideal for workloads that require job queues, reservations, node features, partitions\/queues, and MPI-friendly placement.<\/li>\n<li><strong>Control over infrastructure:<\/strong> choose VM sizes, images, networks, and storage architecture.<\/li>\n<li><strong>Custom images and bootstrap:<\/strong> standardize OS packages, MPI stacks, drivers, and app dependencies.<\/li>\n<li><strong>Integration with Azure primitives:<\/strong> use your existing VNets, subnets, NSGs, and identity patterns.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operational reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Automated node lifecycle:<\/strong> provision, configure, and remove nodes based on demand.<\/li>\n<li><strong>Central management UI + CLI:<\/strong> consistent operations for multiple clusters.<\/li>\n<li><strong>Scaling policies:<\/strong> align compute growth with actual queued work.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/compliance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Runs inside your subscription:<\/strong> you control network isolation, encryption, and logging.<\/li>\n<li><strong>Works with Azure RBAC and enterprise governance<\/strong> (deployment permissions, tagging, policy).<\/li>\n<li>Enables architectures that keep clusters private (no public IPs), using bastion\/jumpbox patterns.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scalability\/performance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Scale-out compute<\/strong> with large node counts (subject to quotas and SKU availability).<\/li>\n<li>Designed for HPC-style scaling patterns: bursts, backfill, and queue-driven elasticity.<\/li>\n<li>Can be paired with HPC-oriented VM SKUs and storage designs (where appropriate).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should choose Azure CycleCloud<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Choose Azure CycleCloud when you:\n&#8211; Need a <strong>traditional HPC scheduler<\/strong> experience on Azure.\n&#8211; Want <strong>elastic clusters<\/strong> that scale based on queued jobs.\n&#8211; Need to <strong>control OS images, drivers, and system-level tuning<\/strong>.\n&#8211; Want repeatable cluster deployments via templates and infrastructure-as-code-like patterns.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should not choose it<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Azure CycleCloud may not be the best fit when you:\n&#8211; Prefer <strong>fully managed batch processing<\/strong> without managing a scheduler VM (consider Azure Batch).\n&#8211; Want <strong>container-native<\/strong> orchestration and CI\/CD patterns (consider AKS).\n&#8211; Only need a small fixed-size VM pool; CycleCloud may be unnecessary overhead.\n&#8211; Lack HPC admin skills (scheduler configuration, Linux tuning, MPI, shared filesystems).<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">4. Where is Azure CycleCloud used?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Industries<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Engineering and manufacturing (CAE\/CFD\/FEA, EDA)<\/li>\n<li>Life sciences (genomics pipelines, molecular dynamics\u2014scheduler-based)<\/li>\n<li>Financial services (risk simulations, Monte Carlo)<\/li>\n<li>Media and rendering (frame rendering with queue-based scheduling)<\/li>\n<li>Research and academia (MPI\/HTC clusters)<\/li>\n<li>Energy (reservoir simulations, seismic processing)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Team types<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>HPC platform teams<\/li>\n<li>DevOps\/SRE teams supporting research compute<\/li>\n<li>Computational science teams<\/li>\n<li>Enterprise infrastructure teams modernizing on-prem HPC<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Workloads<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>MPI-based simulations<\/li>\n<li>Parameter sweeps (HTC)<\/li>\n<li>EDA toolchains<\/li>\n<li>Rendering and transcoding farms<\/li>\n<li>Large-scale data preprocessing where a scheduler is preferred<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Architectures<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Private VNet HPC clusters with a login node<\/li>\n<li>Hub-and-spoke networks (shared services in hub, clusters in spokes)<\/li>\n<li>Hybrid identity + private DNS patterns<\/li>\n<li>Burst-to-cloud extensions of on-prem schedulers (complex; validate integration approach)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Real-world deployment contexts<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Production:<\/strong> regulated environments, controlled images, private endpoints, strict network rules, logging\/monitoring, change management.<\/li>\n<li><strong>Dev\/test:<\/strong> smaller clusters for workflow validation; spot\/low-priority patterns where allowed; ephemeral clusters per project.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">5. Top Use Cases and Scenarios<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Below are realistic scenarios where Azure CycleCloud commonly fits. Exact scheduler\/templates and deployment steps vary\u2014verify supported templates and schedulers in the official docs: https:\/\/learn.microsoft.com\/azure\/cyclecloud\/<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1) Elastic Slurm cluster for engineering simulations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Simulation jobs arrive in bursts; fixed clusters waste money.<\/li>\n<li><strong>Why Azure CycleCloud fits:<\/strong> Autoscaling based on queued jobs; repeatable Slurm cluster deployment.<\/li>\n<li><strong>Example:<\/strong> A mechanical engineering team runs nightly CFD jobs; cluster scales from 0 to 200 nodes at night and back to minimal footprint by morning.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2) PBS-based cluster for legacy HPC workloads<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Applications are certified on PBS workflows and job scripts.<\/li>\n<li><strong>Why it fits:<\/strong> CycleCloud supports scheduler-driven IaaS clusters and can standardize node images.<\/li>\n<li><strong>Example:<\/strong> A research lab migrates PBS scripts to Azure with minimal changes, keeping user workflows consistent.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3) On-demand rendering farm with queue-driven scaling<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Rendering jobs spike around deadlines; artists need predictable turnaround.<\/li>\n<li><strong>Why it fits:<\/strong> Queue length drives node count; compute nodes can be transient.<\/li>\n<li><strong>Example:<\/strong> A studio spins up 500 CPU nodes for a weekend render, then deallocates them.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4) Parameter sweep \/ HTC cluster for model calibration<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Thousands of independent jobs; needs fast provisioning and teardown.<\/li>\n<li><strong>Why it fits:<\/strong> HPC schedulers + autoscale handle large job counts; scaling policies reduce idle.<\/li>\n<li><strong>Example:<\/strong> A quant team runs Monte Carlo sweeps with job arrays and scales worker nodes as the queue grows.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">5) Secure, private HPC cluster for regulated workloads<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Compliance requires no public exposure and strict egress control.<\/li>\n<li><strong>Why it fits:<\/strong> CycleCloud can operate in private VNets; you control NSGs, routing, and private access patterns.<\/li>\n<li><strong>Example:<\/strong> A healthcare organization processes sensitive datasets on a private cluster accessible only via VPN\/ExpressRoute and bastion.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6) Standardized \u201ccluster as a product\u201d for internal teams<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Multiple departments request clusters; ad hoc setups cause drift and outages.<\/li>\n<li><strong>Why it fits:<\/strong> Templates + governance make deployments consistent.<\/li>\n<li><strong>Example:<\/strong> Central IT offers approved templates (small\/medium\/large) with standard tagging and logging.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">7) Preemptible\/spot-friendly burst compute (where supported)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Need cheaper compute for interruptible workloads.<\/li>\n<li><strong>Why it fits:<\/strong> Cluster policies can incorporate VM priority options where appropriate (verify current support and best practices).<\/li>\n<li><strong>Example:<\/strong> A research group runs interruptible parameter sweeps on spot VMs; failed tasks automatically resubmit.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">8) Multi-queue cluster with different VM types<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Workloads need different CPU\/memory ratios and sometimes GPUs.<\/li>\n<li><strong>Why it fits:<\/strong> Scheduler partitions\/queues map well to different node arrays and VM sizes.<\/li>\n<li><strong>Example:<\/strong> One partition uses memory-optimized VMs; another uses GPU VMs for acceleration.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">9) Temporary training clusters for classes or workshops<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Need consistent clusters for labs; must be easy to reset.<\/li>\n<li><strong>Why it fits:<\/strong> Templates make \u201cknown good\u201d environments reproducible.<\/li>\n<li><strong>Example:<\/strong> A university deploys a Slurm cluster per class section, then deletes after the course.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">10) \u201cBurst to Azure\u201d for peak periods<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> On-prem cluster is full during peak; jobs wait too long.<\/li>\n<li><strong>Why it fits:<\/strong> Azure CycleCloud enables rapid scale-out in Azure using similar scheduling patterns (hybrid bursting designs require careful DNS\/identity\/network planning).<\/li>\n<li><strong>Example:<\/strong> End-of-quarter risk runs overflow to Azure to meet deadlines.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">6. Core Features<\/h2>\n\n\n\n<blockquote>\n<p>Feature availability and supported schedulers can change. Validate in the official documentation: https:\/\/learn.microsoft.com\/azure\/cyclecloud\/<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">1) Cluster templates (repeatable deployments)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Defines cluster topology (head node, compute node arrays, networking, storage mounts, scheduler config).<\/li>\n<li><strong>Why it matters:<\/strong> Reduces manual configuration and drift.<\/li>\n<li><strong>Practical benefit:<\/strong> You can redeploy identical clusters for dev\/test\/prod or different projects.<\/li>\n<li><strong>Caveats:<\/strong> Templates must be versioned and tested; changes can break bootstrapping or scheduler configuration.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2) Scheduler-based autoscaling<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Scales compute nodes based on scheduler demand (queued jobs, resource requests).<\/li>\n<li><strong>Why it matters:<\/strong> Avoid paying for idle nodes while keeping queue wait times low.<\/li>\n<li><strong>Practical benefit:<\/strong> Elastic clusters that respond to real workload demand.<\/li>\n<li><strong>Caveats:<\/strong> Autoscaling depends on correct scheduler configuration and accurate resource requests in job submissions.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3) Integrated cluster lifecycle operations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Start\/stop clusters, add\/remove node arrays, and manage node states from a central interface.<\/li>\n<li><strong>Why it matters:<\/strong> Day-2 operations are where HPC platforms often struggle.<\/li>\n<li><strong>Practical benefit:<\/strong> Standard operational workflow across teams.<\/li>\n<li><strong>Caveats:<\/strong> Operational runbooks still matter\u2014especially around patching, image updates, and scheduler upgrades.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4) Node provisioning and configuration (bootstrap)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Installs packages, configures mounts, joins nodes to the scheduler, sets up users\/SSH access (pattern depends on template).<\/li>\n<li><strong>Why it matters:<\/strong> Repeatable and automated node bring-up is essential for scale.<\/li>\n<li><strong>Practical benefit:<\/strong> New nodes become ready quickly and consistently.<\/li>\n<li><strong>Caveats:<\/strong> Bootstrap scripts must be idempotent; avoid long-running steps that slow scale-out.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">5) Support for custom VM images<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Lets you use custom images for head\/compute nodes to preinstall HPC libraries, drivers, security agents, etc.<\/li>\n<li><strong>Why it matters:<\/strong> Faster provisioning and better consistency.<\/li>\n<li><strong>Practical benefit:<\/strong> Reduced \u201ctime to ready\u201d per node and fewer runtime downloads.<\/li>\n<li><strong>Caveats:<\/strong> Image pipelines must be maintained; GPU\/HPC drivers must match kernel versions and SKU requirements.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6) Azure infrastructure integration<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Works with VNets\/subnets, NSGs, route tables, managed disks, and Azure identity constructs.<\/li>\n<li><strong>Why it matters:<\/strong> HPC clusters must fit enterprise network and governance patterns.<\/li>\n<li><strong>Practical benefit:<\/strong> Deploy into existing landing zones and shared services.<\/li>\n<li><strong>Caveats:<\/strong> Network restrictions (no outbound internet, forced tunneling) can break package installs unless mirrored repositories are used.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">7) Multi-cluster management (single CycleCloud server)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> A single CycleCloud Server can manage multiple clusters (subject to sizing and design).<\/li>\n<li><strong>Why it matters:<\/strong> Centralized operations.<\/li>\n<li><strong>Practical benefit:<\/strong> Shared templates, common policy, consolidated audit\/ops.<\/li>\n<li><strong>Caveats:<\/strong> Treat the server as critical infrastructure; implement backups and HA strategies as appropriate (verify supported patterns).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">8) CLI automation<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Supports scripting cluster actions (create, start\/stop, scale) using CLI tooling.<\/li>\n<li><strong>Why it matters:<\/strong> Enables CI\/CD-like workflows for infrastructure and cluster management.<\/li>\n<li><strong>Practical benefit:<\/strong> Repeatable operations and GitOps-style automation.<\/li>\n<li><strong>Caveats:<\/strong> Manage credentials securely; use least privilege for automation identities.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">9) Tagging and governance alignment<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Enables tagging of cluster resources for cost allocation and policy compliance (implementation varies).<\/li>\n<li><strong>Why it matters:<\/strong> HPC spend can be significant; tagging is critical for chargeback\/showback.<\/li>\n<li><strong>Practical benefit:<\/strong> Better cost reporting and governance.<\/li>\n<li><strong>Caveats:<\/strong> Enforce tags with Azure Policy; otherwise drift is common.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">7. Architecture and How It Works<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">High-level architecture<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>You deploy <strong>CycleCloud Server<\/strong> (typically a VM) into your Azure subscription.<\/li>\n<li>You define\/import <strong>cluster templates<\/strong> (e.g., Slurm-based).<\/li>\n<li>From the CycleCloud UI\/CLI, you create a cluster.<\/li>\n<li>CycleCloud calls Azure APIs to provision:\n   &#8211; Head node(s)\n   &#8211; Compute node arrays (scale sets or VM instances depending on template\/version)\n   &#8211; Networking components (if not precreated)\n   &#8211; Storage attachments\/mounts<\/li>\n<li>The scheduler runs on the head node and manages jobs.<\/li>\n<li>Autoscaling monitors demand and requests more nodes; nodes join the scheduler, run jobs, and are deallocated\/removed when idle (based on policy).<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Request\/data\/control flow<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Control plane flow:<\/strong> Admin \u2192 CycleCloud UI\/CLI \u2192 CycleCloud Server \u2192 Azure Resource Manager \/ Compute APIs \u2192 VM provisioning.<\/li>\n<li><strong>Job flow:<\/strong> User \u2192 login\/head node \u2192 scheduler queue \u2192 compute node executes \u2192 writes results to shared storage \u2192 user retrieves results.<\/li>\n<li><strong>Telemetry flow:<\/strong> Nodes\/OS logs \u2192 Azure Monitor \/ Log Analytics agent (optional) \u2192 central monitoring workspace.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Integrations with related services (common patterns)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Azure Virtual Machines<\/strong> for head\/compute nodes.<\/li>\n<li><strong>Azure Virtual Network<\/strong> for private cluster communication.<\/li>\n<li><strong>Azure DNS \/ Private DNS<\/strong> for internal name resolution (optional but recommended in enterprise networks).<\/li>\n<li><strong>Azure Monitor<\/strong> for metrics\/logs (agent-based or extensions).<\/li>\n<li><strong>Azure Bastion<\/strong> or jumpbox for secure SSH\/RDP access without public IPs.<\/li>\n<li><strong>Azure Key Vault<\/strong> for secrets (e.g., retrieving tokens\/keys in bootstrap scripts\u2014design carefully).<\/li>\n<li><strong>Azure Files \/ NFS solutions<\/strong> or third-party NFS for shared home\/work directories (choose based on performance\/semantics).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Dependency services<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">At minimum, a working CycleCloud deployment depends on:\n&#8211; Azure subscription with adequate <strong>compute quota<\/strong> for chosen VM SKUs.\n&#8211; Networking (VNet\/subnet) with correct routing\/DNS.\n&#8211; Identity configured so CycleCloud can provision resources (RBAC\/credentials).\n&#8211; Storage choices for shared directories (depending on workload).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/authentication model (conceptual)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Azure RBAC:<\/strong> governs who can create CycleCloud server and clusters, and what resources can be provisioned.<\/li>\n<li><strong>CycleCloud application access:<\/strong> users authenticate to the CycleCloud UI (mechanism depends on configuration; verify exact auth options in docs).<\/li>\n<li><strong>Node access:<\/strong> typically SSH keys for Linux-based clusters; restrict inbound access with NSGs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Networking model<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Most HPC deployments use:\n&#8211; One VNet with subnets for:\n  &#8211; CycleCloud Server (management)\n  &#8211; Head\/login node\n  &#8211; Compute nodes\n&#8211; NSGs to restrict inbound to:\n  &#8211; HTTPS to CycleCloud UI (from admin network only)\n  &#8211; SSH to head\/login (from admin network only)\n  &#8211; East-west traffic within subnets as required by scheduler\/MPI\n&#8211; Optional NAT Gateway or controlled egress for package repositories and licensing servers.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Monitoring\/logging\/governance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Azure Monitor metrics:<\/strong> VM CPU, memory (guest-based), disk, network.<\/li>\n<li><strong>Log Analytics:<\/strong> OS logs, scheduler logs (forward via agent), audit trails.<\/li>\n<li><strong>Azure Activity Log:<\/strong> records ARM operations (cluster creation triggers lots of operations).<\/li>\n<li><strong>Tagging:<\/strong> cost allocation and lifecycle management.<\/li>\n<li><strong>Backups:<\/strong> backup CycleCloud server state and cluster templates (verify recommended backup method in docs).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Simple architecture diagram (conceptual)<\/h3>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart LR\n  Admin[Admin\/Engineer] --&gt;|HTTPS| CC[Azure CycleCloud Server (VM)]\n  CC --&gt;|ARM\/Compute API| Azure[Azure Resource Manager]\n  Azure --&gt; Head[Head\/Login Node (Scheduler)]\n  Azure --&gt; Compute[Compute Nodes (Scale-out)]\n  Head --&gt;|Jobs| Compute\n  Head &lt;--&gt; Storage[Shared Storage (e.g., NFS\/Azure Files depending on design)]\n  Compute &lt;--&gt; Storage\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Production-style architecture diagram (reference pattern)<\/h3>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart TB\n  subgraph Hub[Hub VNet \/ Shared Services]\n    Bastion[Azure Bastion or Jumpbox]\n    DNS[Private DNS]\n    Monitor[Azure Monitor + Log Analytics]\n    KV[Azure Key Vault]\n  end\n\n  subgraph Spoke[Spoke VNet: HPC]\n    CC[CycleCloud Server VM]\n    Head[HPC Head\/Login Node + Scheduler]\n    subgraph Scale[Compute Subnet]\n      CN1[Compute Nodes]\n      CN2[Compute Nodes]\n      CNn[Compute Nodes...]\n    end\n    Storage[Shared Storage Endpoint]\n  end\n\n  Admin[Admins\/Users] --&gt;|VPN\/ER| Bastion\n  Bastion --&gt;|SSH\/HTTPS| CC\n  Bastion --&gt;|SSH| Head\n\n  CC --&gt;|Provision\/Scale| Scale\n  Head --&gt;|Dispatch jobs| Scale\n  Head &lt;--&gt; Storage\n  Scale &lt;--&gt; Storage\n\n  CC --&gt; Monitor\n  Head --&gt; Monitor\n  Scale --&gt; Monitor\n\n  CC --&gt; KV\n  Head --&gt; DNS\n  Scale --&gt; DNS\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">8. Prerequisites<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Azure account\/subscription requirements<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>An Azure subscription where you can deploy Marketplace images and create compute\/network\/storage resources.<\/li>\n<li>Ability to register required resource providers if not already registered (commonly <code>Microsoft.Compute<\/code>, <code>Microsoft.Network<\/code>, <code>Microsoft.Storage<\/code>).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Permissions \/ IAM roles<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Minimum for a lab (typical):\n&#8211; <strong>Contributor<\/strong> on a resource group (or subscription) to create VNets, VMs, NSGs, and storage resources.\n&#8211; Permissions to create and assign <strong>managed identities<\/strong> (if your chosen setup uses them).\n&#8211; For enterprise: separate roles for networking and compute may be used; coordinate with your platform team.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Billing requirements<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A billing-enabled subscription with quota for your chosen VM SKUs.<\/li>\n<li>HPC-sized SKUs may require quota requests and regional availability checks.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Tools needed<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Azure Portal access.<\/li>\n<li>Azure CLI (recommended) for cleanup and verification: https:\/\/learn.microsoft.com\/cli\/azure\/install-azure-cli<\/li>\n<li>SSH client (e.g., OpenSSH).<\/li>\n<li>Optional: Git for managing templates\/scripts.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Region availability<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Azure CycleCloud Server can be deployed where the Marketplace offer is available and where required VM sizes are available.<\/li>\n<li>Choose a region that supports:<\/li>\n<li>Your desired VM families (Dv5\/Ev5\/HB\/HC\/ND etc.)<\/li>\n<li>Availability Zones if needed<\/li>\n<li>Always verify region + SKU availability before committing.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Quotas\/limits<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>vCPU quotas<\/strong> per VM family and per region can block cluster scaling.<\/li>\n<li>Public IP, NIC, and storage limits can also matter at scale.<\/li>\n<li>Check quotas: Azure Portal \u2192 Subscriptions \u2192 Usage + quotas.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Prerequisite services and design decisions<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Before hands-on work, decide:\n&#8211; Networking: new VNet for lab, or use existing.\n&#8211; Access model: public IP for quick lab (less secure) vs Bastion\/private (recommended for production).\n&#8211; Storage: for a small lab, you can often start without a complex shared filesystem, but many HPC workflows require shared home\/work directories.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">9. Pricing \/ Cost<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Pricing model (accurate and practical)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Azure CycleCloud is typically <strong>not billed as a standalone metered Azure service<\/strong> in the way many PaaS services are. In most deployments, you pay for the <strong>underlying Azure resources<\/strong> you deploy to run CycleCloud and the clusters it orchestrates:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Primary cost dimensions:\n1. <strong>CycleCloud Server VM<\/strong>\n   &#8211; VM compute (hours), OS disk, and associated networking (public IP if used).\n2. <strong>Head\/login node VM<\/strong>\n   &#8211; Compute hours + disk + networking.\n3. <strong>Compute nodes<\/strong>\n   &#8211; The main cost driver: VM hours across all scaled nodes.\n4. <strong>Storage<\/strong>\n   &#8211; Managed disks (OS and data disks)\n   &#8211; Shared filesystem service (varies by design)\n   &#8211; Backup storage (if configured)\n5. <strong>Networking<\/strong>\n   &#8211; Outbound data transfer (egress) from Azure\n   &#8211; NAT Gateway, load balancers (if used)\n   &#8211; Inter-region traffic if multi-region\n6. <strong>Monitoring<\/strong>\n   &#8211; Log Analytics ingestion and retention\n   &#8211; Azure Monitor features you enable\n7. <strong>Optional software licensing<\/strong>\n   &#8211; Some HPC applications and some scheduler components may have separate licensing (BYOL). This is workload-specific.<\/p>\n\n\n\n<blockquote>\n<p>Verify current billing specifics for the CycleCloud Marketplace image\/offer you select (if any charges apply) in the Azure Marketplace listing and official docs:\n&#8211; Official documentation: https:\/\/learn.microsoft.com\/azure\/cyclecloud\/\n&#8211; Azure Pricing Calculator: https:\/\/azure.microsoft.com\/pricing\/calculator\/\n&#8211; VM pricing: https:\/\/azure.microsoft.com\/pricing\/details\/virtual-machines\/\n&#8211; Storage pricing: https:\/\/azure.microsoft.com\/pricing\/details\/storage\/<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">Free tier<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>There is no universal \u201cfree tier\u201d for HPC clusters; even a tiny lab uses billable VMs and disks.<\/li>\n<li>Some Azure accounts include credits (e.g., Visual Studio subscriptions). That\u2019s account-dependent.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Cost drivers (what makes bills spike)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Compute node hours<\/strong>: the number of nodes * hours running.<\/li>\n<li><strong>Idle nodes<\/strong>: misconfigured autoscale or scheduler settings can keep nodes running.<\/li>\n<li><strong>Large or premium storage<\/strong>: high-performance storage can cost more than compute in some cases.<\/li>\n<li><strong>Egress<\/strong>: moving data out of Azure can be expensive.<\/li>\n<li><strong>Over-provisioned head nodes<\/strong>: head nodes often run 24\/7; size them appropriately.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Hidden\/indirect costs to plan for<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Public IPs<\/strong> (small cost but often overlooked).<\/li>\n<li><strong>Log ingestion<\/strong> when forwarding verbose scheduler logs.<\/li>\n<li><strong>Image build pipelines<\/strong> (compute time for Packer\/VM image builds).<\/li>\n<li><strong>Data staging<\/strong>: repeated downloads of application datasets if you don\u2019t centralize storage.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">How to optimize cost (practical tactics)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use autoscaling with conservative <strong>idle timeout<\/strong> settings.<\/li>\n<li>Prefer <strong>ephemeral clusters<\/strong> for short projects: create, run, delete.<\/li>\n<li>Right-size the <strong>head node<\/strong>; it rarely needs HPC-scale compute.<\/li>\n<li>Use <strong>reserved instances<\/strong> or savings plans for steady baseline nodes (head\/login, long-running partitions).<\/li>\n<li>Consider <strong>spot VMs<\/strong> for interruptible workloads (verify suitability and template support).<\/li>\n<li>Minimize egress by keeping post-processing in Azure or using the same region for dependent services.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Example low-cost starter estimate (model, not numbers)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For a small lab you can estimate using the pricing calculator:\n&#8211; 1 small VM for CycleCloud Server (runs while you\u2019re using the lab)\n&#8211; 1 small VM for head node\n&#8211; 0\u20132 small compute nodes for brief testing\n&#8211; Standard SSD OS disks\n&#8211; Minimal logging retention<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Because prices vary by region and VM family, use the calculator to plug in:\n&#8211; VM size\n&#8211; Hours per month (or per day)\n&#8211; Disk size and type\n&#8211; Expected outbound data<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Example production cost considerations<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For production HPC:\n&#8211; Compute nodes may scale into hundreds or thousands of vCPUs.\n&#8211; Storage and data throughput (IOPS\/GB\/s) can become the dominant cost driver.\n&#8211; Monitoring and security tooling (SIEM integration, long retention) adds cost.\n&#8211; If you run multi-tenant clusters, include chargeback tagging and budget alerts.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">10. Step-by-Step Hands-On Tutorial<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">This lab builds a <strong>small, low-cost Azure CycleCloud environment<\/strong> and deploys a <strong>minimal scheduler-based cluster<\/strong>. The exact template names and steps can differ by Azure CycleCloud version and the templates available in your environment. Where UI text or template catalogs vary, this lab tells you what to look for and where to verify in official docs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Official docs start here: https:\/\/learn.microsoft.com\/azure\/cyclecloud\/<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Objective<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Deploy Azure CycleCloud Server in Azure, create a small cluster (e.g., Slurm-based), run a simple job, validate autoscaling basics, and then clean up to avoid ongoing charges.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Lab Overview<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You will:\n1. Create a resource group and networking.\n2. Deploy the Azure CycleCloud Server VM (Marketplace).\n3. Perform initial CycleCloud setup and configure Azure credentials for provisioning.\n4. Import\/select an HPC cluster template (commonly Slurm) and deploy a small cluster.\n5. SSH to the head node and submit a simple job.\n6. Validate nodes and job execution.\n7. Delete resources.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 1: Create a resource group and basic network (Azure CLI)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Why this matters:<\/strong> keeping everything in one resource group makes cleanup easy and reduces the chance of orphaned resources.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">1) Sign in and set your subscription:<\/p>\n\n\n\n<pre><code class=\"language-bash\">az login\naz account set --subscription \"&lt;YOUR_SUBSCRIPTION_ID_OR_NAME&gt;\"\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">2) Create a resource group:<\/p>\n\n\n\n<pre><code class=\"language-bash\">export LOCATION=\"eastus\"\nexport RG=\"rg-cyclecloud-lab\"\n\naz group create \\\n  --name \"$RG\" \\\n  --location \"$LOCATION\"\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">3) Create a VNet and subnet:<\/p>\n\n\n\n<pre><code class=\"language-bash\">export VNET=\"vnet-cyclecloud-lab\"\nexport SUBNET=\"subnet-hpc\"\n\naz network vnet create \\\n  --resource-group \"$RG\" \\\n  --name \"$VNET\" \\\n  --address-prefixes 10.10.0.0\/16 \\\n  --subnet-name \"$SUBNET\" \\\n  --subnet-prefixes 10.10.1.0\/24\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> a resource group with a VNet and subnet that can host the CycleCloud Server and cluster nodes.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 2: Deploy the Azure CycleCloud Server VM (Azure Portal)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Why portal here:<\/strong> Marketplace deployments often require accepting terms and selecting plan details that are easiest to confirm in the Portal.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">1) Go to Azure Portal: https:\/\/portal.azure.com<br\/>\n2) Search for <strong>Azure CycleCloud<\/strong> in <strong>Marketplace<\/strong>.\n3) Choose the official <strong>Azure CycleCloud<\/strong> offer (publisher should be Microsoft\/Azure-related; verify in listing).\n4) Click <strong>Create<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">During creation, choose:\n&#8211; <strong>Resource group:<\/strong> <code>rg-cyclecloud-lab<\/code>\n&#8211; <strong>Region:<\/strong> same as your VNet (e.g., East US)\n&#8211; <strong>Virtual network:<\/strong> <code>vnet-cyclecloud-lab<\/code>\n&#8211; <strong>Subnet:<\/strong> <code>subnet-hpc<\/code>\n&#8211; <strong>Authentication:<\/strong> SSH public key (recommended) or password (not recommended)\n&#8211; <strong>Inbound ports:<\/strong> For a lab, you may allow HTTPS to the CycleCloud UI from your IP.\n  &#8211; Prefer: restrict source to your public IP\n  &#8211; Production: do not expose publicly; use Bastion\/VPN\/Private access<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">5) Review + create.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> a running VM acting as the CycleCloud Server, visible in the resource group.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong>\n&#8211; Azure Portal \u2192 Resource group \u2192 Virtual machines \u2192 CycleCloud VM is <strong>Running<\/strong>\n&#8211; Note the VM\u2019s private IP and (if used) public IP or DNS name.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 3: Access the CycleCloud UI and complete initial setup<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">1) From your workstation, browse to the CycleCloud UI endpoint:\n&#8211; If public IP was assigned: <code>https:\/\/&lt;PUBLIC_IP_OR_DNS&gt;<\/code>\n&#8211; If private-only: connect via Bastion\/jumpbox first, or use port forwarding.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">2) Complete the initial wizard:\n&#8211; Create the first admin user (store credentials securely).\n&#8211; Review any license\/terms prompts shown in the UI.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> you can log in to Azure CycleCloud and reach the dashboard.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong>\n&#8211; You can log out\/in successfully.\n&#8211; You can see a page for clusters\/templates\/projects (UI varies by version).<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 4: Grant Azure permissions for cluster provisioning (identity setup)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">CycleCloud needs Azure permissions to create VMs, NICs, disks, and related resources. The exact configuration can be done using:\n&#8211; A <strong>service principal<\/strong> (app registration) with a client secret\/cert, or\n&#8211; A <strong>managed identity<\/strong> (in some architectures), or\n&#8211; Another supported credential method documented by Microsoft.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Because identity setup details are environment-specific, follow the official instructions for \u201cconfigure Azure credentials\u201d in CycleCloud docs:\n&#8211; https:\/\/learn.microsoft.com\/azure\/cyclecloud\/ (navigate to configuration\/credentials sections)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A common approach (service principal) looks like this conceptually:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">1) Create a service principal scoped to your lab resource group:<\/p>\n\n\n\n<pre><code class=\"language-bash\">export SP_NAME=\"sp-cyclecloud-lab\"\naz ad sp create-for-rbac \\\n  --name \"$SP_NAME\" \\\n  --role \"Contributor\" \\\n  --scopes \"\/subscriptions\/$(az account show --query id -o tsv)\/resourceGroups\/$RG\"\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">2) Capture:\n&#8211; appId (client ID)\n&#8211; password (client secret)\n&#8211; tenant<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">3) In CycleCloud UI, add an Azure \u201ccloud account\u201d \/ credentials entry using those values.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> CycleCloud can successfully validate Azure credentials and list\/create resources.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong>\n&#8211; In the CycleCloud UI, the Azure account\/credentials show as <strong>valid<\/strong> or <strong>connected<\/strong>.\n&#8211; If the UI provides a \u201ctest\u201d action, run it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Common error &amp; fix:<\/strong>\n&#8211; <strong>Error:<\/strong> Authorization failed \/ insufficient privileges<br\/>\n<strong>Fix:<\/strong> ensure the service principal has Contributor on the RG (or necessary granular roles) and that the subscription\/tenant values match.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 5: Import or select a scheduler cluster template (e.g., Slurm)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">CycleCloud typically uses templates for schedulers and reference clusters. The exact process varies:\n&#8211; Some environments include built-in templates.\n&#8211; Others require importing templates (sometimes from official repositories).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Do this in the CycleCloud UI:\n1) Navigate to <strong>Templates<\/strong> (or equivalent section).\n2) Find the template for your target scheduler (commonly <strong>Slurm<\/strong> in many HPC environments).\n3) If templates must be imported, follow the official template guidance for your CycleCloud version (verify in docs).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> the scheduler template is available and selectable when creating a cluster.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong>\n&#8211; You can see a template entry with parameters for head node, compute nodes, VM types, and scaling.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 6: Create a small cluster (minimize cost)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Create a cluster with conservative sizing:\n&#8211; Head node VM size: small general-purpose VM (e.g., D-series)\n&#8211; Compute nodes: start with <strong>0<\/strong> minimum, small maximum (e.g., 1\u20132 nodes)\n&#8211; Networking: use your lab VNet\/subnet\n&#8211; Public IPs: prefer none for compute nodes; for head node, use private and SSH via bastion\/jumpbox where possible<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In CycleCloud UI:\n1) Click <strong>Create Cluster<\/strong>.\n2) Choose your scheduler template.\n3) Set:\n   &#8211; Cluster name: <code>slurm-lab<\/code> (example)\n   &#8211; Resource group: <code>rg-cyclecloud-lab<\/code> (or a separate RG if your governance requires it)\n   &#8211; VM sizes: choose low-cost sizes supported in your region\n   &#8211; Max nodes: 1 or 2 for this lab\n   &#8211; Idle timeout: short (to scale down quickly)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">4) Click <strong>Start<\/strong> (or <strong>Create and Start<\/strong>, depending on UI).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> CycleCloud provisions the head node and the cluster reaches a \u201crunning\u201d state.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong>\n&#8211; In CycleCloud UI, cluster shows <strong>Running<\/strong> or <strong>Started<\/strong>.\n&#8211; Head node appears healthy.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Common error &amp; fix:<\/strong>\n&#8211; <strong>Error:<\/strong> Quota exceeded<br\/>\n<strong>Fix:<\/strong> choose a smaller VM size or request quota increase for the VM family\/region.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 7: SSH to the head node and run a test job<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">1) Get the head node IP:\n&#8211; From CycleCloud UI cluster view, locate the head node details.\n&#8211; Or in Azure Portal, find the head node VM and check private IP (or public if used).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">2) SSH to head node:<\/p>\n\n\n\n<pre><code class=\"language-bash\">ssh &lt;USERNAME&gt;@&lt;HEAD_NODE_IP&gt;\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">3) Verify scheduler commands exist (example for Slurm):<\/p>\n\n\n\n<pre><code class=\"language-bash\">which sbatch || true\nwhich srun || true\nwhich sinfo || true\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">4) Submit a simple job (Slurm example):<\/p>\n\n\n\n<pre><code class=\"language-bash\">cat &gt; hello.sbatch &lt;&lt;'EOF'\n#!\/bin\/bash\n#SBATCH --job-name=hello\n#SBATCH --output=hello-%j.out\n#SBATCH --time=00:02:00\n#SBATCH --ntasks=1\n\nhostname\ndate\necho \"Hello from Slurm on Azure CycleCloud\"\nEOF\n\nsbatch hello.sbatch\nsqueue\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong>\n&#8211; Job is accepted.\n&#8211; If compute nodes are at 0, autoscaling should add a compute node and then run the job (timing varies).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">5) After the job completes:<\/p>\n\n\n\n<pre><code class=\"language-bash\">squeue\nls -l hello-*.out\ncat hello-*.out\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong>\n&#8211; Output file exists and includes hostname\/date and the hello message.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Validation<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Use these checks:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">1) CycleCloud UI:\n&#8211; Cluster status is healthy.\n&#8211; Compute nodes scale up when job is queued (if autoscaling is configured).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">2) Head node:\n&#8211; Scheduler reports nodes:\n  &#8211; Slurm example:\n    <code>bash\n    sinfo\n    scontrol show nodes | head<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">3) Azure Portal:\n&#8211; You can see compute VMs created when work is queued.\n&#8211; After idle timeout, compute nodes deallocate\/terminate according to policy.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Troubleshooting<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Common issues and practical fixes:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">1) <strong>Cannot reach CycleCloud UI<\/strong>\n&#8211; Check NSG rules: allow HTTPS (443) from your IP (lab) or only from Bastion\/jump network (production).\n&#8211; Confirm the CycleCloud Server VM is running.\n&#8211; If using private access: ensure you are connected via VPN\/ExpressRoute\/Bastion.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">2) <strong>Cluster creation fails immediately<\/strong>\n&#8211; Check CycleCloud \u201cevents\u201d or logs in the UI.\n&#8211; Verify credentials (service principal\/managed identity).\n&#8211; Confirm the target subnet has enough IP space.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">3) <strong>Nodes provision but never join scheduler<\/strong>\n&#8211; DNS or outbound connectivity issues can break bootstrap installs.\n&#8211; Verify head node can resolve names and reach package repositories (or configure internal mirrors).\n&#8211; Check bootstrap logs on the node (location depends on OS\/template\u2014verify in template docs).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">4) <strong>Autoscaling doesn\u2019t add compute nodes<\/strong>\n&#8211; Job requests may not match node configuration (e.g., requesting GPU when no GPU nodes exist).\n&#8211; Partition\/queue mismatch.\n&#8211; Idle settings or max node limits set too low.\n&#8211; Verify the template\u2019s autoscale configuration and scheduler integration.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">5) <strong>Quota exceeded \/ SKU not available<\/strong>\n&#8211; Switch to a more available VM size.\n&#8211; Change region.\n&#8211; Request quota increases early for production.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Cleanup<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To avoid ongoing charges, delete the lab resource group. This removes the CycleCloud Server VM, head\/compute nodes, disks, and most dependent resources created inside the RG.<\/p>\n\n\n\n<pre><code class=\"language-bash\">az group delete --name \"$RG\" --yes --no-wait\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>If you created a service principal for the lab<\/strong>, delete it too:<\/p>\n\n\n\n<pre><code class=\"language-bash\"># Find the appId first if you didn\u2019t save it\naz ad sp list --display-name \"$SP_NAME\" --query \"[0].appId\" -o tsv\n\n# Then delete by appId\naz ad sp delete --id \"&lt;APP_ID&gt;\"\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> Resource group deletion completes and no CycleCloud-related resources remain in that RG.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">11. Best Practices<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Architecture best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Separate management from compute:<\/strong> place CycleCloud Server and head node in a management subnet; compute nodes in a dedicated compute subnet.<\/li>\n<li><strong>Use hub-and-spoke networking<\/strong> for enterprise: shared DNS, security tooling, and egress control in hub; clusters in spokes.<\/li>\n<li><strong>Plan shared storage early:<\/strong> many HPC workloads need POSIX-like semantics; validate performance and locking requirements before choosing storage.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">IAM\/security best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use <strong>least privilege<\/strong> for the identity CycleCloud uses to create resources:<\/li>\n<li>Prefer scoped permissions (resource group) for dev\/test.<\/li>\n<li>For production, consider custom roles with only required actions (verify required actions in docs).<\/li>\n<li>Restrict UI and SSH access:<\/li>\n<li>No public IPs for compute nodes.<\/li>\n<li>Limit head\/login node exposure.<\/li>\n<li>Use Bastion\/VPN and NSGs with source IP restrictions.<\/li>\n<li>Rotate secrets (service principal secrets, SSH keys) and store them securely.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Cost best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Set conservative <strong>min nodes<\/strong> (often 0) and sensible <strong>max nodes<\/strong>.<\/li>\n<li>Use autoscale policies with <strong>idle timeout<\/strong> and <strong>graceful drain<\/strong> behavior (scheduler-specific).<\/li>\n<li>Enforce <strong>tagging<\/strong> (owner, cost center, environment, project) with Azure Policy.<\/li>\n<li>Add <strong>budgets and alerts<\/strong> in Azure Cost Management for HPC subscriptions\/resource groups.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Performance best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Select VM SKUs appropriate to workload:<\/li>\n<li>CPU-bound vs memory-bound vs network\/MPI-bound.<\/li>\n<li>For MPI-heavy workloads, ensure:<\/li>\n<li>VM SKUs with suitable network performance<\/li>\n<li>Placement and topology considerations (verify Azure HPC guidance)<\/li>\n<li>Minimize bootstrap time:<\/li>\n<li>Use custom images<\/li>\n<li>Cache packages<\/li>\n<li>Avoid large downloads at scale-out time<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Reliability best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Treat head node and CycleCloud Server as critical:<\/li>\n<li>Backup configurations\/templates<\/li>\n<li>Use disciplined change management<\/li>\n<li>Use Availability Sets\/Zones where appropriate and supported (verify patterns and tradeoffs).<\/li>\n<li>Validate failure modes: what happens to running jobs if head node reboots?<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operations best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Centralize logs: scheduler logs, system logs, and provisioning events into Log Analytics.<\/li>\n<li>Create runbooks for:<\/li>\n<li>Cluster start\/stop<\/li>\n<li>Node drain\/replace<\/li>\n<li>Scheduler upgrades<\/li>\n<li>Image updates and rollback<\/li>\n<li>Implement patch strategy for OS and critical packages.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Governance\/tagging\/naming best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Naming convention example:<\/li>\n<li><code>cc-&lt;env&gt;-&lt;region&gt;-&lt;team&gt;<\/code> for CycleCloud server<\/li>\n<li><code>hpc-&lt;project&gt;-&lt;env&gt;-head<\/code><\/li>\n<li><code>hpc-&lt;project&gt;-&lt;env&gt;-compute<\/code><\/li>\n<li>Tags:<\/li>\n<li><code>Owner<\/code>, <code>CostCenter<\/code>, <code>Project<\/code>, <code>Environment<\/code>, <code>DataSensitivity<\/code>, <code>ExpirationDate<\/code><\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">12. Security Considerations<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Identity and access model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Azure RBAC<\/strong> controls resource creation and changes.<\/li>\n<li><strong>CycleCloud UI<\/strong> has its own access control model (verify exact auth and RBAC options in the version you deploy).<\/li>\n<li>For automation, use:<\/li>\n<li>Service principals with scoped permissions, or<\/li>\n<li>Managed identities where supported and appropriate.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Encryption<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>At rest:<\/strong> Azure managed disks are encrypted by default (platform-managed keys); CMK options exist depending on disk\/storage type.<\/li>\n<li><strong>In transit:<\/strong> use HTTPS for UI and SSH for node access.<\/li>\n<li>For shared storage, ensure encryption in transit is enabled where supported.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Network exposure<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Avoid public IPs for compute nodes.<\/li>\n<li>Prefer private IP access and controlled admin entry points (Bastion\/VPN).<\/li>\n<li>Use NSGs to restrict:<\/li>\n<li>HTTPS (443) to CycleCloud UI from admin networks only<\/li>\n<li>SSH (22) to head\/login from admin networks only<\/li>\n<li>Control outbound egress (NAT Gateway, firewall) to reduce exfiltration risk and improve auditability.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secrets handling<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Don\u2019t hardcode secrets in templates or bootstrap scripts.<\/li>\n<li>Use <strong>Azure Key Vault<\/strong> for storing secrets where possible, but design carefully:<\/li>\n<li>Ensure cluster nodes can reach Key Vault endpoints (private endpoints if required).<\/li>\n<li>Use managed identity on nodes if your design supports it (verify).<\/li>\n<li>Rotate and audit credentials.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Audit\/logging<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Enable:<\/li>\n<li>Azure Activity Log forwarding (resource operations)<\/li>\n<li>Log Analytics for OS and scheduler logs<\/li>\n<li>NSG flow logs (if required)<\/li>\n<li>Ensure logs are retained per policy and protected from tampering (e.g., centralized workspace with RBAC).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Compliance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data residency: keep cluster, storage, and dependent services in approved regions.<\/li>\n<li>Access controls: enforce MFA and privileged access workflows for admins.<\/li>\n<li>Vulnerability management: patch OS images and track CVEs affecting scheduler stack.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Common security mistakes<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Exposing CycleCloud UI to the internet with weak authentication.<\/li>\n<li>Leaving SSH open to <code>0.0.0.0\/0<\/code>.<\/li>\n<li>Using a high-privilege service principal at subscription scope for convenience.<\/li>\n<li>Allowing unrestricted outbound internet without logging\/controls.<\/li>\n<li>Not tagging resources, leading to unknown ownership and abandoned clusters.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secure deployment recommendations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Private cluster design (no public IPs), access via Bastion\/VPN.<\/li>\n<li>Separate resource groups for management vs compute if governance requires.<\/li>\n<li>Use Azure Policy to enforce:<\/li>\n<li>Required tags<\/li>\n<li>Allowed VM SKUs\/regions<\/li>\n<li>Deny public IP creation except approved cases<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">13. Limitations and Gotchas<\/h2>\n\n\n\n<blockquote>\n<p>These are common real-world issues. For authoritative limits, always verify in official docs and Azure service quotas.<\/p>\n<\/blockquote>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Not a fully managed service:<\/strong> you manage the CycleCloud Server VM, OS patching, and scheduler components.<\/li>\n<li><strong>Quota constraints:<\/strong> HPC scaling is often blocked by vCPU quotas per VM family\/region.<\/li>\n<li><strong>SKU availability:<\/strong> desired HPC SKUs may be unavailable in some regions or may require capacity planning.<\/li>\n<li><strong>Bootstrap fragility:<\/strong> scale-out relies on successful bootstrapping; locked-down networks often break package installs.<\/li>\n<li><strong>Shared storage complexity:<\/strong> HPC workflows often assume POSIX semantics; choosing storage that meets performance and locking needs is critical.<\/li>\n<li><strong>Autoscaling tuning:<\/strong> misconfigured policies can cause:<\/li>\n<li>Too many nodes (cost spike)<\/li>\n<li>Too few nodes (long queue wait)<\/li>\n<li>Thrashing (scale up\/down too often)<\/li>\n<li><strong>Head node as SPOF:<\/strong> unless you implement HA patterns supported by your scheduler and architecture, head node issues can disrupt scheduling.<\/li>\n<li><strong>Logging volume:<\/strong> scheduler and provisioning logs can be large; Log Analytics ingestion costs can surprise teams.<\/li>\n<li><strong>Template drift:<\/strong> unversioned template changes can break reproducibility; treat templates like code.<\/li>\n<li><strong>Networking for MPI:<\/strong> some workloads need specific network performance and topology; test at scale.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">14. Comparison with Alternatives<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Azure CycleCloud is one option in Azure\u2019s Compute ecosystem for large-scale workloads. Alternatives vary depending on whether you want scheduler-managed VMs, managed batch, containers, or a DIY approach.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Comparison table<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Option<\/th>\n<th>Best For<\/th>\n<th>Strengths<\/th>\n<th>Weaknesses<\/th>\n<th>When to Choose<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Azure CycleCloud<\/strong><\/td>\n<td>HPC clusters with schedulers (e.g., Slurm\/PBS), IaaS control<\/td>\n<td>Template-driven clusters, scheduler integration, autoscaling, Azure integration<\/td>\n<td>You manage server + scheduler; storage\/network complexity<\/td>\n<td>You need HPC scheduler workflows and elastic VM clusters<\/td>\n<\/tr>\n<tr>\n<td><strong>Azure Batch<\/strong><\/td>\n<td>Managed batch\/HTC workloads without managing scheduler VMs<\/td>\n<td>Managed service, job\/task model, simplified operations<\/td>\n<td>Different programming model than traditional HPC schedulers; less \u201cHPC admin\u201d feel<\/td>\n<td>You want managed orchestration and can adapt to Batch model<\/td>\n<\/tr>\n<tr>\n<td><strong>Azure Virtual Machine Scale Sets (DIY)<\/strong><\/td>\n<td>Custom VM fleets without HPC scheduler integration<\/td>\n<td>Full control, simple scaling mechanics<\/td>\n<td>You must build orchestration, scheduling, node config, and autoscale logic<\/td>\n<td>You have custom orchestration needs and strong automation capability<\/td>\n<\/tr>\n<tr>\n<td><strong>Azure Kubernetes Service (AKS)<\/strong><\/td>\n<td>Container-native compute platforms<\/td>\n<td>Strong ecosystem, GitOps, autoscaling<\/td>\n<td>Not a drop-in replacement for HPC schedulers; MPI and high-perf storage require expertise<\/td>\n<td>Your workloads are containerized and platform team runs Kubernetes<\/td>\n<\/tr>\n<tr>\n<td><strong>AWS ParallelCluster<\/strong> (other cloud)<\/td>\n<td>HPC clusters on AWS<\/td>\n<td>HPC templates and scheduler integration<\/td>\n<td>Cloud-specific; migration effort<\/td>\n<td>You are standardized on AWS or need AWS-native integrations<\/td>\n<\/tr>\n<tr>\n<td><strong>Google Cloud Cluster Toolkit \/ HPC solutions<\/strong> (other cloud)<\/td>\n<td>HPC clusters on GCP<\/td>\n<td>Infrastructure blueprints for HPC<\/td>\n<td>Cloud-specific; migration effort<\/td>\n<td>You are standardized on GCP<\/td>\n<\/tr>\n<tr>\n<td><strong>Self-managed Slurm\/PBS on VMs<\/strong><\/td>\n<td>Full DIY HPC<\/td>\n<td>Maximum control<\/td>\n<td>Highest ops burden; scaling\/templating is on you<\/td>\n<td>You need bespoke architecture and accept operational load<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">15. Real-World Example<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Enterprise example: regulated engineering simulation platform<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> An automotive supplier runs CAE simulations that spike during design milestones. On-prem HPC is saturated at quarter-end; workloads must stay private and compliant.<\/li>\n<li><strong>Proposed architecture:<\/strong><\/li>\n<li>Hub-and-spoke network with private DNS, centralized logging, and controlled egress.<\/li>\n<li>Azure CycleCloud Server in a management subnet (private access only).<\/li>\n<li>Scheduler head\/login nodes in a secure subnet with restricted SSH.<\/li>\n<li>Compute nodes in a compute subnet, no public IPs, autoscaling from 0 to N.<\/li>\n<li>Shared storage designed for throughput and POSIX semantics (choose solution appropriate to workload; verify with HPC storage guidance).<\/li>\n<li>Azure Monitor + Log Analytics for OS\/scheduler logs; Activity Log forwarded to SIEM.<\/li>\n<li><strong>Why Azure CycleCloud was chosen:<\/strong><\/li>\n<li>Preserves scheduler-based workflow familiar to HPC users.<\/li>\n<li>Enables elastic scaling while keeping strict network controls.<\/li>\n<li>Template-driven deployments support governance and standardization.<\/li>\n<li><strong>Expected outcomes:<\/strong><\/li>\n<li>Reduced queue time during peaks by bursting to Azure.<\/li>\n<li>Improved cost control via autoscaling and chargeback tags.<\/li>\n<li>More consistent cluster deployments across teams.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Startup\/small-team example: elastic compute for parameter sweeps<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> A small biotech startup runs thousands of independent parameter sweep jobs weekly. They need low ops overhead but still want scheduler-style job submission and autoscaling.<\/li>\n<li><strong>Proposed architecture:<\/strong><\/li>\n<li>Single Azure CycleCloud Server VM (small size).<\/li>\n<li>One scheduler cluster template with max 20 nodes; min 0.<\/li>\n<li>Basic shared working directory; outputs stored in Azure storage.<\/li>\n<li>Budget alerts and forced TTL tags to delete stale resources.<\/li>\n<li><strong>Why Azure CycleCloud was chosen:<\/strong><\/li>\n<li>Faster than hand-rolling VM orchestration.<\/li>\n<li>Autoscaling reduces idle compute costs.<\/li>\n<li>Works well with Linux-based HPC tooling and scripts.<\/li>\n<li><strong>Expected outcomes:<\/strong><\/li>\n<li>Repeatable environment for pipelines.<\/li>\n<li>Lower compute spend due to scaling down between runs.<\/li>\n<li>Ability to increase capacity quickly when experiments expand.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">16. FAQ<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">1) <strong>Is Azure CycleCloud a fully managed Azure service?<\/strong><br\/>\nNo. Azure CycleCloud is typically deployed as software (often a VM from Marketplace) in your subscription. You manage the server VM, patching, and scheduler components.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">2) <strong>What is the difference between Azure CycleCloud and Azure Batch?<\/strong><br\/>\nAzure CycleCloud focuses on deploying and operating <strong>scheduler-based VM clusters<\/strong> (common in HPC). Azure Batch is a <strong>managed batch processing service<\/strong> with a different job\/task model.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">3) <strong>Do I pay separately for Azure CycleCloud?<\/strong><br\/>\nIn many cases, costs are primarily for the underlying Azure resources (VMs, disks, networking, monitoring). Verify the Marketplace offer details you deploy to confirm any additional charges.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">4) <strong>Which schedulers are supported?<\/strong><br\/>\nAzure CycleCloud commonly supports popular HPC schedulers (often including Slurm and others). The supported list can change\u2014verify in the official docs for your version.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">5) <strong>Can I run GPU workloads with Azure CycleCloud?<\/strong><br\/>\nYes, if your template supports GPU node arrays and you select GPU-capable VM sizes. Ensure drivers and images are compatible with your chosen SKU and region.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">6) <strong>Can clusters be fully private (no public IPs)?<\/strong><br\/>\nYes. A recommended production approach is private networking with access via Bastion\/VPN\/ExpressRoute and restrictive NSGs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">7) <strong>How does autoscaling work?<\/strong><br\/>\nAutoscaling typically reacts to scheduler demand (queued jobs and requested resources) and provisions compute nodes accordingly, then deallocates\/removes nodes after idle timeouts (scheduler\/template-dependent).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">8) <strong>What are the most common reasons cluster creation fails?<\/strong><br\/>\nInsufficient Azure permissions, quota limits, wrong VM SKU availability, blocked outbound connectivity during bootstrap, or misconfigured networking (DNS\/routing\/NSGs).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">9) <strong>What storage should I use for shared home and scratch?<\/strong><br\/>\nIt depends on performance and POSIX requirements. Many HPC workloads require NFS-like semantics and high throughput. Validate options with Azure HPC storage guidance and your application requirements.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">10) <strong>Can I use custom images?<\/strong><br\/>\nYes. Custom images are often recommended to reduce bootstrap time and ensure consistent libraries\/drivers\/security agents\u2014maintain an image pipeline and rollback strategy.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">11) <strong>How do I control costs?<\/strong><br\/>\nAutoscale with min=0 where possible, use short idle timeouts, enforce max node counts, right-size head nodes, use tagging and budgets, and avoid unnecessary logging\/egress.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">12) <strong>Is Azure CycleCloud suitable for containerized workloads?<\/strong><br\/>\nIt can be used, but if your primary model is containers and Kubernetes, AKS may be a better fit. CycleCloud is typically chosen for scheduler-based HPC VM clusters.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">13) <strong>How do I monitor clusters?<\/strong><br\/>\nUse Azure Monitor for VM metrics, Log Analytics for OS and scheduler logs, and Azure Activity Log for provisioning operations. Decide what to ingest to control cost.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">14) <strong>Can I manage multiple clusters with one CycleCloud Server?<\/strong><br\/>\nOften yes. Capacity depends on server sizing and operational practices. Treat the server as critical shared infrastructure.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">15) <strong>What is the recommended way to learn Azure CycleCloud?<\/strong><br\/>\nStart with Microsoft\u2019s official documentation and a small lab cluster, then learn scheduler fundamentals (Slurm\/PBS), Azure networking for private clusters, and storage design for HPC.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">17. Top Online Resources to Learn Azure CycleCloud<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Resource Type<\/th>\n<th>Name<\/th>\n<th>Why It Is Useful<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Official Documentation<\/td>\n<td>Azure CycleCloud documentation \u2014 https:\/\/learn.microsoft.com\/azure\/cyclecloud\/<\/td>\n<td>Authoritative guidance on installation, configuration, templates, and operations<\/td>\n<\/tr>\n<tr>\n<td>Marketplace Listing<\/td>\n<td>Azure Marketplace (search \u201cAzure CycleCloud\u201d) \u2014 https:\/\/azuremarketplace.microsoft.com\/<\/td>\n<td>Shows deployment options, plan details, and any offer-specific terms<\/td>\n<\/tr>\n<tr>\n<td>Pricing Calculator<\/td>\n<td>Azure Pricing Calculator \u2014 https:\/\/azure.microsoft.com\/pricing\/calculator\/<\/td>\n<td>Build realistic estimates for VM, storage, and networking costs<\/td>\n<\/tr>\n<tr>\n<td>VM Pricing<\/td>\n<td>Virtual Machines pricing \u2014 https:\/\/azure.microsoft.com\/pricing\/details\/virtual-machines\/<\/td>\n<td>Understand compute cost drivers by VM family and region<\/td>\n<\/tr>\n<tr>\n<td>Storage Pricing<\/td>\n<td>Azure Storage pricing \u2014 https:\/\/azure.microsoft.com\/pricing\/details\/storage\/<\/td>\n<td>Plan shared storage and data costs<\/td>\n<\/tr>\n<tr>\n<td>Monitoring Docs<\/td>\n<td>Azure Monitor documentation \u2014 https:\/\/learn.microsoft.com\/azure\/azure-monitor\/<\/td>\n<td>Implement metrics\/logging for clusters and nodes<\/td>\n<\/tr>\n<tr>\n<td>Governance<\/td>\n<td>Azure Policy documentation \u2014 https:\/\/learn.microsoft.com\/azure\/governance\/policy\/<\/td>\n<td>Enforce tags, allowed SKUs, and security guardrails for HPC environments<\/td>\n<\/tr>\n<tr>\n<td>Identity<\/td>\n<td>Azure RBAC documentation \u2014 https:\/\/learn.microsoft.com\/azure\/role-based-access-control\/<\/td>\n<td>Apply least privilege and secure automation identities<\/td>\n<\/tr>\n<tr>\n<td>Architecture Guidance<\/td>\n<td>Azure Architecture Center \u2014 https:\/\/learn.microsoft.com\/azure\/architecture\/<\/td>\n<td>Reference architectures and design principles (useful for landing zones and governance)<\/td>\n<\/tr>\n<tr>\n<td>HPC Overview<\/td>\n<td>Azure high performance computing (HPC) resources (verify current entry points) \u2014 https:\/\/learn.microsoft.com\/azure\/<\/td>\n<td>Helps with VM selection, networking, and storage considerations for HPC on Azure<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">18. Training and Certification Providers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Institute<\/th>\n<th>Suitable Audience<\/th>\n<th>Likely Learning Focus<\/th>\n<th>Mode<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>DevOps engineers, SREs, platform teams<\/td>\n<td>DevOps practices, cloud operations, automation fundamentals relevant to running HPC platforms<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>ScmGalaxy.com<\/td>\n<td>Beginners to intermediate engineers<\/td>\n<td>DevOps\/SCM learning paths that support infrastructure automation skills<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.scmgalaxy.com\/<\/td>\n<\/tr>\n<tr>\n<td>CLoudOpsNow.in<\/td>\n<td>Cloud operations teams<\/td>\n<td>Cloud operations practices, monitoring, governance, cost control<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.cloudopsnow.in\/<\/td>\n<\/tr>\n<tr>\n<td>SreSchool.com<\/td>\n<td>SREs, operations engineers<\/td>\n<td>Reliability engineering practices applicable to HPC platforms<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.sreschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>AiOpsSchool.com<\/td>\n<td>Ops teams exploring AIOps<\/td>\n<td>Monitoring, automation, and operational analytics concepts<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.aiopsschool.com\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">19. Top Trainers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Platform\/Site<\/th>\n<th>Likely Specialization<\/th>\n<th>Suitable Audience<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>RajeshKumar.xyz<\/td>\n<td>DevOps\/cloud training content (verify exact offerings)<\/td>\n<td>Beginners to intermediate engineers<\/td>\n<td>https:\/\/www.rajeshkumar.xyz\/<\/td>\n<\/tr>\n<tr>\n<td>devopstrainer.in<\/td>\n<td>DevOps training (tools and practices)<\/td>\n<td>DevOps engineers, admins<\/td>\n<td>https:\/\/www.devopstrainer.in\/<\/td>\n<\/tr>\n<tr>\n<td>devopsfreelancer.com<\/td>\n<td>DevOps consulting\/training marketplace style resource (verify)<\/td>\n<td>Teams needing hands-on guidance<\/td>\n<td>https:\/\/www.devopsfreelancer.com\/<\/td>\n<\/tr>\n<tr>\n<td>devopssupport.in<\/td>\n<td>DevOps support\/training resource (verify)<\/td>\n<td>Operations teams and engineers<\/td>\n<td>https:\/\/www.devopssupport.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">20. Top Consulting Companies<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Company Name<\/th>\n<th>Likely Service Area<\/th>\n<th>Where They May Help<\/th>\n<th>Consulting Use Case Examples<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>cotocus.com<\/td>\n<td>Cloud\/DevOps consulting (verify service catalog)<\/td>\n<td>Architecture, implementation support, operations processes<\/td>\n<td>Landing zone alignment, network design review, automation pipelines for cluster templates<\/td>\n<td>https:\/\/cotocus.com\/<\/td>\n<\/tr>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>DevOps and cloud consulting\/training services<\/td>\n<td>Skills enablement and implementation guidance<\/td>\n<td>Operational runbooks, monitoring\/logging strategy, IaC adoption around cluster deployments<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>DEVOPSCONSULTING.IN<\/td>\n<td>DevOps consulting (verify service catalog)<\/td>\n<td>CI\/CD, automation, operational best practices<\/td>\n<td>Cost governance setup, RBAC hardening, deployment automation for Azure HPC environments<\/td>\n<td>https:\/\/www.devopsconsulting.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">21. Career and Learning Roadmap<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn before Azure CycleCloud<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Azure fundamentals<\/strong>\n   &#8211; Resource groups, VNets\/subnets, NSGs, managed disks\n   &#8211; Azure RBAC, managed identities, service principals<\/li>\n<li><strong>Linux administration<\/strong>\n   &#8211; SSH, systemd, package management, logs, network troubleshooting<\/li>\n<li><strong>HPC basics<\/strong>\n   &#8211; Scheduler concepts: queues\/partitions, nodes, job submission, backfill\n   &#8211; MPI fundamentals if running tightly coupled workloads<\/li>\n<li><strong>Infrastructure as Code and automation<\/strong>\n   &#8211; Azure CLI, scripting\n   &#8211; (Optional) Terraform\/Bicep for repeatable deployments<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn after Azure CycleCloud<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Advanced scheduler administration (fairshare, reservations, accounting)<\/li>\n<li>Image pipelines for HPC nodes (Packer, Azure Image Builder)<\/li>\n<li>Storage performance engineering (IO patterns, throughput vs IOPS, caching)<\/li>\n<li>Observability at scale (Log Analytics cost control, dashboards, alerting)<\/li>\n<li>Network performance tuning for MPI-capable workloads<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Job roles that use it<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>HPC Cloud Architect<\/li>\n<li>Cloud Platform Engineer (HPC)<\/li>\n<li>DevOps Engineer supporting compute platforms<\/li>\n<li>SRE\/Operations Engineer for research compute<\/li>\n<li>Computational infrastructure engineer<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Certification path (Azure)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">There is not a CycleCloud-specific certification commonly referenced as a standalone credential. A practical path is:\n&#8211; Start with <strong>Azure fundamentals<\/strong> certifications\/learning paths\n&#8211; Then focus on <strong>Azure Administrator<\/strong> \/ <strong>Azure Solutions Architect<\/strong> skills\n&#8211; Add Linux + HPC scheduler expertise as a specialization<br\/>\n(Verify current Microsoft certification offerings: https:\/\/learn.microsoft.com\/credentials\/)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Project ideas for practice<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Build a private CycleCloud environment with Bastion-only access.<\/li>\n<li>Implement autoscaling policies and measure queue time vs cost.<\/li>\n<li>Create a custom VM image with preinstalled libraries and compare node \u201ctime to ready.\u201d<\/li>\n<li>Build a tagging + budget + alerting framework for HPC resource groups.<\/li>\n<li>Centralize scheduler logs into Log Analytics and create operational dashboards.<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">22. Glossary<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>HPC (High-Performance Computing):<\/strong> Compute workloads requiring parallelism, high throughput, or low-latency interconnect considerations.<\/li>\n<li><strong>Scheduler:<\/strong> Software that queues and assigns jobs to compute resources (e.g., Slurm, PBS).<\/li>\n<li><strong>Head node \/ Login node:<\/strong> The node users connect to for submitting jobs and where scheduler control services typically run.<\/li>\n<li><strong>Compute node:<\/strong> Worker node that executes jobs.<\/li>\n<li><strong>Autoscaling:<\/strong> Automatically adding\/removing compute nodes based on demand\/policy.<\/li>\n<li><strong>Template:<\/strong> A repeatable cluster definition used to deploy consistent infrastructure and configuration.<\/li>\n<li><strong>VM SKU:<\/strong> The VM size\/family defining CPU, memory, disk, and network capabilities.<\/li>\n<li><strong>NSG (Network Security Group):<\/strong> Azure firewall rules for subnet\/NIC traffic control.<\/li>\n<li><strong>VNet:<\/strong> Azure virtual network.<\/li>\n<li><strong>Egress:<\/strong> Outbound network traffic leaving Azure (often billed).<\/li>\n<li><strong>Quota:<\/strong> Azure limits on resources (e.g., vCPU per VM family per region).<\/li>\n<li><strong>Bootstrap:<\/strong> Initialization scripts\/tasks that configure a node at first boot.<\/li>\n<li><strong>Log Analytics:<\/strong> Azure service for log collection, querying, and retention (cost based on ingestion\/retention).<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">23. Summary<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Azure CycleCloud is an Azure Compute-focused solution for deploying and operating <strong>scheduler-based HPC clusters<\/strong> on Azure infrastructure. It matters because it gives teams a practical path to run familiar HPC schedulers with <strong>repeatable templates<\/strong> and <strong>queue-driven autoscaling<\/strong>, while keeping control over VM images, networking, and storage.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Cost-wise, the main drivers are <strong>compute node hours<\/strong>, storage performance choices, monitoring ingestion, and data egress. Security-wise, treat the CycleCloud Server and head node as critical assets: keep them private, enforce least privilege, and centralize audit and operational logs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Use Azure CycleCloud when you need HPC scheduler workflows and elastic VM clusters; consider Azure Batch or AKS when you want managed batch or container-native orchestration instead. The best next step is to complete the hands-on lab, then deepen your scheduler, storage, and Azure networking knowledge using the official documentation: https:\/\/learn.microsoft.com\/azure\/cyclecloud\/<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Compute<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[40,26],"tags":[],"class_list":["post-390","post","type-post","status-publish","format-standard","hentry","category-azure","category-compute"],"_links":{"self":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/390","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/comments?post=390"}],"version-history":[{"count":0,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/390\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/media?parent=390"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/categories?post=390"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/tags?post=390"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}