{"id":21,"date":"2026-04-12T13:30:44","date_gmt":"2026-04-12T13:30:44","guid":{"rendered":"https:\/\/www.devopsschool.com\/tutorials\/alibaba-cloud-elastic-high-performance-computing-e-hpc-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-computing\/"},"modified":"2026-04-12T13:30:44","modified_gmt":"2026-04-12T13:30:44","slug":"alibaba-cloud-elastic-high-performance-computing-e-hpc-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-computing","status":"publish","type":"post","link":"https:\/\/www.devopsschool.com\/tutorials\/alibaba-cloud-elastic-high-performance-computing-e-hpc-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-computing\/","title":{"rendered":"Alibaba Cloud Elastic High Performance Computing (E-HPC) Tutorial: Architecture, Pricing, Use Cases, and Hands-On Guide for Computing"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Category<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Computing<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">1. Introduction<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Elastic High Performance Computing (E-HPC) is Alibaba Cloud\u2019s managed cluster service for running high-performance computing (HPC) workloads on elastic cloud infrastructure. It helps you build an HPC cluster faster by automating the setup of compute nodes, scheduler, shared storage, and core networking\u2014so you can focus on running jobs instead of assembling a cluster from scratch.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In simple terms: <strong>E-HPC creates and manages an HPC \u201cfarm\u201d of ECS instances<\/strong> (login\/manager + compute nodes) connected in a VPC, typically with shared storage (for example NAS), and a job scheduler (commonly Slurm; verify supported schedulers in the official docs). You submit batch jobs to the scheduler, and E-HPC helps the cluster scale, operate, and remain consistent.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Technically, E-HPC orchestrates a set of Alibaba Cloud resources (primarily ECS, VPC, Security Groups, storage such as NAS\/OSS, and optionally GPUs\/fast networking depending on instance families and region). It provisions a cluster manager node, attaches compute nodes, configures scheduler services, configures user access (for example via SSH), and integrates with Alibaba Cloud identity and monitoring capabilities (for example via RAM\/CloudMonitor; verify exact integration points in your region and account).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>The problem it solves:<\/strong> building an HPC environment is traditionally slow and error-prone\u2014networking, scheduler configuration, image consistency, node lifecycle, shared storage mounts, and permissions must all be correct. E-HPC reduces that complexity and makes HPC clusters repeatable and elastic on Alibaba Cloud\u2019s Computing platform.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">2. What is Elastic High Performance Computing (E-HPC)?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Official purpose<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Elastic High Performance Computing (E-HPC) is an Alibaba Cloud service in the <strong>Computing<\/strong> category designed to <strong>create, manage, and operate HPC clusters<\/strong> on Alibaba Cloud infrastructure. Its goal is to provide a cluster experience (scheduler + nodes + shared storage + network) suitable for scientific computing, engineering simulation, EDA, rendering, and other parallel workloads.<\/p>\n\n\n\n<blockquote>\n<p>Service naming\/status note: The current service name used by Alibaba Cloud is <strong>Elastic High Performance Computing (E-HPC)<\/strong>. If you see older references to \u201cHPC Cluster\u201d or older HPC tooling, treat them as historical context and <strong>verify in official docs<\/strong> for the current workflow and supported components.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">Core capabilities (high level)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Cluster provisioning<\/strong>: Create an HPC cluster with common topologies (manager\/login + compute nodes).<\/li>\n<li><strong>Scheduler-based batch execution<\/strong>: Submit jobs, manage queues\/partitions, allocate resources (CPU\/memory\/GPU where applicable).<\/li>\n<li><strong>Elasticity<\/strong>: Add\/remove compute capacity by changing node counts; many HPC solutions support autoscaling patterns\u2014<strong>verify E-HPC autoscaling options and limits in the official docs<\/strong>.<\/li>\n<li><strong>Shared storage integration<\/strong>: Provide shared directories for job input\/output and software stacks (commonly via NAS; OSS is frequently used for dataset staging\/archival).<\/li>\n<li><strong>Networking and access<\/strong>: Create in a VPC with security groups; provide controlled SSH access.<\/li>\n<li><strong>Operations support<\/strong>: Standardize images\/config, simplify node lifecycle, and integrate with Alibaba Cloud monitoring\/logging where applicable.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Major components (conceptual)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>E-HPC Cluster<\/strong>: The managed grouping (metadata + lifecycle) that ties together nodes, scheduler, and associated resources.<\/li>\n<li><strong>Manager (or control) node<\/strong>: Runs scheduler controller services, cluster services, and often acts as the primary administration point.<\/li>\n<li><strong>Login node (sometimes separate)<\/strong>: Where users SSH to compile code, submit jobs, and manage data. (In small clusters, login and manager roles may be combined; verify templates\/options.)<\/li>\n<li><strong>Compute nodes<\/strong>: ECS instances that execute scheduled jobs.<\/li>\n<li><strong>Scheduler<\/strong>: HPC workload manager (commonly Slurm; E-HPC may support other schedulers\u2014verify).<\/li>\n<li><strong>Shared storage<\/strong>: NAS for POSIX-like shared filesystem, plus optional OSS for object storage workflows.<\/li>\n<li><strong>Networking<\/strong>: VPC, vSwitch(es), security groups, and optionally enhanced networking depending on instance families and region.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Service type and scope<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Service type<\/strong>: Managed orchestration\/cluster provisioning service on top of Alibaba Cloud IaaS (ECS\/VPC\/storage).<\/li>\n<li><strong>Scope<\/strong>: Practically <strong>regional<\/strong>\u2014you create clusters in a specific Alibaba Cloud region and attach them to VPC\/vSwitch resources in that region. Nodes run in one or more zones depending on how you design the VPC\/vSwitch layout (many HPC designs keep nodes in the same zone for predictable latency; verify best practice for your workload and region).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">How it fits into the Alibaba Cloud ecosystem<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">E-HPC is not a general-purpose compute platform by itself; it is a <strong>Computing orchestration layer<\/strong> that coordinates:\n&#8211; <strong>Elastic Compute Service (ECS)<\/strong> for compute nodes (CPU\/GPU instance families)\n&#8211; <strong>VPC<\/strong> for private networking and security group controls\n&#8211; <strong>Storage services<\/strong> such as <strong>Apsara File Storage NAS<\/strong> (shared filesystem) and <strong>Object Storage Service (OSS)<\/strong> (datasets\/archives)\n&#8211; <strong>CloudMonitor<\/strong> \/ logging services for operational visibility (verify exact integrations)\n&#8211; <strong>Resource Access Management (RAM)<\/strong> for identity and authorization around cluster creation and management<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">3. Why use Elastic High Performance Computing (E-HPC)?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Business reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Faster time-to-results<\/strong>: Stand up HPC clusters in hours\/minutes instead of days.<\/li>\n<li><strong>Elastic cost profile<\/strong>: Scale out for peaks and scale in after jobs complete, reducing the need for always-on capacity.<\/li>\n<li><strong>Standardization<\/strong>: Repeatable cluster configurations for teams (dev\/test\/prod).<\/li>\n<li><strong>Reduced operational overhead<\/strong>: Less bespoke scripting to maintain node configuration.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Technical reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Scheduler-driven efficiency<\/strong>: Batch scheduling reduces contention and improves resource utilization for multi-user, multi-project environments.<\/li>\n<li><strong>HPC-friendly architecture<\/strong>: Align compute nodes, shared storage, and network design for parallel workloads.<\/li>\n<li><strong>Better reproducibility<\/strong>: Consistent images and shared environment across nodes.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operational reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Lifecycle management<\/strong>: Create\/delete clusters cleanly; add\/remove compute nodes with lower risk.<\/li>\n<li><strong>Centralized administration<\/strong>: Manager\/login nodes provide an operational \u201ccenter of gravity\u201d.<\/li>\n<li><strong>Integration with Alibaba Cloud primitives<\/strong>: Use security groups, VPC routing, monitoring, and tagging.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/compliance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Network isolation<\/strong>: Private VPC clusters minimize public exposure.<\/li>\n<li><strong>IAM boundaries<\/strong>: Use RAM policies to restrict who can create clusters, manage nodes, and access data.<\/li>\n<li><strong>Auditable infrastructure<\/strong>: Combine with ActionTrail (audit) and logging services (verify availability and integration).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scalability\/performance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Scale-out compute<\/strong>: Use ECS instance families aligned to compute\/memory\/GPU needs.<\/li>\n<li><strong>Specialized instances<\/strong>: Select HPC-optimized instances (where available in your region; verify).<\/li>\n<li><strong>Short-lived fleets<\/strong>: Run large experiments temporarily rather than maintaining permanent clusters.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should choose it<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Choose Elastic High Performance Computing (E-HPC) on Alibaba Cloud when:\n&#8211; You have <strong>batch-oriented parallel workloads<\/strong> (MPI-style, parameter sweeps, simulation ensembles, rendering frames, EDA flows).\n&#8211; You need <strong>multi-user scheduling<\/strong> and fair-share\/queue controls.\n&#8211; You want <strong>repeatable clusters<\/strong> with controlled access and shared storage.\n&#8211; You can benefit from elasticity (bursty research cycles, seasonal compute, deadline-driven simulation).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should <em>not<\/em> choose it<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Avoid (or reconsider) E-HPC if:\n&#8211; Your workload is primarily <strong>stateless microservices<\/strong> or long-running web services (consider ACK\/Kubernetes, ECS autoscaling groups, or PaaS alternatives).\n&#8211; You need <strong>interactive, low-latency services<\/strong> rather than batch scheduling (you may still use HPC, but E-HPC may not be the primary tool).\n&#8211; You require complex hybrid scheduler customizations that must match an on-prem environment exactly (self-managed Slurm\/PBS may be more appropriate).\n&#8211; Your organization cannot accommodate HPC-style operations (user accounts, SSH workflows, shared POSIX storage patterns).<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">4. Where is Elastic High Performance Computing (E-HPC) used?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Industries<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Life sciences<\/strong>: genomics pipelines, molecular dynamics, protein docking<\/li>\n<li><strong>Manufacturing\/CAE<\/strong>: finite element analysis (FEA), CFD simulation, structural mechanics<\/li>\n<li><strong>Media &amp; entertainment<\/strong>: offline rendering, transcoding farms (batch), VFX simulation<\/li>\n<li><strong>Energy<\/strong>: reservoir simulation, seismic processing<\/li>\n<li><strong>Finance<\/strong>: risk simulations, Monte Carlo workloads<\/li>\n<li><strong>Semiconductors<\/strong>: EDA flows (timing, simulation, verification)<\/li>\n<li><strong>Academia\/research<\/strong>: computational chemistry, climate, physics simulations<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Team types<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Research engineering teams<\/li>\n<li>Platform\/infra teams supporting scientists and analysts<\/li>\n<li>DevOps\/SRE teams operating compute platforms for internal users<\/li>\n<li>Data science teams with compute-intensive training\/evaluation (some teams use HPC schedulers for GPU batch; verify GPU support patterns in E-HPC for your region)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Workloads<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Embarrassingly parallel<\/strong> parameter sweeps<\/li>\n<li><strong>MPI and tightly coupled parallel jobs<\/strong> (where network latency and instance selection matter)<\/li>\n<li><strong>Hybrid CPU\/GPU<\/strong> batch jobs (where supported)<\/li>\n<li><strong>Pipeline batch stages<\/strong> (pre-processing \u2192 compute \u2192 post-processing)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Architectures<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>VPC-isolated cluster with NAT\/proxy for outbound traffic<\/li>\n<li>Shared filesystem + object storage staging<\/li>\n<li>Multi-queue scheduler partitions for dev\/test\/prod<\/li>\n<li>\u201cBurst\u201d cluster model that spins up for campaigns and deletes afterward<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Production vs dev\/test usage<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Dev\/test<\/strong>: small cluster, cheaper instance types, smaller shared storage, relaxed scheduling policies.<\/li>\n<li><strong>Production<\/strong>: dedicated VPC\/subnets, strict IAM, hardened images, multi-tenant access controls, strong monitoring, stable shared storage, and documented runbooks.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">5. Top Use Cases and Scenarios<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Below are realistic scenarios where Elastic High Performance Computing (E-HPC) fits well.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1) CFD simulation campaign (CAE)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Run hundreds of CFD parameter variants and compare results.<\/li>\n<li><strong>Why E-HPC fits<\/strong>: Scheduler queues and batch submission simplify throughput; shared filesystem centralizes inputs\/outputs.<\/li>\n<li><strong>Example<\/strong>: Submit 500 jobs (mesh variations). Use separate partitions for small\/large runs and scale compute nodes during the campaign.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2) Monte Carlo risk simulations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Compute many independent trials across CPU cores.<\/li>\n<li><strong>Why E-HPC fits<\/strong>: High-throughput batch scheduling with fair-share and quotas.<\/li>\n<li><strong>Example<\/strong>: Nightly VaR runs submit thousands of small jobs; cluster scales up after market close.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3) Genomics variant calling pipeline (batch stages)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Process large numbers of samples with repeated pipeline stages.<\/li>\n<li><strong>Why E-HPC fits<\/strong>: Parallelizes across samples; shared filesystem keeps reference data and sample outputs accessible.<\/li>\n<li><strong>Example<\/strong>: Stage FASTQ files from OSS \u2192 run alignment jobs \u2192 run variant calling.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4) Rendering farm for animation frames<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Render many frames reliably with predictable job control.<\/li>\n<li><strong>Why E-HPC fits<\/strong>: Job arrays and queue management suit frame rendering.<\/li>\n<li><strong>Example<\/strong>: Submit one job per frame; autoscale compute nodes; store frames in NAS or push outputs to OSS.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">5) EDA verification regression runs<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Large regressions require consistent tool versions and distributed execution.<\/li>\n<li><strong>Why E-HPC fits<\/strong>: Shared software stack + scheduler policies.<\/li>\n<li><strong>Example<\/strong>: A nightly regression uses a dedicated queue and priority rules; outputs archived to OSS.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6) Academic research burst cluster<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Grant-funded compute needs spikes around deadlines.<\/li>\n<li><strong>Why E-HPC fits<\/strong>: Temporary cluster creation and clean teardown.<\/li>\n<li><strong>Example<\/strong>: Spin up a 200-core cluster for a week, then delete it while keeping results in OSS.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">7) Multi-tenant internal HPC platform (small enterprise)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Multiple teams need shared compute with isolation and quotas.<\/li>\n<li><strong>Why E-HPC fits<\/strong>: Scheduler partitions\/queues; controlled SSH access; VPC isolation.<\/li>\n<li><strong>Example<\/strong>: Separate partitions per department and chargeback based on scheduler accounting (verify scheduler accounting support and configuration).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">8) Machine learning batch inference (GPU where applicable)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Run batch inference jobs nightly\/weekly.<\/li>\n<li><strong>Why E-HPC fits<\/strong>: Batch scheduling and controlled GPU allocation (where supported).<\/li>\n<li><strong>Example<\/strong>: Submit inference jobs that read from OSS and write back results; scale GPU nodes only during inference windows.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">9) Large-scale scientific simulation with checkpointing<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Long-running simulations must checkpoint and restart reliably.<\/li>\n<li><strong>Why E-HPC fits<\/strong>: Shared storage for checkpoints; scheduler supports requeue\/restart patterns (depends on scheduler; verify).<\/li>\n<li><strong>Example<\/strong>: Job writes checkpoints to shared filesystem; if preempted or failed, restart from checkpoint.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">10) Benchmarking and performance testing of HPC codes<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Compare performance across instance families and configurations.<\/li>\n<li><strong>Why E-HPC fits<\/strong>: Repeatable cluster provisioning; controlled test environment.<\/li>\n<li><strong>Example<\/strong>: Create cluster A with compute-optimized ECS and cluster B with memory-optimized ECS, run the same workload, compare.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">6. Core Features<\/h2>\n\n\n\n<blockquote>\n<p>Feature availability varies by region, scheduler, and account. Where a capability depends on options, this section uses \u201ccommonly\u201d and recommends checking the official docs.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">1) Managed HPC cluster provisioning<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Creates an HPC cluster composed of manager\/login and compute nodes, wired into a VPC and security groups.<\/li>\n<li><strong>Why it matters<\/strong>: Reduces manual steps and misconfiguration risk.<\/li>\n<li><strong>Practical benefit<\/strong>: Faster setup; consistent cluster blueprint.<\/li>\n<li><strong>Caveats<\/strong>: You still pay for underlying resources (ECS\/NAS\/EIP). Quotas apply for ECS cores, vSwitch IPs, etc.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2) Scheduler integration (batch job management)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Provides a scheduler-based environment for job submission and resource allocation.<\/li>\n<li><strong>Why it matters<\/strong>: HPC workloads require structured scheduling and isolation.<\/li>\n<li><strong>Practical benefit<\/strong>: Queues\/partitions, job priority, job states, controlled concurrency.<\/li>\n<li><strong>Caveats<\/strong>: Supported schedulers and versions vary\u2014<strong>verify supported scheduler types in official docs<\/strong>.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3) Compute node lifecycle management<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Adds\/removes compute nodes attached to the cluster.<\/li>\n<li><strong>Why it matters<\/strong>: HPC demand is bursty; you rarely want fixed capacity.<\/li>\n<li><strong>Practical benefit<\/strong>: Scale out for a run, then scale in.<\/li>\n<li><strong>Caveats<\/strong>: Scaling behavior depends on cluster configuration and scheduler integration; verify whether E-HPC supports automatic scaling in your configuration or only manual scaling.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4) Shared storage integration (NAS\/OSS patterns)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Mounts shared storage for home directories, shared project space, and job I\/O; supports common data staging patterns with OSS.<\/li>\n<li><strong>Why it matters<\/strong>: HPC workflows require fast shared access to input data and stable output storage.<\/li>\n<li><strong>Practical benefit<\/strong>: Standard paths across nodes (for example <code>\/home<\/code>, <code>\/shared<\/code>).<\/li>\n<li><strong>Caveats<\/strong>: NAS throughput\/IOPS depends on NAS type and mount configuration; OSS is object storage (not POSIX) and typically used via tools\/SDKs rather than as a native filesystem unless you use additional components (verify supported approaches).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">5) VPC-based networking and security groups<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Keeps cluster nodes in a private network; controls inbound\/outbound rules.<\/li>\n<li><strong>Why it matters<\/strong>: HPC clusters often contain sensitive data and should not expose compute nodes directly to the internet.<\/li>\n<li><strong>Practical benefit<\/strong>: Reduced attack surface; controlled SSH entry points.<\/li>\n<li><strong>Caveats<\/strong>: If you enable public IPs or EIPs, you must harden SSH and restrict sources.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6) Cluster access and user workflows (SSH-based)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Provides standard login to cluster nodes for job submission.<\/li>\n<li><strong>Why it matters<\/strong>: HPC users expect SSH, modules, compilers, and batch submission.<\/li>\n<li><strong>Practical benefit<\/strong>: Familiar workflow for engineering\/science teams.<\/li>\n<li><strong>Caveats<\/strong>: User account management patterns vary (local users, directory integration, etc.). <strong>Verify supported user management options<\/strong>.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">7) Image consistency \/ software environment standardization<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Helps ensure nodes are created with consistent OS images and cluster configuration.<\/li>\n<li><strong>Why it matters<\/strong>: \u201cIt works on one node\u201d is a major HPC failure mode.<\/li>\n<li><strong>Practical benefit<\/strong>: Predictable behavior across compute fleet.<\/li>\n<li><strong>Caveats<\/strong>: For custom stacks, you may need to build your own images or use configuration management; verify supported customization mechanisms.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">8) Integration with Alibaba Cloud monitoring\/auditing (where applicable)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Enables operational visibility via Alibaba Cloud\u2019s monitoring\/audit ecosystem.<\/li>\n<li><strong>Why it matters<\/strong>: You need node health, job health signals, and audit trails in production.<\/li>\n<li><strong>Practical benefit<\/strong>: Alerting on node failures, capacity issues, and abnormal usage.<\/li>\n<li><strong>Caveats<\/strong>: Exact metrics\/log integrations can vary\u2014<strong>verify in official docs<\/strong>. You may need to install agents.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">9) Tagging and resource governance (via underlying services)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Apply tags to ECS\/NAS and related resources for cost allocation and governance.<\/li>\n<li><strong>Why it matters<\/strong>: HPC costs are often shared across teams\/projects.<\/li>\n<li><strong>Practical benefit<\/strong>: Chargeback\/showback and inventory management.<\/li>\n<li><strong>Caveats<\/strong>: Ensure tagging policies are enforced consistently; E-HPC may not automatically tag all underlying resources unless configured\u2014verify behavior.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">7. Architecture and How It Works<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">High-level service architecture<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">At a high level, Elastic High Performance Computing (E-HPC) orchestrates:\n1. <strong>Control plane actions<\/strong> (via Alibaba Cloud console\/API): cluster creation, node scaling, configuration.\n2. <strong>Data plane<\/strong>: ECS instances in your VPC (manager\/login\/compute) and mounted storage.\n3. <strong>Scheduler control<\/strong>: job submission from login node to scheduler; scheduler dispatches tasks to compute nodes.\n4. <strong>Storage paths<\/strong>: shared filesystem for input\/output; optional OSS for datasets\/archives.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Request\/data\/control flow (typical)<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>An administrator uses the <strong>Alibaba Cloud Console<\/strong> (or OpenAPI\/CLI) to create an E-HPC cluster in a region and VPC.<\/li>\n<li>E-HPC provisions ECS instances and configures scheduler services on the manager node.<\/li>\n<li>Users SSH to the login node (or manager node depending on topology).<\/li>\n<li>Users place input data on shared storage (NAS) or stage from OSS.<\/li>\n<li>Users submit jobs to the scheduler (<code>sbatch<\/code>\/<code>srun<\/code> for Slurm, or equivalent).<\/li>\n<li>Scheduler allocates resources and runs tasks on compute nodes.<\/li>\n<li>Outputs are written to shared storage; optional archiving to OSS.<\/li>\n<li>Monitoring\/auditing captures logs\/metrics (CloudMonitor, ActionTrail, log services\u2014verify setup).<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Integrations with related services (common)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>ECS<\/strong>: compute nodes and manager\/login nodes.<\/li>\n<li><strong>VPC + Security Groups<\/strong>: network isolation and traffic control.<\/li>\n<li><strong>NAS<\/strong>: shared POSIX-like storage for jobs.<\/li>\n<li><strong>OSS<\/strong>: data staging and archival.<\/li>\n<li><strong>RAM<\/strong>: permissions to create\/modify clusters and underlying resources.<\/li>\n<li><strong>CloudMonitor \/ SLS \/ ActionTrail<\/strong>: operations, logging, and auditing (verify exact integration path).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Dependency services<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">E-HPC depends on the availability and quotas of:\n&#8211; ECS instance families in your region\/zone\n&#8211; VPC\/vSwitch IP capacity\n&#8211; NAS mount targets and throughput capabilities\n&#8211; EIP\/NAT gateways (if you need internet access)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/authentication model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Control plane<\/strong>: Alibaba Cloud RAM users\/roles and policies determine who can create\/manage clusters.<\/li>\n<li><strong>Data plane<\/strong>: SSH key pairs\/password policies, OS-level user accounts, and security groups determine access to nodes.<\/li>\n<li><strong>Service-to-service<\/strong>: Some workflows may use instance roles (RAM roles) to access OSS without embedding keys (recommended; verify supported patterns for your environment).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Networking model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Cluster nodes reside in your <strong>VPC<\/strong> (recommended: no public IPs on compute nodes).<\/li>\n<li>SSH access typically goes to a <strong>bastion<\/strong> or a dedicated login node with restricted inbound rules.<\/li>\n<li>Shared storage mounts (NAS) occur over private network endpoints.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Monitoring\/logging\/governance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use node-level monitoring (CPU\/mem\/disk\/network) plus scheduler-level signals (queue depth, job states).<\/li>\n<li>Centralize logs (OS logs, scheduler logs) for troubleshooting and compliance.<\/li>\n<li>Use tags and naming to support cost allocation.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Simple architecture diagram (Mermaid)<\/h3>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart LR\n  U[User\/Engineer] --&gt;|SSH| L[Login Node (ECS)]\n  L --&gt;|submit job| S[Scheduler on Manager Node]\n  S --&gt; C1[Compute Node(s) (ECS)]\n  C1 --&gt;|read\/write| NAS[(NAS Shared Storage)]\n  L --&gt;|stage data (optional)| OSS[(OSS Bucket)]\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Production-style architecture diagram (Mermaid)<\/h3>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart TB\n  subgraph VPC[\"Alibaba Cloud VPC (Private)\"]\n    subgraph SubnetA[\"vSwitch \/ Subnet (HPC)\"]\n      M[Manager Node (ECS)\\nScheduler Controller]\n      L[Login Node (ECS)\\nUser Access + Submission]\n      CN[Compute Nodes (ECS Fleet)]\n    end\n\n    NAS[(Apsara File Storage NAS\\nShared FS Mount Target)]\n    CM[CloudMonitor \/ Metrics\\n(verify integration)]\n    LOG[Central Logs (SLS)\\n(verify integration)]\n  end\n\n  subgraph Edge[\"Controlled Access\"]\n    B[Bastion Host or VPN\\n(recommended)]\n    NAT[NAT Gateway (optional)]\n  end\n\n  OSS[(OSS Bucket\\nData Staging\/Archive)]\n\n  User[Users \/ CI] --&gt;|VPN\/SSH via bastion| B --&gt; L\n  L --&gt; M\n  M --&gt; CN\n  CN --&gt; NAS\n  L --&gt; NAS\n  CN --&gt;|optional stage| OSS\n  L --&gt;|optional stage| OSS\n  M --&gt; CM\n  CN --&gt; CM\n  M --&gt; LOG\n  CN --&gt; LOG\n  NAT -. optional outbound .-&gt; OSS\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">8. Prerequisites<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Account and billing<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>An <strong>Alibaba Cloud account<\/strong> with billing enabled.<\/li>\n<li>Permission to create and pay for ECS, VPC, NAS, and optionally EIP\/NAT and OSS resources.<\/li>\n<li>If your organization uses a finance\/approval workflow, confirm that resource creation in the chosen region is allowed.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Permissions \/ IAM (RAM)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You typically need permissions for:\n&#8211; <strong>E-HPC<\/strong> cluster management (create\/update\/delete clusters).\n&#8211; <strong>ECS<\/strong> instance lifecycle (create\/attach\/detach\/delete).\n&#8211; <strong>VPC<\/strong> and <strong>Security Group<\/strong> management.\n&#8211; <strong>NAS<\/strong> file system and mount target creation (if used).\n&#8211; <strong>OSS<\/strong> bucket access (if used).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If you are in a regulated environment, use least privilege:\n&#8211; Admins: provisioning privileges.\n&#8211; Users: cluster login only; no ability to create\/modify infrastructure.<\/p>\n\n\n\n<blockquote>\n<p>Exact RAM policy actions vary. Use Alibaba Cloud RAM policy references for E-HPC and related services and <strong>verify in official docs<\/strong>:\n&#8211; RAM overview: https:\/\/www.alibabacloud.com\/help\/en\/ram\/<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">Tools<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Alibaba Cloud Console access.<\/li>\n<li>SSH client:<\/li>\n<li>macOS\/Linux: OpenSSH<\/li>\n<li>Windows: PowerShell OpenSSH or PuTTY<\/li>\n<li>(Optional) Alibaba Cloud CLI (<code>aliyun<\/code>) if you plan to automate via OpenAPI:<\/li>\n<li>CLI overview: https:\/\/www.alibabacloud.com\/help\/en\/alibaba-cloud-cli\/<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Region availability<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>E-HPC is not necessarily available in every region and may have region-specific instance family availability (especially HPC\/GPU instance families). <strong>Verify in the E-HPC product page\/docs and your target region.<\/strong><\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Quotas\/limits to check before the lab<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>ECS quotas: vCPU count, instance count, security group rules.<\/li>\n<li>VPC\/vSwitch IP capacity for the number of nodes.<\/li>\n<li>NAS mount target limits.<\/li>\n<li>Public IP\/EIP quotas if you plan to expose a login node (not recommended for production).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Prerequisite services<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>ECS<\/strong>, <strong>VPC<\/strong>, and <strong>Security Groups<\/strong> will be used by almost every E-HPC cluster.<\/li>\n<li><strong>NAS<\/strong> is strongly recommended for shared storage; OSS is optional for staging and archival.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">9. Pricing \/ Cost<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Elastic High Performance Computing (E-HPC) pricing is primarily driven by the <strong>underlying infrastructure<\/strong> it provisions and operates.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Pricing dimensions (typical)<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>ECS instances<\/strong>\n   &#8211; Manager\/login nodes (always-on while the cluster exists)\n   &#8211; Compute nodes (can be scaled up\/down)\n   &#8211; Instance family choice (compute-optimized, memory-optimized, GPU, etc.)\n   &#8211; Billing method (Pay-As-You-Go vs Subscription), region-dependent<\/li>\n<li><strong>Storage<\/strong>\n   &#8211; NAS capacity and performance tier (pricing varies by NAS type and region)\n   &#8211; OSS storage, requests, and data retrieval (if used)<\/li>\n<li><strong>Networking<\/strong>\n   &#8211; VPC is typically not charged directly, but <strong>NAT Gateway<\/strong>, <strong>EIP<\/strong>, and <strong>internet data transfer<\/strong> can add cost\n   &#8211; Cross-zone traffic may have implications (verify your region\u2019s rules)<\/li>\n<li><strong>Snapshots \/ Images<\/strong>\n   &#8211; ECS snapshots or custom images used for cluster consistency<\/li>\n<li><strong>Monitoring\/logging<\/strong>\n   &#8211; CloudMonitor is often included for basic metrics, but advanced features and Log Service (SLS) ingestion\/storage may cost extra (verify your configuration and region)<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Is there a free tier?<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>E-HPC as a control\/orchestration service may not have a separate \u201cfree tier\u201d; many Alibaba Cloud services charge mainly for resources you create. <strong>Verify E-HPC billing in the official pricing or product documentation<\/strong>, and treat the major spend as ECS + storage + network.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Cost drivers (what most affects your bill)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Number and size of compute nodes (vCPU\/GPU hours)<\/li>\n<li>Whether nodes are always on vs scaled down when idle<\/li>\n<li>Storage performance tier and size (NAS)<\/li>\n<li>Outbound internet traffic (especially if moving large datasets)<\/li>\n<li>Idle manager\/login nodes left running<\/li>\n<li>Using public IPs \/ NAT gateways unnecessarily<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Hidden or indirect costs<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Keeping a cluster \u201caround\u201d<\/strong> for convenience: manager\/login nodes and storage persist.<\/li>\n<li><strong>Data egress<\/strong> to the public internet or other regions\/clouds.<\/li>\n<li><strong>Build and staging<\/strong>: CI pipelines that copy datasets frequently.<\/li>\n<li><strong>Over-provisioned NAS<\/strong>: paying for large capacity or performance tier for long periods.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Network\/data transfer implications<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Keep data and compute in the <strong>same region<\/strong> when possible.<\/li>\n<li>Prefer private connectivity (VPC endpoints, private mounts) and avoid internet egress.<\/li>\n<li>If collaborating with on-prem, consider VPN\/Express Connect patterns\u2014costs depend on bandwidth and connectivity services (verify).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">How to optimize cost<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use Pay-As-You-Go for burst campaigns; delete clusters after campaigns.<\/li>\n<li>Minimize always-on nodes (keep only one small login\/manager node if acceptable).<\/li>\n<li>Use multiple partitions\/queues so small jobs don\u2019t force large node allocation.<\/li>\n<li>Use OSS for archival and NAS for active working sets.<\/li>\n<li>Tag resources for chargeback and enforce budget alerts.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Official pricing resources<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Alibaba Cloud Pricing Calculator: https:\/\/www.alibabacloud.com\/pricing\/calculator<\/li>\n<li>E-HPC product page (for current positioning and any billing notes): https:\/\/www.alibabacloud.com\/product\/ehpc<\/li>\n<li>ECS pricing: https:\/\/www.alibabacloud.com\/product\/ecs<\/li>\n<li>NAS pricing\/product: https:\/\/www.alibabacloud.com\/product\/nas<\/li>\n<li>OSS pricing\/product: https:\/\/www.alibabacloud.com\/product\/oss<\/li>\n<\/ul>\n\n\n\n<blockquote>\n<p>Exact prices vary by <strong>region<\/strong>, <strong>instance family<\/strong>, <strong>billing method<\/strong>, and <strong>storage tier<\/strong>. Do not rely on static numbers\u2014use the calculator and your region\u2019s pricing pages.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">Example low-cost starter estimate (no fabricated numbers)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A minimal learning lab typically includes:\n&#8211; 1 small ECS instance (manager\/login combined)\n&#8211; 1 small ECS instance (single compute node)\n&#8211; Small NAS filesystem for shared <code>\/home<\/code> or <code>\/shared<\/code> (optional but recommended)\n&#8211; No EIP (use VPN\/Bastion or temporary EIP only on login node)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To estimate:\n1. Use the pricing calculator to price <strong>2 ECS Pay-As-You-Go instances<\/strong> for a few hours.\n2. Add NAS capacity for the number of GB you plan to keep during the lab.\n3. Add any EIP\/NAT data transfer if used.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Example production cost considerations<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For production HPC, cost planning should include:\n&#8211; Always-on login nodes and at least one manager\/controller node\n&#8211; Compute nodes scaled by queue demand (possibly hundreds\/thousands)\n&#8211; NAS sized for active working sets; OSS for archives\n&#8211; Monitoring\/logging retention\n&#8211; Backup strategy (snapshots, OSS versioning, etc.)\n&#8211; Network egress constraints and data locality planning<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">10. Step-by-Step Hands-On Tutorial<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">This lab creates a small E-HPC cluster and runs a simple scheduled job. The goal is to learn the end-to-end workflow: provision \u2192 connect \u2192 submit job \u2192 observe \u2192 clean up.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Objective<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Provision an Elastic High Performance Computing (E-HPC) cluster in Alibaba Cloud and run a simple multi-node job (or a single-node job if you choose a 1-compute-node cluster).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Lab Overview<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You will:\n1. Create (or choose) a VPC and vSwitch.\n2. Create a small E-HPC cluster with a scheduler (commonly Slurm).\n3. SSH into the login node.\n4. Submit a batch job that runs <code>hostname<\/code> across allocated nodes.\n5. Validate job output and scheduler status.\n6. Delete the cluster to stop costs.<\/p>\n\n\n\n<blockquote>\n<p>Notes before you start:\n&#8211; Console screens and options can change. Use the official E-HPC \u201cGet Started\u201d guide for your region if any label differs.<br\/>\n&#8211; If your organization restricts public IPs, use a bastion host or VPN to access the login node.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">Step 1: Choose a region and check quotas<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Pick an Alibaba Cloud <strong>region<\/strong> where E-HPC is available and where the ECS instance family you want exists.<\/li>\n<li>Check you have quota for at least:\n   &#8211; 2 ECS instances (1 login\/manager + 1 compute)<br\/>\n   &#8211; 1 security group<br\/>\n   &#8211; Enough vSwitch IPs for 2 instances  <\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> You confirm the region and quotas support the lab.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong>\n&#8211; In the ECS console, confirm your instance quotas and current usage (exact location varies; verify in console).\n&#8211; Confirm the region shows E-HPC in the product list.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 2: Create (or select) a VPC and vSwitch<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">If you already have a suitable VPC, you can reuse it.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Open the Alibaba Cloud Console \u2192 <strong>VPC<\/strong>.<\/li>\n<li>Create a VPC (example):\n   &#8211; IPv4 CIDR: <code>10.0.0.0\/16<\/code><\/li>\n<li>Create a vSwitch in one zone (example):\n   &#8211; vSwitch CIDR: <code>10.0.1.0\/24<\/code><\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> A VPC and vSwitch exist in your chosen region\/zone.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong>\n&#8211; VPC list shows the new VPC.\n&#8211; vSwitch list shows the vSwitch in the chosen zone.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 3: Prepare access (SSH key pair recommended)<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>In the ECS console, create or select an <strong>SSH key pair<\/strong>.<\/li>\n<li>Store the private key securely on your workstation.<\/li>\n<li>Decide your access pattern:\n   &#8211; <strong>Recommended for production<\/strong>: VPN + private login node, or a bastion host.\n   &#8211; <strong>For this lab<\/strong>: You may temporarily use an EIP on the login node only (avoid public IPs on compute nodes).<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> You have an SSH key pair ready.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong>\n&#8211; The key pair appears in ECS \u2192 Key Pairs.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 4: Create an E-HPC cluster<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Open Alibaba Cloud Console \u2192 <strong>Elastic High Performance Computing (E-HPC)<\/strong>.<\/li>\n<li>Select <strong>Create Cluster<\/strong>.<\/li>\n<li>Choose basic settings:\n   &#8211; Region: your chosen region\n   &#8211; VPC\/vSwitch: the ones from Step 2\n   &#8211; Scheduler: choose the available scheduler (commonly <strong>Slurm<\/strong>; if multiple are available, pick the default\/recommended option and <strong>verify in official docs<\/strong>)<\/li>\n<li>Choose node settings (minimal lab):\n   &#8211; Manager\/login node: 1 instance\n   &#8211; Compute nodes: 1 instance (or 2 if budget allows)\n   &#8211; Instance type: pick a low-cost general-purpose type that is supported in your region<\/li>\n<li>Storage:\n   &#8211; Enable shared storage (NAS) if the wizard offers it (recommended)<\/li>\n<li>\n<p>Access:\n   &#8211; Assign your SSH key pair to the login node\n   &#8211; If the wizard offers public access options, restrict by IP allowlist and use a strong policy (or avoid public IPs)<\/p>\n<\/li>\n<li>\n<p>Create the cluster and wait until status is <strong>Running<\/strong> (or equivalent).<\/p>\n<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> The cluster is created and shows a healthy\/running status.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong>\n&#8211; E-HPC cluster list shows cluster state as Running.\n&#8211; ECS console shows the manager\/login and compute nodes created in the specified VPC\/vSwitch.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 5: Connect to the login node with SSH<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">How you obtain the login endpoint depends on whether you attached an EIP or use VPN\/bastion.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>In E-HPC cluster details, locate the <strong>login node<\/strong> or connection info.<\/li>\n<li>SSH from your workstation:<\/li>\n<\/ol>\n\n\n\n<pre><code class=\"language-bash\">ssh -i \/path\/to\/your-key.pem &lt;username&gt;@&lt;login-node-ip-or-hostname&gt;\n<\/code><\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The username depends on the OS image (often <code>root<\/code> or a distro-specific default). Use the value shown by the wizard or ECS instance details. <strong>Do not guess<\/strong>\u2014verify in your cluster\/instance settings.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> You have a shell on the login node.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong>\n&#8211; Run:<\/p>\n\n\n\n<pre><code class=\"language-bash\">hostname\nwhoami\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">You should see the node hostname and your username.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 6: Confirm scheduler commands and node visibility<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">If your cluster uses Slurm (common), try:<\/p>\n\n\n\n<pre><code class=\"language-bash\">sinfo\nsqueue\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">If <code>sinfo<\/code> is not found, your cluster might use another scheduler or the PATH differs. Check the E-HPC docs for the selected scheduler and verify which commands apply.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> You can see at least one partition\/queue and one compute node.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong>\n&#8211; <code>sinfo<\/code> shows node states like <code>idle<\/code> (or similar).\n&#8211; <code>squeue<\/code> shows no jobs (empty) in a new cluster.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 7: Create and submit a simple batch job<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Create a shared working directory (location depends on your cluster mounts). Common paths include <code>\/shared<\/code> or your home directory.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Create a working folder and a script:<\/li>\n<\/ol>\n\n\n\n<pre><code class=\"language-bash\">mkdir -p ~\/ehpc-lab\ncd ~\/ehpc-lab\n\ncat &gt; hostnames.sbatch &lt;&lt;'EOF'\n#!\/bin\/bash\n#SBATCH --job-name=ehpc-hostnames\n#SBATCH --output=hostnames-%j.out\n#SBATCH --error=hostnames-%j.err\n#SBATCH --nodes=1\n#SBATCH --ntasks=1\n#SBATCH --time=00:02:00\n\necho \"Job started on: $(date)\"\necho \"Running on host: $(hostname)\"\necho \"Done.\"\nEOF\n<\/code><\/pre>\n\n\n\n<ol class=\"wp-block-list\" start=\"2\">\n<li>Submit the job:<\/li>\n<\/ol>\n\n\n\n<pre><code class=\"language-bash\">sbatch hostnames.sbatch\n<\/code><\/pre>\n\n\n\n<ol class=\"wp-block-list\" start=\"3\">\n<li>Watch the queue:<\/li>\n<\/ol>\n\n\n\n<pre><code class=\"language-bash\">squeue\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> The job transitions from <code>PENDING<\/code> to <code>RUNNING<\/code> to <code>COMPLETED<\/code>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong>\n&#8211; After completion, view output:<\/p>\n\n\n\n<pre><code class=\"language-bash\">ls -l\ncat hostnames-*.out\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">You should see the hostname and timestamps.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 8 (Optional): Run a multi-node \u201chostname fan-out\u201d job<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">If you created <strong>2+ compute nodes<\/strong>, you can run a multi-node job.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Update the script to request multiple nodes (example for 2 nodes):<\/p>\n\n\n\n<pre><code class=\"language-bash\">cat &gt; multinode-hostnames.sbatch &lt;&lt;'EOF'\n#!\/bin\/bash\n#SBATCH --job-name=ehpc-multinode\n#SBATCH --output=multinode-%j.out\n#SBATCH --error=multinode-%j.err\n#SBATCH --nodes=2\n#SBATCH --ntasks=2\n#SBATCH --time=00:02:00\n\necho \"Allocated nodes:\"\nscontrol show hostnames \"$SLURM_JOB_NODELIST\"\n\necho \"Task hostnames:\"\nsrun -n 2 hostname\nEOF\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Submit:<\/p>\n\n\n\n<pre><code class=\"language-bash\">sbatch multinode-hostnames.sbatch\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> The output shows two different hostnames (one per task), assuming two nodes are available and allocated.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong><\/p>\n\n\n\n<pre><code class=\"language-bash\">cat multinode-*.out\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Validation<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Use this checklist:\n&#8211; Cluster state is Running in E-HPC console.\n&#8211; You can SSH to the login node.\n&#8211; Scheduler commands show nodes available.\n&#8211; Jobs can be submitted and complete successfully.\n&#8211; Output files are created and readable.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Troubleshooting<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">Problem: Cannot SSH to the login node<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Common causes and fixes:\n&#8211; <strong>Security group inbound rules<\/strong>: Ensure port 22 is open <em>only<\/em> from your IP (or from bastion\/VPN subnet).\n&#8211; <strong>Wrong username<\/strong>: Check the image default login user shown in ECS\/E-HPC settings.\n&#8211; <strong>Key pair mismatch<\/strong>: Ensure the correct private key is used and that the key pair is attached to the login node.\n&#8211; <strong>No public route<\/strong>: If using EIP, verify EIP association and routing. If using VPN\/bastion, verify connectivity to the private IP.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Problem: Job stays in PENDING<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Common causes:\n&#8211; <strong>No compute nodes<\/strong> or nodes are not in <code>idle<\/code> state (<code>sinfo<\/code>).\n&#8211; <strong>Insufficient resources<\/strong>: job requests more nodes\/cores than available.\n&#8211; <strong>Partition\/queue mismatch<\/strong>: your scheduler might require specifying a partition\/queue. Check <code>sinfo<\/code> output and add <code>#SBATCH -p &lt;partition&gt;<\/code> if needed.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Problem: <code>sinfo<\/code> \/ <code>sbatch<\/code> not found<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>The selected scheduler might not be Slurm, or the environment PATH is not configured as expected.<\/li>\n<li>Confirm the scheduler type in the cluster settings and consult the official E-HPC documentation for that scheduler.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Problem: Shared storage path not found<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Verify NAS was enabled and mounted.<\/li>\n<li>Check mount points:<\/li>\n<\/ul>\n\n\n\n<pre><code class=\"language-bash\">mount | grep -E 'nfs|nas'\ndf -h\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">If no shared filesystem is mounted, use your home directory for the lab and revisit cluster storage options.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Cleanup<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To avoid ongoing charges:\n1. In the E-HPC console, <strong>Delete the cluster<\/strong>.\n2. Verify associated ECS instances are terminated.\n3. Verify shared storage resources:\n   &#8211; NAS filesystem (if it was created for the cluster)<br\/>\n   &#8211; EIP\/NAT gateways (if created)<br\/>\n4. Verify OSS buckets (if created) and remove test data if needed.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> No compute instances remain running and ongoing storage\/network charges are minimized.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">11. Best Practices<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Architecture best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Separate login and manager roles<\/strong> for production, especially for multi-tenant clusters.<\/li>\n<li>Keep compute nodes and shared storage in the <strong>same region<\/strong> and preferably same zone for predictable performance (verify your region\u2019s best practice).<\/li>\n<li>Use <strong>NAS for POSIX shared working directories<\/strong> and <strong>OSS for archival\/staging<\/strong>.<\/li>\n<li>Design partitions\/queues aligned to workload types (small\/large, CPU\/GPU, short\/long).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">IAM\/security best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use <strong>RAM least privilege<\/strong>: separate roles for cluster admins vs cluster users.<\/li>\n<li>Prefer <strong>SSH key-based auth<\/strong>; disable password SSH where feasible.<\/li>\n<li>Use <strong>RAM roles for ECS<\/strong> to access OSS (avoid long-lived AccessKey secrets on nodes) where supported by your workflow.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Cost best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Minimize always-on nodes; keep manager\/login nodes small if feasible.<\/li>\n<li>Scale down compute nodes aggressively after work completes.<\/li>\n<li>Use budgets, alerts, and tags for cost governance.<\/li>\n<li>Avoid internet egress for large datasets; keep data near compute.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Performance best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Select ECS instance families appropriate for HPC (compute-optimized, memory-optimized, GPU, enhanced networking) <strong>based on region availability<\/strong>.<\/li>\n<li>Keep tightly coupled jobs on nodes with consistent performance characteristics.<\/li>\n<li>Avoid oversubscribing shared storage. Plan NAS throughput and directory layouts (scratch vs home).<\/li>\n<li>Use job submission best practices (request realistic resources, avoid single job monopolizing all nodes).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Reliability best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Store critical outputs in durable storage (NAS + snapshots, or OSS).<\/li>\n<li>Use checkpointing for long runs; store checkpoints on shared storage.<\/li>\n<li>Have a documented \u201ccluster rebuild\u201d playbook so you can recreate from images\/templates quickly.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operations best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Standardize naming: cluster name, partitions, node naming, tag schemas.<\/li>\n<li>Centralize logs and retain scheduler logs for incident response.<\/li>\n<li>Monitor:<\/li>\n<li>node health (CPU\/memory\/disk\/network)<\/li>\n<li>scheduler queue depth and job failure rate<\/li>\n<li>storage capacity and throughput<\/li>\n<li>Maintain runbooks for common incidents (node down, jobs pending, storage mount failures).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Governance\/tagging\/naming best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Tag resources with:<\/li>\n<li><code>Project<\/code>, <code>Environment<\/code>, <code>Owner<\/code>, <code>CostCenter<\/code>, <code>DataClassification<\/code><\/li>\n<li>Use consistent cluster naming:<\/li>\n<li><code>ehpc-&lt;team&gt;-&lt;env&gt;-&lt;region&gt;-&lt;purpose&gt;<\/code><\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">12. Security Considerations<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Identity and access model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Control plane (E-HPC console\/API)<\/strong>: governed by <strong>RAM<\/strong> policies.<\/li>\n<li><strong>Data plane (cluster nodes)<\/strong>: governed by OS user accounts, SSH keys, and security groups.<\/li>\n<li>Maintain strict separation:<\/li>\n<li>Infra admins can create\/modify clusters.<\/li>\n<li>HPC users can SSH and submit jobs but cannot change infrastructure.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Encryption<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>In transit<\/strong>:<\/li>\n<li>SSH encrypts admin\/user access.<\/li>\n<li>NAS traffic encryption depends on NAS and mount options\u2014<strong>verify in official docs<\/strong> for your NAS type and supported encryption-in-transit.<\/li>\n<li><strong>At rest<\/strong>:<\/li>\n<li>OSS supports server-side encryption options; verify SSE-KMS\/SSE-OSS for your bucket.<\/li>\n<li>ECS disks can be encrypted (encryption features vary by disk type and region; verify).<\/li>\n<li>For sensitive datasets, define where encryption is mandatory (NAS, OSS, snapshots, backups).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Network exposure<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Avoid public IPs on compute nodes.<\/li>\n<li>Restrict SSH inbound:<\/li>\n<li>Prefer VPN\/bastion.<\/li>\n<li>If using EIP on login node, allowlist corporate IPs only.<\/li>\n<li>Use separate security groups for:<\/li>\n<li>login node (limited inbound)<\/li>\n<li>compute nodes (no inbound from internet)<\/li>\n<li>storage endpoints (private only)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secrets handling<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Avoid embedding AccessKey secrets on nodes.<\/li>\n<li>Prefer instance RAM roles and short-lived credentials where possible.<\/li>\n<li>Store secrets in a managed secret store if your organization uses one (Alibaba Cloud options exist; verify your chosen service and integration).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Audit\/logging<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Enable <strong>ActionTrail<\/strong> (if available in your account\/region) for auditing control plane actions (verify): https:\/\/www.alibabacloud.com\/help\/en\/actiontrail\/<\/li>\n<li>Retain scheduler logs and SSH logs for investigation.<\/li>\n<li>Consider central log collection for production (SLS; verify): https:\/\/www.alibabacloud.com\/product\/log-service<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Compliance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data residency: keep clusters and storage in approved regions.<\/li>\n<li>Access controls: document who can access datasets and cluster nodes.<\/li>\n<li>Retention: define how long job outputs and logs are retained.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Common security mistakes<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Exposing compute nodes directly to the internet.<\/li>\n<li>Using shared SSH keys across teams without rotation.<\/li>\n<li>Storing long-lived AccessKeys on the filesystem.<\/li>\n<li>Using overly permissive security groups (<code>0.0.0.0\/0<\/code> SSH).<\/li>\n<li>No audit trail for cluster changes.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secure deployment recommendations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use a bastion\/VPN for access.<\/li>\n<li>Use least-privilege RAM roles.<\/li>\n<li>Encrypt disks\/snapshots where required.<\/li>\n<li>Apply patch management to base images and rebuild clusters regularly.<\/li>\n<li>Separate environments (dev\/test\/prod) by VPC and RAM policies.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">13. Limitations and Gotchas<\/h2>\n\n\n\n<blockquote>\n<p>Treat these as planning reminders. Exact limits depend on your region, ECS instance families, NAS type, and E-HPC capabilities. Verify in official docs and account quota pages.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">Known limitations \/ common constraints<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Regional availability<\/strong>: Not all regions support E-HPC or the same scheduler\/instance families.<\/li>\n<li><strong>Instance family constraints<\/strong>: HPC-optimized or GPU instances may have limited stock or require approvals.<\/li>\n<li><strong>Quota bottlenecks<\/strong>:<\/li>\n<li>vCPU quotas<\/li>\n<li>vSwitch IP exhaustion when scaling large clusters<\/li>\n<li>NAS mount target limits<\/li>\n<li><strong>Single point of access<\/strong>: A single login node can become a bottleneck; consider scaling login nodes or using bastion patterns (verify supported designs).<\/li>\n<li><strong>Storage performance mismatch<\/strong>: NAS performance tier may not meet scratch needs; consider separating scratch and home\/project directories and selecting appropriate storage.<\/li>\n<li><strong>Data locality<\/strong>: Moving large datasets into\/out of the region is slow and costly.<\/li>\n<li><strong>Scheduler learning curve<\/strong>: Users must learn job submission and resource requests; mis-specified jobs can sit pending or waste resources.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Pricing surprises<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Leaving manager\/login nodes running continuously.<\/li>\n<li>Paying for NAT Gateway and outbound traffic while staging data repeatedly.<\/li>\n<li>Over-provisioning NAS capacity\/performance for long periods.<\/li>\n<li>Large log retention in centralized logging.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Compatibility issues<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Some HPC applications assume specific kernel\/driver\/MPI stacks; you may need custom images and careful validation.<\/li>\n<li>MPI performance depends heavily on instance type and networking; benchmark before committing.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operational gotchas<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Security group changes can break cluster internal communications.<\/li>\n<li>OS updates can change scheduler behavior; apply changes in a controlled manner.<\/li>\n<li>If compute nodes are frequently recycled, ensure your software stack and mounts are idempotent.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Migration challenges<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>On-prem HPC environments often include:<\/li>\n<li>directory services (LDAP)<\/li>\n<li>proprietary schedulers<\/li>\n<li>bespoke filesystems<\/li>\n<li>license servers<\/li>\n<li>Plan migration as an architecture project (identity, licensing, data sync, and reproducibility).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Vendor-specific nuances<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Alibaba Cloud service limits and naming differ across regions and accounts.<\/li>\n<li>Some automation requires OpenAPI usage; API versions can change\u2014verify current OpenAPI docs before scripting.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">14. Comparison with Alternatives<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Within Alibaba Cloud (nearest options)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Self-managed HPC on ECS<\/strong>: You build Slurm\/PBS yourself with Terraform\/Ansible.<\/li>\n<li><strong>ACK (Alibaba Cloud Container Service for Kubernetes)<\/strong>: Great for containerized services and batch (with Kubernetes batch tooling), but not always a drop-in replacement for classic HPC schedulers.<\/li>\n<li><strong>BatchCompute (if encountered)<\/strong>: Alibaba Cloud had batch compute offerings historically; <strong>verify current product availability and suitability<\/strong> if you see it referenced.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Other clouds (nearest equivalents)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>AWS ParallelCluster<\/strong>: HPC cluster provisioning with Slurm on AWS.<\/li>\n<li><strong>Azure CycleCloud<\/strong>: HPC cluster orchestration on Azure.<\/li>\n<li><strong>Google Cloud HPC Toolkit \/ Cluster Toolkit<\/strong>: HPC blueprints on GCP.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Open-source\/self-managed alternatives<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Slurm + NFS\/parallel filesystem + custom autoscaling scripts.<\/li>\n<li>Kubernetes + batch controllers (for container-native HPC-like workloads), but scheduler semantics differ.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Comparison table<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Option<\/th>\n<th>Best For<\/th>\n<th>Strengths<\/th>\n<th>Weaknesses<\/th>\n<th>When to Choose<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Alibaba Cloud Elastic High Performance Computing (E-HPC)<\/strong><\/td>\n<td>HPC clusters on Alibaba Cloud with managed provisioning<\/td>\n<td>Faster cluster setup, integrated with ECS\/VPC\/NAS, standardized workflow<\/td>\n<td>Still requires HPC ops knowledge; feature availability varies by region<\/td>\n<td>You want managed cluster provisioning and scheduler-based HPC on Alibaba Cloud<\/td>\n<\/tr>\n<tr>\n<td>Self-managed Slurm\/PBS on ECS<\/td>\n<td>Maximum customization<\/td>\n<td>Full control over scheduler\/config, exact parity with on-prem possible<\/td>\n<td>Higher ops burden; longer setup; more risk<\/td>\n<td>You need deep customization or exact scheduler\/version parity<\/td>\n<\/tr>\n<tr>\n<td>Alibaba Cloud ACK (Kubernetes)<\/td>\n<td>Containerized apps, microservices, cloud-native batch<\/td>\n<td>Strong ecosystem, portability for containers<\/td>\n<td>Not the same as classic HPC scheduling; MPI\/tightly coupled needs careful design<\/td>\n<td>Your workloads are container-native or you want unified platform for services + batch<\/td>\n<\/tr>\n<tr>\n<td>AWS ParallelCluster<\/td>\n<td>HPC on AWS<\/td>\n<td>Mature HPC patterns, AWS ecosystem<\/td>\n<td>Different cloud\/provider; migration effort<\/td>\n<td>Your organization runs primarily on AWS<\/td>\n<\/tr>\n<tr>\n<td>Azure CycleCloud<\/td>\n<td>HPC on Azure<\/td>\n<td>Strong enterprise integration on Azure<\/td>\n<td>Different cloud\/provider; migration effort<\/td>\n<td>Your organization runs primarily on Azure<\/td>\n<\/tr>\n<tr>\n<td>Google Cloud HPC Toolkit<\/td>\n<td>HPC on GCP<\/td>\n<td>Infrastructure blueprints and automation<\/td>\n<td>Different cloud\/provider; migration effort<\/td>\n<td>Your organization runs primarily on GCP<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">15. Real-World Example<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Enterprise example: CAE simulation platform for manufacturing<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: A manufacturing company runs CAE simulations (CFD + structural) that spike during design cycles. On-prem cluster is saturated and procurement is slow.<\/li>\n<li><strong>Proposed architecture<\/strong>:<\/li>\n<li>E-HPC cluster in a dedicated VPC<\/li>\n<li>Separate login node behind VPN\/bastion<\/li>\n<li>NAS shared filesystem for active project data and solver scratch (tier selected based on benchmark)<\/li>\n<li>OSS bucket for dataset archive and result export<\/li>\n<li>RAM roles for admins and users; ActionTrail for auditing (verify)<\/li>\n<li>CloudMonitor alerts for node health and NAS capacity<\/li>\n<li><strong>Why this service was chosen<\/strong>:<\/li>\n<li>Faster provisioning than self-managed clusters<\/li>\n<li>Elastic scaling for peak simulation campaigns<\/li>\n<li>Integrates with Alibaba Cloud Computing primitives (ECS\/VPC\/NAS)<\/li>\n<li><strong>Expected outcomes<\/strong>:<\/li>\n<li>Reduced cycle time for simulation campaigns<\/li>\n<li>Better cost control by scaling down off-peak<\/li>\n<li>Improved governance through tags and standardized environments<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Startup\/small-team example: rendering farm for product videos<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: A small media team needs burst rendering for product launches but can\u2019t justify permanent render hardware.<\/li>\n<li><strong>Proposed architecture<\/strong>:<\/li>\n<li>Small E-HPC cluster with a login node and elastic compute nodes<\/li>\n<li>NAS for shared assets and render outputs<\/li>\n<li>OSS for source assets and long-term storage of rendered frames<\/li>\n<li>Simple queue policy for nightly rendering<\/li>\n<li><strong>Why this service was chosen<\/strong>:<\/li>\n<li>Quick start without deep HPC cluster buildout<\/li>\n<li>Batch scheduling aligns with frame rendering jobs<\/li>\n<li><strong>Expected outcomes<\/strong>:<\/li>\n<li>Predictable rendering throughput during launch windows<\/li>\n<li>Lower costs outside launch periods by deleting\/scaling down clusters<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">16. FAQ<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">1) <strong>What is Elastic High Performance Computing (E-HPC) used for?<\/strong><br\/>\nIt\u2019s used to run HPC workloads (simulations, EDA, rendering, scientific computing) using a scheduler-managed cluster on Alibaba Cloud Computing infrastructure.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">2) <strong>Do I pay separately for E-HPC, or only for the resources it creates?<\/strong><br\/>\nIn many managed cluster services, the main cost is underlying ECS\/storage\/network. <strong>Verify E-HPC billing behavior in the official product\/pricing documentation<\/strong>, then plan costs around ECS + NAS + network + logging.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">3) <strong>Which schedulers does E-HPC support?<\/strong><br\/>\nSlurm is commonly used in cloud HPC. E-HPC may support additional schedulers depending on region and product evolution. <strong>Verify supported schedulers and versions in the official E-HPC docs.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">4) <strong>Can I run MPI jobs on E-HPC?<\/strong><br\/>\nHPC clusters are commonly used for MPI workloads, but performance depends on instance type, networking, and MPI stack. <strong>Verify recommended instance families and MPI guidance<\/strong> in official docs and benchmark.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">5) <strong>Is E-HPC suitable for GPU workloads?<\/strong><br\/>\nIf you choose GPU-capable ECS instances as compute nodes, you can run GPU batch jobs. Availability depends on region and instance families. Verify GPU options and any driver\/image requirements.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">6) <strong>How do users access the cluster?<\/strong><br\/>\nTypically via SSH to a login node (or bastion \u2192 login node). Compute nodes should not be internet-exposed.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">7) <strong>Can I place the cluster in a private VPC without public IPs?<\/strong><br\/>\nYes, that\u2019s the recommended approach. Use VPN or a bastion host for access, and NAT gateway only if outbound internet is required.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">8) <strong>How do I store large datasets for HPC jobs?<\/strong><br\/>\nUse NAS for shared POSIX-like file access during jobs and OSS for dataset staging\/archival. Keep data in the same region for performance and cost.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">9) <strong>What\u2019s the difference between NAS and OSS for HPC?<\/strong><br\/>\nNAS is a shared file system (POSIX-like) suitable for job I\/O. OSS is object storage suited for durable, cost-effective storage and data distribution, but not a native POSIX filesystem.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">10) <strong>How do I control who can create clusters vs who can submit jobs?<\/strong><br\/>\nUse RAM policies for cluster creation and infrastructure changes. Use OS-level accounts\/SSH and scheduler policies for job submission control.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">11) <strong>How do I prevent runaway costs?<\/strong><br\/>\nUse budgets\/alerts, tag resources, delete clusters when not needed, minimize always-on nodes, and avoid unnecessary NAT\/EIP usage.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">12) <strong>Can I integrate E-HPC with CI\/CD pipelines?<\/strong><br\/>\nYes, typically by SSH-based job submission or OpenAPI\/CLI automation. Verify E-HPC OpenAPI support and authentication best practices before implementing.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">13) <strong>How do I monitor jobs and node health?<\/strong><br\/>\nUse scheduler commands for job status, and Alibaba Cloud monitoring for node metrics. Consider centralized logs for scheduler and OS logs (verify integration options).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">14) <strong>What happens if a compute node fails during a job?<\/strong><br\/>\nBehavior depends on scheduler configuration and application checkpointing. For long runs, implement checkpoint\/restart and test failure scenarios.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">15) <strong>Is E-HPC a replacement for Kubernetes?<\/strong><br\/>\nNot generally. Kubernetes excels at container orchestration for services and cloud-native batch, while E-HPC targets classic HPC scheduler-based workflows.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">16) <strong>Can I reuse my on-prem HPC software licenses?<\/strong><br\/>\nPossibly, but licensing terms vary. You may need license servers reachable from the VPC and compliant usage tracking. Engage your vendor and verify network\/security design.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">17) <strong>How do I migrate an on-prem Slurm cluster to E-HPC?<\/strong><br\/>\nStart by matching OS\/toolchain, scheduler version (if possible), and filesystem layout. Pilot with a small queue and benchmark; then scale. Expect changes in identity, networking, and storage.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">17. Top Online Resources to Learn Elastic High Performance Computing (E-HPC)<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Resource Type<\/th>\n<th>Name<\/th>\n<th>Why It Is Useful<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Official documentation<\/td>\n<td>E-HPC Documentation (Alibaba Cloud Help Center) \u2013 https:\/\/www.alibabacloud.com\/help\/en\/ehpc\/<\/td>\n<td>Primary source for current features, supported schedulers, and setup steps<\/td>\n<\/tr>\n<tr>\n<td>Official product page<\/td>\n<td>E-HPC Product Page \u2013 https:\/\/www.alibabacloud.com\/product\/ehpc<\/td>\n<td>High-level overview and links to docs\/pricing notes<\/td>\n<\/tr>\n<tr>\n<td>Pricing calculator<\/td>\n<td>Alibaba Cloud Pricing Calculator \u2013 https:\/\/www.alibabacloud.com\/pricing\/calculator<\/td>\n<td>Build region-accurate estimates without guessing prices<\/td>\n<\/tr>\n<tr>\n<td>Compute reference<\/td>\n<td>Elastic Compute Service (ECS) \u2013 https:\/\/www.alibabacloud.com\/product\/ecs<\/td>\n<td>Instance families, billing, and performance options used by E-HPC nodes<\/td>\n<\/tr>\n<tr>\n<td>Storage reference<\/td>\n<td>Apsara File Storage NAS \u2013 https:\/\/www.alibabacloud.com\/product\/nas<\/td>\n<td>Shared storage fundamentals for HPC clusters<\/td>\n<\/tr>\n<tr>\n<td>Storage reference<\/td>\n<td>Object Storage Service (OSS) \u2013 https:\/\/www.alibabacloud.com\/product\/oss<\/td>\n<td>Data staging\/archival patterns and pricing<\/td>\n<\/tr>\n<tr>\n<td>IAM reference<\/td>\n<td>Resource Access Management (RAM) \u2013 https:\/\/www.alibabacloud.com\/help\/en\/ram\/<\/td>\n<td>Least privilege, roles, and policies for E-HPC operations<\/td>\n<\/tr>\n<tr>\n<td>Audit reference<\/td>\n<td>ActionTrail \u2013 https:\/\/www.alibabacloud.com\/help\/en\/actiontrail\/<\/td>\n<td>Auditing control-plane changes (verify service availability in your region\/account)<\/td>\n<\/tr>\n<tr>\n<td>CLI tooling<\/td>\n<td>Alibaba Cloud CLI \u2013 https:\/\/www.alibabacloud.com\/help\/en\/alibaba-cloud-cli\/<\/td>\n<td>Automate provisioning\/operations via CLI and OpenAPI calls<\/td>\n<\/tr>\n<tr>\n<td>Logging\/ops<\/td>\n<td>Log Service (SLS) \u2013 https:\/\/www.alibabacloud.com\/product\/log-service<\/td>\n<td>Centralized log collection and retention planning (verify integration approach)<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">18. Training and Certification Providers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Institute<\/th>\n<th>Suitable Audience<\/th>\n<th>Likely Learning Focus<\/th>\n<th>Mode<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>DevOps engineers, SREs, platform teams<\/td>\n<td>DevOps foundations, cloud operations, automation concepts applicable to HPC ops<\/td>\n<td>check website<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>ScmGalaxy.com<\/td>\n<td>Beginners to intermediate engineers<\/td>\n<td>SCM, DevOps toolchains, operational practices<\/td>\n<td>check website<\/td>\n<td>https:\/\/www.scmgalaxy.com\/<\/td>\n<\/tr>\n<tr>\n<td>CLoudOpsNow.in<\/td>\n<td>Cloud engineers, operations teams<\/td>\n<td>Cloud operations practices, monitoring, governance<\/td>\n<td>check website<\/td>\n<td>https:\/\/www.cloudopsnow.in\/<\/td>\n<\/tr>\n<tr>\n<td>SreSchool.com<\/td>\n<td>SREs, reliability engineers<\/td>\n<td>Reliability engineering, monitoring, incident response patterns<\/td>\n<td>check website<\/td>\n<td>https:\/\/www.sreschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>AiOpsSchool.com<\/td>\n<td>Ops teams adopting automation<\/td>\n<td>AIOps concepts, operational analytics, automation<\/td>\n<td>check website<\/td>\n<td>https:\/\/www.aiopsschool.com\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">19. Top Trainers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Platform\/Site<\/th>\n<th>Likely Specialization<\/th>\n<th>Suitable Audience<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>RajeshKumar.xyz<\/td>\n<td>DevOps\/cloud training content (verify offerings)<\/td>\n<td>Beginners to intermediate<\/td>\n<td>https:\/\/www.rajeshkumar.xyz\/<\/td>\n<\/tr>\n<tr>\n<td>devopstrainer.in<\/td>\n<td>DevOps training services (verify scope)<\/td>\n<td>DevOps engineers, admins<\/td>\n<td>https:\/\/www.devopstrainer.in\/<\/td>\n<\/tr>\n<tr>\n<td>devopsfreelancer.com<\/td>\n<td>Freelance DevOps\/ops guidance (verify scope)<\/td>\n<td>Small teams needing practical help<\/td>\n<td>https:\/\/www.devopsfreelancer.com\/<\/td>\n<\/tr>\n<tr>\n<td>devopssupport.in<\/td>\n<td>Ops\/DevOps support services (verify scope)<\/td>\n<td>Teams needing production support patterns<\/td>\n<td>https:\/\/www.devopssupport.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">20. Top Consulting Companies<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Company Name<\/th>\n<th>Likely Service Area<\/th>\n<th>Where They May Help<\/th>\n<th>Consulting Use Case Examples<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>cotocus.com<\/td>\n<td>Cloud\/DevOps consulting (verify services)<\/td>\n<td>Architecture, migration planning, operations setup<\/td>\n<td>HPC environment planning, cost governance setup, automation pipelines<\/td>\n<td>https:\/\/cotocus.com\/<\/td>\n<\/tr>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>DevOps consulting &amp; training (verify offerings)<\/td>\n<td>DevOps processes, automation, operational maturity<\/td>\n<td>IaC practices for cluster lifecycle, monitoring\/logging rollout<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>DEVOPSCONSULTING.IN<\/td>\n<td>DevOps consulting (verify services)<\/td>\n<td>CI\/CD, infrastructure automation, operations<\/td>\n<td>Build repeatable cluster provisioning workflows, security reviews, runbooks<\/td>\n<td>https:\/\/www.devopsconsulting.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">21. Career and Learning Roadmap<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn before this service<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Linux fundamentals: SSH, users\/groups, permissions, systemd basics<\/li>\n<li>Networking basics: VPC\/subnets, routing, security groups, DNS<\/li>\n<li>Storage basics: POSIX filesystems vs object storage, NFS concepts<\/li>\n<li>Alibaba Cloud foundations: ECS, VPC, RAM, NAS, OSS pricing and quotas<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn after this service<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Scheduler mastery (for your chosen scheduler): partitions\/queues, job arrays, priorities, accounting<\/li>\n<li>Image building and configuration management (Packer\/Ansible\/Terraform-style patterns)<\/li>\n<li>Performance engineering: CPU pinning, NUMA basics, I\/O profiling, benchmarking<\/li>\n<li>Security hardening: bastions, VPN, key rotation, audit logging, secret management<\/li>\n<li>Cost optimization: capacity planning, usage analytics, tagging and chargeback<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Job roles that use it<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Cloud\/Platform Engineer (HPC platform)<\/li>\n<li>DevOps Engineer supporting research\/engineering<\/li>\n<li>SRE for compute platforms<\/li>\n<li>Research Computing Engineer \/ HPC Administrator<\/li>\n<li>Solutions Architect designing HPC workloads on cloud<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Certification path (if available)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Alibaba Cloud certification programs evolve. Check the official Alibaba Cloud Certification portal for current tracks relevant to Computing, ECS, and architecture.<br\/>\n&#8211; Certifications landing page (verify current URL): https:\/\/www.alibabacloud.com\/certification<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Project ideas for practice<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Build a reproducible E-HPC cluster pattern with:<\/li>\n<li>hardened login node access (bastion\/VPN)<\/li>\n<li>NAS for shared workspace<\/li>\n<li>OSS for datasets and result archive<\/li>\n<li>Implement job submission from CI:<\/li>\n<li>CI pipeline that SSH\u2019s to login node, submits job, and collects outputs<\/li>\n<li>Benchmark study:<\/li>\n<li>compare 2\u20133 ECS instance families for your code (runtime, cost per job)<\/li>\n<li>Governance project:<\/li>\n<li>tagging + cost allocation + budget alerts for HPC campaigns<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">22. Glossary<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>HPC (High Performance Computing)<\/strong>: Computing focused on running large, compute-intensive workloads using parallelism.<\/li>\n<li><strong>Cluster<\/strong>: A group of machines (nodes) working together, typically managed by a scheduler.<\/li>\n<li><strong>Node<\/strong>: A server\/VM in the cluster (manager\/login\/compute).<\/li>\n<li><strong>Manager node<\/strong>: Runs cluster control services and scheduler controller.<\/li>\n<li><strong>Login node<\/strong>: Entry point for users to submit jobs and manage files.<\/li>\n<li><strong>Compute node<\/strong>: Executes scheduled workload tasks.<\/li>\n<li><strong>Scheduler<\/strong>: Software that allocates cluster resources to jobs (e.g., Slurm; verify supported schedulers).<\/li>\n<li><strong>Partition\/Queue<\/strong>: A scheduler concept grouping resources and policies for specific job types.<\/li>\n<li><strong>Job<\/strong>: A unit of work submitted to the scheduler.<\/li>\n<li><strong>Job array<\/strong>: Many similar jobs submitted as a set.<\/li>\n<li><strong>NAS<\/strong>: A managed shared file storage service suitable for POSIX-like access patterns.<\/li>\n<li><strong>OSS<\/strong>: Object storage used for durable storage, staging, and archival.<\/li>\n<li><strong>VPC<\/strong>: Virtual Private Cloud\u2014isolated virtual network in Alibaba Cloud.<\/li>\n<li><strong>vSwitch<\/strong>: Subnet within a VPC.<\/li>\n<li><strong>Security Group<\/strong>: Virtual firewall controlling inbound\/outbound traffic for ECS instances.<\/li>\n<li><strong>EIP<\/strong>: Elastic IP\u2014public IP that can be associated to an ECS instance.<\/li>\n<li><strong>NAT Gateway<\/strong>: Provides outbound internet access for private subnets without public IPs on instances.<\/li>\n<li><strong>RAM<\/strong>: Resource Access Management\u2014Alibaba Cloud IAM service.<\/li>\n<li><strong>ActionTrail<\/strong>: Service that records API actions for auditing (verify usage for your account\/region).<\/li>\n<li><strong>CloudMonitor<\/strong>: Alibaba Cloud monitoring service for metrics and alerting.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">23. Summary<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Elastic High Performance Computing (E-HPC) is Alibaba Cloud\u2019s <strong>Computing<\/strong> service for provisioning and operating scheduler-based HPC clusters on elastic cloud infrastructure. It matters because it shortens the path from \u201cwe need HPC\u201d to \u201cwe can run jobs,\u201d while keeping clusters consistent and easier to scale than fully manual builds.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In Alibaba Cloud architectures, E-HPC typically sits on top of <strong>ECS + VPC + NAS<\/strong>, with <strong>OSS<\/strong> commonly used for data staging and archival. Cost is driven mainly by <strong>ECS instance hours<\/strong>, <strong>shared storage<\/strong>, and <strong>network egress\/NAT<\/strong>\u2014so deleting idle clusters, minimizing always-on nodes, and keeping data local are key optimizations. Security comes from <strong>VPC isolation<\/strong>, tight <strong>security group rules<\/strong>, and <strong>RAM least privilege<\/strong>, with audit\/logging added for production governance.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Use Elastic High Performance Computing (E-HPC) when you need a practical, scheduler-managed HPC environment on Alibaba Cloud. Next step: read the official E-HPC documentation for your scheduler and region, then extend the lab into a production blueprint with hardened access, standardized images, monitoring\/logging, and cost governance.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Computing<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2,5],"tags":[],"class_list":["post-21","post","type-post","status-publish","format-standard","hentry","category-alibaba-cloud","category-computing"],"_links":{"self":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/21","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/comments?post=21"}],"version-history":[{"count":0,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/21\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/media?parent=21"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/categories?post=21"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/tags?post=21"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}