Category
Computing
1. Introduction
Elastic GPU Service is Alibaba Cloud’s GPU-accelerated computing offering used to run workloads that need massively parallel processing—most commonly AI/ML training and inference, graphics rendering, and high-performance computing (HPC). In Alibaba Cloud documentation, “Elastic GPU Service” is closely associated with (and in practice delivered through) GPU-accelerated Elastic Compute Service (ECS) instance families.
In simple terms: you create a GPU-enabled virtual machine, connect to it over the network, install (or use a prebuilt) GPU software stack, and run your GPU applications. You pay based on the chosen billing method (for example, pay-as-you-go or subscription), the selected instance type (GPU model/quantity, vCPU, RAM), storage, and network.
Technically, Elastic GPU Service uses ECS infrastructure with attached GPU hardware. It integrates with foundational Alibaba Cloud services such as VPC, Security Groups, Elastic IP (EIP), CloudMonitor, ActionTrail, Resource Access Management (RAM), and storage services like ESSD disks, OSS, and NAS. You get the familiar VM lifecycle (create, stop, start, snapshot, image, scale) while leveraging GPUs for throughput.
The problem it solves: enabling teams to access GPUs on-demand without buying and operating physical GPU servers—while still retaining VM-level control for drivers, runtimes, and performance tuning.
Naming note (verify in official docs): Alibaba Cloud uses “Elastic GPU Service” as a product documentation entry, while GPU compute is provisioned via ECS GPU instance families. Always confirm the latest positioning, supported instance families, and regions in official Alibaba Cloud documentation before standardizing an internal platform design.
2. What is Elastic GPU Service?
Official purpose
Elastic GPU Service provides GPU-accelerated compute capacity on Alibaba Cloud so customers can run workloads that benefit from GPU parallelism—AI, visualization, rendering, video processing, scientific computing, and more.
Core capabilities
- Provision GPU-enabled instances (VMs) with different GPU models and sizes (availability varies by region).
- Run standard OS images (Linux/Windows) and install GPU drivers and frameworks.
- Integrate into VPC networking with security groups, private subnets (vSwitches), and optional public access via EIP.
- Support common ECS operations: scaling via instance replacement, creating custom images, snapshots, monitoring, and automation.
Major components (as used in real deployments)
- ECS GPU instance: the VM that includes one or more physical GPUs (and corresponding vCPU/RAM).
- System disk and data disks: typically ESSD (performance SSD) volumes for OS and datasets.
- VPC, vSwitch, security group: networking and firewall boundaries.
- EIP / NAT Gateway / SLB (optional): controlled ingress/egress and service publishing.
- RAM users/roles/policies: access control and least privilege.
- CloudMonitor + ActionTrail: metrics/alarms and audit trails.
- OSS/NAS (optional): data lake and shared storage for models, artifacts, and datasets.
Service type
- Infrastructure-as-a-Service (IaaS) GPU compute delivered as ECS instances.
Scope (regional/zonal/account)
- Regional service with zonal instance placement (availability depends on region/zone capacity).
- Account-scoped resources under your Alibaba Cloud account; access controlled via RAM.
- Certain attributes (instance type availability, GPU models, quotas) are region/zone dependent.
How it fits into the Alibaba Cloud ecosystem
Elastic GPU Service is a compute building block. Typical ecosystem pairings: – Compute orchestration: Auto Scaling (to scale stateless inference) and/or ACK (Kubernetes) with GPU nodes (verify GPU support and scheduling in your ACK version/region). – Data: OSS for object storage; NAS for shared POSIX-like storage. – Security: RAM, KMS (for secrets/keys), security groups, VPC isolation. – Operations: CloudMonitor for metrics/alarms; ActionTrail for auditing; Log Service (SLS) for log centralization (verify exact integration pattern you choose).
3. Why use Elastic GPU Service?
Business reasons
- Faster time-to-value: spin up GPU capacity in minutes rather than procuring hardware.
- Elasticity: match GPU spend to demand (training bursts, rendering deadlines, seasonal inference load).
- Global deployment: place GPU compute nearer users or data (where regions support GPU capacity).
Technical reasons
- GPU acceleration: dramatically higher throughput for parallel workloads.
- VM-level control: choose OS, install drivers, pin framework versions, tune performance.
- Workload fit: supports diverse stacks (CUDA-based workloads, ML frameworks, rendering engines), subject to driver and GPU compatibility.
Operational reasons
- Standard ECS lifecycle: familiar instance operations, disk snapshots, images, monitoring.
- Repeatability: bake golden images with drivers and frameworks; use Infrastructure as Code (IaC) (for example, Terraform—verify provider resources and versions).
- Isolation: per-instance isolation fits teams that require dedicated environments.
Security/compliance reasons
- Network isolation with VPC and security groups.
- Centralized access control via RAM; audit via ActionTrail.
- Encryption options for disks and objects (verify which encryption modes you enable and in which regions).
Scalability/performance reasons
- Scale out (more instances for inference/render farms) or scale up (larger GPU instance types) depending on your application.
- Place instances near data sources (OSS endpoints, NAS) to reduce latency and egress.
When teams should choose it
- You need GPU acceleration and want VM control over drivers and runtime.
- Your workload can be packaged into an image or reproducible bootstrap scripts.
- You want a stepping stone between managed AI platforms and bare metal.
When they should not choose it
- You do not actually need GPUs (many “AI” workloads run fine on CPU; profile first).
- You want a fully managed training/inference platform with minimal infrastructure management—consider Alibaba Cloud AI platform services instead (for example, PAI offerings; verify best-fit product).
- You have strict requirements for a specific GPU model/feature and it is not available in your target region/zone or quota.
4. Where is Elastic GPU Service used?
Industries
- Media & entertainment (rendering, transcoding acceleration)
- Retail and e-commerce (recommendation inference, vision search)
- Manufacturing (computer vision QC, digital twins)
- Healthcare/life sciences (imaging inference, research compute)
- Finance (risk modeling acceleration, NLP inference)
- Education & research (GPU labs, coursework environments)
- Gaming (asset rendering, AI NPC training)
Team types
- ML engineering and data science teams
- Platform engineering teams building shared GPU platforms
- DevOps/SRE teams running GPU-backed services
- Research teams running iterative experiments
- Media pipeline teams operating render farms
Workloads
- Deep learning training (batch jobs)
- Deep learning inference (online services)
- GPU-accelerated ETL or feature processing
- Rendering (frame/scene rendering)
- Simulation and scientific computing
Architectures
- Single-instance experimentation
- Multi-instance distributed jobs (framework-dependent; verify networking requirements)
- Inference microservices behind load balancers
- Batch pipelines that read from OSS and write results back
- Kubernetes clusters with GPU nodes (ACK)
Production vs dev/test usage
- Dev/test: smaller GPU instances, spot/preemptible when acceptable, short-lived experiments, per-branch environments.
- Production: stable instance families, reserved/subscription capacity, multi-zone design where possible, strong monitoring, and well-defined patching and image pipelines.
5. Top Use Cases and Scenarios
Below are realistic scenarios where Elastic GPU Service is commonly used. Availability of GPU models and instance families varies by region/zone—verify in official docs and the console.
1) ML model training on demand
- Problem: Training jobs need GPUs for hours/days, but idle GPUs are expensive.
- Why this fits: Provision GPU instances only during training windows; use OSS/NAS for datasets.
- Example: A team launches a GPU VM nightly to retrain a demand forecast model, then shuts it down.
2) Real-time inference API for computer vision
- Problem: CPU inference latency is too high for image classification/detection.
- Why this fits: GPU instances reduce inference latency and increase throughput.
- Example: A retail app uses GPU instances behind a load balancer to classify product images in real time.
3) Batch inference at scale
- Problem: Running inference on millions of images/videos takes too long on CPU.
- Why this fits: GPU batch processing + OSS data lake improves throughput.
- Example: A media company runs nightly GPU batch inference to tag scenes for search.
4) Rendering farm (animation/3D)
- Problem: Rendering frames locally is too slow; deadlines require parallel rendering.
- Why this fits: Scale out GPU instances for a render burst, then release them.
- Example: A studio launches 50 GPU instances for a weekend render push and tears them down Monday.
5) GPU-accelerated video processing
- Problem: Video transcoding and filters are compute-heavy.
- Why this fits: GPU acceleration can increase throughput per node (framework/codec dependent).
- Example: A streaming platform uses GPU instances to accelerate transcoding pipelines.
6) Interactive data science workstation
- Problem: Data scientists need a consistent GPU environment without local setup pain.
- Why this fits: A GPU VM with preinstalled drivers and Jupyter stack provides a reproducible workstation.
- Example: A team standardizes on a golden GPU image and gives each scientist an isolated VM.
7) NLP inference for chat/semantic search
- Problem: Large transformer inference is slow on CPU and expensive at scale.
- Why this fits: GPUs improve token throughput (model/framework dependent).
- Example: A SaaS company serves embeddings and reranking models on GPU VMs.
8) Scientific simulation / HPC kernels
- Problem: Numerical kernels run for days on CPU.
- Why this fits: GPU acceleration reduces runtime for compatible codes.
- Example: A research lab runs GPU-enabled simulations, storing outputs in OSS.
9) Game development asset pipelines
- Problem: Texture/asset generation and validation pipelines need acceleration.
- Why this fits: GPU VMs can run pipeline tooling and scale with CI workloads.
- Example: CI triggers GPU-based asset validation jobs during release cycles.
10) Proof-of-concept for GPU migration
- Problem: On-prem GPU apps need a safe cloud POC before migration.
- Why this fits: VM-level control mirrors on-prem patterns; lift-and-shift is straightforward.
- Example: A company clones an on-prem inference stack to a GPU ECS instance to validate performance.
6. Core Features
Note: Exact feature names and availability can vary by region and by ECS instance family. Confirm in the Elastic GPU Service and ECS documentation.
GPU-accelerated ECS instances
- What it does: Provides ECS instance types that include one or more GPUs.
- Why it matters: GPU hardware enables parallel computation for ML and graphics.
- Practical benefit: Faster training/inference or rendering compared to CPU-only instances.
- Caveats: GPU models and counts vary; not all regions/zones have the same capacity.
Multiple billing methods (via ECS)
- What it does: Typically supports pay-as-you-go and subscription; some regions may support preemptible/spot (verify).
- Why it matters: Aligns cost with workload patterns.
- Practical benefit: Use subscription for steady inference; pay-as-you-go for bursty experiments.
- Caveats: Spot/preemptible instances can be reclaimed; design for interruption.
Image-based provisioning and automation
- What it does: Use public images, custom images, or snapshots to standardize GPU environments.
- Why it matters: GPU stacks are sensitive to driver/CUDA/framework version compatibility.
- Practical benefit: Golden images reduce “works on my machine” issues.
- Caveats: Driver updates can break ABI compatibility—pin versions and test.
VPC networking and Security Groups
- What it does: Place GPU instances into private networks; control inbound/outbound with security group rules.
- Why it matters: Many GPU workloads handle sensitive datasets and models.
- Practical benefit: Minimize public exposure; use bastion or VPN for admin access.
- Caveats: Misconfigured security groups are a common source of exposure.
Elastic storage options (ECS disks, OSS, NAS)
- What it does: Attach high-performance system/data disks; integrate with OSS/NAS for datasets and artifacts.
- Why it matters: ML pipelines are often data-bound, not compute-bound.
- Practical benefit: Keep datasets in OSS, cache locally on ESSD for hot reads.
- Caveats: Data transfer and storage requests can add cost; plan caching strategy.
Monitoring and auditing (CloudMonitor, ActionTrail)
- What it does: Collect instance metrics, set alarms, and track API actions.
- Why it matters: GPUs are expensive; you need visibility and governance.
- Practical benefit: Alert on idle instances, high costs, and suspicious operations.
- Caveats: GPU-level metrics may require in-guest tooling (for example, via
nvidia-smiand an exporter); verify what CloudMonitor provides natively for your instance family.
Integration with container and orchestration platforms (workload-dependent)
- What it does: GPU ECS instances can act as nodes for containerized workloads (for example, Kubernetes via ACK).
- Why it matters: Many modern ML inference stacks run in containers.
- Practical benefit: Standard deployment, rolling updates, autoscaling patterns.
- Caveats: GPU scheduling requires correct device plugins and runtime configuration; validate with your ACK version and documentation.
7. Architecture and How It Works
High-level architecture
At a high level, Elastic GPU Service workloads run on GPU-enabled ECS instances within your VPC:
- Control plane: You create and manage instances using Alibaba Cloud console/API/CLI. IAM is enforced via RAM.
- Network plane: Instances connect through VPC and security groups. Public access is typically through an EIP (or via NAT/Bastion/VPN).
- Data plane: Workloads read datasets and models from OSS/NAS and write results back. Local disks provide low-latency scratch space.
- Observability: CloudMonitor collects ECS metrics; additional agents/exporters can push GPU metrics and logs to Log Service (implementation-dependent).
Request/data/control flow
- Provisioning: User/API → ECS/Elastic GPU Service → allocate host with GPU → attach disks/network → boot OS.
- Runtime:
- App requests → (optional) SLB/API Gateway → GPU inference service on ECS.
- Data reads → OSS/NAS → cached to local disk → processed on GPU → outputs stored to OSS/DB.
- Operations:
- Metrics → CloudMonitor; logs → Log Service (if configured).
- Audit events → ActionTrail.
Integrations with related services (common patterns)
- VPC + Security Groups: network segmentation and micro-perimeter.
- EIP / NAT Gateway: controlled outbound internet for package installs; controlled inbound admin access.
- OSS: dataset/model artifact storage and sharing.
- NAS: shared filesystem for multi-instance workloads (throughput/latency characteristics vary; verify).
- ACK (Kubernetes): GPU nodes for containerized inference/training (verify support and GPU scheduling details).
- RAM: least privilege, MFA, and role-based access.
- KMS: encrypt secrets and data keys (verify service integration options).
- CloudMonitor + ActionTrail: monitoring, alerting, auditing.
Dependency services
- ECS (primary compute resource manager)
- VPC, vSwitch, Security Groups
- Storage (cloud disks; optional OSS/NAS)
- IAM (RAM)
- Monitoring/audit (CloudMonitor, ActionTrail)
Security/authentication model
- Human/API access: RAM users, RAM roles, policies; use MFA and least privilege.
- Instance-to-service access: use instance RAM roles (where supported) to access OSS/other APIs without long-lived keys (verify your pattern in official docs).
- Network security: security groups and private subnets; optionally bastion host.
Networking model
- Instances run in a VPC and vSwitch (subnet).
- Ingress/egress governed by security group rules.
- Public internet access typically via:
- EIP bound to the instance, or
- NAT Gateway for outbound-only access, with bastion/VPN for admin.
Monitoring/logging/governance considerations
- Baseline: CPU/memory/disk/network metrics in CloudMonitor.
- Add-on: GPU utilization/temperature metrics via in-guest exporters (implementation-dependent).
- Centralize logs to Log Service for retention and querying (optional).
- Use ActionTrail for auditing instance create/modify/delete operations.
Simple architecture diagram (Mermaid)
flowchart LR
User[Engineer / Data Scientist] -->|Console/API| ECS[ECS: Elastic GPU Service Instance]
ECS --> VPC[VPC + vSwitch]
ECS -->|Read/Write| OSS[Object Storage Service (OSS)]
ECS -->|Metrics| CM[CloudMonitor]
ECS -->|Audit events| AT[ActionTrail]
User -->|SSH via EIP| EIP[Elastic IP]
EIP --> ECS
Production-style architecture diagram (Mermaid)
flowchart TB
subgraph VPC1[VPC (Production)]
subgraph ZoneA[Zone A]
SLB[Server Load Balancer] --> INF1[GPU ECS Inference #1]
SLB --> INF2[GPU ECS Inference #2]
end
subgraph ZoneB[Zone B]
SLB --> INF3[GPU ECS Inference #3]
end
INF1 --> NAS[NAS (shared model cache)]
INF2 --> NAS
INF3 --> NAS
INF1 --> OSS[OSS (datasets/models/artifacts)]
INF2 --> OSS
INF3 --> OSS
INF1 --> SLS[Log Service (SLS)]
INF2 --> SLS
INF3 --> SLS
end
Users[Clients] -->|HTTPS| SLB
Sec[RAM + KMS + Security Groups] -.govern.-> VPC1
CM[CloudMonitor Alarms] --> Ops[Ops/SRE On-call]
AT[ActionTrail] --> SecOps[Security Review / Audit]
8. Prerequisites
Account and billing
- An active Alibaba Cloud account with a valid billing method.
- Billing enabled for ECS and related services (VPC is typically free; EIP, disks, OSS requests/storage can incur cost—verify).
Permissions / IAM (RAM)
Minimum recommended: – Ability to create/manage ECS instances, VPC resources, security groups, disks, EIPs. – Read access to monitoring/audit services. – If using OSS/NAS, permissions to access those resources.
Practical guidance: – Use RAM users for humans, not root account credentials. – Use least privilege policies; scope by region/resource groups where possible (verify RAM policy capabilities in your account).
Tools (optional but recommended)
- SSH client:
- macOS/Linux:
ssh - Windows: Windows Terminal + OpenSSH or PuTTY
- (Optional) Alibaba Cloud CLI:
- Alibaba Cloud CLI documentation (verify current install steps): https://www.alibabacloud.com/help
- (Optional) Docker for containerized GPU apps (installed on the instance).
Region availability
- GPU instance families are not available in every region/zone and may have capacity constraints.
- Confirm:
- Supported regions and zones
- Available GPU instance families and GPU models
- Whether spot/preemptible is available
Use the official Elastic GPU Service and ECS instance type pages for confirmation (see Resources section).
Quotas/limits
Common quota considerations (verify exact limits in your account/region): – vCPU and instance quotas per region – GPU instance quotas per region/zone – EIP quota – Disk quota and snapshot quota
Prerequisite services
- VPC and vSwitch (subnet)
- Security Group
- Optional: EIP (for SSH), OSS/NAS (for data), Log Service (for logs)
9. Pricing / Cost
Elastic GPU Service cost is primarily the cost of GPU-enabled ECS instances, plus associated storage and networking.
Do not rely on static numbers in articles—GPU pricing is region- and instance-family dependent and changes over time. Use official pricing pages and calculators.
Pricing dimensions (typical)
- Instance type (major driver) – GPU model and count – vCPU and memory size – Instance family generation
- Billing method – Pay-as-you-go (hourly/second-level granularity depends on ECS rules—verify) – Subscription (reserved capacity for a term) – Preemptible/spot (if available; interruptible)
- Storage – System disk and data disks (ESSD categories and size) – Snapshots (snapshot storage and API usage)
- Network – EIP (public IP) charges – Internet outbound bandwidth charges (billing model depends on EIP settings) – Cross-region traffic (if any)
- Data services – OSS storage (GB-month), requests, and data transfer – NAS capacity/throughput billing model (verify)
Free tier
- GPU instances are typically not part of free tiers. Verify Alibaba Cloud promotions/free trials for your account/region.
Cost drivers (what surprises people)
- Idle GPU instances: the most common and expensive mistake.
- Always-on EIPs: public IP resources accrue cost even when you’re not actively SSH’d.
- Large disks: oversized ESSD volumes or snapshots kept forever.
- OSS request costs: repeated small reads/writes during training can add up.
- Data egress: moving datasets out of Alibaba Cloud can be costly.
Network/data transfer implications
- Keeping data in-region (OSS + ECS in same region) typically reduces latency and avoids cross-region transfer charges.
- Pulling large container images and packages over the internet increases bandwidth usage; consider local mirrors or image caching.
How to optimize cost
- Prefer short-lived GPU instances for experiments; shut down and release when done.
- Use custom images with preinstalled drivers/frameworks to reduce bootstrapping time.
- Store datasets in OSS and cache on local disks only when needed.
- Evaluate spot/preemptible for fault-tolerant training and batch inference (design for interruption).
- For steady production inference, compare subscription vs pay-as-you-go break-even (use the calculator).
Example low-cost starter estimate (no fabricated prices)
A “starter lab” cost typically includes: – 1 small GPU ECS instance (pay-as-you-go) for 1–2 hours – 40–100 GB system disk (ESSD) – Minimal EIP bandwidth for SSH and package installs
Because GPU SKUs and EIP billing vary by region, calculate using: – ECS pricing page / calculator (see official resources) – The console’s “Buy” page cost estimate for the chosen instance type
Example production cost considerations
For a production inference service, model your monthly cost around: – N GPU instances (often multiple for HA and rolling deployments) – SLB (if used), EIP/NAT Gateway – Disk snapshots, image storage, logs (SLS), OSS model storage – Headroom for scaling during peak
A practical approach: – Start with load testing to measure requests/sec per GPU instance – Convert peak QPS into instance count with a utilization target (for example, 60–70% GPU utilization) – Then compare pay-as-you-go vs subscription pricing.
Official pricing sources (use these)
- ECS pricing: https://www.alibabacloud.com/product/ecs (navigate to pricing)
- Alibaba Cloud Pricing Calculator (verify current URL in official site navigation): https://www.alibabacloud.com/pricing
If you find a dedicated Elastic GPU Service pricing page for your region, prefer that over secondary sources.
10. Step-by-Step Hands-On Tutorial
This lab creates a GPU-enabled ECS instance (Elastic GPU Service), verifies the GPU is accessible, runs a simple GPU validation, and then cleans up to avoid ongoing cost.
Objective
Provision a low-cost, short-lived Elastic GPU Service instance on Alibaba Cloud, install/verify the NVIDIA driver stack (or confirm it’s already present), and validate GPU compute availability with nvidia-smi and a small CUDA/container test.
Lab Overview
You will: 1. Create networking (VPC, vSwitch) and a security group. 2. Launch a GPU ECS instance and connect over SSH. 3. Install NVIDIA driver (if required) and validate GPU visibility. 4. Optionally run a container-based GPU workload. 5. Set basic monitoring and then clean up all resources.
Important: GPU availability is region/zone dependent. If you cannot find a GPU instance type in your zone, choose a different zone/region.
Step 1: Create a VPC and vSwitch (subnet)
Console path (typical):
1. Go to the Alibaba Cloud Console.
2. Navigate to VPC.
3. Create a VPC:
– IPv4 CIDR example: 10.0.0.0/16
4. Create a vSwitch in a specific zone:
– CIDR example: 10.0.1.0/24
Expected outcome – You have a VPC and one vSwitch ready for instance placement.
Verification – In the VPC console, confirm the VPC and vSwitch show “Available”.
Step 2: Create a Security Group with minimal inbound rules
Create a security group in the same region/VPC.
Inbound rules (recommended for the lab) – SSH (TCP 22) from your public IP only (preferred) – If you cannot restrict to one IP (for example, dynamic IP), use a temporary broader range and tighten later.
Optional (only if you serve a web app in this lab) – HTTP/HTTPS from specific sources
Expected outcome – A security group that allows you to SSH into the instance.
Verification – Confirm the inbound rule exists and is scoped properly.
Step 3: Create an SSH key pair (recommended)
Console path (typical): – ECS → Network & Security → Key Pairs (naming may vary)
Create a key pair and download the private key file (.pem).
Expected outcome – You have a key pair available to attach to the instance.
Verification – Confirm the key pair is listed in your region.
Step 4: Launch a GPU ECS instance (Elastic GPU Service)
Console path (typical): – ECS → Instances → Create Instance
Choose: – Region/Zone: pick the zone where GPU instances are available. – Instance type: choose any GPU-accelerated instance type shown in the wizard. – The console may show GPU family names and specs. Select a smaller option for cost. – Image: Ubuntu LTS is a good default for driver installation. – If the console offers a GPU-optimized image with drivers, you can use it; verify what it includes in the image description. – System disk: ESSD, 40–100 GB for the lab. – Network: select your VPC/vSwitch. – Security group: select the one you created. – Login credential: choose your key pair. – Public IP: – Option A: allocate an EIP and bind it to the instance (common for labs). – Option B: enable a public IPv4 if offered in the wizard (region-dependent). – Prefer EIP so you can release it explicitly during cleanup.
Create the instance.
Expected outcome – A running GPU ECS instance.
Verification – ECS console shows instance state as Running. – You can see the public IP (EIP) address assigned.
Step 5: SSH to the instance
From your local machine (Linux/macOS), set permissions and connect:
chmod 600 ~/Downloads/your-key.pem
ssh -i ~/Downloads/your-key.pem ubuntu@<EIP_or_Public_IP>
If you used a different image, the default username might differ (for example, root). Check the console connection instructions.
Expected outcome – You are logged into the instance shell.
Verification
uname -a
Step 6: Check whether GPU is visible (and whether drivers are installed)
Run:
lspci | grep -i nvidia || true
nvidia-smi || true
Interpretation:
– If lspci shows an NVIDIA device but nvidia-smi fails, drivers may not be installed.
– If nvidia-smi works, drivers are installed and the GPU is visible.
Expected outcome – You confirm whether GPU hardware is present and whether drivers are installed.
Step 7: Install NVIDIA drivers (Ubuntu example)
Driver installation is the most variable part because it depends on GPU model, kernel version, and image contents. Follow Alibaba Cloud’s official GPU driver guidance for your instance family whenever possible. If your image already includes drivers, skip this step.
Update packages:
sudo apt-get update
sudo apt-get -y install ubuntu-drivers-common
List recommended drivers:
ubuntu-drivers devices
Install a recommended driver (example command; choose the recommended one shown on your VM):
sudo ubuntu-drivers autoinstall
Reboot:
sudo reboot
Reconnect via SSH and run:
nvidia-smi
Expected outcome
– nvidia-smi outputs GPU model, driver version, and current GPU utilization.
Verification
– nvidia-smi returns exit code 0 and prints a GPU table.
Step 8 (Optional): Run a GPU smoke test using a container
This is a practical way to validate that: – The driver is functional – The runtime can access the GPU
You need Docker installed. Install Docker (Ubuntu):
sudo apt-get -y install docker.io
sudo usermod -aG docker $USER
newgrp docker
Now, GPU containers require the NVIDIA container runtime. The exact install steps vary; follow NVIDIA’s official instructions and verify compatibility with your driver and OS: – NVIDIA Container Toolkit: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html
After installing NVIDIA container runtime, test with a CUDA base image (image tags vary; pick a current one compatible with your setup):
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
Expected outcome
– The container prints nvidia-smi output from inside the container.
Verification – The container sees the GPU(s) and prints the same (or similar) output as the host.
If pulling images is slow/expensive, consider stopping at Step 7. Container image downloads can increase outbound bandwidth cost.
Step 9: (Optional) Basic monitoring checks
In the console:
– Open CloudMonitor → ECS monitoring.
– Confirm CPU/network metrics are visible.
– For GPU utilization, rely on nvidia-smi for the lab unless you have an established GPU exporter pipeline.
Expected outcome – You can see baseline instance metrics and confirm the instance is healthy.
Validation
Run these checks:
# GPU visibility
nvidia-smi
# Kernel/driver sanity
lsmod | grep -i nvidia || true
# Disk and memory checks
df -h
free -h
What “good” looks like:
– nvidia-smi shows at least one GPU and no fatal errors.
– Disk has free space.
– No unusual system load at idle.
Troubleshooting
Common issues and realistic fixes:
-
No GPU instance types available in your zone – Cause: capacity or region limitations. – Fix: choose a different zone/region; request quota/capacity via Alibaba Cloud support if needed.
-
nvidia-smi: command not found– Cause: drivers not installed (or PATH not set). – Fix: install drivers (Step 7) or use a GPU-optimized image (verify its contents). -
nvidia-smifails after driver install – Possible causes: mismatched driver/kernel, secure boot constraints (less common on cloud), incomplete install. – Fix:- Re-check recommended driver version via
ubuntu-drivers devices - Ensure you rebooted
- Review
/var/log/sysloganddmesg | grep -i nvidia - Consider using Alibaba Cloud’s recommended driver/CUDA guidance for that instance family (preferred)
- Re-check recommended driver version via
-
SSH connection timeout – Cause: security group missing port 22, wrong source IP, or no public route. – Fix: verify inbound rule, verify EIP binding, confirm the instance has public connectivity.
-
Docker GPU test fails: “could not select device driver” – Cause: NVIDIA container runtime not installed/configured. – Fix: install NVIDIA Container Toolkit and configure Docker runtime per NVIDIA docs.
Cleanup
To avoid ongoing charges, delete resources in this order:
- Terminate/release the ECS instance – ECS → Instances → Select instance → Release
- Release EIP (if you allocated one) – VPC → EIPs → Release
- Delete unused disks/snapshots – ECS → Disks / Snapshots (ensure nothing remains billable)
- Delete security group (optional)
- Delete vSwitch and VPC (optional, if created only for this lab)
Expected outcome – No GPU instances, no EIPs, and no unattached billable disks remain.
11. Best Practices
Architecture best practices
- Separate training and inference environments:
- Training: bursty, batch, interruption-tolerant (spot where acceptable).
- Inference: stable, HA, scaled behind load balancers.
- Keep data close to compute:
- Same region for OSS/NAS and GPU instances.
- Use immutable images:
- Build golden images with pinned driver/CUDA/framework versions.
- Design for scale-out:
- Prefer horizontal scaling for inference when possible; keep instances stateless and load models from shared storage.
IAM/security best practices
- Use RAM users with MFA; avoid using root credentials.
- Use RAM roles for instances to access OSS or other APIs without embedding long-lived keys (verify exact feature availability and configuration).
- Scope permissions by resource group, region, and tags.
Cost best practices
- Implement “idle shutdown” automation for dev/test GPU instances.
- Use budgets/alerts:
- CloudMonitor alarms for running instances beyond expected windows
- Billing center budgets (verify feature availability in your account)
- Right-size:
- Track GPU utilization; if consistently low, move to a smaller GPU SKU or CPU inference.
Performance best practices
- Match GPU model to workload:
- Inference vs training, FP16/BF16 support, memory requirements (verify GPU capabilities).
- Avoid I/O bottlenecks:
- Use local ESSD as cache/scratch for training.
- Use OSS multipart downloads or prefetching where appropriate.
- Pin software versions:
- Driver ↔ CUDA ↔ framework compatibility is critical.
Reliability best practices
- For production inference:
- Deploy across zones when available.
- Use health checks and rolling updates.
- For batch training:
- Checkpoint frequently to OSS so jobs can resume after interruption.
Operations best practices
- Standardize:
- naming conventions (project-env-role-index)
- tags (Owner, CostCenter, Environment, DataSensitivity)
- Patch management:
- Update images in CI and roll out via instance replacement.
- Observability:
- Centralize logs; alert on OOM, disk full, high error rates.
- Collect GPU utilization metrics via agents if you operate at scale.
Governance/tagging/naming best practices
- Enforce tags via policy/process:
- Owner
- Application
- Environment
- Cost center
- Keep separate accounts or resource groups for dev/test vs production.
12. Security Considerations
Identity and access model
- Use RAM for:
- Human access (RAM users, SSO if available)
- Programmatic access (RAM roles, access keys with rotation policies)
- Prefer instance roles over embedding credentials in code.
Encryption
- At rest:
- Enable disk encryption where supported/required (verify ECS disk encryption options and constraints).
- Encrypt OSS buckets/objects as required (server-side encryption options—verify).
- In transit:
- Use SSH for admin, TLS for APIs, HTTPS endpoints for OSS access.
Network exposure
- Avoid public SSH whenever possible:
- Use a bastion host, VPN, or private connectivity (best practice).
- If you must use EIP:
- Restrict security group inbound rules to your IP.
- Consider changing SSH port only as a minor hardening measure—real security comes from IP restriction and keys.
Secrets handling
- Do not store access keys in code or AMIs.
- Use environment injection from a secrets store if available (for example, KMS-based patterns; verify Alibaba Cloud-native secrets solutions you adopt).
- Rotate credentials and audit usage.
Audit/logging
- Enable ActionTrail and retain logs according to your policy.
- Collect OS logs and application logs to a centralized service (SLS) for incident response (implementation-dependent).
Compliance considerations
- Data residency: choose regions aligned with data requirements.
- Access logging: ensure administrative actions are traceable.
- Least privilege: restrict who can create GPU instances (cost and data risk).
Common security mistakes
- Opening SSH (22) to
0.0.0.0/0 - Leaving EIPs attached to instances permanently
- Using shared SSH keys across the organization
- Running workloads as root inside the VM without controls
- Copying datasets to local disks and forgetting to wipe/decommission properly
Secure deployment recommendations
- Private subnets + NAT Gateway for outbound-only traffic.
- Bastion host with strong authentication and session recording (if required).
- Golden images with pre-hardened baseline and CIS-like settings (where applicable).
- Use separate resource groups/accounts for different sensitivity levels.
13. Limitations and Gotchas
Confirm exact values and availability in official docs; limits vary by region, instance family, and account quotas.
- Regional/zone capacity constraints: GPU capacity can be scarce. Plan procurement early for production.
- Quota limitations: GPU instance quotas may be low by default and require increases.
- Driver/framework compatibility: CUDA, cuDNN, TensorRT, and ML frameworks have strict version compatibility.
- Spot/preemptible interruptions: great for cost but requires checkpointing and fault tolerance.
- Data gravity and egress: moving large datasets across regions or out of cloud costs money and time.
- GPU monitoring: Cloud-native metrics may not include detailed GPU utilization; you may need in-guest monitoring.
- Image baking complexity: maintaining a secure, patched GPU image pipeline requires discipline and testing.
- Kubernetes GPU scheduling complexity (if using ACK): device plugin, node labeling/taints, and runtime setup require careful validation.
14. Comparison with Alternatives
Elastic GPU Service is best viewed as “GPU-enabled ECS.” Alternatives include other Alibaba Cloud services that abstract infrastructure, other cloud providers’ GPU VMs, or self-managed on-prem GPU clusters.
| Option | Best For | Strengths | Weaknesses | When to Choose |
|---|---|---|---|---|
| Alibaba Cloud Elastic GPU Service (GPU ECS) | Teams needing VM control for GPU workloads | Full OS control, integrates with VPC/RAM/OSS, flexible stacks | You manage drivers, patching, scaling patterns | You want GPU compute with VM-level flexibility |
| Alibaba Cloud ECS (CPU-only) | Non-GPU workloads, lightweight inference | Lower cost, simpler ops | Not suitable for heavy training/inference/rendering | When profiling shows CPU is sufficient |
| Alibaba Cloud ACK with GPU nodes (uses GPU ECS nodes) | Containerized GPU inference/training | Orchestration, rolling updates, scaling patterns | More operational complexity; GPU scheduling setup | When you standardize on Kubernetes for deployment |
| Alibaba Cloud PAI (managed AI platform offerings) | Managed training/inference pipelines | Higher-level abstractions, potentially less ops | Less low-level control; product fit varies | When you want managed ML workflows over raw VMs |
| AWS EC2 GPU instances | Multi-cloud or AWS-native stacks | Broad ecosystem, mature tooling | Different IAM/networking model; migration effort | When the rest of your platform is on AWS |
| Azure GPU VMs | Azure-native ML/VDI | Tight integration with Azure services | Different tooling and costs | When you’re standardized on Azure |
| Google Cloud GPU VMs | GCP-native ML/data stacks | Integration with GCP data/AI services | Different networking/IAM | When your data platform is on GCP |
| On-prem GPU servers | Fixed high utilization, strict data locality | Full control, potentially lower long-term cost at high utilization | CapEx, capacity planning, ops burden | When GPUs are near-100% utilized and data must stay on-prem |
| Self-managed Kubernetes + GPUs (anywhere) | Custom platform engineering | Portable patterns | High complexity | When you have a platform team and portability is critical |
15. Real-World Example
Enterprise example: Visual quality inspection in manufacturing
- Problem: Multiple factories produce high-resolution images; CPU inference can’t meet latency/throughput, and the company needs strong network isolation.
- Proposed architecture
- GPU ECS instances (Elastic GPU Service) run inference services.
- VPC with private subnets; access via internal SLB.
- OSS stores images and model artifacts; NAS caches frequently used models.
- CloudMonitor alarms on instance health; ActionTrail for audit.
- Why Elastic GPU Service was chosen
- VM-level control to pin driver/CUDA/framework versions for validation.
- Predictable performance and easier compliance alignment than ad-hoc desktops.
- Expected outcomes
- Lower per-image inference time.
- Centralized governance and auditability.
- Faster rollout of updated models using golden images and rolling instance replacement.
Startup/small-team example: GPU-backed semantic search MVP
- Problem: A startup needs fast embedding generation and reranking for search; they can’t justify on-prem GPUs.
- Proposed architecture
- One small GPU ECS instance for inference.
- OSS bucket for model artifacts and logs.
- EIP only for admin access; app served behind a managed load balancer when scaling.
- Why Elastic GPU Service was chosen
- Quick provisioning and pay-as-you-go experimentation.
- Simple VM deployment without building a full Kubernetes platform on day one.
- Expected outcomes
- MVP performance meets product needs.
- Ability to scale out by cloning the instance image and adding a load balancer.
16. FAQ
-
Is Elastic GPU Service different from ECS?
In practice, Elastic GPU Service is delivered through GPU-enabled ECS instance families. Think of it as the GPU compute capability of ECS. Verify current product positioning in Alibaba Cloud docs. -
Which GPU models are available?
Availability depends on region/zone and instance family. Check the ECS instance type list in your region and the Elastic GPU Service documentation. -
Can I use pay-as-you-go billing?
Typically yes (as an ECS billing method). Availability of subscription/spot/preemptible varies—verify in the console for your region. -
Do GPU instances include NVIDIA drivers by default?
Some images may include drivers, others do not. Always verify the selected image description and test withnvidia-smi. -
How do I verify the GPU is working?
Runnvidia-smi. For deeper validation, run a small CUDA sample or a container test with the NVIDIA runtime. -
Can I use Docker containers with the GPU?
Yes, but you must configure the NVIDIA container runtime/toolkit and ensure driver compatibility. Follow NVIDIA’s official documentation. -
Can I use Kubernetes (ACK) with GPUs?
Many teams use GPU ECS instances as Kubernetes worker nodes. You must configure GPU scheduling and device plugins; verify ACK GPU guidance for your region/version. -
What storage is best for ML datasets?
Common pattern: store datasets and artifacts in OSS, cache hot data on local ESSD, and optionally use NAS for shared filesystem needs. -
What are the main cost risks?
Leaving GPU instances running idle, paying for EIPs, and data egress. Set budgets and automate shutdown for non-production. -
How do I secure SSH access?
Restrict security group inbound to your IP, use key-based auth, and ideally use a bastion host or VPN rather than public SSH. -
How do I scale an inference service?
Keep inference nodes stateless, load models from OSS/NAS, place them behind SLB, and scale out by adding instances or using Auto Scaling (pattern depends on your stack). -
How do I handle spot/preemptible interruptions?
Design training to checkpoint frequently to OSS and resume. For inference, use multiple instances and graceful draining. -
Can I snapshot a GPU instance and replicate it?
You can create custom images/snapshots like normal ECS, but ensure driver licensing/compatibility and validate after cloning. -
Do I get GPU utilization metrics in CloudMonitor?
Baseline ECS metrics are available. Detailed GPU metrics often require in-guest tooling/exporters. Verify what CloudMonitor provides natively for your instance family. -
What’s the safest way to keep driver/CUDA consistent?
Use a golden image pipeline: build → test → publish. Pin versions and document compatibility matrices. -
Is Elastic GPU Service suitable for regulated workloads?
It can be, if you design for encryption, access controls, auditing, and region residency requirements. Confirm compliance needs and Alibaba Cloud controls with official documentation and your compliance team.
17. Top Online Resources to Learn Elastic GPU Service
| Resource Type | Name | Why It Is Useful |
|---|---|---|
| Official documentation | Elastic GPU Service documentation (Alibaba Cloud Help Center) – https://www.alibabacloud.com/help | Primary reference for capabilities, regions, and guidance (verify the Elastic GPU Service section) |
| Official documentation | ECS documentation – https://www.alibabacloud.com/help/en/ecs | GPU instances are provisioned as ECS; this covers instance lifecycle, networking, disks, images |
| Official documentation | VPC documentation – https://www.alibabacloud.com/help/en/vpc | Networking fundamentals for secure GPU deployments |
| Official documentation | RAM documentation – https://www.alibabacloud.com/help/en/ram | Access control, least privilege policies, roles |
| Official documentation | CloudMonitor documentation – https://www.alibabacloud.com/help/en/cloudmonitor | Monitoring/alarms for ECS and related resources |
| Official documentation | ActionTrail documentation – https://www.alibabacloud.com/help/en/actiontrail | Audit logging of control-plane actions |
| Official pricing | Alibaba Cloud pricing overview – https://www.alibabacloud.com/pricing | Pricing entry point; use it to reach ECS pricing and calculator |
| Official product page | Elastic Compute Service (ECS) product page – https://www.alibabacloud.com/product/ecs | Background, billing methods, and entry point to pricing/specs |
| Official docs (data) | OSS documentation – https://www.alibabacloud.com/help/en/oss | Best practices for dataset/model storage |
| Official architecture | Alibaba Cloud Architecture Center – https://www.alibabacloud.com/architecture | Reference architectures; verify GPU-specific patterns available |
| External (vendor official) | NVIDIA Container Toolkit install guide – https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html | Required to run GPU workloads in Docker containers reliably |
| Community (use with care) | Trusted GitHub examples for CUDA/PyTorch/TensorFlow | Helps with smoke tests; ensure they match your driver/CUDA versions and security policies |
18. Training and Certification Providers
Below are training providers (neutral listing). Confirm current course catalogs and delivery modes on their websites.
-
DevOpsSchool.com – Suitable audience: DevOps engineers, SREs, platform teams, cloud engineers – Likely learning focus: cloud operations, DevOps practices, infrastructure automation – Mode: check website – Website: https://www.devopsschool.com/
-
ScmGalaxy.com – Suitable audience: software engineers, DevOps learners, build/release engineers – Likely learning focus: SCM, CI/CD, DevOps tooling foundations – Mode: check website – Website: https://www.scmgalaxy.com/
-
CLoudOpsNow.in – Suitable audience: cloud operations engineers, sysadmins moving to cloud – Likely learning focus: cloud operations, monitoring, reliability practices – Mode: check website – Website: https://www.cloudopsnow.in/
-
SreSchool.com – Suitable audience: SREs, operations, platform engineering – Likely learning focus: SRE principles, observability, incident response – Mode: check website – Website: https://www.sreschool.com/
-
AiOpsSchool.com – Suitable audience: operations teams adopting AIOps, monitoring/automation engineers – Likely learning focus: AIOps concepts, automation, operational analytics – Mode: check website – Website: https://www.aiopsschool.com/
19. Top Trainers
Listed as training resources/platforms (neutral listing). Verify specific trainer profiles and offerings on each site.
-
RajeshKumar.xyz – Likely specialization: check website (often individual trainer branding) – Suitable audience: learners seeking guided training/mentorship – Website: https://rajeshkumar.xyz/
-
devopstrainer.in – Likely specialization: DevOps tools and practices training – Suitable audience: DevOps engineers, CI/CD learners – Website: https://www.devopstrainer.in/
-
devopsfreelancer.com – Likely specialization: DevOps freelancing/services and training resources – Suitable audience: teams and individuals seeking practical DevOps support – Website: https://www.devopsfreelancer.com/
-
devopssupport.in – Likely specialization: DevOps support services and training – Suitable audience: ops teams and DevOps practitioners – Website: https://www.devopssupport.in/
20. Top Consulting Companies
Neutral listing based on provided names. Verify service lines and case studies directly with each company.
-
cotocus.com – Likely service area: cloud/DevOps consulting (verify on website) – Where they may help: architecture reviews, cloud migrations, ops automation – Consulting use case examples:
- Designing a secure VPC layout for GPU workloads
- Building an image pipeline for GPU instances
- Website: https://cotocus.com/
-
DevOpsSchool.com – Likely service area: DevOps consulting and corporate training (verify on website) – Where they may help: CI/CD, infrastructure automation, operational readiness – Consulting use case examples:
- Implementing IaC for ECS GPU fleets
- Setting up monitoring/alerting standards for GPU services
- Website: https://www.devopsschool.com/
-
DEVOPSCONSULTING.IN – Likely service area: DevOps consulting services (verify on website) – Where they may help: DevOps transformations, reliability improvements, automation – Consulting use case examples:
- Hardening SSH/bastion patterns for production GPU instances
- Cost optimization for dev/test GPU usage
- Website: https://www.devopsconsulting.in/
21. Career and Learning Roadmap
What to learn before Elastic GPU Service
- Linux fundamentals: SSH, systemd, logs, package managers, storage.
- Networking: VPC concepts, subnets, routing, security groups, NAT.
- Cloud basics on Alibaba Cloud:
- ECS instance lifecycle
- RAM and least privilege
- OSS basics
- GPU basics:
- What drivers do
- CUDA conceptually (even if you don’t write CUDA kernels)
What to learn after
- Image pipelines: Packer or similar image build workflows (verify your toolchain).
- Containers for GPU: Docker + NVIDIA runtime; container security.
- Orchestration: ACK Kubernetes GPU scheduling, autoscaling patterns.
- MLOps (if ML workloads):
- Model registry patterns, CI for models, canary deploys, drift monitoring (tooling varies).
- Observability at scale: centralized logs, tracing, metrics pipelines.
Job roles that use it
- Cloud engineer / infrastructure engineer
- DevOps engineer / SRE
- ML engineer / MLOps engineer
- Data scientist (advanced users managing their own GPU environments)
- Platform engineer building shared compute platforms
Certification path (if available)
- Alibaba Cloud certifications change over time and vary by region. Verify current Alibaba Cloud certification tracks on the official site:
- https://edu.alibabacloud.com/ (verify current certification pages and availability)
Project ideas for practice
- Build a golden GPU image with pinned driver + CUDA + PyTorch.
- Create an inference service on a GPU ECS instance and publish via SLB.
- Implement a training job that checkpoints to OSS and resumes after interruption.
- Build a cost-control script that shuts down idle GPU instances after N minutes.
22. Glossary
- Elastic GPU Service: Alibaba Cloud offering for GPU-accelerated compute, typically provisioned via GPU-enabled ECS instances.
- ECS (Elastic Compute Service): Alibaba Cloud virtual machine service used to run compute workloads.
- GPU: Graphics Processing Unit; excels at parallel workloads.
- VPC (Virtual Private Cloud): logically isolated virtual network in Alibaba Cloud.
- vSwitch: subnet within a VPC in a specific zone.
- Security Group: stateful virtual firewall controlling inbound/outbound traffic to instances.
- EIP (Elastic IP): static public IP that can be bound to cloud resources.
- ESSD: enterprise SSD cloud disk types for ECS (performance varies by category).
- OSS (Object Storage Service): Alibaba Cloud object storage for datasets, models, artifacts.
- NAS: managed shared file storage service (POSIX-like access).
- RAM (Resource Access Management): Alibaba Cloud IAM service for users, roles, and policies.
- KMS (Key Management Service): service for managing encryption keys (verify integrations you use).
- CloudMonitor: Alibaba Cloud monitoring and alerting service.
- ActionTrail: Alibaba Cloud audit logging of API actions.
- Golden image: prebuilt VM image with standardized configuration and software.
- CUDA: NVIDIA GPU computing platform and API ecosystem.
nvidia-smi: NVIDIA tool to display GPU status, driver info, and utilization.
23. Summary
Elastic GPU Service on Alibaba Cloud (Computing) provides GPU-accelerated compute primarily through GPU-enabled ECS instances. It matters because GPUs are often the difference between feasible and impractical runtimes for AI training/inference, rendering, and parallel compute workloads.
Architecturally, treat it as a secure, VPC-isolated GPU VM layer that integrates with OSS/NAS for data, RAM for access control, and CloudMonitor/ActionTrail for operations and governance. Cost-wise, focus on the big drivers: instance type selection, running time (avoid idle), storage, and network egress/EIP usage. Security-wise, prioritize least privilege RAM policies, private networking, restricted SSH, encryption, and audit trails.
Use Elastic GPU Service when you need GPU acceleration with VM-level control and reproducibility; consider higher-level managed platforms when you want less infrastructure management. Next, deepen your skills by building a golden image pipeline and—if you operate at scale—validating GPU orchestration patterns with ACK and robust monitoring.