Alibaba Cloud CloudMonitor Tutorial: Architecture, Pricing, Use Cases, and Hands-On Guide for Developer Tools

Category

Developer Tools

1. Introduction

CloudMonitor is Alibaba Cloud’s native monitoring and alerting service for cloud resources and workloads. It helps you collect metrics, visualize health and performance, and trigger notifications or automated responses when something abnormal happens—before users notice.

In simple terms: CloudMonitor watches your Alibaba Cloud services (like ECS, RDS, SLB, and many others), tracks key performance indicators (CPU, memory, latency, errors, throughput, and more depending on the service), and sends alerts when thresholds are breached.

Technically, CloudMonitor is a metrics and events observability layer integrated into the Alibaba Cloud control plane. It provides built-in metric collection for many Alibaba Cloud services, plus the ability to ingest custom metrics (for your own applications), define alarm rules, route notifications, and build dashboards for operations teams.

CloudMonitor solves a core production problem: you can’t operate what you can’t observe. Without consistent monitoring and alerting, teams discover failures late, struggle to troubleshoot, and can’t reliably prove SLO/SLA compliance or capacity needs.


2. What is CloudMonitor?

CloudMonitor is an Alibaba Cloud service designed to monitor cloud resources and applications by collecting metrics/events, presenting them in dashboards, and enabling alerting and notification workflows.

Official purpose (what it’s for)

CloudMonitor’s purpose is to provide: – Monitoring of Alibaba Cloud services (built-in metrics) – Alerting via alarm rules and notification channels – Visualization via dashboards/metric charts – Custom monitoring for user-defined metrics (where supported)

(For the authoritative scope and feature list, verify in official docs: https://www.alibabacloud.com/help/en/cloudmonitor/)

Core capabilities

Common CloudMonitor capabilities include: – Cloud service monitoring: collect and chart metrics from supported Alibaba Cloud services – Host monitoring: OS-level metrics for ECS (often requires an agent; verify per OS/region) – Custom metrics: push application/business metrics into CloudMonitor (API-based) – Alert rules: threshold/condition-based alarms – Notification management: contacts, contact groups, and notification channels (availability varies; verify) – Dashboards: view metrics across multiple resources in one place – Events / event-driven monitoring: view resource/system events and alert on them (verify exact event sources in docs)

Major components (conceptual model)

CloudMonitor typically includes: – Metric collection
Built-in service metrics + optional agent-based host metrics + custom metric ingestion. – Metric storage & query
Time-series storage with query APIs (retention and granularity depend on metric type and product rules; verify). – Dashboards / visualization
Console-based dashboards and charts; some environments integrate with Grafana (often via Prometheus or other services—verify for your setup). – Alarming & notification
Alarm rules evaluate metrics and route alerts to contacts/channels. – Access control & audit
Controlled via Alibaba Cloud RAM policies and audited via ActionTrail (verify logging coverage).

Service type

CloudMonitor is a managed monitoring/alerting platform service integrated across Alibaba Cloud.

Scope (regional/global/account/project)

CloudMonitor is account-scoped (per Alibaba Cloud account / Resource Account under a Resource Directory), while: – Metrics are typically tied to the region of the monitored resource (for example, ECS in cn-hangzhou vs ap-southeast-1). – The CloudMonitor console experience is centralized, but you select regions/resources for queries and alarms.

Exact regional behavior (especially for custom metrics and event sources) can vary—verify in official docs.

How it fits into the Alibaba Cloud ecosystem

CloudMonitor is part of the operational foundation for Alibaba Cloud workloads: – Works with compute (ECS), networking (SLB/ALB), storage (OSS), databases (RDS and others), and many SaaS services. – Complements (does not replace) log-focused products like Simple Log Service (SLS) and application tracing/APM products like ARMS. A common pattern is: – CloudMonitor for infrastructure/service metrics + alarmsSLS for logs + log analytics + alerting on log patternsARMS for application performance monitoring and distributed tracing


3. Why use CloudMonitor?

Business reasons

  • Reduced downtime and faster incident response: Detect and alert on issues early.
  • Operational visibility for stakeholders: Dashboards provide a shared source of truth.
  • SLA/SLO support: Monitoring is necessary for reliability commitments.

Technical reasons

  • Native integration with Alibaba Cloud services: built-in metrics reduce instrumentation effort.
  • Unified monitoring plane: standardize alerting patterns across teams and services.
  • Custom metrics (where supported) let you monitor business KPIs alongside infrastructure signals.

Operational reasons

  • Alarm automation: notify on-call engineers, trigger runbooks, or integrate with incident workflows.
  • Capacity planning: trend analysis helps forecast scale needs.
  • Change impact visibility: detect regression after deployments or infrastructure changes.

Security/compliance reasons

  • Auditability: monitoring/alerting supports compliance controls (detect anomalies, track operational status).
  • Separation of duties: RAM policies can limit who can modify alarm rules and notification channels.
  • Continuous control validation: confirm that key resources are healthy and within expected bounds.

Scalability/performance reasons

  • Handle growth: consistent metrics across regions/services support large-scale operations.
  • Performance baselines: define “normal” and alert when deviations occur.

When teams should choose CloudMonitor

Choose CloudMonitor when you: – Primarily run workloads on Alibaba Cloud and want a native monitoring platform. – Need standard service metrics and operational alarms quickly. – Want to centralize dashboards and alarms across Alibaba Cloud services.

When teams should not choose it (or should augment it)

CloudMonitor alone may not be enough when you: – Need deep application tracing/APM → consider ARMS (verify product fit). – Need full log analytics, indexing, and search → use Simple Log Service (SLS). – Require a single observability tool across multiple clouds/on-prem with a unified backend → consider Prometheus + Grafana or a third-party observability platform (plus integration).


4. Where is CloudMonitor used?

Industries

  • SaaS and internet services
  • E-commerce and mobile apps
  • FinTech and payments (with strict availability monitoring)
  • Gaming (latency and regional performance monitoring)
  • Manufacturing/IoT backends (device ingestion systems on Alibaba Cloud)
  • Education and media streaming platforms

Team types

  • DevOps and SRE teams
  • Platform engineering teams
  • Cloud infrastructure operations
  • Security operations (for certain operational anomaly detection)
  • Application teams (for service-level dashboards and KPIs)

Workloads

  • Web applications (ECS + SLB/ALB + RDS)
  • Containerized platforms (often augmented with Prometheus/Kubernetes metrics—verify your product stack)
  • Batch processing and scheduled workloads
  • API gateways and microservices (often paired with ARMS)
  • Storage-heavy workloads using OSS

Architectures

  • Single-region production with HA inside a region
  • Multi-region active-active or active-passive
  • Multi-account (Resource Directory) with centralized ops dashboards
  • Hybrid observability (CloudMonitor + SLS + ARMS)

Production vs dev/test usage

  • Production: comprehensive alarms (availability, error rate, saturation), escalation paths, on-call routing, tighter IAM controls, dashboards for NOC/SRE.
  • Dev/test: fewer alarms, focus on debugging and performance tests; careful cost control (custom metrics and probes can add cost).

5. Top Use Cases and Scenarios

Below are realistic scenarios where CloudMonitor is commonly used.

1) ECS CPU saturation alerting

  • Problem: Instances become slow or unresponsive due to CPU exhaustion.
  • Why CloudMonitor fits: ECS exposes built-in CPU utilization metrics; CloudMonitor alarms can notify on thresholds.
  • Scenario: Trigger an alarm when CPU > 85% for 5 minutes on production ECS instances.

2) ECS disk and memory monitoring (host monitoring)

  • Problem: Out-of-memory kills or disk-full errors cause outages.
  • Why CloudMonitor fits: With host monitoring/agent (where supported), you can capture OS-level memory/disk usage.
  • Scenario: Alert when / filesystem usage > 90% or memory available < 10%.

3) RDS connection and storage threshold alarms

  • Problem: Applications fail due to max connections reached or storage exhaustion.
  • Why CloudMonitor fits: RDS provides operational metrics; alarms help prevent incident escalation.
  • Scenario: Notify DBAs when connections exceed 80% of limit or storage approaches capacity.

4) Load balancer health and traffic anomalies

  • Problem: Sudden drops in traffic or back-end health issues.
  • Why CloudMonitor fits: SLB/ALB metrics and health indicators can be monitored.
  • Scenario: Alert when healthy backend server count drops below threshold.

5) OSS request error monitoring

  • Problem: Increased 4xx/5xx errors from OSS disrupt downloads/uploads.
  • Why CloudMonitor fits: Many OSS metrics are observable; alarms catch regression quickly.
  • Scenario: Alert when OSS 5xx rate spikes above baseline.

6) Website availability monitoring (synthetic/site monitoring)

  • Problem: Users report “site down” but infra metrics look fine.
  • Why CloudMonitor fits: CloudMonitor commonly offers site/synthetic monitoring (verify the exact “site monitoring” feature availability in your region).
  • Scenario: Probe https://api.example.com/health every minute from multiple locations.

7) Business KPI monitoring via custom metrics

  • Problem: Infrastructure is healthy but orders drop or payments fail.
  • Why CloudMonitor fits: Custom metrics allow pushing business signals.
  • Scenario: Push “successful_checkout_count” metric and alert if it drops to near zero.

8) Release regression detection

  • Problem: A deployment increases latency and error rate.
  • Why CloudMonitor fits: Dashboards compare pre/post release patterns; alarms catch threshold breaches.
  • Scenario: After a release, monitor 95th percentile latency and error counts.

9) Cost anomaly early-warning (indirect)

  • Problem: A runaway job increases load and resource consumption.
  • Why CloudMonitor fits: Resource utilization spikes are often the earliest indicator of unexpected cost growth.
  • Scenario: Alert when outbound traffic or CPU usage grows unusually fast.

10) Multi-account centralized NOC dashboard

  • Problem: Large organizations struggle to view service health across accounts/teams.
  • Why CloudMonitor fits: Account-scoped monitoring with RAM permissions and cross-account approaches (verify best practice patterns in docs).
  • Scenario: Platform team builds standard dashboards and enforces baseline alarms across accounts.

11) Event-driven operational alerts (maintenance/instance lifecycle)

  • Problem: Unexpected maintenance, restarts, or lifecycle actions cause disruption.
  • Why CloudMonitor fits: Event monitoring can surface system/resource events (verify the event types available).
  • Scenario: Alert when an ECS instance is stopped/started unexpectedly.

12) SLO-driven alerting for critical APIs (combined approach)

  • Problem: You need service-level alerts (latency/error budgets), not just CPU.
  • Why CloudMonitor fits: CloudMonitor metrics + custom metrics can approximate SLO signals; for deeper tracing use ARMS.
  • Scenario: Push request success rate as a custom metric and alarm on error budget burn.

6. Core Features

Feature availability can differ by region, account type, or product edition. Verify in official docs for your environment: https://www.alibabacloud.com/help/en/cloudmonitor/

1) Cloud service monitoring (built-in metrics)

  • What it does: Collects metrics from supported Alibaba Cloud services automatically.
  • Why it matters: You get immediate visibility without installing agents or building collectors.
  • Practical benefit: Faster onboarding; consistent metric naming and dashboards.
  • Caveats: Not all services expose the same granularity; some metrics may have collection delays. Verify metric resolution/retention per service.

2) Host monitoring for ECS (agent-based OS metrics)

  • What it does: Collects OS-level metrics such as memory, disk usage, processes (depending on supported agent/OS).
  • Why it matters: CPU alone doesn’t explain many incidents (OOM, disk full, inode exhaustion).
  • Practical benefit: Alerts on memory and disk capacity prevent avoidable outages.
  • Caveats: Requires installing and maintaining an agent; ensure outbound connectivity and proper permissions. Verify supported OS versions and agent install steps.

3) Custom monitoring (custom metrics ingestion)

  • What it does: Lets you push your own metrics into CloudMonitor via API.
  • Why it matters: Infrastructure health does not always correlate with business health.
  • Practical benefit: Monitor KPIs like order counts, queue depth, feature flags, and cron job success.
  • Caveats: Custom metrics may be billable and subject to quotas; verify ingestion rate, retention, and pricing.

4) Alarm rules (threshold-based alerting)

  • What it does: Evaluates metric conditions and triggers alarms when rules match (for example, CPU > 85% for 5 minutes).
  • Why it matters: Automates detection and reduces mean time to detect (MTTD).
  • Practical benefit: Standard “golden signal” alerting across services.
  • Caveats: Poorly tuned thresholds cause alert fatigue; use baselines and severity tiers.

5) Notification management (contacts, groups, channels)

  • What it does: Routes alarms to contacts and contact groups through configured notification methods (email/SMS/webhook options vary—verify).
  • Why it matters: The best alert is useless if it doesn’t reach the right responders.
  • Practical benefit: On-call routing; team-based ownership.
  • Caveats: SMS/voice notifications often have additional costs; confirm notification pricing and regional availability.

6) Dashboards and visualization

  • What it does: Provides charts, dashboards, and multi-metric views.
  • Why it matters: Operations work is faster with curated dashboards.
  • Practical benefit: Single page view for service health and incident triage.
  • Caveats: Dashboard features can vary; some advanced visualization needs Grafana/Prometheus.

7) Metric query APIs / OpenAPI integration

  • What it does: Provides APIs to query metrics and manage alarms programmatically.
  • Why it matters: Enables “monitoring as code” patterns.
  • Practical benefit: Automate baseline alarms for every new resource; integrate with CI/CD.
  • Caveats: API rate limits and authentication via AccessKey/RAM roles; secure key management is critical.

8) Event monitoring (resource/system events)

  • What it does: Surfaces events about resource state changes and platform operations (coverage varies; verify event sources).
  • Why it matters: Some incidents start as events (maintenance, instance reboot, failed scaling).
  • Practical benefit: Faster correlation between “what changed” and “what broke.”
  • Caveats: Event completeness differs by service; do not rely on events alone for availability monitoring.

9) Tag-based monitoring and grouping (where supported)

  • What it does: Use tags to filter/group resources in dashboards and alarm targeting.
  • Why it matters: Tagging is essential at scale.
  • Practical benefit: Team ownership, environment separation (prod/stage), cost allocation.
  • Caveats: Requires consistent tagging discipline and governance.

7. Architecture and How It Works

High-level service architecture

CloudMonitor sits between your resources and your operators/automation:

  1. Metric sources – Alibaba Cloud services emit metrics (ECS, RDS, SLB/ALB, OSS, etc.). – Optional host agent sends OS metrics from ECS. – Your apps can push custom metrics using APIs/SDKs.

  2. CloudMonitor ingestion and storage – Receives metrics and events. – Stores time-series data with defined retention/granularity rules (verify specifics per metric).

  3. Evaluation and alerting – Alarm rules periodically evaluate conditions. – Alarm state changes trigger notifications.

  4. Visualization and access – Console dashboards and charts for humans. – APIs for automation and integration.

Request/data/control flow (typical)

  • Data plane: metrics/events flow from services/agents/apps → CloudMonitor.
  • Control plane: operators define alarm rules/dashboards → CloudMonitor configuration is stored and applied.
  • Notification flow: alarm triggers → notification system → email/SMS/webhook (depending on configuration; verify).

Integrations with related services (common)

  • RAM (Resource Access Management): access control for CloudMonitor operations.
  • ActionTrail: audit of API calls/changes (verify event coverage).
  • Simple Log Service (SLS): log collection and analysis; often paired with CloudMonitor.
  • ARMS: application performance monitoring/tracing; complements CloudMonitor metrics.
  • Resource Directory: multi-account governance (patterns vary; verify best practices).

Dependency services

CloudMonitor is managed; you generally do not deploy dependencies yourself. Your main dependencies are: – Properly configured RAM permissions – Network access for any required agents – Notification endpoints (email/SMS/webhooks)

Security/authentication model

  • Console and API calls authenticate through Alibaba Cloud identity mechanisms.
  • Programmatic access commonly uses:
  • RAM users with least-privilege policies
  • RAM roles for services/automation where applicable (preferred over long-lived AccessKeys when possible)
  • Always follow least privilege; restrict “write” actions (create/modify alarms, contacts) to ops automation or a small group.

Networking model

  • Cloud service metrics are collected internally by Alibaba Cloud.
  • Host monitoring agents (if used) may require outbound connectivity to Alibaba Cloud endpoints; exact endpoints/ports vary. Verify in official docs.

Monitoring/logging/governance considerations

  • Treat alarms and dashboards as production configuration:
  • version-control via IaC or scripts where possible
  • consistent naming conventions
  • tagging for ownership and environment
  • Audit alarm changes using ActionTrail.
  • Review quotas and rate limits to avoid blind spots.

Simple architecture diagram (Mermaid)

flowchart LR
  subgraph AlibabaCloud["Alibaba Cloud Account"]
    ECS["ECS Instances"]
    RDS["RDS Database"]
    SLB["SLB/ALB"]
    APP["Custom App Metrics (API)"]
  end

  ECS -->|Service Metrics| CMS["CloudMonitor"]
  RDS -->|Service Metrics| CMS
  SLB -->|Service Metrics| CMS
  APP -->|PutCustomMetric API| CMS

  CMS --> DASH["Dashboards"]
  CMS --> ALARM["Alarm Rules"]
  ALARM --> NOTIF["Notifications (Email/SMS/Webhook*)"]

  note1["* Notification types vary by region/account. Verify in official docs."]
  NOTIF --- note1

Production-style architecture diagram (Mermaid)

flowchart TB
  subgraph RD["Resource Directory / Multi-Account (Optional)"]
    A1["Prod Account"]
    A2["Shared Services Account"]
  end

  subgraph Prod["Production VPC"]
    LB["SLB/ALB"]
    ECSASG["ECS/ASG App Tier"]
    RDS1["RDS Primary"]
    REDIS["ApsaraDB for Redis (optional)"]
  end

  subgraph Obs["Observability Stack (Managed)"]
    CMS["CloudMonitor\n(Metrics, Alarms, Dashboards)"]
    SLS["Simple Log Service\n(Logs, Search, Alerts)"]
    ARMS["ARMS\n(APM/Tracing, optional)"]
    AT["ActionTrail\n(Audit Logs)"]
  end

  Users["Users"] --> LB --> ECSASG --> RDS1
  ECSASG --> REDIS

  LB -->|Metrics| CMS
  ECSASG -->|Metrics| CMS
  RDS1 -->|Metrics| CMS
  ECSASG -->|Logs| SLS
  ECSASG -->|Traces/Metrics*| ARMS

  CMS -->|Alarm Notifications| OnCall["On-call (Email/SMS/Webhook)"]
  CMS --> NOC["NOC Dashboard"]

  CMS -->|API Calls| AT
  SLS -->|API Calls| AT
  ARMS -->|API Calls| AT

  note2["* ARMS integration depends on your instrumentation and licensing. Verify in official docs."]
  ARMS --- note2

8. Prerequisites

Before starting, ensure you have the following.

Account requirements

  • An active Alibaba Cloud account.
  • If using multiple accounts (Resource Directory), ensure you understand where metrics and alarms are managed (verify in official docs).

Permissions / IAM (RAM)

You need permissions to: – View monitored resources and metrics – Create/manage alarm rules – Create/manage contacts/contact groups – (Optional) install host monitoring agent on ECS

Practical approach: – Use a dedicated RAM user or RAM role for monitoring administration. – Apply least privilege: read-only for viewers, write permissions for ops automation.

Verify exact policy actions in the CloudMonitor API reference and RAM policy docs: – CloudMonitor docs: https://www.alibabacloud.com/help/en/cloudmonitor/ – RAM docs: https://www.alibabacloud.com/help/en/ram/

Billing requirements

  • A billing method configured (Pay-as-you-go is common for labs).
  • Some CloudMonitor features may incur charges (custom metrics, synthetic monitoring, notifications). Confirm pricing before enabling.

Tools (optional but recommended)

  • Alibaba Cloud console access
  • Alibaba Cloud CLI (optional): https://www.alibabacloud.com/help/en/alibaba-cloud-cli/
  • SSH client to access an ECS instance for generating load

Region availability

  • CloudMonitor is available broadly, but feature availability may vary by region. Verify in official docs.

Quotas / limits

CloudMonitor typically enforces quotas such as: – Maximum alarm rules – Custom metric ingestion limits – API rate limits

Do not assume defaults—check quotas in the CloudMonitor console or official docs.

Prerequisite services (for this lab)

  • An ECS instance in any region you can access via SSH.
  • If you don’t have one, create a small pay-as-you-go ECS instance (cost depends on region, instance type, disk, bandwidth).
  • An email address for receiving alarm notifications.

9. Pricing / Cost

CloudMonitor pricing can be a combination of: – Included/basic monitoring for many Alibaba Cloud services – Usage-based charges for value-added capabilities (often custom metrics, synthetic monitoring, advanced alerting/notification channels, longer retention, etc.)

Because Alibaba Cloud pricing varies by: – region – account type/contract – metric types and retention – notification method (SMS can be billable) you should rely on official pricing pages.

Official pricing sources (verify)

  • Product page (often links to pricing): https://www.alibabacloud.com/product/cloudmonitor
  • CloudMonitor documentation entry point: https://www.alibabacloud.com/help/en/cloudmonitor/
  • Alibaba Cloud pricing center: https://www.alibabacloud.com/pricing (navigate to CloudMonitor if listed)
  • If an official calculator is available for CloudMonitor, use it (availability varies). Verify in official sources.

Pricing dimensions (typical)

When evaluating CloudMonitor cost, expect these dimensions (confirm for your account): – Number/type of monitored metrics
Built-in service metrics may be included; custom metrics may be billed by count and/or ingestion frequency. – Data points ingestion rate
Higher-frequency metrics can increase cost and quota usage. – Alarm rules count and evaluation frequency
Many alarms across many dimensions can increase evaluation load (pricing varies). – Notification volume and channel
SMS/voice notifications can be a direct billable item. – Synthetic/site monitoring probes
Usually billed by number of probes, frequency, and locations. – Retention
Longer retention or high granularity may be part of paid tiers (verify).

Cost drivers (most common)

  • Enabling custom metrics widely (high-cardinality labels/dimensions can explode metric count).
  • High-frequency monitoring (for example, 10-second intervals) if supported/paid.
  • SMS notifications for every alarm flapping incident.
  • Synthetic checks from many locations at short intervals.

Hidden/indirect costs

  • Data transfer: not usually charged for internal metric collection, but:
  • host agents may generate outbound traffic (typically small, but verify)
  • your own custom metric push from outside Alibaba Cloud may incur internet egress from the sender side
  • Operational cost: time spent tuning thresholds, deduplicating alerts, and maintaining dashboards.

How to optimize cost

  • Prefer built-in service metrics where possible.
  • Avoid high-cardinality custom metrics (don’t use user IDs as metric dimensions).
  • Use email/webhook alerts where acceptable; reserve SMS for high-severity paging.
  • Reduce alarm noise: add appropriate durations, suppression, and dependency-based alerting patterns.
  • Standardize dashboards and alarms using templates and reuse.

Example low-cost starter estimate (conceptual)

A low-cost lab typically includes: – 1 ECS instance basic monitoring – 1–3 alarm rules – Email notifications only

This is often near-zero incremental CloudMonitor cost if you stay within included metrics and avoid paid add-ons—but verify in official pricing for your region/account.

Example production cost considerations

In production, cost planning should consider: – hundreds/thousands of resources – per-service dashboards and alert rules – custom metrics for business and application signals – synthetic checks for critical endpoints – paging channels (SMS) and alert volumes

Best practice: run a one-week pilot, measure metric and alert volumes, and then validate charges in Billing Center.


10. Step-by-Step Hands-On Tutorial

This lab sets up a practical CloudMonitor alarm for an ECS instance CPU metric, generates load to trigger it, verifies notifications, and cleans up safely.

Objective

Create a CloudMonitor alarm that notifies you by email when an ECS instance CPU utilization stays high for several minutes, then validate it by generating CPU load on the instance.

Lab Overview

You will: 1. Prepare an ECS instance and confirm metrics are visible. 2. Create CloudMonitor contact and contact group. 3. Create an alarm rule for ECS CPU utilization. 4. Generate CPU load to trigger the alarm. 5. Validate the alarm state and notification delivery. 6. Clean up (delete alarm rule and optional test tools).

Notes: – Exact console labels may vary slightly by region or UI version. – Some accounts require enabling monitoring features or accepting service terms. Follow the console prompts. – To keep costs low, use email notifications rather than SMS.


Step 1: Prepare an ECS instance and confirm metrics are visible

  1. Sign in to the Alibaba Cloud console.
  2. Navigate to ECS and select a region.
  3. Ensure you have one running Linux ECS instance you can SSH into.

Expected outcome – ECS instance is running and reachable via SSH.

Verify metrics in CloudMonitor 1. Open CloudMonitor in the console. 2. Find Cloud Service Monitoring (or similar) and select ECS. 3. Locate your instance and open its metric charts. 4. Confirm you can see CPUUtilization (or similarly named CPU usage metric).

Expected outcome – You can view CPU usage charts for your ECS instance.

If you cannot see metrics: – Confirm you selected the correct region. – Confirm the ECS instance is running. – Wait a few minutes for metrics to appear after instance creation.


Step 2: Create an alarm contact and contact group

CloudMonitor typically routes alarms to contacts and contact groups.

  1. In CloudMonitor, go to Alerts / Alarm Service (naming may vary).
  2. Go to Contacts and create a new contact: – Name: lab-contact – Email: your email address
  3. Confirm/verify the email if the console prompts for verification.
  4. Create a Contact Group: – Name: lab-oncall – Add lab-contact to the group

Expected outcome – You have a contact group ready for alarm notifications.


Step 3: Create a CPU utilization alarm rule for ECS

  1. In CloudMonitor, go to Alarm Rules and choose Create Alarm Rule.
  2. Select the product/namespace for ECS metrics (often “ECS” or “Compute/ECS”).
  3. Target your ECS instance (InstanceId).
  4. Configure the rule (example values): – Metric: CPU utilization (for example, CPUUtilization) – Condition: > 80 (percent) – Duration: 5 minutes (or “5 consecutive periods” depending on UI) – Alarm level/severity: Warning (or equivalent)
  5. Notification: – Contact group: lab-oncall – Notification method: Email (avoid SMS for low cost)
  6. Name the rule: lab-ecs-cpu-high
  7. Create/Save.

Expected outcome – Alarm rule is created and shows status “Enabled”.

Verification – The alarm appears in the alarm rules list. – Alarm history is empty (no trigger yet), which is expected.


Step 4: Generate CPU load on the ECS instance

SSH into your ECS instance and run a CPU stress tool.

Option A: Using stress-ng (recommended if available)

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install -y stress-ng

CentOS/RHEL/Alibaba Cloud Linux (package availability varies):

sudo yum install -y stress-ng || true

Run CPU load (example: use 2 workers for 10 minutes):

sudo stress-ng --cpu 2 --timeout 10m

Option B: Using stress (often available)

Ubuntu/Debian:

sudo apt-get update
sudo apt-get install -y stress
sudo stress --cpu 2 --timeout 600

CentOS/RHEL:

sudo yum install -y epel-release || true
sudo yum install -y stress || true
sudo stress --cpu 2 --timeout 600

Option C: Simple shell loop (no packages)

This is less controlled, but works when you cannot install packages:

for i in 1 2; do
  (while :; do :; done) &
done
echo "CPU loops started. Remember to stop them later."

To stop the loops later:

pkill -f "while :; do :; done" || true

Expected outcome – CPU utilization rises significantly and stays high for several minutes.

Verify on the instance

top

You should see CPU usage near 100% (depending on vCPU count and workers).


Step 5: Observe the alarm triggering

  1. Return to the CloudMonitor console.
  2. Open your ECS CPU chart and confirm CPU is above the threshold.
  3. Go to Alarm History (or “Alarm Events”).
  4. Wait for the evaluation window to pass (for a 5-minute rule, it can take 5–10 minutes depending on metric delay and evaluation interval).

Expected outcome – The alarm transitions to a triggered state (often “ALARM”). – You receive an email notification.

If you do not receive email: – Check email spam/junk folders. – Confirm the contact email was verified. – Confirm the rule’s notification settings include your contact group. – Confirm the alarm condition truly stayed above threshold for the configured duration.


Step 6: Recover and confirm alarm clears (optional but recommended)

Stop the stress workload (if not timed):

If using stress-ng or stress, it stops automatically after timeout.
If using shell loops:

pkill -f "while :; do :; done" || true

Wait several minutes and observe CPU drop below threshold.

Expected outcome – Alarm eventually returns to normal/OK (depending on rule behavior and whether “recovery notifications” are enabled).


Validation

Use this checklist:

  • [ ] ECS CPU metrics are visible in CloudMonitor charts
  • [ ] Contact and contact group exist, email is verified
  • [ ] Alarm rule lab-ecs-cpu-high is enabled
  • [ ] CPU load sustained above threshold long enough to trigger the alarm
  • [ ] Alarm history shows trigger event
  • [ ] Email notification received

Troubleshooting

Issue: No metrics appear for ECS – Confirm correct region in CloudMonitor. – Wait 5–15 minutes after instance creation. – Verify the instance is running. – Some metrics require specific instance types or agents; check ECS metric docs (verify in official docs).

Issue: Alarm does not trigger – Ensure condition matches the metric scale (percent vs fraction). – Increase stress load (more workers) or lower threshold. – Increase duration window awareness: metrics and alarm evaluation can lag.

Issue: No email notification – Confirm email verification status. – Confirm contact group is attached to the alarm rule. – Check notification preferences and alarm severity routing. – Verify whether your account/region restricts certain notification methods.

Issue: CPU doesn’t go high – Your instance may have multiple vCPUs; --cpu 2 might not saturate it. Increase workers: bash sudo stress-ng --cpu 4 --timeout 10m – Use top to confirm actual CPU usage.


Cleanup

To avoid noise and potential cost: 1. Delete (or disable) the alarm rule lab-ecs-cpu-high. 2. Delete the lab-oncall contact group and lab-contact contact (optional). 3. Remove stress tools (optional): – Ubuntu/Debian: bash sudo apt-get remove -y stress-ng stress || true 4. If you created an ECS instance just for this lab, stop or release it to avoid compute charges.

Expected outcome – No active alarm rules remain from the lab; no ongoing notifications.


11. Best Practices

Architecture best practices

  • Monitor the “golden signals”:
  • Latency, Traffic, Errors, Saturation
  • Add dependency-aware dashboards:
  • Load balancer → app tier → database → storage
  • Use multi-region views for active-active architectures.

IAM/security best practices

  • Use RAM roles where possible instead of long-lived AccessKeys.
  • Separate duties:
  • Read-only dashboards for most users
  • Limited write permissions for ops/platform team
  • Restrict who can change:
  • alarm rules
  • contact groups
  • notification channels

Cost best practices

  • Keep custom metrics low-cardinality:
  • Good: service=checkout, env=prod
  • Bad: user_id=123456
  • Reduce SMS usage; reserve for critical pages.
  • Use fewer, higher-quality alarms instead of hundreds of noisy ones.
  • Periodically prune unused dashboards/alarm rules.

Performance best practices

  • Prefer built-in metrics when they exist.
  • Avoid pushing custom metrics at unnecessarily high frequency.
  • Use aggregation (sum/avg/max) at the source when possible.

Reliability best practices

  • Make alarms actionable:
  • include runbook links in descriptions (if supported)
  • include owner/team tag in the alarm name or metadata
  • Use multiple severity levels:
  • Warning (email)
  • Critical (page)
  • Prevent flapping:
  • use durations and proper thresholds
  • use silence/maintenance windows during planned work (verify feature availability)

Operations best practices

  • Standardize naming:
  • prod-<service>-<resource>-<signal>
  • Create baseline dashboards:
  • per service
  • per environment
  • Review alarms monthly:
  • remove stale ones
  • tune thresholds based on real incidents

Governance/tagging/naming best practices

  • Enforce tags:
  • Environment=prod|stage|dev
  • OwnerTeam=...
  • Application=...
  • CostCenter=...
  • Use tags as filters for dashboards and alarm targeting (where supported).

12. Security Considerations

Identity and access model

  • CloudMonitor access is governed by RAM.
  • Apply least privilege:
  • Viewers: read-only metrics/dashboards
  • Operators: manage alarms
  • Admins: manage notification configurations and integrations

Encryption

  • Data is managed by Alibaba Cloud; for in-transit and at-rest controls, verify CloudMonitor security documentation and your compliance needs.
  • For custom metrics, ensure your client uses official endpoints and TLS (standard for Alibaba Cloud APIs).

Network exposure

  • Built-in metrics require no inbound access to your VPC.
  • Host monitoring agents may require outbound connectivity; restrict via:
  • security groups
  • egress policies
  • private endpoints/VPC endpoints if supported (verify)

Secrets handling

  • Avoid embedding AccessKeys in scripts on ECS.
  • Prefer:
  • RAM roles (where applicable)
  • secure secret stores (for example, KMS/Secrets Manager patterns—verify Alibaba Cloud offerings and best fit)
  • Rotate AccessKeys if you must use them.

Audit/logging

  • Use ActionTrail to audit CloudMonitor configuration changes (alarm creation, contact changes).
  • Export audit logs to SLS for retention and search if needed (verify integration).

Compliance considerations

  • Determine where monitoring data is stored/processed (region, retention).
  • Ensure contact/notification data (email/phone numbers) is handled per privacy policies.
  • For regulated industries, confirm:
  • data residency
  • retention controls
  • access logging

Common security mistakes

  • Over-permissive RAM policies that allow anyone to disable alarms.
  • Storing AccessKeys in plaintext on instances or in code repos.
  • Sending critical alerts to shared inboxes without access controls.
  • Not auditing alarm rule changes (leading to silent monitoring gaps).

Secure deployment recommendations

  • Maintain a minimal set of operators who can modify alarm configurations.
  • Use change management for alarm changes (ticket, PR, approval).
  • Regularly test that alerts reach the on-call rotation.

13. Limitations and Gotchas

Because CloudMonitor is a managed service and deeply integrated with Alibaba Cloud resources, watch for these common issues:

  • Region mismatches: Metrics are often region-bound; selecting the wrong region makes resources “disappear.”
  • Metric delays: Some metrics are not real-time; alarm evaluation can lag behind actual behavior.
  • Different granularity per service: ECS CPU may be frequent; other services might have coarser resolution.
  • Agent requirements: OS-level metrics often require an agent; missing agent = missing memory/disk signals.
  • Quota limits: Alarm rules, custom metrics, and API calls can hit quotas.
  • Alert fatigue: Default thresholds can be noisy; tune based on baselines and service behavior.
  • Notification costs: SMS/voice can create surprise bills during incident storms.
  • High-cardinality custom metrics: Can explode costs and degrade manageability.
  • Cross-account visibility: Multi-account setups require careful RAM and governance patterns (verify best practice in official docs).
  • Service overlaps: Logs and APM are separate products (SLS, ARMS). Don’t expect CloudMonitor to replace them.

14. Comparison with Alternatives

CloudMonitor sits in the “metrics + alerting for Alibaba Cloud resources” space. Here’s how it compares.

Comparison table

Option Best For Strengths Weaknesses When to Choose
Alibaba Cloud CloudMonitor Native monitoring/alerting for Alibaba Cloud services Tight integration with Alibaba Cloud services, fast setup, managed dashboards/alarms Not a full log analytics platform; deep APM requires other services You run primarily on Alibaba Cloud and want a first-party monitoring baseline
Alibaba Cloud ARMS Application performance monitoring & tracing Deep app insights, traces, service topology (verify exact features) Requires instrumentation; cost and complexity may be higher You need application-level latency breakdowns and distributed tracing
Alibaba Cloud Simple Log Service (SLS) Log collection, search, analytics, log-based alerts Powerful log analytics, indexing, long-term retention options Not a metrics-first tool; requires log pipeline setup You need to investigate errors via logs or alert on log patterns
Managed Service for Prometheus + Grafana (Alibaba Cloud ecosystem) Cloud-native/Kubernetes metrics and standard Prometheus ecosystem PromQL, broad ecosystem, Grafana dashboards Requires Prometheus model knowledge; integration work You run Kubernetes/microservices and want Prometheus-native observability
AWS CloudWatch Monitoring on AWS Mature cross-service integration in AWS Not native to Alibaba Cloud Only if your workloads are primarily on AWS
Azure Monitor Monitoring on Azure Strong Azure integrations Not native to Alibaba Cloud Only if your workloads are primarily on Azure
Google Cloud Monitoring Monitoring on GCP Deep GCP integration Not native to Alibaba Cloud Only if your workloads are primarily on GCP
Prometheus + Grafana (self-managed) Full control, hybrid/multi-cloud Portable, flexible, large ecosystem Operational overhead; scaling/HA complexity You need maximum portability and can operate the stack

15. Real-World Example

Enterprise example: Multi-region e-commerce platform on Alibaba Cloud

Problem A large e-commerce company runs production in multiple regions. During flash sales, they face: – CPU saturation on ECS – database connection spikes on RDS – intermittent 5xx from load balancers They need standardized alerting, dashboards for NOC, and audit trails for changes.

Proposed architecture – CloudMonitor monitors ECS, SLB/ALB, RDS metrics in each region. – Alarm rules: – critical: LB 5xx spikes, RDS connections near limit – warning: CPU/memory/disk thresholds – Dashboards: – per region and global view – per business service (checkout/search/catalog) – ActionTrail audits all alarm configuration changes. – SLS collects application logs for root-cause analysis; ARMS used for deep tracing in critical services (optional).

Why CloudMonitor was chosen – Native Alibaba Cloud integration reduces deployment effort. – Centralized alarming and dashboards are faster to standardize. – Works well as the baseline metric layer for all core services.

Expected outcomes – Faster detection of saturation and error spikes – Reduced time to triage via standardized dashboards – Better governance and auditability for monitoring configuration


Startup/small-team example: Single-region SaaS API

Problem A small startup runs: – 2 ECS instances behind a load balancer – RDS database They need basic monitoring and reliable alerts without operating a full observability stack.

Proposed architecture – CloudMonitor for: – ECS CPU alarms – RDS storage and connections alarms – LB health metrics dashboard – Email notifications to the shared on-call inbox; SMS reserved for critical alarms only. – Minimal dashboard for daily checks.

Why CloudMonitor was chosen – Low operational overhead (managed service). – Quick setup and good default metrics for Alibaba Cloud services.

Expected outcomes – Incidents detected early without building custom tooling – Lean ops process suitable for a small team


16. FAQ

1) Is CloudMonitor the same as AWS CloudWatch?

No. CloudMonitor is Alibaba Cloud’s monitoring service. AWS CloudWatch is specific to AWS. They solve similar problems but are separate products with different APIs and integrations.

2) Do I need to install anything to monitor ECS CPU usage?

Usually no—ECS basic metrics such as CPU usage are typically available as built-in service metrics. OS-level metrics (memory/disk) often require an agent. Verify ECS metric coverage in official docs.

3) Can CloudMonitor monitor on-premises servers?

CloudMonitor is primarily for Alibaba Cloud resources. Some monitoring models might allow external/custom metrics ingestion, but on-prem host monitoring is not guaranteed. Verify supported hybrid options in official docs.

4) Can I create custom metrics for my application?

CloudMonitor commonly supports custom metrics via API (custom monitoring). Availability, quotas, and pricing vary—verify in official docs and pricing pages.

5) How do I avoid alert fatigue?

Use: – meaningful thresholds tied to user impact – evaluation durations (e.g., 5 minutes) – severity levels – fewer, higher-quality alerts Also review and tune regularly based on incidents.

6) What notification methods are supported?

Typically email and sometimes SMS or webhooks/integrations. Exact methods vary by region/account and may change—verify in your CloudMonitor console and docs.

7) Does SMS alerting cost extra?

Often yes—telecom-based notifications typically incur charges. Confirm in Alibaba Cloud pricing and your Billing Center.

8) How long does CloudMonitor retain metrics?

Retention depends on metric type and service rules. Built-in metrics and custom metrics may differ. Verify retention in official docs.

9) Can I manage alarms as code?

CloudMonitor offers APIs (OpenAPI) for many operations. You can script alarm rule creation and updates. Verify API coverage in the CloudMonitor API reference.

10) Can I monitor Kubernetes metrics with CloudMonitor?

Kubernetes monitoring is usually handled through Prometheus-based solutions and/or ARMS/other Alibaba Cloud services. CloudMonitor may integrate at the infrastructure level. Verify the recommended Alibaba Cloud approach for ACK/Kubernetes monitoring.

11) What’s the difference between CloudMonitor and SLS?

CloudMonitor focuses on metrics (time-series). SLS focuses on logs (search, indexing, analytics). They complement each other.

12) What’s the difference between CloudMonitor and ARMS?

CloudMonitor is mainly infrastructure/service metrics and alerting. ARMS is application performance monitoring and tracing (APM). Use ARMS when you need code-level insights and distributed tracing.

13) Why do I see different metrics for different services?

Each Alibaba Cloud service exposes a different metric set and granularity based on what’s meaningful for that service. Always consult that service’s metric reference.

14) How do I secure access to monitoring data?

Use RAM least-privilege policies, restrict alarm modifications, rotate keys, and audit changes with ActionTrail.

15) Why do alarms trigger late?

Common reasons: – metric publishing delay – evaluation window duration – rule configuration (period, consecutive breaches) Tune the rule and consider metric resolution constraints.


17. Top Online Resources to Learn CloudMonitor

Resource Type Name Why It Is Useful
Official documentation CloudMonitor Documentation Canonical feature descriptions, configuration guides, and references: https://www.alibabacloud.com/help/en/cloudmonitor/
Official product page CloudMonitor Product Page Overview and entry point for pricing and positioning: https://www.alibabacloud.com/product/cloudmonitor
Official API reference CloudMonitor API Reference (OpenAPI) Automate alarms/metrics queries; confirm actions and parameters (navigate from docs): https://www.alibabacloud.com/help/en/cloudmonitor/
Official CLI docs Alibaba Cloud CLI Learn how to authenticate and call CloudMonitor APIs via CLI: https://www.alibabacloud.com/help/en/alibaba-cloud-cli/
Official RAM docs Resource Access Management (RAM) Secure CloudMonitor access with least privilege: https://www.alibabacloud.com/help/en/ram/
Official audit docs ActionTrail Audit monitoring configuration changes: https://www.alibabacloud.com/help/en/actiontrail/
Official logging docs Simple Log Service (SLS) Complement metrics with logs and log-based alerting: https://www.alibabacloud.com/help/en/sls/
Official APM docs ARMS Add application tracing and APM where needed: https://www.alibabacloud.com/help/en/arms/
Architecture guidance Alibaba Cloud Architecture Center Reference architectures and operational patterns (search within): https://www.alibabacloud.com/architecture
Community learning Alibaba Cloud Blog Practical posts and announcements; verify against docs: https://www.alibabacloud.com/blog

18. Training and Certification Providers

Institute Suitable Audience Likely Learning Focus Mode Website URL
DevOpsSchool.com DevOps engineers, SREs, cloud engineers DevOps practices, monitoring/observability fundamentals, cloud operations Check website https://www.devopsschool.com/
ScmGalaxy.com Beginners to intermediate engineers SCM/DevOps concepts, operational tooling Check website https://www.scmgalaxy.com/
CLoudOpsNow.in Cloud operations teams Cloud operations practices, monitoring and reliability Check website https://www.cloudopsnow.in/
SreSchool.com SREs and platform teams SRE principles, incident response, monitoring and alerting design Check website https://www.sreschool.com/
AiOpsSchool.com Ops and engineering leaders, SREs AIOps concepts, event correlation, automation (verify course outlines) Check website https://www.aiopsschool.com/

19. Top Trainers

Platform/Site Likely Specialization Suitable Audience Website URL
RajeshKumar.xyz DevOps/cloud coaching (verify offerings) Students, working engineers https://rajeshkumar.xyz/
devopstrainer.in DevOps training and mentoring (verify offerings) Beginners to intermediate https://www.devopstrainer.in/
devopsfreelancer.com Freelance DevOps support/training (verify offerings) Teams needing practical guidance https://www.devopsfreelancer.com/
devopssupport.in DevOps support and enablement (verify offerings) Ops teams and small organizations https://www.devopssupport.in/

20. Top Consulting Companies

Company Name Likely Service Area Where They May Help Consulting Use Case Examples Website URL
cotocus.com Cloud/DevOps consulting (verify service catalog) Observability adoption, cloud operations processes Baseline monitoring rollout, alert tuning workshops, dashboard standardization https://cotocus.com/
DevOpsSchool.com DevOps consulting and enablement (verify service catalog) DevOps transformation, monitoring practices, training Monitoring strategy, incident response process, “monitoring as code” implementation https://www.devopsschool.com/
DEVOPSCONSULTING.IN DevOps consulting (verify service catalog) CI/CD, cloud ops, reliability Alarm rationalization, operational readiness reviews, SRE playbooks https://www.devopsconsulting.in/

21. Career and Learning Roadmap

What to learn before CloudMonitor

  • Alibaba Cloud fundamentals:
  • ECS, VPC, security groups, SLB/ALB, RDS basics
  • Monitoring fundamentals:
  • metrics vs logs vs traces
  • SLI/SLO/SLA concepts
  • alert fatigue and on-call basics
  • IAM basics:
  • RAM users/roles/policies
  • least privilege and audit logging

What to learn after CloudMonitor

  • Simple Log Service (SLS) for log pipelines and log analytics
  • ARMS for tracing and APM (if you build microservices)
  • Incident management:
  • runbooks, postmortems, error budgets
  • Infrastructure as Code:
  • automate alarms and dashboards (via OpenAPI/CLI/Terraform patterns—verify official support and providers)

Job roles that use CloudMonitor

  • Cloud engineer / cloud operations
  • DevOps engineer
  • SRE
  • Platform engineer
  • Security engineer (operational monitoring and audit support)
  • Solutions architect (designing production-ready operations)

Certification path (if available)

Alibaba Cloud certifications evolve over time and vary by region. Verify current certification tracks on the official certification portal: – https://edu.alibabacloud.com/ (verify current certification pages and paths)

Project ideas for practice

  1. Build a “production baseline” dashboard for ECS + RDS + SLB/ALB.
  2. Implement a standard set of alarms (CPU, disk, LB 5xx, DB connections).
  3. Add custom metrics for a sample API (requests/sec, error rate).
  4. Create an “alert review” process: track top noisy alarms and tune them.
  5. Multi-region failover drill: monitor primary and standby health and alert on failover signals.

22. Glossary

  • Alarm rule: A condition evaluated against a metric that triggers notifications when breached.
  • Metric: A numerical time-series measurement (CPU %, latency ms, requests count).
  • Namespace: Logical grouping for metrics (often per product/service).
  • Dimension: Metadata that identifies a metric series (e.g., InstanceId, device).
  • Retention: How long monitoring data is stored.
  • Granularity / resolution: The time interval between metric data points (e.g., 1 minute).
  • SLI (Service Level Indicator): A measurable indicator of service performance (latency, availability).
  • SLO (Service Level Objective): Target value/range for an SLI.
  • SLA (Service Level Agreement): Contractual commitment, often derived from SLOs.
  • Alert fatigue: When too many low-quality alerts cause responders to ignore alerts.
  • Host monitoring: OS-level monitoring, often via an agent installed on the server.
  • Custom metric: A metric defined and pushed by the user/application rather than provided by the cloud service.
  • RAM: Resource Access Management, Alibaba Cloud’s IAM service.
  • ActionTrail: Alibaba Cloud audit logging service for API actions.

23. Summary

CloudMonitor is Alibaba Cloud’s native monitoring, alerting, and dashboard service—an essential part of Developer Tools for operating workloads reliably. It provides built-in metrics for many Alibaba Cloud services, supports alarms and notifications, and can be extended with custom monitoring for application or business KPIs (where supported).

It matters because consistent observability reduces downtime, speeds up incident response, and supports scalable operations. Cost is typically driven by value-added features (custom metrics, synthetic checks, and certain notification channels like SMS), so confirm pricing in official sources and design alerts to minimize noise and high-volume paging.

Use CloudMonitor when you need a managed, Alibaba Cloud-integrated monitoring baseline. Pair it with SLS for logs and ARMS for deep application tracing when your workloads require broader observability.

Next step: build a production-ready dashboard and a minimal set of actionable alarms for ECS + RDS + SLB/ALB, then expand into logs (SLS) and tracing (ARMS) as your system grows.