Category
Observability and Management
1. Introduction
Oracle Cloud Infrastructure (OCI) Monitoring is the metrics and alerting service in Oracle Cloud under the Observability and Management category. It collects time-series metrics from OCI services (and optionally from your applications via custom metrics), lets you explore and query those metrics, and triggers alarms when conditions occur.
In simple terms: Monitoring tells you what is happening right now (and what happened recently) in your OCI resources—CPU is high, a load balancer is failing health checks, a database is running out of storage, or your application’s error rate has increased—and it can notify your team automatically.
Technically, Monitoring is a regional metrics platform that stores and serves metrics (service metrics and custom metrics), supports metric query and aggregation, and evaluates alarms based on metric query rules. Alarms typically deliver notifications through OCI Notifications (email, SMS in supported regions, HTTPS endpoints, Functions, etc., depending on Notifications capabilities in your tenancy/region).
Monitoring solves common operational problems:
- Detecting outages and performance regressions early
- Turning “someone noticed something is slow” into measurable SLO-driven operations
- Reducing mean time to detect (MTTD) and mean time to resolve (MTTR)
- Providing evidence for incident timelines and capacity planning inputs
Service status note: As of this writing, Monitoring is an active OCI service. OCI also uses the term Telemetry in some API/endpoint naming for metrics. Always verify the latest naming and feature scope in the official docs.
2. What is Monitoring?
Official purpose
OCI Monitoring provides a way to observe the health, performance, and behavior of resources by collecting and querying metrics, and to act on those metrics by configuring alarms that trigger notifications when conditions are met.
Core capabilities (what you can do)
- View service metrics emitted automatically by OCI services (for example, compute, networking, load balancers, databases—availability depends on the service).
- Publish custom metrics from your own applications and systems.
- Explore metrics using the console (Metric Explorer) and query metrics using APIs/CLI/SDKs.
- Create alarms driven by metric queries to detect thresholds, errors, saturation, or absence of signals.
- Route alarm notifications via OCI Notifications topics and subscriptions.
Major components
- Metrics
- Service metrics: Provided by OCI services.
- Custom metrics: You publish metric datapoints to a custom namespace.
- Namespaces: Logical grouping of metrics (service namespaces and your custom namespaces).
- Dimensions: Key/value attributes that describe and filter metrics (for example,
resourceId,availabilityDomain,app,environment). - Metric queries: Queries that aggregate and filter time-series data.
- Alarms: Rules that evaluate metric queries and trigger notifications.
- Notifications integration: Alarm destinations are typically Notifications topics (OCI Notifications service).
Service type
- Managed cloud service for metrics storage, querying, and alarm evaluation.
- Integrates tightly with other OCI services for metric emission and alerting workflows.
Scope (regional/global, tenancy/compartment)
- Monitoring is regional: metrics and alarms are evaluated and stored in the region where they are created and where the emitting resources exist.
- Access and organization are tenancy- and compartment-aware through OCI IAM policies.
- Metrics belong to a compartment (for service metrics, typically the resource’s compartment; for custom metrics, you specify the compartment when posting datapoints).
Fit in the Oracle Cloud ecosystem
In OCI Observability and Management, Monitoring typically works alongside:
- Notifications: Deliver alarm events to people or systems.
- Logging / Logging Analytics: Investigate logs related to alarm triggers.
- Events: Event-driven automation (separate service; often used with Notifications/Functions).
- APM (Application Performance Monitoring): Tracing and application-level observability (separate product area).
- Dashboards: Build operational dashboards that can visualize metrics (separate OCI dashboard capability; verify your console’s current dashboard offering).
Monitoring is the “metrics + alerts” foundation; other observability services add logs, traces, and deeper analytics.
3. Why use Monitoring?
Business reasons
- Reduce downtime cost by detecting failures faster and alerting the right team automatically.
- Improve customer experience by catching latency or saturation before it becomes an incident.
- Operational accountability: metrics and alarm history provide auditability of incident conditions.
- Enable SLO/SLA reporting inputs (Monitoring provides signals; reporting often requires additional tooling/process).
Technical reasons
- Built-in service metrics: You often get useful metrics without deploying agents.
- Custom metrics support: publish application KPIs (orders/minute, queue depth, error rate) to the same platform.
- Programmatic access: CLI/SDK/API lets you automate alarm creation and metric retrieval in CI/CD and IaC.
Operational reasons
- Centralized alerting using alarms and Notifications topics.
- Standardized metric model: namespaces, dimensions, aggregations, and queries.
- Faster troubleshooting: correlate metric changes with deployments or infrastructure changes.
Security/compliance reasons
- IAM-controlled access to metrics and alarms.
- Supports governance patterns (compartment isolation, tagging strategies, least privilege).
- Integrates with OCI’s auditing model (actions on alarms/policies are auditable via OCI Audit; verify details in official docs).
Scalability/performance reasons
- Managed service scales with your footprint; no need to operate a metrics backend for common cases.
- Enables consistent alarms across hundreds/thousands of resources using dimensions and consistent naming.
When teams should choose Monitoring
- You are running workloads on OCI and want a first-party way to monitor OCI resource health.
- You want basic-to-advanced metric alerting integrated with OCI IAM and Notifications.
- You want to publish custom business metrics without running your own time-series database (or as a complement to one).
When teams should not choose Monitoring (or should complement it)
- You need full observability stacks with long retention, complex dashboards, cross-cloud correlation, or deep tracing: consider APM, Logging Analytics, or third-party tools, and use Monitoring as a signal source.
- You have a mature Prometheus/Grafana ecosystem and want to keep a single metrics backend for all environments; you might still use OCI Monitoring for OCI-native alarms or integrate via exporters/bridges (verify supported integrations in current docs).
- You require very long metric retention or specialized analytics that Monitoring does not provide—verify retention and capabilities in official docs.
4. Where is Monitoring used?
Industries
- SaaS and technology (platform reliability, SLO monitoring)
- Finance (availability and latency monitoring with strict change control)
- Retail/e-commerce (traffic and order pipeline metrics)
- Healthcare (system health, audit-driven operations)
- Manufacturing/IoT backends (telemetry aggregation signals—often combined with Streaming and custom metrics)
- Education and public sector (cost-controlled baseline monitoring)
Team types
- DevOps and SRE teams (incident response, on-call, automation)
- Platform engineering (golden alarms, baseline dashboards, tenancy governance)
- Cloud operations/NOC teams (central alarm routing and triage)
- Security and compliance teams (monitoring critical controls and availability signals)
- Application teams (custom metrics + alerting for app KPIs)
Workloads
- OCI Compute instances and autoscaling groups
- Containerized workloads (OKE/Kubernetes—often combined with Prometheus, verify current OCI observability options)
- API backends behind OCI Load Balancers
- Databases (Autonomous Database or DB systems—service metrics)
- Event-driven/serverless (Functions, Streaming-based pipelines)
Architectures
- Single-region production with local alarms and notifications
- Multi-compartment multi-environment setups (dev/test/prod)
- Multi-region DR: regional alarms per region, centralized incident routing (often via shared Notifications integrations)
Real-world deployment contexts
- Production: alarms must be tuned (avoid noise), integrated with incident management, and use strong IAM boundaries.
- Dev/test: fewer alarms, shorter retention needs, focus on validating metrics and alarm logic; keep costs low by minimizing custom metric cardinality and ingestion volume.
5. Top Use Cases and Scenarios
Below are realistic scenarios where Oracle Cloud Monitoring fits well.
1) Compute CPU saturation alarm
- Problem: An application VM becomes CPU-bound and starts timing out.
- Why Monitoring fits: OCI emits compute-related metrics; alarms can detect sustained high utilization.
- Scenario: Trigger an alarm when CPU utilization exceeds a threshold for 10 minutes and notify on-call.
2) Memory pressure detection (when available via agent/service metrics)
- Problem: Instances fail due to OOM or swapping, but CPU looks fine.
- Why Monitoring fits: Some memory metrics may be available via OCI agents/plugins depending on OS and configuration (verify in official docs).
- Scenario: Alarm on memory utilization or swap usage to act before incidents.
3) Load balancer backend health degradation
- Problem: Backends become unhealthy and traffic errors increase.
- Why Monitoring fits: Load balancer services typically emit health/HTTP metrics; alarms can detect unhealthy backend count or error rate.
- Scenario: Alarm when unhealthy backends > 0 for 5 minutes; notify and trigger an automated remediation runbook.
4) Autonomous Database storage or CPU threshold alerting
- Problem: Database resources approach limits; performance degrades.
- Why Monitoring fits: OCI database services emit service metrics; alarms can notify proactively.
- Scenario: Alarm when storage used exceeds a percentage or when CPU is consistently high.
5) Custom business KPI: orders per minute
- Problem: Infrastructure is “green” but business throughput drops.
- Why Monitoring fits: Custom metrics allow app-level KPIs, enabling operational alerting on business impact.
- Scenario: Publish
orders_processedmetric; alarm if it drops below baseline during peak hours.
6) Custom metric: queue depth / lag for data pipelines
- Problem: Consumers fall behind; processing latency increases.
- Why Monitoring fits: Custom metrics can represent queue depth, lag, or backlog.
- Scenario: Alarm if backlog exceeds threshold; notify data engineering.
7) Detect “silence” (absence of expected metrics)
- Problem: A scheduled job stops running; no failures are logged centrally.
- Why Monitoring fits: Alarms can be built around missing signals (depending on supported query patterns; verify in official docs).
- Scenario: Publish a heartbeat metric; alarm if no datapoints are received in a window.
8) Multi-compartment operational guardrails
- Problem: Different teams deploy resources inconsistently, leading to monitoring gaps.
- Why Monitoring fits: Standard alarm patterns can be applied per compartment, with IAM controls and standardized notification routing.
- Scenario: Platform team provides Terraform modules that create baseline alarms for new workloads.
9) Capacity trending inputs
- Problem: You need capacity data to plan scale-ups.
- Why Monitoring fits: Metric history supports trend views; export via API to external analytics if needed.
- Scenario: Pull CPU/memory/network metrics regularly to a data lake for forecasting.
10) Incident correlation with logs and deployments
- Problem: Alert fires; you need fast root cause.
- Why Monitoring fits: Monitoring provides the “signal”; you correlate with OCI Logging/Logging Analytics and your CI/CD deployment timeline.
- Scenario: Alarm triggers; on-call checks logs for the same time window and compares to last deployment.
11) SLA monitoring at the edge (combined design)
- Problem: Need external availability checks.
- Why Monitoring fits: Monitoring can ingest custom results (for example, synthetic check results posted as custom metrics) or be combined with OCI Health Checks (separate service).
- Scenario: Synthetic probe posts
api_availabilityandlatency_msmetrics; alarms notify if availability drops.
12) Security operations signals (availability and misconfig indicators)
- Problem: You want to detect unusual spikes (traffic, errors) that may indicate abuse.
- Why Monitoring fits: Alarms on network/edge metrics can be an early indicator; integrate with security workflows.
- Scenario: Alarm on sudden surge of 4xx/5xx responses; notify security/on-call for investigation.
6. Core Features
Feature availability can vary by region and by OCI service integration. Verify specifics in the official docs for your region and tenancy.
6.1 Service metrics (OCI-provided metrics)
- What it does: Collects and stores metrics emitted automatically by OCI services.
- Why it matters: You can monitor core infrastructure health without deploying a metrics pipeline.
- Practical benefit: Fast setup—go from “no metrics” to dashboards/alarms quickly.
- Caveats: Metric names, dimensions, and availability differ by service; some signals require enabling agents/plugins or service-specific settings.
6.2 Custom metrics (publish your own datapoints)
- What it does: Lets you publish time-series datapoints to Monitoring under a custom namespace.
- Why it matters: Enables application-level and business-level observability.
- Practical benefit: You can alert on KPIs (orders/minute, queue depth) not visible via infrastructure metrics.
- Caveats: Custom metrics can introduce cost and complexity—especially high-cardinality dimensions (many unique dimension values).
6.3 Namespaces, metrics, dimensions model
- What it does: Organizes metrics by namespace and describes series by dimensions.
- Why it matters: Dimension filters are how you isolate metrics for specific resources, apps, environments, or tenants.
- Practical benefit: Scales monitoring patterns; one metric name can cover many resources.
- Caveats: Too many unique dimension combinations can explode the number of time series and increase cost/limits consumption.
6.4 Metric Explorer (console visualization)
- What it does: Interactive browsing, filtering, and charting of metrics in the OCI Console.
- Why it matters: Great for quick investigation and validating that metrics are flowing.
- Practical benefit: Reduce time to diagnose and validate alarm conditions.
- Caveats: For advanced dashboards and long-term views, you may need dedicated dashboards tooling (OCI dashboards or external tools—verify current options).
6.5 Metric Query Language (MQL) for alarms and queries
- What it does: Lets you define how to aggregate and evaluate metric data over time windows (for example, mean CPU over 5 minutes).
- Why it matters: Alarm correctness depends on correct query design (window, aggregation, filters).
- Practical benefit: Detect sustained issues rather than spikes; reduce alert noise.
- Caveats: Query syntax and functions are specific to OCI; confirm supported syntax and patterns in official MQL documentation.
6.6 Alarms (metric-based alert rules)
- What it does: Evaluates metric queries and transitions alarm states when conditions are met.
- Why it matters: Alarms are your automation boundary—turn metrics into action.
- Practical benefit: Detect incidents proactively and consistently.
- Caveats: Poorly tuned alarms cause fatigue; missing dimension filters can alert on the wrong resource(s).
6.7 Notifications integration (alarm destinations)
- What it does: Sends alarm notifications to OCI Notifications topics; subscribers receive messages (email, HTTPS, etc., depending on configuration).
- Why it matters: Separates “alarm evaluation” from “message delivery,” enabling fan-out and routing.
- Practical benefit: One alarm can notify multiple teams/systems through topic subscriptions.
- Caveats: Email subscriptions require confirmation; delivery formats and endpoints must be validated.
6.8 Alarm history and state transitions
- What it does: Tracks when alarms fire, clear, and change state.
- Why it matters: Helps reconstruct incidents and validate alarm tuning.
- Practical benefit: Post-incident reviews can use alarm timestamps to correlate events.
- Caveats: Historical depth and retention for alarm history should be verified in official docs.
6.9 API/CLI/SDK access (automation)
- What it does: Programmatically publish metrics, query metrics, and manage alarms.
- Why it matters: Enables IaC and GitOps patterns for monitoring configuration.
- Practical benefit: Repeatable, reviewable monitoring changes across environments.
- Caveats: Requires careful IAM design and secrets handling for automation credentials.
6.10 Compartment-aware governance
- What it does: Uses OCI compartments as a security and management boundary for metrics and alarms.
- Why it matters: Large organizations need isolation between teams/environments.
- Practical benefit: Separate prod vs non-prod alarms, limit who can modify alarms.
- Caveats: Cross-compartment visibility requires explicit IAM policies.
7. Architecture and How It Works
High-level service architecture
At a high level:
- Metrics are emitted either: – Automatically by OCI services (service metrics), or – Explicitly by your code/automation (custom metrics API).
- Monitoring stores metrics as time-series keyed by namespace + metric name + dimension set.
- Users and tools query and visualize metrics in the console or through API/CLI/SDK.
- Alarms evaluate metric queries on a schedule.
- When an alarm condition is met, Monitoring publishes a message to an OCI Notifications topic.
- Notifications delivers to configured subscriptions (email, HTTPS, Functions, etc., depending on Notifications support).
Request/data/control flow
- Data plane
- Datapoints flow into Monitoring (service-emitted or posted).
- Alarms evaluate stored datapoints.
- Control plane
- IAM policies determine who can read metrics, post custom metrics, and manage alarms.
- Console/CLI/SDK calls create/update/delete alarms, query metrics.
Integrations with related services
- OCI Notifications: alarm destinations for alert delivery.
- OCI Logging: investigate logs during alarm events.
- OCI Events: often used for automation patterns (not required for Monitoring itself).
- OCI Functions / Streaming: frequently used as downstream targets through Notifications or event-driven pipelines.
- Terraform / Resource Manager: manage alarms and topics as code (verify provider resource names in current Terraform docs).
Dependency services
- OCI IAM: authentication/authorization.
- OCI Notifications: if you want delivered alerts.
- The monitored OCI services: compute, networking, database, etc., for service metrics.
Security/authentication model
- Requests are authenticated using OCI IAM:
- Console sessions (federated or local users)
- API signing keys (for CLI/SDK)
- Instance Principals / Resource Principals (for workloads posting custom metrics—verify best practice for your architecture)
- Authorization is controlled by IAM policies at tenancy/compartment scope.
Networking model
- The Monitoring API endpoints are OCI regional service endpoints.
- From within OCI, you may use public endpoints or private access patterns depending on your network design (for example, using NAT/Service Gateway patterns—verify current OCI guidance for accessing public OCI services privately).
- For notifications delivered to HTTPS endpoints, ensure your endpoint is reachable and secured (TLS, auth).
Monitoring/logging/governance considerations
- Monitoring is itself an operational control; treat alarm configuration as production code:
- version control alarm definitions (Terraform)
- least privilege IAM
- standardized naming/tags
- Use OCI Audit to track changes to alarm configuration and IAM policies (verify audit event coverage in official docs).
Simple architecture diagram (Mermaid)
flowchart LR
A[OCI Resource<br/>Compute / LB / DB] -->|Service metrics| M[OCI Monitoring]
C[App / Script] -->|Post custom metrics| M
U[Engineer / SRE] -->|Query & charts| M
M -->|Alarm triggers| N[OCI Notifications Topic]
N --> E[Email / SMS / HTTPS / Function<br/>(per Notifications subscriptions)]
Production-style architecture diagram (Mermaid)
flowchart TB
subgraph Tenancy["Oracle Cloud Tenancy"]
subgraph Compartments["Compartments: prod / nonprod / shared"]
subgraph Prod["Prod Compartment"]
W1[OKE / Compute Apps]
LB[Load Balancer]
DB[(Database Service)]
end
subgraph Shared["Shared Ops Compartment"]
MON[Monitoring<br/>Metrics + Alarms]
TOPIC[Notifications Topic(s)]
end
end
end
W1 -->|Custom metrics (KPI, errors)| MON
LB -->|Service metrics| MON
DB -->|Service metrics| MON
MON -->|Alarm messages| TOPIC
TOPIC --> ONCALL[On-call Email/ChatOps Gateway]
TOPIC --> WEBHOOK[HTTPS Webhook<br/>Incident Mgmt / SOAR]
TOPIC --> FN[OCI Function<br/>Auto-remediation]
ONCALL --> RUNBOOK[Runbooks + Dashboards + Logs]
WEBHOOK --> RUNBOOK
FN --> RUNBOOK
8. Prerequisites
Tenancy and account requirements
- An active Oracle Cloud tenancy with permission to use Observability and Management services.
- A user account (or federated identity) with rights to create/read Monitoring resources.
Permissions / IAM policies
You need IAM permissions for at least: – Reading metrics (to explore metrics) – Managing alarms (to create alarm rules) – Posting custom metrics (for the hands-on lab) – Managing Notifications topics/subscriptions (to receive alarm messages)
OCI IAM policies are expressed in human-readable statements. Exact policy verbs and resource families can vary; verify the latest Monitoring and Notifications IAM policy examples in the official docs.
Typical patterns include: – Allow a group to manage alarms in a compartment – Allow a group to read/use metrics in a compartment – Allow a group to manage topics/subscriptions in a compartment – If posting custom metrics from automation: allow that principal to post metrics
Billing requirements
- Service metrics and basic alarm usage may be included or have minimal direct cost, but custom metrics ingestion/storage and downstream services may incur charges depending on your usage and region.
- You need a payment method configured if you plan to exceed Always Free or free allocations (verify current free tier details).
Tools
For the hands-on tutorial, you can use:
– OCI Console (web UI)
– OCI CLI for posting custom metrics and optional validation
CLI docs: https://docs.oracle.com/en-us/iaas/Content/API/Concepts/cliconcepts.htm
Optional: – Terraform (for IaC) – SDKs (Python/Java/Go/etc.) for programmatic posting and querying
Region availability
- Monitoring is available in OCI commercial regions and many other OCI regions, but availability can vary.
- Confirm on the OCI region/service availability pages (verify in official docs).
Quotas/limits
- Limits exist for alarms, metric ingestion, namespaces, and API rates.
- Check OCI Service Limits and Quotas for Monitoring and Notifications in your tenancy (Console: Governance & Administration → Limits/Quotas; exact navigation may vary).
Prerequisite services
- OCI Notifications is required if you want alarms to send messages to email/webhooks.
- No compute resources are strictly required to try custom metrics (you can post from your local machine using OCI CLI).
9. Pricing / Cost
Do not treat this section as a quote. OCI pricing is region-dependent and can change. Always confirm in official pricing pages and your tenancy’s rate card.
Official pricing sources
- OCI price list (Observability and Management): https://www.oracle.com/cloud/price-list/#observability-and-management
- OCI Cost Estimator: https://www.oracle.com/cloud/costestimator.html
- OCI Free Tier overview: https://www.oracle.com/cloud/free/
Pricing dimensions (how costs are commonly determined)
OCI Monitoring cost typically depends on factors such as: – Custom metrics ingestion: how many datapoints you publish (frequency × number of time series). – Custom metrics storage/retention: how long datapoints are retained (if priced separately; verify current model). – API requests: heavy querying/exporting can have request costs or rate limits (verify). – Alarms: some providers price per alarm or per evaluation; OCI’s model must be confirmed in the official price list for your region. – Notifications delivery: topic usage and delivery endpoints may have their own pricing dimensions (verify Notifications pricing).
Service metrics emitted by OCI services are often available without a separate “ingestion charge,” but your overall bill still includes the monitored services (compute, database, networking, etc.). Treat Monitoring as a cost multiplier only when you add custom metrics, heavy queries, long retention requirements, or downstream integrations.
Free tier (if applicable)
OCI has a Free Tier; Monitoring and Notifications may have Always Free components or free allocations. Verify current Always Free limits for Monitoring and Notifications on the official Free Tier pages and the price list.
Main cost drivers
- High-cardinality custom metrics
– Example: dimensions include
userId,requestId, orpodUidfor thousands of unique values. – Result: explosive growth in time series count and datapoints. - High-frequency datapoints – Publishing every 1 second instead of every 60 seconds increases ingestion by 60×.
- Many environments – Dev/test/prod each with its own custom metrics and alarms.
- Exporting/reading at scale – Frequent dashboards, external exporters, and API-based polling.
- Downstream alert delivery – Notifications to many endpoints and high alert volume.
Hidden or indirect costs
- Compute/network costs for any collectors/agents you run to generate custom metrics.
- Data egress if you send notifications to external systems or export metrics to external locations (depends on architecture; verify OCI data transfer pricing).
- Operational overhead: alert fatigue costs real engineering time.
Network/data transfer implications
- Posting custom metrics from outside OCI uses public endpoints; your local network egress is on your side; OCI ingress is typically not charged but verify.
- Sending alerts to external HTTPS endpoints could involve OCI egress from the Notifications service path (verify).
How to optimize cost
- Prefer service metrics when available.
- For custom metrics:
- Use low cardinality dimensions (environment, app, region) rather than per-user identifiers.
- Publish at the lowest frequency that meets your alerting needs (often 1 minute).
- Aggregate upstream when possible (publish counts/sums rather than raw event-per-request).
- Reduce alert noise: fewer triggered alarms reduces downstream delivery volume and operational load.
- Use compartments and tagging to track cost by team/app.
Example low-cost starter estimate (no fabricated numbers)
A low-cost starter approach typically looks like: – Use service metrics for compute/load balancer/database. – Add a small number of custom metrics (single namespace, a few metric names, 1-minute resolution) for key KPIs. – Create a handful of alarms (CPU high, LB backend unhealthy, KPI drop) routed to one Notifications topic.
To estimate: 1. Determine custom metric datapoints per month: – datapoints = (time series count) × (datapoints per minute) × (minutes per month) 2. Plug datapoints into the Monitoring price dimension for your region in the price list. 3. Add Notifications delivery estimates if applicable.
Example production cost considerations
In production, costs can rise due to: – Many microservices each emitting multiple custom metrics – Multiple clusters/regions and per-team compartments – Extensive dashboards, exports, and third-party integrations – Alert storms causing high message volume downstream
For production, formalize: – A metric taxonomy and dimension policy – A custom metrics budget per team – Automated checks in CI to prevent high-cardinality dimensions
10. Step-by-Step Hands-On Tutorial
This lab creates a complete “metrics → alarm → notification” flow using Oracle Cloud Monitoring and OCI Notifications. It uses custom metrics so you can complete the tutorial without provisioning compute resources.
Objective
- Post a custom metric datapoint into Oracle Cloud Monitoring
- Visualize it in the OCI Console
- Create an alarm that triggers based on the metric
- Receive an email notification through OCI Notifications
- Clean up all created resources
Lab Overview
You will: 1. Create (or choose) a compartment for the lab. 2. Create a Notifications topic and email subscription. 3. Configure OCI CLI (if not already configured). 4. Post custom metric datapoints to Monitoring. 5. Create an alarm on that custom metric and route it to the topic. 6. Trigger the alarm and validate the email notification. 7. Clean up.
Expected time: ~30–60 minutes (depends mostly on email subscription confirmation).
Step 1: Choose or create a compartment for the lab
Console steps
1. Open the OCI Console.
2. Navigate to Identity & Security → Compartments.
3. Either:
– Select an existing non-production compartment you can use, or
– Create a new compartment, for example: lab-monitoring
Expected outcome – You have a compartment OCID available for later steps.
Verification – In the compartment details page, copy: – Compartment OCID – Confirm it shows as Active
Step 2: Ensure you have the required IAM permissions
To complete the lab, your user/group needs permissions for: – Monitoring metrics (post/read) – Monitoring alarms (create/manage) – Notifications topics/subscriptions (create/manage)
If you do not have admin access, ask your tenancy administrator to grant you the minimum required permissions. OCI policies vary by org; use official examples and least privilege.
Where to verify
– Official Monitoring docs (IAM/policies section): https://docs.oracle.com/en-us/iaas/Content/Monitoring/home.htm
– Official Notifications docs (IAM/policies section): https://docs.oracle.com/en-us/iaas/Content/Notification/home.htm
Expected outcome – You can create a topic, create an alarm, and post custom metrics without authorization errors.
Common error
– NotAuthorizedOrNotFound when creating alarms or posting metrics: usually missing IAM policy or using the wrong compartment.
Step 3: Create a Notifications topic and email subscription
Alarms typically publish to a Notifications topic. Subscriptions deliver messages to endpoints such as email.
Console steps
1. Navigate to Observability & Management → Notifications (in some consoles: Developer Services → Notifications).
2. Select your lab compartment.
3. Click Create Topic
– Name: lab-monitoring-topic
– (Optional) Description: Alarm notifications for Monitoring lab
4. After the topic is created, open it and click Create Subscription
– Protocol: EMAIL
– Email: your email address
5. Check your inbox for the confirmation email and confirm the subscription.
Expected outcome – A topic exists and the subscription is in Confirmed state.
Verification – In the topic’s subscription list, confirm: – Protocol: EMAIL – Lifecycle: Confirmed (or similar state wording)
Common error – Subscription stays pending: check spam/junk folder; resubmit the subscription if needed.
Step 4: Set up OCI CLI (local machine)
If you already use OCI CLI, you can skip to Step 5.
Install and configure – OCI CLI install docs: https://docs.oracle.com/en-us/iaas/Content/API/Concepts/cliconcepts.htm
After installing, configure:
oci setup config
You will be prompted for:
– Tenancy OCID
– User OCID
– Region (for example us-ashburn-1)
– Path for config and keys
Expected outcome
– You have ~/.oci/config created (or equivalent on Windows).
– oci commands run successfully.
Verification Run:
oci iam region list --output table
If authentication is correct, you will see a table of regions.
Common errors and fixes
– Failed to verify the SSL certificate: update CA certificates or corporate proxy settings.
– NotAuthorizedOrNotFound: wrong region/OCID or missing IAM policy.
Step 5: Post a custom metric datapoint to Monitoring
Now you will publish a custom metric called orders_processed in namespace lab_metrics.
5.1 Gather required IDs
You need: – The compartment OCID for your lab compartment (from Step 1)
Set it in your shell:
export COMPARTMENT_OCID="ocid1.compartment.oc1..exampleuniqueID"
Also set a timestamp:
export TS="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"
echo "$TS"
5.2 Create the metric payload file
Create a file named metric_data.json:
cat > metric_data.json <<EOF
[
{
"namespace": "lab_metrics",
"compartmentId": "${COMPARTMENT_OCID}",
"name": "orders_processed",
"dimensions": {
"app": "demo-store",
"environment": "lab"
},
"datapoints": [
{
"timestamp": "${TS}",
"value": 1
}
]
}
]
EOF
JSON structure note: OCI CLI expects a list of metric objects for
--metric-data. If the CLI interface changes, verify the exact payload format in current CLI docs foroci monitoring metric-data post.
5.3 Post the metric
oci monitoring metric-data post --metric-data file://metric_data.json
Expected outcome – The command returns a response indicating the datapoints were accepted (look for a successful HTTP status/response).
Verification (console)
1. Go to Observability & Management → Monitoring → Metrics Explorer (naming may vary slightly).
2. Select the region and compartment.
3. Choose namespace: lab_metrics
4. Find metric name: orders_processed
5. Filter dimensions:
– app=demo-store
– environment=lab
6. Set the time window to “Last 5–15 minutes” and confirm you see the datapoint.
Common errors
– InvalidParameter or payload errors: JSON format mismatch; re-check commas/quotes and consult CLI reference.
– No datapoint appears: confirm you’re viewing the same region and compartment you posted to, and widen time range.
Step 6: Create an alarm on the custom metric
Now you’ll create an alarm that fires when orders_processed is greater than or equal to 10 (we will post a value of 10+ to trigger it).
Console steps
1. Navigate to Observability & Management → Monitoring → Alarms.
2. Select the lab compartment.
3. Click Create Alarm.
4. Set:
– Alarm name: lab-orders-processed-alarm
– Metric namespace: lab_metrics
– Metric name: orders_processed
– Dimensions: app=demo-store, environment=lab (so the alarm is scoped)
– Statistic/Aggregation: choose an appropriate aggregation for your use case (for example, max or sum).
– Trigger rule: threshold >= 10
– Evaluation window / interval: choose defaults or a short evaluation window for the lab (exact UI wording varies).
5. Destination
– Choose Notifications topic: lab-monitoring-topic
6. Create the alarm.
Alarm query note: OCI uses a metric query language for alarms. If the console shows the underlying query, review it carefully and confirm dimension filters are applied. For MQL syntax and best practices, verify in official docs.
Expected outcome – Alarm is created and appears in the Alarms list. – Initial state is typically OK/No data (depends on your datapoints and evaluation settings).
Verification – Open the alarm details and confirm: – Destination topic is correct – Metric namespace/name and dimensions match your posted datapoints
Step 7: Trigger the alarm by posting a higher datapoint
Update the timestamp and value and post again.
export TS="$(date -u +"%Y-%m-%dT%H:%M:%SZ")"
cat > metric_data.json <<EOF
[
{
"namespace": "lab_metrics",
"compartmentId": "${COMPARTMENT_OCID}",
"name": "orders_processed",
"dimensions": {
"app": "demo-store",
"environment": "lab"
},
"datapoints": [
{
"timestamp": "${TS}",
"value": 15
}
]
}
]
EOF
oci monitoring metric-data post --metric-data file://metric_data.json
Expected outcome – Within the alarm evaluation period, the alarm changes to Firing (or equivalent). – You receive an email notification via Notifications.
Verification – In Monitoring → Alarms, open the alarm and check: – Current state shows firing – Alarm history shows a transition event – Check your email for the alarm message.
Validation
Use this checklist:
- Metric exists
– Metric Explorer shows
lab_metrics / orders_processedwith your dimensions. - Alarm exists – Alarm is created and scoped to the correct compartment and dimensions.
- Alarm triggers
– Alarm state transitions to Firing after posting
value: 15. - Notification delivered – Email subscription is confirmed. – You received the alarm email.
If any item fails, use Troubleshooting below.
Troubleshooting
Problem: No datapoints visible in Metric Explorer – Confirm you are in the correct region. – Confirm the compartment is correct. – Expand time range to last 1 hour. – Re-check dimensions: if your Explorer filter doesn’t match posted dimensions, you won’t see the series.
Problem: NotAuthorizedOrNotFound from CLI
– Check that your OCI CLI profile is pointing to the correct tenancy/user/region in ~/.oci/config.
– Confirm your user/group has the required IAM policies for metrics posting and alarm management.
Problem: Alarm never fires – Ensure the alarm’s metric query filters match your metric’s dimensions. – Confirm the alarm uses an aggregation and interval that will catch your datapoint (for example, if you used a longer window, wait longer). – Post multiple datapoints to ensure the evaluation window contains values.
Problem: Email never arrives – Confirm the subscription is Confirmed. – Check spam/junk. – Verify the alarm destination topic is correct. – If your organization blocks automated emails, consider an HTTPS subscription endpoint instead (verify Notifications options).
Cleanup
To avoid ongoing cost and to keep your tenancy tidy, delete lab resources.
Console cleanup
1. Delete the alarm:
– Monitoring → Alarms → select lab-orders-processed-alarm → Delete
2. Delete the Notifications subscription and topic:
– Notifications → open lab-monitoring-topic
– Delete the email subscription
– Delete the topic
3. If you created a compartment for the lab:
– Ensure it has no remaining resources
– Delete the compartment (it will move to a “Deleted” state after resources are removed)
Local cleanup
– Remove metric_data.json if desired.
11. Best Practices
Architecture best practices
- Start with service metrics before adding custom metrics.
- Design a metric taxonomy:
- Namespaces by domain (
payments,orders,platform) - Metric names that are stable and consistent (
request_count,error_count,latency_ms) - Dimensions that support filtering without high cardinality (
environment,service,region) - Use compartments to separate environments and teams:
prod,stage,dev,shared-ops- Standardize alarm patterns:
- Saturation (CPU, memory), errors (5xx), latency, availability, and “silence” heartbeats
IAM/security best practices
- Apply least privilege:
- Separate roles: viewers (read metrics), operators (manage alarms), publishers (post custom metrics)
- Avoid using long-lived user API keys in apps:
- Prefer Instance Principals or Resource Principals for OCI-native workloads (verify supported auth for your architecture).
- Restrict who can change alarm destinations to avoid misrouting alerts.
Cost best practices
- Control custom metric volume:
- Publish at 1-minute intervals unless you truly need higher frequency.
- Aggregate before publishing (send sums/averages, not per-request metrics).
- Keep dimensions low cardinality:
- Do not use
requestId,sessionId,userIdas dimensions. - Use tags on alarms and topics to enable cost allocation and ownership tracking.
Performance best practices
- Use dimension filters in queries to avoid pulling broad datasets.
- Avoid creating alarms that scan huge sets of time series unless necessary.
- Prefer a small number of well-designed alarms over many noisy alarms.
Reliability best practices
- Use multi-channel alerting for critical alarms:
- Email + webhook to incident management (depending on Notifications options)
- Build runbooks linked from alarm descriptions:
- “What does this alarm mean?”
- “What’s the first check?”
- “What are safe mitigations?”
- Regularly test alarms (game days).
Operations best practices
- Maintain an “alarm hygiene” routine:
- Monthly review of top noisy alarms
- Remove obsolete alarms after architecture changes
- Use naming conventions:
env.service.signal.severity(example:prod.orders.5xx_rate.critical)- Use consistent severity definitions and on-call routing topics per team.
Governance/tagging/naming best practices
- Tag alarms and topics with:
owner,costCenter,environment,application,team- Enforce standards through IaC (Terraform) and code review.
12. Security Considerations
Identity and access model
- Monitoring access is controlled by OCI IAM policies at tenancy and compartment scope.
- Separate privileges for:
- Reading metrics (engineers, dashboards)
- Managing alarms (ops/SRE)
- Posting custom metrics (applications/automation)
Recommendation
– Use dedicated dynamic groups and principals for workloads posting metrics.
– Do not grant broad manage all-resources unless absolutely necessary.
Encryption
- OCI services generally encrypt data at rest and in transit. For Monitoring-specific encryption guarantees, verify the official Monitoring security documentation and Oracle Cloud security documentation.
Network exposure
- Posting custom metrics uses OCI service endpoints; ensure:
- TLS is used (default)
- Your environment’s egress policy allows access to OCI endpoints
- For HTTPS subscriptions (webhooks), ensure your endpoint:
- Uses TLS
- Requires authentication/verification (to prevent spoofed alerts)
- Has rate limiting (to withstand alert storms)
Secrets handling
- Avoid embedding OCI config files and API keys in containers or repos.
- Use OCI-native identity (instance/resource principals) when possible.
- If you must use API keys, store them in OCI Vault (separate service) and rotate regularly (verify your org’s standard).
Audit/logging
- Use OCI Audit to track changes to:
- IAM policies that grant Monitoring permissions
- Alarm creation/modification/deletion
- Notifications topics and subscriptions
Verify exact audit event coverage in official docs.
Compliance considerations
- Alarms and metrics can contain sensitive context if you encode it in dimensions (for example, customer identifiers).
- Treat custom metric payload design as a data classification issue:
- Do not put PII into metric dimensions or names.
- Use anonymized or aggregated identifiers.
Common security mistakes
- Posting high-cardinality identifiers (PII) as dimensions.
- Granting developers
manage alarmsin production without controls. - Using a shared email topic for all severities (leaks incident details broadly).
- Not authenticating webhook subscribers.
Secure deployment recommendations
- Compartmentalize prod monitoring resources (alarms/topics) and restrict modifications.
- Use separate topics per:
- Severity (critical vs warning)
- Team ownership (payments vs platform)
- Validate webhook endpoints and log delivery outcomes.
13. Limitations and Gotchas
Exact numeric limits can change. Always check OCI Service Limits for Monitoring and Notifications in your region/tenancy.
Known limitation categories
- Regional scope
- Metrics and alarms are regional; multi-region monitoring requires per-region configuration and aggregation outside Monitoring if needed.
- Service metric availability
- Not all services emit all desired metrics; some require enabling agent plugins or service-specific options.
- Custom metric cardinality
- Too many dimension combinations can:
- hit service limits
- increase ingestion cost
- make queries slow or confusing
- Alarm noise
- A poorly designed alarm (too sensitive, no delay, no aggregation) will flap and create alert fatigue.
- Email confirmation requirement
- Notifications email subscriptions require confirmation; missing confirmation causes “silent” non-delivery.
- IAM complexity
- Cross-compartment visibility is not automatic; missing policies are a frequent cause of “no metrics found” confusion.
- Time window mismatches
- Alarm evaluation windows and metric publishing intervals must align; single datapoints might not trigger if evaluation expects sustained conditions.
- Dimension mismatches
- A common gotcha: your alarm filters don’t match posted dimensions exactly (case/typos), resulting in “no data.”
Pricing surprises
- Custom metrics can be inexpensive at small scale but can grow rapidly with:
- per-pod/per-container dimensions
- second-level publishing
- many environments and microservices
Compatibility issues
- If you rely on agents/plugins for OS-level metrics, compatibility depends on:
- OS version
- agent version
- network egress and permissions
Verify in official docs for the specific agent/plugin involved.
Migration challenges (from other tools)
- Metric naming and query semantics differ from Prometheus/AWS CloudWatch/Azure Monitor.
- Alarm threshold semantics and aggregation windows may need redesign, not just a “lift and shift.”
14. Comparison with Alternatives
In Oracle Cloud (nearest services)
- Logging: event/log record collection; not a metrics system. Great for root cause after alarms.
- Logging Analytics: advanced log analytics and correlation; complements Monitoring.
- APM: application tracing, spans, transactions; complements Monitoring for deep app performance.
- Health Checks: external availability probing (separate service); complements Monitoring.
- Dashboards: visualization layer; complements Monitoring.
Other clouds (nearest equivalents)
- AWS: CloudWatch (metrics/alarms/logs), with different pricing and query patterns.
- Azure: Azure Monitor (metrics/logs/alerts).
- GCP: Cloud Monitoring (metrics/alerting), integrated with Cloud Logging.
Open-source/self-managed alternatives
- Prometheus + Alertmanager + Grafana
- VictoriaMetrics / Thanos / Cortex for scalable metrics backends
- OpenTelemetry metrics pipelines (then choose backend)
Comparison table
| Option | Best For | Strengths | Weaknesses | When to Choose |
|---|---|---|---|---|
| OCI Monitoring | OCI-native metrics + alarms | Integrated with OCI services/IAM; managed; service metrics out of the box; custom metrics supported | Regional scope; feature set focused on OCI; custom metric costs/limits must be managed | You run primarily on OCI and want first-party monitoring and alerting |
| OCI Logging | Log collection and troubleshooting | Great for forensic analysis; structured/unstructured logs | Not a metrics platform; alerting differs | Use with Monitoring to investigate alarm triggers |
| OCI Logging Analytics | Advanced log analytics | Powerful search/correlation on logs | Additional setup/cost; not a pure metrics replacement | You need deep log insights alongside Monitoring |
| OCI APM | App tracing and deep performance | End-to-end tracing, app-level visibility | Requires instrumentation/agents; separate pricing | You need tracing and app performance diagnostics beyond metrics |
| AWS CloudWatch | AWS workloads | Mature ecosystem; integrated metrics/logs | Different semantics/pricing; not OCI-native | You are primarily on AWS |
| Azure Monitor | Azure workloads | Broad monitoring suite | Not OCI-native | You are primarily on Azure |
| GCP Cloud Monitoring | GCP workloads | Strong managed monitoring | Not OCI-native | You are primarily on GCP |
| Prometheus + Grafana | Cloud-neutral Kubernetes and custom monitoring | Industry-standard; flexible queries (PromQL); portable | You operate and scale it; storage/HA burden; integration effort | You need portability, Kubernetes-native metrics, or custom control of retention and dashboards |
15. Real-World Example
Enterprise example: regulated financial services platform
Problem A financial services company runs customer-facing APIs on OCI across multiple compartments (prod, staging, shared). They need strong operational control, least privilege access, and reliable incident routing with auditability.
Proposed architecture
– OCI Monitoring for service metrics:
– Load balancer health and error rates
– Compute resource saturation
– Database service metrics
– Custom metrics:
– transaction_success_rate
– authorization_latency_ms
– queue_backlog
– Alarms per tier:
– Critical alarms route to a prod-critical Notifications topic
– Warning alarms route to prod-warning
– Notifications:
– Email distribution lists for on-call
– HTTPS webhook to incident management system
– Governance:
– IAM policies restrict alarm modification to SRE group
– Alarms managed via Terraform with code review
– Tags enforce ownership and cost attribution
Why Monitoring was chosen – OCI-native metrics and IAM integration fits regulated environments. – Service metrics reduce operational overhead. – Custom metrics enable business-level alerting without a separate metrics stack.
Expected outcomes – Reduced MTTD with consistent alerting across compartments – Better auditability and change control over monitoring rules – Improved incident response via routing and runbooks linked in alarm metadata
Startup/small-team example: SaaS MVP on OCI
Problem A small team runs a single-region SaaS app on OCI with a small VM pool and a managed database. They need basic alerting without operating Prometheus.
Proposed architecture
– OCI Monitoring service metrics:
– VM CPU utilization
– Load balancer backend health
– Database CPU/storage metrics
– Minimal custom metrics:
– signup_count
– job_failures
– A few alarms routed to one Notifications topic with email subscriptions to founders/on-call.
Why Monitoring was chosen – Low operational burden, quick setup. – Sufficient for MVP operational needs. – Scales as they add more services and compartments.
Expected outcomes – Basic reliability guardrails without additional infrastructure – Faster debugging when performance issues happen – Controlled costs by limiting custom metrics
16. FAQ
1) Is Oracle Cloud Monitoring the same as Logging?
No. Monitoring focuses on metrics (time-series numeric values) and alarms. Logging captures log events (text/structured records). They complement each other.
2) Is Monitoring a regional service in OCI?
Yes, Monitoring is typically regional. Metrics and alarms are tied to the region. For multi-region architectures, plan alarms per region and centralize routing downstream if needed.
3) What are service metrics vs custom metrics?
Service metrics are emitted by OCI services automatically. Custom metrics are datapoints you publish to Monitoring for your apps/systems.
4) Do I need an agent to use Monitoring?
For many OCI services, no—service metrics are automatic. For OS-level metrics (like memory) you may need an OCI agent/plugin depending on the service and OS. Verify in official docs for your target metric.
5) How do alarms send notifications?
Alarms usually publish to an OCI Notifications topic. Subscriptions on the topic deliver messages to email/HTTPS/etc.
6) Can I create alarms with Terraform?
Yes, typically you can manage alarms and topics as code using OCI Terraform provider resources. Verify current provider documentation for the exact resource names and arguments.
7) What’s the biggest cost risk in Monitoring?
High-volume/high-cardinality custom metrics. Avoid dimensions that create many unique time series.
8) Can I monitor Kubernetes (OKE) with OCI Monitoring?
You can monitor OCI service metrics around OKE and related infrastructure. For detailed pod/container metrics, many teams use Prometheus-based tooling. Verify current OCI OKE observability guidance.
9) How do I avoid alert fatigue?
Use aggregation windows, delays, and clear thresholds; scope alarms with dimensions; classify severity; review noisy alarms regularly.
10) Can I trigger automation from an alarm?
Indirectly, yes—alarms publish to Notifications. Notifications can deliver to endpoints like HTTPS or Functions (depending on Notifications features). Use this to trigger auto-remediation carefully.
11) How do I design custom metrics for business KPIs?
Publish aggregated counts/rates/latency percentiles (if you compute them upstream), use stable metric names, and include low-cardinality dimensions like service, environment.
12) Why do I see “No data” for an alarm?
Common causes: wrong region/compartment, wrong dimension filters, publishing interval too sparse for evaluation window, or metric not emitted as expected.
13) Can multiple teams share the same Monitoring setup?
Yes—use compartments and IAM policies to isolate. Share only common topics/routing if desired.
14) How quickly do alarms detect issues?
Depends on metric emission frequency and alarm evaluation settings (window, interval). Choose settings that balance speed and noise. Verify exact evaluation behavior in official docs.
15) Can I export metrics to external systems?
Yes, you can query via API/CLI/SDK and forward to external systems. Consider API rate limits and data transfer costs.
16) Is there a built-in dashboard for all metrics?
You can explore metrics in Metric Explorer; for curated dashboards, use OCI’s dashboard capabilities or external tools. Verify your tenancy’s current dashboard options.
17) What’s the difference between Monitoring and APM?
Monitoring is metrics and alarms for infrastructure and custom numeric signals. APM adds tracing and application performance diagnostics (transactions, spans), typically requiring instrumentation.
17. Top Online Resources to Learn Monitoring
| Resource Type | Name | Why It Is Useful |
|---|---|---|
| Official documentation | OCI Monitoring documentation | Primary reference for metrics, namespaces, alarms, and APIs: https://docs.oracle.com/en-us/iaas/Content/Monitoring/home.htm |
| Official documentation | OCI Notifications documentation | Required for alarm delivery via topics/subscriptions: https://docs.oracle.com/en-us/iaas/Content/Notification/home.htm |
| Official documentation | OCI CLI concepts and setup | Install/configure CLI used in labs and automation: https://docs.oracle.com/en-us/iaas/Content/API/Concepts/cliconcepts.htm |
| Official pricing | OCI price list (Observability and Management) | Official pricing dimensions for Monitoring/related services: https://www.oracle.com/cloud/price-list/#observability-and-management |
| Official tool | OCI Cost Estimator | Model regional cost impacts: https://www.oracle.com/cloud/costestimator.html |
| Official free tier | Oracle Cloud Free Tier | Understand Always Free and trial allowances: https://www.oracle.com/cloud/free/ |
| Architecture guidance | Oracle Architecture Center | Reference architectures and operational patterns (search for observability): https://www.oracle.com/cloud/architecture-center/ |
| Tutorials/labs | Oracle LiveLabs | Hands-on labs for OCI services including observability topics: https://livelabs.oracle.com/ |
| Official GitHub | OCI CLI repository | Source, releases, and examples for CLI: https://github.com/oracle/oci-cli |
| Official GitHub | OCI SDKs | Programmatic access samples and SDKs: https://github.com/oracle/oci-python-sdk (and related org repos) |
| Community (reputable) | Oracle Cloud blogs and solution playbooks | Practical patterns and updates (validate against docs): https://blogs.oracle.com/cloud-infrastructure/ |
18. Training and Certification Providers
| Institute | Suitable Audience | Likely Learning Focus | Mode | Website |
|---|---|---|---|---|
| DevOpsSchool.com | DevOps engineers, SREs, platform teams | DevOps practices, cloud operations, monitoring/observability fundamentals | Check website | https://www.devopsschool.com/ |
| ScmGalaxy.com | Beginners to intermediate engineers | DevOps, CI/CD, operations tooling foundations | Check website | https://www.scmgalaxy.com/ |
| CLoudOpsNow.in | Cloud engineers, operations teams | Cloud operations and monitoring practices | Check website | https://www.cloudopsnow.in/ |
| SreSchool.com | SREs, reliability engineers | SRE principles, alerting, SLOs, incident management | Check website | https://www.sreschool.com/ |
| AiOpsSchool.com | Ops/SRE, automation-focused teams | AIOps concepts, event correlation, automation | Check website | https://www.aiopsschool.com/ |
19. Top Trainers
| Platform/Site | Likely Specialization | Suitable Audience | Website |
|---|---|---|---|
| RajeshKumar.xyz | DevOps/cloud coaching and consulting-style training resources (verify offerings) | Engineers seeking guided learning | https://rajeshkumar.xyz/ |
| devopstrainer.in | DevOps training programs (verify current courses) | Beginners to intermediate DevOps learners | https://www.devopstrainer.in/ |
| devopsfreelancer.com | Freelance DevOps services and potentially training/support resources (verify scope) | Teams needing hands-on help | https://www.devopsfreelancer.com/ |
| devopssupport.in | DevOps support and training resources (verify offerings) | Ops teams needing troubleshooting help | https://www.devopssupport.in/ |
20. Top Consulting Companies
| Company | Likely Service Area | Where They May Help | Consulting Use Case Examples | Website |
|---|---|---|---|---|
| cotocus.com | Cloud/DevOps consulting (verify current portfolio) | Observability setup, automation, cloud operations | Alarm strategy design, custom metrics pipeline design, IaC for alarms/topics | https://cotocus.com/ |
| DevOpsSchool.com | DevOps consulting and enablement | Training + implementation support for DevOps/observability | Monitoring baseline implementation, on-call readiness, CI/CD integration for alarm-as-code | https://www.devopsschool.com/ |
| DEVOPSCONSULTING.IN | DevOps consulting services (verify current offerings) | Implementation and support | Setting up alert routing, building runbooks, governance and IAM reviews | https://www.devopsconsulting.in/ |
21. Career and Learning Roadmap
What to learn before Monitoring
- OCI fundamentals:
- Regions, availability domains, compartments
- OCI IAM basics (groups, policies, dynamic groups)
- Core services: Compute, Networking (VCN), Load Balancer
- Observability fundamentals:
- Metrics vs logs vs traces
- Basic SRE concepts: SLIs/SLOs, alert fatigue, incident lifecycle
- CLI basics:
- Authentication, profiles, regions, compartments
What to learn after Monitoring
- OCI Logging and Logging Analytics for root cause analysis.
- OCI APM for tracing and application diagnostics (if you own app performance).
- Automation:
- Terraform modules for alarms and notification topics
- Functions-based remediation workflows
- Reliability engineering:
- SLO-based alerting and error budgets
- Capacity planning and performance testing
Job roles that use Monitoring
- Cloud Engineer (OCI)
- DevOps Engineer
- Site Reliability Engineer (SRE)
- Platform Engineer
- Cloud Solutions Architect
- Operations/NOC Engineer
- Security Engineer (for availability and abuse signals)
Certification path (if available)
Oracle certification offerings change. Look for OCI certifications that cover:
– OCI Foundations (baseline)
– Architect or DevOps-focused OCI certifications
Verify current certification tracks on Oracle University / Oracle Certification pages (official): https://education.oracle.com/
Project ideas for practice
- Golden alarms module – Build Terraform module that creates standard alarms (CPU, LB health, DB storage).
- Custom KPI monitoring – Publish 5 KPIs from a demo app (requests, errors, latency, queue depth, throughput).
- Alarm routing by severity – Two topics (critical/warning), subscriptions to different teams.
- Game day – Intentionally trigger CPU saturation or simulated error rate and validate notifications/runbooks.
- Cost guardrails – Implement checks to prevent high-cardinality dimensions in custom metrics.
22. Glossary
- Alarm: A rule that evaluates a metric query and triggers notifications when conditions are met.
- Aggregation: A method to combine datapoints over time (for example, mean/max/sum) for evaluation and charting.
- Compartment: An OCI organizational boundary for resources and IAM policies.
- Custom metric: A metric you publish to OCI Monitoring via API/CLI/SDK.
- Datapoint: A single metric value at a timestamp.
- Dimension: A key/value attribute that describes a time series and enables filtering (for example,
resourceId,app,environment). - Metric: A named time-series signal, typically numeric, representing a system or application measurement.
- Metric Explorer: OCI Console UI for browsing and charting metrics.
- Namespace: A container for related metrics (service namespace or custom namespace).
- Notifications topic: A message channel in OCI Notifications; publishers send messages to a topic, and subscribers receive them.
- Subscription: A delivery endpoint (email/HTTPS/etc.) attached to a Notifications topic.
- SLO (Service Level Objective): A reliability target (for example, 99.9% availability).
- SLI (Service Level Indicator): A measurement that feeds an SLO (for example, success rate).
- Telemetry: A general term for metrics data; used in some OCI API naming around Monitoring.
23. Summary
Oracle Cloud Monitoring is OCI’s managed metrics and alarms service in the Observability and Management category. It collects service metrics from OCI resources, accepts custom metrics from your applications, and evaluates alarms that can notify teams through OCI Notifications.
It matters because it forms the operational backbone for detecting incidents early, reducing downtime, and turning system behavior into actionable alerts. Cost and scale considerations center on custom metrics volume and cardinality, while security hinges on least-privilege IAM, compartment design, and safe notification endpoints.
Use Monitoring when you want OCI-native, IAM-integrated metrics and alerting with minimal operational overhead. For deeper troubleshooting and correlation, pair it with Logging, Logging Analytics, and APM as appropriate.
Next step: implement a “golden signals” alarm baseline (latency, traffic, errors, saturation) in Terraform and roll it out across your compartments with consistent tagging and routing.