Category
Data Management
1. Introduction
Oracle Cloud Data Catalog is Oracle Cloud Infrastructure’s managed service for discovering, organizing, and governing metadata about the data your organization stores across databases, data lakes, and analytics platforms.
In simple terms: Data Catalog helps you answer “What data do we have, where is it, who owns it, and how should it be used?”—without moving or copying the underlying data.
Technically, Data Catalog is a metadata management and data discovery service. You create a catalog, register data sources (called data assets), run harvest jobs to extract technical metadata (schemas, tables, files, columns, etc.), and enrich that metadata with business context such as glossary terms, tags, and custom properties. Consumers then use search and browsing to find trusted datasets faster.
It solves common data-management problems such as: – Lack of visibility into what data exists across teams and clouds – Inconsistent definitions (e.g., “customer”, “revenue”, “active user”) – Difficulty finding the right dataset and its owner/steward – Governance needs for audits and compliance (knowing what exists, where, and how it’s classified)
Service name check: The service is commonly documented as Oracle Cloud Infrastructure (OCI) Data Catalog. This tutorial uses the required primary name Data Catalog and keeps alignment with Oracle Cloud and Data Management. If Oracle renames any UI labels or endpoints in your region, verify in official docs.
2. What is Data Catalog?
Official purpose
In Oracle Cloud’s Data Management portfolio, Data Catalog is intended to provide a centralized place to: – Collect technical metadata from supported data sources – Organize and curate that metadata for discoverability – Add business context using glossary, tags, and properties – Support governance by making ownership and definitions explicit
Core capabilities (what it does)
Data Catalog typically supports the following capability areas (exact source coverage depends on your region and connectors; verify supported data assets in official docs): – Metadata harvesting from registered data assets – Search and discovery across harvested entities (tables, views, files, columns, etc.) – Business glossary for definitions and standard terminology – Curation and enrichment via tags, custom properties, and relationships – Access control using Oracle Cloud IAM and compartments
Major components (mental model)
- Catalog: The top-level container for metadata. Created in a specific Oracle Cloud region and compartment.
- Data asset: A registered data source (for example, Object Storage, Autonomous Database, or other supported sources). Think of it as “this is where metadata can be harvested from.”
- Connection / credential: How Data Catalog authenticates to the data asset (varies by source type; may use IAM/service access for OCI-native services or credentials for databases).
- Harvest: A job (manual or scheduled) that extracts metadata from a data asset into the catalog.
- Entities: The harvested objects (schemas, tables, columns, files, etc.) represented in the catalog.
- Glossary / terms: Business definitions linked to harvested entities to clarify meaning and intended use.
- Tags and custom properties: Lightweight governance controls (classification, sensitivity, owner, SLA tier, domain, etc.)
Service type
- Managed Oracle Cloud service (control plane managed by Oracle)
- Metadata system (stores metadata and governance context, not the underlying data)
Scope: regional vs global
Data Catalog is created in a specific Oracle Cloud region and a compartment within your tenancy. You can catalog sources across compartments if IAM policies allow it. Cross-region cataloging patterns exist, but the catalog itself is regional; plan accordingly and verify current cross-region support in official docs.
How it fits into the Oracle Cloud ecosystem
Data Catalog sits at the center of a typical Oracle Cloud Data Management and analytics environment: – Data producers store data in Object Storage, Autonomous Database, and other platforms. – Data engineers transform data using services such as OCI Data Integration, OCI Data Flow, and other processing engines. – Data Catalog provides the “system of record” for metadata, helping analysts and engineers find and interpret datasets. – Security and governance rely on OCI IAM, Audit, and tagging strategies.
3. Why use Data Catalog?
Business reasons
- Faster time-to-data: Teams spend less time searching and re-creating datasets.
- Better decision-making: Shared definitions reduce reporting conflicts.
- Reduced risk: Easier to identify sensitive data locations for compliance initiatives.
- Increased reuse: Analysts find trusted datasets instead of building shadow copies.
Technical reasons
- Central metadata index for multiple sources
- Searchable inventory of tables/files/columns and their attributes
- Standardization via glossary and curated metadata
- Extensibility through tags and custom properties
Operational reasons
- Repeatable harvesting (manual/scheduled) to keep metadata current
- Ownership and stewardship captured alongside metadata
- Better handoffs between engineering, analytics, and governance teams
Security/compliance reasons
- Supports governance patterns like:
- “Know where PII might exist”
- “Who owns this dataset?”
- “What’s the approved definition of a metric?”
- Integrates with IAM for access control and with auditing capabilities in Oracle Cloud.
Scalability/performance reasons
Data Catalog is designed to scale in metadata volume and user access patterns typical of medium-to-large enterprises. The underlying data stays in place; you manage metadata, which is far lighter than copying datasets.
When teams should choose Data Catalog
Choose Data Catalog when: – You have multiple data sources and need a single discovery experience – You need a business glossary tied to real datasets – You want to operationalize data governance without building a custom metadata system – You want an Oracle-managed metadata catalog integrated with Oracle Cloud IAM
When teams should not choose it
Data Catalog may not be the right fit if: – You only have one small data store and discovery is trivial – You need full data-quality rules engine or master data management (different tool category) – You require capabilities not currently supported by Data Catalog connectors in your region (verify first) – You want a fully open-source/self-managed solution with deep customization and are willing to operate it
4. Where is Data Catalog used?
Industries
- Financial services (regulatory reporting, audit readiness)
- Healthcare/life sciences (data sensitivity classification)
- Retail/e-commerce (product/customer analytics definitions)
- Telecom (large-scale data platforms with many producers)
- Government/public sector (data inventories and stewardship)
- SaaS companies (internal analytics governance)
Team types
- Data platform teams
- Data engineering and ETL teams
- Analytics engineering teams
- BI teams and data analysts
- Security and compliance teams
- Enterprise architecture and governance teams
Workloads
- Data lake discovery (Object Storage)
- Data warehouse cataloging (Autonomous Data Warehouse and other supported DBs)
- Cross-domain metrics standardization (glossary-driven analytics)
- Migration governance (inventory before moving data)
- Audit response (identify datasets and owners)
Architectures
- Central lakehouse with multiple pipelines
- Multi-compartment data mesh-like layouts (domain-based compartments)
- Hybrid environments (OCI plus external sources where supported; verify connector coverage)
Real-world deployment contexts
- Production: catalog is used by analysts and governance daily; harvesting is scheduled and monitored.
- Dev/test: used to validate metadata extraction and glossary structure before scaling.
5. Top Use Cases and Scenarios
Below are realistic scenarios where Oracle Cloud Data Catalog commonly fits.
1) Data lake discovery for Object Storage
- Problem: Hundreds of buckets and folders; nobody knows what’s inside.
- Why it fits: Data Catalog can harvest and index metadata for supported Object Storage structures (verify exact capabilities for file formats and depth).
- Scenario: A data platform team catalogs curated datasets in Object Storage so analysts can search for “orders” and find the canonical dataset plus owner.
2) Cataloging a data warehouse for self-service analytics
- Problem: Analysts can query the warehouse but don’t know table meanings.
- Why it fits: Harvest tables/columns and enrich them with glossary terms and curated descriptions.
- Scenario: Finance defines “Net Revenue” as a glossary term and links it to the correct column(s) in the warehouse.
3) Standardizing business definitions across departments
- Problem: “Customer” means different things in Sales vs Support.
- Why it fits: Glossary provides a shared vocabulary with stewarded definitions.
- Scenario: Governance team defines “Customer (Bill-to)” and “Customer (User)” as separate terms and maps datasets accordingly.
4) Ownership and stewardship mapping (operational governance)
- Problem: No one knows who to contact about a dataset.
- Why it fits: Use custom properties/tags to record owner, steward, support channel, SLA tier.
- Scenario: Every curated dataset includes
Owner,Steward,SlackChannel, andRefreshFrequency.
5) Sensitive data discovery support (classification workflow)
- Problem: Compliance asks where PII exists; teams respond manually.
- Why it fits: Tag entities/attributes with classifications; create views of sensitive datasets.
- Scenario: A quarterly review exports a list of entities tagged as
PIIfor follow-up controls (actual export/reporting methods depend on UI/API; verify).
6) Pre-migration inventory and rationalization
- Problem: Before migrating to OCI, you need an inventory of sources and schemas.
- Why it fits: Data Catalog becomes a landing place for harvested metadata, highlighting duplicates and unused datasets.
- Scenario: During warehouse modernization, teams catalog legacy schemas, then mark deprecated datasets with tags.
7) Data product catalog for a platform team (data mesh-ish)
- Problem: Domain teams publish data products but discovery is fragmented.
- Why it fits: Central catalog with domain-based tags and glossary.
- Scenario: Marketing and Supply Chain publish certified datasets; Data Catalog becomes the discovery portal.
8) Faster onboarding for new engineers and analysts
- Problem: New hires take weeks to learn data landscape.
- Why it fits: Search, browse, and glossary shorten ramp-up time.
- Scenario: A new analyst searches “returns” and quickly finds the curated returns dataset and its definition.
9) Pipeline change impact analysis (metadata-based)
- Problem: Schema changes break dashboards; teams don’t see dependencies.
- Why it fits: Metadata and relationships can help document dependencies; if lineage integrations are available in your setup, it’s even stronger (verify lineage support/integration).
- Scenario: Data engineers annotate downstream consumers in custom properties and use consistent tags for impacted domains.
10) Audit response and evidence collection
- Problem: Auditors ask for data inventory, ownership, and definitions.
- Why it fits: Catalog provides centralized metadata, ownership, and governance artifacts.
- Scenario: Security exports a list of datasets tagged
Confidentialand shows steward approvals recorded in process (process tooling is external; catalog supports the metadata).
11) Shared KPI metric governance for BI
- Problem: Multiple dashboards calculate metrics differently.
- Why it fits: Glossary defines metrics and points to canonical datasets/columns.
- Scenario: “Active Subscriber” is defined once, used across BI reports.
12) Cross-team dataset certification
- Problem: Users can’t tell trusted datasets from experimental ones.
- Why it fits: Tag datasets as
Certified,Bronze/Silver/Gold, orTrusted. - Scenario: Platform team certifies “Gold” tables after validation; analysts filter search to only certified assets.
6. Core Features
Feature availability can vary by region, permissions, and connector type. Confirm exact UI labels and supported source types in the official documentation.
1) Catalogs (metadata containers)
- What it does: Provides a top-level container to store metadata, glossary, tags, and enrichment.
- Why it matters: Separates environments or domains (e.g., “Prod Catalog” vs “Sandbox Catalog”).
- Practical benefit: Cleaner governance boundaries and access control.
- Caveats: Catalog is regional; plan for multi-region architectures.
2) Data assets (source registration)
- What it does: Registers a data source for harvesting.
- Why it matters: Establishes the “where” for metadata.
- Practical benefit: Standardized onboarding process for new sources.
- Caveats: Each asset type has distinct connection requirements.
3) Harvesting (metadata extraction jobs)
- What it does: Extracts and updates technical metadata from a data asset into the catalog.
- Why it matters: Keeps metadata current as schemas/files evolve.
- Practical benefit: Repeatable scheduled refresh reduces manual documentation.
- Caveats: Requires correct IAM/credentials and network access; harvesting can fail if policies are missing.
4) Search and browse
- What it does: Lets users find entities using keywords, filters, and navigation.
- Why it matters: Discovery is the core value of a catalog.
- Practical benefit: Reduces tribal knowledge dependency.
- Caveats: Search quality depends on metadata quality; add descriptions, glossary terms, tags.
5) Business glossary
- What it does: Stores business terms, definitions, and associations to technical assets.
- Why it matters: Aligns teams on consistent definitions.
- Practical benefit: BI and analytics become more reliable.
- Caveats: Glossary governance is a people/process challenge; needs steward ownership.
6) Tags (classification and organization)
- What it does: Apply labels to assets/entities/attributes.
- Why it matters: Enables filtering, governance, and lifecycle management.
- Practical benefit: Common tags:
PII,Confidential,Certified,Domain:Marketing. - Caveats: Without naming conventions, tags become messy and duplicated.
7) Custom properties (metadata enrichment)
- What it does: Adds organization-specific fields (owner, SLA, refresh frequency, cost center).
- Why it matters: Most governance needs are organization-specific.
- Practical benefit: Convert tribal knowledge into structured metadata.
- Caveats: Over-customization can reduce usability; keep a controlled list.
8) IAM integration (access control)
- What it does: Uses Oracle Cloud IAM policies and compartments to control who can manage catalogs, assets, harvest, and metadata.
- Why it matters: Governance requires role-based access.
- Practical benefit: Separate duties between admins, stewards, and consumers.
- Caveats: Harvesting access to source systems often requires additional policies/credentials.
9) Auditability (via Oracle Cloud auditing capabilities)
- What it does: Administrative actions can be audited via OCI Audit (exact event coverage: verify).
- Why it matters: Compliance needs traceability.
- Practical benefit: Investigate who changed glossary definitions or asset registrations.
- Caveats: You must enable and retain logs per policy and compliance requirements.
10) API/SDK/CLI support (automation)
- What it does: Enables automation of catalog lifecycle, asset creation, harvesting, and metadata operations via APIs (verify the set of operations you need).
- Why it matters: Scales onboarding and governance workflows.
- Practical benefit: “Catalog as code” patterns for enterprise consistency.
- Caveats: IAM and rate limits apply; build idempotent automation.
7. Architecture and How It Works
High-level architecture
Data Catalog sits between: – Metadata producers (data sources such as Object Storage and databases) – Metadata consumers (analysts, engineers, governance users) – Governance controls (IAM, tagging standards, auditing)
Key principle: Data Catalog stores metadata, not the data itself. Harvesting reads source metadata and indexes it in the catalog.
Request/data/control flow (typical)
- An administrator creates a catalog in a compartment and region.
- They register a data asset and configure access (IAM policies and/or credentials).
- They run a harvest job: – The service connects to the source – Reads technical metadata (schemas, tables, files, columns) – Stores metadata objects in the catalog
- Stewards enrich metadata with glossary terms, tags, and custom properties.
- Consumers search/browse to find datasets and interpret them correctly.
Integrations with related Oracle Cloud services (common patterns)
- Object Storage: catalog data lake buckets and curated datasets.
- Autonomous Database / Autonomous Data Warehouse: catalog tables/views (connector support varies; verify).
- OCI Vault: store database credentials/secrets (pattern depends on connector; verify).
- OCI Events + Notifications: notify teams when harvest jobs fail or complete (pattern depends on available events; verify).
- OCI Logging / Audit: operational traceability and compliance evidence.
- OCI Data Integration / Data Flow: data pipelines; catalog provides metadata context. (Lineage availability depends on integration; verify.)
Dependency services
- OCI IAM: policies, compartments, groups (mandatory)
- Networking (VCN): required when harvesting private data sources (if supported via private endpoints; verify)
- Source services: Object Storage, databases, etc.
Security/authentication model
- User access to Data Catalog is governed by OCI IAM.
- Service access (Data Catalog reading metadata from sources) typically requires:
- OCI-native access policies for OCI resources (Object Storage, etc.)
- Credentials for database sources (stored securely; exact method depends on connector—verify in docs)
- Prefer least privilege: only allow read access required for metadata extraction.
Networking model
- Access to Data Catalog is via Oracle Cloud endpoints in the region.
- Harvesting network path depends on the source:
- For OCI public endpoints (like Object Storage), IAM permission is often the primary gate.
- For private databases, you may need private connectivity (VCN/private endpoint patterns—verify what Data Catalog supports in your region).
Monitoring/logging/governance considerations
- Treat harvesting as an operational workload:
- Schedule harvest windows
- Monitor job outcomes
- Track changes to glossary and tags
- Use IAM and compartments to separate:
- Platform admins
- Data stewards
- Read-only consumers
Simple architecture diagram (Mermaid)
flowchart LR
U[User: Admin/Steward/Analyst] -->|Console/API| DC[Oracle Cloud Data Catalog]
DC -->|Harvest metadata| OS[OCI Object Storage Bucket]
DC --> M[(Metadata Index\nEntities/Attributes/Tags/Glossary)]
U -->|Search/Browse| DC
Production-style architecture diagram (Mermaid)
flowchart TB
subgraph Tenancy[Oracle Cloud Tenancy]
subgraph IAM[OCI IAM]
G[Groups/Roles]
P[Policies]
end
subgraph Region[Region (e.g., us-ashburn-1)]
DC[Data Catalog (Regional)]
AUD[Audit]
LOG[Logging/Monitoring]
EVT[Events/Notifications]
VAULT[OCI Vault (Secrets)]
subgraph DataLake[Data Lake Compartment]
OS1[Object Storage: Raw Bucket]
OS2[Object Storage: Curated Bucket]
end
subgraph Warehouse[Analytics Compartment]
ADB[Autonomous Database / ADW]
end
subgraph Network[VCN (if needed)]
PE[Private Connectivity / Endpoint\n(verify Data Catalog support)]
end
end
end
G --> P
U1[Admins/Stewards/Consumers] -->|IAM AuthZ| DC
DC -->|Harvest| OS2
DC -->|Harvest (if supported)| ADB
DC -->|Read secrets (pattern)| VAULT
DC --> AUD
DC --> LOG
DC --> EVT
ADB --- PE
8. Prerequisites
Tenancy and billing
- An active Oracle Cloud tenancy
- Ability to create resources in the chosen region and compartment
- Billing/credits as required by your account (Data Catalog may be metered; verify pricing and free tier eligibility)
Permissions / IAM roles
You need permissions to: – Create and manage Data Catalog resources in a compartment – Create and manage Object Storage resources for the lab (bucket + objects) – Grant Data Catalog (as a service) permission to read metadata from the target source (policy requirements vary)
Because IAM policies are security-critical and can change, use the official doc patterns for:
– Data Catalog administrators
– Data Catalog users
– Service access to Object Storage or databases
Verify in official docs: https://docs.oracle.com/en-us/iaas/data-catalog/home.htm
Tools
- Oracle Cloud Console access (browser)
- Optional:
- OCI CLI (if you want automation): https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/cliinstall.htm
- SDKs (Python/Java/Go) if integrating with pipelines (verify Data Catalog SDK coverage)
Region availability
- Data Catalog is not necessarily available in every region. Confirm in your region in the Console or official docs/service availability pages.
Quotas/limits
- Catalog count, harvest frequency, and metadata volume may be governed by service limits.
- Check Service Limits in the OCI Console for Data Catalog and related services.
Prerequisite services for this lab
- Object Storage bucket in the same tenancy (and ideally same region)
- A compartment to contain the lab resources
9. Pricing / Cost
Pricing changes over time and can be region-dependent. Do not rely on blog posts for exact numbers.
Current pricing model (how to confirm)
Oracle publishes OCI pricing on the official price list and pricing pages. Confirm Data Catalog pricing here: – OCI Pricing / Price List: https://www.oracle.com/cloud/price-list/ – OCI Cost Estimator (calculator): https://www.oracle.com/cloud/costestimator.html (if redirected, use the OCI cost estimator from the Oracle Cloud site)
Look for Data Management → Data Catalog in the price list. If the pricing page breaks out billable dimensions (for example, per catalog, per metadata volume, per user, per harvest, etc.), treat that as the source of truth.
Typical pricing dimensions to look for (verify)
Depending on Oracle’s current SKU model, pricing can be based on items such as: – Number of catalogs or capacity units – Amount of metadata stored/indexed – Number of users or requests – Harvest operations or scheduling frequency
Because these dimensions can change, verify in the official pricing entry for Data Catalog.
Cost drivers (direct and indirect)
Direct or near-direct drivers: – Number of catalogs (dev/test/prod separation can multiply costs) – Number of data assets and the metadata volume harvested – Frequency of harvest jobs (daily vs hourly) – Number of active users (if user-based pricing applies in your current SKU model)
Indirect drivers: – Object Storage cost for storing sample/curated datasets (your underlying data) – Network egress (generally avoid cross-region data access patterns if they cause additional cost) – Operational overhead: governance workflows and stewardship time – If private connectivity is required for sources, networking components may have cost
Network/data transfer implications
- Harvesting reads metadata; for OCI-native services in the same region, data transfer charges are typically lower than cross-region or internet egress scenarios.
- If cataloging sources across regions or through complex network paths, validate whether any data transfer fees apply.
How to optimize cost
- Start with one catalog and a small number of assets; expand after standards are proven.
- Harvest only what you need (avoid cataloging every raw bucket if it’s not useful).
- Use tags/properties to identify “curated” vs “raw” datasets and prioritize harvesting curated zones.
- Schedule harvesting at a reasonable cadence (nightly for many warehouses is enough; hourly harvesting can increase cost and operational noise).
- Enforce lifecycle: retire/deprecate obsolete assets rather than leaving them searchable forever.
Example low-cost starter estimate (conceptual)
A low-cost starter typically looks like: – 1 catalog (single region) – 1–3 data assets (Object Storage curated bucket + one warehouse) – Harvest run manually during setup, then scheduled nightly – Limited steward group (2–5 users)
Use the OCI Cost Estimator and the Data Catalog pricing entry to compute your estimate. Do not assume “free” unless the official pricing explicitly states a free tier for your tenancy/region.
Example production cost considerations
In production, the cost shape is driven by: – Many assets across domains (Marketing, Finance, Ops) – Higher metadata object counts (tables, columns, partitions, files) – Frequent harvest schedules and governance workflows – Potential multi-region requirements (which can imply multiple catalogs)
10. Step-by-Step Hands-On Tutorial
Objective
Create an Oracle Cloud Data Catalog, catalog an Object Storage bucket by harvesting metadata, and enrich one discovered dataset with tags and a glossary term—all using a safe, beginner-friendly workflow.
Lab Overview
You will: 1. Create a compartment and an Object Storage bucket with a small sample dataset. 2. Create a Data Catalog in Oracle Cloud. 3. Configure IAM access so Data Catalog can read Object Storage metadata (policy statements vary; you will validate using official docs). 4. Register the bucket as a data asset and run a harvest job. 5. Search for the harvested dataset and enrich it with tags and glossary.
Expected end state: – A catalog exists and contains harvested metadata for a bucket/object path. – You can search and find an entity representing your dataset. – The entity is tagged and linked to a glossary term.
Step 1: Create a compartment for the lab
- In the Oracle Cloud Console, open the navigation menu.
- Go to Identity & Security → Compartments.
- Click Create Compartment.
- Name it:
lab-datacatalog - (Optional) Description:
Hands-on lab for Data Catalog tutorial - Click Create.
Expected outcome: A new compartment appears and becomes available within seconds (sometimes minutes).
Step 2: Create an Object Storage bucket and upload a sample file
- Go to Storage → Object Storage & Archive Storage → Buckets.
- Ensure you’re in the correct region and compartment (
lab-datacatalog). - Click Create Bucket.
- Bucket name:
lab-dc-bucket-<unique-suffix> - Defaults are usually fine for a lab. Click Create.
Now create a small CSV file locally named customers.csv:
customer_id,full_name,email,country,signup_date
1001,Alice Johnson,alice@example.com,US,2024-01-12
1002,Bob Smith,bob@example.com,GB,2024-02-03
1003,Chandra Patel,chandra@example.com,IN,2024-02-19
Upload it:
1. Open your bucket.
2. Click Upload.
3. Select customers.csv.
4. Click Upload.
Expected outcome: The bucket contains customers.csv.
Verification: You can click the object name and view details (size, last modified).
Step 3: Create (or confirm) IAM permissions for Data Catalog and for your user
3A) Ensure your user/group can manage Data Catalog
If you’re in a training tenancy you might already be an admin. If not, you need IAM policies allowing your group to manage Data Catalog in the compartment.
Because policy naming and required verbs must be exact, use the official documentation’s IAM policy examples for Data Catalog: – Docs home (navigate to IAM/policies section): https://docs.oracle.com/en-us/iaas/data-catalog/home.htm
Create policies in: Identity & Security → Policies
Common pattern (example only—verify exact service names, resource-types, and verbs in docs):
– Allow a group to manage Data Catalog resources in compartment lab-datacatalog.
3B) Allow Data Catalog to read Object Storage metadata
Harvesting needs permission to read Object Storage (at least bucket/object metadata, possibly object listings).
Use the official Data Catalog documentation for Object Storage harvesting IAM policy statements. Create them in a policy attached to the compartment containing the bucket.
Important: Do not over-permission. Grant read-only access and scope it to the lab compartment where possible.
Expected outcome: Policies exist and are attached to the correct compartment.
Verification: IAM policy changes can take a short time to propagate. If harvest fails with authorization errors, wait a few minutes and retry after confirming policies.
Step 4: Create a Data Catalog
- Go to Analytics & AI (or search for Data Catalog in the console search bar).
- Open Data Catalog.
- Select compartment:
lab-datacatalog. - Click Create Catalog.
- Name:
lab-catalog - (Optional) Description:
Catalog for Object Storage metadata harvesting lab - Create.
Expected outcome: Catalog is created and appears as Active.
Verification: Open the catalog and confirm you can see catalog details and navigation items (Data Assets, Glossary, etc.).
Step 5: Register Object Storage as a Data Asset
- Inside your catalog, go to Data Assets.
- Click Create Data Asset.
- Choose the data asset type for Object Storage (label can vary; select the OCI Object Storage option).
- Provide:
– Name:
lab-os-asset– Description:Object Storage bucket for lab dataset– Bucket details: select/enter your bucket and namespace as required by the UI - Save/Create.
Expected outcome: A data asset representing your bucket exists in the catalog.
Verification: The data asset appears in the list and shows connection details (where configured).
Step 6: Run a harvest job to ingest metadata
- Open the data asset
lab-os-asset. - Locate Harvest (or “Harvesting”) in the asset actions.
- Create a harvest job (or run a harvest immediately): – Harvest type: choose the default “metadata harvest” option shown – Scope: optionally limit to a prefix/path if your UI supports it (useful for large buckets)
- Start the harvest.
Expected outcome: Harvest job starts and then completes successfully.
Verification: – Check harvest job status: Succeeded/Completed. – If the UI provides a job run log, review it for counts of discovered entities.
Step 7: Search for the harvested dataset and enrich metadata
7A) Find the dataset
- In the catalog, use Search.
- Search for:
customers(orcustomers.csvdepending on how the entity is represented). - Open the entity representing your dataset.
Expected outcome: You can view metadata such as name, location/path, and possibly inferred schema/columns (exact metadata depends on connector support).
7B) Add tags
- In the entity details, find Tags (or classification).
- Add tags such as:
–
Domain:Lab–Sensitivity:Internal–Lifecycle:Demo
Expected outcome: Tags appear on the entity and become searchable filters.
7C) Create a glossary term and link it
- Go to Glossary.
- Create a term:
– Term:
Customer– Definition:A person or organization that has signed up for our service. - Return to the
customersentity and associate/link the glossary term (UI wording varies).
Expected outcome: The entity now shows an associated glossary term, improving business clarity.
Validation
Use this checklist:
- Catalog exists and is Active.
- Data asset exists for Object Storage bucket.
- Harvest job succeeded.
- Searching for
customersreturns at least one entity. - Entity shows your tags and linked glossary term.
If any item fails, use the troubleshooting section below.
Troubleshooting
Issue: Harvest fails with authorization/403 errors
- Cause: Missing or incorrect IAM policy allowing Data Catalog service to read Object Storage.
- Fix:
- Re-check the official Data Catalog Object Storage harvesting policy examples.
- Confirm policy is in the correct compartment (where the bucket resides).
- Wait for IAM propagation (a few minutes) and retry harvest.
Issue: Bucket or namespace not found
- Cause: Wrong region/compartment selected, or incorrect namespace.
- Fix: Confirm region at the top right and the compartment selector in Object Storage and Data Catalog.
Issue: No entities found after harvest
- Cause: Harvest scope/prefix excludes the object, or connector doesn’t infer metadata from the file type.
- Fix:
- Confirm
customers.csvexists in the bucket. - Re-run harvest without prefix filters.
- Check whether file-level metadata vs schema inference is supported for your connector/version (verify in docs).
Issue: Can’t see Data Catalog in console
- Cause: Service not enabled/available in your region, or you lack IAM permissions.
- Fix: Switch regions and confirm service availability; request access from your tenancy administrator.
Cleanup
To avoid ongoing costs and clutter, remove lab resources:
-
In Data Catalog: – Delete harvest jobs (if required by the UI) – Delete the data asset
lab-os-asset– Delete the cataloglab-catalog -
In Object Storage: – Delete
customers.csv– Delete the bucketlab-dc-bucket-... -
In IAM: – Remove lab-specific policies if they were created only for this exercise
-
Delete the compartment
lab-datacatalog(only after all resources inside are deleted)
11. Best Practices
Architecture best practices
- Start with a curated zone: Catalog your “silver/gold” datasets before raw ingestion zones.
- Design for domains: Use consistent tagging like
Domain:<name>and map assets to domain ownership. - Separate environments: Use separate catalogs or clear naming (and separate compartments) for dev/test/prod depending on governance needs and pricing.
IAM/security best practices
- Use least privilege for both:
- Human users (stewards vs consumers)
- Service access for harvesting (read-only where possible)
- Keep catalog administration limited to a small group.
- Use compartments to enforce boundaries between domains or business units.
Cost best practices
- Avoid harvesting everything. Harvesting should be intentional and tied to discovery value.
- Set harvest schedules carefully; nightly is often enough.
- Periodically deprecate/remove assets no longer needed.
Performance best practices
- Use a naming standard for assets and entities to improve search quality.
- Enforce required metadata fields (owner, description) through governance processes.
- Keep tags controlled (avoid dozens of near-duplicates like
PII,Pii,pii).
Reliability best practices
- Treat harvest as a production job:
- Define RACI for failures
- Add alerts/notifications (where supported)
- Document rollback/mitigation (e.g., last-known-good metadata)
Operations best practices
- Create an operational runbook:
- Harvest cadence
- Failure handling
- Change management for glossary
- Use Audit and logging to track administrative activity.
Governance/tagging/naming best practices
- Tag strategy examples:
Sensitivity:Public|Internal|Confidential|RestrictedCertification:Bronze|Silver|GoldDomain:<DomainName>OwnerTeam:<TeamName>- Name catalogs and assets with predictable prefixes:
prod-,nonprod-,sandbox-- Require a short description for every data asset and key entity.
12. Security Considerations
Identity and access model
- Data Catalog uses OCI IAM for authentication and authorization.
- Use groups and policies to separate:
- Catalog administrators (create/manage catalogs, assets, harvest)
- Data stewards (edit glossary, curation fields)
- Consumers (read-only search/browse)
Encryption
- Oracle Cloud services typically encrypt data at rest and in transit. Confirm Data Catalog’s encryption specifics and key management options (Oracle-managed keys vs customer-managed keys, if available) in official docs.
Network exposure
- Console/API access uses Oracle Cloud endpoints.
- Harvest connectivity to private sources may require private networking patterns (verify private endpoint support and requirements).
Secrets handling
- If harvesting requires credentials (common for databases), store secrets securely:
- Prefer Oracle Cloud Vault where supported by the connector pattern (verify).
- Restrict who can view/rotate credentials.
- Rotate secrets regularly and on staff changes.
Audit/logging
- Use OCI Audit to record administrative events.
- Retain logs per compliance requirements.
- Monitor harvest activity and unexpected changes to glossary terms/tags.
Compliance considerations
Data Catalog helps with:
– Data inventory visibility
– Ownership and stewardship traceability
– Classification tagging workflows
But it does not replace:
– DLP tooling
– Full data access monitoring on underlying stores
– Data retention enforcement (that remains with the storage/database systems)
Common security mistakes
- Granting overly broad Object Storage permissions to the Data Catalog service or to users
- Using shared personal credentials for database harvesting
- Allowing anyone to edit glossary terms (definitions become untrusted)
- Not tracking who changed sensitive classification tags
Secure deployment recommendations
- Use compartment boundaries and least privilege.
- Centralize naming/tagging standards.
- Restrict write privileges to curated metadata fields.
- Establish a review workflow for high-impact glossary changes.
13. Limitations and Gotchas
Treat this section as a checklist to validate early; details vary by region and connector.
- Connector coverage varies: Not all data sources are supported everywhere. Confirm supported data asset types in your region.
- Regional service: Catalogs are regional. Multi-region organizations may need multiple catalogs and governance alignment.
- IAM complexity: Harvesting often fails due to missing service permissions to source systems.
- Metadata ≠ data: Data Catalog doesn’t grant access to the underlying data; it only indexes metadata.
- Glossary success depends on process: Without stewardship and standards, glossary becomes stale.
- Tag sprawl risk: Without controlled vocabulary, tags become inconsistent and reduce search value.
- Private network sources: Harvesting private databases can require networking setup; validate what Data Catalog supports (private endpoints/connectivity).
- Operational visibility: If you need detailed metrics/alerts, verify what native monitoring and events exist; you may need process tooling around it.
- Deletion dependencies: You may need to delete harvest jobs or assets before deleting catalogs, depending on UI rules.
14. Comparison with Alternatives
Data Catalog is one component of a broader Data Management stack. Here’s how it compares to nearby options.
Comparison table
| Option | Best For | Strengths | Weaknesses | When to Choose |
|---|---|---|---|---|
| Oracle Cloud Data Catalog | OCI-centric metadata discovery and governance | Managed service, integrates with OCI IAM/compartments, glossary + enrichment | Connector coverage and regional scope must be validated; governance requires process | You run data platforms on Oracle Cloud and want a managed metadata catalog |
| OCI Data Integration (metadata features) | ETL/ELT pipeline building with some metadata context | Strong for building pipelines; can complement a catalog | Not a dedicated enterprise catalog by itself | You need data pipelines first, and cataloging is a secondary need |
| Custom metadata in a database/wiki | Very small environments | Simple, cheap at tiny scale | Not searchable at enterprise scale; not governed; becomes stale | Small team with limited sources and minimal compliance requirements |
| AWS Glue Data Catalog | AWS data lake and analytics | Tight AWS integration; common in AWS ecosystems | AWS-specific; different IAM model | Your platform is primarily on AWS |
| Microsoft Purview | Microsoft-centric governance and cataloging | Broad governance suite, integrations across Microsoft stack | Complexity and licensing can be significant | Your ecosystem is Microsoft/Azure-first and you need broad governance suite |
| Google Cloud Dataplex Catalog (and related GCP governance tools) | GCP data governance | Integrates with GCP data services | GCP-specific | You are GCP-first and need native governance/catalog |
| Apache Atlas (self-managed) | Highly customizable governance | Open-source, extensible | Operational burden; scaling and UX depend on your implementation | You need deep customization and can operate the platform |
| DataHub / Amundsen (self-managed) | Modern metadata platforms | Strong community, flexible ingestion | You run/scale it; integrations vary | You want open ecosystem control and can invest in operations |
15. Real-World Example
Enterprise example (regulated industry)
Problem: A financial services company runs multiple analytics domains on Oracle Cloud. Auditors request a repeatable inventory of datasets used for regulatory reporting, including definitions and owners. Teams also struggle with inconsistent KPI definitions across departments.
Proposed architecture: – One regional Data Catalog per primary region – Data assets for: – Curated Object Storage buckets (domain-based) – Autonomous Data Warehouse (core reporting) – Governance model: – Data stewards manage glossary and certification tags – Platform admins manage assets/harvesting – Consumers get read-only access – Operational integration: – Scheduled nightly harvest for curated sources – Audit log retention aligned to compliance policy
Why Data Catalog was chosen: – Native integration with Oracle Cloud IAM and compartments – Central business glossary connected to technical assets – Managed service reduces operational overhead vs self-hosting
Expected outcomes: – Faster audit response (inventory + ownership in one place) – Reduced KPI disputes due to glossary-driven definitions – Improved analyst productivity via search and certified datasets
Startup/small-team example
Problem: A SaaS startup stores product analytics events in Object Storage and a small warehouse. New team members don’t know which datasets are safe to use, and dashboards are inconsistent.
Proposed architecture:
– Single Data Catalog in the team’s region
– Catalog only curated datasets:
– analytics_curated bucket paths
– Warehouse schema BI_MART
– Simple glossary:
– “Active User”, “Conversion”, “Churn”
– Tagging:
– Certified:Gold for tables used in executive dashboards
Why Data Catalog was chosen: – Quick setup without building a custom system – Glossary + tags provide immediate value for a small team – Scales as the startup adds data sources
Expected outcomes: – New hires onboard faster – Fewer broken dashboards from misunderstanding data meaning – Better reuse of curated datasets
16. FAQ
1) Does Data Catalog store my actual data?
No. Data Catalog stores metadata (information about data). The underlying data remains in Object Storage, databases, or other systems.
2) Is Data Catalog a data governance platform?
It supports governance workflows (glossary, tags, ownership metadata), but full governance often requires processes and potentially additional tools.
3) Can Data Catalog catalog Object Storage buckets?
Commonly yes, through a data asset and harvest job for Object Storage. Confirm exact connector behavior and supported formats in the official docs.
4) Can I catalog Autonomous Data Warehouse or Autonomous Database?
Often yes, depending on connector support and your configuration. Verify supported sources and required credentials/networking.
5) How do users access Data Catalog?
Through the Oracle Cloud Console and APIs, controlled by OCI IAM policies.
6) How do I keep metadata up to date?
Use scheduled harvests (if supported in your UI) or run harvest jobs periodically. Also operationalize steward updates for business context.
7) What’s the difference between a catalog and a data asset?
A catalog is the container. A data asset is a registered source inside the catalog.
8) What’s a harvest job?
A harvest job connects to a data asset and extracts technical metadata into the catalog.
9) Can I restrict who can edit glossary terms?
Yes—use IAM policies and role separation so only stewards/admins can modify governed fields.
10) Will Data Catalog improve query performance?
No. It’s not a query engine. It improves discovery and understanding, not execution speed.
11) How do I classify sensitive fields (like email)?
Apply tags and/or custom properties at the entity/attribute level as supported. The exact tagging granularity depends on the harvested metadata model.
12) Does Data Catalog automatically detect PII?
Some catalogs provide classification features; do not assume automatic detection. Verify whether OCI Data Catalog includes automated classification in your current version/region, and consider complementary tooling if needed.
13) Can I automate onboarding of new datasets?
Yes, using APIs/CLI/SDK where supported. Many teams implement “catalog as code” patterns plus standard tags/properties.
14) What’s the best way to design tags?
Use controlled vocabularies and a small number of standardized dimensions (Sensitivity, Certification, Domain, OwnerTeam).
15) How do I estimate cost?
Use the official price list entry for Data Catalog and the OCI Cost Estimator. Costs depend on the pricing dimensions Oracle currently uses for this service—verify before scaling.
16) Should I create one catalog or many?
Start with one per environment or region, then scale only if governance boundaries require it. Multiple catalogs increase operational overhead and may increase cost.
17) Can I integrate Data Catalog with CI/CD?
Yes, by calling APIs in pipelines to create assets, apply tags, or trigger harvest. Ensure policies and secrets management are handled securely.
17. Top Online Resources to Learn Data Catalog
| Resource Type | Name | Why It Is Useful |
|---|---|---|
| Official Documentation | OCI Data Catalog Documentation | Primary source for concepts, connectors, IAM policies, and API references: https://docs.oracle.com/en-us/iaas/data-catalog/home.htm |
| Official Pricing | Oracle Cloud Price List | Find Data Catalog under Data Management and confirm current billable dimensions: https://www.oracle.com/cloud/price-list/ |
| Pricing Calculator | OCI Cost Estimator | Build scenario estimates using current SKUs: https://www.oracle.com/cloud/costestimator.html |
| Official Console | Oracle Cloud Console | Hands-on creation of catalogs, data assets, harvest jobs: https://cloud.oracle.com/ |
| Architecture Center | Oracle Architecture Center | Reference architectures for data platforms that commonly include cataloging/governance patterns (search within): https://docs.oracle.com/en/solutions/ |
| Tutorials / Workshops | Oracle LiveLabs | Hands-on labs (search for “Data Catalog” and verify lab availability): https://apexapps.oracle.com/pls/apex/r/dbpm/livelabs/home |
| API/CLI Docs | OCI CLI Installation and Usage | If you automate Data Catalog operations: https://docs.oracle.com/en-us/iaas/Content/API/SDKDocs/cliinstall.htm |
| Community Learning | Oracle Cloud Customer Connect / Community | Practical troubleshooting and patterns (validate against docs): https://community.oracle.com/customerconnect/categories/oracle-cloud-infrastructure |
18. Training and Certification Providers
| Institute Name | Suitable Audience | Likely Learning Focus | Mode | Website URL |
|---|---|---|---|---|
| DevOpsSchool.com | DevOps engineers, platform teams, cloud engineers | OCI fundamentals, DevOps practices, cloud operations (verify course specifics) | check website | https://www.devopsschool.com/ |
| ScmGalaxy.com | Beginners to intermediate engineers | SCM/DevOps foundations, automation practices (verify OCI coverage) | check website | https://www.scmgalaxy.com/ |
| CLoudOpsNow.in | Cloud operations, SRE, platform operations | Cloud ops practices, monitoring, reliability (verify OCI content) | check website | https://www.cloudopsnow.in/ |
| SreSchool.com | SREs, production operations teams | Reliability engineering, incident response, observability (verify cloud modules) | check website | https://www.sreschool.com/ |
| AiOpsSchool.com | Ops teams adopting AIOps | AIOps concepts, operations analytics (verify integrations) | check website | https://www.aiopsschool.com/ |
19. Top Trainers
| Platform/Site Name | Likely Specialization | Suitable Audience | Website URL |
|---|---|---|---|
| RajeshKumar.xyz | DevOps/cloud training resources (verify specific offerings) | Students and working engineers | https://www.rajeshkumar.xyz/ |
| devopstrainer.in | DevOps training and mentoring (verify course catalog) | Beginners to intermediate DevOps engineers | https://www.devopstrainer.in/ |
| devopsfreelancer.com | Freelance DevOps guidance/training resources (verify offerings) | Teams needing short-term enablement | https://www.devopsfreelancer.com/ |
| devopssupport.in | DevOps support and enablement resources (verify services) | Ops and DevOps teams | https://www.devopssupport.in/ |
20. Top Consulting Companies
| Company Name | Likely Service Area | Where They May Help | Consulting Use Case Examples | Website URL |
|---|---|---|---|---|
| cotocus.com | Cloud/DevOps consulting (verify current portfolio) | Platform engineering, cloud adoption, operations | Standing up governance-friendly cloud landing zones; automation and operational readiness | https://www.cotocus.com/ |
| DevOpsSchool.com | Training + consulting (verify service catalog) | Enablement, DevOps transformation, cloud best practices | Designing IAM and operational runbooks; implementing CI/CD and automation around data platforms | https://www.devopsschool.com/ |
| DEVOPSCONSULTING.IN | DevOps consulting (verify offerings) | DevOps tooling, reliability improvements | Building monitoring/alerting and incident processes; automation for cloud resource provisioning | https://www.devopsconsulting.in/ |
21. Career and Learning Roadmap
What to learn before Data Catalog
- Oracle Cloud fundamentals:
- Tenancy, compartments, IAM policies, groups
- Regions and availability
- Object Storage basics (buckets, objects, namespaces)
- Basic data concepts:
- Schemas, tables, partitions, file formats
- Data lake vs data warehouse
- Governance foundations:
- Data ownership, stewardship, classification
What to learn after Data Catalog
- Data pipelines and processing:
- OCI Data Integration, OCI Data Flow (or your preferred tools)
- Security hardening:
- OCI Vault, key management, network segmentation
- Observability and operations:
- OCI Logging, Monitoring, Audit, and alerting patterns
- Advanced governance:
- Data-quality checks, access reviews, retention policies (implemented in source systems)
Job roles that use it
- Data Engineer (metadata-aware pipelines)
- Analytics Engineer (semantic definitions, curated marts)
- Data Steward / Governance Analyst (glossary, classification)
- Cloud Engineer / Platform Engineer (IAM, compartments, automation)
- Security Engineer (classification workflows, audit readiness)
- Solution Architect (data platform design)
Certification path (if available)
Oracle’s certification catalog changes over time. Look for:
– OCI architect and data-related certifications on the official Oracle University pages.
Verify current paths here: https://education.oracle.com/
Project ideas for practice
- Curated dataset certification workflow: Tag assets as Bronze/Silver/Gold and document steward review steps.
- Glossary-driven metrics: Build a glossary for 20 key KPIs and link them to warehouse columns.
- Automated asset onboarding: Script creation of data assets and harvesting (API/CLI), then auto-apply tags.
- Compliance inventory: Maintain a list of datasets tagged
Confidentialand perform quarterly owner reviews. - Multi-compartment domain model: Organize assets by domain compartments and implement least-privilege access.
22. Glossary
- Catalog: A regional container in Oracle Cloud Data Catalog that stores harvested metadata and business context.
- Data Asset: A registered data source (Object Storage, database, etc.) that can be harvested.
- Harvest: The process/job that extracts technical metadata from a data asset into the catalog.
- Entity: A metadata object in the catalog (table, file, view, column/attribute, etc.).
- Business Glossary: A curated set of business terms and definitions linked to technical metadata.
- Tag: A label applied to catalog objects for classification and discovery.
- Custom Property: An organization-defined metadata field added to catalog objects (owner, SLA, domain, etc.).
- Compartment: OCI logical container for organizing resources and applying IAM access control.
- IAM Policy: A statement that grants permissions to groups/users/services for OCI resources.
- Steward: A role responsible for maintaining business definitions and governance metadata.
- Certified Dataset: A dataset that has been reviewed and approved for broad use (implemented via tags/process).
- Metadata: Data about data—schema, structure, definitions, location, and governance annotations.
23. Summary
Oracle Cloud Data Catalog is a managed metadata discovery and governance service in the Data Management category. It helps organizations find datasets faster, standardize definitions with a business glossary, and operationalize stewardship through tags and custom properties—without moving the underlying data.
Architecturally, it works by registering data assets and running harvest jobs to ingest technical metadata, then enabling users to search and enrich that metadata securely using OCI IAM controls. Cost depends on Oracle’s current pricing dimensions for Data Catalog (confirm in the official price list), and indirect costs are mostly driven by how broadly and frequently you harvest.
Use Data Catalog when you need reliable data discovery and shared definitions across multiple teams and sources in Oracle Cloud. Next step: expand from the lab by cataloging one curated production domain, establishing a minimal glossary, and implementing a controlled tagging standard backed by IAM role separation.