Oracle Cloud Big Data Discovery Tutorial: Architecture, Pricing, Use Cases, and Hands-On Guide for Other Services

1. Introduction

Big Data Discovery is an Oracle product designed to help people explore, profile, transform, and visualize large datasets—especially data stored in Hadoop ecosystems—without requiring every user to write code.

In simple terms: Big Data Discovery is a “data exploration and preparation” tool for big data. It lets analysts and engineers ingest data (often from Hadoop), understand it quickly (profiling and sampling), clean it (enrichment and transforms), and publish curated datasets for downstream analytics.

In more technical terms: Big Data Discovery combines a browser-based exploration experience with a backend processing and indexing layer that can work with big-data storage and query engines. It was commonly positioned alongside Oracle Big Data Appliance and related Oracle big-data components. It overlaps conceptually with modern “data prep + exploration + visualization” workflows that today are frequently implemented using cloud-native services (data lake, Spark, SQL engines, and BI).

What problem it solves: teams often have large, messy datasets with unknown schema quality, missing values, inconsistent formatting, and unclear distributions. Big Data Discovery addresses the “time-to-first-insight” gap by providing interactive discovery, profiling, and transformation workflows so teams can build trusted, analysis-ready datasets faster.

Important lifecycle note (read first): Oracle’s “Big Data Discovery” has historically existed as a product integrated with Oracle’s big data stack (commonly associated with Oracle Big Data Appliance) and was also offered in some form in older Oracle Cloud environments. In many Oracle Cloud Infrastructure (OCI) tenancies today, Big Data Discovery does not appear as a native OCI managed service in the console. Availability, lifecycle status (active vs. legacy), and procurement/licensing can vary by Oracle program and contract. Verify current availability and lifecycle status in official Oracle documentation and with Oracle Sales/Support before designing new long-term architectures around it.

Because of that reality, this tutorial does two things: 1. Teaches Big Data Discovery accurately as a product and where it fits. 2. Provides a practical, executable OCI lab that recreates a “Big Data Discovery-style” workflow using current Oracle Cloud services (Object Storage + Autonomous Database + built-in analytics tools). This is often the most practical approach for new projects on OCI.

2. What is Big Data Discovery?

Official purpose (product intent)

Big Data Discovery is intended to help users: – Connect to large data sources (commonly Hadoop/Hive/HDFS in Oracle big data deployments). – Explore datasets interactively (search, filter, facet, profile). – Perform data preparation (cleaning, enrichment, transformations). – Publish curated datasets for analytics and reporting.

If you have Big Data Discovery in your environment, consult the Oracle Big Data Discovery documentation set available through Oracle Help Center (see resources at the end) for the exact version and supported integrations.

Core capabilities (what it typically provides)

Big Data Discovery capabilities, as described in Oracle materials for the product, typically include: – Dataset ingestion from big-data repositories and structured sources (implementation depends on deployment). – Data profiling (type inference, cardinality, distributions, outliers). – Search and faceted exploration for fast slicing/dicing of large datasets. – Data preparation workflows (standardization, parsing, filtering, joining, deriving fields). – Publishing/export of curated outputs for downstream BI or data science workflows.

Caveat: exact connectors, processing engines, and export targets are version- and deployment-dependent. Verify in the documentation for your Big Data Discovery version.

Major components (conceptual)

A typical Big Data Discovery deployment historically included: – A web-based “Studio” experience for interactive discovery and preparation. – A processing layer to run transformations at scale (often integrated with big-data processing frameworks in the environment). – An indexing/search layer enabling fast interactive filtering and faceting. – Admin/configuration components for connectivity, security, and operations.

Names and internal architecture details vary by release. Use the official admin and installation guides for your exact build.

Service type

In the Oracle Cloud “Other Services” category context, it’s best to think of Big Data Discovery as: – A product/workload that you run as part of a broader big-data platform, not necessarily a first-class “OCI native managed service” (like Object Storage or Autonomous Database).

Scope: regional/global/zonal?

Because Big Data Discovery is not universally exposed as a native OCI resource, “scope” is generally: – Deployment-scoped: it runs where you deploy it (on-prem, appliance, or customer-managed compute). – Its effective availability is determined by your infrastructure and licensing rather than OCI region catalogs.

How it fits into the Oracle Cloud ecosystem

In modern OCI architectures, Big Data Discovery’s role is often fulfilled by a combination of: – Oracle Cloud Infrastructure Object Storage (data lake storage), – OCI Data Flow (Apache Spark) or OCI Big Data Service (processing), – Autonomous Database (curated/serving layer), – Oracle Analytics Cloud (BI/visualization), – OCI Data Integration (ETL/ELT orchestration).

So even when Big Data Discovery itself is not used, the workflow it represents remains a common requirement: interactive discovery + preparation + publishing trusted datasets.

3. Why use Big Data Discovery?

Business reasons

Faster insight from large datasets: reduce the time spent just understanding data shape and quality.
Self-service discovery: analysts can explore data without waiting for custom engineering pipelines for every question.
Improved data trust: profiling and preparation steps help produce cleaner datasets for decision-making.

Technical reasons

Interactive exploration over large data: supports discovery patterns that SQL-only workflows can make slower for ad-hoc questions (depending on indexing/engine).
Repeatable transformations: data prep can be standardized and reused.
Bridge between raw data and analytics: publish curated outputs for BI, ML, or downstream reporting.

Operational reasons

Standardized discovery tooling: reduces “spreadsheet chaos” and inconsistent local scripts.
Governance alignment: centralized platform is easier to govern than scattered personal scripts (when deployed and managed properly).

Security/compliance reasons

Centralized access control and auditability (deployment-dependent).
Reduced need to copy data to unmanaged endpoints for exploration.

Scalability/performance reasons

Designed for large datasets and big data ecosystems (particularly where deployed with Hadoop-related storage/query engines).
Supports sampling and summary-based exploration patterns to keep UIs responsive.

When teams should choose it

Choose Big Data Discovery when: – You already have it licensed/deployed (or part of an Oracle big data platform) and it matches your sources. – You need an interactive data prep and discovery experience tightly integrated with your big data environment. – You have operational maturity to maintain the platform (patching, scaling, governance).

When teams should not choose it

Avoid Big Data Discovery when: – You’re starting greenfield on OCI and need a fully managed, roadmap-forward cloud service (Big Data Discovery may be legacy for many customers). – Your main need is BI dashboards over curated data (Oracle Analytics Cloud may be a simpler fit). – You want open lakehouse formats (Parquet/Iceberg/Delta) with modern query engines and minimal proprietary dependencies—evaluate OCI Data Flow + Trino/Presto patterns instead. – You cannot staff platform operations (customer-managed software can be operationally heavy).

4. Where is Big Data Discovery used?

Industries

Financial services (risk analytics, fraud exploration, compliance datasets)
Retail/e-commerce (clickstream exploration, product analytics)
Telecom (CDR exploration, network event analytics)
Manufacturing/IoT (sensor data quality and anomaly exploration)
Healthcare (claims analytics, operational reporting; subject to strict compliance)
Public sector (case analytics, citizen service data quality)

Team types

Data analysts and BI teams doing exploratory work
Data engineering teams preparing curated datasets
Data science teams validating features and distributions
Platform teams standardizing data exploration tooling

Workloads

Exploratory data analysis (EDA) over big-data repositories
Data quality and profiling at scale
Building curated datasets from raw lakes
Publishing “analysis-ready” datasets for BI/ML

Architectures

Hadoop-centric environments (historically common)
Data lake + processing + serving layer patterns
Hybrid: on-prem big data + cloud analytics serving

Real-world deployment contexts

Existing Oracle big data platforms where Big Data Discovery is already part of the stack
Migration scenarios: using Big Data Discovery outputs to transition to OCI analytics services

Production vs dev/test usage

Production: curated dataset publishing, governed exploration, standardized transformations.
Dev/test: discovery of new sources, profiling, POCs for analytics use cases.

5. Top Use Cases and Scenarios

Below are realistic scenarios where Big Data Discovery (or a Big Data Discovery-style workflow) is commonly applied.

1) Data lake profiling before onboarding to analytics

Problem: raw data arrives with unknown schema drift and inconsistent quality.
Why Big Data Discovery fits: profiling + interactive exploration helps teams understand distributions, nulls, and anomalies quickly.
Example: a retail team receives daily CSV/JSON dumps of transactions and needs to validate fields and detect missing store IDs before building reports.

2) Self-service exploration for analysts on big datasets

Problem: analysts are blocked waiting for engineering to build custom extracts.
Why it fits: interactive filtering/faceting reduces dependence on ad-hoc pipelines.
Example: marketing analysts explore clickstream attributes to identify top referral sources.

3) Preparing curated datasets for BI dashboards

Problem: BI dashboards fail due to inconsistent formats and dirty dimensions.
Why it fits: standardized transforms create consistent columns (dates, categories, IDs).
Example: telecom team standardizes device model strings and publishes a clean “subscriber_device_dim”.

4) Joining heterogeneous sources into a unified dataset

Problem: data lives in multiple sources (events + reference data) with inconsistent keys.
Why it fits: preparation steps can include joins/derivations (capability depends on version).
Example: manufacturer joins sensor readings with equipment metadata for plant-level KPIs.

5) Detecting outliers and data quality issues early

Problem: pipelines silently ingest bad data leading to wrong decisions.
Why it fits: profiling and distributions reveal outliers and breaks.
Example: finance sees a sudden spike in “transaction_amount” due to a unit conversion bug.

6) Publishing datasets for downstream ML feature engineering

Problem: data scientists spend too long cleaning data before modeling.
Why it fits: curated, standardized datasets reduce duplicated cleanup.
Example: fraud modelers receive a prepared dataset with consistent merchant categories and cleaned timestamps.

7) Investigating operational incidents with fast filtering

Problem: SRE/ops teams need to explore event logs at scale.
Why it fits: faceted exploration supports quick narrowing by host, error code, region (depending on ingestion).
Example: ops team investigates “payment_timeouts” across services after a deployment.

8) Compliance reporting dataset preparation

Problem: compliance needs repeatable datasets with lineage and consistent logic.
Why it fits: repeatable transformations reduce one-off spreadsheet manipulation.
Example: bank prepares monthly AML case dataset with standardized customer identifiers.

9) Enrichment and standardization of semi-structured fields

Problem: addresses, names, product codes are messy.
Why it fits: transformations can parse and standardize values (exact enrichment varies).
Example: e-commerce standardizes shipping address fields and extracts postal code.

10) Migration discovery: understanding Hadoop datasets before moving to OCI

Problem: organizations want to migrate but don’t know which datasets are important or clean.
Why it fits: discovery identifies high-value datasets and quality issues to plan migration.
Example: enterprise profiles Hive tables, identifies top-used columns, and prioritizes migration into OCI lakehouse patterns.

6. Core Features

Because Big Data Discovery’s availability and packaging can vary, this section describes commonly documented feature categories. Confirm exact feature availability in your Big Data Discovery version docs.

Feature 1: Interactive data exploration (search/filter/facets)

What it does: lets users explore datasets by filtering, searching, and slicing attributes interactively.
Why it matters: reduces time spent writing exploratory queries; accelerates understanding of data.
Practical benefit: faster ad-hoc investigations and quicker iteration with stakeholders.
Limitations/caveats: responsiveness depends on indexing/engine configuration and dataset size; some transformations may require batch processing.

Feature 2: Data profiling and summary statistics

What it does: provides distributions, cardinality, missing values, and type inference.
Why it matters: data quality issues are common in lakes; profiling surfaces them early.
Practical benefit: improves downstream pipeline reliability and reduces “silent failures.”
Limitations/caveats: profiling large datasets may rely on sampling; verify how sampling is configured.

Feature 3: Data preparation / transformation workflows

What it does: supports common transforms like filtering rows, deriving columns, parsing strings/dates, and standardizing values.
Why it matters: most analytics value comes after cleaning/curation.
Practical benefit: repeatable prep steps reduce spreadsheet-based manipulation and duplicated scripts.
Limitations/caveats: not every transform equals full ETL; complex workflows may still require Spark/SQL pipelines.

Feature 4: Sampling to keep exploration responsive

What it does: allows users to work on representative subsets of huge datasets.
Why it matters: exploration UIs can’t always operate on full-scale data interactively.
Practical benefit: quick iteration on cleaning logic before full-scale apply.
Limitations/caveats: sampling can mislead if data is highly skewed; validate with full runs.

Feature 5: Publishing curated datasets

What it does: exports/publishes prepared datasets for consumption by BI or other systems.
Why it matters: turns exploratory work into reusable assets.
Practical benefit: consistent curated datasets across teams.
Limitations/caveats: export targets and formats depend on environment integration; confirm supported sinks.

Feature 6: Collaboration and project organization

What it does: organizes work into projects/datasets with saved steps and shareable assets.
Why it matters: prevents knowledge loss and “tribal scripts.”
Practical benefit: repeatable prep pipelines and easier onboarding.
Limitations/caveats: governance depends on how access control is implemented.

Feature 7: Security integration (authentication/authorization)

What it does: controls who can access datasets and features.
Why it matters: discovery tools often surface sensitive columns; least privilege is critical.
Practical benefit: safer self-service analytics.
Limitations/caveats: integration method depends on deployment (e.g., enterprise identity providers); verify supported modes.

Feature 8: Administrative controls and monitoring hooks

What it does: provides ways to configure sources, manage users, and monitor health.
Why it matters: discovery platforms need operational oversight to stay stable.
Practical benefit: more predictable performance and better troubleshooting.
Limitations/caveats: monitoring integrations vary; you may need external monitoring stacks.

7. Architecture and How It Works

High-level architecture

A Big Data Discovery-style platform typically has: 1. Data sources (HDFS/Hive tables, object storage, databases—depending on connectors). 2. Ingestion/metadata layer to register datasets. 3. Indexing and exploration layer to support fast interactive filtering and profiling. 4. Processing layer to run transformations at scale. 5. Publishing layer to produce curated outputs. 6. Access control integrated with enterprise identity/IAM.

Request/data/control flow (conceptual)

A user logs into the Studio/UI.
The user selects or ingests a dataset.
The system profiles the dataset and builds indexes/metadata to enable interactive exploration.
The user defines transformations (cleaning/derivations).
The system executes transforms (possibly using a cluster compute engine).
Results are published to downstream destinations.

Integrations with related services (OCI context)

If you are implementing this workflow on OCI today (even without Big Data Discovery), typical integrations are: – OCI Object Storage for raw and curated data. – Autonomous Database (ADW/ATP) for curated serving datasets. – OCI Data Flow (Spark) for transformation at scale. – OCI Logging for centralized logs (for OCI-native services). – OCI IAM for least privilege access to buckets and databases. – Oracle Analytics Cloud for visualization (licensed service).

Dependency services

Identity: OCI IAM (OCI-native workflows), or enterprise IdP for legacy deployments.
Storage: Object Storage / HDFS / database storage depending on deployment.
Compute/processing: Spark/Hadoop/YARN or OCI Data Flow.

Security/authentication model

OCI-native pattern: users and services authenticate via IAM policies, dynamic groups, and resource principals (for services like Data Flow).
Legacy/platform pattern: user auth may be integrated with LDAP/SSO depending on deployment.

Networking model

OCI-native: private endpoints for Autonomous Database; private access to Object Storage via Service Gateway; limit public exposure.
Legacy: depends on how the platform is deployed (on-prem network segmentation, firewalls).

Monitoring/logging/governance considerations

Define ownership for datasets and transformations (data product mindset).
Centralize logs:
OCI Logging for OCI services.
Database audit logs for Autonomous Database.
Set tagging standards and cost tracking in OCI.

Simple architecture diagram (conceptual)

flowchart LR
  U[User / Analyst] --> UI[Big Data Discovery Studio<br/>or Discovery UI]
  UI --> META[Metadata + Profiling]
  META --> IDX[Index/Search Layer]
  UI --> PROC[Processing/Transform Layer]
  PROC --> SRC[(Raw Data Store<br/>HDFS / Object Storage)]
  PROC --> CUR[(Curated Output<br/>DB / Object Storage)]
  CUR --> BI[BI / Analytics Consumers]

Production-style architecture diagram (OCI-native replacement pattern)

flowchart TB
  subgraph OCI[Oracle Cloud Infrastructure (OCI)]
    subgraph Net[VCN (private)]
      ADB[(Autonomous Database<br/>Private Endpoint)]
      BAST[Admin Bastion / Private Admin Host]
    end

    OS[(Object Storage Buckets<br/>Raw + Curated)]
    DF[OCI Data Flow (Spark Jobs)]
    LOG[OCI Logging]
    AUD[Audit Logs]
    IAM[IAM Policies<br/>+ Dynamic Groups]
  end

  EXT[Enterprise Users] -->|SSO/IAM| IAM
  EXT -->|SQL/Web| ADB
  EXT -->|Console/API| OS

  DF -->|Read/Write| OS
  DF -->|Load curated| ADB
  DF --> LOG
  ADB --> AUD
  OS --> AUD

  BAST -->|Private admin| ADB

8. Prerequisites

Because Big Data Discovery itself may not be a universally available OCI service, prerequisites are split into two parts: – A) If you already have Big Data Discovery (legacy/product environment) – B) If you will follow the OCI hands-on lab (recommended for most new OCI users)

A) Big Data Discovery (product) prerequisites (verify in your version docs)

Access to a Big Data Discovery environment (often part of an Oracle big data platform deployment).
Admin-provisioned connectivity to your data sources (Hive/HDFS/etc., depending on your environment).
User authentication set up (SSO/LDAP/IAM depending on deployment).
Permissions to create projects/datasets and run transformations.

B) OCI hands-on lab prerequisites (Big Data Discovery-style workflow on Oracle Cloud)

Account/tenancy – An active Oracle Cloud (OCI) tenancy with billing enabled (or free trial). – Permission to create and manage: – Object Storage buckets/objects – Autonomous Database (Always Free if available in your region) – IAM policies (or an admin who can create required policies)

IAM permissions – You need a group with policies similar to: – Manage Object Storage in a compartment – Manage Autonomous Database in a compartment – Use Cloud Shell (optional) – If you are not an admin, ask your OCI admin to grant least-privilege permissions.

Tools – OCI Console access. – Optionally: – OCI Cloud Shell (recommended) or OCI CLI installed locally. – A SQL client: SQL Developer, SQLcl, or the built-in Autonomous Database SQL tools.

Region availability – Object Storage and Autonomous Database are widely available, but Always Free availability can vary. – If a service isn’t available in your region, select a different OCI region (if your tenancy allows) or use paid resources.

Quotas/limits – Autonomous Database Always Free has resource limits. – Object Storage has tenancy-level service limits. – If you hit a limit error, request a service limit increase (paid accounts) or use a smaller dataset.

Prerequisite services – OCI Object Storage – Oracle Autonomous Database (ADW or ATP)

9. Pricing / Cost

Big Data Discovery pricing model (important reality)

Big Data Discovery is not typically priced like a modern OCI consumption service with a public “per-GB/per-OCPU” meter in the OCI price list. Instead, it has historically been: – Included in certain Oracle big-data platform offerings, or – Licensed as software (terms vary)

What to do: – Verify Big Data Discovery commercial and licensing terms with Oracle Sales/Account team. – Check your support contract and product availability for your environment.

Because public, region-based OCI pricing pages may not list Big Data Discovery explicitly, you should not assume it behaves like a pay-as-you-go OCI native service.

Cost model for the OCI lab (Big Data Discovery-style workflow)

The lab in this tutorial uses common OCI services with published pricing: – OCI Object Storage: billed by stored GB-month and requests (and data egress if applicable). – Autonomous Database: Always Free option may cost $0 within limits; paid tiers bill by OCPU and storage. – Optional additions (not required): – OCI Data Flow: billed by OCPU time (Spark job runtime) and possibly other dimensions depending on SKU.

Free tier notes

Autonomous Database has an Always Free option in many regions/tenancies (verify in your OCI console).
Object Storage has limited “free” components depending on promotions; assume storage is billed unless your tenancy offers credits.

Cost drivers

Direct cost drivers: – Object Storage data volume (raw + curated + logs/exports). – Autonomous Database size and compute (if not Always Free). – Any optional Spark processing (Data Flow) runtime.

Indirect/hidden costs: – Data transfer/egress if you move data out of OCI regions. – Operational overhead (time) if you maintain self-managed tooling. – Backups and retention if you store many copies.

Network/data transfer implications

Ingress to OCI is typically not billed, but egress to the internet is usually billed. Verify OCI data transfer pricing for your region.
Cross-region replication and reads may incur additional costs.

How to optimize cost

Keep raw data in Object Storage and only curate what you need into the database.
Use compressed columnar formats (Parquet) for curated lake data when possible.
Use Always Free Autonomous Database for small labs and prototypes.
Set lifecycle policies on buckets to archive or delete older objects.
Tag resources for cost tracking and shut down/delete unused resources.

Example low-cost starter estimate (no fabricated numbers)

A low-cost starter design usually includes: – One Object Storage bucket with a small dataset (< a few GB). – One Autonomous Database Always Free instance. – Optional: no Data Flow jobs.

Cost should be minimal (often near $0 if Always Free is used and storage is small), but verify in the OCI Cost Estimator.

Example production cost considerations

For production-scale discovery/prep workflows: – Expect significant Object Storage growth (raw + curated + historical). – Autonomous Database paid tiers if you need larger compute/storage and higher concurrency. – Spark processing costs (OCI Data Flow) if you run frequent large jobs. – Monitoring/log retention costs and security services.

Official pricing references (start here)

OCI pricing overview and calculator:
https://www.oracle.com/cloud/costestimator.html
https://www.oracle.com/cloud/pricing/
OCI Object Storage pricing (navigate to “Storage”):
https://www.oracle.com/cloud/pricing/
Autonomous Database pricing:
https://www.oracle.com/autonomous-database/pricing/
OCI Data Flow pricing (if used):
https://www.oracle.com/cloud/pricing/

Pricing pages can be reorganized over time. If a link changes, start from the OCI pricing page and drill down by service.

10. Step-by-Step Hands-On Tutorial

Because Big Data Discovery may not be available as a native OCI managed service in your tenancy, this lab provides a Big Data Discovery-style workflow on Oracle Cloud using commonly available OCI services. The end result is the same outcome Big Data Discovery is typically used for: ingest → profile → clean/transform → publish → explore.

Objective

Build a small, realistic “discovery and preparation” pipeline on Oracle Cloud: 1. Store a raw CSV dataset in OCI Object Storage. 2. Load it into an Autonomous Database (Always Free where available) using DBMS_CLOUD. 3. Profile and transform the dataset with SQL. 4. Explore results with built-in Autonomous Database tools (and optionally connect a BI tool later).

Lab Overview

You will create: – 1 compartment (optional but recommended) – 1 Object Storage bucket + uploaded dataset – 1 Autonomous Database (ADW or ATP) – 1 database user + credential to read from Object Storage – 1 raw table + 1 curated table – Simple profiling queries and a “publish” view

Estimated time: 60–120 minutes
Cost: Low. Potentially $0 if you use Autonomous Database Always Free and a small dataset.
Skill level: Beginner-friendly; includes IAM and SQL fundamentals.

Step 1: Create a compartment (recommended)

Why: compartments help isolate access and costs for labs.

In the OCI Console, open the navigation menu → Identity & Security → Compartments.
Click Create Compartment.
Name: bdd-lab
Description: Big Data Discovery style lab resources
Click Create Compartment.

Expected outcome: You have a bdd-lab compartment to place all resources.

Verification: You can select bdd-lab in the compartment picker.

Step 2: Create an Object Storage bucket and upload a dataset

2.1 Create a bucket

Go to Storage → Buckets.
Ensure the compartment is bdd-lab.
Click Create Bucket.
Name: bdd-lab-raw
Keep defaults (Standard storage tier is fine for the lab).
Click Create.

Expected outcome: bucket bdd-lab-raw exists.

2.2 Upload a sample CSV

Pick a small dataset you can legally use. Two good options: – A public dataset from a government open data portal – A synthetic dataset you generate yourself

For a quick lab, you can generate a synthetic CSV locally:

cat > sales_raw.csv <<'EOF'
order_id,order_ts,customer_id,region,product,qty,unit_price,status
1,2026-01-05T10:15:00Z,C001,us-phx,keyboard,1,45.00,SHIPPED
2,2026-01-05T11:02:00Z,C002,us-ashburn,mouse,2,18.50,SHIPPED
3,2026-01-06T09:41:00Z,C003,eu-frankfurt,monitor,1,199.99,PENDING
4,2026-01-06T09:41:00Z,C003,eu-frankfurt,monitor,1,199.99,PENDING
5,2026-01-07T14:20:00Z,,us-phx,usb-c cable,3,9.99,CANCELLED
6,2026-01-08T08:05:00Z,C004,us-phx,laptop,1,899.00,SHIPPED
7,2026-01-08T08:07:00Z,C004,us-phx,laptop,1,899.00,SHIPPED
8,2026-01-10T16:55:00Z,C005,ap-tokyo,headset,2,59.90,SHIPPED
EOF

Upload it: – Buckets → bdd-lab-raw → Upload → select sales_raw.csv

Expected outcome: sales_raw.csv is stored in Object Storage.

Verification: You can see the object in the bucket listing.

Step 3: Create an Autonomous Database (Always Free if available)

Go to Oracle Database → Autonomous Database.
Select compartment: bdd-lab.
Click Create Autonomous Database.
Choose a workload: – Autonomous Data Warehouse (ADW) is often a good fit for analytics labs.
Display name: bdd_lab_adw
Database name: BDDLAB
Choose Always Free if available.
Set admin password (store it securely).
Networking: – For the simplest lab, you can use public access with allowed IPs. – For more secure setups, use private endpoint in a VCN (adds complexity).
Click Create.

Expected outcome: an Autonomous Database instance is provisioned.

Verification: status becomes Available.

Step 4: Create an Object Storage auth token and database credential

Autonomous Database uses DBMS_CLOUD to access Object Storage. The common approach is: – Create an OCI user Auth Token – Store it as a database credential

4.1 Create an Auth Token (OCI user)

In OCI Console: Identity & Security → Users → your user.
Open Auth Tokens.
Click Generate Token.
Description: bdd-lab-dbms-cloud
Copy the token value (you won’t see it again).

Expected outcome: you have an auth token.

4.2 Create a database user (optional but recommended)

In Autonomous Database, open Database Actions (or your SQL tool) and run:

CREATE USER bdd_lab IDENTIFIED BY "UseAStrongPassword#1";
GRANT CONNECT, RESOURCE TO bdd_lab;
-- For DBMS_CLOUD usage:
GRANT EXECUTE ON DBMS_CLOUD TO bdd_lab;

Expected outcome: user bdd_lab exists.

Verification:

SELECT username FROM all_users WHERE username = 'BDD_LAB';

4.3 Create a DBMS_CLOUD credential

Connect as bdd_lab and run:

BEGIN
  DBMS_CLOUD.CREATE_CREDENTIAL(
    credential_name => 'OBJ_STORE_CRED',
    username        => '<your_oci_username>',
    password        => '<your_auth_token>'
  );
END;
/

Expected outcome: credential is created.

Verification:

SELECT credential_name FROM user_credentials WHERE credential_name = 'OBJ_STORE_CRED';

If you can’t use an auth token due to org policy, use an approved pattern (for example, resource principals in some OCI services). Follow your security team’s guidance.

Step 5: Load the CSV from Object Storage into a raw table

5.1 Create a raw staging table

CREATE TABLE sales_raw (
  order_id     NUMBER,
  order_ts     VARCHAR2(30),
  customer_id  VARCHAR2(20),
  region       VARCHAR2(50),
  product      VARCHAR2(100),
  qty          NUMBER,
  unit_price   NUMBER(10,2),
  status       VARCHAR2(20)
);

5.2 Identify the Object Storage URL

In the bucket object details, find the Object URL. OCI also provides a “URI” format you can use.

A common pattern is:

Object Storage endpoint: https://objectstorage.<region>.oraclecloud.com
Namespace + bucket + object: /n/<namespace>/b/<bucket>/o/<object>

So the full URL looks like:

https://objectstorage.<region>.oraclecloud.com/n/<namespace>/b/bdd-lab-raw/o/sales_raw.csv

Use the exact URL from your console to avoid mistakes.

5.3 Load using DBMS_CLOUD

BEGIN
  DBMS_CLOUD.COPY_DATA(
    table_name      => 'SALES_RAW',
    credential_name => 'OBJ_STORE_CRED',
    file_uri_list   => 'https://objectstorage.<region>.oraclecloud.com/n/<namespace>/b/bdd-lab-raw/o/sales_raw.csv',
    format          => JSON_OBJECT(
      'type' VALUE 'csv',
      'skipheaders' VALUE '1',
      'delimiter' VALUE ',',
      'ignoremissingcolumns' VALUE 'true'
    )
  );
END;
/

Expected outcome: rows are loaded into SALES_RAW.

Verification:

SELECT COUNT(*) AS row_count FROM sales_raw;

SELECT * FROM sales_raw FETCH FIRST 5 ROWS ONLY;

Step 6: Profile the data (Big Data Discovery-style checks)

Run quick profiling queries similar to what Big Data Discovery would surface:

6.1 Null checks

SELECT
  SUM(CASE WHEN customer_id IS NULL OR TRIM(customer_id) IS NULL THEN 1 ELSE 0 END) AS null_customer_id,
  COUNT(*) AS total_rows
FROM sales_raw;

6.2 Duplicate detection

SELECT order_id, COUNT(*) AS cnt
FROM sales_raw
GROUP BY order_id
HAVING COUNT(*) > 1
ORDER BY cnt DESC;

6.3 Distribution by region/status

SELECT region, status, COUNT(*) AS cnt
FROM sales_raw
GROUP BY region, status
ORDER BY cnt DESC;

Expected outcome: you identify: – Missing customer_id rows – Duplicate order_id rows – Basic frequency breakdowns

Step 7: Transform into a curated table (clean + dedupe + typed timestamp)

7.1 Create a curated table

This example: – Parses ISO timestamps – Deduplicates by keeping the first row per order_id (simple rule) – Filters out rows missing customer_id (business rule example)

CREATE TABLE sales_curated AS
WITH typed AS (
  SELECT
    order_id,
    TO_TIMESTAMP_TZ(order_ts, 'YYYY-MM-DD"T"HH24:MI:SS"Z"') AS order_ts_tz,
    TRIM(customer_id) AS customer_id,
    LOWER(TRIM(region)) AS region,
    TRIM(product) AS product,
    qty,
    unit_price,
    UPPER(TRIM(status)) AS status
  FROM sales_raw
),
deduped AS (
  SELECT t.*,
         ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY order_ts_tz) AS rn
  FROM typed t
)
SELECT
  order_id,
  order_ts_tz,
  customer_id,
  region,
  product,
  qty,
  unit_price,
  status,
  (qty * unit_price) AS line_amount
FROM deduped
WHERE rn = 1
  AND customer_id IS NOT NULL;

Expected outcome: SALES_CURATED exists and is cleaner.

Verification:

SELECT COUNT(*) AS curated_rows FROM sales_curated;

SELECT * FROM sales_curated ORDER BY order_id;

Step 8: “Publish” a consumption-friendly view

In Big Data Discovery-style workflows, publishing often means creating a stable dataset interface for BI/analytics consumers.

CREATE OR REPLACE VIEW sales_summary_v AS
SELECT
  region,
  status,
  COUNT(*) AS orders,
  SUM(line_amount) AS revenue
FROM sales_curated
GROUP BY region, status;

Verification:

SELECT * FROM sales_summary_v ORDER BY revenue DESC;

Expected outcome: a stable view for dashboards and reports.

Validation

You should be able to confirm all of the following:

Object exists in Object Storage: – Bucket bdd-lab-raw contains sales_raw.csv
Data loaded: – SELECT COUNT(*) FROM sales_raw; returns expected row count (8 in the sample)
Curated table created: – SELECT COUNT(*) FROM sales_curated; returns fewer rows than raw (because of null removal + dedupe)
Published summary view works: – SELECT * FROM sales_summary_v; returns region/status rollups

Troubleshooting

Error: `ORA-20000: ... Unauthorized` or `401` when running `DBMS_CLOUD.COPY_DATA`

Likely causes: – Wrong auth token (token not copied correctly) – Wrong username (OCI username mismatch) – Wrong Object Storage URL/namespace/region – IAM policy does not allow Object Storage access

Fixes: – Regenerate the auth token and recreate the credential. – Copy the Object URL directly from the console. – Confirm your user/group has permissions for Object Storage in that compartment.

Error: `ORA-00942: table or view does not exist` when selecting `user_credentials`

Use USER_CREDENTIALS view (as shown) and confirm your privileges.
Make sure you created the credential in the same schema you’re querying from.

Error: timestamp parsing fails

Confirm the timestamp format in your CSV.
Adjust TO_TIMESTAMP_TZ format mask accordingly.

Data looks duplicated or inconsistent

Review the dedupe rule (ROW_NUMBER() over order_id).
In real datasets, you may need a more robust business key and ordering rule.

Cleanup

To avoid ongoing costs, remove resources when done:

Drop database objects

DROP VIEW sales_summary_v;
DROP TABLE sales_curated PURGE;
DROP TABLE sales_raw PURGE;

BEGIN
  DBMS_CLOUD.DROP_CREDENTIAL('OBJ_STORE_CRED');
END;
/

Delete Autonomous Database – OCI Console → Autonomous Database → bdd_lab_adw → Terminate
Delete Object Storage object and bucket – Buckets → bdd-lab-raw → delete sales_raw.csv – Delete bucket bdd-lab-raw
Remove IAM auth token – Identity → Users → your user → Auth Tokens → delete bdd-lab-dbms-cloud
Optionally delete the compartment bdd-lab (only after it’s empty).

11. Best Practices

Architecture best practices

Treat discovery/prep as a repeatable pipeline, not a one-time activity.
Separate zones:
Raw zone (immutable, append-only) in Object Storage
Curated zone (cleaned, standardized)
Serving zone (database views/tables for BI)
Prefer open formats (CSV for ingestion, Parquet for curated at scale) when building lake patterns.

IAM/security best practices

Enforce least privilege:
Bucket read-only for consumers
Write access only for pipeline identities
Use compartments per environment (dev/test/prod).
Prefer private access paths:
Autonomous Database private endpoint
Object Storage via Service Gateway in a VCN (where feasible)
Rotate credentials (auth tokens) and avoid embedding secrets in scripts.

Cost best practices

Use Always Free resources for labs where possible.
Implement Object Storage lifecycle rules (delete/archive old staging outputs).
Avoid duplicating large datasets into databases unless necessary.
Monitor egress and cross-region data movement.

Performance best practices

For large data:
Do transforms in scalable engines (Spark/SQL) rather than interactive tools
Partition and compress curated datasets
In databases:
Use appropriate indexing/materialized views for BI query patterns
Avoid SELECT * in production semantic layers

Reliability best practices

Make transformations idempotent.
Version curated datasets and keep lineage of rules.
Automate loads and validation checks.
Use backups and retention (database and object storage) aligned to RPO/RTO.

Operations best practices

Centralize logs and metrics:
OCI Logging for OCI services
Database auditing for data access
Use tagging:
CostCenter, Environment, Owner, DataDomain
Define SLOs for data freshness and pipeline success rate.

Governance/tagging/naming best practices

Standard naming:
bdd-<env>-raw, bdd-<env>-curated
Data cataloging:
Document dataset purpose, owners, and sensitivity classification.
Apply consistent tags to buckets, DBs, and networking resources.

12. Security Considerations

Identity and access model

In OCI, access to Object Storage and databases is governed by IAM policies.
For discovery workflows:
Create distinct identities for pipelines vs human users.
Limit who can access raw sensitive data.

Encryption

OCI Object Storage encrypts data at rest by default (service-managed keys are typical).
Autonomous Database encrypts storage at rest by default.
For stricter requirements, use customer-managed keys via OCI Vault (verify service support and configuration requirements).

Network exposure

Avoid public Autonomous Database access for production.
Restrict access with IP allowlists if public access is required.
Prefer private endpoints and controlled ingress via bastion hosts.

Secrets handling

Do not store auth tokens in plaintext files or source control.
Prefer OCI Vault for secret storage and rotation where applicable.
In this tutorial lab, you used an auth token; in production, design a safer credential strategy.

Audit/logging

Use OCI Audit to track control-plane actions (bucket creation, DB changes).
Use database auditing for data access and schema changes.
Store logs in a central, tamper-resistant logging account if required.

Compliance considerations

Classify data: PII/PHI/PCI.
Apply data minimization: do not copy sensitive raw data to too many places.
Enforce retention and deletion policies.
Validate region/legal constraints for data residency.

Common security mistakes

Granting broad “manage all-resources” policies to analysts.
Leaving public buckets or permissive pre-authenticated requests.
Using shared credentials across teams.
No audit trail for who accessed sensitive columns.

Secure deployment recommendations

Compartmentalize by environment and sensitivity.
Use private networking for databases and processing.
Implement a data access review process.
Standardize dataset publishing with documented schemas and ownership.

13. Limitations and Gotchas

Big Data Discovery product lifecycle and availability

Big Data Discovery may be legacy or not available as an OCI native managed service in many tenancies.
Documentation may exist while new deployments are limited.
Verify in official Oracle docs and with Oracle before committing to it long-term.

Quotas and limits (OCI lab)

Autonomous Database Always Free has compute/storage limits.
Object Storage has service limits (objects, requests) at tenancy level.
Large CSV loads can be slower than parquet-based pipelines.

Regional constraints

Always Free availability differs by region.
Some services (like Oracle Analytics Cloud) may not be available in all regions.

Pricing surprises

Data egress costs can surprise teams if they export large datasets to the internet.
Storing many curated copies can multiply storage cost.

Compatibility issues

CSV parsing is error-prone: delimiter issues, quoting, encoding.
Timestamp formats often break ingestion unless standardized.

Operational gotchas

Ad-hoc “discovery transforms” can become production dependencies. Treat them as code where possible.
Without governance, multiple “curated” datasets may diverge and confuse consumers.

Migration challenges

If migrating from a Hadoop-centric Big Data Discovery environment:
Mapping transformations to Spark/SQL pipelines can require rework.
Access patterns may change (index-based exploration vs SQL queries).
Plan for data format conversion and partitioning.

Vendor-specific nuances

Oracle’s big data and analytics tooling portfolio evolves. What replaces Big Data Discovery depends on your target architecture (data lake, warehouse, or lakehouse pattern).

14. Comparison with Alternatives

Nearest options in Oracle Cloud (OCI)

Oracle Analytics Cloud (OAC): BI + visualization + data prep features (licensed service).
OCI Data Integration: managed ETL/ELT orchestration for moving and transforming data (focus on pipelines rather than interactive discovery).
OCI Data Flow (Apache Spark): scalable processing; requires engineering patterns, not a discovery UI.
OCI Big Data Service: managed Hadoop ecosystem for customers who need it.
Autonomous Database + APEX/Database Actions: fast SQL-based profiling and lightweight exploration.

Nearest options in other clouds

AWS: Glue + Athena + Lake Formation + QuickSight
Azure: Data Factory + Synapse + Purview + Power BI
GCP: Dataproc + BigQuery + Dataplex + Looker

Open-source/self-managed alternatives

Apache Superset (BI exploration)
Trino/Presto + a metastore + a BI tool
Jupyter notebooks + Spark
OpenSearch/Kibana for log/event exploration

Comparison table

Option	Best For	Strengths	Weaknesses	When to Choose
Big Data Discovery (Oracle)	Existing Oracle big data deployments needing interactive discovery	Integrated discovery + prep experience for big data (deployment dependent)	Availability/lifecycle uncertainty for new OCI projects; may be legacy	You already have it and it meets requirements; short-to-mid term use while planning roadmap
Oracle Analytics Cloud	BI dashboards + governed analytics	Strong visualization, semantic modeling, enterprise BI capabilities	Licensed cost; may require curated datasets	You need enterprise BI and a supported OCI analytics roadmap
OCI Data Integration	ETL/ELT orchestration	Managed pipelines, scheduling, connectors	Not primarily interactive discovery	You need repeatable data movement and transformations
OCI Data Flow (Spark)	Large-scale processing	Scales Spark without managing clusters	Engineering-heavy; no “discovery UI”	You need big transformations at scale
Autonomous Database + SQL tools	SQL-centric profiling + curated serving	Fast iteration with SQL, strong governance controls	Not a specialized discovery UI	You want a cost-effective, governable curated layer on OCI
AWS Glue + Athena + QuickSight	AWS-native lake analytics	Strong managed lake query and BI ecosystem	AWS lock-in; cost can grow with query volume	Your platform is AWS and you want managed lake analytics
Open-source (Trino/Superset)	Custom lakehouse stacks	Flexibility, open formats	Ops burden, security/integration work	You have strong platform engineering and want maximum portability

15. Real-World Example

Enterprise example: telecom data quality and churn analytics

Problem: A telecom company ingests billions of network events and customer interactions. Analysts struggle to explore raw data and detect schema drift and quality issues before reporting.
Proposed architecture:
Raw events land in OCI Object Storage (raw zone).
Transformations run via Spark (OCI Data Flow) to generate curated customer-event aggregates.
Curated datasets stored in Autonomous Data Warehouse for high-concurrency BI.
BI dashboards in Oracle Analytics Cloud.
Governance with IAM policies, auditing, and data classification tags.
Why Big Data Discovery was chosen (or evaluated):
Historically, it offered a self-service discovery layer on big-data stores for analysts.
In a modernization effort, the company maps those workflows to OCI-native services for long-term supportability.
Expected outcomes:
Faster detection of bad data (null spikes, outliers).
Reduced time to publish curated datasets for churn models.
Stronger access controls and auditability.

Startup/small-team example: e-commerce order analytics

Problem: A small team needs quick insight into order data quality and basic revenue reporting without hiring a full data platform team.
Proposed architecture:
Store raw CSV exports in OCI Object Storage.
Load into Autonomous Database Always Free (for small volumes).
Use SQL views as the published semantic layer.
Lightweight exploration via Database Actions/APEX; later add Oracle Analytics Cloud if needed.
Why Big Data Discovery-style approach:
The team needs the workflow outcome (discover → clean → publish) more than the specific legacy product.
Expected outcomes:
Clean, deduplicated order dataset.
Simple dashboards and reports with minimal operational overhead.
Low costs and easy scaling path.

16. FAQ

1) Is Big Data Discovery an OCI native managed service?
In many OCI tenancies, Big Data Discovery does not appear as a standard OCI managed service. It has historically been delivered as part of Oracle’s broader big data product stack. Verify current availability in official Oracle documentation and your OCI tenancy/service catalog.

2) What is Big Data Discovery primarily used for?
Interactive data discovery, profiling, preparation, and publishing curated datasets—often for big-data sources like Hadoop ecosystems.

3) Is Big Data Discovery the same as Oracle Analytics Cloud?
No. Oracle Analytics Cloud is an enterprise BI/analytics service. Big Data Discovery focuses more on discovery and preparation over big data sources (though there is conceptual overlap).

4) What replaced Big Data Discovery on OCI?
There isn’t always a single 1:1 replacement. Many teams implement the workflow using Object Storage + Data Flow + Autonomous Database + Oracle Analytics Cloud. Choose based on your needs and Oracle’s current roadmap.

5) Can I still follow this tutorial if I don’t have Big Data Discovery?
Yes. The hands-on lab is designed to be executable using common OCI services and recreates Big Data Discovery-style outcomes.

6) Does the lab require Oracle Analytics Cloud?
No. The lab uses Autonomous Database SQL and views for exploration. You can optionally connect OAC or another BI tool later.

7) What dataset size is appropriate for the lab?
Start small (MBs to a few GB). Always Free Autonomous Database is limited, and CSV loads are not optimized for huge datasets.

8) What’s the best storage format for curated big data on OCI?
For large-scale lake analytics, columnar formats like Parquet are common. For simple ingestion, CSV is fine but less efficient.

9) How do I control who can access raw vs curated data?
Use IAM policies and separate buckets/compartments. In the database, use schemas/roles and views to restrict columns and rows.

10) How do I avoid copying sensitive data into too many places?
Apply data minimization: keep raw data in one controlled zone, publish only necessary curated datasets, and enforce retention policies.

11) What’s the biggest operational risk with discovery tools?
Un-governed transformations can become “shadow production” logic. Treat curated outputs as products: versioning, testing, and ownership.

12) How do I monitor this pipeline?
Use OCI Audit for control-plane actions, database auditing for data access, and (if you add Spark/ETL) use the service’s logging + OCI Logging.

13) Can Autonomous Database load directly from Object Storage securely?
Yes, using DBMS_CLOUD with credentials. For more secure designs, evaluate private networking and approved credential storage patterns.

14) What’s the most common ingestion error?
Incorrect Object Storage URL/namespace or invalid credentials, leading to authorization failures.

15) Should I build a new long-term platform around Big Data Discovery today?
Only after confirming lifecycle status and supportability for your organization. For many OCI-first projects, OCI-native services provide a clearer forward path.

17. Top Online Resources to Learn Big Data Discovery

Resource Type	Name	Why It Is Useful
Official documentation (search)	Oracle Help Center search for “Big Data Discovery”: https://docs.oracle.com/en/search/?q=Big%20Data%20Discovery	Safest starting point to find the correct versioned docs and guides
Official documentation (platform context)	Oracle Big Data Appliance documentation (Oracle Help Center): https://docs.oracle.com/en/	Big Data Discovery is often discussed in the context of Oracle’s big data platform; use this to locate install/admin context
Official pricing	OCI Pricing: https://www.oracle.com/cloud/pricing/	For pricing of OCI services used as alternatives (Object Storage, Data Flow, etc.)
Official cost estimator	OCI Cost Estimator: https://www.oracle.com/cloud/costestimator.html	Build region-specific estimates without guessing
Official docs (Object Storage)	OCI Object Storage docs: https://docs.oracle.com/en-us/iaas/Content/Object/home.htm	Required for secure bucket design, URLs, lifecycle policies
Official docs (Autonomous Database)	Autonomous Database docs: https://docs.oracle.com/en/cloud/paas/autonomous-database/	Covers provisioning, security, Database Actions, connectivity
Official docs (DBMS_CLOUD)	DBMS_CLOUD documentation (in Autonomous DB docs): https://docs.oracle.com/en/cloud/paas/autonomous-database/adbsa/	Authoritative reference for loading from Object Storage
Official docs (IAM)	OCI IAM docs: https://docs.oracle.com/en-us/iaas/Content/Identity/home.htm	Correct policy patterns and least privilege guidance
Official videos	Oracle Cloud YouTube channel: https://www.youtube.com/@OracleCloud	Often includes practical demos for OCI data services (verify availability for specific topics)
Community learning	Oracle Cloud Customer Connect: https://community.oracle.com/customerconnect/	Practical Q&A with Oracle community and product teams (verify advice against docs)

18. Training and Certification Providers

Institute	Suitable Audience	Likely Learning Focus	Mode	Website URL
DevOpsSchool.com	Engineers, DevOps, architects	Cloud/DevOps fundamentals, automation, CI/CD; check for OCI data tracks	Check website	https://www.devopsschool.com/
ScmGalaxy.com	DevOps learners, build/release teams	SCM, DevOps, tooling foundations; may complement cloud labs	Check website	https://www.scmgalaxy.com/
CLoudOpsNow.in	Cloud ops teams, SREs	Cloud operations practices, monitoring, reliability	Check website	https://cloudopsnow.in/
SreSchool.com	SREs, platform engineers	SRE practices, observability, incident management	Check website	https://sreschool.com/
AiOpsSchool.com	Ops + data/AI practitioners	AIOps concepts, automation, operational analytics	Check website	https://aiopsschool.com/

19. Top Trainers

Platform/Site	Likely Specialization	Suitable Audience	Website URL
RajeshKumar.xyz	DevOps/cloud training content (verify specific OCI coverage)	Beginners to intermediate engineers	https://rajeshkumar.xyz/
devopstrainer.in	DevOps training and mentoring	DevOps practitioners	https://devopstrainer.in/
devopsfreelancer.com	Freelance DevOps/services platform	Teams seeking hands-on help or coaching	https://devopsfreelancer.com/
devopssupport.in	DevOps support/training resources	Ops teams needing implementation support	https://devopssupport.in/

20. Top Consulting Companies

Company	Likely Service Area	Where They May Help	Consulting Use Case Examples	Website URL
cotocus.com	Cloud/DevOps consulting (verify offerings)	Architecture reviews, implementation help, operations	Designing OCI landing zones; setting up CI/CD for data pipelines; cost optimization reviews	https://cotocus.com/
DevOpsSchool.com	Training + consulting services (verify scope)	Upskilling teams; implementing DevOps practices around cloud workloads	Building automation for OCI deployments; operational best practices for data platforms	https://www.devopsschool.com/
DEVOPSCONSULTING.IN	DevOps consulting (verify offerings)	DevOps transformations, tooling, operational maturity	Implementing observability; infrastructure automation; governance processes	https://devopsconsulting.in/

21. Career and Learning Roadmap

What to learn before Big Data Discovery-style work

Data fundamentals: CSV/JSON/Parquet, schemas, partitioning
SQL fundamentals: joins, aggregates, window functions
Cloud basics: IAM, networking, compartments/projects, encryption
Object storage concepts: buckets, prefixes, lifecycle rules
Basic data governance: data classification and least privilege

What to learn after

Spark fundamentals (especially if using OCI Data Flow)
Data modeling for analytics (star schema, slowly changing dimensions)
CI/CD for data pipelines (testing, versioning transformations)
Observability for data platforms (data quality checks, pipeline SLAs)
BI semantic modeling (Oracle Analytics Cloud or equivalent)

Job roles that use it (or equivalent workflows)

Data Engineer
Analytics Engineer
BI Developer / BI Engineer
Cloud Data Architect
Platform Engineer (data platform)
Data Governance / Data Quality Engineer

Certification path (if available)

For Big Data Discovery specifically, certification availability may be limited depending on lifecycle.
For OCI pathways, consider Oracle’s OCI certifications (associate/professional) relevant to:
Cloud infrastructure
Data management
Analytics
Verify the current catalog on Oracle University: https://education.oracle.com/

Project ideas for practice

Build a raw-to-curated pipeline for web logs and publish a KPI view.
Implement data quality checks (null thresholds, uniqueness constraints) with automated alerts.
Create a curated dataset with PII masking and column-level access controls.
Cost-optimization exercise: lifecycle policies + partition strategies.

22. Glossary

Big Data Discovery: Oracle product for interactive discovery, profiling, and preparation of large datasets (availability and packaging vary).
OCI (Oracle Cloud Infrastructure): Oracle’s public cloud platform.
Object Storage: Durable blob storage for files/objects in buckets.
Autonomous Database: Managed Oracle database with automated operations; includes ADW/ATP.
ADW (Autonomous Data Warehouse): Autonomous Database workload optimized for analytics.
DBMS_CLOUD: Oracle-supplied PL/SQL package commonly used to load data into Autonomous Database from Object Storage and other cloud locations.
Raw zone: Storage location for unmodified ingested data.
Curated zone: Storage or tables with cleaned, standardized, analysis-ready data.
Serving layer: Optimized data structures (tables/views) used by BI tools and applications.
Faceted search: Exploration approach where users filter by attribute “facets” (e.g., region, status).
Schema drift: Changes in incoming data schema over time (new columns, changed types).
Least privilege: Security principle of granting only the permissions required to perform a task.
Egress: Outbound data transfer from a cloud to the internet or another region.

23. Summary

Big Data Discovery is an Oracle offering aimed at interactive exploration, profiling, preparation, and publishing of large datasets, historically aligned with Oracle’s big data platform deployments. It matters because it addresses a consistent pain point in analytics programs: turning raw, messy data into trusted, consumable datasets quickly.

In Oracle Cloud (OCI), Big Data Discovery may not be present as a standard managed service in many tenancies, so treat its lifecycle and availability as something to verify in official Oracle documentation and with Oracle Sales/Support. For most new OCI projects, the practical approach is to implement the same workflow using OCI-native building blocks: Object Storage for the lake, Autonomous Database for curated/serving datasets, and (optionally) Spark-based processing and enterprise BI.

Key cost and security points: – Costs come mainly from storage growth, compute for transformations, and BI licensing. – Secure designs rely on compartments, least privilege IAM policies, private networking for databases, encryption, and auditing.

When to use it: – Use Big Data Discovery if you already have it and it aligns with your platform. – Use OCI-native services to achieve the same outcomes when building forward-looking architectures on Oracle Cloud.

Next step: run the hands-on lab in this guide, then expand it by adding a scalable processing layer (OCI Data Flow) and a governed BI layer (Oracle Analytics Cloud) as your requirements grow.

rajeshkumar

Category

1. Introduction

2. What is Big Data Discovery?

Official purpose (product intent)

Core capabilities (what it typically provides)

Major components (conceptual)

Service type

Scope: regional/global/zonal?

How it fits into the Oracle Cloud ecosystem

3. Why use Big Data Discovery?

Business reasons

Technical reasons

Operational reasons

Security/compliance reasons

Scalability/performance reasons

When teams should choose it

When teams should not choose it

4. Where is Big Data Discovery used?

Industries

Team types

Workloads

Architectures

Real-world deployment contexts

Production vs dev/test usage

5. Top Use Cases and Scenarios

1) Data lake profiling before onboarding to analytics

2) Self-service exploration for analysts on big datasets

3) Preparing curated datasets for BI dashboards

4) Joining heterogeneous sources into a unified dataset

5) Detecting outliers and data quality issues early

6) Publishing datasets for downstream ML feature engineering

7) Investigating operational incidents with fast filtering

8) Compliance reporting dataset preparation

9) Enrichment and standardization of semi-structured fields

10) Migration discovery: understanding Hadoop datasets before moving to OCI

6. Core Features

Feature 1: Interactive data exploration (search/filter/facets)

Feature 2: Data profiling and summary statistics

Feature 3: Data preparation / transformation workflows

Feature 4: Sampling to keep exploration responsive

Feature 5: Publishing curated datasets

Feature 6: Collaboration and project organization

Feature 7: Security integration (authentication/authorization)

Feature 8: Administrative controls and monitoring hooks

7. Architecture and How It Works

High-level architecture

Request/data/control flow (conceptual)

Integrations with related services (OCI context)

Dependency services

Security/authentication model

Networking model

Monitoring/logging/governance considerations

Simple architecture diagram (conceptual)

Production-style architecture diagram (OCI-native replacement pattern)

8. Prerequisites

A) Big Data Discovery (product) prerequisites (verify in your version docs)

B) OCI hands-on lab prerequisites (Big Data Discovery-style workflow on Oracle Cloud)

9. Pricing / Cost

Big Data Discovery pricing model (important reality)

Cost model for the OCI lab (Big Data Discovery-style workflow)

Free tier notes

Cost drivers

Network/data transfer implications

How to optimize cost

Example low-cost starter estimate (no fabricated numbers)

Example production cost considerations

Official pricing references (start here)

10. Step-by-Step Hands-On Tutorial

Objective

Lab Overview

Step 1: Create a compartment (recommended)

Step 2: Create an Object Storage bucket and upload a dataset

2.1 Create a bucket

2.2 Upload a sample CSV

Step 3: Create an Autonomous Database (Always Free if available)

Step 4: Create an Object Storage auth token and database credential

4.1 Create an Auth Token (OCI user)

4.2 Create a database user (optional but recommended)

4.3 Create a DBMS_CLOUD credential

Step 5: Load the CSV from Object Storage into a raw table

Error: `ORA-20000: ... Unauthorized` or `401` when running `DBMS_CLOUD.COPY_DATA`

Error: `ORA-00942: table or view does not exist` when selecting `user_credentials`