Category
Other Services
1. Introduction
Big Data Discovery is an Oracle product designed to help people explore, profile, transform, and visualize large datasets—especially data stored in Hadoop ecosystems—without requiring every user to write code.
In simple terms: Big Data Discovery is a “data exploration and preparation” tool for big data. It lets analysts and engineers ingest data (often from Hadoop), understand it quickly (profiling and sampling), clean it (enrichment and transforms), and publish curated datasets for downstream analytics.
In more technical terms: Big Data Discovery combines a browser-based exploration experience with a backend processing and indexing layer that can work with big-data storage and query engines. It was commonly positioned alongside Oracle Big Data Appliance and related Oracle big-data components. It overlaps conceptually with modern “data prep + exploration + visualization” workflows that today are frequently implemented using cloud-native services (data lake, Spark, SQL engines, and BI).
What problem it solves: teams often have large, messy datasets with unknown schema quality, missing values, inconsistent formatting, and unclear distributions. Big Data Discovery addresses the “time-to-first-insight” gap by providing interactive discovery, profiling, and transformation workflows so teams can build trusted, analysis-ready datasets faster.
Important lifecycle note (read first): Oracle’s “Big Data Discovery” has historically existed as a product integrated with Oracle’s big data stack (commonly associated with Oracle Big Data Appliance) and was also offered in some form in older Oracle Cloud environments. In many Oracle Cloud Infrastructure (OCI) tenancies today, Big Data Discovery does not appear as a native OCI managed service in the console. Availability, lifecycle status (active vs. legacy), and procurement/licensing can vary by Oracle program and contract. Verify current availability and lifecycle status in official Oracle documentation and with Oracle Sales/Support before designing new long-term architectures around it.
Because of that reality, this tutorial does two things: 1. Teaches Big Data Discovery accurately as a product and where it fits. 2. Provides a practical, executable OCI lab that recreates a “Big Data Discovery-style” workflow using current Oracle Cloud services (Object Storage + Autonomous Database + built-in analytics tools). This is often the most practical approach for new projects on OCI.
2. What is Big Data Discovery?
Official purpose (product intent)
Big Data Discovery is intended to help users: – Connect to large data sources (commonly Hadoop/Hive/HDFS in Oracle big data deployments). – Explore datasets interactively (search, filter, facet, profile). – Perform data preparation (cleaning, enrichment, transformations). – Publish curated datasets for analytics and reporting.
If you have Big Data Discovery in your environment, consult the Oracle Big Data Discovery documentation set available through Oracle Help Center (see resources at the end) for the exact version and supported integrations.
Core capabilities (what it typically provides)
Big Data Discovery capabilities, as described in Oracle materials for the product, typically include: – Dataset ingestion from big-data repositories and structured sources (implementation depends on deployment). – Data profiling (type inference, cardinality, distributions, outliers). – Search and faceted exploration for fast slicing/dicing of large datasets. – Data preparation workflows (standardization, parsing, filtering, joining, deriving fields). – Publishing/export of curated outputs for downstream BI or data science workflows.
Caveat: exact connectors, processing engines, and export targets are version- and deployment-dependent. Verify in the documentation for your Big Data Discovery version.
Major components (conceptual)
A typical Big Data Discovery deployment historically included: – A web-based “Studio” experience for interactive discovery and preparation. – A processing layer to run transformations at scale (often integrated with big-data processing frameworks in the environment). – An indexing/search layer enabling fast interactive filtering and faceting. – Admin/configuration components for connectivity, security, and operations.
Names and internal architecture details vary by release. Use the official admin and installation guides for your exact build.
Service type
In the Oracle Cloud “Other Services” category context, it’s best to think of Big Data Discovery as: – A product/workload that you run as part of a broader big-data platform, not necessarily a first-class “OCI native managed service” (like Object Storage or Autonomous Database).
Scope: regional/global/zonal?
Because Big Data Discovery is not universally exposed as a native OCI resource, “scope” is generally: – Deployment-scoped: it runs where you deploy it (on-prem, appliance, or customer-managed compute). – Its effective availability is determined by your infrastructure and licensing rather than OCI region catalogs.
How it fits into the Oracle Cloud ecosystem
In modern OCI architectures, Big Data Discovery’s role is often fulfilled by a combination of: – Oracle Cloud Infrastructure Object Storage (data lake storage), – OCI Data Flow (Apache Spark) or OCI Big Data Service (processing), – Autonomous Database (curated/serving layer), – Oracle Analytics Cloud (BI/visualization), – OCI Data Integration (ETL/ELT orchestration).
So even when Big Data Discovery itself is not used, the workflow it represents remains a common requirement: interactive discovery + preparation + publishing trusted datasets.
3. Why use Big Data Discovery?
Business reasons
- Faster insight from large datasets: reduce the time spent just understanding data shape and quality.
- Self-service discovery: analysts can explore data without waiting for custom engineering pipelines for every question.
- Improved data trust: profiling and preparation steps help produce cleaner datasets for decision-making.
Technical reasons
- Interactive exploration over large data: supports discovery patterns that SQL-only workflows can make slower for ad-hoc questions (depending on indexing/engine).
- Repeatable transformations: data prep can be standardized and reused.
- Bridge between raw data and analytics: publish curated outputs for BI, ML, or downstream reporting.
Operational reasons
- Standardized discovery tooling: reduces “spreadsheet chaos” and inconsistent local scripts.
- Governance alignment: centralized platform is easier to govern than scattered personal scripts (when deployed and managed properly).
Security/compliance reasons
- Centralized access control and auditability (deployment-dependent).
- Reduced need to copy data to unmanaged endpoints for exploration.
Scalability/performance reasons
- Designed for large datasets and big data ecosystems (particularly where deployed with Hadoop-related storage/query engines).
- Supports sampling and summary-based exploration patterns to keep UIs responsive.
When teams should choose it
Choose Big Data Discovery when: – You already have it licensed/deployed (or part of an Oracle big data platform) and it matches your sources. – You need an interactive data prep and discovery experience tightly integrated with your big data environment. – You have operational maturity to maintain the platform (patching, scaling, governance).
When teams should not choose it
Avoid Big Data Discovery when: – You’re starting greenfield on OCI and need a fully managed, roadmap-forward cloud service (Big Data Discovery may be legacy for many customers). – Your main need is BI dashboards over curated data (Oracle Analytics Cloud may be a simpler fit). – You want open lakehouse formats (Parquet/Iceberg/Delta) with modern query engines and minimal proprietary dependencies—evaluate OCI Data Flow + Trino/Presto patterns instead. – You cannot staff platform operations (customer-managed software can be operationally heavy).
4. Where is Big Data Discovery used?
Industries
- Financial services (risk analytics, fraud exploration, compliance datasets)
- Retail/e-commerce (clickstream exploration, product analytics)
- Telecom (CDR exploration, network event analytics)
- Manufacturing/IoT (sensor data quality and anomaly exploration)
- Healthcare (claims analytics, operational reporting; subject to strict compliance)
- Public sector (case analytics, citizen service data quality)
Team types
- Data analysts and BI teams doing exploratory work
- Data engineering teams preparing curated datasets
- Data science teams validating features and distributions
- Platform teams standardizing data exploration tooling
Workloads
- Exploratory data analysis (EDA) over big-data repositories
- Data quality and profiling at scale
- Building curated datasets from raw lakes
- Publishing “analysis-ready” datasets for BI/ML
Architectures
- Hadoop-centric environments (historically common)
- Data lake + processing + serving layer patterns
- Hybrid: on-prem big data + cloud analytics serving
Real-world deployment contexts
- Existing Oracle big data platforms where Big Data Discovery is already part of the stack
- Migration scenarios: using Big Data Discovery outputs to transition to OCI analytics services
Production vs dev/test usage
- Production: curated dataset publishing, governed exploration, standardized transformations.
- Dev/test: discovery of new sources, profiling, POCs for analytics use cases.
5. Top Use Cases and Scenarios
Below are realistic scenarios where Big Data Discovery (or a Big Data Discovery-style workflow) is commonly applied.
1) Data lake profiling before onboarding to analytics
- Problem: raw data arrives with unknown schema drift and inconsistent quality.
- Why Big Data Discovery fits: profiling + interactive exploration helps teams understand distributions, nulls, and anomalies quickly.
- Example: a retail team receives daily CSV/JSON dumps of transactions and needs to validate fields and detect missing store IDs before building reports.
2) Self-service exploration for analysts on big datasets
- Problem: analysts are blocked waiting for engineering to build custom extracts.
- Why it fits: interactive filtering/faceting reduces dependence on ad-hoc pipelines.
- Example: marketing analysts explore clickstream attributes to identify top referral sources.
3) Preparing curated datasets for BI dashboards
- Problem: BI dashboards fail due to inconsistent formats and dirty dimensions.
- Why it fits: standardized transforms create consistent columns (dates, categories, IDs).
- Example: telecom team standardizes device model strings and publishes a clean “subscriber_device_dim”.
4) Joining heterogeneous sources into a unified dataset
- Problem: data lives in multiple sources (events + reference data) with inconsistent keys.
- Why it fits: preparation steps can include joins/derivations (capability depends on version).
- Example: manufacturer joins sensor readings with equipment metadata for plant-level KPIs.
5) Detecting outliers and data quality issues early
- Problem: pipelines silently ingest bad data leading to wrong decisions.
- Why it fits: profiling and distributions reveal outliers and breaks.
- Example: finance sees a sudden spike in “transaction_amount” due to a unit conversion bug.
6) Publishing datasets for downstream ML feature engineering
- Problem: data scientists spend too long cleaning data before modeling.
- Why it fits: curated, standardized datasets reduce duplicated cleanup.
- Example: fraud modelers receive a prepared dataset with consistent merchant categories and cleaned timestamps.
7) Investigating operational incidents with fast filtering
- Problem: SRE/ops teams need to explore event logs at scale.
- Why it fits: faceted exploration supports quick narrowing by host, error code, region (depending on ingestion).
- Example: ops team investigates “payment_timeouts” across services after a deployment.
8) Compliance reporting dataset preparation
- Problem: compliance needs repeatable datasets with lineage and consistent logic.
- Why it fits: repeatable transformations reduce one-off spreadsheet manipulation.
- Example: bank prepares monthly AML case dataset with standardized customer identifiers.
9) Enrichment and standardization of semi-structured fields
- Problem: addresses, names, product codes are messy.
- Why it fits: transformations can parse and standardize values (exact enrichment varies).
- Example: e-commerce standardizes shipping address fields and extracts postal code.
10) Migration discovery: understanding Hadoop datasets before moving to OCI
- Problem: organizations want to migrate but don’t know which datasets are important or clean.
- Why it fits: discovery identifies high-value datasets and quality issues to plan migration.
- Example: enterprise profiles Hive tables, identifies top-used columns, and prioritizes migration into OCI lakehouse patterns.
6. Core Features
Because Big Data Discovery’s availability and packaging can vary, this section describes commonly documented feature categories. Confirm exact feature availability in your Big Data Discovery version docs.
Feature 1: Interactive data exploration (search/filter/facets)
- What it does: lets users explore datasets by filtering, searching, and slicing attributes interactively.
- Why it matters: reduces time spent writing exploratory queries; accelerates understanding of data.
- Practical benefit: faster ad-hoc investigations and quicker iteration with stakeholders.
- Limitations/caveats: responsiveness depends on indexing/engine configuration and dataset size; some transformations may require batch processing.
Feature 2: Data profiling and summary statistics
- What it does: provides distributions, cardinality, missing values, and type inference.
- Why it matters: data quality issues are common in lakes; profiling surfaces them early.
- Practical benefit: improves downstream pipeline reliability and reduces “silent failures.”
- Limitations/caveats: profiling large datasets may rely on sampling; verify how sampling is configured.
Feature 3: Data preparation / transformation workflows
- What it does: supports common transforms like filtering rows, deriving columns, parsing strings/dates, and standardizing values.
- Why it matters: most analytics value comes after cleaning/curation.
- Practical benefit: repeatable prep steps reduce spreadsheet-based manipulation and duplicated scripts.
- Limitations/caveats: not every transform equals full ETL; complex workflows may still require Spark/SQL pipelines.
Feature 4: Sampling to keep exploration responsive
- What it does: allows users to work on representative subsets of huge datasets.
- Why it matters: exploration UIs can’t always operate on full-scale data interactively.
- Practical benefit: quick iteration on cleaning logic before full-scale apply.
- Limitations/caveats: sampling can mislead if data is highly skewed; validate with full runs.
Feature 5: Publishing curated datasets
- What it does: exports/publishes prepared datasets for consumption by BI or other systems.
- Why it matters: turns exploratory work into reusable assets.
- Practical benefit: consistent curated datasets across teams.
- Limitations/caveats: export targets and formats depend on environment integration; confirm supported sinks.
Feature 6: Collaboration and project organization
- What it does: organizes work into projects/datasets with saved steps and shareable assets.
- Why it matters: prevents knowledge loss and “tribal scripts.”
- Practical benefit: repeatable prep pipelines and easier onboarding.
- Limitations/caveats: governance depends on how access control is implemented.
Feature 7: Security integration (authentication/authorization)
- What it does: controls who can access datasets and features.
- Why it matters: discovery tools often surface sensitive columns; least privilege is critical.
- Practical benefit: safer self-service analytics.
- Limitations/caveats: integration method depends on deployment (e.g., enterprise identity providers); verify supported modes.
Feature 8: Administrative controls and monitoring hooks
- What it does: provides ways to configure sources, manage users, and monitor health.
- Why it matters: discovery platforms need operational oversight to stay stable.
- Practical benefit: more predictable performance and better troubleshooting.
- Limitations/caveats: monitoring integrations vary; you may need external monitoring stacks.
7. Architecture and How It Works
High-level architecture
A Big Data Discovery-style platform typically has: 1. Data sources (HDFS/Hive tables, object storage, databases—depending on connectors). 2. Ingestion/metadata layer to register datasets. 3. Indexing and exploration layer to support fast interactive filtering and profiling. 4. Processing layer to run transformations at scale. 5. Publishing layer to produce curated outputs. 6. Access control integrated with enterprise identity/IAM.
Request/data/control flow (conceptual)
- A user logs into the Studio/UI.
- The user selects or ingests a dataset.
- The system profiles the dataset and builds indexes/metadata to enable interactive exploration.
- The user defines transformations (cleaning/derivations).
- The system executes transforms (possibly using a cluster compute engine).
- Results are published to downstream destinations.
Integrations with related services (OCI context)
If you are implementing this workflow on OCI today (even without Big Data Discovery), typical integrations are: – OCI Object Storage for raw and curated data. – Autonomous Database (ADW/ATP) for curated serving datasets. – OCI Data Flow (Spark) for transformation at scale. – OCI Logging for centralized logs (for OCI-native services). – OCI IAM for least privilege access to buckets and databases. – Oracle Analytics Cloud for visualization (licensed service).
Dependency services
- Identity: OCI IAM (OCI-native workflows), or enterprise IdP for legacy deployments.
- Storage: Object Storage / HDFS / database storage depending on deployment.
- Compute/processing: Spark/Hadoop/YARN or OCI Data Flow.
Security/authentication model
- OCI-native pattern: users and services authenticate via IAM policies, dynamic groups, and resource principals (for services like Data Flow).
- Legacy/platform pattern: user auth may be integrated with LDAP/SSO depending on deployment.
Networking model
- OCI-native: private endpoints for Autonomous Database; private access to Object Storage via Service Gateway; limit public exposure.
- Legacy: depends on how the platform is deployed (on-prem network segmentation, firewalls).
Monitoring/logging/governance considerations
- Define ownership for datasets and transformations (data product mindset).
- Centralize logs:
- OCI Logging for OCI services.
- Database audit logs for Autonomous Database.
- Set tagging standards and cost tracking in OCI.
Simple architecture diagram (conceptual)
flowchart LR
U[User / Analyst] --> UI[Big Data Discovery Studio<br/>or Discovery UI]
UI --> META[Metadata + Profiling]
META --> IDX[Index/Search Layer]
UI --> PROC[Processing/Transform Layer]
PROC --> SRC[(Raw Data Store<br/>HDFS / Object Storage)]
PROC --> CUR[(Curated Output<br/>DB / Object Storage)]
CUR --> BI[BI / Analytics Consumers]
Production-style architecture diagram (OCI-native replacement pattern)
flowchart TB
subgraph OCI[Oracle Cloud Infrastructure (OCI)]
subgraph Net[VCN (private)]
ADB[(Autonomous Database<br/>Private Endpoint)]
BAST[Admin Bastion / Private Admin Host]
end
OS[(Object Storage Buckets<br/>Raw + Curated)]
DF[OCI Data Flow (Spark Jobs)]
LOG[OCI Logging]
AUD[Audit Logs]
IAM[IAM Policies<br/>+ Dynamic Groups]
end
EXT[Enterprise Users] -->|SSO/IAM| IAM
EXT -->|SQL/Web| ADB
EXT -->|Console/API| OS
DF -->|Read/Write| OS
DF -->|Load curated| ADB
DF --> LOG
ADB --> AUD
OS --> AUD
BAST -->|Private admin| ADB
8. Prerequisites
Because Big Data Discovery itself may not be a universally available OCI service, prerequisites are split into two parts: – A) If you already have Big Data Discovery (legacy/product environment) – B) If you will follow the OCI hands-on lab (recommended for most new OCI users)
A) Big Data Discovery (product) prerequisites (verify in your version docs)
- Access to a Big Data Discovery environment (often part of an Oracle big data platform deployment).
- Admin-provisioned connectivity to your data sources (Hive/HDFS/etc., depending on your environment).
- User authentication set up (SSO/LDAP/IAM depending on deployment).
- Permissions to create projects/datasets and run transformations.
B) OCI hands-on lab prerequisites (Big Data Discovery-style workflow on Oracle Cloud)
Account/tenancy – An active Oracle Cloud (OCI) tenancy with billing enabled (or free trial). – Permission to create and manage: – Object Storage buckets/objects – Autonomous Database (Always Free if available in your region) – IAM policies (or an admin who can create required policies)
IAM permissions – You need a group with policies similar to: – Manage Object Storage in a compartment – Manage Autonomous Database in a compartment – Use Cloud Shell (optional) – If you are not an admin, ask your OCI admin to grant least-privilege permissions.
Tools – OCI Console access. – Optionally: – OCI Cloud Shell (recommended) or OCI CLI installed locally. – A SQL client: SQL Developer, SQLcl, or the built-in Autonomous Database SQL tools.
Region availability – Object Storage and Autonomous Database are widely available, but Always Free availability can vary. – If a service isn’t available in your region, select a different OCI region (if your tenancy allows) or use paid resources.
Quotas/limits – Autonomous Database Always Free has resource limits. – Object Storage has tenancy-level service limits. – If you hit a limit error, request a service limit increase (paid accounts) or use a smaller dataset.
Prerequisite services – OCI Object Storage – Oracle Autonomous Database (ADW or ATP)
9. Pricing / Cost
Big Data Discovery pricing model (important reality)
Big Data Discovery is not typically priced like a modern OCI consumption service with a public “per-GB/per-OCPU” meter in the OCI price list. Instead, it has historically been: – Included in certain Oracle big-data platform offerings, or – Licensed as software (terms vary)
What to do: – Verify Big Data Discovery commercial and licensing terms with Oracle Sales/Account team. – Check your support contract and product availability for your environment.
Because public, region-based OCI pricing pages may not list Big Data Discovery explicitly, you should not assume it behaves like a pay-as-you-go OCI native service.
Cost model for the OCI lab (Big Data Discovery-style workflow)
The lab in this tutorial uses common OCI services with published pricing: – OCI Object Storage: billed by stored GB-month and requests (and data egress if applicable). – Autonomous Database: Always Free option may cost $0 within limits; paid tiers bill by OCPU and storage. – Optional additions (not required): – OCI Data Flow: billed by OCPU time (Spark job runtime) and possibly other dimensions depending on SKU.
Free tier notes
- Autonomous Database has an Always Free option in many regions/tenancies (verify in your OCI console).
- Object Storage has limited “free” components depending on promotions; assume storage is billed unless your tenancy offers credits.
Cost drivers
Direct cost drivers: – Object Storage data volume (raw + curated + logs/exports). – Autonomous Database size and compute (if not Always Free). – Any optional Spark processing (Data Flow) runtime.
Indirect/hidden costs: – Data transfer/egress if you move data out of OCI regions. – Operational overhead (time) if you maintain self-managed tooling. – Backups and retention if you store many copies.
Network/data transfer implications
- Ingress to OCI is typically not billed, but egress to the internet is usually billed. Verify OCI data transfer pricing for your region.
- Cross-region replication and reads may incur additional costs.
How to optimize cost
- Keep raw data in Object Storage and only curate what you need into the database.
- Use compressed columnar formats (Parquet) for curated lake data when possible.
- Use Always Free Autonomous Database for small labs and prototypes.
- Set lifecycle policies on buckets to archive or delete older objects.
- Tag resources for cost tracking and shut down/delete unused resources.
Example low-cost starter estimate (no fabricated numbers)
A low-cost starter design usually includes: – One Object Storage bucket with a small dataset (< a few GB). – One Autonomous Database Always Free instance. – Optional: no Data Flow jobs.
Cost should be minimal (often near $0 if Always Free is used and storage is small), but verify in the OCI Cost Estimator.
Example production cost considerations
For production-scale discovery/prep workflows: – Expect significant Object Storage growth (raw + curated + historical). – Autonomous Database paid tiers if you need larger compute/storage and higher concurrency. – Spark processing costs (OCI Data Flow) if you run frequent large jobs. – Monitoring/log retention costs and security services.
Official pricing references (start here)
- OCI pricing overview and calculator:
- https://www.oracle.com/cloud/costestimator.html
- https://www.oracle.com/cloud/pricing/
- OCI Object Storage pricing (navigate to “Storage”):
- https://www.oracle.com/cloud/pricing/
- Autonomous Database pricing:
- https://www.oracle.com/autonomous-database/pricing/
- OCI Data Flow pricing (if used):
- https://www.oracle.com/cloud/pricing/
Pricing pages can be reorganized over time. If a link changes, start from the OCI pricing page and drill down by service.
10. Step-by-Step Hands-On Tutorial
Because Big Data Discovery may not be available as a native OCI managed service in your tenancy, this lab provides a Big Data Discovery-style workflow on Oracle Cloud using commonly available OCI services. The end result is the same outcome Big Data Discovery is typically used for: ingest → profile → clean/transform → publish → explore.
Objective
Build a small, realistic “discovery and preparation” pipeline on Oracle Cloud:
1. Store a raw CSV dataset in OCI Object Storage.
2. Load it into an Autonomous Database (Always Free where available) using DBMS_CLOUD.
3. Profile and transform the dataset with SQL.
4. Explore results with built-in Autonomous Database tools (and optionally connect a BI tool later).
Lab Overview
You will create: – 1 compartment (optional but recommended) – 1 Object Storage bucket + uploaded dataset – 1 Autonomous Database (ADW or ATP) – 1 database user + credential to read from Object Storage – 1 raw table + 1 curated table – Simple profiling queries and a “publish” view
Estimated time: 60–120 minutes
Cost: Low. Potentially $0 if you use Autonomous Database Always Free and a small dataset.
Skill level: Beginner-friendly; includes IAM and SQL fundamentals.
Step 1: Create a compartment (recommended)
Why: compartments help isolate access and costs for labs.
- In the OCI Console, open the navigation menu → Identity & Security → Compartments.
- Click Create Compartment.
- Name:
bdd-lab
Description:Big Data Discovery style lab resources - Click Create Compartment.
Expected outcome: You have a bdd-lab compartment to place all resources.
Verification: You can select bdd-lab in the compartment picker.
Step 2: Create an Object Storage bucket and upload a dataset
2.1 Create a bucket
- Go to Storage → Buckets.
- Ensure the compartment is
bdd-lab. - Click Create Bucket.
- Name:
bdd-lab-raw - Keep defaults (Standard storage tier is fine for the lab).
- Click Create.
Expected outcome: bucket bdd-lab-raw exists.
2.2 Upload a sample CSV
Pick a small dataset you can legally use. Two good options: – A public dataset from a government open data portal – A synthetic dataset you generate yourself
For a quick lab, you can generate a synthetic CSV locally:
cat > sales_raw.csv <<'EOF'
order_id,order_ts,customer_id,region,product,qty,unit_price,status
1,2026-01-05T10:15:00Z,C001,us-phx,keyboard,1,45.00,SHIPPED
2,2026-01-05T11:02:00Z,C002,us-ashburn,mouse,2,18.50,SHIPPED
3,2026-01-06T09:41:00Z,C003,eu-frankfurt,monitor,1,199.99,PENDING
4,2026-01-06T09:41:00Z,C003,eu-frankfurt,monitor,1,199.99,PENDING
5,2026-01-07T14:20:00Z,,us-phx,usb-c cable,3,9.99,CANCELLED
6,2026-01-08T08:05:00Z,C004,us-phx,laptop,1,899.00,SHIPPED
7,2026-01-08T08:07:00Z,C004,us-phx,laptop,1,899.00,SHIPPED
8,2026-01-10T16:55:00Z,C005,ap-tokyo,headset,2,59.90,SHIPPED
EOF
Upload it:
– Buckets → bdd-lab-raw → Upload → select sales_raw.csv
Expected outcome: sales_raw.csv is stored in Object Storage.
Verification: You can see the object in the bucket listing.
Step 3: Create an Autonomous Database (Always Free if available)
- Go to Oracle Database → Autonomous Database.
- Select compartment:
bdd-lab. - Click Create Autonomous Database.
- Choose a workload: – Autonomous Data Warehouse (ADW) is often a good fit for analytics labs.
- Display name:
bdd_lab_adw - Database name:
BDDLAB - Choose Always Free if available.
- Set admin password (store it securely).
- Networking: – For the simplest lab, you can use public access with allowed IPs. – For more secure setups, use private endpoint in a VCN (adds complexity).
- Click Create.
Expected outcome: an Autonomous Database instance is provisioned.
Verification: status becomes Available.
Step 4: Create an Object Storage auth token and database credential
Autonomous Database uses DBMS_CLOUD to access Object Storage. The common approach is:
– Create an OCI user Auth Token
– Store it as a database credential
4.1 Create an Auth Token (OCI user)
- In OCI Console: Identity & Security → Users → your user.
- Open Auth Tokens.
- Click Generate Token.
- Description:
bdd-lab-dbms-cloud - Copy the token value (you won’t see it again).
Expected outcome: you have an auth token.
4.2 Create a database user (optional but recommended)
In Autonomous Database, open Database Actions (or your SQL tool) and run:
CREATE USER bdd_lab IDENTIFIED BY "UseAStrongPassword#1";
GRANT CONNECT, RESOURCE TO bdd_lab;
-- For DBMS_CLOUD usage:
GRANT EXECUTE ON DBMS_CLOUD TO bdd_lab;
Expected outcome: user bdd_lab exists.
Verification:
SELECT username FROM all_users WHERE username = 'BDD_LAB';
4.3 Create a DBMS_CLOUD credential
Connect as bdd_lab and run:
BEGIN
DBMS_CLOUD.CREATE_CREDENTIAL(
credential_name => 'OBJ_STORE_CRED',
username => '<your_oci_username>',
password => '<your_auth_token>'
);
END;
/
Expected outcome: credential is created.
Verification:
SELECT credential_name FROM user_credentials WHERE credential_name = 'OBJ_STORE_CRED';
If you can’t use an auth token due to org policy, use an approved pattern (for example, resource principals in some OCI services). Follow your security team’s guidance.
Step 5: Load the CSV from Object Storage into a raw table
5.1 Create a raw staging table
CREATE TABLE sales_raw (
order_id NUMBER,
order_ts VARCHAR2(30),
customer_id VARCHAR2(20),
region VARCHAR2(50),
product VARCHAR2(100),
qty NUMBER,
unit_price NUMBER(10,2),
status VARCHAR2(20)
);
5.2 Identify the Object Storage URL
In the bucket object details, find the Object URL. OCI also provides a “URI” format you can use.
A common pattern is:
- Object Storage endpoint:
https://objectstorage.<region>.oraclecloud.com - Namespace + bucket + object:
/n/<namespace>/b/<bucket>/o/<object>
So the full URL looks like:
https://objectstorage.<region>.oraclecloud.com/n/<namespace>/b/bdd-lab-raw/o/sales_raw.csv
Use the exact URL from your console to avoid mistakes.
5.3 Load using DBMS_CLOUD
BEGIN
DBMS_CLOUD.COPY_DATA(
table_name => 'SALES_RAW',
credential_name => 'OBJ_STORE_CRED',
file_uri_list => 'https://objectstorage.<region>.oraclecloud.com/n/<namespace>/b/bdd-lab-raw/o/sales_raw.csv',
format => JSON_OBJECT(
'type' VALUE 'csv',
'skipheaders' VALUE '1',
'delimiter' VALUE ',',
'ignoremissingcolumns' VALUE 'true'
)
);
END;
/
Expected outcome: rows are loaded into SALES_RAW.
Verification:
SELECT COUNT(*) AS row_count FROM sales_raw;
SELECT * FROM sales_raw FETCH FIRST 5 ROWS ONLY;
Step 6: Profile the data (Big Data Discovery-style checks)
Run quick profiling queries similar to what Big Data Discovery would surface:
6.1 Null checks
SELECT
SUM(CASE WHEN customer_id IS NULL OR TRIM(customer_id) IS NULL THEN 1 ELSE 0 END) AS null_customer_id,
COUNT(*) AS total_rows
FROM sales_raw;
6.2 Duplicate detection
SELECT order_id, COUNT(*) AS cnt
FROM sales_raw
GROUP BY order_id
HAVING COUNT(*) > 1
ORDER BY cnt DESC;
6.3 Distribution by region/status
SELECT region, status, COUNT(*) AS cnt
FROM sales_raw
GROUP BY region, status
ORDER BY cnt DESC;
Expected outcome: you identify:
– Missing customer_id rows
– Duplicate order_id rows
– Basic frequency breakdowns
Step 7: Transform into a curated table (clean + dedupe + typed timestamp)
7.1 Create a curated table
This example:
– Parses ISO timestamps
– Deduplicates by keeping the first row per order_id (simple rule)
– Filters out rows missing customer_id (business rule example)
CREATE TABLE sales_curated AS
WITH typed AS (
SELECT
order_id,
TO_TIMESTAMP_TZ(order_ts, 'YYYY-MM-DD"T"HH24:MI:SS"Z"') AS order_ts_tz,
TRIM(customer_id) AS customer_id,
LOWER(TRIM(region)) AS region,
TRIM(product) AS product,
qty,
unit_price,
UPPER(TRIM(status)) AS status
FROM sales_raw
),
deduped AS (
SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY order_ts_tz) AS rn
FROM typed t
)
SELECT
order_id,
order_ts_tz,
customer_id,
region,
product,
qty,
unit_price,
status,
(qty * unit_price) AS line_amount
FROM deduped
WHERE rn = 1
AND customer_id IS NOT NULL;
Expected outcome: SALES_CURATED exists and is cleaner.
Verification:
SELECT COUNT(*) AS curated_rows FROM sales_curated;
SELECT * FROM sales_curated ORDER BY order_id;
Step 8: “Publish” a consumption-friendly view
In Big Data Discovery-style workflows, publishing often means creating a stable dataset interface for BI/analytics consumers.
CREATE OR REPLACE VIEW sales_summary_v AS
SELECT
region,
status,
COUNT(*) AS orders,
SUM(line_amount) AS revenue
FROM sales_curated
GROUP BY region, status;
Verification:
SELECT * FROM sales_summary_v ORDER BY revenue DESC;
Expected outcome: a stable view for dashboards and reports.
Validation
You should be able to confirm all of the following:
-
Object exists in Object Storage: – Bucket
bdd-lab-rawcontainssales_raw.csv -
Data loaded: –
SELECT COUNT(*) FROM sales_raw;returns expected row count (8 in the sample) -
Curated table created: –
SELECT COUNT(*) FROM sales_curated;returns fewer rows than raw (because of null removal + dedupe) -
Published summary view works: –
SELECT * FROM sales_summary_v;returns region/status rollups
Troubleshooting
Error: ORA-20000: ... Unauthorized or 401 when running DBMS_CLOUD.COPY_DATA
Likely causes: – Wrong auth token (token not copied correctly) – Wrong username (OCI username mismatch) – Wrong Object Storage URL/namespace/region – IAM policy does not allow Object Storage access
Fixes: – Regenerate the auth token and recreate the credential. – Copy the Object URL directly from the console. – Confirm your user/group has permissions for Object Storage in that compartment.
Error: ORA-00942: table or view does not exist when selecting user_credentials
- Use
USER_CREDENTIALSview (as shown) and confirm your privileges. - Make sure you created the credential in the same schema you’re querying from.
Error: timestamp parsing fails
- Confirm the timestamp format in your CSV.
- Adjust
TO_TIMESTAMP_TZformat mask accordingly.
Data looks duplicated or inconsistent
- Review the dedupe rule (
ROW_NUMBER()overorder_id). - In real datasets, you may need a more robust business key and ordering rule.
Cleanup
To avoid ongoing costs, remove resources when done:
- Drop database objects
DROP VIEW sales_summary_v;
DROP TABLE sales_curated PURGE;
DROP TABLE sales_raw PURGE;
BEGIN
DBMS_CLOUD.DROP_CREDENTIAL('OBJ_STORE_CRED');
END;
/
-
Delete Autonomous Database – OCI Console → Autonomous Database →
bdd_lab_adw→ Terminate -
Delete Object Storage object and bucket – Buckets →
bdd-lab-raw→ deletesales_raw.csv– Delete bucketbdd-lab-raw -
Remove IAM auth token – Identity → Users → your user → Auth Tokens → delete
bdd-lab-dbms-cloud -
Optionally delete the compartment
bdd-lab(only after it’s empty).
11. Best Practices
Architecture best practices
- Treat discovery/prep as a repeatable pipeline, not a one-time activity.
- Separate zones:
- Raw zone (immutable, append-only) in Object Storage
- Curated zone (cleaned, standardized)
- Serving zone (database views/tables for BI)
- Prefer open formats (CSV for ingestion, Parquet for curated at scale) when building lake patterns.
IAM/security best practices
- Enforce least privilege:
- Bucket read-only for consumers
- Write access only for pipeline identities
- Use compartments per environment (dev/test/prod).
- Prefer private access paths:
- Autonomous Database private endpoint
- Object Storage via Service Gateway in a VCN (where feasible)
- Rotate credentials (auth tokens) and avoid embedding secrets in scripts.
Cost best practices
- Use Always Free resources for labs where possible.
- Implement Object Storage lifecycle rules (delete/archive old staging outputs).
- Avoid duplicating large datasets into databases unless necessary.
- Monitor egress and cross-region data movement.
Performance best practices
- For large data:
- Do transforms in scalable engines (Spark/SQL) rather than interactive tools
- Partition and compress curated datasets
- In databases:
- Use appropriate indexing/materialized views for BI query patterns
- Avoid SELECT * in production semantic layers
Reliability best practices
- Make transformations idempotent.
- Version curated datasets and keep lineage of rules.
- Automate loads and validation checks.
- Use backups and retention (database and object storage) aligned to RPO/RTO.
Operations best practices
- Centralize logs and metrics:
- OCI Logging for OCI services
- Database auditing for data access
- Use tagging:
CostCenter,Environment,Owner,DataDomain- Define SLOs for data freshness and pipeline success rate.
Governance/tagging/naming best practices
- Standard naming:
bdd-<env>-raw,bdd-<env>-curated- Data cataloging:
- Document dataset purpose, owners, and sensitivity classification.
- Apply consistent tags to buckets, DBs, and networking resources.
12. Security Considerations
Identity and access model
- In OCI, access to Object Storage and databases is governed by IAM policies.
- For discovery workflows:
- Create distinct identities for pipelines vs human users.
- Limit who can access raw sensitive data.
Encryption
- OCI Object Storage encrypts data at rest by default (service-managed keys are typical).
- Autonomous Database encrypts storage at rest by default.
- For stricter requirements, use customer-managed keys via OCI Vault (verify service support and configuration requirements).
Network exposure
- Avoid public Autonomous Database access for production.
- Restrict access with IP allowlists if public access is required.
- Prefer private endpoints and controlled ingress via bastion hosts.
Secrets handling
- Do not store auth tokens in plaintext files or source control.
- Prefer OCI Vault for secret storage and rotation where applicable.
- In this tutorial lab, you used an auth token; in production, design a safer credential strategy.
Audit/logging
- Use OCI Audit to track control-plane actions (bucket creation, DB changes).
- Use database auditing for data access and schema changes.
- Store logs in a central, tamper-resistant logging account if required.
Compliance considerations
- Classify data: PII/PHI/PCI.
- Apply data minimization: do not copy sensitive raw data to too many places.
- Enforce retention and deletion policies.
- Validate region/legal constraints for data residency.
Common security mistakes
- Granting broad “manage all-resources” policies to analysts.
- Leaving public buckets or permissive pre-authenticated requests.
- Using shared credentials across teams.
- No audit trail for who accessed sensitive columns.
Secure deployment recommendations
- Compartmentalize by environment and sensitivity.
- Use private networking for databases and processing.
- Implement a data access review process.
- Standardize dataset publishing with documented schemas and ownership.
13. Limitations and Gotchas
Big Data Discovery product lifecycle and availability
- Big Data Discovery may be legacy or not available as an OCI native managed service in many tenancies.
- Documentation may exist while new deployments are limited.
- Verify in official Oracle docs and with Oracle before committing to it long-term.
Quotas and limits (OCI lab)
- Autonomous Database Always Free has compute/storage limits.
- Object Storage has service limits (objects, requests) at tenancy level.
- Large CSV loads can be slower than parquet-based pipelines.
Regional constraints
- Always Free availability differs by region.
- Some services (like Oracle Analytics Cloud) may not be available in all regions.
Pricing surprises
- Data egress costs can surprise teams if they export large datasets to the internet.
- Storing many curated copies can multiply storage cost.
Compatibility issues
- CSV parsing is error-prone: delimiter issues, quoting, encoding.
- Timestamp formats often break ingestion unless standardized.
Operational gotchas
- Ad-hoc “discovery transforms” can become production dependencies. Treat them as code where possible.
- Without governance, multiple “curated” datasets may diverge and confuse consumers.
Migration challenges
- If migrating from a Hadoop-centric Big Data Discovery environment:
- Mapping transformations to Spark/SQL pipelines can require rework.
- Access patterns may change (index-based exploration vs SQL queries).
- Plan for data format conversion and partitioning.
Vendor-specific nuances
- Oracle’s big data and analytics tooling portfolio evolves. What replaces Big Data Discovery depends on your target architecture (data lake, warehouse, or lakehouse pattern).
14. Comparison with Alternatives
Nearest options in Oracle Cloud (OCI)
- Oracle Analytics Cloud (OAC): BI + visualization + data prep features (licensed service).
- OCI Data Integration: managed ETL/ELT orchestration for moving and transforming data (focus on pipelines rather than interactive discovery).
- OCI Data Flow (Apache Spark): scalable processing; requires engineering patterns, not a discovery UI.
- OCI Big Data Service: managed Hadoop ecosystem for customers who need it.
- Autonomous Database + APEX/Database Actions: fast SQL-based profiling and lightweight exploration.
Nearest options in other clouds
- AWS: Glue + Athena + Lake Formation + QuickSight
- Azure: Data Factory + Synapse + Purview + Power BI
- GCP: Dataproc + BigQuery + Dataplex + Looker
Open-source/self-managed alternatives
- Apache Superset (BI exploration)
- Trino/Presto + a metastore + a BI tool
- Jupyter notebooks + Spark
- OpenSearch/Kibana for log/event exploration
Comparison table
| Option | Best For | Strengths | Weaknesses | When to Choose |
|---|---|---|---|---|
| Big Data Discovery (Oracle) | Existing Oracle big data deployments needing interactive discovery | Integrated discovery + prep experience for big data (deployment dependent) | Availability/lifecycle uncertainty for new OCI projects; may be legacy | You already have it and it meets requirements; short-to-mid term use while planning roadmap |
| Oracle Analytics Cloud | BI dashboards + governed analytics | Strong visualization, semantic modeling, enterprise BI capabilities | Licensed cost; may require curated datasets | You need enterprise BI and a supported OCI analytics roadmap |
| OCI Data Integration | ETL/ELT orchestration | Managed pipelines, scheduling, connectors | Not primarily interactive discovery | You need repeatable data movement and transformations |
| OCI Data Flow (Spark) | Large-scale processing | Scales Spark without managing clusters | Engineering-heavy; no “discovery UI” | You need big transformations at scale |
| Autonomous Database + SQL tools | SQL-centric profiling + curated serving | Fast iteration with SQL, strong governance controls | Not a specialized discovery UI | You want a cost-effective, governable curated layer on OCI |
| AWS Glue + Athena + QuickSight | AWS-native lake analytics | Strong managed lake query and BI ecosystem | AWS lock-in; cost can grow with query volume | Your platform is AWS and you want managed lake analytics |
| Open-source (Trino/Superset) | Custom lakehouse stacks | Flexibility, open formats | Ops burden, security/integration work | You have strong platform engineering and want maximum portability |
15. Real-World Example
Enterprise example: telecom data quality and churn analytics
- Problem: A telecom company ingests billions of network events and customer interactions. Analysts struggle to explore raw data and detect schema drift and quality issues before reporting.
- Proposed architecture:
- Raw events land in OCI Object Storage (raw zone).
- Transformations run via Spark (OCI Data Flow) to generate curated customer-event aggregates.
- Curated datasets stored in Autonomous Data Warehouse for high-concurrency BI.
- BI dashboards in Oracle Analytics Cloud.
- Governance with IAM policies, auditing, and data classification tags.
- Why Big Data Discovery was chosen (or evaluated):
- Historically, it offered a self-service discovery layer on big-data stores for analysts.
- In a modernization effort, the company maps those workflows to OCI-native services for long-term supportability.
- Expected outcomes:
- Faster detection of bad data (null spikes, outliers).
- Reduced time to publish curated datasets for churn models.
- Stronger access controls and auditability.
Startup/small-team example: e-commerce order analytics
- Problem: A small team needs quick insight into order data quality and basic revenue reporting without hiring a full data platform team.
- Proposed architecture:
- Store raw CSV exports in OCI Object Storage.
- Load into Autonomous Database Always Free (for small volumes).
- Use SQL views as the published semantic layer.
- Lightweight exploration via Database Actions/APEX; later add Oracle Analytics Cloud if needed.
- Why Big Data Discovery-style approach:
- The team needs the workflow outcome (discover → clean → publish) more than the specific legacy product.
- Expected outcomes:
- Clean, deduplicated order dataset.
- Simple dashboards and reports with minimal operational overhead.
- Low costs and easy scaling path.
16. FAQ
1) Is Big Data Discovery an OCI native managed service?
In many OCI tenancies, Big Data Discovery does not appear as a standard OCI managed service. It has historically been delivered as part of Oracle’s broader big data product stack. Verify current availability in official Oracle documentation and your OCI tenancy/service catalog.
2) What is Big Data Discovery primarily used for?
Interactive data discovery, profiling, preparation, and publishing curated datasets—often for big-data sources like Hadoop ecosystems.
3) Is Big Data Discovery the same as Oracle Analytics Cloud?
No. Oracle Analytics Cloud is an enterprise BI/analytics service. Big Data Discovery focuses more on discovery and preparation over big data sources (though there is conceptual overlap).
4) What replaced Big Data Discovery on OCI?
There isn’t always a single 1:1 replacement. Many teams implement the workflow using Object Storage + Data Flow + Autonomous Database + Oracle Analytics Cloud. Choose based on your needs and Oracle’s current roadmap.
5) Can I still follow this tutorial if I don’t have Big Data Discovery?
Yes. The hands-on lab is designed to be executable using common OCI services and recreates Big Data Discovery-style outcomes.
6) Does the lab require Oracle Analytics Cloud?
No. The lab uses Autonomous Database SQL and views for exploration. You can optionally connect OAC or another BI tool later.
7) What dataset size is appropriate for the lab?
Start small (MBs to a few GB). Always Free Autonomous Database is limited, and CSV loads are not optimized for huge datasets.
8) What’s the best storage format for curated big data on OCI?
For large-scale lake analytics, columnar formats like Parquet are common. For simple ingestion, CSV is fine but less efficient.
9) How do I control who can access raw vs curated data?
Use IAM policies and separate buckets/compartments. In the database, use schemas/roles and views to restrict columns and rows.
10) How do I avoid copying sensitive data into too many places?
Apply data minimization: keep raw data in one controlled zone, publish only necessary curated datasets, and enforce retention policies.
11) What’s the biggest operational risk with discovery tools?
Un-governed transformations can become “shadow production” logic. Treat curated outputs as products: versioning, testing, and ownership.
12) How do I monitor this pipeline?
Use OCI Audit for control-plane actions, database auditing for data access, and (if you add Spark/ETL) use the service’s logging + OCI Logging.
13) Can Autonomous Database load directly from Object Storage securely?
Yes, using DBMS_CLOUD with credentials. For more secure designs, evaluate private networking and approved credential storage patterns.
14) What’s the most common ingestion error?
Incorrect Object Storage URL/namespace or invalid credentials, leading to authorization failures.
15) Should I build a new long-term platform around Big Data Discovery today?
Only after confirming lifecycle status and supportability for your organization. For many OCI-first projects, OCI-native services provide a clearer forward path.
17. Top Online Resources to Learn Big Data Discovery
| Resource Type | Name | Why It Is Useful |
|---|---|---|
| Official documentation (search) | Oracle Help Center search for “Big Data Discovery”: https://docs.oracle.com/en/search/?q=Big%20Data%20Discovery | Safest starting point to find the correct versioned docs and guides |
| Official documentation (platform context) | Oracle Big Data Appliance documentation (Oracle Help Center): https://docs.oracle.com/en/ | Big Data Discovery is often discussed in the context of Oracle’s big data platform; use this to locate install/admin context |
| Official pricing | OCI Pricing: https://www.oracle.com/cloud/pricing/ | For pricing of OCI services used as alternatives (Object Storage, Data Flow, etc.) |
| Official cost estimator | OCI Cost Estimator: https://www.oracle.com/cloud/costestimator.html | Build region-specific estimates without guessing |
| Official docs (Object Storage) | OCI Object Storage docs: https://docs.oracle.com/en-us/iaas/Content/Object/home.htm | Required for secure bucket design, URLs, lifecycle policies |
| Official docs (Autonomous Database) | Autonomous Database docs: https://docs.oracle.com/en/cloud/paas/autonomous-database/ | Covers provisioning, security, Database Actions, connectivity |
| Official docs (DBMS_CLOUD) | DBMS_CLOUD documentation (in Autonomous DB docs): https://docs.oracle.com/en/cloud/paas/autonomous-database/adbsa/ | Authoritative reference for loading from Object Storage |
| Official docs (IAM) | OCI IAM docs: https://docs.oracle.com/en-us/iaas/Content/Identity/home.htm | Correct policy patterns and least privilege guidance |
| Official videos | Oracle Cloud YouTube channel: https://www.youtube.com/@OracleCloud | Often includes practical demos for OCI data services (verify availability for specific topics) |
| Community learning | Oracle Cloud Customer Connect: https://community.oracle.com/customerconnect/ | Practical Q&A with Oracle community and product teams (verify advice against docs) |
18. Training and Certification Providers
| Institute | Suitable Audience | Likely Learning Focus | Mode | Website URL |
|---|---|---|---|---|
| DevOpsSchool.com | Engineers, DevOps, architects | Cloud/DevOps fundamentals, automation, CI/CD; check for OCI data tracks | Check website | https://www.devopsschool.com/ |
| ScmGalaxy.com | DevOps learners, build/release teams | SCM, DevOps, tooling foundations; may complement cloud labs | Check website | https://www.scmgalaxy.com/ |
| CLoudOpsNow.in | Cloud ops teams, SREs | Cloud operations practices, monitoring, reliability | Check website | https://cloudopsnow.in/ |
| SreSchool.com | SREs, platform engineers | SRE practices, observability, incident management | Check website | https://sreschool.com/ |
| AiOpsSchool.com | Ops + data/AI practitioners | AIOps concepts, automation, operational analytics | Check website | https://aiopsschool.com/ |
19. Top Trainers
| Platform/Site | Likely Specialization | Suitable Audience | Website URL |
|---|---|---|---|
| RajeshKumar.xyz | DevOps/cloud training content (verify specific OCI coverage) | Beginners to intermediate engineers | https://rajeshkumar.xyz/ |
| devopstrainer.in | DevOps training and mentoring | DevOps practitioners | https://devopstrainer.in/ |
| devopsfreelancer.com | Freelance DevOps/services platform | Teams seeking hands-on help or coaching | https://devopsfreelancer.com/ |
| devopssupport.in | DevOps support/training resources | Ops teams needing implementation support | https://devopssupport.in/ |
20. Top Consulting Companies
| Company | Likely Service Area | Where They May Help | Consulting Use Case Examples | Website URL |
|---|---|---|---|---|
| cotocus.com | Cloud/DevOps consulting (verify offerings) | Architecture reviews, implementation help, operations | Designing OCI landing zones; setting up CI/CD for data pipelines; cost optimization reviews | https://cotocus.com/ |
| DevOpsSchool.com | Training + consulting services (verify scope) | Upskilling teams; implementing DevOps practices around cloud workloads | Building automation for OCI deployments; operational best practices for data platforms | https://www.devopsschool.com/ |
| DEVOPSCONSULTING.IN | DevOps consulting (verify offerings) | DevOps transformations, tooling, operational maturity | Implementing observability; infrastructure automation; governance processes | https://devopsconsulting.in/ |
21. Career and Learning Roadmap
What to learn before Big Data Discovery-style work
- Data fundamentals: CSV/JSON/Parquet, schemas, partitioning
- SQL fundamentals: joins, aggregates, window functions
- Cloud basics: IAM, networking, compartments/projects, encryption
- Object storage concepts: buckets, prefixes, lifecycle rules
- Basic data governance: data classification and least privilege
What to learn after
- Spark fundamentals (especially if using OCI Data Flow)
- Data modeling for analytics (star schema, slowly changing dimensions)
- CI/CD for data pipelines (testing, versioning transformations)
- Observability for data platforms (data quality checks, pipeline SLAs)
- BI semantic modeling (Oracle Analytics Cloud or equivalent)
Job roles that use it (or equivalent workflows)
- Data Engineer
- Analytics Engineer
- BI Developer / BI Engineer
- Cloud Data Architect
- Platform Engineer (data platform)
- Data Governance / Data Quality Engineer
Certification path (if available)
- For Big Data Discovery specifically, certification availability may be limited depending on lifecycle.
- For OCI pathways, consider Oracle’s OCI certifications (associate/professional) relevant to:
- Cloud infrastructure
- Data management
- Analytics
Verify the current catalog on Oracle University: https://education.oracle.com/
Project ideas for practice
- Build a raw-to-curated pipeline for web logs and publish a KPI view.
- Implement data quality checks (null thresholds, uniqueness constraints) with automated alerts.
- Create a curated dataset with PII masking and column-level access controls.
- Cost-optimization exercise: lifecycle policies + partition strategies.
22. Glossary
- Big Data Discovery: Oracle product for interactive discovery, profiling, and preparation of large datasets (availability and packaging vary).
- OCI (Oracle Cloud Infrastructure): Oracle’s public cloud platform.
- Object Storage: Durable blob storage for files/objects in buckets.
- Autonomous Database: Managed Oracle database with automated operations; includes ADW/ATP.
- ADW (Autonomous Data Warehouse): Autonomous Database workload optimized for analytics.
- DBMS_CLOUD: Oracle-supplied PL/SQL package commonly used to load data into Autonomous Database from Object Storage and other cloud locations.
- Raw zone: Storage location for unmodified ingested data.
- Curated zone: Storage or tables with cleaned, standardized, analysis-ready data.
- Serving layer: Optimized data structures (tables/views) used by BI tools and applications.
- Faceted search: Exploration approach where users filter by attribute “facets” (e.g., region, status).
- Schema drift: Changes in incoming data schema over time (new columns, changed types).
- Least privilege: Security principle of granting only the permissions required to perform a task.
- Egress: Outbound data transfer from a cloud to the internet or another region.
23. Summary
Big Data Discovery is an Oracle offering aimed at interactive exploration, profiling, preparation, and publishing of large datasets, historically aligned with Oracle’s big data platform deployments. It matters because it addresses a consistent pain point in analytics programs: turning raw, messy data into trusted, consumable datasets quickly.
In Oracle Cloud (OCI), Big Data Discovery may not be present as a standard managed service in many tenancies, so treat its lifecycle and availability as something to verify in official Oracle documentation and with Oracle Sales/Support. For most new OCI projects, the practical approach is to implement the same workflow using OCI-native building blocks: Object Storage for the lake, Autonomous Database for curated/serving datasets, and (optionally) Spark-based processing and enterprise BI.
Key cost and security points: – Costs come mainly from storage growth, compute for transformations, and BI licensing. – Secure designs rely on compartments, least privilege IAM policies, private networking for databases, encryption, and auditing.
When to use it: – Use Big Data Discovery if you already have it and it aligns with your platform. – Use OCI-native services to achieve the same outcomes when building forward-looking architectures on Oracle Cloud.
Next step: run the hands-on lab in this guide, then expand it by adding a scalable processing layer (OCI Data Flow) and a governed BI layer (Oracle Analytics Cloud) as your requirements grow.