Oracle Cloud Big Data Preparation Tutorial: Architecture, Pricing, Use Cases, and Hands-On Guide for Other Services

Category

Other Services

1. Introduction

Big Data Preparation in Oracle Cloud refers to the practices and cloud building blocks you use to ingest, profile, cleanse, transform, enrich, and publish large datasets so they are reliable for analytics, reporting, and machine learning.

A simple way to think about Big Data Preparation is: turn messy raw data (CSV/JSON/logs/exports) into curated datasets (often Parquet/Delta-like layouts, clean tables, or feature sets) that downstream systems can trust.

Technically, Big Data Preparation in Oracle Cloud is typically implemented by combining OCI services such as Object Storage (data lake landing zone), OCI Data Flow (Apache Spark) (distributed transformation at scale), OCI Data Integration (ETL/ELT orchestration and connectors), OCI Data Catalog (metadata and governance), and OCI Logging/Audit (operational visibility). In many environments, Oracle Analytics Cloud also provides interactive data preparation features for analysts.

Important note about service naming and availability: Oracle has historically used names like “Big Data Preparation” in different contexts (including legacy Oracle Cloud services and analytics data-prep capabilities). In current Oracle Cloud Infrastructure (OCI) service lists, there may not be a single standalone service tile named exactly Big Data Preparation in every tenancy/region. This tutorial therefore focuses on how to implement Big Data Preparation outcomes on Oracle Cloud using currently documented OCI services. If your organization has a specific Oracle service formally named “Big Data Preparation” enabled in your tenancy, verify the exact product documentation and console workflow in official docs and adapt the lab steps accordingly.

What problem does Big Data Preparation solve?

  • Raw data is inconsistent (missing values, wrong types, duplicates).
  • Data arrives from many sources with different schemas and quality.
  • Downstream analytics/ML fails or becomes expensive without curated, optimized layouts.
  • Teams need repeatable, auditable pipelines with access control and governance.

2. What is Big Data Preparation?

Official purpose (practical OCI interpretation)

In Oracle Cloud, Big Data Preparation is the end-to-end capability to: 1. Land data reliably (often into Object Storage), 2. Profile and validate quality, 3. Transform and enrich at scale (Spark/ETL), 4. Publish curated datasets for analytics and ML, 5. Govern and operate these pipelines (metadata, logging, IAM, auditing).

Because Oracle Cloud may not expose “Big Data Preparation” as a single monolithic service in all tenancies, you should treat it as a solution capability implemented through OCI-native services.

Core capabilities

Common Big Data Preparation capabilities in Oracle Cloud architectures include:

  • Ingestion/landing: batch files, exports, logs, event data (Object Storage).
  • Schema handling: infer/define schemas; handle drift and evolution.
  • Data cleaning: null handling, standardization, type casting, de-duplication.
  • Data enrichment: joins to reference data, geocoding (if applicable), mapping codes to names.
  • Data quality checks: row counts, uniqueness, range checks, “bad record” quarantine.
  • Output optimization: columnar formats (Parquet), partitioning, compression.
  • Orchestration: schedules, dependencies, retries, parameterization.
  • Governance: cataloging, lineage (where available), tagging, access policies.

Major components (OCI building blocks)

A typical Oracle Cloud Big Data Preparation implementation uses:

  • OCI Object Storage for raw/curated zones
    Docs: https://docs.oracle.com/en-us/iaas/Content/Object/home.htm
  • OCI Data Flow (Apache Spark) for distributed preparation at scale
    Docs: https://docs.oracle.com/en-us/iaas/data-flow/using/overview.htm
  • OCI Data Integration for managed ETL/ELT, connectivity, and orchestration
    Docs: https://docs.oracle.com/en-us/iaas/data-integration/using/overview.htm
  • OCI Data Catalog for metadata management (where it fits your governance model)
    Docs: https://docs.oracle.com/en-us/iaas/data-catalog/home.htm
  • OCI Identity and Access Management (IAM) for least-privilege access
    Docs: https://docs.oracle.com/en-us/iaas/Content/Identity/home.htm
  • OCI Logging + Audit for observability and traceability
    Logging: https://docs.oracle.com/en-us/iaas/Content/Logging/home.htm
    Audit: https://docs.oracle.com/en-us/iaas/Content/Audit/home.htm

Service type

In practice, Big Data Preparation in Oracle Cloud is a managed data engineering capability implemented with managed storage + managed processing + managed orchestration/governance.

Scope: regional vs global; project/tenancy scoping

In OCI: – Most resources (Object Storage buckets, Data Flow applications/runs, Data Integration workspaces) are regional. – Access control is tenancy-wide via IAM, and resources are organized by compartments. – Data access is governed by policies and, where used, by encryption keys (Vault/KMS).

Verify in official docs for the exact scope of each service you use in your region, because OCI services sometimes have region-specific constraints or feature parity differences.

How it fits into the Oracle Cloud ecosystem

Big Data Preparation commonly sits between: – Source systems (SaaS apps, databases, on-prem exports, IoT/log streams), – and consumers (Oracle Analytics Cloud, Autonomous Database/ADW, Data Science models, downstream data products).

In the “Other Services” category context, Big Data Preparation often shows up as a cross-cutting capability that touches analytics, integration, governance, and operations rather than a single product silo.


3. Why use Big Data Preparation?

Business reasons

  • Faster insights: curated datasets reduce analyst time spent cleaning data.
  • Consistent KPIs: standardized definitions enable reliable reporting.
  • Better ML outcomes: higher-quality features improve model performance.
  • Reduced operational risk: repeatable pipelines reduce manual errors.

Technical reasons

  • Scalable transformations: distributed processing (Spark) handles large volumes.
  • Format optimization: Parquet + partitioning reduces query time and cost.
  • Schema governance: controlled evolution avoids breaking downstream jobs.
  • Automation: schedules, retries, and parameterized pipelines.

Operational reasons

  • Observability: logs/metrics for each run, failure visibility, and auditability.
  • Separation of environments: dev/test/prod compartments and policies.
  • Standardized deployment: Infrastructure-as-Code and reproducible jobs.

Security/compliance reasons

  • Least-privilege IAM for data access.
  • Encryption at rest and in transit (service-managed + customer-managed keys).
  • Audit trails for access and administrative changes.
  • Data residency controls through region selection.

Scalability/performance reasons

  • Elastic compute: spin up processing only when needed (batch).
  • Parallelism: distribute tasks across executors.
  • Incremental processing: partition by time and process only new data.

When teams should choose Big Data Preparation (Oracle Cloud approach)

Choose it when you need: – A data lake landing zone plus scalable transformations. – Repeatable pipelines from raw to curated. – Governance and IAM integrated with OCI. – A path to analytics (OAC/ADW) and ML (OCI Data Science).

When they should not choose it

Avoid or reconsider if: – Your data volumes are tiny and a simple database ETL is enough. – You need sub-second interactive transformations for small datasets only (a local tool may be faster). – Your organization cannot operate data pipelines yet (no ownership, no monitoring, no runbooks). – Your data is highly regulated and you lack a clear governance model (you must solve governance first).


4. Where is Big Data Preparation used?

Industries

  • Financial services (risk, fraud, regulatory reporting datasets)
  • Retail/e-commerce (clickstream + orders + inventory curation)
  • Healthcare/life sciences (claims, encounters, research datasets)
  • Manufacturing/IoT (sensor telemetry normalization and aggregation)
  • Telecom (CDRs/logs aggregation and enrichment)
  • Public sector (data standardization across departments)
  • SaaS companies (product analytics event preparation)

Team types

  • Data engineering teams building curated data products
  • Analytics engineering teams defining semantic models and KPIs
  • Platform teams offering “data prep as a platform”
  • ML engineering teams generating feature datasets
  • Security/compliance teams enforcing access control and auditing

Workloads

  • Batch ETL/ELT from files to curated lake or warehouse
  • Data quality validation and anomaly detection
  • Sessionization and behavioral aggregation
  • Building slowly changing dimensions (SCD) and reference tables
  • Feature engineering for ML

Architectures

  • Data lake on Object Storage + Spark transformations
  • Lakehouse-style patterns (object storage + query engines)
  • Warehouse-centric patterns (curate into ADW)
  • Hybrid: on-prem sources to OCI landing zone

Real-world deployment contexts

  • Multi-account/compartment segmentation by domain (finance, marketing, product)
  • Central “shared services” data platform
  • Regulated environments requiring encryption keys and auditing

Production vs dev/test usage

  • Dev/test: smaller buckets, limited retention, fewer executors, relaxed schedules
  • Production: strict IAM, key management, runbooks, alerting, SLAs, retention policies, data contracts

5. Top Use Cases and Scenarios

Below are realistic Big Data Preparation use cases implemented on Oracle Cloud.

1) Raw CSV exports → curated Parquet in a data lake

  • Problem: CSV exports are slow and expensive to query; schemas are inconsistent.
  • Why this fits: Data Flow (Spark) transforms CSV into partitioned Parquet.
  • Example: Nightly ERP exports land in Object Storage; Spark job writes curated Parquet by date=YYYY-MM-DD.

2) Data quality gate before loading to a warehouse

  • Problem: Bad records break downstream loads and reports.
  • Why this fits: Spark/ETL can quarantine bad rows and publish quality metrics.
  • Example: Validate customer_id uniqueness and required fields; write “rejected” rows to a quarantine prefix.

3) PII masking/tokenization before broad analytics access

  • Problem: Analysts need trends, but not raw PII.
  • Why this fits: Data preparation step can hash/tokenize PII fields and control access.
  • Example: Hash emails, remove names, keep geographic region for analysis.

4) Join enrichment with reference data

  • Problem: Events contain codes that need human-readable dimensions.
  • Why this fits: Data prep joins event streams with reference tables stored in Object Storage or a database.
  • Example: Join country_code to country_name dimension.

5) Sessionization of clickstream logs

  • Problem: Raw event logs are too granular; you need sessions for funnels.
  • Why this fits: Spark can group by user and session windows at scale.
  • Example: Create sessions with 30-minute inactivity threshold; publish session table for BI.

6) IoT telemetry normalization + downsampling

  • Problem: High-frequency telemetry is huge and inconsistent.
  • Why this fits: Spark can standardize units, remove noise, and aggregate.
  • Example: Convert Celsius/Fahrenheit to a standard unit and aggregate to 1-minute averages.

7) CDC landing → curated incremental datasets

  • Problem: Change data capture lands as many small files and requires compaction/curation.
  • Why this fits: Data prep can merge/compact and publish an optimized layout.
  • Example: Daily compaction job merges small files into partitioned Parquet.

8) Multi-source reconciliation for finance reporting

  • Problem: Numbers differ across systems; you need consistent reconciled datasets.
  • Why this fits: Controlled transformations + auditing produce traceable curated outputs.
  • Example: Join GL exports, payment processor data, and subscription system data; reconcile and output “gold” dataset.

9) ML feature store-style feature dataset builds

  • Problem: ML needs repeatable, versioned features with time-based backfills.
  • Why this fits: Parameterized Spark jobs build feature snapshots.
  • Example: Daily feature build with lookback windows; store features partitioned by feature_date.

10) Data cataloging and governance of lake datasets

  • Problem: Teams don’t know what data exists or who owns it.
  • Why this fits: Catalog + tags + naming standards make datasets discoverable.
  • Example: Register curated prefixes in OCI Data Catalog and apply business glossary terms.

11) Log data preparation for security analytics

  • Problem: Raw logs vary by source and are hard to query.
  • Why this fits: Normalize logs into a common schema; partition by source and date.
  • Example: Parse web server logs into structured fields; publish normalized security events.

12) Migration preparation (on-prem Hadoop exports to OCI)

  • Problem: Legacy Hadoop datasets need cleanup and format conversion.
  • Why this fits: Spark-based prep converts formats and enforces schema contracts.
  • Example: Read legacy TSV exports, standardize schema, write Parquet to Object Storage.

6. Core Features

Because “Big Data Preparation” may be implemented with multiple OCI services, this section describes the core features you should expect in an Oracle Cloud Big Data Preparation solution, mapped to common OCI components. Always verify feature availability in the official docs for the exact services you choose.

Feature 1: Data landing zones (raw/clean/curated)

  • What it does: Separates data into well-defined zones (often raw/, clean/, curated/ prefixes in Object Storage).
  • Why it matters: Prevents accidental mixing of raw and governed data; supports reproducibility.
  • Practical benefit: Simplifies debugging and backfills.
  • Caveat: Requires lifecycle/retention policies and naming conventions to avoid sprawl.

Feature 2: Schema enforcement and evolution handling

  • What it does: Defines schemas, casts types, and manages schema drift over time.
  • Why it matters: Downstream failures commonly come from unexpected columns or type changes.
  • Practical benefit: Stable contracts for BI and ML.
  • Caveat: Spark schema inference can be dangerous at scale; prefer explicit schemas for critical pipelines.

Feature 3: Distributed transformation at scale (Spark)

  • What it does: Uses a distributed engine (OCI Data Flow) for large joins, aggregations, and reshaping.
  • Why it matters: Single-node ETL breaks on volume and takes too long.
  • Practical benefit: Parallel processing, elastically provisioned.
  • Caveat: Requires careful partitioning, shuffle tuning, and file sizing to avoid slow jobs.

Feature 4: Data quality checks and quarantining

  • What it does: Validates constraints (null checks, ranges, uniqueness) and routes bad records.
  • Why it matters: Prevents “silent corruption” where reports look plausible but wrong.
  • Practical benefit: Quality dashboards and trust.
  • Caveat: Quality checks must be versioned and treated as code; avoid ad-hoc manual rules.

Feature 5: Output optimization (Parquet + partitioning + compression)

  • What it does: Writes curated datasets in columnar formats and partitions by common filters (date, region, tenant).
  • Why it matters: Query engines and downstream processing become faster and cheaper.
  • Practical benefit: Lower IO, improved scan efficiency.
  • Caveat: Over-partitioning creates too many small files; under-partitioning causes huge scans.

Feature 6: Orchestration and scheduling

  • What it does: Runs preparation pipelines on a schedule or event, with dependencies and retries.
  • Why it matters: Manual runs don’t scale operationally.
  • Practical benefit: Reliable daily/hourly pipelines.
  • Caveat: Choose one orchestration layer (Data Integration, external orchestrator, CI/CD) and standardize.

Feature 7: Metadata, cataloging, and discoverability

  • What it does: Registers datasets, owners, descriptions, and tags in a catalog system.
  • Why it matters: Data lakes without metadata become unusable.
  • Practical benefit: Faster onboarding, less tribal knowledge.
  • Caveat: Metadata requires stewardship; assign owners and enforce “definition of done”.

Feature 8: IAM-based access control and compartment isolation

  • What it does: Controls who can read/write buckets, run jobs, and view logs.
  • Why it matters: Data is often sensitive; least privilege is mandatory.
  • Practical benefit: Reduces blast radius and compliance risk.
  • Caveat: Misconfigured policies are a top cause of failed runs and data leaks.

Feature 9: Encryption and key management

  • What it does: Uses encryption at rest/in transit; optionally customer-managed keys via OCI Vault.
  • Why it matters: Regulatory controls and internal security standards.
  • Practical benefit: Strong data protection posture.
  • Caveat: Key rotation and access policies must be tested; losing key access can block pipelines.

Feature 10: Logging, auditing, and run traceability

  • What it does: Captures run logs, metrics, and administrative audits.
  • Why it matters: You need root cause analysis and compliance evidence.
  • Practical benefit: Faster incident response; reliable operations.
  • Caveat: Logs cost money to store; define retention and filter noise.

7. Architecture and How It Works

High-level architecture

A practical Oracle Cloud Big Data Preparation architecture uses: – Object Storage as the landing and curated repository. – Data Flow (Spark) to read raw data, transform it, and write curated outputs. – IAM to authorize the job to access storage and logs. – Logging/Audit for operations and compliance. – Optional: Data Integration for orchestration and connectors, Data Catalog for metadata.

Request/data/control flow (typical batch pipeline)

  1. Data lands in raw/ (Object Storage) via upload, export, or integration.
  2. A scheduled job (or orchestrator) triggers a Data Flow run.
  3. Spark reads raw objects, applies transformations and validations.
  4. Spark writes curated datasets (often Parquet) to curated/ and rejects to quarantine/.
  5. Logs and metrics are emitted to OCI Logging.
  6. Optional: a catalog scan registers curated datasets; consumers query curated data.

Integrations with related services (common)

  • OCI Data Integration: orchestrate workflows, manage connectivity, parameterization.
  • Oracle Autonomous Data Warehouse (ADW): load curated data into relational warehouse for BI.
  • Oracle Analytics Cloud: consume curated datasets for dashboards; may also do interactive data prep.
  • OCI Data Science: build features and train models from curated data.
  • OCI Vault: customer-managed keys and secrets.
  • OCI Events/Notifications: alert on failures (verify current integration patterns in docs).

Dependency services

  • Object Storage (data)
  • IAM (policies, dynamic groups if using resource principals)
  • Networking (VCN if using private endpoints; otherwise service endpoints)
  • Logging/Audit

Security/authentication model (typical for Data Flow)

OCI commonly supports resource principals: the Data Flow run authenticates as a resource in a dynamic group, and policies grant it access to Object Storage and Logging.

  • Dynamic group: matches Data Flow run resources.
  • IAM policy: allows that dynamic group to read/write objects in specific compartments/buckets.

Verify exact policy statements in official docs, because policy grammar and required verbs vary by service.

Networking model

  • Simplest: jobs access Object Storage via OCI service endpoints (public internet not required in the same way as generic public endpoints).
  • More controlled: configure private networking (VCN, private endpoints) depending on service support and organizational policy.

Verify in official docs if you require private endpoints for your chosen services and region.

Monitoring/logging/governance considerations

  • Define standardized log groups per environment (dev/test/prod).
  • Emit job-level metrics (row counts, reject counts, input/output bytes).
  • Use tags for cost allocation (cost-center, data-domain, environment).
  • Runbooks and on-call: define what constitutes failure and how to recover.

Simple architecture diagram

flowchart LR
  A[Source data exports\nCSV/JSON/logs] --> B[OCI Object Storage\nraw/]
  B --> C[OCI Data Flow (Spark)\nBig Data Preparation job]
  C --> D[OCI Object Storage\ncurated/ (Parquet)]
  C --> E[OCI Object Storage\nquarantine/ (bad rows)]
  C --> F[OCI Logging\nrun logs]
  D --> G[Consumers\nAnalytics / ML / BI]

Production-style architecture diagram

flowchart TB
  subgraph Tenancy[OCI Tenancy]
    subgraph CompartmentData[Compartment: data-platform-prod]
      OSRAW[(Object Storage\nraw zone)]
      OSCUR[(Object Storage\ncurated zone)]
      OSQUA[(Object Storage\nquarantine zone)]
      LOG[(Logging: log group)]
      AUD[(Audit)]
      DG[Dynamic Group\n(Data Flow runs)]
      POL[IAM Policies]
      KMS[(Vault/KMS\noptional CMK)]
      DC[(Data Catalog\noptional)]
      DI[(Data Integration\noptional orchestration)]
      DF[Data Flow App + Runs\n(Spark)]
    end
  end

  Sources[On-prem/SaaS/DB exports] --> OSRAW
  DI -->|trigger| DF
  OSRAW --> DF
  DF --> OSCUR
  DF --> OSQUA
  DF --> LOG
  POL --> DG
  DG --> DF
  KMS -.encryption.-> OSRAW
  KMS -.encryption.-> OSCUR
  DC -->|scan/register| OSCUR
  Consumers[OAC / ADW / Data Science] --> OSCUR
  AUD --> CompartmentData

8. Prerequisites

Tenancy/account requirements

  • An Oracle Cloud (OCI) tenancy with permission to create:
  • Object Storage buckets and objects
  • Data Flow applications and runs
  • IAM dynamic groups and policies
  • Logging resources

Permissions / IAM roles

You need IAM privileges to: – Create/manage buckets and objects in a target compartment – Create Data Flow apps/runs – Create dynamic groups and policies (or have an admin do it) – View logs in Logging

If you are not a tenancy admin, coordinate with your OCI administrators.

Billing requirements

  • Data Flow runs and storage usage are typically billable.
  • Logging retention and storage can also incur costs.
  • If your tenancy has Free Tier credits or trial credits, you can often complete a small lab with minimal spend.

Tools needed

  • OCI Console access
  • OCI Cloud Shell (recommended) or OCI CLI installed locally
    Cloud Shell docs: https://docs.oracle.com/en-us/iaas/Content/API/Concepts/cloudshellintro.htm
    OCI CLI docs: https://docs.oracle.com/en-us/iaas/Content/API/Concepts/cliconcepts.htm

Optional: – Git (for versioning your Spark scripts) – Python locally (for validation)

Region availability

  • OCI services are region-specific. Confirm Data Flow and Logging are available in your region.
  • If a service is missing, choose another region or use an alternative approach (for example, Data Integration or a compute-based Spark cluster).

Verify in official docs for your specific region’s service availability.

Quotas / limits

Typical constraints you should check: – Object Storage bucket limits and object size limits – Data Flow run limits (concurrent runs, max executors, etc.) – Logging retention/ingestion limits – IAM policy limits

See OCI service limits: https://docs.oracle.com/en-us/iaas/Content/General/Concepts/servicelimits.htm

Prerequisite services

  • Object Storage
  • Data Flow
  • IAM
  • Logging

9. Pricing / Cost

Current pricing model (accurate, non-numeric)

Because Big Data Preparation is commonly implemented using multiple OCI services, cost is the sum of the components you use, typically:

  • OCI Data Flow: billed based on Spark compute resources consumed during runs (e.g., OCPU and memory time). The exact meter names and rates are region-dependent.
  • OCI Object Storage: billed by stored GB-month, requests, and retrieval tiers (depending on storage tier).
  • OCI Logging: billed by log ingestion and/or storage/retention (depending on the OCI Logging pricing model in your tenancy/region—verify).
  • Networking: data egress charges can apply when data leaves an OCI region to the internet or another region.

Always confirm with official sources: – Oracle Cloud Pricing: https://www.oracle.com/cloud/pricing/ – OCI Cost Estimator/Calculator: https://www.oracle.com/cloud/costestimator.html – OCI price list (region/SKU specific): https://www.oracle.com/cloud/price-list/

Pricing dimensions to watch

Component Common dimensions Notes
Data Flow OCPU time, memory time, run duration, possibly storage/temp usage Biggest driver for heavy transformations
Object Storage GB-month, requests, tier (Standard/Archive), lifecycle Curated Parquet often reduces total storage
Logging Ingestion volume, retention duration High-verbosity logs can become a hidden cost
Data transfer Egress GB, inter-region transfer Keep producers/consumers in-region when possible

Free tier

Oracle Free Tier offerings vary. Some services include always-free allocations, others do not, and free-tier eligibility can differ by region and tenancy type. Verify in official docs/pricing pages whether Data Flow, Logging, or other services used are included in any free tier.

Cost drivers (direct and indirect)

Direct: – Large Spark joins (shuffle heavy) → longer runtime – High executor counts – Writing too many small files – Storing multiple copies (raw + clean + curated) without lifecycle policies

Indirect/hidden: – Long log retention with verbose logs – Cross-region replication or copying curated data – Egress to external BI tools outside OCI – Reprocessing full history instead of incremental loads

Network/data transfer implications

  • Uploading data into OCI is usually not charged as egress, but downloading data out of OCI often is.
  • Cross-region transfers can be charged.
  • Keep your data prep jobs and storage in the same region to reduce latency and cost.

How to optimize cost

  • Use partition pruning: partition by date/tenant so jobs read less data.
  • Prefer incremental processing: process “new partitions only”.
  • Control file sizes: avoid small files; aim for moderately sized output files (exact target depends on query engines; validate with your downstream tools).
  • Reduce shuffles: broadcast small dimensions, pre-aggregate, filter early.
  • Set log level appropriately; shorten retention for debug logs.

Example low-cost starter estimate (non-numeric)

A low-cost starter lab typically includes: – One Object Storage bucket with a few MB–GB of sample data – One small Data Flow run (short duration) with minimal executors – Short log retention (days, not months)

Use the OCI Cost Estimator with: 1. Object Storage (Standard) for the expected data size and retention. 2. Data Flow for a small number of OCPUs and a short runtime. 3. Logging ingestion estimates (keep low by limiting log verbosity).

Example production cost considerations

For production: – Model daily volume and expected transformation time. – Consider peak concurrency (multiple pipelines). – Account for dev/test/prod environments. – Include data retention and backup/replication requirements. – Include operational tooling: alerts, dashboards, catalog scans.


10. Step-by-Step Hands-On Tutorial

This lab demonstrates a realistic Big Data Preparation workflow on Oracle Cloud using OCI Object Storage and OCI Data Flow (Apache Spark). It is designed to be safe, low-risk, and relatively low-cost, while still being “real” and repeatable.

If your tenancy provides a standalone “Big Data Preparation” product experience, treat this lab as the reference implementation of the same outcome using OCI-native services.

Objective

Build a small Big Data Preparation pipeline that: – Lands raw CSV data in Object Storage (raw/) – Runs a Spark job to: – enforce schema – clean/standardize fields – quarantine invalid rows – write curated Parquet outputs (curated/) – write a small CSV “sample” output for quick verification (curated_sample/) – Validates outputs and cleans up resources

Lab Overview

You will perform these steps:

  1. Create an Object Storage bucket and upload sample raw data.
  2. Create a Spark (PySpark) script and upload it to Object Storage.
  3. Configure IAM (dynamic group + policy) so Data Flow can access Object Storage.
  4. Create and run an OCI Data Flow application.
  5. Validate the curated output.
  6. Clean up resources to avoid ongoing charges.

Step 1: Choose a compartment and region

  1. In the OCI Console, pick a Region where Data Flow is available.
  2. Choose or create a Compartment for this lab, for example: – data-labs

Expected outcome: You know the compartment where you’ll create the bucket, Data Flow app, and IAM resources.


Step 2: Create an Object Storage bucket

You can do this in the Console or via Cloud Shell.

Option A: Console

  1. Go to Storage → Object Storage & Archive Storage → Buckets
  2. Select your compartment
  3. Click Create Bucket
  4. Name: bdp-lab-bucket-<unique-suffix>
  5. Default settings are fine for a lab (verify any org policies)

Option B: OCI Cloud Shell (CLI)

In Cloud Shell, set variables:

export COMPARTMENT_OCID="ocid1.compartment.oc1..replace_me"
export BUCKET_NAME="bdp-lab-bucket-$(date +%Y%m%d%H%M)"
export NAMESPACE=$(oci os ns get --query 'data' --raw-output)

Create bucket:

oci os bucket create \
  --compartment-id "$COMPARTMENT_OCID" \
  --name "$BUCKET_NAME"

Expected outcome: A bucket exists to store raw data, scripts, and curated outputs.


Step 3: Create and upload sample raw data

Create a local CSV file in Cloud Shell:

cat > transactions_raw.csv << 'EOF'
transaction_id,customer_id,amount,currency,txn_ts,country_code,email
t001,c001,19.99,USD,2025-01-01T10:05:00Z,US,alice@example.com
t002,c002,,USD,2025-01-01T10:07:00Z,US,bob@example.com
t003,c003,5.25,USD,not_a_timestamp,CA,charlie@example.com
t004,c001,19.99,USD,2025-01-01T10:05:00Z,US,alice@example.com
t005,,100.00,USD,2025-01-02T12:00:00Z,GB,dana@example.com
t006,c004,250,EUR,2025-01-02T12:05:00Z,DE,eric@example.com
EOF

Upload it to raw/:

oci os object put \
  --bucket-name "$BUCKET_NAME" \
  --file transactions_raw.csv \
  --name raw/transactions/transactions_raw.csv

Expected outcome: The raw dataset is stored at raw/transactions/transactions_raw.csv in your bucket.

Verification:

oci os object list --bucket-name "$BUCKET_NAME" --prefix raw/transactions/

Step 4: Create the PySpark Big Data Preparation script

Create bdp_prepare_transactions.py:

cat > bdp_prepare_transactions.py << 'EOF'
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
from pyspark.sql.functions import col, trim, to_timestamp, lit, sha2, concat_ws, year, month

spark = SparkSession.builder.appName("BigDataPreparation-Lab").getOrCreate()

# Inputs/outputs passed as Spark configs for simplicity
input_path = spark.conf.get("spark.bdp.input_path")
curated_out = spark.conf.get("spark.bdp.curated_out")
quarantine_out = spark.conf.get("spark.bdp.quarantine_out")
sample_out = spark.conf.get("spark.bdp.sample_out")

# Explicit schema: treat most IDs as string; parse amount separately
schema = StructType([
    StructField("transaction_id", StringType(), True),
    StructField("customer_id", StringType(), True),
    StructField("amount", StringType(), True),      # parse to double after cleaning
    StructField("currency", StringType(), True),
    StructField("txn_ts", StringType(), True),      # parse to timestamp after cleaning
    StructField("country_code", StringType(), True),
    StructField("email", StringType(), True)
])

raw = (
    spark.read
    .option("header", "true")
    .schema(schema)
    .csv(input_path)
)

# Standardize strings
df = (raw
      .withColumn("transaction_id", trim(col("transaction_id")))
      .withColumn("customer_id", trim(col("customer_id")))
      .withColumn("currency", trim(col("currency")))
      .withColumn("country_code", trim(col("country_code")))
      .withColumn("email", trim(col("email")))
)

# Parse timestamp and amount
df = df.withColumn("txn_ts_parsed", to_timestamp(col("txn_ts"))) \
       .withColumn("amount_num", col("amount").cast(DoubleType()))

# Basic quality rules:
# - transaction_id must be present
# - customer_id must be present
# - txn_ts must parse
# - amount must be numeric and non-negative
valid = df.where(
    col("transaction_id").isNotNull() &
    (col("transaction_id") != "") &
    col("customer_id").isNotNull() &
    (col("customer_id") != "") &
    col("txn_ts_parsed").isNotNull() &
    col("amount_num").isNotNull() &
    (col("amount_num") >= 0)
)

invalid = df.subtract(valid).withColumn("reject_reason", lit("failed_basic_validation"))

# De-duplicate exact duplicates by transaction_id (keep first occurrence)
# For real pipelines, choose deterministic ordering (e.g., ingestion time). Here, dropDuplicates is enough.
valid_dedup = valid.dropDuplicates(["transaction_id"])

# Minimal PII protection example: hash email (still treat this as sensitive depending on your policy)
curated = (valid_dedup
           .withColumn("email_sha256", sha2(concat_ws("||", col("email")), 256))
           .drop("email")
           .drop("txn_ts")
           .withColumnRenamed("txn_ts_parsed", "txn_ts")
           .drop("amount")
           .withColumnRenamed("amount_num", "amount")
           .withColumn("year", year(col("txn_ts")))
           .withColumn("month", month(col("txn_ts")))
)

# Write curated Parquet partitioned by year/month (common pattern)
(curated
 .write
 .mode("overwrite")
 .partitionBy("year", "month")
 .parquet(curated_out)
)

# Write quarantine rows as CSV for easy inspection
(invalid
 .write
 .mode("overwrite")
 .option("header", "true")
 .csv(quarantine_out)
)

# Write a small curated CSV sample (for quick validation without Parquet readers)
(curated
 .limit(50)
 .coalesce(1)
 .write
 .mode("overwrite")
 .option("header", "true")
 .csv(sample_out)
)

print("Big Data Preparation job completed.")
print(f"Curated output: {curated_out}")
print(f"Quarantine output: {quarantine_out}")
print(f"Curated sample: {sample_out}")

spark.stop()
EOF

Upload the script to Object Storage:

oci os object put \
  --bucket-name "$BUCKET_NAME" \
  --file bdp_prepare_transactions.py \
  --name scripts/bdp_prepare_transactions.py

Expected outcome: Your Spark script is stored at scripts/bdp_prepare_transactions.py.


Step 5: Configure IAM so Data Flow can access Object Storage

OCI Data Flow typically uses resource principals + dynamic groups.

You will: 1. Create a dynamic group matching Data Flow run resources. 2. Create an IAM policy granting that dynamic group access to the bucket/objects.

Because IAM policy grammar and the exact resource type names can vary, verify the latest Data Flow IAM documentation for the precise dynamic-group matching rule and policy statements: – Data Flow docs: https://docs.oracle.com/en-us/iaas/data-flow/using/overview.htm – IAM policies: https://docs.oracle.com/en-us/iaas/Content/Identity/Concepts/policies.htm – Dynamic groups: https://docs.oracle.com/en-us/iaas/Content/Identity/Tasks/managingdynamicgroups.htm

5.1 Create a dynamic group (Console)

  1. Go to Identity & Security → Identity → Dynamic Groups
  2. Click Create Dynamic Group
  3. Name: dg-dataflow-bdp-lab
  4. Description: Dynamic group for Data Flow runs in the lab compartment
  5. Matching rule (example pattern): limit to the lab compartment.

A common approach is to match Data Flow runs by compartment. The exact resource type string must be verified in docs. Example (verify before using):

  • ALL {resource.compartment.id = 'ocid1.compartment.oc1..replace_me'}

Some OCI services require more specific matching (resource type). Use the official Data Flow IAM example for correctness.

5.2 Create an IAM policy (Console)

  1. Go to Identity & Security → Identity → Policies
  2. Select the root compartment or an appropriate parent compartment based on your org model.
  3. Click Create Policy
  4. Name: policy-dataflow-bdp-lab
  5. Statements (example—verify exact verbs/resources in docs):

Typical required permissions include reading buckets and managing objects in the compartment that contains your bucket:

  • Allow dynamic-group dg-dataflow-bdp-lab to read buckets in compartment data-labs
  • Allow dynamic-group dg-dataflow-bdp-lab to manage objects in compartment data-labs

If you store logs in Logging and need access, add Logging permissions as required by your observability design (often viewing/using log groups is not needed by the job itself; logs are emitted by the platform). Verify in docs.

Expected outcome: Data Flow runs can read the script and input data and write outputs to your bucket.


Step 6: Create an OCI Data Flow application

You can do this in the Console.

  1. Go to Analytics & AI → Data Flow
  2. Click Create Application
  3. Name: bdp-lab-prepare-transactions
  4. Choose your compartment (data-labs)
  5. Choose Language/Type appropriate for Spark application (PySpark).
  6. For “Application file” or “Script path”, point to the script in Object Storage: – oci://<bucket>@<namespace>/scripts/bdp_prepare_transactions.py

Verify the exact Data Flow UI fields (they can evolve). In some workflows you create an application and then configure runs; in others you submit runs with parameters.

Expected outcome: You have a Data Flow application ready to run.


Step 7: Run the Data Flow job with parameters

You must provide: – input path: oci://bucket@namespace/raw/transactions/transactions_raw.csv – curated output path: oci://bucket@namespace/curated/transactions_parquet/ – quarantine output path: oci://bucket@namespace/quarantine/transactions_invalid/ – sample output path: oci://bucket@namespace/curated_sample/transactions_csv/

In the Data Flow run configuration, set Spark config properties:

  • spark.bdp.input_path=oci://...
  • spark.bdp.curated_out=oci://...
  • spark.bdp.quarantine_out=oci://...
  • spark.bdp.sample_out=oci://...

Also choose a small shape/executor count for a low-cost lab.

Start the run.

Expected outcome: The run transitions through statuses (Accepted → In Progress → Succeeded). You can view driver/executor logs in the run details.


Validation

Validate objects were written

List the curated prefixes:

oci os object list --bucket-name "$BUCKET_NAME" --prefix curated/
oci os object list --bucket-name "$BUCKET_NAME" --prefix quarantine/
oci os object list --bucket-name "$BUCKET_NAME" --prefix curated_sample/

You should see: – curated/transactions_parquet/ with partition folders like year=2025/month=1/quarantine/transactions_invalid/ with CSV parts – curated_sample/transactions_csv/ with a single CSV part (because the job coalesces to 1 file)

Download and inspect the curated sample CSV

Find the CSV part object name under curated_sample/transactions_csv/ and download it:

# List objects to find the part file name
oci os object list --bucket-name "$BUCKET_NAME" --prefix curated_sample/transactions_csv/

# Replace OBJECT_NAME with the part file name you found
export OBJECT_NAME="curated_sample/transactions_csv/part-00000-....csv"

oci os object get \
  --bucket-name "$BUCKET_NAME" \
  --name "$OBJECT_NAME" \
  --file curated_sample_out.csv

Inspect:

head -n 50 curated_sample_out.csv

What you should observe: – Rows with missing amount, missing customer_id, or invalid timestamp should be excluded from curated output. – Duplicate transaction_id should be removed. – email column should be replaced by email_sha256. – amount should be numeric. – year and month columns exist.


Troubleshooting

Common issues and fixes:

  1. 403 / NotAuthorizedOrNotFound when reading/writing Object Storage – Cause: dynamic group or policy is wrong/missing. – Fix:

    • Confirm the Data Flow run is in the compartment matched by the dynamic group.
    • Confirm policy statements grant access to the compartment/bucket.
    • Use official Data Flow IAM policy examples (verify in docs).
  2. Script not found – Cause: incorrect oci:// URI, wrong namespace, wrong bucket/object path. – Fix:

    • Confirm NAMESPACE=$(oci os ns get ...)
    • Confirm object exists: oci os object head --bucket-name ... --name scripts/bdp_prepare_transactions.py
  3. Job succeeds but outputs are empty – Cause: validation rules filtered all rows, or input schema mismatch. – Fix:

    • Relax rules, print counts in the script, and re-run.
    • Confirm the CSV header matches expected columns.
  4. Spark parsing issues – Cause: timestamp format not recognized. – Fix:

    • Use to_timestamp(col("txn_ts"), "yyyy-MM-dd'T'HH:mm:ssX") or a custom format.
    • Add explicit parsing formats and fallback logic.
  5. Performance issues / slow run – Cause: shuffle-heavy transformations, too few executors, too many small files. – Fix:

    • Filter early.
    • Avoid wide joins for the lab.
    • Increase resources (cost tradeoff).

Cleanup

To avoid ongoing charges:

  1. Terminate/delete Data Flow resources – Ensure the run is finished. – Delete the Data Flow application if it’s only for the lab.

  2. Delete objects and bucket – Buckets must be empty before deletion.

CLI example:

# Delete all objects in the bucket (be careful!)
oci os object bulk-delete --bucket-name "$BUCKET_NAME" --force

# Delete the bucket
oci os bucket delete --bucket-name "$BUCKET_NAME" --force
  1. Remove IAM policy and dynamic group – Delete policy-dataflow-bdp-lab – Delete dg-dataflow-bdp-lab

Expected outcome: No remaining lab resources, minimizing cost.


11. Best Practices

Architecture best practices

  • Use a zoned data lake approach: raw/, clean/, curated/, quarantine/.
  • Use immutable raw data when possible; curate into new paths instead of overwriting raw.
  • Prefer Parquet for curated datasets and partition by common filters (often date).
  • Implement incremental processing: process only new partitions to control cost.

IAM/security best practices

  • Use least privilege: restrict Data Flow runs to only required buckets/prefixes (where possible).
  • Separate dev/test/prod into different compartments and policies.
  • Prefer resource principals over embedding user API keys in scripts.
  • Use Vault for customer-managed keys if required by compliance (verify service support).

Cost best practices

  • Right-size Spark resources and set max executor counts for predictable cost.
  • Avoid small files; compact outputs.
  • Set lifecycle policies on raw and intermediate zones.
  • Control Logging verbosity and retention.

Performance best practices

  • Filter early, project only needed columns.
  • Use broadcast joins for small dimension tables (validate size).
  • Partition wisely; avoid over-partitioning.
  • Monitor shuffle spill and executor memory behavior in logs/metrics.

Reliability best practices

  • Make jobs idempotent: re-runs should not corrupt curated outputs.
  • Write to staging then promote (rename/move) for critical pipelines (Object Storage rename semantics differ; verify best pattern).
  • Emit checkpoints and run metadata (input version, code version, row counts).

Operations best practices

  • Standardize:
  • naming (bdp-<domain>-<env>-<pipeline>)
  • tagging (cost-center, owner, data-classification)
  • runbooks (common failure modes)
  • Alert on:
  • run failures
  • abnormal row count deltas
  • increasing quarantine rate
  • Track SLAs: data freshness and completeness.

Governance/tagging/naming best practices

  • Maintain dataset owners and data contracts (schema + SLA + allowed use).
  • Use a catalog where feasible and keep metadata updated.
  • Use consistent prefixes and avoid “misc/” dumping grounds.

12. Security Considerations

Identity and access model

  • OCI uses IAM policies scoped to compartments and resource types.
  • For Big Data Preparation jobs (Data Flow), prefer resource principals:
  • Data Flow run joins a dynamic group
  • Policy grants that group access to required storage and services

This avoids distributing user credentials.

Encryption

  • OCI services typically encrypt data at rest by default (service-managed keys).
  • For regulated environments, use customer-managed keys (CMK) in OCI Vault where supported.
  • Ensure encryption in transit (TLS) for all service calls.

Verify encryption options for each service in official docs.

Network exposure

  • Prefer private connectivity patterns if required by your security posture:
  • private endpoints / VCN integration (service-dependent)
  • Restrict bucket access:
  • avoid public buckets unless explicitly required
  • use pre-authenticated requests only for controlled sharing, if at all (verify best practices)

Secrets handling

  • Do not hard-code passwords or API keys in Spark scripts.
  • Use resource principals; store secrets in Vault if external credentials are needed.
  • Restrict who can read job scripts (Object Storage policies).

Audit/logging

  • OCI Audit captures API calls and changes to many resources.
  • Centralize logs for pipeline operations (Data Flow logs, orchestrator logs).
  • Restrict log access because logs can contain sensitive data (avoid logging raw rows).

Compliance considerations

  • Data classification: define what is PII, PCI, PHI, etc.
  • Apply masking/tokenization as part of preparation if needed.
  • Enforce retention and deletion policies aligned to compliance.

Common security mistakes

  • Overly broad policies like “manage all-resources in tenancy”
  • Storing curated sensitive datasets in the same bucket/prefix as raw data without controls
  • Leaving buckets public
  • Logging raw records with PII

Secure deployment recommendations

  • Use separate compartments for raw vs curated if you need stronger separation.
  • Use CMKs for sensitive datasets and restrict key usage.
  • Use CI/CD to deploy scripts; enforce code review on transformation logic.
  • Implement data access approvals for curated “gold” datasets.

13. Limitations and Gotchas

Because Big Data Preparation on Oracle Cloud is usually composed of multiple services, limitations are often service-specific. Common gotchas include:

  • IAM complexity: Dynamic group matching rules and policies are frequent sources of run failures.
  • Region constraints: Not all OCI services/features are available in all regions.
  • Small files problem: Naive Spark writes can create many small files, hurting performance and raising costs.
  • Schema inference pitfalls: Inference can change types between runs; explicit schemas are safer.
  • Logging costs: High-volume logs or long retention can cost more than expected.
  • Data egress surprises: Downloading curated data to the public internet can incur charges.
  • Operational maturity requirement: Pipelines need monitoring, alerting, and ownership; otherwise they become brittle.
  • Eventual consistency assumptions: Always design for retry and idempotency; verify Object Storage behavior in your region and workflow.
  • Upgrades/compatibility: Spark runtime versions and libraries matter; pin versions and test.

For exact quotas and service limits, see: – https://docs.oracle.com/en-us/iaas/Content/General/Concepts/servicelimits.htm


14. Comparison with Alternatives

Big Data Preparation can be delivered via multiple OCI-native options and external alternatives.

Comparison table

Option Best For Strengths Weaknesses When to Choose
OCI Data Flow (Spark) Large-scale transformations, batch prep, heavy joins/aggregations Elastic Spark, good for big data patterns, integrates with OCI Requires Spark skills; tuning needed; cost scales with runtime When data volume is large or transformations are complex
OCI Data Integration Managed ETL/ELT orchestration, connectors, pipeline management Visual design, scheduling, connectors, governance hooks May not match Spark flexibility/performance for very large jobs When you need managed orchestration and standardized ETL
Oracle Analytics Cloud (Data Preparation features) Analyst-led interactive prep for BI datasets Self-service prep, quick iteration for BI Not intended for massive batch pipelines; licensing cost When business users need interactive prep for reporting
Autonomous Database (SQL-based prep) Structured data prep, ELT within DB Strong SQL engine, governance, transactional consistency Not ideal for raw file lakes or huge semi-structured workloads without staging When data is already in ADW/ATP and SQL is enough
Self-managed Spark on Compute (or Kubernetes) Maximum control over Spark runtime and dependencies Full flexibility Higher ops burden; patching, scaling, security are on you When you need custom Spark builds or strict control
AWS Glue / Glue Studio / DataBrew AWS-centric ETL and data prep Deep AWS integration Not OCI-native; cross-cloud data movement adds complexity/cost When your data platform is primarily on AWS
Azure Data Factory / Synapse pipelines Azure-centric orchestration and ETL Deep Azure integration Not OCI-native When your data platform is primarily on Azure
GCP Dataflow / Dataproc GCP-centric data processing Mature data processing ecosystem Not OCI-native When your data platform is primarily on GCP

Notes: – Some “Dataprep” branded products in other clouds have changed over time (for example, Google Cloud Dataprep was discontinued). Always verify current product availability and roadmaps in official vendor docs.


15. Real-World Example

Enterprise example: Retail analytics and forecasting

Problem A retail enterprise has: – Daily sales transactions (POS) – E-commerce events – Inventory updates All arrive as files from multiple systems. Reports disagree due to inconsistent product IDs and missing timestamps.

Proposed architecture (Oracle Cloud) – Object Storage zones: – raw/pos/, raw/ecom/, raw/inventory/curated/sales/, curated/sessions/, curated/inventory/ – Data Flow (Spark): – Standardize product IDs – Validate timestamps and amounts – Join product master reference – Create curated Parquet partitioned by business date – Data Integration: – Orchestrate daily pipelines and dependencies – Data Catalog: – Register curated datasets and owners – Consumers: – ADW for executive reporting – OAC for dashboards – Data Science for demand forecasting features

Why Big Data Preparation was chosen – Needed scalable processing for large datasets – Required reproducible, governed curated outputs – Preferred OCI-native IAM, audit, and integration

Expected outcomes – Consistent KPIs across channels – Faster dashboard refresh and lower query cost (Parquet, partitions) – Reduced incident rate due to quality gates and quarantining

Startup/small-team example: SaaS product telemetry

Problem A startup collects product events and application logs. Data is messy (duplicate events, missing user IDs). They need weekly retention dashboards and churn signals.

Proposed architecture (Oracle Cloud) – Object Storage: – raw/events/ landed from ingestion service – curated/events/ cleaned and deduped – Data Flow: – Deduplicate by event ID – Normalize event schema – Produce weekly aggregates and user activity tables – Lightweight governance: – tags for ownership and environment – simple runbooks and alerts

Why Big Data Preparation was chosen – They needed a cost-controlled batch pipeline without running permanent clusters. – Spark batch jobs were a straightforward fit.

Expected outcomes – Reliable event tables for product analytics – Predictable weekly pipelines with clear failure modes – A foundation to expand into ML features later


16. FAQ

1) Is “Big Data Preparation” a standalone OCI service?
In many OCI environments, Big Data Preparation is implemented as a capability using services like Object Storage, Data Flow, and Data Integration rather than a single standalone service tile. Verify in official Oracle Cloud documentation and your tenancy console if a product explicitly named “Big Data Preparation” is available to you.

2) What’s the simplest Oracle Cloud setup for Big Data Preparation?
Object Storage for raw/curated data + OCI Data Flow (Spark) for transformations + IAM policies for secure access.

3) Do I need Spark to do Big Data Preparation on Oracle Cloud?
Not always. For structured datasets already in a database, SQL-based ELT may be sufficient. Spark is most useful for large-scale file-based datasets and complex transformations.

4) How should I organize my Object Storage paths?
Use zones and consistent prefixes such as raw/<source>/, curated/<domain>/, and quarantine/<pipeline>/. Add partitioning folders like date=YYYY-MM-DD/ when appropriate.

5) What format should I use for curated datasets?
Parquet is a common default for analytics due to columnar storage and compression. Validate compatibility with your query/BI tools.

6) How do I prevent bad data from reaching dashboards?
Implement data quality gates: validate required fields, types, ranges, and uniqueness. Quarantine failures and publish quality metrics per run.

7) How do I handle schema drift?
Use explicit schemas where possible, version schemas, and add controlled evolution logic (new columns allowed, breaking changes flagged). Treat schema changes as a change-management process.

8) How do I secure PII during preparation?
Mask, tokenize, or hash PII fields in curated outputs; restrict access to raw zones; enforce least privilege; avoid logging raw records.

9) What IAM pattern should I use for Data Flow accessing Object Storage?
Use resource principals via dynamic groups and compartment-scoped policies. Verify the exact policy statements in Data Flow’s IAM documentation.

10) How do I monitor Big Data Preparation pipelines?
Use Data Flow run status and logs, OCI Logging, and (optionally) alarms/notifications. Track row counts, reject rates, and runtime.

11) How do I make pipelines idempotent?
Write outputs to versioned or partitioned paths, use overwrite carefully, and ensure re-runs produce the same result for the same input.

12) What are common causes of slow Spark jobs?
Large shuffles from joins, data skew, too many small files, insufficient executor memory, and reading too much unfiltered data.

13) How do I reduce cost without sacrificing reliability?
Process incrementally, partition data, right-size compute, compact files, and apply lifecycle policies to limit retention of intermediates.

14) Can I use Big Data Preparation outputs in ADW and OAC?
Yes. A common pattern is to curate in Object Storage and then load into ADW for BI, or have OAC consume curated datasets depending on your architecture and data governance.

15) What’s the right way to test Big Data Preparation pipelines?
Use unit tests for transformation functions, integration tests on small sample datasets, and data-quality regression checks (expected row counts, distribution checks).


17. Top Online Resources to Learn Big Data Preparation

These resources focus on Oracle Cloud services typically used to build Big Data Preparation solutions. Verify the latest product positioning and feature availability in official documentation.

Resource Type Name Why It Is Useful
Official docs OCI Data Flow Documentation Core service for Spark-based big data preparation on Oracle Cloud: https://docs.oracle.com/en-us/iaas/data-flow/using/overview.htm
Official docs OCI Object Storage Documentation Data lake storage foundation: https://docs.oracle.com/en-us/iaas/Content/Object/home.htm
Official docs OCI Data Integration Documentation Managed ETL/ELT and orchestration: https://docs.oracle.com/en-us/iaas/data-integration/using/overview.htm
Official docs OCI IAM Documentation Policies, dynamic groups, least privilege: https://docs.oracle.com/en-us/iaas/Content/Identity/home.htm
Official docs OCI Logging Documentation Operational logs and troubleshooting: https://docs.oracle.com/en-us/iaas/Content/Logging/home.htm
Official docs OCI Audit Documentation Compliance and change traceability: https://docs.oracle.com/en-us/iaas/Content/Audit/home.htm
Official docs OCI Data Catalog Documentation Metadata management and discovery: https://docs.oracle.com/en-us/iaas/data-catalog/home.htm
Pricing Oracle Cloud Pricing Entry point for pricing and SKUs: https://www.oracle.com/cloud/pricing/
Pricing tool OCI Cost Estimator Model cost drivers for Data Flow, storage, logging: https://www.oracle.com/cloud/costestimator.html
Reference architectures Oracle Architecture Center Find data lake / analytics reference architectures (search within): https://docs.oracle.com/en/solutions/
Tutorials/labs OCI Cloud Shell Practical CLI-driven labs: https://docs.oracle.com/en-us/iaas/Content/API/Concepts/cloudshellintro.htm
Tutorials OCI CLI Getting Started Automate setup and validation: https://docs.oracle.com/en-us/iaas/Content/API/Concepts/cliconcepts.htm
Community (verify) Oracle Cloud blogs and examples Practical patterns; validate against docs: https://blogs.oracle.com/cloud-infrastructure/
Code examples (verify) Oracle samples on GitHub Search for OCI Data Flow samples; validate repository trust and freshness: https://github.com/oracle

18. Training and Certification Providers

The following training providers may offer courses relevant to Oracle Cloud and Big Data Preparation (data engineering, DevOps, SRE, cloud operations). Verify current course catalogs and delivery modes on their websites.

Institute Suitable Audience Likely Learning Focus Mode Website URL
DevOpsSchool.com DevOps engineers, cloud engineers, platform teams DevOps + cloud automation foundations; may include OCI tooling Check website https://www.devopsschool.com/
ScmGalaxy.com Beginners to intermediate DevOps learners SCM, CI/CD, cloud basics Check website https://www.scmgalaxy.com/
CLoudOpsNow.in Cloud operations teams Cloud operations, monitoring, reliability practices Check website https://cloudopsnow.in/
SreSchool.com SREs, operations, platform teams Reliability engineering, incident response, observability Check website https://sreschool.com/
AiOpsSchool.com Ops + data/AI practitioners AIOps concepts, monitoring automation Check website https://aiopsschool.com/

19. Top Trainers

These sites are presented as training resources/platforms. Verify the specific trainer profiles, OCI coverage, and current offerings directly on each site.

Platform/Site Likely Specialization Suitable Audience Website URL
RajeshKumar.xyz DevOps/cloud training (verify OCI coverage) Engineers seeking hands-on mentoring https://rajeshkumar.xyz/
devopstrainer.in DevOps tooling and practices Beginners to intermediate DevOps learners https://devopstrainer.in/
devopsfreelancer.com Freelance DevOps/consulting/training (verify) Teams needing short-term enablement https://devopsfreelancer.com/
devopssupport.in DevOps support and training resources (verify) Ops/DevOps teams looking for guided support https://devopssupport.in/

20. Top Consulting Companies

These companies may provide consulting services relevant to Oracle Cloud, data platforms, and operationalization. Descriptions below are general; verify exact service offerings and references on their websites.

Company Likely Service Area Where They May Help Consulting Use Case Examples Website URL
cotocus.com Cloud/DevOps consulting (verify specialties) Architecture, automation, operational practices Setting up CI/CD for data pipelines; governance and IAM design https://cotocus.com/
DevOpsSchool.com DevOps enablement and consulting Training + implementation support Building runbooks and monitoring for OCI-based pipelines; IaC practices https://www.devopsschool.com/
DEVOPSCONSULTING.IN DevOps consulting (verify portfolio) DevOps transformations, cloud operations Standardizing environments, IAM, logging and alerting around data workloads https://devopsconsulting.in/

21. Career and Learning Roadmap

What to learn before Big Data Preparation (Oracle Cloud)

  • OCI fundamentals:
  • compartments, VCN basics, IAM policies, tagging
  • Object Storage concepts:
  • buckets, prefixes, lifecycle, access control
  • Data engineering fundamentals:
  • schemas, partitioning, file formats (CSV vs Parquet)
  • Basic SQL and data modeling
  • Basic Python (for PySpark) or Spark fundamentals

What to learn after

  • Advanced Spark:
  • joins, shuffles, partitioning strategies, performance tuning
  • Orchestration and CI/CD:
  • parameterization, environment promotion, automated testing
  • Data governance:
  • catalogs, lineage, data contracts, stewardship
  • Observability:
  • metrics, alerting, SLOs for data freshness and quality
  • Security specialization:
  • CMK design, auditing, PII handling patterns

Job roles that use it

  • Data Engineer / Senior Data Engineer
  • Analytics Engineer
  • Cloud Engineer (Data Platform)
  • Platform Engineer (Data)
  • SRE/Operations for Data Platforms
  • ML Engineer (feature pipelines)

Certification path (if available)

Oracle certification offerings change over time and by product line. For an OCI-focused path: – Start with OCI foundations certifications (if relevant) – Add data-focused OCI certifications where applicable

Verify current Oracle certification paths on Oracle University: – https://education.oracle.com/

Project ideas for practice

  • Build a multi-zone data lake on Object Storage with curated Parquet outputs.
  • Add a data quality framework: publish quality metrics and alert on regressions.
  • Implement incremental partition processing (daily partitions).
  • Add a cataloging step and enforce dataset metadata requirements.
  • Secure a dataset with CMKs and compartment-based segmentation.

22. Glossary

  • Big Data Preparation: The process of turning raw, messy datasets into clean, governed, analytics-ready datasets.
  • OCI (Oracle Cloud Infrastructure): Oracle Cloud’s infrastructure platform for compute, storage, networking, and cloud services.
  • Compartment: OCI’s logical isolation boundary for organizing and securing resources.
  • IAM Policy: A set of statements defining who can access which OCI resources and how.
  • Dynamic Group: A group of OCI resources (not users) that can be granted permissions via policies.
  • Resource Principal: An OCI authentication method where a resource (like a Data Flow run) authenticates without user credentials.
  • Object Storage: Durable, scalable storage for unstructured data such as files and logs.
  • Data Flow: OCI’s managed Apache Spark service for distributed data processing.
  • Spark (Apache Spark): A distributed processing engine commonly used for large-scale transformations and analytics.
  • Parquet: A columnar storage format optimized for analytics workloads.
  • Partitioning: Organizing data into folders/segments by keys (like date) to speed up reads and reduce cost.
  • Quarantine (data): A location where invalid or rejected records are stored for investigation.
  • Data drift / schema drift: Changes in incoming data structure or distributions over time.
  • Idempotent pipeline: A pipeline that can be re-run safely without producing incorrect duplicates or corruption.
  • PII: Personally Identifiable Information (e.g., name, email, phone).
  • CMK: Customer-Managed Key (encryption key managed by the customer, typically via OCI Vault).
  • Data contract: Agreed schema, semantics, SLAs, and usage constraints between producers and consumers.

23. Summary

Big Data Preparation in Oracle Cloud is the practical capability of transforming raw, inconsistent data into curated, trusted datasets for analytics and machine learning. In many OCI environments, it is implemented by combining Object Storage (data lake zones), OCI Data Flow (Spark) (scalable transformations), and optionally OCI Data Integration and OCI Data Catalog (orchestration and governance).

It matters because it improves data quality, reduces downstream failures, accelerates reporting/ML, and creates repeatable pipelines with strong IAM and audit controls. Cost is primarily driven by Data Flow compute runtime, Object Storage footprint, Logging retention, and data egress—optimize via incremental processing, partitioning, and careful log retention.

Use Oracle Cloud Big Data Preparation patterns when you need reliable, scalable data curation integrated with OCI security and operations. Next, deepen your skills in Spark tuning, IAM policy design, and production-grade observability so your data pipelines are not just functional, but dependable.