{"id":885,"date":"2026-04-16T13:32:04","date_gmt":"2026-04-16T13:32:04","guid":{"rendered":"https:\/\/www.devopsschool.com\/tutorials\/oracle-cloud-data-flow-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-data-management\/"},"modified":"2026-04-16T13:32:04","modified_gmt":"2026-04-16T13:32:04","slug":"oracle-cloud-data-flow-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-data-management","status":"publish","type":"post","link":"https:\/\/www.devopsschool.com\/tutorials\/oracle-cloud-data-flow-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-data-management\/","title":{"rendered":"Oracle Cloud Data Flow Tutorial: Architecture, Pricing, Use Cases, and Hands-On Guide for Data Management"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Category<\/h2>\n\n\n\n<p>Data Management<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">1. Introduction<\/h2>\n\n\n\n<p>Oracle Cloud <strong>Data Flow<\/strong> is a managed, serverless <strong>Apache Spark<\/strong> service on Oracle Cloud Infrastructure (OCI) designed for large-scale batch data processing and ETL\/ELT workloads.<\/p>\n\n\n\n<p>In simple terms: you upload (or reference) your Spark application, point it at your data (often in OCI Object Storage), and Data Flow runs the job for you\u2014without you provisioning, patching, or scaling a Spark cluster.<\/p>\n\n\n\n<p>Technically, Data Flow orchestrates Spark drivers and executors as an OCI-managed service. You define an <strong>Application<\/strong> (code + defaults) and execute it as a <strong>Run<\/strong>. The service integrates with OCI Identity and Access Management (IAM), Object Storage, Logging\/Monitoring, and (optionally) private networking so Spark can read and write data at scale securely.<\/p>\n\n\n\n<p>The core problem Data Flow solves is operational overhead and time-to-value: teams want Spark for distributed processing, but do not want to build and manage Spark clusters (capacity planning, upgrades, failures, autoscaling, and tuning) for intermittent or bursty workloads.<\/p>\n\n\n\n<blockquote>\n<p>Service status\/naming: <strong>Data Flow<\/strong> is the current, active OCI service name. It is commonly referred to in documentation as <strong>OCI Data Flow<\/strong> or <strong>Oracle Cloud Infrastructure Data Flow<\/strong>. Verify the latest features, supported Spark versions, and limits in the official docs linked in the Resources section.<\/p>\n<\/blockquote>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">2. What is Data Flow?<\/h2>\n\n\n\n<p><strong>Official purpose (scope):<\/strong> Data Flow is OCI\u2019s serverless service for running <strong>Apache Spark<\/strong> applications to process data at scale. It is positioned in <strong>Data Management<\/strong> because it performs distributed transformation\/ETL, analytics preparation, and batch processing over large datasets.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Core capabilities<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Run <strong>Spark applications<\/strong> without managing a Spark cluster.<\/li>\n<li>Execute jobs on-demand and scale resources to match workload needs (within OCI service limits).<\/li>\n<li>Use OCI-native integrations for:<\/li>\n<li><strong>Object Storage<\/strong> as a common data lake layer (inputs, outputs, logs).<\/li>\n<li><strong>IAM<\/strong> for secure access to OCI resources.<\/li>\n<li><strong>Monitoring\/Logging\/Audit<\/strong> for operations and governance.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Major components (mental model)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Application<\/strong>: A reusable definition that points to Spark code (for example, a Python file, JAR, or other Spark artifact) plus defaults such as arguments and log locations.<\/li>\n<li><strong>Run<\/strong>: An execution of an Application with a specific set of parameters, resource sizing, and runtime behavior. Runs produce logs and outputs.<\/li>\n<li><strong>Artifacts &amp; dependencies<\/strong>: Files in Object Storage (Spark scripts, JARs, configuration, dependency archives) referenced at runtime.<\/li>\n<li><strong>Networking attachment (optional)<\/strong>: A configuration that lets your Spark run access private OCI resources in a VCN (for example, private databases or private endpoints).<\/li>\n<\/ul>\n\n\n\n<blockquote>\n<p>Terminology and exact resource names can evolve. Use the Data Flow documentation for the definitive list of resource types and fields.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">Service type and scope<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Service type:<\/strong> Managed serverless big data processing (Apache Spark).<\/li>\n<li><strong>Scope:<\/strong> Data Flow resources (applications\/runs) are typically created within an <strong>OCI region<\/strong> and <strong>compartment<\/strong> in your tenancy.<\/li>\n<li><strong>How it fits the Oracle Cloud ecosystem:<\/strong><\/li>\n<li>Often paired with <strong>Object Storage<\/strong> (data lake), <strong>Autonomous Database<\/strong> (serving\/warehouse), <strong>Data Integration<\/strong> (orchestration), and <strong>Data Catalog<\/strong> (governance).<\/li>\n<li>Used by platform teams to provide a standardized, secure Spark execution environment without persistent clusters.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">3. Why use Data Flow?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Business reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Faster time to deliver ETL and analytics pipelines (no cluster procurement and lifecycle management).<\/li>\n<li>Pay for execution rather than owning always-on infrastructure for batch processing.<\/li>\n<li>Standardize data processing across teams with reusable applications and policies.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Technical reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Spark is a mature ecosystem for distributed compute (SQL, DataFrames, ML pipelines).<\/li>\n<li>Serverless execution is ideal for <strong>burst workloads<\/strong> (daily ETL, periodic backfills, event-driven batch).<\/li>\n<li>Strong fit for <strong>data lake<\/strong> patterns using Object Storage.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operational reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Reduced operational burden (no manual scaling, patching, cluster recovery).<\/li>\n<li>Centralized observability through OCI logging\/monitoring primitives.<\/li>\n<li>Easier environment consistency: packaged code + defined runtime behavior.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/compliance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>IAM-driven access to Object Storage and other OCI resources.<\/li>\n<li>Encryption at rest (OCI-managed) for stored artifacts and logs in Object Storage.<\/li>\n<li>Auditability through OCI <strong>Audit<\/strong> and policy controls.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scalability\/performance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Parallel processing across distributed executors for large datasets.<\/li>\n<li>Ability to tune Spark configuration per job (within supported parameters).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should choose Data Flow<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>You want Apache Spark processing but don\u2019t want to run Spark infrastructure.<\/li>\n<li>Workloads are batch or micro-batch and can tolerate job startup time.<\/li>\n<li>Data is already in Object Storage or can be staged there.<\/li>\n<li>You want a managed service integrated with OCI IAM and compartments.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should not choose Data Flow<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>You need <strong>always-on, low-latency<\/strong> interactive Spark clusters (consider managed cluster services instead).<\/li>\n<li>Workloads are highly stateful streaming with tight SLAs (evaluate dedicated streaming + processing designs).<\/li>\n<li>You require deep OS-level control or custom cluster daemons (serverless constraints may apply).<\/li>\n<li>You depend on specific Spark versions or system libraries not supported (verify supported runtime versions in official docs).<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">4. Where is Data Flow used?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Industries<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Financial services: batch risk calculations, AML feature generation, regulatory reporting prep.<\/li>\n<li>Retail\/e-commerce: clickstream\/sessionization, product catalog enrichment, demand analytics.<\/li>\n<li>Healthcare\/life sciences: de-identification transforms, cohort preparation, claims analytics.<\/li>\n<li>Telecom: CDR aggregation, churn features, network performance batch analytics.<\/li>\n<li>Manufacturing\/IoT: sensor data aggregation, quality analytics, predictive maintenance feature prep.<\/li>\n<li>Public sector: periodic data integration, analytics datasets, open data publishing.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Team types<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data engineering teams building ETL\/ELT pipelines.<\/li>\n<li>Analytics engineering teams preparing curated datasets.<\/li>\n<li>Platform teams offering \u201cSpark as a service\u201d internally.<\/li>\n<li>DevOps\/SRE teams standardizing deployment, IAM, network controls, and monitoring.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Workloads and architectures<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data lake \u2192 curated zone transformations (bronze\/silver\/gold).<\/li>\n<li>Batch enrichment joining multiple datasets at scale.<\/li>\n<li>Periodic reprocessing\/backfills.<\/li>\n<li>Extract and staging jobs feeding a warehouse or serving store.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Real-world deployment contexts<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Production:<\/strong> scheduled ETL pipelines, controlled IAM policies, private network access to databases, rigorous cost controls, logging retention policies.<\/li>\n<li><strong>Dev\/Test:<\/strong> ad-hoc dataset exploration, smaller job sizes, ephemeral runs, sandbox compartments.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">5. Top Use Cases and Scenarios<\/h2>\n\n\n\n<p>Below are realistic scenarios where <strong>Oracle Cloud Data Flow<\/strong> is commonly a strong fit.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1) Data lake ETL (CSV\/JSON \u2192 Parquet)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Raw data is slow to query and expensive to process repeatedly.<\/li>\n<li><strong>Why Data Flow fits:<\/strong> Spark transforms at scale and writes columnar formats to Object Storage.<\/li>\n<li><strong>Scenario:<\/strong> Convert daily CSV drops into partitioned Parquet in a curated bucket.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2) Daily incremental aggregations (fact tables)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> BI dashboards need daily rollups across large event datasets.<\/li>\n<li><strong>Why Data Flow fits:<\/strong> Spark group-bys and window functions scale across partitions.<\/li>\n<li><strong>Scenario:<\/strong> Aggregate daily sessions and revenue metrics for a warehouse load.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3) Backfill\/reprocessing after logic changes<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> A new business rule requires reprocessing months of data.<\/li>\n<li><strong>Why Data Flow fits:<\/strong> Burst compute without long-lived cluster costs.<\/li>\n<li><strong>Scenario:<\/strong> Recompute \u201ccustomer lifetime value\u201d features for the last 365 days.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4) Data quality checks at scale<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> You need to validate schema, null rates, and referential integrity across large tables\/files.<\/li>\n<li><strong>Why Data Flow fits:<\/strong> Spark can compute validation metrics quickly and write reports.<\/li>\n<li><strong>Scenario:<\/strong> Generate data quality scorecards and store results in Object Storage.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">5) Feature engineering for ML<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> ML requires large feature tables created from raw logs and reference data.<\/li>\n<li><strong>Why Data Flow fits:<\/strong> Spark joins and feature transforms are a standard pattern.<\/li>\n<li><strong>Scenario:<\/strong> Build user-level feature vectors weekly for model training.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6) Log\/event enrichment and sessionization<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Raw logs need IP-to-geo lookup, bot filtering, and session windows.<\/li>\n<li><strong>Why Data Flow fits:<\/strong> Spark window functions and UDFs handle enrichment at scale.<\/li>\n<li><strong>Scenario:<\/strong> Produce session-level datasets from clickstream events in Object Storage.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">7) Data masking or tokenization batch transforms<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Sensitive fields must be masked before analytics access.<\/li>\n<li><strong>Why Data Flow fits:<\/strong> Spark transformations can apply masking consistently across large data volumes.<\/li>\n<li><strong>Scenario:<\/strong> Replace PII fields with irreversible hashes in curated datasets.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">8) Export from databases to Object Storage (staging for analytics)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Teams want to stage data extracts into a lake for processing.<\/li>\n<li><strong>Why Data Flow fits:<\/strong> Spark can read from JDBC sources (with network\/IAM set correctly) and write to Object Storage.<\/li>\n<li><strong>Scenario:<\/strong> Nightly export and transform from a database into partitioned lake data.<\/li>\n<\/ul>\n\n\n\n<blockquote>\n<p>JDBC sources and database connectivity require correct networking and credentials handling. Verify supported patterns in official docs.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">9) Multi-dataset joins for master data enrichment<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Customer\/product master data comes from multiple systems with different identifiers.<\/li>\n<li><strong>Why Data Flow fits:<\/strong> Spark scales joins and deduplication for large records.<\/li>\n<li><strong>Scenario:<\/strong> Build a golden customer dimension dataset.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">10) Cost-optimized batch compute for intermittent jobs<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Cluster-based solutions are underutilized most of the time.<\/li>\n<li><strong>Why Data Flow fits:<\/strong> Serverless model matches intermittent processing patterns.<\/li>\n<li><strong>Scenario:<\/strong> Run 30-minute ETL jobs hourly without paying for idle cluster time.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">11) Curated dataset publishing to downstream teams<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Downstream teams need consistent, versioned datasets.<\/li>\n<li><strong>Why Data Flow fits:<\/strong> Spark can enforce schema, partitioning, and write consistent outputs.<\/li>\n<li><strong>Scenario:<\/strong> Publish a daily \u201canalytics-ready\u201d dataset to a shared Object Storage bucket.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">12) Compliance reporting dataset preparation<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Regulatory reporting needs consistent transformations and auditability.<\/li>\n<li><strong>Why Data Flow fits:<\/strong> Repeatable runs, logs, and IAM policies help enforce governance.<\/li>\n<li><strong>Scenario:<\/strong> Produce monthly compliance extracts with traceable job runs and outputs.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">6. Core Features<\/h2>\n\n\n\n<blockquote>\n<p>Feature availability can vary by region and by current OCI release. Verify the latest features and supported Spark versions in the official Data Flow documentation.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">Serverless Apache Spark execution<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Runs Spark jobs without requiring you to provision a cluster.<\/li>\n<li><strong>Why it matters:<\/strong> Removes cluster lifecycle tasks (creation, scaling, patching).<\/li>\n<li><strong>Practical benefit:<\/strong> Faster onboarding; focus on data logic and performance tuning.<\/li>\n<li><strong>Caveats:<\/strong> There can be job startup overhead compared to warm, always-on clusters.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Applications and Runs (repeatable job definitions)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Separates a reusable job definition (<strong>Application<\/strong>) from each execution (<strong>Run<\/strong>).<\/li>\n<li><strong>Why it matters:<\/strong> Encourages standardized, repeatable pipelines and parameterization.<\/li>\n<li><strong>Practical benefit:<\/strong> Use one application for multiple environments\/inputs via runtime arguments.<\/li>\n<li><strong>Caveats:<\/strong> You must manage versioning of code artifacts in Object Storage (for example, via paths or object versioning).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Object Storage integration (data, artifacts, logs)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Uses OCI Object Storage as a common location for input data, output data, code artifacts, and logs.<\/li>\n<li><strong>Why it matters:<\/strong> Object Storage is durable, scalable, and commonly used as a data lake foundation.<\/li>\n<li><strong>Practical benefit:<\/strong> Simple and cost-effective storage layer for large datasets.<\/li>\n<li><strong>Caveats:<\/strong> Costs accrue for stored data, requests, and any cross-region data transfer.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">IAM integration and \u201cresource principal\u201d-style access patterns<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Enables Data Flow runs to access OCI resources via IAM policies (commonly through dynamic groups\/policies).<\/li>\n<li><strong>Why it matters:<\/strong> Avoids embedding user API keys in jobs; supports least-privilege.<\/li>\n<li><strong>Practical benefit:<\/strong> Safer automation with compartment-scoped policies.<\/li>\n<li><strong>Caveats:<\/strong> Misconfigured dynamic groups\/policies are a top cause of failures (403 access to Object Storage).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Parameterized runs and Spark configuration<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Allows passing arguments and Spark properties per run.<\/li>\n<li><strong>Why it matters:<\/strong> Supports multiple datasets and workload profiles with one codebase.<\/li>\n<li><strong>Practical benefit:<\/strong> Same job can run daily increments or full backfills by changing parameters.<\/li>\n<li><strong>Caveats:<\/strong> Not every Spark property may be supported or safe to override; verify allowed configurations.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Observability: logs, run status, and operational visibility<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Provides run state tracking and access to application logs (commonly stored in Object Storage and visible via console).<\/li>\n<li><strong>Why it matters:<\/strong> Batch pipelines need reliable debugging, auditability, and operational support.<\/li>\n<li><strong>Practical benefit:<\/strong> Standard workflow to investigate failures and performance issues.<\/li>\n<li><strong>Caveats:<\/strong> Log retention is usually your responsibility (Object Storage lifecycle policies).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Optional private networking for data sources in a VCN<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Allows Spark jobs to access private endpoints (for example, private databases) through VCN configuration.<\/li>\n<li><strong>Why it matters:<\/strong> Enterprises often restrict data sources to private networks.<\/li>\n<li><strong>Practical benefit:<\/strong> Secure connectivity without exposing databases publicly.<\/li>\n<li><strong>Caveats:<\/strong> Requires correct subnets, route tables, security lists\/NSGs, and policies. Misconfigurations are common.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">7. Architecture and How It Works<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">High-level service architecture<\/h3>\n\n\n\n<p>At a high level, Data Flow:\n1. Reads your <strong>Application<\/strong> definition (code location, default arguments, logging configuration).\n2. Allocates Spark driver and executors for the <strong>Run<\/strong> (serverless).\n3. Uses OCI IAM authorization for access to Object Storage and other OCI resources.\n4. Writes logs (and your output datasets) back to Object Storage.\n5. Exposes run state in the Console\/CLI\/SDK and emits operational telemetry through OCI observability services (verify exact metrics\/events in docs).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Control flow vs data flow<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Control plane:<\/strong> You create applications and submit runs via OCI Console\/CLI\/SDK. IAM controls who can submit and manage runs.<\/li>\n<li><strong>Data plane:<\/strong> Spark reads\/writes datasets (commonly in Object Storage). Networking rules determine whether it can reach private resources.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Integrations with related OCI services<\/h3>\n\n\n\n<p>Common integrations include:\n&#8211; <strong>OCI Object Storage<\/strong>: primary lake storage and log storage.\n&#8211; <strong>OCI IAM<\/strong>: dynamic groups\/policies for service-run access.\n&#8211; <strong>OCI Monitoring\/Logging<\/strong>: operational telemetry and logs (verify exact integration steps for your region).\n&#8211; <strong>OCI Events<\/strong>: react to run state changes (for example, trigger downstream steps).\n&#8211; <strong>OCI Vault<\/strong>: store secrets for external systems (recommended pattern; verify supported injection patterns for your jobs).\n&#8211; <strong>Databases (Autonomous Database \/ other DBs)<\/strong>: typically via JDBC, requiring networking and secure credentials handling.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Dependency services<\/h3>\n\n\n\n<p>A realistic Data Flow deployment nearly always depends on:\n&#8211; Object Storage (code + data + logs)\n&#8211; IAM policies (users + dynamic groups)\n&#8211; Optionally VCN\/subnets (if connecting to private resources)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/authentication model (practical summary)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Human users authenticate via OCI IAM and are authorized by policies to manage Data Flow resources.<\/li>\n<li>Data Flow runs commonly access Object Storage using OCI service identity mechanisms authorized by dynamic group policies (do not hardcode keys).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Networking model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>If your job only reads\/writes Object Storage, it may not require custom networking (depending on your tenancy policies and regional defaults).<\/li>\n<li>If your job needs to reach private endpoints in a VCN, configure Data Flow networking appropriately (subnets, routing, security lists\/NSGs, DNS).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Monitoring\/logging\/governance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Ensure logs are captured to a controlled bucket with retention and lifecycle policies.<\/li>\n<li>Tag applications\/runs for cost allocation and governance.<\/li>\n<li>Track quotas\/service limits to avoid failed submissions under load.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Simple architecture diagram (Mermaid)<\/h3>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart LR\n  Dev[Engineer \/ CI Pipeline] --&gt;|Submit Run| DF[OCI Data Flow]\n  DF --&gt;|Read code + input| OS[(OCI Object Storage)]\n  DF --&gt;|Write output| OS\n  DF --&gt;|Write logs| LOGS[(Object Storage Logs Bucket)]\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Production-style architecture diagram (Mermaid)<\/h3>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart TB\n  subgraph Tenancy[OCI Tenancy]\n    subgraph Comp[Compartment: data-platform]\n      CICD[CI\/CD Pipeline] --&gt; IAM[OCI IAM Policies]\n      CICD --&gt;|Create\/Update Application| DFApp[Data Flow Application]\n      CICD --&gt;|Start Run| DFRun[Data Flow Run]\n\n      DFRun --&gt;|Read artifacts| Art[(Object Storage: artifacts bucket)]\n      DFRun --&gt;|Read raw data| Raw[(Object Storage: raw bucket)]\n      DFRun --&gt;|Write curated data| Cur[(Object Storage: curated bucket)]\n      DFRun --&gt;|Write logs| Logs[(Object Storage: logs bucket)]\n\n      DFRun --&gt; Mon[OCI Monitoring\/Alarms]\n      DFRun --&gt; Aud[OCI Audit]\n      DFRun --&gt; Evt[OCI Events]\n    end\n\n    subgraph VCN[VCN (optional)]\n      DFRun --&gt;|Private access| DB[(Private Database Endpoint)]\n    end\n  end\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">8. Prerequisites<\/h2>\n\n\n\n<p>Before starting the hands-on lab, you need:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">OCI tenancy and compartment<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>An active <strong>Oracle Cloud<\/strong> (OCI) tenancy.<\/li>\n<li>A <strong>compartment<\/strong> where you can create:<\/li>\n<li>Object Storage buckets\/objects<\/li>\n<li>Data Flow applications and runs<\/li>\n<li>IAM policies\/dynamic groups (if you manage IAM)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Permissions \/ IAM<\/h3>\n\n\n\n<p>At minimum:\n&#8211; Permissions to manage <strong>Data Flow<\/strong> resources in your compartment.\n&#8211; Permissions to manage or use <strong>Object Storage<\/strong> buckets in your compartment.\n&#8211; For the run-time identity (Data Flow run) to access buckets, you typically configure:\n  &#8211; A <strong>dynamic group<\/strong>\n  &#8211; Policies that allow that dynamic group to read\/write objects in the relevant compartments\/buckets<\/p>\n\n\n\n<blockquote>\n<p>IAM policy syntax and resource types are exacting. Follow the official Data Flow IAM documentation and verify the latest recommended policies.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">Billing requirements<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data Flow is usage-billed. Ensure your tenancy has a valid payment method or credits.<\/li>\n<li>Additional costs may come from Object Storage, logging retention, and data transfer.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Tools (optional but recommended)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>OCI Console access (required for the easiest start).<\/li>\n<li><strong>OCI CLI<\/strong> (optional) for scripting and repeatable automation. Install guide:\n  https:\/\/docs.oracle.com\/en-us\/iaas\/Content\/API\/SDKDocs\/cliinstall.htm<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Region availability<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data Flow is region-specific. Verify Data Flow availability in your target OCI region:<\/li>\n<li>OCI Regions list: https:\/\/www.oracle.com\/cloud\/public-cloud-regions\/<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Quotas\/limits<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data Flow is governed by <strong>service limits<\/strong> (for example, concurrent runs, max resources per run). Check:<\/li>\n<li>OCI Service Limits overview: https:\/\/docs.oracle.com\/en-us\/iaas\/Content\/General\/Concepts\/servicelimits.htm<\/li>\n<li>Data Flow-specific limits: verify in the Data Flow docs and the Limits page in the Console for your tenancy.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Prerequisite services<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Object Storage<\/strong> bucket(s) for input and output.<\/li>\n<li>A log bucket for Data Flow run logs (recommended to separate logs from data).<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">9. Pricing \/ Cost<\/h2>\n\n\n\n<blockquote>\n<p>Pricing varies by region and is subject to change. Use official pricing pages and the OCI cost estimator for authoritative numbers.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">Current pricing model (how you are billed)<\/h3>\n\n\n\n<p>Data Flow is typically billed based on the compute resources used during a run, such as:\n&#8211; <strong>OCPU consumption<\/strong> (driver + executors)\n&#8211; <strong>Memory consumption<\/strong>\n&#8211; <strong>Run duration<\/strong> (how long resources are allocated)<\/p>\n\n\n\n<p>In addition, you pay for:\n&#8211; <strong>Object Storage<\/strong> (input, output, artifacts, logs)\n&#8211; Requests and retrieval patterns (depending on storage tier)\n&#8211; <strong>Network egress<\/strong> (for data leaving a region or to the public internet), if applicable<\/p>\n\n\n\n<p>Official starting points:\n&#8211; OCI Price List: https:\/\/www.oracle.com\/cloud\/price-list\/\n&#8211; OCI Cost Estimator: https:\/\/www.oracle.com\/cloud\/costestimator.html (or the current OCI cost estimator page; verify if URL changes)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Pricing dimensions to understand<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Job sizing:<\/strong> driver\/executor shape and count (or equivalent resource selection)<\/li>\n<li><strong>Runtime:<\/strong> longer runs cost more; performance tuning can reduce cost<\/li>\n<li><strong>Data volume:<\/strong> large reads\/writes and shuffle-heavy workloads can increase runtime<\/li>\n<li><strong>Storage footprint:<\/strong> curated datasets (Parquet) may reduce long-term costs vs raw formats but increase total stored data if you keep multiple versions<\/li>\n<li><strong>Logging volume:<\/strong> verbose Spark logs retained for long periods can become a noticeable storage cost<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Free tier considerations<\/h3>\n\n\n\n<p>Oracle Cloud has an Always Free and Free Trial program, but eligibility and included services change over time.\n&#8211; Verify whether <strong>Data Flow<\/strong> has free-tier usage in your region\/tenancy:\n  https:\/\/www.oracle.com\/cloud\/free\/<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Hidden or indirect costs<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Object Storage logs<\/strong>: driver\/executor logs stored and retained.<\/li>\n<li><strong>Data transfer<\/strong>: cross-region reads\/writes or internet egress.<\/li>\n<li><strong>Downstream services<\/strong>: if you write into a database or trigger other jobs, those have their own costs.<\/li>\n<li><strong>Retries and failed runs<\/strong>: repeated failures can quietly accumulate cost.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">How to optimize cost (practical tactics)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Right-size resources: don\u2019t allocate more executors than the job can efficiently use.<\/li>\n<li>Use efficient formats: Parquet\/ORC and partitioning reduce scan costs and runtime.<\/li>\n<li>Avoid small files: compact outputs to reduce overhead.<\/li>\n<li>Tune Spark:<\/li>\n<li>Partition sizes<\/li>\n<li>Broadcast joins (when safe)<\/li>\n<li>Avoid wide shuffles where possible<\/li>\n<li>Use Object Storage lifecycle policies for logs and intermediate datasets.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Example low-cost starter estimate (model, not numbers)<\/h3>\n\n\n\n<p>A small lab run typically costs:\n&#8211; A short duration run (minutes)\n&#8211; Minimal driver\/executor resources\n&#8211; A few MB to GB of Object Storage<\/p>\n\n\n\n<p>To estimate:\n1. Determine driver\/executor sizing and expected runtime.\n2. Multiply resource-hours by regional Data Flow rates.\n3. Add Object Storage GB-month and request costs (usually small for labs).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Example production cost considerations<\/h3>\n\n\n\n<p>For a daily ETL pipeline:\n&#8211; The dominant cost is often compute time (resource-hours) plus storage growth.\n&#8211; Costs scale with:\n  &#8211; Input volume (GB\/TB per day)\n  &#8211; Shuffle intensity (joins, window functions)\n  &#8211; Output retention (how many days\/months of curated data you retain)\n&#8211; Consider building a FinOps model:\n  &#8211; Cost per dataset per day\n  &#8211; Cost per pipeline run\n  &#8211; Cost per TB processed<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">10. Step-by-Step Hands-On Tutorial<\/h2>\n\n\n\n<p>This lab runs a real PySpark ETL job on <strong>Oracle Cloud Data Flow<\/strong>:\n&#8211; Read a CSV from OCI Object Storage\n&#8211; Transform it (simple derived column + filtering)\n&#8211; Write Parquet output back to Object Storage<\/p>\n\n\n\n<p>This is designed to be beginner-friendly and low-cost.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Objective<\/h3>\n\n\n\n<p>Deploy and run a serverless Spark job using <strong>Data Flow<\/strong> that transforms data in <strong>Object Storage<\/strong>, with correct IAM permissions, logs, validation, and cleanup.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Lab Overview<\/h3>\n\n\n\n<p>You will:\n1. Create buckets for input, output, and logs.\n2. Upload a sample CSV and a PySpark script.\n3. Configure IAM so Data Flow runs can read\/write Object Storage.\n4. Create a Data Flow Application referencing the script in Object Storage.\n5. Start a Run and monitor it.\n6. Validate outputs and review logs.\n7. Clean up all resources.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 1: Create Object Storage buckets (input, output, logs)<\/h3>\n\n\n\n<p><strong>In the OCI Console:<\/strong>\n1. Go to <strong>Storage \u2192 Object Storage &amp; Archive Storage \u2192 Buckets<\/strong>\n2. Select your compartment (use a dedicated lab compartment if possible).\n3. Create three buckets (names must be globally unique within your namespace):\n   &#8211; <code>dataflow-lab-input<\/code>\n   &#8211; <code>dataflow-lab-output<\/code>\n   &#8211; <code>dataflow-lab-logs<\/code><\/p>\n\n\n\n<p>Recommended settings:\n&#8211; Use default storage tier unless you have specific requirements.\n&#8211; Consider enabling bucket-level encryption defaults (OCI manages encryption by default; verify your security posture requirements).<\/p>\n\n\n\n<p><strong>Expected outcome<\/strong>\n&#8211; Three buckets exist in your compartment and region.<\/p>\n\n\n\n<p><strong>Verification<\/strong>\n&#8211; Open each bucket and confirm it is empty and accessible.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 2: Upload sample input data<\/h3>\n\n\n\n<p>Create a small local file <code>sales.csv<\/code>:<\/p>\n\n\n\n<pre><code class=\"language-csv\">order_id,customer_id,amount,region\n1001,C001,120.50,APAC\n1002,C002,19.99,EMEA\n1003,C001,220.00,APAC\n1004,C003,5.00,NA\n1005,C004,89.00,EMEA\n<\/code><\/pre>\n\n\n\n<p>Upload it:\n1. Open bucket <code>dataflow-lab-input<\/code>\n2. Click <strong>Upload<\/strong>\n3. Upload <code>sales.csv<\/code> to the root of the bucket (or a folder like <code>raw\/<\/code>)<\/p>\n\n\n\n<p><strong>Expected outcome<\/strong>\n&#8211; <code>sales.csv<\/code> is available in the input bucket.<\/p>\n\n\n\n<p><strong>Verification<\/strong>\n&#8211; Confirm the object appears in the bucket object list.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 3: Create a PySpark ETL script and upload it<\/h3>\n\n\n\n<p>Create <code>etl_sales.py<\/code> locally:<\/p>\n\n\n\n<pre><code class=\"language-python\">import sys\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql.functions import col, when\n\ndef main():\n    if len(sys.argv) != 3:\n        print(\"Usage: etl_sales.py &lt;input_path&gt; &lt;output_path&gt;\")\n        sys.exit(2)\n\n    input_path = sys.argv[1]\n    output_path = sys.argv[2]\n\n    spark = SparkSession.builder.appName(\"dataflow-sales-etl\").getOrCreate()\n\n    df = spark.read.option(\"header\", \"true\").option(\"inferSchema\", \"true\").csv(input_path)\n\n    # Simple transform:\n    # - Add a derived column \"amount_bucket\"\n    # - Filter out tiny orders\n    out = (\n        df.withColumn(\n            \"amount_bucket\",\n            when(col(\"amount\") &gt;= 100, \"high\")\n            .when(col(\"amount\") &gt;= 20, \"medium\")\n            .otherwise(\"low\")\n        )\n        .filter(col(\"amount\") &gt;= 10)\n    )\n\n    # Write Parquet output\n    out.write.mode(\"overwrite\").parquet(output_path)\n\n    spark.stop()\n\nif __name__ == \"__main__\":\n    main()\n<\/code><\/pre>\n\n\n\n<p>Upload the script to Object Storage:\n1. Open (or create) a folder\/prefix in <code>dataflow-lab-input<\/code> called <code>scripts\/<\/code>\n2. Upload <code>etl_sales.py<\/code> to <code>scripts\/etl_sales.py<\/code><\/p>\n\n\n\n<p><strong>Expected outcome<\/strong>\n&#8211; The PySpark script is stored in Object Storage.<\/p>\n\n\n\n<p><strong>Verification<\/strong>\n&#8211; Confirm you can see <code>scripts\/etl_sales.py<\/code> in the bucket.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 4: Collect your Object Storage namespace and object URIs<\/h3>\n\n\n\n<p>Data Flow commonly references Object Storage using an OCI\/Hadoop connector URI format.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>In Console, go to <strong>Profile \u2192 Tenancy: <your tenancy=\"\"><\/your><\/strong><\/li>\n<li>Find <strong>Object Storage Namespace<\/strong> (copy it)<\/li>\n<\/ol>\n\n\n\n<p>Construct URIs (adjust to your bucket names and paths). A common pattern is:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Input CSV:<\/li>\n<li><code>oci:\/\/dataflow-lab-input@&lt;namespace&gt;\/sales.csv<\/code><\/li>\n<li>Script:<\/li>\n<li><code>oci:\/\/dataflow-lab-input@&lt;namespace&gt;\/scripts\/etl_sales.py<\/code><\/li>\n<li>Output folder (prefix):<\/li>\n<li><code>oci:\/\/dataflow-lab-output@&lt;namespace&gt;\/curated\/sales_parquet\/<\/code><\/li>\n<li>Logs bucket (configured in the application\/run settings in console)<\/li>\n<\/ul>\n\n\n\n<blockquote>\n<p>URI schemes and exact formats are sensitive. If you encounter \u201cfile not found\u201d or permission errors, verify the correct URI format in the official Data Flow documentation for Object Storage access.<\/p>\n<\/blockquote>\n\n\n\n<p><strong>Expected outcome<\/strong>\n&#8211; You have the namespace and correct URIs for script\/input\/output.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 5: Configure IAM so Data Flow runs can read\/write Object Storage<\/h3>\n\n\n\n<p>This is the most important step for a successful run.<\/p>\n\n\n\n<p>Data Flow runs need permission to access:\n&#8211; Script artifact in <code>dataflow-lab-input<\/code>\n&#8211; Input data in <code>dataflow-lab-input<\/code>\n&#8211; Output location in <code>dataflow-lab-output<\/code>\n&#8211; Logs bucket <code>dataflow-lab-logs<\/code><\/p>\n\n\n\n<p>A common OCI pattern is:\n1. Create a <strong>Dynamic Group<\/strong> that matches Data Flow run resources.\n2. Create IAM <strong>policies<\/strong> granting that dynamic group access to Object Storage.<\/p>\n\n\n\n<p><strong>In the OCI Console:<\/strong>\n1. Go to <strong>Identity &amp; Security \u2192 Identity \u2192 Dynamic Groups<\/strong>\n2. Create a dynamic group (example name): <code>dg-dataflow-runs-lab<\/code><\/p>\n\n\n\n<p><strong>Matching rule<\/strong>\nThe exact rule depends on OCI\u2019s current resource type identifiers for Data Flow runs. A commonly used rule pattern matches the resource type and compartment. <strong>Verify the correct rule in the official Data Flow IAM documentation<\/strong>.<\/p>\n\n\n\n<p>Example pattern (verify before using):\n&#8211; Match Data Flow runs in a compartment:\n  &#8211; <code>ALL {resource.type = 'dataflowrun', resource.compartment.id = '&lt;your_compartment_ocid&gt;'}<\/code><\/p>\n\n\n\n<p><strong>Create a policy<\/strong>\n1. Go to <strong>Identity &amp; Security \u2192 Identity \u2192 Policies<\/strong>\n2. Create a policy in your compartment (or tenancy, depending on your org model)\n3. Add statements that allow the dynamic group to access Object Storage.<\/p>\n\n\n\n<p>Example policy statements (scope down to your compartment; verify in docs):\n&#8211; Allow listing buckets:\n  &#8211; <code>Allow dynamic-group dg-dataflow-runs-lab to read buckets in compartment &lt;compartment-name&gt;<\/code>\n&#8211; Allow read\/write objects:\n  &#8211; <code>Allow dynamic-group dg-dataflow-runs-lab to manage objects in compartment &lt;compartment-name&gt;<\/code><\/p>\n\n\n\n<p>If you want to be stricter, use bucket-specific policies where supported, or separate compartments for logs\/data.<\/p>\n\n\n\n<p><strong>Expected outcome<\/strong>\n&#8211; Data Flow runs are authorized to access buckets\/objects needed for the job.<\/p>\n\n\n\n<p><strong>Verification<\/strong>\n&#8211; There\u2019s no direct \u201ctest\u201d button; validation happens when the run starts.\n&#8211; If the run fails with 403\/authorization errors, return here first.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 6: Create a Data Flow Application<\/h3>\n\n\n\n<p><strong>In the OCI Console:<\/strong>\n1. Go to <strong>Analytics &amp; AI \u2192 Data Flow<\/strong>\n2. Select your compartment.\n3. Click <strong>Create application<\/strong>.<\/p>\n\n\n\n<p>Fill in key fields (names vary slightly across console versions):\n&#8211; <strong>Name:<\/strong> <code>dataflow-sales-etl<\/code>\n&#8211; <strong>Application type:<\/strong> Spark application (serverless Spark)\n&#8211; <strong>Language\/framework:<\/strong> PySpark (Python)\n&#8211; <strong>Main file URI:<\/strong> the Object Storage URI to <code>etl_sales.py<\/code>\n&#8211; <strong>Arguments:<\/strong> add two arguments:\n  1. Input CSV URI<br\/>\n  2. Output Parquet URI<br\/>\n  Example:\n  &#8211; <code>oci:\/\/dataflow-lab-input@&lt;namespace&gt;\/sales.csv<\/code>\n  &#8211; <code>oci:\/\/dataflow-lab-output@&lt;namespace&gt;\/curated\/sales_parquet\/<\/code><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Logs:<\/strong> choose the <code>dataflow-lab-logs<\/code> bucket (or the console\u2019s log destination option)<\/li>\n<\/ul>\n\n\n\n<p>Keep resource sizing modest for a lab. Choose the smallest supported configuration that can run Spark (the console usually provides presets).<\/p>\n\n\n\n<blockquote>\n<p>Exact sizing knobs and defaults can change. Start small, then scale if needed.<\/p>\n<\/blockquote>\n\n\n\n<p><strong>Expected outcome<\/strong>\n&#8211; A Data Flow Application is created and visible in the Data Flow Applications list.<\/p>\n\n\n\n<p><strong>Verification<\/strong>\n&#8211; Open the application details and confirm:\n  &#8211; Main file URI points to your script\n  &#8211; Arguments are set correctly\n  &#8211; Logs destination is configured<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 7: Start a Run<\/h3>\n\n\n\n<p>From the application page:\n1. Click <strong>Run<\/strong>\n2. Confirm or override:\n   &#8211; Arguments (input\/output URIs)\n   &#8211; Resource sizing (keep small for lab)\n   &#8211; Logs bucket\n   &#8211; Networking (leave default unless you need VCN access)<\/p>\n\n\n\n<p>Start the run.<\/p>\n\n\n\n<p><strong>Expected outcome<\/strong>\n&#8211; A Run is created with state such as \u201cAccepted\/Starting\/Running\u201d.<\/p>\n\n\n\n<p><strong>Verification<\/strong>\n&#8211; Open the Run details page:\n  &#8211; Confirm state transitions toward Running\n  &#8211; Confirm it eventually reaches <strong>Succeeded<\/strong> (ideal) or <strong>Failed<\/strong> (troubleshoot)<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 8: Validate output data in Object Storage<\/h3>\n\n\n\n<p>After the run succeeds:\n1. Open bucket <code>dataflow-lab-output<\/code>\n2. Navigate to <code>curated\/sales_parquet\/<\/code><\/p>\n\n\n\n<p>You should see Spark output files such as:\n&#8211; One or more <code>part-...snappy.parquet<\/code> files\n&#8211; <code>_SUCCESS<\/code> marker file (common for Spark jobs)<\/p>\n\n\n\n<p><strong>Expected outcome<\/strong>\n&#8211; Parquet output is present in the output bucket.<\/p>\n\n\n\n<p><strong>Verification<\/strong>\n&#8211; Confirm <code>_SUCCESS<\/code> exists.\n&#8211; Confirm the output folder contains Parquet part files.<\/p>\n\n\n\n<p>Optional validation:\n&#8211; Download a Parquet file and inspect it locally with a Parquet viewer\/library.\n&#8211; Or run a small Spark\/duckdb job locally to read the Parquet.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 9: Review logs<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Open bucket <code>dataflow-lab-logs<\/code><\/li>\n<li>Find the log prefix for your run (the console often provides a direct link from the run details)<\/li>\n<\/ol>\n\n\n\n<p>Review:\n&#8211; Driver logs (most useful for application errors)\n&#8211; Executor logs (for distributed task issues)<\/p>\n\n\n\n<p><strong>Expected outcome<\/strong>\n&#8211; Logs are stored and accessible for debugging\/audit.<\/p>\n\n\n\n<p><strong>Verification<\/strong>\n&#8211; Confirm you can read the driver log and see messages like reading CSV and writing output.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Validation<\/h3>\n\n\n\n<p>Use this checklist:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Application exists and references the correct <code>etl_sales.py<\/code> URI.<\/li>\n<li>Run status is <strong>Succeeded<\/strong>.<\/li>\n<li>Output bucket contains:<\/li>\n<li><code>_SUCCESS<\/code><\/li>\n<li>Parquet part files<\/li>\n<li>Logs bucket contains logs for the run.<\/li>\n<li>Data looks correct:<\/li>\n<li>Orders with <code>amount &lt; 10<\/code> are filtered out (order <code>1004<\/code> should be removed)<\/li>\n<li><code>amount_bucket<\/code> exists with values <code>medium<\/code> \/ <code>high<\/code><\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Troubleshooting<\/h3>\n\n\n\n<p>Common errors and fixes:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p><strong>403 \/ NotAuthorizedOrNotFound \/ AccessDenied when reading script or input<\/strong>\n   &#8211; Cause: Missing IAM dynamic group\/policy for Data Flow runs.\n   &#8211; Fix:<\/p>\n<ul>\n<li>Re-check dynamic group matching rule (resource type and compartment).<\/li>\n<li>Ensure policy allows reading objects in the compartment\/bucket.<\/li>\n<li>Confirm the run is in the same compartment you scoped.<\/li>\n<\/ul>\n<\/li>\n<li>\n<p><strong>FileNotFoundException \/ Path does not exist<\/strong>\n   &#8211; Cause: Incorrect Object Storage URI format or wrong namespace\/bucket\/path.\n   &#8211; Fix:<\/p>\n<ul>\n<li>Re-copy namespace from tenancy details.<\/li>\n<li>Confirm object path and letter casing.<\/li>\n<li>Verify the correct URI format in official docs.<\/li>\n<\/ul>\n<\/li>\n<li>\n<p><strong>Run stays in Starting too long<\/strong>\n   &#8211; Cause: Capacity constraints, service limits, or transient OCI issues.\n   &#8211; Fix:<\/p>\n<ul>\n<li>Wait a few minutes.<\/li>\n<li>Check service limits and quotas.<\/li>\n<li>Try smaller resource sizing or a different region (if possible).<\/li>\n<\/ul>\n<\/li>\n<li>\n<p><strong>Job failed due to memory errors or executor failures<\/strong>\n   &#8211; Cause: Insufficient resources or inefficient transformations.\n   &#8211; Fix:<\/p>\n<ul>\n<li>Increase executor memory\/resources.<\/li>\n<li>Reduce shuffle (optimize joins\/partitions).<\/li>\n<li>Start by testing on a smaller dataset.<\/li>\n<\/ul>\n<\/li>\n<li>\n<p><strong>No output written<\/strong>\n   &#8211; Cause: Output URI points to a file instead of a folder prefix, or permission issues on output bucket.\n   &#8211; Fix:<\/p>\n<ul>\n<li>Ensure output path ends with a folder-like prefix (e.g., <code>\/curated\/sales_parquet\/<\/code>).<\/li>\n<li>Confirm Data Flow run has write permissions to the output bucket.<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Cleanup<\/h3>\n\n\n\n<p>To avoid ongoing costs and clutter:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p><strong>Stop\/delete runs<\/strong>\n   &#8211; Completed runs stop consuming compute, but logs remain in Object Storage.\n   &#8211; If a run is still running, terminate it from the run details page.<\/p>\n<\/li>\n<li>\n<p><strong>Delete the Data Flow application<\/strong>\n   &#8211; Data Flow \u2192 Applications \u2192 delete <code>dataflow-sales-etl<\/code><\/p>\n<\/li>\n<li>\n<p><strong>Delete Object Storage objects and buckets<\/strong>\n   &#8211; Delete objects in:<\/p>\n<ul>\n<li><code>dataflow-lab-input<\/code><\/li>\n<li><code>dataflow-lab-output<\/code><\/li>\n<li><code>dataflow-lab-logs<\/code><\/li>\n<li>Then delete the buckets (buckets must be empty first).<\/li>\n<\/ul>\n<\/li>\n<li>\n<p><strong>Remove IAM policy and dynamic group (if created solely for this lab)<\/strong>\n   &#8211; Delete the policy statements and dynamic group <code>dg-dataflow-runs-lab<\/code><\/p>\n<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">11. Best Practices<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Architecture best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use Object Storage as the durable lake layer and write curated datasets in efficient columnar formats (Parquet\/ORC).<\/li>\n<li>Separate buckets or prefixes by zone:<\/li>\n<li><code>raw\/<\/code>, <code>staged\/<\/code>, <code>curated\/<\/code>, <code>logs\/<\/code><\/li>\n<li>Parameterize your jobs (input\/output prefixes, processing dates) so the same application can handle multiple runs.<\/li>\n<li>For production pipelines, design idempotent outputs:<\/li>\n<li>Write to a run-specific path and atomically update \u201clatest\u201d pointers, or use overwrite carefully.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">IAM\/security best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Prefer dynamic groups + least-privilege policies for run-time access.<\/li>\n<li>Separate duties:<\/li>\n<li>Developers can manage applications\/runs<\/li>\n<li>Security\/platform team controls IAM policies and compartments<\/li>\n<li>Restrict write access to curated zones (only pipeline identities should write).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Cost best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Start with small sizing; scale based on measured runtime and bottlenecks.<\/li>\n<li>Optimize Spark jobs to reduce runtime:<\/li>\n<li>Partitioning and file sizing<\/li>\n<li>Avoid repeated scans<\/li>\n<li>Apply Object Storage lifecycle policies to:<\/li>\n<li>Delete old logs after a retention period<\/li>\n<li>Archive or delete intermediate datasets<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Performance best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Reduce small files:<\/li>\n<li>Coalesce\/repartition before writing output<\/li>\n<li>Use partitioning aligned to query patterns:<\/li>\n<li>Example: partition by <code>date=YYYY-MM-DD<\/code> for time-series datasets<\/li>\n<li>Be cautious with <code>inferSchema<\/code> on huge CSVs\u2014consider explicit schemas.<\/li>\n<li>Use broadcast joins only when safe and sized appropriately.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Reliability best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Treat each run as ephemeral: write checkpoints and outputs to durable storage.<\/li>\n<li>Implement retry logic at the orchestration layer (not uncontrolled retries inside Spark).<\/li>\n<li>Validate inputs early and fail fast with clear error messages.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operations best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Standardize log locations and naming conventions.<\/li>\n<li>Use OCI tags:<\/li>\n<li><code>cost-center<\/code>, <code>env<\/code>, <code>owner<\/code>, <code>dataset<\/code>, <code>pipeline<\/code><\/li>\n<li>Use alarms\/notifications (via Monitoring\/Events) on failures and long runtimes (verify exact events\/metrics for Data Flow runs).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Governance\/tagging\/naming best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Naming:<\/li>\n<li>Applications: <code>env-domain-pipeline<\/code> (e.g., <code>prod-sales-curation<\/code>)<\/li>\n<li>Buckets\/prefixes: <code>env\/domain\/zone\/...<\/code><\/li>\n<li>Use compartments to separate:<\/li>\n<li>dev\/test\/prod<\/li>\n<li>raw vs curated (optional but common)<\/li>\n<li>Keep a simple data product catalog: dataset owner, SLA, retention, schema version.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">12. Security Considerations<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Identity and access model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>User access:<\/strong> Controlled by OCI IAM policies granting permission to manage Data Flow applications and runs.<\/li>\n<li><strong>Run-time access:<\/strong> Typically controlled via <strong>dynamic groups<\/strong> and policies that allow the Data Flow run identity to read\/write Object Storage and reach other services.<\/li>\n<\/ul>\n\n\n\n<p>Key recommendation:\n&#8211; Do not embed long-lived OCI API keys in Spark code.\n&#8211; Use OCI\u2019s recommended identity patterns for service-to-service access.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Encryption<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Object Storage encrypts data at rest by default using OCI-managed keys.<\/li>\n<li>For stricter control, use customer-managed keys (OCI Vault) where supported by Object Storage and your org policies (verify requirements and supportability).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Network exposure<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>If accessing only Object Storage, you may avoid VCN complexity (subject to tenancy policy).<\/li>\n<li>If accessing private endpoints:<\/li>\n<li>Use private subnets and restrictive NSGs\/security lists.<\/li>\n<li>Ensure correct routing (NAT\/Service Gateway as needed; verify with your network architecture).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secrets handling<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Never hardcode database passwords, tokens, or API keys in scripts stored in Object Storage.<\/li>\n<li>Recommended patterns:<\/li>\n<li>Store secrets in <strong>OCI Vault<\/strong> and fetch them securely at runtime (verify supported approach for Spark jobs).<\/li>\n<li>Use short-lived credentials where possible.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Audit\/logging<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use <strong>OCI Audit<\/strong> to track changes to Data Flow resources and IAM policies.<\/li>\n<li>Keep run logs in a dedicated logs bucket with retention controls.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Compliance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data residency: run jobs in regions aligned to your compliance requirements.<\/li>\n<li>Least privilege: restrict access to sensitive buckets and datasets.<\/li>\n<li>Retention: ensure logs and intermediate data meet policy requirements.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Common security mistakes<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Overbroad policies like \u201cmanage all-resources\u201d at tenancy scope for convenience.<\/li>\n<li>Writing logs to the same bucket as curated data without access separation.<\/li>\n<li>Public buckets or overly permissive Object Storage policies.<\/li>\n<li>Using public database endpoints when private connectivity is required.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secure deployment recommendations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Separate compartments for dev\/test\/prod and enforce policy boundaries.<\/li>\n<li>Restrict who can create\/modify applications vs who can start runs.<\/li>\n<li>Use object versioning for scripts and artifacts to support traceability and rollback.<\/li>\n<li>Apply consistent tagging for ownership and incident response.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">13. Limitations and Gotchas<\/h2>\n\n\n\n<blockquote>\n<p>This section focuses on common real-world issues. For authoritative limits (max resources, concurrency, supported runtimes), check official Data Flow docs and OCI Service Limits.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">Known limitations (categories)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Service limits:<\/strong> concurrency, maximum resources per run, and per-tenancy quotas can block runs.<\/li>\n<li><strong>Runtime constraints:<\/strong> only certain Spark\/runtime versions and configurations are supported.<\/li>\n<li><strong>Job startup time:<\/strong> serverless jobs often have a cold-start delay compared to warm clusters.<\/li>\n<li><strong>Dependency packaging:<\/strong> non-trivial Python\/JVM dependencies require careful packaging and distribution to executors.<\/li>\n<li><strong>Networking complexity:<\/strong> private access to databases requires correct VCN configuration and may be the most time-consuming part.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Quotas and throttling<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Runs may fail to start if you hit:<\/li>\n<li>Data Flow run limits<\/li>\n<li>Underlying compute quotas (depending on OCI internals and tenancy configuration)<\/li>\n<li>Track limits via the OCI Console\u2019s Limits pages.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Regional constraints<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Features and availability can vary by region.<\/li>\n<li>Cross-region access can add latency and cost (egress).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Pricing surprises<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Verbose logs retained for months in Object Storage.<\/li>\n<li>Repeated failed runs during debugging.<\/li>\n<li>Cross-region reads\/writes or internet egress.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Compatibility issues<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Spark version mismatches with your code or libraries (verify supported versions).<\/li>\n<li>Python dependencies that work locally but fail on distributed executors if not packaged correctly.<\/li>\n<li>Reading\/writing certain file formats may require extra Spark packages (verify supported packaging approaches).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operational gotchas<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Output paths: Spark writes multiple part files; downstream systems must handle this.<\/li>\n<li>Small files: naive writes can produce thousands of small objects.<\/li>\n<li>Schema inference on big CSVs: slow and memory-heavy.<\/li>\n<li>Permissions: the #1 failure mode is IAM misconfiguration for Object Storage.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Migration challenges<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Migrating from self-managed Spark often requires changes to:<\/li>\n<li>Dependency distribution<\/li>\n<li>Logging paths<\/li>\n<li>IAM access model<\/li>\n<li>Networking assumptions<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">14. Comparison with Alternatives<\/h2>\n\n\n\n<p>Data Flow is one option in a broader data processing ecosystem. Selection depends on latency needs, operational model, and ecosystem alignment.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Comparison table<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Option<\/th>\n<th>Best For<\/th>\n<th>Strengths<\/th>\n<th>Weaknesses<\/th>\n<th>When to Choose<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>OCI Data Flow (serverless Spark)<\/strong><\/td>\n<td>Batch ETL at scale without cluster management<\/td>\n<td>Serverless ops, Spark ecosystem, strong OCI integration<\/td>\n<td>Startup latency, runtime\/version constraints, IAM\/networking learning curve<\/td>\n<td>You want Spark but not cluster ops; bursty batch workloads<\/td>\n<\/tr>\n<tr>\n<td><strong>OCI Big Data Service (managed Hadoop\/Spark cluster)<\/strong><\/td>\n<td>Long-running clusters, more control, steady workloads<\/td>\n<td>Persistent cluster, deeper control, potentially better for always-on workloads<\/td>\n<td>You manage cluster lifecycle and costs of idle capacity<\/td>\n<td>You need persistent HDFS\/YARN-style environment or always-on clusters<\/td>\n<\/tr>\n<tr>\n<td><strong>OCI Data Integration<\/strong><\/td>\n<td>Graphical\/managed ETL orchestration and connectors<\/td>\n<td>Low-code pipelines, scheduling\/orchestration patterns<\/td>\n<td>Not a Spark replacement; different execution model<\/td>\n<td>You need managed integration\/orchestration; use Data Flow for heavy Spark transforms<\/td>\n<\/tr>\n<tr>\n<td><strong>Oracle Autonomous Data Warehouse (SQL ELT)<\/strong><\/td>\n<td>ELT inside a database\/warehouse<\/td>\n<td>Strong SQL engine, governance, BI integration<\/td>\n<td>Not ideal for all file-based lake processing<\/td>\n<td>Your transformations fit SQL and data already lands in ADW<\/td>\n<\/tr>\n<tr>\n<td><strong>AWS Glue (serverless Spark)<\/strong><\/td>\n<td>Serverless Spark on AWS<\/td>\n<td>Tight AWS integration<\/td>\n<td>Cloud lock-in to AWS<\/td>\n<td>You\u2019re standardized on AWS<\/td>\n<\/tr>\n<tr>\n<td><strong>GCP Dataproc Serverless<\/strong><\/td>\n<td>Serverless Spark on Google Cloud<\/td>\n<td>Tight GCP integration<\/td>\n<td>Cloud lock-in to GCP<\/td>\n<td>You\u2019re standardized on GCP<\/td>\n<\/tr>\n<tr>\n<td><strong>Azure Synapse \/ Fabric Spark<\/strong><\/td>\n<td>Spark in Microsoft analytics ecosystem<\/td>\n<td>Integration with Microsoft data stack<\/td>\n<td>Ecosystem-specific<\/td>\n<td>You\u2019re standardized on Azure\/Microsoft<\/td>\n<\/tr>\n<tr>\n<td><strong>Self-managed Spark on Kubernetes<\/strong><\/td>\n<td>Maximum control and portability<\/td>\n<td>Full control, portable patterns<\/td>\n<td>Highest ops overhead<\/td>\n<td>You need custom runtimes and are willing to run the platform<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">15. Real-World Example<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Enterprise example: Retail analytics lakehouse batch pipelines<\/h3>\n\n\n\n<p><strong>Problem<\/strong>\nA retailer collects large clickstream logs and transactional data daily. They need reliable daily transformations (sessionization, enrichment, aggregations) and curated datasets for BI and ML, without managing big data clusters.<\/p>\n\n\n\n<p><strong>Proposed architecture<\/strong>\n&#8211; Object Storage buckets:\n  &#8211; <code>raw\/<\/code> for incoming logs\n  &#8211; <code>curated\/<\/code> for Parquet datasets partitioned by date\n  &#8211; <code>logs\/<\/code> for Data Flow logs with lifecycle retention\n&#8211; Data Flow:\n  &#8211; One application per domain pipeline (sessions, orders, products)\n  &#8211; Runs scheduled daily by an orchestrator (could be OCI services or external CI\/CD; verify your toolchain)\n&#8211; IAM:\n  &#8211; Dynamic group for Data Flow runs\n  &#8211; Least-privilege bucket access policies\n&#8211; Observability:\n  &#8211; Alerts on failed runs and runtime anomalies\n  &#8211; Audit tracking for changes<\/p>\n\n\n\n<p><strong>Why Data Flow was chosen<\/strong>\n&#8211; Spark-based workloads at scale.\n&#8211; Avoid operating Spark clusters for daily batch.\n&#8211; Integrates with OCI IAM and compartment governance.<\/p>\n\n\n\n<p><strong>Expected outcomes<\/strong>\n&#8211; Reduced platform ops work (no cluster patching\/scaling).\n&#8211; Repeatable ETL runs with consistent logging and governance.\n&#8211; Faster backfills by scaling execution during reprocessing windows.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Startup\/small-team example: SaaS product usage metrics pipeline<\/h3>\n\n\n\n<p><strong>Problem<\/strong>\nA small SaaS company stores product event logs in Object Storage and needs daily rollups and customer usage metrics for billing and dashboards, but has no dedicated data platform team.<\/p>\n\n\n\n<p><strong>Proposed architecture<\/strong>\n&#8211; Object Storage:\n  &#8211; <code>events\/<\/code> raw JSON\/CSV\n  &#8211; <code>metrics\/<\/code> curated Parquet outputs\n&#8211; Data Flow:\n  &#8211; Single PySpark application that reads yesterday\u2019s events and writes daily metrics\n&#8211; Lightweight governance:\n  &#8211; Tags for cost attribution\n  &#8211; Logs bucket with 14\u201330 day retention<\/p>\n\n\n\n<p><strong>Why Data Flow was chosen<\/strong>\n&#8211; Minimal operational overhead.\n&#8211; Pay-per-run aligns to daily workload.\n&#8211; Spark provides flexibility for evolving metrics logic.<\/p>\n\n\n\n<p><strong>Expected outcomes<\/strong>\n&#8211; Daily metrics produced reliably with minimal infrastructure management.\n&#8211; Cost scales with usage rather than idle cluster time.\n&#8211; Straightforward path to expand pipelines as product grows.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">16. FAQ<\/h2>\n\n\n\n<p>1) <strong>What is Oracle Cloud Data Flow?<\/strong><br\/>\nData Flow is OCI\u2019s managed, serverless service for running Apache Spark applications for batch processing and ETL\/ELT at scale.<\/p>\n\n\n\n<p>2) <strong>Do I need to manage a Spark cluster?<\/strong><br\/>\nNo. Data Flow is serverless\u2014you submit an application\/run, and OCI manages the underlying Spark execution environment.<\/p>\n\n\n\n<p>3) <strong>Is Data Flow good for streaming?<\/strong><br\/>\nData Flow is commonly used for batch processing. For streaming use cases, evaluate whether the serverless execution model and runtime constraints meet your needs, and consider OCI\u2019s streaming services plus appropriate processing designs.<\/p>\n\n\n\n<p>4) <strong>Where does Data Flow store logs?<\/strong><br\/>\nA common pattern is storing logs in OCI Object Storage (often in a dedicated logs bucket). The console run details typically link to logs. Verify current logging behavior in official docs.<\/p>\n\n\n\n<p>5) <strong>How does Data Flow access Object Storage securely?<\/strong><br\/>\nTypically through OCI IAM dynamic groups and policies that authorize Data Flow run identities to read\/write specific buckets\/objects.<\/p>\n\n\n\n<p>6) <strong>What\u2019s the difference between an Application and a Run?<\/strong><br\/>\nAn Application is a reusable job definition (code + defaults). A Run is an execution instance with specific parameters and resource sizing.<\/p>\n\n\n\n<p>7) <strong>Can I run Python (PySpark) jobs?<\/strong><br\/>\nYes\u2014Data Flow supports Spark applications, including PySpark. Verify supported versions and packaging methods in the official docs.<\/p>\n\n\n\n<p>8) <strong>Can Data Flow connect to private databases?<\/strong><br\/>\nYes, commonly via VCN\/private networking configuration, plus secure credentials handling. This requires careful network and IAM setup.<\/p>\n\n\n\n<p>9) <strong>How do I pass parameters to my Spark job?<\/strong><br\/>\nYou provide arguments in the application definition and\/or override them at run submission time.<\/p>\n\n\n\n<p>10) <strong>How do I version my ETL code?<\/strong><br\/>\nCommon approaches include storing scripts\/JARs in Object Storage with versioned paths, enabling object versioning, and referencing immutable artifact URIs in applications\/runs.<\/p>\n\n\n\n<p>11) <strong>What are the biggest causes of run failures?<\/strong><br\/>\nMost commonly: IAM permission issues to Object Storage, wrong Object Storage URIs, missing dependencies, and insufficient resources for the dataset size.<\/p>\n\n\n\n<p>12) <strong>How do I reduce costs?<\/strong><br\/>\nReduce runtime and resource allocation, write efficient formats (Parquet), avoid small files, use lifecycle policies for logs, and minimize failed\/retry runs.<\/p>\n\n\n\n<p>13) <strong>Is Data Flow regional?<\/strong><br\/>\nYes. You create and run resources in a specific OCI region and compartment. Plan data residency and region placement accordingly.<\/p>\n\n\n\n<p>14) <strong>Can I automate Data Flow with CI\/CD?<\/strong><br\/>\nYes\u2014use OCI CLI\/SDK\/Terraform patterns to manage applications and submit runs. Verify current APIs and best practices in official docs.<\/p>\n\n\n\n<p>15) <strong>Is Data Flow the same as OCI Data Integration?<\/strong><br\/>\nNo. Data Integration is oriented around managed integration\/orchestration patterns, while Data Flow is a serverless Spark execution service. They can complement each other.<\/p>\n\n\n\n<p>16) <strong>How do I monitor Data Flow runs?<\/strong><br\/>\nUse the Data Flow run status, logs, and OCI observability services (Monitoring\/Events\/Logging) where supported. Verify exact metrics\/events available.<\/p>\n\n\n\n<p>17) <strong>Do I pay when nothing is running?<\/strong><br\/>\nCompute charges are typically tied to runs, but you still pay for stored artifacts\/logs in Object Storage and any other always-on services you use.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">17. Top Online Resources to Learn Data Flow<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Resource Type<\/th>\n<th>Name<\/th>\n<th>Why It Is Useful<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Official documentation<\/td>\n<td>OCI Data Flow Documentation<\/td>\n<td>Primary source for concepts, IAM, networking, jobs, and API references. https:\/\/docs.oracle.com\/en-us\/iaas\/data-flow\/using\/home.htm<\/td>\n<\/tr>\n<tr>\n<td>Official pricing<\/td>\n<td>OCI Price List<\/td>\n<td>Authoritative pricing by region\/service; search for Data Flow. https:\/\/www.oracle.com\/cloud\/price-list\/<\/td>\n<\/tr>\n<tr>\n<td>Official cost estimator<\/td>\n<td>OCI Cost Estimator<\/td>\n<td>Helps model compute + storage + data transfer. https:\/\/www.oracle.com\/cloud\/costestimator.html<\/td>\n<\/tr>\n<tr>\n<td>Official free tier info<\/td>\n<td>Oracle Cloud Free Tier<\/td>\n<td>Verify if Data Flow has free-tier usage in your tenancy\/region. https:\/\/www.oracle.com\/cloud\/free\/<\/td>\n<\/tr>\n<tr>\n<td>Official CLI install<\/td>\n<td>OCI CLI Installation Guide<\/td>\n<td>Enables repeatable automation and scripting. https:\/\/docs.oracle.com\/en-us\/iaas\/Content\/API\/SDKDocs\/cliinstall.htm<\/td>\n<\/tr>\n<tr>\n<td>Official architecture center<\/td>\n<td>Oracle Architecture Center<\/td>\n<td>Reference architectures and best practices for OCI deployments. https:\/\/docs.oracle.com\/en\/solutions\/<\/td>\n<\/tr>\n<tr>\n<td>Official tutorials\/labs<\/td>\n<td>Oracle LiveLabs<\/td>\n<td>Hands-on labs (filter for \u201cData Flow\u201d and Spark). https:\/\/apexapps.oracle.com\/pls\/apex\/r\/dbpm\/livelabs\/home<\/td>\n<\/tr>\n<tr>\n<td>Official region info<\/td>\n<td>OCI Regions<\/td>\n<td>Validate service availability by region. https:\/\/www.oracle.com\/cloud\/public-cloud-regions\/<\/td>\n<\/tr>\n<tr>\n<td>Official service limits overview<\/td>\n<td>OCI Service Limits<\/td>\n<td>Understand quotas and request increases. https:\/\/docs.oracle.com\/en-us\/iaas\/Content\/General\/Concepts\/servicelimits.htm<\/td>\n<\/tr>\n<tr>\n<td>Community learning<\/td>\n<td>Oracle Cloud community\/blog<\/td>\n<td>Practical patterns and announcements; validate against official docs. https:\/\/blogs.oracle.com\/cloud\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">18. Training and Certification Providers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Institute<\/th>\n<th>Suitable Audience<\/th>\n<th>Likely Learning Focus<\/th>\n<th>Mode<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>DevOps engineers, cloud engineers, architects<\/td>\n<td>OCI fundamentals, DevOps practices, cloud operations (verify exact Data Flow coverage on site)<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>ScmGalaxy.com<\/td>\n<td>Beginners to intermediate IT professionals<\/td>\n<td>DevOps, SCM, CI\/CD foundations that support data platform automation<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.scmgalaxy.com\/<\/td>\n<\/tr>\n<tr>\n<td>CLoudOpsNow.in<\/td>\n<td>Cloud operations teams, SRE\/ops<\/td>\n<td>Cloud operations, monitoring, reliability practices<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.cloudopsnow.in\/<\/td>\n<\/tr>\n<tr>\n<td>SreSchool.com<\/td>\n<td>SREs, platform engineers<\/td>\n<td>SRE principles, incident response, observability<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.sreschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>AiOpsSchool.com<\/td>\n<td>Ops teams adopting AIOps<\/td>\n<td>Monitoring automation, AIOps concepts<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.aiopsschool.com\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">19. Top Trainers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Platform\/Site<\/th>\n<th>Likely Specialization<\/th>\n<th>Suitable Audience<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>RajeshKumar.xyz<\/td>\n<td>Cloud\/DevOps training (verify current offerings)<\/td>\n<td>Beginners to intermediate engineers<\/td>\n<td>https:\/\/rajeshkumar.xyz\/<\/td>\n<\/tr>\n<tr>\n<td>devopstrainer.in<\/td>\n<td>DevOps and cloud training (verify OCI coverage)<\/td>\n<td>DevOps engineers, cloud engineers<\/td>\n<td>https:\/\/www.devopstrainer.in\/<\/td>\n<\/tr>\n<tr>\n<td>devopsfreelancer.com<\/td>\n<td>Freelance DevOps support\/training resources<\/td>\n<td>Teams needing hands-on guidance<\/td>\n<td>https:\/\/www.devopsfreelancer.com\/<\/td>\n<\/tr>\n<tr>\n<td>devopssupport.in<\/td>\n<td>DevOps support and learning resources<\/td>\n<td>Ops\/DevOps teams<\/td>\n<td>https:\/\/www.devopssupport.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">20. Top Consulting Companies<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Company<\/th>\n<th>Likely Service Area<\/th>\n<th>Where They May Help<\/th>\n<th>Consulting Use Case Examples<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>cotocus.com<\/td>\n<td>Cloud\/DevOps consulting (verify service catalog)<\/td>\n<td>Platform engineering, automation, cloud migration<\/td>\n<td>CI\/CD for data pipelines, IaC for OCI, operational runbooks<\/td>\n<td>https:\/\/www.cotocus.com\/<\/td>\n<\/tr>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>DevOps consulting and training (verify offerings)<\/td>\n<td>DevOps transformations, tooling, delivery practices<\/td>\n<td>Building automated deployment for Data Flow apps, IAM\/policy guidance, observability setup<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>DEVOPSCONSULTING.IN<\/td>\n<td>DevOps consulting (verify offerings)<\/td>\n<td>DevOps processes and operational improvements<\/td>\n<td>Standardizing environments, pipeline automation, incident response processes<\/td>\n<td>https:\/\/www.devopsconsulting.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">21. Career and Learning Roadmap<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn before Data Flow<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>OCI fundamentals:<\/li>\n<li>Compartments, regions, VCN basics<\/li>\n<li>Object Storage (buckets, prefixes, lifecycle policies)<\/li>\n<li>IAM (groups, policies, dynamic groups)<\/li>\n<li>Data engineering fundamentals:<\/li>\n<li>Data lake zones (raw\/curated)<\/li>\n<li>Partitioning, file formats (CSV vs Parquet)<\/li>\n<li>Spark basics:<\/li>\n<li>DataFrames, transformations vs actions<\/li>\n<li>Joins, shuffles, partitioning<\/li>\n<li>Reading\/writing to object stores<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn after Data Flow<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Orchestration:<\/li>\n<li>Scheduling, dependency management, event-driven pipelines (tooling varies; choose what your org uses)<\/li>\n<li>Data governance:<\/li>\n<li>Cataloging, lineage, data quality frameworks<\/li>\n<li>Advanced Spark optimization:<\/li>\n<li>Join strategies, skew handling, adaptive query execution (if supported)<\/li>\n<li>Security hardening:<\/li>\n<li>Private networking patterns, Vault-based secrets, least privilege policies<\/li>\n<li>IaC:<\/li>\n<li>Terraform for repeatable Data Flow + Object Storage + IAM setup (verify official modules\/providers)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Job roles that use Data Flow<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data Engineer<\/li>\n<li>Cloud Engineer (Data Platform)<\/li>\n<li>Solutions Architect<\/li>\n<li>DevOps Engineer \/ Platform Engineer supporting data workloads<\/li>\n<li>SRE\/Operations for data platforms<\/li>\n<li>Analytics Engineer (depending on organization)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Certification path (if available)<\/h3>\n\n\n\n<p>Oracle\u2019s certification offerings change over time. For OCI certification paths:\n&#8211; Start at Oracle Cloud Infrastructure training\/certification pages and select the latest role-based track:\n  https:\/\/education.oracle.com\/<\/p>\n\n\n\n<p>If you need a Data Flow-specific credential, verify current availability on Oracle Education.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Project ideas for practice<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Build a bronze\/silver\/gold pipeline using Object Storage prefixes and daily partitions.<\/li>\n<li>Implement a backfill runner: same application, different date range arguments.<\/li>\n<li>Add data quality metrics output (row counts, null counts) as a separate dataset.<\/li>\n<li>Create a cost dashboard:<\/li>\n<li>Track run durations and output sizes by tags\/datasets.<\/li>\n<li>Secure private DB ingestion:<\/li>\n<li>Read from a private database endpoint and write curated datasets to Object Storage (requires networking and secrets strategy).<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">22. Glossary<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Apache Spark<\/strong>: Distributed compute engine for large-scale data processing.<\/li>\n<li><strong>Application (Data Flow)<\/strong>: A reusable definition of a Spark job (artifact + defaults).<\/li>\n<li><strong>Run (Data Flow)<\/strong>: An execution instance of an application with specific parameters.<\/li>\n<li><strong>Driver<\/strong>: The Spark process coordinating the job (scheduling tasks, maintaining context).<\/li>\n<li><strong>Executor<\/strong>: Spark worker process that runs tasks and stores data partitions.<\/li>\n<li><strong>OCPU<\/strong>: Oracle CPU unit used for OCI compute billing\/quotas.<\/li>\n<li><strong>Object Storage<\/strong>: OCI service for storing unstructured data (buckets\/objects).<\/li>\n<li><strong>Namespace (Object Storage)<\/strong>: A tenancy-scoped identifier used in Object Storage URIs.<\/li>\n<li><strong>Compartment<\/strong>: OCI logical isolation boundary for organizing and controlling access to resources.<\/li>\n<li><strong>IAM Policy<\/strong>: Rules that grant permissions to users\/groups\/dynamic groups in OCI.<\/li>\n<li><strong>Dynamic Group<\/strong>: OCI IAM feature that groups resources (like service-run identities) so policies can apply to them.<\/li>\n<li><strong>VCN (Virtual Cloud Network)<\/strong>: OCI virtual network for private IP space, subnets, routing, security controls.<\/li>\n<li><strong>Parquet<\/strong>: Columnar file format optimized for analytics and compression.<\/li>\n<li><strong>Partitioning<\/strong>: Organizing data by one or more columns (commonly date) to reduce scan costs.<\/li>\n<li><strong>Shuffle<\/strong>: Spark data redistribution across partitions; often the biggest performance and cost driver.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">23. Summary<\/h2>\n\n\n\n<p><strong>Oracle Cloud Data Flow<\/strong> is OCI\u2019s serverless <strong>Apache Spark<\/strong> service in the <strong>Data Management<\/strong> category. It lets you run scalable batch ETL and data processing workloads without managing Spark clusters.<\/p>\n\n\n\n<p>It matters because it reduces operational burden while keeping the flexibility of Spark for large transformations, feature engineering, and lakehouse-style pipelines\u2014especially when your data lives in <strong>OCI Object Storage<\/strong>.<\/p>\n\n\n\n<p>Key points to remember:\n&#8211; <strong>Cost<\/strong> is driven primarily by run-time compute resource usage (plus Object Storage\/log retention and any data transfer).\n&#8211; <strong>Security<\/strong> hinges on correct IAM design (dynamic groups + least-privilege policies) and careful handling of private networking and secrets.\n&#8211; Use Data Flow when you want Spark at scale with a serverless model; avoid it when you need always-on interactive clusters or unsupported runtime dependencies.<\/p>\n\n\n\n<p>Next step: run the hands-on lab again with a larger dataset and add production practices\u2014partitioned outputs, lifecycle policies for logs, and automated run submission via OCI CLI\/SDK or your CI\/CD system.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Data Management<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[68,62],"tags":[],"class_list":["post-885","post","type-post","status-publish","format-standard","hentry","category-data-management","category-oracle-cloud"],"_links":{"self":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/885","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/comments?post=885"}],"version-history":[{"count":0,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/885\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/media?parent=885"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/categories?post=885"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/tags?post=885"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}