{"id":749,"date":"2026-04-15T10:34:25","date_gmt":"2026-04-15T10:34:25","guid":{"rendered":"https:\/\/www.devopsschool.com\/tutorials\/oracle-cloud-big-data-discovery-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-other-services\/"},"modified":"2026-04-15T10:34:25","modified_gmt":"2026-04-15T10:34:25","slug":"oracle-cloud-big-data-discovery-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-other-services","status":"publish","type":"post","link":"https:\/\/www.devopsschool.com\/tutorials\/oracle-cloud-big-data-discovery-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-other-services\/","title":{"rendered":"Oracle Cloud Big Data Discovery Tutorial: Architecture, Pricing, Use Cases, and Hands-On Guide for Other Services"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Category<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Other Services<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">1. Introduction<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Big Data Discovery<\/strong> is an Oracle product designed to help people explore, profile, transform, and visualize large datasets\u2014especially data stored in Hadoop ecosystems\u2014without requiring every user to write code.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In simple terms: Big Data Discovery is a \u201cdata exploration and preparation\u201d tool for big data. It lets analysts and engineers ingest data (often from Hadoop), understand it quickly (profiling and sampling), clean it (enrichment and transforms), and publish curated datasets for downstream analytics.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In more technical terms: Big Data Discovery combines a browser-based exploration experience with a backend processing and indexing layer that can work with big-data storage and query engines. It was commonly positioned alongside Oracle Big Data Appliance and related Oracle big-data components. It overlaps conceptually with modern \u201cdata prep + exploration + visualization\u201d workflows that today are frequently implemented using cloud-native services (data lake, Spark, SQL engines, and BI).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>What problem it solves:<\/strong> teams often have large, messy datasets with unknown schema quality, missing values, inconsistent formatting, and unclear distributions. Big Data Discovery addresses the \u201ctime-to-first-insight\u201d gap by providing interactive discovery, profiling, and transformation workflows so teams can build trusted, analysis-ready datasets faster.<\/p>\n\n\n\n<blockquote>\n<p>Important lifecycle note (read first): Oracle\u2019s \u201cBig Data Discovery\u201d has historically existed as a product integrated with Oracle\u2019s big data stack (commonly associated with Oracle Big Data Appliance) and was also offered in some form in older Oracle Cloud environments. In many Oracle Cloud Infrastructure (OCI) tenancies today, <strong>Big Data Discovery does not appear as a native OCI managed service in the console<\/strong>. Availability, lifecycle status (active vs. legacy), and procurement\/licensing can vary by Oracle program and contract. <strong>Verify current availability and lifecycle status in official Oracle documentation and with Oracle Sales\/Support<\/strong> before designing new long-term architectures around it.<\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">Because of that reality, this tutorial does two things:\n1. Teaches Big Data Discovery accurately as a product and where it fits.\n2. Provides a <strong>practical, executable OCI lab<\/strong> that recreates a \u201cBig Data Discovery-style\u201d workflow using current Oracle Cloud services (Object Storage + Autonomous Database + built-in analytics tools). This is often the most practical approach for new projects on OCI.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">2. What is Big Data Discovery?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Official purpose (product intent)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Big Data Discovery is intended to help users:\n&#8211; Connect to large data sources (commonly Hadoop\/Hive\/HDFS in Oracle big data deployments).\n&#8211; Explore datasets interactively (search, filter, facet, profile).\n&#8211; Perform data preparation (cleaning, enrichment, transformations).\n&#8211; Publish curated datasets for analytics and reporting.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If you have Big Data Discovery in your environment, consult the <strong>Oracle Big Data Discovery documentation set<\/strong> available through Oracle Help Center (see resources at the end) for the exact version and supported integrations.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Core capabilities (what it typically provides)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Big Data Discovery capabilities, as described in Oracle materials for the product, typically include:\n&#8211; <strong>Dataset ingestion<\/strong> from big-data repositories and structured sources (implementation depends on deployment).\n&#8211; <strong>Data profiling<\/strong> (type inference, cardinality, distributions, outliers).\n&#8211; <strong>Search and faceted exploration<\/strong> for fast slicing\/dicing of large datasets.\n&#8211; <strong>Data preparation<\/strong> workflows (standardization, parsing, filtering, joining, deriving fields).\n&#8211; <strong>Publishing\/export<\/strong> of curated outputs for downstream BI or data science workflows.<\/p>\n\n\n\n<blockquote>\n<p>Caveat: exact connectors, processing engines, and export targets are <strong>version- and deployment-dependent<\/strong>. Verify in the documentation for your Big Data Discovery version.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">Major components (conceptual)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A typical Big Data Discovery deployment historically included:\n&#8211; A <strong>web-based \u201cStudio\u201d<\/strong> experience for interactive discovery and preparation.\n&#8211; A <strong>processing layer<\/strong> to run transformations at scale (often integrated with big-data processing frameworks in the environment).\n&#8211; An <strong>indexing\/search layer<\/strong> enabling fast interactive filtering and faceting.\n&#8211; Admin\/configuration components for connectivity, security, and operations.<\/p>\n\n\n\n<blockquote>\n<p>Names and internal architecture details vary by release. Use the official admin and installation guides for your exact build.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">Service type<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In the Oracle Cloud \u201cOther Services\u201d category context, it\u2019s best to think of Big Data Discovery as:\n&#8211; A <strong>product\/workload<\/strong> that you run as part of a broader big-data platform,\nnot necessarily a first-class \u201cOCI native managed service\u201d (like Object Storage or Autonomous Database).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Scope: regional\/global\/zonal?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Because Big Data Discovery is not universally exposed as a native OCI resource, \u201cscope\u201d is generally:\n&#8211; <strong>Deployment-scoped<\/strong>: it runs where you deploy it (on-prem, appliance, or customer-managed compute).\n&#8211; Its effective availability is determined by your infrastructure and licensing rather than OCI region catalogs.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How it fits into the Oracle Cloud ecosystem<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In modern OCI architectures, Big Data Discovery\u2019s role is often fulfilled by a combination of:\n&#8211; <strong>Oracle Cloud Infrastructure Object Storage<\/strong> (data lake storage),\n&#8211; <strong>OCI Data Flow (Apache Spark)<\/strong> or <strong>OCI Big Data Service<\/strong> (processing),\n&#8211; <strong>Autonomous Database<\/strong> (curated\/serving layer),\n&#8211; <strong>Oracle Analytics Cloud<\/strong> (BI\/visualization),\n&#8211; <strong>OCI Data Integration<\/strong> (ETL\/ELT orchestration).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So even when Big Data Discovery itself is not used, the <em>workflow<\/em> it represents remains a common requirement: interactive discovery + preparation + publishing trusted datasets.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">3. Why use Big Data Discovery?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Business reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Faster insight from large datasets:<\/strong> reduce the time spent just understanding data shape and quality.<\/li>\n<li><strong>Self-service discovery:<\/strong> analysts can explore data without waiting for custom engineering pipelines for every question.<\/li>\n<li><strong>Improved data trust:<\/strong> profiling and preparation steps help produce cleaner datasets for decision-making.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Technical reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Interactive exploration over large data:<\/strong> supports discovery patterns that SQL-only workflows can make slower for ad-hoc questions (depending on indexing\/engine).<\/li>\n<li><strong>Repeatable transformations:<\/strong> data prep can be standardized and reused.<\/li>\n<li><strong>Bridge between raw data and analytics:<\/strong> publish curated outputs for BI, ML, or downstream reporting.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operational reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Standardized discovery tooling:<\/strong> reduces \u201cspreadsheet chaos\u201d and inconsistent local scripts.<\/li>\n<li><strong>Governance alignment:<\/strong> centralized platform is easier to govern than scattered personal scripts (when deployed and managed properly).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/compliance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Centralized access control and auditability (deployment-dependent).<\/li>\n<li>Reduced need to copy data to unmanaged endpoints for exploration.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scalability\/performance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Designed for large datasets and big data ecosystems (particularly where deployed with Hadoop-related storage\/query engines).<\/li>\n<li>Supports sampling and summary-based exploration patterns to keep UIs responsive.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should choose it<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Choose Big Data Discovery when:\n&#8211; You already have it licensed\/deployed (or part of an Oracle big data platform) and it matches your sources.\n&#8211; You need an interactive data prep and discovery experience tightly integrated with your big data environment.\n&#8211; You have operational maturity to maintain the platform (patching, scaling, governance).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should not choose it<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Avoid Big Data Discovery when:\n&#8211; You\u2019re starting greenfield on OCI and need a <strong>fully managed, roadmap-forward<\/strong> cloud service (Big Data Discovery may be legacy for many customers).\n&#8211; Your main need is BI dashboards over curated data (Oracle Analytics Cloud may be a simpler fit).\n&#8211; You want open lakehouse formats (Parquet\/Iceberg\/Delta) with modern query engines and minimal proprietary dependencies\u2014evaluate OCI Data Flow + Trino\/Presto patterns instead.\n&#8211; You cannot staff platform operations (customer-managed software can be operationally heavy).<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">4. Where is Big Data Discovery used?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Industries<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Financial services (risk analytics, fraud exploration, compliance datasets)<\/li>\n<li>Retail\/e-commerce (clickstream exploration, product analytics)<\/li>\n<li>Telecom (CDR exploration, network event analytics)<\/li>\n<li>Manufacturing\/IoT (sensor data quality and anomaly exploration)<\/li>\n<li>Healthcare (claims analytics, operational reporting; subject to strict compliance)<\/li>\n<li>Public sector (case analytics, citizen service data quality)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Team types<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data analysts and BI teams doing exploratory work<\/li>\n<li>Data engineering teams preparing curated datasets<\/li>\n<li>Data science teams validating features and distributions<\/li>\n<li>Platform teams standardizing data exploration tooling<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Workloads<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Exploratory data analysis (EDA) over big-data repositories<\/li>\n<li>Data quality and profiling at scale<\/li>\n<li>Building curated datasets from raw lakes<\/li>\n<li>Publishing \u201canalysis-ready\u201d datasets for BI\/ML<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Architectures<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Hadoop-centric environments (historically common)<\/li>\n<li>Data lake + processing + serving layer patterns<\/li>\n<li>Hybrid: on-prem big data + cloud analytics serving<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Real-world deployment contexts<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Existing Oracle big data platforms where Big Data Discovery is already part of the stack<\/li>\n<li>Migration scenarios: using Big Data Discovery outputs to transition to OCI analytics services<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Production vs dev\/test usage<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Production<\/strong>: curated dataset publishing, governed exploration, standardized transformations.<\/li>\n<li><strong>Dev\/test<\/strong>: discovery of new sources, profiling, POCs for analytics use cases.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">5. Top Use Cases and Scenarios<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Below are realistic scenarios where Big Data Discovery (or a Big Data Discovery-style workflow) is commonly applied.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1) Data lake profiling before onboarding to analytics<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> raw data arrives with unknown schema drift and inconsistent quality.<\/li>\n<li><strong>Why Big Data Discovery fits:<\/strong> profiling + interactive exploration helps teams understand distributions, nulls, and anomalies quickly.<\/li>\n<li><strong>Example:<\/strong> a retail team receives daily CSV\/JSON dumps of transactions and needs to validate fields and detect missing store IDs before building reports.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2) Self-service exploration for analysts on big datasets<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> analysts are blocked waiting for engineering to build custom extracts.<\/li>\n<li><strong>Why it fits:<\/strong> interactive filtering\/faceting reduces dependence on ad-hoc pipelines.<\/li>\n<li><strong>Example:<\/strong> marketing analysts explore clickstream attributes to identify top referral sources.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3) Preparing curated datasets for BI dashboards<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> BI dashboards fail due to inconsistent formats and dirty dimensions.<\/li>\n<li><strong>Why it fits:<\/strong> standardized transforms create consistent columns (dates, categories, IDs).<\/li>\n<li><strong>Example:<\/strong> telecom team standardizes device model strings and publishes a clean \u201csubscriber_device_dim\u201d.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4) Joining heterogeneous sources into a unified dataset<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> data lives in multiple sources (events + reference data) with inconsistent keys.<\/li>\n<li><strong>Why it fits:<\/strong> preparation steps can include joins\/derivations (capability depends on version).<\/li>\n<li><strong>Example:<\/strong> manufacturer joins sensor readings with equipment metadata for plant-level KPIs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">5) Detecting outliers and data quality issues early<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> pipelines silently ingest bad data leading to wrong decisions.<\/li>\n<li><strong>Why it fits:<\/strong> profiling and distributions reveal outliers and breaks.<\/li>\n<li><strong>Example:<\/strong> finance sees a sudden spike in \u201ctransaction_amount\u201d due to a unit conversion bug.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6) Publishing datasets for downstream ML feature engineering<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> data scientists spend too long cleaning data before modeling.<\/li>\n<li><strong>Why it fits:<\/strong> curated, standardized datasets reduce duplicated cleanup.<\/li>\n<li><strong>Example:<\/strong> fraud modelers receive a prepared dataset with consistent merchant categories and cleaned timestamps.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">7) Investigating operational incidents with fast filtering<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> SRE\/ops teams need to explore event logs at scale.<\/li>\n<li><strong>Why it fits:<\/strong> faceted exploration supports quick narrowing by host, error code, region (depending on ingestion).<\/li>\n<li><strong>Example:<\/strong> ops team investigates \u201cpayment_timeouts\u201d across services after a deployment.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">8) Compliance reporting dataset preparation<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> compliance needs repeatable datasets with lineage and consistent logic.<\/li>\n<li><strong>Why it fits:<\/strong> repeatable transformations reduce one-off spreadsheet manipulation.<\/li>\n<li><strong>Example:<\/strong> bank prepares monthly AML case dataset with standardized customer identifiers.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">9) Enrichment and standardization of semi-structured fields<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> addresses, names, product codes are messy.<\/li>\n<li><strong>Why it fits:<\/strong> transformations can parse and standardize values (exact enrichment varies).<\/li>\n<li><strong>Example:<\/strong> e-commerce standardizes shipping address fields and extracts postal code.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">10) Migration discovery: understanding Hadoop datasets before moving to OCI<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> organizations want to migrate but don\u2019t know which datasets are important or clean.<\/li>\n<li><strong>Why it fits:<\/strong> discovery identifies high-value datasets and quality issues to plan migration.<\/li>\n<li><strong>Example:<\/strong> enterprise profiles Hive tables, identifies top-used columns, and prioritizes migration into OCI lakehouse patterns.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">6. Core Features<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Because Big Data Discovery\u2019s availability and packaging can vary, this section describes commonly documented feature categories. <strong>Confirm exact feature availability in your Big Data Discovery version docs.<\/strong><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Feature 1: Interactive data exploration (search\/filter\/facets)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> lets users explore datasets by filtering, searching, and slicing attributes interactively.<\/li>\n<li><strong>Why it matters:<\/strong> reduces time spent writing exploratory queries; accelerates understanding of data.<\/li>\n<li><strong>Practical benefit:<\/strong> faster ad-hoc investigations and quicker iteration with stakeholders.<\/li>\n<li><strong>Limitations\/caveats:<\/strong> responsiveness depends on indexing\/engine configuration and dataset size; some transformations may require batch processing.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Feature 2: Data profiling and summary statistics<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> provides distributions, cardinality, missing values, and type inference.<\/li>\n<li><strong>Why it matters:<\/strong> data quality issues are common in lakes; profiling surfaces them early.<\/li>\n<li><strong>Practical benefit:<\/strong> improves downstream pipeline reliability and reduces \u201csilent failures.\u201d<\/li>\n<li><strong>Limitations\/caveats:<\/strong> profiling large datasets may rely on sampling; verify how sampling is configured.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Feature 3: Data preparation \/ transformation workflows<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> supports common transforms like filtering rows, deriving columns, parsing strings\/dates, and standardizing values.<\/li>\n<li><strong>Why it matters:<\/strong> most analytics value comes after cleaning\/curation.<\/li>\n<li><strong>Practical benefit:<\/strong> repeatable prep steps reduce spreadsheet-based manipulation and duplicated scripts.<\/li>\n<li><strong>Limitations\/caveats:<\/strong> not every transform equals full ETL; complex workflows may still require Spark\/SQL pipelines.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Feature 4: Sampling to keep exploration responsive<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> allows users to work on representative subsets of huge datasets.<\/li>\n<li><strong>Why it matters:<\/strong> exploration UIs can\u2019t always operate on full-scale data interactively.<\/li>\n<li><strong>Practical benefit:<\/strong> quick iteration on cleaning logic before full-scale apply.<\/li>\n<li><strong>Limitations\/caveats:<\/strong> sampling can mislead if data is highly skewed; validate with full runs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Feature 5: Publishing curated datasets<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> exports\/publishes prepared datasets for consumption by BI or other systems.<\/li>\n<li><strong>Why it matters:<\/strong> turns exploratory work into reusable assets.<\/li>\n<li><strong>Practical benefit:<\/strong> consistent curated datasets across teams.<\/li>\n<li><strong>Limitations\/caveats:<\/strong> export targets and formats depend on environment integration; confirm supported sinks.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Feature 6: Collaboration and project organization<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> organizes work into projects\/datasets with saved steps and shareable assets.<\/li>\n<li><strong>Why it matters:<\/strong> prevents knowledge loss and \u201ctribal scripts.\u201d<\/li>\n<li><strong>Practical benefit:<\/strong> repeatable prep pipelines and easier onboarding.<\/li>\n<li><strong>Limitations\/caveats:<\/strong> governance depends on how access control is implemented.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Feature 7: Security integration (authentication\/authorization)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> controls who can access datasets and features.<\/li>\n<li><strong>Why it matters:<\/strong> discovery tools often surface sensitive columns; least privilege is critical.<\/li>\n<li><strong>Practical benefit:<\/strong> safer self-service analytics.<\/li>\n<li><strong>Limitations\/caveats:<\/strong> integration method depends on deployment (e.g., enterprise identity providers); verify supported modes.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Feature 8: Administrative controls and monitoring hooks<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> provides ways to configure sources, manage users, and monitor health.<\/li>\n<li><strong>Why it matters:<\/strong> discovery platforms need operational oversight to stay stable.<\/li>\n<li><strong>Practical benefit:<\/strong> more predictable performance and better troubleshooting.<\/li>\n<li><strong>Limitations\/caveats:<\/strong> monitoring integrations vary; you may need external monitoring stacks.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">7. Architecture and How It Works<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">High-level architecture<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A Big Data Discovery-style platform typically has:\n1. <strong>Data sources<\/strong> (HDFS\/Hive tables, object storage, databases\u2014depending on connectors).\n2. <strong>Ingestion\/metadata layer<\/strong> to register datasets.\n3. <strong>Indexing and exploration layer<\/strong> to support fast interactive filtering and profiling.\n4. <strong>Processing layer<\/strong> to run transformations at scale.\n5. <strong>Publishing layer<\/strong> to produce curated outputs.\n6. <strong>Access control<\/strong> integrated with enterprise identity\/IAM.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Request\/data\/control flow (conceptual)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A user logs into the Studio\/UI.<\/li>\n<li>The user selects or ingests a dataset.<\/li>\n<li>The system profiles the dataset and builds indexes\/metadata to enable interactive exploration.<\/li>\n<li>The user defines transformations (cleaning\/derivations).<\/li>\n<li>The system executes transforms (possibly using a cluster compute engine).<\/li>\n<li>Results are published to downstream destinations.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Integrations with related services (OCI context)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">If you are implementing this workflow on OCI today (even without Big Data Discovery), typical integrations are:\n&#8211; <strong>OCI Object Storage<\/strong> for raw and curated data.\n&#8211; <strong>Autonomous Database (ADW\/ATP)<\/strong> for curated serving datasets.\n&#8211; <strong>OCI Data Flow (Spark)<\/strong> for transformation at scale.\n&#8211; <strong>OCI Logging<\/strong> for centralized logs (for OCI-native services).\n&#8211; <strong>OCI IAM<\/strong> for least privilege access to buckets and databases.\n&#8211; <strong>Oracle Analytics Cloud<\/strong> for visualization (licensed service).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Dependency services<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Identity: <strong>OCI IAM<\/strong> (OCI-native workflows), or enterprise IdP for legacy deployments.<\/li>\n<li>Storage: Object Storage \/ HDFS \/ database storage depending on deployment.<\/li>\n<li>Compute\/processing: Spark\/Hadoop\/YARN or OCI Data Flow.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/authentication model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>OCI-native pattern:<\/strong> users and services authenticate via IAM policies, dynamic groups, and resource principals (for services like Data Flow).<\/li>\n<li><strong>Legacy\/platform pattern:<\/strong> user auth may be integrated with LDAP\/SSO depending on deployment.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Networking model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>OCI-native:<\/strong> private endpoints for Autonomous Database; private access to Object Storage via Service Gateway; limit public exposure.<\/li>\n<li><strong>Legacy:<\/strong> depends on how the platform is deployed (on-prem network segmentation, firewalls).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Monitoring\/logging\/governance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Define ownership for datasets and transformations (data product mindset).<\/li>\n<li>Centralize logs:<\/li>\n<li>OCI Logging for OCI services.<\/li>\n<li>Database audit logs for Autonomous Database.<\/li>\n<li>Set tagging standards and cost tracking in OCI.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Simple architecture diagram (conceptual)<\/h4>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart LR\n  U[User \/ Analyst] --&gt; UI[Big Data Discovery Studio&lt;br\/&gt;or Discovery UI]\n  UI --&gt; META[Metadata + Profiling]\n  META --&gt; IDX[Index\/Search Layer]\n  UI --&gt; PROC[Processing\/Transform Layer]\n  PROC --&gt; SRC[(Raw Data Store&lt;br\/&gt;HDFS \/ Object Storage)]\n  PROC --&gt; CUR[(Curated Output&lt;br\/&gt;DB \/ Object Storage)]\n  CUR --&gt; BI[BI \/ Analytics Consumers]\n<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Production-style architecture diagram (OCI-native replacement pattern)<\/h4>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart TB\n  subgraph OCI[Oracle Cloud Infrastructure (OCI)]\n    subgraph Net[VCN (private)]\n      ADB[(Autonomous Database&lt;br\/&gt;Private Endpoint)]\n      BAST[Admin Bastion \/ Private Admin Host]\n    end\n\n    OS[(Object Storage Buckets&lt;br\/&gt;Raw + Curated)]\n    DF[OCI Data Flow (Spark Jobs)]\n    LOG[OCI Logging]\n    AUD[Audit Logs]\n    IAM[IAM Policies&lt;br\/&gt;+ Dynamic Groups]\n  end\n\n  EXT[Enterprise Users] --&gt;|SSO\/IAM| IAM\n  EXT --&gt;|SQL\/Web| ADB\n  EXT --&gt;|Console\/API| OS\n\n  DF --&gt;|Read\/Write| OS\n  DF --&gt;|Load curated| ADB\n  DF --&gt; LOG\n  ADB --&gt; AUD\n  OS --&gt; AUD\n\n  BAST --&gt;|Private admin| ADB\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">8. Prerequisites<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Because Big Data Discovery itself may not be a universally available OCI service, prerequisites are split into two parts:\n&#8211; <strong>A) If you already have Big Data Discovery<\/strong> (legacy\/product environment)\n&#8211; <strong>B) If you will follow the OCI hands-on lab<\/strong> (recommended for most new OCI users)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">A) Big Data Discovery (product) prerequisites (verify in your version docs)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Access to a Big Data Discovery environment (often part of an Oracle big data platform deployment).<\/li>\n<li>Admin-provisioned connectivity to your data sources (Hive\/HDFS\/etc., depending on your environment).<\/li>\n<li>User authentication set up (SSO\/LDAP\/IAM depending on deployment).<\/li>\n<li>Permissions to create projects\/datasets and run transformations.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">B) OCI hands-on lab prerequisites (Big Data Discovery-style workflow on Oracle Cloud)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Account\/tenancy<\/strong>\n&#8211; An active <strong>Oracle Cloud (OCI) tenancy<\/strong> with billing enabled (or free trial).\n&#8211; Permission to create and manage:\n  &#8211; Object Storage buckets\/objects\n  &#8211; Autonomous Database (Always Free if available in your region)\n  &#8211; IAM policies (or an admin who can create required policies)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>IAM permissions<\/strong>\n&#8211; You need a group with policies similar to:\n  &#8211; Manage Object Storage in a compartment\n  &#8211; Manage Autonomous Database in a compartment\n  &#8211; Use Cloud Shell (optional)\n&#8211; If you are not an admin, ask your OCI admin to grant least-privilege permissions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Tools<\/strong>\n&#8211; OCI Console access.\n&#8211; Optionally:\n  &#8211; <strong>OCI Cloud Shell<\/strong> (recommended) or OCI CLI installed locally.\n  &#8211; A SQL client: SQL Developer, SQLcl, or the built-in Autonomous Database SQL tools.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Region availability<\/strong>\n&#8211; Object Storage and Autonomous Database are widely available, but Always Free availability can vary.\n&#8211; If a service isn\u2019t available in your region, select a different OCI region (if your tenancy allows) or use paid resources.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Quotas\/limits<\/strong>\n&#8211; Autonomous Database Always Free has resource limits.\n&#8211; Object Storage has tenancy-level service limits.\n&#8211; If you hit a limit error, request a service limit increase (paid accounts) or use a smaller dataset.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Prerequisite services<\/strong>\n&#8211; OCI Object Storage\n&#8211; Oracle Autonomous Database (ADW or ATP)<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">9. Pricing \/ Cost<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Big Data Discovery pricing model (important reality)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Big Data Discovery is not typically priced like a modern OCI consumption service with a public \u201cper-GB\/per-OCPU\u201d meter in the OCI price list. Instead, it has historically been:\n&#8211; Included in certain Oracle big-data platform offerings, or\n&#8211; Licensed as software (terms vary)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>What to do:<\/strong>\n&#8211; <strong>Verify Big Data Discovery commercial and licensing terms<\/strong> with Oracle Sales\/Account team.\n&#8211; Check your support contract and product availability for your environment.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Because public, region-based OCI pricing pages may not list Big Data Discovery explicitly, you should not assume it behaves like a pay-as-you-go OCI native service.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Cost model for the OCI lab (Big Data Discovery-style workflow)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The lab in this tutorial uses common OCI services with published pricing:\n&#8211; <strong>OCI Object Storage<\/strong>: billed by stored GB-month and requests (and data egress if applicable).\n&#8211; <strong>Autonomous Database<\/strong>: Always Free option may cost $0 within limits; paid tiers bill by OCPU and storage.\n&#8211; Optional additions (not required):\n  &#8211; <strong>OCI Data Flow<\/strong>: billed by OCPU time (Spark job runtime) and possibly other dimensions depending on SKU.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Free tier notes<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Autonomous Database has an <strong>Always Free<\/strong> option in many regions\/tenancies (verify in your OCI console).<\/li>\n<li>Object Storage has limited \u201cfree\u201d components depending on promotions; assume storage is billed unless your tenancy offers credits.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Cost drivers<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Direct cost drivers:\n&#8211; Object Storage data volume (raw + curated + logs\/exports).\n&#8211; Autonomous Database size and compute (if not Always Free).\n&#8211; Any optional Spark processing (Data Flow) runtime.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Indirect\/hidden costs:\n&#8211; Data transfer\/egress if you move data out of OCI regions.\n&#8211; Operational overhead (time) if you maintain self-managed tooling.\n&#8211; Backups and retention if you store many copies.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Network\/data transfer implications<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Ingress<\/strong> to OCI is typically not billed, but <strong>egress<\/strong> to the internet is usually billed. Verify OCI data transfer pricing for your region.<\/li>\n<li>Cross-region replication and reads may incur additional costs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">How to optimize cost<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Keep raw data in Object Storage and only curate what you need into the database.<\/li>\n<li>Use compressed columnar formats (Parquet) for curated lake data when possible.<\/li>\n<li>Use Always Free Autonomous Database for small labs and prototypes.<\/li>\n<li>Set lifecycle policies on buckets to archive or delete older objects.<\/li>\n<li>Tag resources for cost tracking and shut down\/delete unused resources.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Example low-cost starter estimate (no fabricated numbers)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A low-cost starter design usually includes:\n&#8211; One Object Storage bucket with a small dataset (&lt; a few GB).\n&#8211; One Autonomous Database Always Free instance.\n&#8211; Optional: no Data Flow jobs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Cost should be minimal (often near $0 if Always Free is used and storage is small), but <strong>verify in the OCI Cost Estimator<\/strong>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Example production cost considerations<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For production-scale discovery\/prep workflows:\n&#8211; Expect significant Object Storage growth (raw + curated + historical).\n&#8211; Autonomous Database paid tiers if you need larger compute\/storage and higher concurrency.\n&#8211; Spark processing costs (OCI Data Flow) if you run frequent large jobs.\n&#8211; Monitoring\/log retention costs and security services.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Official pricing references (start here)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>OCI pricing overview and calculator:<\/li>\n<li>https:\/\/www.oracle.com\/cloud\/costestimator.html<\/li>\n<li>https:\/\/www.oracle.com\/cloud\/pricing\/<\/li>\n<li>OCI Object Storage pricing (navigate to \u201cStorage\u201d):<\/li>\n<li>https:\/\/www.oracle.com\/cloud\/pricing\/<\/li>\n<li>Autonomous Database pricing:<\/li>\n<li>https:\/\/www.oracle.com\/autonomous-database\/pricing\/<\/li>\n<li>OCI Data Flow pricing (if used):<\/li>\n<li>https:\/\/www.oracle.com\/cloud\/pricing\/<\/li>\n<\/ul>\n\n\n\n<blockquote>\n<p>Pricing pages can be reorganized over time. If a link changes, start from the OCI pricing page and drill down by service.<\/p>\n<\/blockquote>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">10. Step-by-Step Hands-On Tutorial<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Because Big Data Discovery may not be available as a native OCI managed service in your tenancy, this lab provides a <strong>Big Data Discovery-style workflow<\/strong> on Oracle Cloud using commonly available OCI services. The end result is the same outcome Big Data Discovery is typically used for: <strong>ingest \u2192 profile \u2192 clean\/transform \u2192 publish \u2192 explore<\/strong>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Objective<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Build a small, realistic \u201cdiscovery and preparation\u201d pipeline on Oracle Cloud:\n1. Store a raw CSV dataset in <strong>OCI Object Storage<\/strong>.\n2. Load it into an <strong>Autonomous Database (Always Free where available)<\/strong> using <code>DBMS_CLOUD<\/code>.\n3. Profile and transform the dataset with SQL.\n4. Explore results with built-in Autonomous Database tools (and optionally connect a BI tool later).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Lab Overview<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You will create:\n&#8211; 1 compartment (optional but recommended)\n&#8211; 1 Object Storage bucket + uploaded dataset\n&#8211; 1 Autonomous Database (ADW or ATP)\n&#8211; 1 database user + credential to read from Object Storage\n&#8211; 1 raw table + 1 curated table\n&#8211; Simple profiling queries and a \u201cpublish\u201d view<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Estimated time:<\/strong> 60\u2013120 minutes<br\/>\n<strong>Cost:<\/strong> Low. Potentially $0 if you use Autonomous Database Always Free and a small dataset.<br\/>\n<strong>Skill level:<\/strong> Beginner-friendly; includes IAM and SQL fundamentals.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 1: Create a compartment (recommended)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Why:<\/strong> compartments help isolate access and costs for labs.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>In the OCI Console, open the navigation menu \u2192 <strong>Identity &amp; Security<\/strong> \u2192 <strong>Compartments<\/strong>.<\/li>\n<li>Click <strong>Create Compartment<\/strong>.<\/li>\n<li>Name: <code>bdd-lab<\/code><br\/>\n   Description: <code>Big Data Discovery style lab resources<\/code> <\/li>\n<li>Click <strong>Create Compartment<\/strong>.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> You have a <code>bdd-lab<\/code> compartment to place all resources.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong> You can select <code>bdd-lab<\/code> in the compartment picker.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 2: Create an Object Storage bucket and upload a dataset<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">2.1 Create a bucket<\/h4>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Go to <strong>Storage<\/strong> \u2192 <strong>Buckets<\/strong>.<\/li>\n<li>Ensure the compartment is <code>bdd-lab<\/code>.<\/li>\n<li>Click <strong>Create Bucket<\/strong>.<\/li>\n<li>Name: <code>bdd-lab-raw<\/code><\/li>\n<li>Keep defaults (Standard storage tier is fine for the lab).<\/li>\n<li>Click <strong>Create<\/strong>.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> bucket <code>bdd-lab-raw<\/code> exists.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">2.2 Upload a sample CSV<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Pick a small dataset you can legally use. Two good options:\n&#8211; A public dataset from a government open data portal\n&#8211; A synthetic dataset you generate yourself<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For a quick lab, you can generate a synthetic CSV locally:<\/p>\n\n\n\n<pre><code class=\"language-bash\">cat &gt; sales_raw.csv &lt;&lt;'EOF'\norder_id,order_ts,customer_id,region,product,qty,unit_price,status\n1,2026-01-05T10:15:00Z,C001,us-phx,keyboard,1,45.00,SHIPPED\n2,2026-01-05T11:02:00Z,C002,us-ashburn,mouse,2,18.50,SHIPPED\n3,2026-01-06T09:41:00Z,C003,eu-frankfurt,monitor,1,199.99,PENDING\n4,2026-01-06T09:41:00Z,C003,eu-frankfurt,monitor,1,199.99,PENDING\n5,2026-01-07T14:20:00Z,,us-phx,usb-c cable,3,9.99,CANCELLED\n6,2026-01-08T08:05:00Z,C004,us-phx,laptop,1,899.00,SHIPPED\n7,2026-01-08T08:07:00Z,C004,us-phx,laptop,1,899.00,SHIPPED\n8,2026-01-10T16:55:00Z,C005,ap-tokyo,headset,2,59.90,SHIPPED\nEOF\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Upload it:\n&#8211; Buckets \u2192 <code>bdd-lab-raw<\/code> \u2192 <strong>Upload<\/strong> \u2192 select <code>sales_raw.csv<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> <code>sales_raw.csv<\/code> is stored in Object Storage.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong> You can see the object in the bucket listing.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 3: Create an Autonomous Database (Always Free if available)<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Go to <strong>Oracle Database<\/strong> \u2192 <strong>Autonomous Database<\/strong>.<\/li>\n<li>Select compartment: <code>bdd-lab<\/code>.<\/li>\n<li>Click <strong>Create Autonomous Database<\/strong>.<\/li>\n<li>Choose a workload:\n   &#8211; <strong>Autonomous Data Warehouse (ADW)<\/strong> is often a good fit for analytics labs.<\/li>\n<li>Display name: <code>bdd_lab_adw<\/code><\/li>\n<li>Database name: <code>BDDLAB<\/code><\/li>\n<li>Choose <strong>Always Free<\/strong> if available.<\/li>\n<li>Set admin password (store it securely).<\/li>\n<li>Networking:\n   &#8211; For the simplest lab, you can use public access with allowed IPs.\n   &#8211; For more secure setups, use private endpoint in a VCN (adds complexity).<\/li>\n<li>Click <strong>Create<\/strong>.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> an Autonomous Database instance is provisioned.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong> status becomes <strong>Available<\/strong>.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 4: Create an Object Storage auth token and database credential<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Autonomous Database uses <code>DBMS_CLOUD<\/code> to access Object Storage. The common approach is:\n&#8211; Create an OCI user <strong>Auth Token<\/strong>\n&#8211; Store it as a database credential<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">4.1 Create an Auth Token (OCI user)<\/h4>\n\n\n\n<ol class=\"wp-block-list\">\n<li>In OCI Console: <strong>Identity &amp; Security<\/strong> \u2192 <strong>Users<\/strong> \u2192 your user.<\/li>\n<li>Open <strong>Auth Tokens<\/strong>.<\/li>\n<li>Click <strong>Generate Token<\/strong>.<\/li>\n<li>Description: <code>bdd-lab-dbms-cloud<\/code><\/li>\n<li>Copy the token value (you won\u2019t see it again).<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> you have an auth token.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">4.2 Create a database user (optional but recommended)<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">In Autonomous Database, open <strong>Database Actions<\/strong> (or your SQL tool) and run:<\/p>\n\n\n\n<pre><code class=\"language-sql\">CREATE USER bdd_lab IDENTIFIED BY \"UseAStrongPassword#1\";\nGRANT CONNECT, RESOURCE TO bdd_lab;\n-- For DBMS_CLOUD usage:\nGRANT EXECUTE ON DBMS_CLOUD TO bdd_lab;\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> user <code>bdd_lab<\/code> exists.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong><\/p>\n\n\n\n<pre><code class=\"language-sql\">SELECT username FROM all_users WHERE username = 'BDD_LAB';\n<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">4.3 Create a DBMS_CLOUD credential<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Connect as <code>bdd_lab<\/code> and run:<\/p>\n\n\n\n<pre><code class=\"language-sql\">BEGIN\n  DBMS_CLOUD.CREATE_CREDENTIAL(\n    credential_name =&gt; 'OBJ_STORE_CRED',\n    username        =&gt; '&lt;your_oci_username&gt;',\n    password        =&gt; '&lt;your_auth_token&gt;'\n  );\nEND;\n\/\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> credential is created.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong><\/p>\n\n\n\n<pre><code class=\"language-sql\">SELECT credential_name FROM user_credentials WHERE credential_name = 'OBJ_STORE_CRED';\n<\/code><\/pre>\n\n\n\n<blockquote>\n<p>If you can\u2019t use an auth token due to org policy, use an approved pattern (for example, resource principals in some OCI services). Follow your security team\u2019s guidance.<\/p>\n<\/blockquote>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 5: Load the CSV from Object Storage into a raw table<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">5.1 Create a raw staging table<\/h4>\n\n\n\n<pre><code class=\"language-sql\">CREATE TABLE sales_raw (\n  order_id     NUMBER,\n  order_ts     VARCHAR2(30),\n  customer_id  VARCHAR2(20),\n  region       VARCHAR2(50),\n  product      VARCHAR2(100),\n  qty          NUMBER,\n  unit_price   NUMBER(10,2),\n  status       VARCHAR2(20)\n);\n<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">5.2 Identify the Object Storage URL<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">In the bucket object details, find the <strong>Object URL<\/strong>. OCI also provides a \u201cURI\u201d format you can use.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A common pattern is:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Object Storage endpoint:\n  <code>https:\/\/objectstorage.&lt;region&gt;.oraclecloud.com<\/code><\/li>\n<li>Namespace + bucket + object:\n  <code>\/n\/&lt;namespace&gt;\/b\/&lt;bucket&gt;\/o\/&lt;object&gt;<\/code><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">So the full URL looks like:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><code>https:\/\/objectstorage.&lt;region&gt;.oraclecloud.com\/n\/&lt;namespace&gt;\/b\/bdd-lab-raw\/o\/sales_raw.csv<\/code><\/p>\n\n\n\n<blockquote>\n<p>Use the exact URL from your console to avoid mistakes.<\/p>\n<\/blockquote>\n\n\n\n<h4 class=\"wp-block-heading\">5.3 Load using DBMS_CLOUD<\/h4>\n\n\n\n<pre><code class=\"language-sql\">BEGIN\n  DBMS_CLOUD.COPY_DATA(\n    table_name      =&gt; 'SALES_RAW',\n    credential_name =&gt; 'OBJ_STORE_CRED',\n    file_uri_list   =&gt; 'https:\/\/objectstorage.&lt;region&gt;.oraclecloud.com\/n\/&lt;namespace&gt;\/b\/bdd-lab-raw\/o\/sales_raw.csv',\n    format          =&gt; JSON_OBJECT(\n      'type' VALUE 'csv',\n      'skipheaders' VALUE '1',\n      'delimiter' VALUE ',',\n      'ignoremissingcolumns' VALUE 'true'\n    )\n  );\nEND;\n\/\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> rows are loaded into <code>SALES_RAW<\/code>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong><\/p>\n\n\n\n<pre><code class=\"language-sql\">SELECT COUNT(*) AS row_count FROM sales_raw;\n\nSELECT * FROM sales_raw FETCH FIRST 5 ROWS ONLY;\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 6: Profile the data (Big Data Discovery-style checks)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Run quick profiling queries similar to what Big Data Discovery would surface:<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">6.1 Null checks<\/h4>\n\n\n\n<pre><code class=\"language-sql\">SELECT\n  SUM(CASE WHEN customer_id IS NULL OR TRIM(customer_id) IS NULL THEN 1 ELSE 0 END) AS null_customer_id,\n  COUNT(*) AS total_rows\nFROM sales_raw;\n<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">6.2 Duplicate detection<\/h4>\n\n\n\n<pre><code class=\"language-sql\">SELECT order_id, COUNT(*) AS cnt\nFROM sales_raw\nGROUP BY order_id\nHAVING COUNT(*) &gt; 1\nORDER BY cnt DESC;\n<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">6.3 Distribution by region\/status<\/h4>\n\n\n\n<pre><code class=\"language-sql\">SELECT region, status, COUNT(*) AS cnt\nFROM sales_raw\nGROUP BY region, status\nORDER BY cnt DESC;\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> you identify:\n&#8211; Missing <code>customer_id<\/code> rows\n&#8211; Duplicate <code>order_id<\/code> rows\n&#8211; Basic frequency breakdowns<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 7: Transform into a curated table (clean + dedupe + typed timestamp)<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">7.1 Create a curated table<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">This example:\n&#8211; Parses ISO timestamps\n&#8211; Deduplicates by keeping the first row per <code>order_id<\/code> (simple rule)\n&#8211; Filters out rows missing <code>customer_id<\/code> (business rule example)<\/p>\n\n\n\n<pre><code class=\"language-sql\">CREATE TABLE sales_curated AS\nWITH typed AS (\n  SELECT\n    order_id,\n    TO_TIMESTAMP_TZ(order_ts, 'YYYY-MM-DD\"T\"HH24:MI:SS\"Z\"') AS order_ts_tz,\n    TRIM(customer_id) AS customer_id,\n    LOWER(TRIM(region)) AS region,\n    TRIM(product) AS product,\n    qty,\n    unit_price,\n    UPPER(TRIM(status)) AS status\n  FROM sales_raw\n),\ndeduped AS (\n  SELECT t.*,\n         ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY order_ts_tz) AS rn\n  FROM typed t\n)\nSELECT\n  order_id,\n  order_ts_tz,\n  customer_id,\n  region,\n  product,\n  qty,\n  unit_price,\n  status,\n  (qty * unit_price) AS line_amount\nFROM deduped\nWHERE rn = 1\n  AND customer_id IS NOT NULL;\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> <code>SALES_CURATED<\/code> exists and is cleaner.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong><\/p>\n\n\n\n<pre><code class=\"language-sql\">SELECT COUNT(*) AS curated_rows FROM sales_curated;\n\nSELECT * FROM sales_curated ORDER BY order_id;\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 8: \u201cPublish\u201d a consumption-friendly view<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In Big Data Discovery-style workflows, publishing often means creating a stable dataset interface for BI\/analytics consumers.<\/p>\n\n\n\n<pre><code class=\"language-sql\">CREATE OR REPLACE VIEW sales_summary_v AS\nSELECT\n  region,\n  status,\n  COUNT(*) AS orders,\n  SUM(line_amount) AS revenue\nFROM sales_curated\nGROUP BY region, status;\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification:<\/strong><\/p>\n\n\n\n<pre><code class=\"language-sql\">SELECT * FROM sales_summary_v ORDER BY revenue DESC;\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> a stable view for dashboards and reports.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Validation<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You should be able to confirm all of the following:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p>Object exists in Object Storage:\n   &#8211; Bucket <code>bdd-lab-raw<\/code> contains <code>sales_raw.csv<\/code><\/p>\n<\/li>\n<li>\n<p>Data loaded:\n   &#8211; <code>SELECT COUNT(*) FROM sales_raw;<\/code> returns expected row count (8 in the sample)<\/p>\n<\/li>\n<li>\n<p>Curated table created:\n   &#8211; <code>SELECT COUNT(*) FROM sales_curated;<\/code> returns fewer rows than raw (because of null removal + dedupe)<\/p>\n<\/li>\n<li>\n<p>Published summary view works:\n   &#8211; <code>SELECT * FROM sales_summary_v;<\/code> returns region\/status rollups<\/p>\n<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Troubleshooting<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">Error: <code>ORA-20000: ... Unauthorized<\/code> or <code>401<\/code> when running <code>DBMS_CLOUD.COPY_DATA<\/code><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">Likely causes:\n&#8211; Wrong auth token (token not copied correctly)\n&#8211; Wrong username (OCI username mismatch)\n&#8211; Wrong Object Storage URL\/namespace\/region\n&#8211; IAM policy does not allow Object Storage access<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Fixes:\n&#8211; Regenerate the auth token and recreate the credential.\n&#8211; Copy the Object URL directly from the console.\n&#8211; Confirm your user\/group has permissions for Object Storage in that compartment.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Error: <code>ORA-00942: table or view does not exist<\/code> when selecting <code>user_credentials<\/code><\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use <code>USER_CREDENTIALS<\/code> view (as shown) and confirm your privileges.<\/li>\n<li>Make sure you created the credential in the same schema you\u2019re querying from.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Error: timestamp parsing fails<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Confirm the timestamp format in your CSV.<\/li>\n<li>Adjust <code>TO_TIMESTAMP_TZ<\/code> format mask accordingly.<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Data looks duplicated or inconsistent<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Review the dedupe rule (<code>ROW_NUMBER()<\/code> over <code>order_id<\/code>).<\/li>\n<li>In real datasets, you may need a more robust business key and ordering rule.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Cleanup<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To avoid ongoing costs, remove resources when done:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Drop database objects<\/strong><\/li>\n<\/ol>\n\n\n\n<pre><code class=\"language-sql\">DROP VIEW sales_summary_v;\nDROP TABLE sales_curated PURGE;\nDROP TABLE sales_raw PURGE;\n\nBEGIN\n  DBMS_CLOUD.DROP_CREDENTIAL('OBJ_STORE_CRED');\nEND;\n\/\n<\/code><\/pre>\n\n\n\n<ol class=\"wp-block-list\" start=\"2\">\n<li>\n<p><strong>Delete Autonomous Database<\/strong>\n&#8211; OCI Console \u2192 Autonomous Database \u2192 <code>bdd_lab_adw<\/code> \u2192 <strong>Terminate<\/strong><\/p>\n<\/li>\n<li>\n<p><strong>Delete Object Storage object and bucket<\/strong>\n&#8211; Buckets \u2192 <code>bdd-lab-raw<\/code> \u2192 delete <code>sales_raw.csv<\/code>\n&#8211; Delete bucket <code>bdd-lab-raw<\/code><\/p>\n<\/li>\n<li>\n<p><strong>Remove IAM auth token<\/strong>\n&#8211; Identity \u2192 Users \u2192 your user \u2192 Auth Tokens \u2192 delete <code>bdd-lab-dbms-cloud<\/code><\/p>\n<\/li>\n<li>\n<p>Optionally delete the compartment <code>bdd-lab<\/code> (only after it\u2019s empty).<\/p>\n<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">11. Best Practices<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Architecture best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Treat discovery\/prep as a <strong>repeatable pipeline<\/strong>, not a one-time activity.<\/li>\n<li>Separate zones:<\/li>\n<li><strong>Raw zone<\/strong> (immutable, append-only) in Object Storage<\/li>\n<li><strong>Curated zone<\/strong> (cleaned, standardized)<\/li>\n<li><strong>Serving zone<\/strong> (database views\/tables for BI)<\/li>\n<li>Prefer <strong>open formats<\/strong> (CSV for ingestion, Parquet for curated at scale) when building lake patterns.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">IAM\/security best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Enforce <strong>least privilege<\/strong>:<\/li>\n<li>Bucket read-only for consumers<\/li>\n<li>Write access only for pipeline identities<\/li>\n<li>Use compartments per environment (dev\/test\/prod).<\/li>\n<li>Prefer private access paths:<\/li>\n<li>Autonomous Database private endpoint<\/li>\n<li>Object Storage via Service Gateway in a VCN (where feasible)<\/li>\n<li>Rotate credentials (auth tokens) and avoid embedding secrets in scripts.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Cost best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use Always Free resources for labs where possible.<\/li>\n<li>Implement Object Storage lifecycle rules (delete\/archive old staging outputs).<\/li>\n<li>Avoid duplicating large datasets into databases unless necessary.<\/li>\n<li>Monitor egress and cross-region data movement.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Performance best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>For large data:<\/li>\n<li>Do transforms in scalable engines (Spark\/SQL) rather than interactive tools<\/li>\n<li>Partition and compress curated datasets<\/li>\n<li>In databases:<\/li>\n<li>Use appropriate indexing\/materialized views for BI query patterns<\/li>\n<li>Avoid SELECT * in production semantic layers<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Reliability best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Make transformations idempotent.<\/li>\n<li>Version curated datasets and keep lineage of rules.<\/li>\n<li>Automate loads and validation checks.<\/li>\n<li>Use backups and retention (database and object storage) aligned to RPO\/RTO.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operations best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Centralize logs and metrics:<\/li>\n<li>OCI Logging for OCI services<\/li>\n<li>Database auditing for data access<\/li>\n<li>Use tagging:<\/li>\n<li><code>CostCenter<\/code>, <code>Environment<\/code>, <code>Owner<\/code>, <code>DataDomain<\/code><\/li>\n<li>Define SLOs for data freshness and pipeline success rate.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Governance\/tagging\/naming best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Standard naming:<\/li>\n<li><code>bdd-&lt;env&gt;-raw<\/code>, <code>bdd-&lt;env&gt;-curated<\/code><\/li>\n<li>Data cataloging:<\/li>\n<li>Document dataset purpose, owners, and sensitivity classification.<\/li>\n<li>Apply consistent tags to buckets, DBs, and networking resources.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">12. Security Considerations<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Identity and access model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>In OCI, access to Object Storage and databases is governed by <strong>IAM policies<\/strong>.<\/li>\n<li>For discovery workflows:<\/li>\n<li>Create distinct identities for <strong>pipelines<\/strong> vs <strong>human users<\/strong>.<\/li>\n<li>Limit who can access raw sensitive data.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Encryption<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>OCI Object Storage encrypts data at rest by default (service-managed keys are typical).<\/li>\n<li>Autonomous Database encrypts storage at rest by default.<\/li>\n<li>For stricter requirements, use <strong>customer-managed keys<\/strong> via OCI Vault (verify service support and configuration requirements).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Network exposure<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Avoid public Autonomous Database access for production.<\/li>\n<li>Restrict access with IP allowlists if public access is required.<\/li>\n<li>Prefer private endpoints and controlled ingress via bastion hosts.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secrets handling<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Do not store auth tokens in plaintext files or source control.<\/li>\n<li>Prefer OCI Vault for secret storage and rotation where applicable.<\/li>\n<li>In this tutorial lab, you used an auth token; in production, design a safer credential strategy.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Audit\/logging<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use OCI Audit to track control-plane actions (bucket creation, DB changes).<\/li>\n<li>Use database auditing for data access and schema changes.<\/li>\n<li>Store logs in a central, tamper-resistant logging account if required.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Compliance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Classify data: PII\/PHI\/PCI.<\/li>\n<li>Apply data minimization: do not copy sensitive raw data to too many places.<\/li>\n<li>Enforce retention and deletion policies.<\/li>\n<li>Validate region\/legal constraints for data residency.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Common security mistakes<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Granting broad \u201cmanage all-resources\u201d policies to analysts.<\/li>\n<li>Leaving public buckets or permissive pre-authenticated requests.<\/li>\n<li>Using shared credentials across teams.<\/li>\n<li>No audit trail for who accessed sensitive columns.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secure deployment recommendations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Compartmentalize by environment and sensitivity.<\/li>\n<li>Use private networking for databases and processing.<\/li>\n<li>Implement a data access review process.<\/li>\n<li>Standardize dataset publishing with documented schemas and ownership.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">13. Limitations and Gotchas<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Big Data Discovery product lifecycle and availability<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Big Data Discovery may be <strong>legacy or not available<\/strong> as an OCI native managed service in many tenancies.<\/li>\n<li>Documentation may exist while new deployments are limited.<\/li>\n<li><strong>Verify in official Oracle docs and with Oracle<\/strong> before committing to it long-term.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Quotas and limits (OCI lab)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Autonomous Database Always Free has compute\/storage limits.<\/li>\n<li>Object Storage has service limits (objects, requests) at tenancy level.<\/li>\n<li>Large CSV loads can be slower than parquet-based pipelines.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Regional constraints<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Always Free availability differs by region.<\/li>\n<li>Some services (like Oracle Analytics Cloud) may not be available in all regions.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Pricing surprises<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data egress costs can surprise teams if they export large datasets to the internet.<\/li>\n<li>Storing many curated copies can multiply storage cost.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Compatibility issues<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>CSV parsing is error-prone: delimiter issues, quoting, encoding.<\/li>\n<li>Timestamp formats often break ingestion unless standardized.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operational gotchas<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Ad-hoc \u201cdiscovery transforms\u201d can become production dependencies. Treat them as code where possible.<\/li>\n<li>Without governance, multiple \u201ccurated\u201d datasets may diverge and confuse consumers.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Migration challenges<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>If migrating from a Hadoop-centric Big Data Discovery environment:<\/li>\n<li>Mapping transformations to Spark\/SQL pipelines can require rework.<\/li>\n<li>Access patterns may change (index-based exploration vs SQL queries).<\/li>\n<li>Plan for data format conversion and partitioning.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Vendor-specific nuances<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Oracle\u2019s big data and analytics tooling portfolio evolves. What replaces Big Data Discovery depends on your target architecture (data lake, warehouse, or lakehouse pattern).<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">14. Comparison with Alternatives<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Nearest options in Oracle Cloud (OCI)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Oracle Analytics Cloud (OAC):<\/strong> BI + visualization + data prep features (licensed service).<\/li>\n<li><strong>OCI Data Integration:<\/strong> managed ETL\/ELT orchestration for moving and transforming data (focus on pipelines rather than interactive discovery).<\/li>\n<li><strong>OCI Data Flow (Apache Spark):<\/strong> scalable processing; requires engineering patterns, not a discovery UI.<\/li>\n<li><strong>OCI Big Data Service:<\/strong> managed Hadoop ecosystem for customers who need it.<\/li>\n<li><strong>Autonomous Database + APEX\/Database Actions:<\/strong> fast SQL-based profiling and lightweight exploration.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Nearest options in other clouds<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>AWS: Glue + Athena + Lake Formation + QuickSight<\/li>\n<li>Azure: Data Factory + Synapse + Purview + Power BI<\/li>\n<li>GCP: Dataproc + BigQuery + Dataplex + Looker<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Open-source\/self-managed alternatives<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Apache Superset (BI exploration)<\/li>\n<li>Trino\/Presto + a metastore + a BI tool<\/li>\n<li>Jupyter notebooks + Spark<\/li>\n<li>OpenSearch\/Kibana for log\/event exploration<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Comparison table<\/h4>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Option<\/th>\n<th>Best For<\/th>\n<th>Strengths<\/th>\n<th>Weaknesses<\/th>\n<th>When to Choose<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Big Data Discovery (Oracle)<\/td>\n<td>Existing Oracle big data deployments needing interactive discovery<\/td>\n<td>Integrated discovery + prep experience for big data (deployment dependent)<\/td>\n<td>Availability\/lifecycle uncertainty for new OCI projects; may be legacy<\/td>\n<td>You already have it and it meets requirements; short-to-mid term use while planning roadmap<\/td>\n<\/tr>\n<tr>\n<td>Oracle Analytics Cloud<\/td>\n<td>BI dashboards + governed analytics<\/td>\n<td>Strong visualization, semantic modeling, enterprise BI capabilities<\/td>\n<td>Licensed cost; may require curated datasets<\/td>\n<td>You need enterprise BI and a supported OCI analytics roadmap<\/td>\n<\/tr>\n<tr>\n<td>OCI Data Integration<\/td>\n<td>ETL\/ELT orchestration<\/td>\n<td>Managed pipelines, scheduling, connectors<\/td>\n<td>Not primarily interactive discovery<\/td>\n<td>You need repeatable data movement and transformations<\/td>\n<\/tr>\n<tr>\n<td>OCI Data Flow (Spark)<\/td>\n<td>Large-scale processing<\/td>\n<td>Scales Spark without managing clusters<\/td>\n<td>Engineering-heavy; no \u201cdiscovery UI\u201d<\/td>\n<td>You need big transformations at scale<\/td>\n<\/tr>\n<tr>\n<td>Autonomous Database + SQL tools<\/td>\n<td>SQL-centric profiling + curated serving<\/td>\n<td>Fast iteration with SQL, strong governance controls<\/td>\n<td>Not a specialized discovery UI<\/td>\n<td>You want a cost-effective, governable curated layer on OCI<\/td>\n<\/tr>\n<tr>\n<td>AWS Glue + Athena + QuickSight<\/td>\n<td>AWS-native lake analytics<\/td>\n<td>Strong managed lake query and BI ecosystem<\/td>\n<td>AWS lock-in; cost can grow with query volume<\/td>\n<td>Your platform is AWS and you want managed lake analytics<\/td>\n<\/tr>\n<tr>\n<td>Open-source (Trino\/Superset)<\/td>\n<td>Custom lakehouse stacks<\/td>\n<td>Flexibility, open formats<\/td>\n<td>Ops burden, security\/integration work<\/td>\n<td>You have strong platform engineering and want maximum portability<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">15. Real-World Example<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Enterprise example: telecom data quality and churn analytics<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> A telecom company ingests billions of network events and customer interactions. Analysts struggle to explore raw data and detect schema drift and quality issues before reporting.<\/li>\n<li><strong>Proposed architecture:<\/strong><\/li>\n<li>Raw events land in OCI Object Storage (raw zone).<\/li>\n<li>Transformations run via Spark (OCI Data Flow) to generate curated customer-event aggregates.<\/li>\n<li>Curated datasets stored in Autonomous Data Warehouse for high-concurrency BI.<\/li>\n<li>BI dashboards in Oracle Analytics Cloud.<\/li>\n<li>Governance with IAM policies, auditing, and data classification tags.<\/li>\n<li><strong>Why Big Data Discovery was chosen (or evaluated):<\/strong><\/li>\n<li>Historically, it offered a self-service discovery layer on big-data stores for analysts.<\/li>\n<li>In a modernization effort, the company maps those workflows to OCI-native services for long-term supportability.<\/li>\n<li><strong>Expected outcomes:<\/strong><\/li>\n<li>Faster detection of bad data (null spikes, outliers).<\/li>\n<li>Reduced time to publish curated datasets for churn models.<\/li>\n<li>Stronger access controls and auditability.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Startup\/small-team example: e-commerce order analytics<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> A small team needs quick insight into order data quality and basic revenue reporting without hiring a full data platform team.<\/li>\n<li><strong>Proposed architecture:<\/strong><\/li>\n<li>Store raw CSV exports in OCI Object Storage.<\/li>\n<li>Load into Autonomous Database Always Free (for small volumes).<\/li>\n<li>Use SQL views as the published semantic layer.<\/li>\n<li>Lightweight exploration via Database Actions\/APEX; later add Oracle Analytics Cloud if needed.<\/li>\n<li><strong>Why Big Data Discovery-style approach:<\/strong><\/li>\n<li>The team needs the <em>workflow outcome<\/em> (discover \u2192 clean \u2192 publish) more than the specific legacy product.<\/li>\n<li><strong>Expected outcomes:<\/strong><\/li>\n<li>Clean, deduplicated order dataset.<\/li>\n<li>Simple dashboards and reports with minimal operational overhead.<\/li>\n<li>Low costs and easy scaling path.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">16. FAQ<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">1) <strong>Is Big Data Discovery an OCI native managed service?<\/strong><br\/>\nIn many OCI tenancies, Big Data Discovery does not appear as a standard OCI managed service. It has historically been delivered as part of Oracle\u2019s broader big data product stack. <strong>Verify current availability in official Oracle documentation and your OCI tenancy\/service catalog.<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">2) <strong>What is Big Data Discovery primarily used for?<\/strong><br\/>\nInteractive data discovery, profiling, preparation, and publishing curated datasets\u2014often for big-data sources like Hadoop ecosystems.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">3) <strong>Is Big Data Discovery the same as Oracle Analytics Cloud?<\/strong><br\/>\nNo. Oracle Analytics Cloud is an enterprise BI\/analytics service. Big Data Discovery focuses more on discovery and preparation over big data sources (though there is conceptual overlap).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">4) <strong>What replaced Big Data Discovery on OCI?<\/strong><br\/>\nThere isn\u2019t always a single 1:1 replacement. Many teams implement the workflow using Object Storage + Data Flow + Autonomous Database + Oracle Analytics Cloud. Choose based on your needs and Oracle\u2019s current roadmap.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">5) <strong>Can I still follow this tutorial if I don\u2019t have Big Data Discovery?<\/strong><br\/>\nYes. The hands-on lab is designed to be executable using common OCI services and recreates Big Data Discovery-style outcomes.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">6) <strong>Does the lab require Oracle Analytics Cloud?<\/strong><br\/>\nNo. The lab uses Autonomous Database SQL and views for exploration. You can optionally connect OAC or another BI tool later.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">7) <strong>What dataset size is appropriate for the lab?<\/strong><br\/>\nStart small (MBs to a few GB). Always Free Autonomous Database is limited, and CSV loads are not optimized for huge datasets.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">8) <strong>What\u2019s the best storage format for curated big data on OCI?<\/strong><br\/>\nFor large-scale lake analytics, columnar formats like <strong>Parquet<\/strong> are common. For simple ingestion, CSV is fine but less efficient.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">9) <strong>How do I control who can access raw vs curated data?<\/strong><br\/>\nUse IAM policies and separate buckets\/compartments. In the database, use schemas\/roles and views to restrict columns and rows.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">10) <strong>How do I avoid copying sensitive data into too many places?<\/strong><br\/>\nApply data minimization: keep raw data in one controlled zone, publish only necessary curated datasets, and enforce retention policies.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">11) <strong>What\u2019s the biggest operational risk with discovery tools?<\/strong><br\/>\nUn-governed transformations can become \u201cshadow production\u201d logic. Treat curated outputs as products: versioning, testing, and ownership.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">12) <strong>How do I monitor this pipeline?<\/strong><br\/>\nUse OCI Audit for control-plane actions, database auditing for data access, and (if you add Spark\/ETL) use the service\u2019s logging + OCI Logging.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">13) <strong>Can Autonomous Database load directly from Object Storage securely?<\/strong><br\/>\nYes, using <code>DBMS_CLOUD<\/code> with credentials. For more secure designs, evaluate private networking and approved credential storage patterns.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">14) <strong>What\u2019s the most common ingestion error?<\/strong><br\/>\nIncorrect Object Storage URL\/namespace or invalid credentials, leading to authorization failures.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">15) <strong>Should I build a new long-term platform around Big Data Discovery today?<\/strong><br\/>\nOnly after confirming lifecycle status and supportability for your organization. For many OCI-first projects, OCI-native services provide a clearer forward path.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">17. Top Online Resources to Learn Big Data Discovery<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Resource Type<\/th>\n<th>Name<\/th>\n<th>Why It Is Useful<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Official documentation (search)<\/td>\n<td>Oracle Help Center search for \u201cBig Data Discovery\u201d: https:\/\/docs.oracle.com\/en\/search\/?q=Big%20Data%20Discovery<\/td>\n<td>Safest starting point to find the correct versioned docs and guides<\/td>\n<\/tr>\n<tr>\n<td>Official documentation (platform context)<\/td>\n<td>Oracle Big Data Appliance documentation (Oracle Help Center): https:\/\/docs.oracle.com\/en\/<\/td>\n<td>Big Data Discovery is often discussed in the context of Oracle\u2019s big data platform; use this to locate install\/admin context<\/td>\n<\/tr>\n<tr>\n<td>Official pricing<\/td>\n<td>OCI Pricing: https:\/\/www.oracle.com\/cloud\/pricing\/<\/td>\n<td>For pricing of OCI services used as alternatives (Object Storage, Data Flow, etc.)<\/td>\n<\/tr>\n<tr>\n<td>Official cost estimator<\/td>\n<td>OCI Cost Estimator: https:\/\/www.oracle.com\/cloud\/costestimator.html<\/td>\n<td>Build region-specific estimates without guessing<\/td>\n<\/tr>\n<tr>\n<td>Official docs (Object Storage)<\/td>\n<td>OCI Object Storage docs: https:\/\/docs.oracle.com\/en-us\/iaas\/Content\/Object\/home.htm<\/td>\n<td>Required for secure bucket design, URLs, lifecycle policies<\/td>\n<\/tr>\n<tr>\n<td>Official docs (Autonomous Database)<\/td>\n<td>Autonomous Database docs: https:\/\/docs.oracle.com\/en\/cloud\/paas\/autonomous-database\/<\/td>\n<td>Covers provisioning, security, Database Actions, connectivity<\/td>\n<\/tr>\n<tr>\n<td>Official docs (DBMS_CLOUD)<\/td>\n<td>DBMS_CLOUD documentation (in Autonomous DB docs): https:\/\/docs.oracle.com\/en\/cloud\/paas\/autonomous-database\/adbsa\/<\/td>\n<td>Authoritative reference for loading from Object Storage<\/td>\n<\/tr>\n<tr>\n<td>Official docs (IAM)<\/td>\n<td>OCI IAM docs: https:\/\/docs.oracle.com\/en-us\/iaas\/Content\/Identity\/home.htm<\/td>\n<td>Correct policy patterns and least privilege guidance<\/td>\n<\/tr>\n<tr>\n<td>Official videos<\/td>\n<td>Oracle Cloud YouTube channel: https:\/\/www.youtube.com\/@OracleCloud<\/td>\n<td>Often includes practical demos for OCI data services (verify availability for specific topics)<\/td>\n<\/tr>\n<tr>\n<td>Community learning<\/td>\n<td>Oracle Cloud Customer Connect: https:\/\/community.oracle.com\/customerconnect\/<\/td>\n<td>Practical Q&amp;A with Oracle community and product teams (verify advice against docs)<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">18. Training and Certification Providers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Institute<\/th>\n<th>Suitable Audience<\/th>\n<th>Likely Learning Focus<\/th>\n<th>Mode<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>Engineers, DevOps, architects<\/td>\n<td>Cloud\/DevOps fundamentals, automation, CI\/CD; check for OCI data tracks<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>ScmGalaxy.com<\/td>\n<td>DevOps learners, build\/release teams<\/td>\n<td>SCM, DevOps, tooling foundations; may complement cloud labs<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.scmgalaxy.com\/<\/td>\n<\/tr>\n<tr>\n<td>CLoudOpsNow.in<\/td>\n<td>Cloud ops teams, SREs<\/td>\n<td>Cloud operations practices, monitoring, reliability<\/td>\n<td>Check website<\/td>\n<td>https:\/\/cloudopsnow.in\/<\/td>\n<\/tr>\n<tr>\n<td>SreSchool.com<\/td>\n<td>SREs, platform engineers<\/td>\n<td>SRE practices, observability, incident management<\/td>\n<td>Check website<\/td>\n<td>https:\/\/sreschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>AiOpsSchool.com<\/td>\n<td>Ops + data\/AI practitioners<\/td>\n<td>AIOps concepts, automation, operational analytics<\/td>\n<td>Check website<\/td>\n<td>https:\/\/aiopsschool.com\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">19. Top Trainers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Platform\/Site<\/th>\n<th>Likely Specialization<\/th>\n<th>Suitable Audience<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>RajeshKumar.xyz<\/td>\n<td>DevOps\/cloud training content (verify specific OCI coverage)<\/td>\n<td>Beginners to intermediate engineers<\/td>\n<td>https:\/\/rajeshkumar.xyz\/<\/td>\n<\/tr>\n<tr>\n<td>devopstrainer.in<\/td>\n<td>DevOps training and mentoring<\/td>\n<td>DevOps practitioners<\/td>\n<td>https:\/\/devopstrainer.in\/<\/td>\n<\/tr>\n<tr>\n<td>devopsfreelancer.com<\/td>\n<td>Freelance DevOps\/services platform<\/td>\n<td>Teams seeking hands-on help or coaching<\/td>\n<td>https:\/\/devopsfreelancer.com\/<\/td>\n<\/tr>\n<tr>\n<td>devopssupport.in<\/td>\n<td>DevOps support\/training resources<\/td>\n<td>Ops teams needing implementation support<\/td>\n<td>https:\/\/devopssupport.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">20. Top Consulting Companies<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Company<\/th>\n<th>Likely Service Area<\/th>\n<th>Where They May Help<\/th>\n<th>Consulting Use Case Examples<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>cotocus.com<\/td>\n<td>Cloud\/DevOps consulting (verify offerings)<\/td>\n<td>Architecture reviews, implementation help, operations<\/td>\n<td>Designing OCI landing zones; setting up CI\/CD for data pipelines; cost optimization reviews<\/td>\n<td>https:\/\/cotocus.com\/<\/td>\n<\/tr>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>Training + consulting services (verify scope)<\/td>\n<td>Upskilling teams; implementing DevOps practices around cloud workloads<\/td>\n<td>Building automation for OCI deployments; operational best practices for data platforms<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>DEVOPSCONSULTING.IN<\/td>\n<td>DevOps consulting (verify offerings)<\/td>\n<td>DevOps transformations, tooling, operational maturity<\/td>\n<td>Implementing observability; infrastructure automation; governance processes<\/td>\n<td>https:\/\/devopsconsulting.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">21. Career and Learning Roadmap<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn before Big Data Discovery-style work<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data fundamentals: CSV\/JSON\/Parquet, schemas, partitioning<\/li>\n<li>SQL fundamentals: joins, aggregates, window functions<\/li>\n<li>Cloud basics: IAM, networking, compartments\/projects, encryption<\/li>\n<li>Object storage concepts: buckets, prefixes, lifecycle rules<\/li>\n<li>Basic data governance: data classification and least privilege<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn after<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Spark fundamentals (especially if using OCI Data Flow)<\/li>\n<li>Data modeling for analytics (star schema, slowly changing dimensions)<\/li>\n<li>CI\/CD for data pipelines (testing, versioning transformations)<\/li>\n<li>Observability for data platforms (data quality checks, pipeline SLAs)<\/li>\n<li>BI semantic modeling (Oracle Analytics Cloud or equivalent)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Job roles that use it (or equivalent workflows)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data Engineer<\/li>\n<li>Analytics Engineer<\/li>\n<li>BI Developer \/ BI Engineer<\/li>\n<li>Cloud Data Architect<\/li>\n<li>Platform Engineer (data platform)<\/li>\n<li>Data Governance \/ Data Quality Engineer<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Certification path (if available)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>For Big Data Discovery specifically, certification availability may be limited depending on lifecycle.<\/li>\n<li>For OCI pathways, consider Oracle\u2019s OCI certifications (associate\/professional) relevant to:<\/li>\n<li>Cloud infrastructure<\/li>\n<li>Data management<\/li>\n<li>Analytics<br\/>\nVerify the current catalog on Oracle University: https:\/\/education.oracle.com\/<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Project ideas for practice<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Build a raw-to-curated pipeline for web logs and publish a KPI view.<\/li>\n<li>Implement data quality checks (null thresholds, uniqueness constraints) with automated alerts.<\/li>\n<li>Create a curated dataset with PII masking and column-level access controls.<\/li>\n<li>Cost-optimization exercise: lifecycle policies + partition strategies.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">22. Glossary<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Big Data Discovery:<\/strong> Oracle product for interactive discovery, profiling, and preparation of large datasets (availability and packaging vary).<\/li>\n<li><strong>OCI (Oracle Cloud Infrastructure):<\/strong> Oracle\u2019s public cloud platform.<\/li>\n<li><strong>Object Storage:<\/strong> Durable blob storage for files\/objects in buckets.<\/li>\n<li><strong>Autonomous Database:<\/strong> Managed Oracle database with automated operations; includes ADW\/ATP.<\/li>\n<li><strong>ADW (Autonomous Data Warehouse):<\/strong> Autonomous Database workload optimized for analytics.<\/li>\n<li><strong>DBMS_CLOUD:<\/strong> Oracle-supplied PL\/SQL package commonly used to load data into Autonomous Database from Object Storage and other cloud locations.<\/li>\n<li><strong>Raw zone:<\/strong> Storage location for unmodified ingested data.<\/li>\n<li><strong>Curated zone:<\/strong> Storage or tables with cleaned, standardized, analysis-ready data.<\/li>\n<li><strong>Serving layer:<\/strong> Optimized data structures (tables\/views) used by BI tools and applications.<\/li>\n<li><strong>Faceted search:<\/strong> Exploration approach where users filter by attribute \u201cfacets\u201d (e.g., region, status).<\/li>\n<li><strong>Schema drift:<\/strong> Changes in incoming data schema over time (new columns, changed types).<\/li>\n<li><strong>Least privilege:<\/strong> Security principle of granting only the permissions required to perform a task.<\/li>\n<li><strong>Egress:<\/strong> Outbound data transfer from a cloud to the internet or another region.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">23. Summary<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Big Data Discovery is an Oracle offering aimed at <strong>interactive exploration, profiling, preparation, and publishing of large datasets<\/strong>, historically aligned with Oracle\u2019s big data platform deployments. It matters because it addresses a consistent pain point in analytics programs: turning raw, messy data into trusted, consumable datasets quickly.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In Oracle Cloud (OCI), Big Data Discovery may not be present as a standard managed service in many tenancies, so treat its lifecycle and availability as something to <strong>verify in official Oracle documentation and with Oracle Sales\/Support<\/strong>. For most new OCI projects, the practical approach is to implement the same workflow using OCI-native building blocks: <strong>Object Storage for the lake, Autonomous Database for curated\/serving datasets, and (optionally) Spark-based processing and enterprise BI<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Key cost and security points:\n&#8211; Costs come mainly from storage growth, compute for transformations, and BI licensing.\n&#8211; Secure designs rely on compartments, least privilege IAM policies, private networking for databases, encryption, and auditing.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">When to use it:\n&#8211; Use Big Data Discovery if you already have it and it aligns with your platform.\n&#8211; Use OCI-native services to achieve the same outcomes when building forward-looking architectures on Oracle Cloud.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Next step: run the hands-on lab in this guide, then expand it by adding a scalable processing layer (OCI Data Flow) and a governed BI layer (Oracle Analytics Cloud) as your requirements grow.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Other Services<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[62,63],"tags":[],"class_list":["post-749","post","type-post","status-publish","format-standard","hentry","category-oracle-cloud","category-other-services"],"_links":{"self":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/749","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/comments?post=749"}],"version-history":[{"count":0,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/749\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/media?parent=749"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/categories?post=749"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/tags?post=749"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}