Google Cloud Data Fusion Tutorial: Architecture, Pricing, Use Cases, and Hands-On Guide for Data analytics and pipelines

1. Introduction

Cloud Data Fusion is Google Cloud’s managed, visual data integration service for building and running data pipelines without writing a lot of code. It is commonly used for ETL/ELT-style pipelines that move and transform data between systems such as Cloud Storage, BigQuery, relational databases (via JDBC), and streaming sources.

In simple terms: Cloud Data Fusion lets you drag, drop, configure, and run data pipelines—like “read CSV files from Cloud Storage, clean them up, and load them into BigQuery”—with built-in connectors and a graphical interface.

In technical terms: Cloud Data Fusion is a fully managed service based on the open-source CDAP (Cask Data Application Platform). You create and manage a Cloud Data Fusion instance in a Google Cloud project and region, design pipelines in the Data Fusion Studio UI, and execute them on managed compute (commonly Dataproc/Spark-based execution managed through Data Fusion “compute profiles”). The service integrates with Google Cloud IAM, Cloud Logging, and Cloud Monitoring for governance and operations.

Cloud Data Fusion solves the problem of building reliable, repeatable data pipelines across heterogeneous sources and sinks, especially when you want: – A visual authoring experience (including interactive data preparation) – A connector ecosystem and plugin-based architecture – Operational features such as pipeline deployment, runtime configuration, and monitoring hooks – A managed alternative to self-hosting data integration tools

2. What is Cloud Data Fusion?

Official purpose

Cloud Data Fusion is a managed data integration service on Google Cloud designed to help teams build, deploy, and manage data pipelines that ingest, transform, and deliver data for analytics and downstream applications.

Cloud Data Fusion is built on CDAP, which provides the underlying pipeline framework, metadata services, and extensibility model (plugins).

Core capabilities

Cloud Data Fusion typically provides: – Visual pipeline design (Studio) for batch and (where supported) streaming-style pipelines – Pre-built connectors (plugins) for common sources/sinks and transformation steps – Interactive data preparation (often called “Wrangler”) for cleaning and shaping data – Runtime execution on managed compute via compute profiles (commonly Dataproc-based execution; verify available runtime options in official docs for your edition/region) – Operational visibility through logs/metrics integrations (Cloud Logging/Monitoring) and pipeline run history – Extensibility via custom plugins (packaged artifacts) and reusable pipeline patterns

Major components (how to think about the service)

Cloud Data Fusion Instance: The managed control plane you create in your project/region. The instance hosts the UI and management services.
Data Fusion Studio: Browser-based UI to design pipelines with sources, transforms, and sinks.
Wrangler: Interactive transformation tool for profiling and cleaning data.
Plugins (Connectors & Transforms): Packaged steps used in pipelines (sources, sinks, joins, aggregations, lookups, etc.). Many are built-in; you can add custom plugins.
Namespaces: Logical separation within an instance (useful for multi-team segmentation).
Compute Profiles: Configuration for pipeline execution environments (often Dataproc-based). Compute profiles govern where/how pipelines run.
Artifacts: Versioned plugin bundles and pipeline assets deployed into an instance.

Service type

Managed data integration / pipeline authoring and management service.
Not “serverless per query” like BigQuery; Cloud Data Fusion involves an instance plus pipeline runtime compute.

Scope (regional / project-scoped)

Project-scoped: You create instances inside a specific Google Cloud project.
Regional: Instances are created in a chosen region. Data locality matters (for performance, cost, and compliance).
Networking mode can be public or private depending on configuration (verify current options in official docs).

Fit in the Google Cloud ecosystem (Data analytics and pipelines)

Cloud Data Fusion often sits in the middle of Google Cloud’s data analytics and pipelines ecosystem: – Storage & landing: Cloud Storage – Warehouse: BigQuery – Execution engines: Dataproc (commonly), plus pushdown to BigQuery when applicable (depends on plugins/transforms) – Orchestration: Built-in scheduling for certain workflows, and/or external orchestration with Cloud Composer (Apache Airflow) or Workflows – Streaming ingestion: Pub/Sub (often as a source), with downstream analytics in BigQuery – Operations: Cloud Logging, Cloud Monitoring – Security: IAM, VPC networking (including private patterns), Cloud KMS (for platform encryption controls—verify per component)

3. Why use Cloud Data Fusion?

Business reasons

Faster time-to-value: Visual pipelines reduce development time for common ETL/ELT tasks.
Lower integration friction: Built-in connectors reduce the need to hand-code ingestion and parsing.
Standardization: Centralize data pipeline patterns across teams, improving governance and reducing “one-off scripts.”

Technical reasons

Visual + extensible: Start with built-in plugins, extend with custom plugins when needed.
Separation of concerns: Control plane (authoring/management) is handled by Cloud Data Fusion; data processing runs on configured compute.
Composable design: Pipelines are built from reusable steps; transformations are explicit and auditable.

Operational reasons

Managed instance lifecycle: Easier than running your own CDAP cluster.
Monitoring and troubleshooting: Run history and integration with Google Cloud logging/monitoring simplify operations.
Repeatability: Deployed pipelines are runnable artifacts, not ad-hoc notebooks or manually executed scripts.

Security/compliance reasons

IAM-based access control: Manage who can administer instances and author/run pipelines.
Network controls: Private deployment patterns can reduce public exposure (verify current private connectivity patterns and constraints for your region/edition).
Auditability: Admin actions and runtime logs can be captured in Cloud Logging.

Scalability/performance reasons

Scale-out execution: Pipelines can run on distributed compute (often Spark on Dataproc) for larger transformations.
Separation of UI/control from runtime: Helps scale workloads by scaling execution compute rather than overloading authoring nodes.

When teams should choose it

Choose Cloud Data Fusion when: – You want visual pipeline development with a strong plugin ecosystem. – Your team needs to integrate multiple data sources and sinks quickly. – You want a managed service rather than self-hosting a data integration platform. – You need to operationalize pipelines with consistent run history and centralized management.

When teams should not choose it

Avoid or reconsider Cloud Data Fusion when: – Your pipelines are mostly SQL transformations inside BigQuery (BigQuery + Dataform or scheduled queries may be simpler). – You want fully serverless per-job pricing without a long-running instance component (Dataflow may fit better for certain patterns). – You primarily need CDC replication at scale (Datastream and purpose-built replication tools may be more appropriate; verify requirements). – Your organization prohibits the networking model required for instances (private connectivity constraints can be non-trivial). – You have very custom transformations and already have strong engineering investment in Spark/Dataflow code.

4. Where is Cloud Data Fusion used?

Industries

Retail and e-commerce (sales, inventory, customer analytics)
Financial services (risk analytics, reporting pipelines, regulatory data aggregation)
Healthcare and life sciences (claims data aggregation, research datasets; ensure compliance)
Media and gaming (event ingestion and aggregation)
Manufacturing and IoT (plant telemetry ingestion, quality analytics)
SaaS companies (product analytics, operational reporting)

Team types

Data engineering teams standardizing ingestion patterns
Analytics engineering teams preparing curated datasets for BI
Platform teams offering “data pipeline as a service” to internal consumers
DevOps/SRE teams supporting data pipeline operations
Hybrid teams migrating from on-prem ETL tools to cloud-managed services

Workloads

Batch ingestion from files (CSV/JSON/Avro/Parquet depending on plugins and design)
Data warehouse loading into BigQuery
Data cleaning, normalization, enrichment, joins
PII handling workflows (tokenization/masking typically done with transforms or external services—verify your approach)
Multi-step staging (raw → cleaned → curated)

Architectures

Landing zone in Cloud Storage → transform → load into BigQuery
Database exports → transform → BigQuery datasets per domain
Pub/Sub event stream → transform → analytics store (streaming patterns depend on runtime and plugin support—verify)

Real-world deployment contexts

Centralized shared instance with namespaces and policies for multiple teams
Per-domain instances for isolation (cost and governance trade-off)
Dev/test/prod separation across projects (recommended for larger orgs)

Production vs dev/test usage

In dev/test: shorter-lived instances, smaller compute profiles, frequent iteration
In production: strict IAM, controlled plugin promotion, predictable scheduling, quotas, monitoring, and cost guardrails

5. Top Use Cases and Scenarios

Below are realistic use cases aligned with Google Cloud “Data analytics and pipelines” needs.

1) Cloud Storage CSV → BigQuery curated table

Problem: Analysts need curated tables from daily CSV drops.
Why Cloud Data Fusion fits: Visual ingestion + Wrangler cleanup + BigQuery sink.
Scenario: A vendor drops orders_YYYYMMDD.csv into a bucket daily; pipeline cleans types and loads partitioned BigQuery tables.

2) Multi-source enrichment (GCS + reference table join)

Problem: Raw events need enrichment from reference data.
Why it fits: Join and lookup transforms in a single pipeline.
Scenario: Web events from GCS are enriched with a BigQuery product catalog table before writing to analytics.

3) JDBC database extract → BigQuery

Problem: Need recurring extracts from a relational database for reporting.
Why it fits: JDBC connectors + managed scheduling patterns.
Scenario: Nightly extract from PostgreSQL read replicas into BigQuery for executive dashboards.

4) Standardize raw → clean → curated layers

Problem: Inconsistent transformations across teams cause conflicting metrics.
Why it fits: Reusable pipelines/plugins, centralized governance.
Scenario: Platform team publishes canonical pipelines for each domain’s raw-to-clean transformations.

5) Data quality checks during ingestion

Problem: Bad records break dashboards and trust.
Why it fits: Add validation steps and route errors to quarantine sinks.
Scenario: Records with missing keys are written to a “rejects” table and alerted on.

6) Metadata and lineage visibility (within the platform)

Problem: Teams can’t track “where a metric came from.”
Why it fits: Pipeline definitions provide documented flow and runtime context.
Scenario: Data engineers trace curated tables back to the raw bucket and transformation steps for audits.

7) Migration from legacy ETL tooling to Google Cloud

Problem: Existing ETL tool is expensive and hard to operate.
Why it fits: Managed service, CDAP-based, plugin ecosystem.
Scenario: Replace on-prem batch ETL jobs with cloud pipelines feeding BigQuery.

8) Rapid prototyping for new data products

Problem: Need to test feasibility quickly without building a full codebase.
Why it fits: Drag-and-drop authoring and interactive wrangling.
Scenario: Prototype new customer segmentation dataset in days, then productionize.

9) Central ingestion for BI tools

Problem: BI tools depend on consistent, timely datasets.
Why it fits: Repeatable pipelines and operational visibility.
Scenario: Daily curated BigQuery marts maintained by Data Fusion, powering Looker/BI.

10) Compliance-aware processing pipeline (segmented access)

Problem: Sensitive data must be processed under strict access control.
Why it fits: IAM + private networking patterns + controlled execution identity.
Scenario: PII dataset pipelines run in a restricted project with limited access; outputs are tokenized datasets in another project.

11) Hybrid ingestion from on-prem to Google Cloud

Problem: On-prem sources need integration during migration.
Why it fits: Network connectivity + JDBC/file ingestion patterns.
Scenario: On-prem database exports land in Cloud Storage via VPN/Interconnect; Data Fusion transforms and loads to BigQuery.

12) Controlled plugin-based extensibility for enterprise standards

Problem: Teams need custom connectors but must standardize operations.
Why it fits: Custom plugin artifacts with governance and versioning.
Scenario: Build a custom plugin for a proprietary API and distribute it across namespaces.

6. Core Features

This section focuses on important, commonly used Cloud Data Fusion capabilities. For exact availability by edition/region, verify in the official docs.

1) Managed Cloud Data Fusion instances

What it does: Provides a managed control plane you create per project/region.
Why it matters: Avoids managing CDAP clusters yourself.
Practical benefit: Faster setup, managed upgrades/maintenance (within service constraints).
Caveats: Costs accrue based on instance pricing model while the instance is running; plan lifecycle management.

2) Visual pipeline designer (Studio)

What it does: Drag-and-drop UI to build pipelines with sources, transforms, and sinks.
Why it matters: Improves developer productivity and consistency.
Practical benefit: Less boilerplate code; easier onboarding.
Caveats: Complex logic may still require custom plugins or off-platform processing.

3) Interactive data preparation (Wrangler)

What it does: Helps profile, clean, and transform datasets interactively.
Why it matters: Data cleanup is a large portion of real ETL work.
Practical benefit: Quickly fix schema issues (types, splits, trims, parsing).
Caveats: Wrangler steps must be validated for production scale; some transformations may behave differently with edge cases.

4) Plugin ecosystem (sources, sinks, transforms)

What it does: Uses plugins for connectivity and transformations (including JDBC-based connectors).
Why it matters: Reduces custom code and integration effort.
Practical benefit: Faster integration across common systems.
Caveats: Plugin capabilities and compatibility depend on versions; validate in a staging environment.

5) Extensibility with custom plugins (artifacts)

What it does: Lets you package and deploy custom connectors/transforms.
Why it matters: Enables integration with non-standard systems.
Practical benefit: Standardize custom logic across teams.
Caveats: You own plugin lifecycle (build, security reviews, compatibility testing).

6) Compute profiles (runtime configuration)

What it does: Separates pipeline design from execution environment configuration.
Why it matters: You can run similar pipelines on different compute configurations.
Practical benefit: Dev uses smaller profiles; prod uses larger profiles.
Caveats: Execution depends on underlying compute quotas (often Dataproc quotas and VPC constraints).

7) Namespace-based organization

What it does: Provides logical separation for assets within an instance.
Why it matters: Helps multi-team governance and segmentation.
Practical benefit: Reduce collisions, separate artifacts/pipelines.
Caveats: Namespaces are not a substitute for project-level isolation for strict compliance boundaries.

8) Operational visibility (run history, logs integration)

What it does: Tracks pipeline deployments and executions; integrates with Cloud Logging/Monitoring.
Why it matters: Data pipeline reliability depends on observability.
Practical benefit: Faster incident response; easier audit of job runs.
Caveats: Log volume can be significant (cost and noise). Apply retention and filtering strategies.

9) Integration with Google Cloud storage and analytics services

What it does: Works naturally with Cloud Storage and BigQuery patterns.
Why it matters: These are core building blocks in Google Cloud data analytics.
Practical benefit: Common landing-to-warehouse flows are straightforward.
Caveats: Cross-region data movement increases cost and latency; co-locate resources.

10) Versioning and promotion patterns (pipeline export/import)

What it does: Supports exporting/importing pipeline definitions and managing artifacts.
Why it matters: Enables CI/CD-style promotion from dev → test → prod.
Practical benefit: Repeatable deployments and reduced manual configuration drift.
Caveats: You must design your own promotion workflow (Git, approvals, environment variables, secrets handling).

7. Architecture and How It Works

High-level service architecture

Cloud Data Fusion typically has: – A managed control plane (the instance/UI and management services) – A data plane where pipelines execute on compute resources configured by compute profiles (commonly Dataproc clusters)

The control plane manages: – Pipeline definitions – Plugin artifacts – Metadata about runs – UI, namespaces, connection configuration

The data plane performs: – Actual reading, transforming, and writing of data – Distributed compute for heavy transformations

Request/data/control flow (typical)

User authenticates to Google Cloud and opens the Cloud Data Fusion instance UI.
User designs a pipeline in Studio and configures plugins (source/transform/sink).
User deploys and runs the pipeline.
Cloud Data Fusion provisions or selects runtime compute per compute profile (often a Dataproc cluster).
Runtime job reads from source (e.g., Cloud Storage), transforms data, and writes to sink (e.g., BigQuery).
Execution logs and metrics flow to Cloud Logging/Monitoring (depending on configuration).
User monitors status in the Data Fusion UI and/or Cloud Monitoring.

Integrations with related services (common)

Cloud Storage: Landing zone for files; staging and intermediate storage.
BigQuery: Data warehouse sink; also can be a source.
Dataproc: Pipeline execution runtime (commonly).
Pub/Sub: Streaming ingestion source patterns (verify runtime compatibility).
Cloud Logging / Cloud Monitoring: Logs and metrics for operations.
IAM: Access control for instances and underlying resources.

Dependency services (what you should plan for)

Data Fusion API
Dataproc API (commonly required for runtime execution)
BigQuery API and Storage APIs (depending on pipelines)
VPC networking configuration (especially for private instances)
Service accounts and IAM bindings for runtime access

Security/authentication model (practical view)

Human access: Controlled via Google Cloud IAM roles on the project and Data Fusion instance.
Service identity: Cloud Data Fusion uses service identities (service agents) and/or configured runtime service accounts to access resources (buckets, BigQuery datasets, databases).
Best practice: Use least privilege and separate service accounts per environment.

Networking model (public vs private patterns)

Public instances: UI is accessible via Google-managed endpoints with IAM gating (subject to org policies).
Private instances: Designed for restricted environments, using private connectivity between your VPC and the managed service. This often involves VPC peering or other private connectivity mechanisms depending on current product design. Verify the current private networking model in official docs because implementation details can evolve.

Monitoring/logging/governance considerations

Standardize on:
Pipeline naming conventions
Labels/tags at the Google Cloud project level
Log-based metrics and alerting for failures
Decide whether:
You monitor from the Data Fusion UI, Cloud Monitoring dashboards, or both
You forward logs to a SIEM
Govern plugin promotion and pipeline changes:
Store pipeline exports in Git
Use change approvals
Maintain dev/test/prod isolation

Simple architecture diagram (Mermaid)

flowchart LR
  U[Engineer] -->|Design & Run| DF[Cloud Data Fusion Instance]
  DF -->|Launch pipeline job| DP[Dataproc runtime (Spark)]
  GCS[(Cloud Storage: raw files)] --> DP
  DP --> BQ[(BigQuery: curated tables)]
  DP --> LOG[Cloud Logging]
  DP --> MON[Cloud Monitoring]

Production-style architecture diagram (Mermaid)

flowchart TB
  subgraph OnPrem[On-prem / External]
    DB[(Relational DB)]
    Files[(File drops)]
  end

  subgraph VPC[Customer VPC (Google Cloud)]
    VPN[Cloud VPN / Interconnect]
    PSC[Private connectivity pattern\n(verify: VPC peering/PSC depending on docs)]
  end

  subgraph GCP[Google Cloud Project(s)]
    DF[Cloud Data Fusion Instance\n(private)]
    GCSraw[(Cloud Storage\nraw zone)]
    GCSstage[(Cloud Storage\nstaging/quarantine)]
    BQcur[(BigQuery\ncurated datasets)]
    KMS[Cloud KMS\n(keys, if used)]
    IAM[IAM\n(service accounts & roles)]
    LOG[Cloud Logging]
    MON[Cloud Monitoring]
  end

  DB --> VPN --> VPC
  Files --> VPN --> VPC

  VPC --> PSC --> DF
  DF -->|Exec via compute profile| DP[Dataproc runtime\n(regional, ephemeral or persistent)]
  DP --> GCSraw
  DP --> GCSstage
  DP --> BQcur

  DF --> LOG
  DP --> LOG
  LOG --> MON

  IAM -.controls.-> DF
  IAM -.controls.-> DP
  KMS -.encryption controls (service-dependent).-> GCSraw
  KMS -.-> BQcur

8. Prerequisites

Before you start, ensure the following are in place.

Google Cloud account/project requirements

A Google Cloud billing account attached to your project
A Google Cloud project where you can create:
Cloud Data Fusion instances
Cloud Storage buckets
BigQuery datasets
Dataproc clusters (or allow Data Fusion to create ephemeral clusters)

Permissions / IAM roles

At minimum, you typically need: – Permissions to create/manage Cloud Data Fusion instances (e.g., Data Fusion Admin role in the project) – Permissions to create/read/write: – Cloud Storage objects (for source files) – BigQuery datasets/tables (for sink) – Permissions for Dataproc usage if your pipelines run on Dataproc

Because exact roles can vary by org policy and design, confirm required roles in official docs: – Cloud Data Fusion IAM overview: https://cloud.google.com/data-fusion/docs/concepts/iam

A practical least-privilege approach: – Human users: Viewer/Developer roles for authoring; Admin only for platform team – Runtime service account(s): only the storage and BigQuery permissions required by the pipelines

Billing requirements

Billing must be enabled.
Expect charges for:
Cloud Data Fusion instance (edition-based)
Runtime compute (commonly Dataproc)
BigQuery usage (storage + queries/loads)
Cloud Storage
Logging/Monitoring ingestion and retention (indirect but real)

CLI/SDK/tools needed

Google Cloud Console access
Optional but recommended:
gcloud CLI: https://cloud.google.com/sdk/docs/install
bq CLI (bundled with Cloud SDK)
gsutil (bundled with Cloud SDK)

Region availability

Cloud Data Fusion is regional. Choose a region where:
Cloud Data Fusion is available
BigQuery dataset location is compatible with your design (BigQuery uses multi-region or region locations)
Dataproc is available
Verify current locations in official docs:
Locations: https://cloud.google.com/data-fusion/docs/concepts/locations

Quotas/limits

Common quota considerations: – Dataproc CPU quotas in your region – Cloud Storage request limits (rarely blocking for small labs) – BigQuery dataset/table quotas (unlikely for a small lab) – Cloud Logging ingestion quotas/cost controls – Data Fusion instance limits per project/region (verify in official docs)

Prerequisite services/APIs

Enable these APIs in your project: – Cloud Data Fusion API – Dataproc API (commonly required for runtime) – BigQuery API – Cloud Storage APIs

You can enable via Console or CLI (shown later in the lab).

9. Pricing / Cost

Cloud Data Fusion pricing changes over time and can vary by region and edition. Do not rely on cached numbers—always confirm using official sources.

Official pricing sources

Cloud Data Fusion pricing page: https://cloud.google.com/data-fusion/pricing
Google Cloud Pricing Calculator: https://cloud.google.com/products/calculator

Pricing dimensions (what you pay for)

Cloud Data Fusion costs usually include:

1) Cloud Data Fusion instance charges – Typically depends on: – Edition (for example, Developer/Basic/Enterprise—verify current editions and names) – Instance size/capacity (where applicable) – Running time (many teams control cost by stopping instances when not needed—verify instance stop/start behavior and billing rules in current docs)

2) Pipeline execution compute – Commonly Dataproc compute charges apply when pipelines run: – VM compute (vCPU/RAM) – Persistent disks – Possible charges for ephemeral clusters during job runtime – Dataproc pricing varies by region and VM family: – Dataproc pricing: https://cloud.google.com/dataproc/pricing

3) Storage – Cloud Storage for raw/staging data and artifacts: – Storage class, GB-month, operations, egress – BigQuery storage for tables: – Active/long-term storage (pricing depends on model)

4) BigQuery processing – Loads are typically inexpensive, but queries can be a major driver depending on your downstream usage. – If your pipeline triggers transformations inside BigQuery, the query processing model matters.

5) Networking – Same-region traffic is usually cheaper and faster. – Cross-region data movement can incur egress charges and latency. – If you integrate with on-prem systems, VPN/Interconnect costs may apply.

6) Logging & monitoring – Cloud Logging ingestion and retention can become a meaningful cost for high-volume pipelines.

Cost drivers (what increases your bill)

Leaving instances running 24/7 when you only need them during working hours
Large Dataproc clusters for transformations that could be pushed down to BigQuery (when appropriate)
Cross-region data movement between Cloud Storage, runtime, and BigQuery
High-volume verbose logging
Reprocessing full datasets instead of incremental loads

Hidden or indirect costs to plan for

Dataproc quotas causing upscaling to larger regions or project changes
Operational overhead: running multiple instances for environment separation
BigQuery downstream costs: curated datasets often increase analytics query volume
Service account sprawl: managing least privilege takes time (but is worth it)

How to optimize cost

Choose the smallest appropriate Cloud Data Fusion edition and instance sizing for your workload.
Stop non-production instances when not in use (verify exact behavior and automation options).
Use ephemeral compute for batch pipelines when possible.
Co-locate Cloud Storage buckets, Dataproc region, and BigQuery dataset location strategy to reduce latency/egress.
Implement incremental processing (date partitions, watermarking patterns).
Reduce log volume:
Don’t log entire records
Use structured logs with sampling
Set retention appropriately

Example low-cost starter estimate (conceptual, no fabricated numbers)

A low-cost learning setup typically looks like: – 1 Cloud Data Fusion Developer (or lowest-cost) edition instance – A small Cloud Storage bucket with a few MB of CSV data – A BigQuery dataset with a single small table – A small ephemeral Dataproc job run once or twice

Your main costs will be: – Instance runtime time (hours) – Dataproc VM runtime time (minutes to hours) – Minimal storage

Use the Pricing Calculator with: – Your region – Your expected number of pipeline runs – Expected Dataproc cluster size and job duration

Example production cost considerations (what to model)

For production, model: – Instance running time (24/7 or business hours) – Number of pipelines and schedules – Peak concurrency (multiple pipelines running simultaneously) – Dataproc cluster sizes and run durations – Data volume growth (GB/day) – Logging volume and retention – BigQuery query volume (often larger than ingestion)

10. Step-by-Step Hands-On Tutorial

This lab builds a real, small pipeline: Cloud Storage CSV → Cloud Data Fusion transformations (Wrangler) → BigQuery table.

Objective

Create a Cloud Data Fusion pipeline that: 1. Reads a CSV file from Cloud Storage 2. Applies simple cleanup transformations (trim, type conversions) 3. Loads the transformed data into a BigQuery table 4. Validates results with a BigQuery query 5. Cleans up all created resources to avoid ongoing cost

Lab Overview

You will: 1. Create a Cloud Storage bucket and upload a sample CSV file 2. Create a BigQuery dataset 3. Create a Cloud Data Fusion instance 4. Build and run a batch pipeline in Data Fusion Studio 5. Validate the loaded data in BigQuery 6. Troubleshoot common issues 7. Clean up resources

Expected time: 60–120 minutes (instance creation can take time).

Cost note: Cloud Data Fusion instances and Dataproc runtime can generate charges. Use a dev/learning project and clean up afterward.

Step 1: Prepare your project and enable required APIs

Actions (Console)

In the Google Cloud Console, select your project.
Go to APIs & Services → Library.
Enable: – Cloud Data Fusion API – Dataproc API – BigQuery API – Cloud Storage (usually enabled by default, but confirm)

Actions (CLI)

Set your project and enable services:

PROJECT_ID="YOUR_PROJECT_ID"
gcloud config set project "$PROJECT_ID"

gcloud services enable \
  datafusion.googleapis.com \
  dataproc.googleapis.com \
  bigquery.googleapis.com \
  storage.googleapis.com \
  logging.googleapis.com \
  monitoring.googleapis.com

Expected outcome: APIs show as enabled in APIs & Services → Enabled APIs & services.

Verification

gcloud services list --enabled --filter="name:(datafusion.googleapis.com OR dataproc.googleapis.com OR bigquery.googleapis.com)"

Step 2: Create a Cloud Storage bucket and upload sample data

Choose a region where you plan to create the Cloud Data Fusion instance.

REGION="us-central1"   # change if needed
BUCKET="df-lab-${PROJECT_ID}-$(date +%Y%m%d%H%M%S)"

gsutil mb -l "$REGION" "gs://${BUCKET}"

Create a small CSV file locally:

cat > customers.csv <<'EOF'
customer_id,full_name,email,signup_date,total_spend
1,  Ada Lovelace  ,ada@example.com,2024-01-05,123.45
2,Grace Hopper, grace.hopper@example.com ,2024-02-12,987.65
3,Alan Turing,alan.turing@example.com,2024-03-20,42.00
EOF

Upload it:

gsutil cp customers.csv "gs://${BUCKET}/input/customers.csv"
gsutil ls "gs://${BUCKET}/input/"

Expected outcome: You can see customers.csv in the bucket path.

Verification (Console) – Go to Cloud Storage → Buckets → your bucket → input/ – Confirm customers.csv exists

Step 3: Create a BigQuery dataset for the output

Pick a BigQuery dataset location consistent with your data governance and performance needs. For simple labs, pick a dataset location that aligns with your region strategy.

Create a dataset:

BQ_DATASET="df_lab"
bq mk --dataset "${PROJECT_ID}:${BQ_DATASET}"

Expected outcome: Dataset exists in BigQuery.

Verification

bq ls "${PROJECT_ID}:${BQ_DATASET}"

Step 4: Create a Cloud Data Fusion instance

This is typically easiest in the Console because it guides networking and edition choices.

Actions (Console)

Go to Navigation menu → Data Fusion – Direct link: https://console.cloud.google.com/data-fusion
Click Create instance
Configure: – Instance name: df-lab – Region: us-central1 (or your chosen region) – Edition: choose a lower-cost option suitable for learning (often “Developer” if available in your org—verify current edition availability and pricing) – Networking: for a first lab, use the simplest configuration allowed by your org policies (public vs private may be constrained by policy)
Click Create

Instance creation can take several minutes.

Expected outcome: Instance status becomes Running.

Verification – Open the instance details page and confirm it is running. – Click View instance (or equivalent) to open the Data Fusion UI.

Common blockers – Organization Policy denies public IP access or external connectivity. – Insufficient IAM permissions to create the instance or required service identities.

If blocked, review: – IAM: https://cloud.google.com/data-fusion/docs/concepts/iam
– Private instances/networking: https://cloud.google.com/data-fusion/docs/concepts/private-ip

Step 5: Open Data Fusion Studio and create a pipeline

Actions (Data Fusion UI)

In your instance page, click View instance to open the Cloud Data Fusion UI.
Go to Studio.
Click Create a pipeline.
Choose Batch pipeline (for CSV file ingestion).

Expected outcome: You see the pipeline canvas.

Step 6: Configure the source (Cloud Storage / GCS)

On the left panel, find a Cloud Storage (GCS) source plugin. – Plugin names can vary (for example “GCS”).
Drag the source onto the canvas.
Configure the source: – Reference name: gcs_customers – Path: gs://YOUR_BUCKET/input/customers.csv – Format: CSV – Ensure header handling is enabled if your plugin requires it.

If the plugin requires a schema: – Use Infer schema if available, or manually define: – customer_id (integer) – full_name (string) – email (string) – signup_date (date or string depending on plugin support) – total_spend (double/decimal)

Expected outcome: Source is configured and validates successfully.

Verification – Use the plugin’s Preview or Get schema option (if available). – Confirm it can read sample rows.

Step 7: Add transformations (Wrangler) to clean up fields

Drag a Wrangler transform onto the canvas.
Connect the GCS source to Wrangler.
Open Wrangler and apply typical cleaning steps: – Trim whitespace in full_name – Trim whitespace in email – Ensure customer_id is numeric – Ensure total_spend is numeric – Parse signup_date as a date if supported; otherwise keep as string and parse downstream

Wrangler has a “recipe” style UI. Exact operations depend on the current Wrangler UI/version. Use operations such as: – Trim – Change data type – Parse date

Expected outcome: Preview shows cleaned values: – Ada Lovelace (no extra spaces) – grace.hopper@example.com (trimmed) – total_spend numeric

Verification – Preview at least 10 rows (your file has 3 rows). – Confirm no nulls were introduced unexpectedly.

Step 8: Configure the sink (BigQuery)

Drag a BigQuery sink plugin onto the canvas.
Connect Wrangler to the BigQuery sink.
Configure: – Reference name: bq_customers – Dataset: df_lab – Table: customers_clean – Write mode: truncate (for a lab) or append (for incremental patterns) – If available, enable Create table automatically based on schema

Expected outcome: Sink is configured and validates successfully.

Verification – Use the plugin’s validation option. – Ensure the dataset name is correct and exists.

Step 9: Set runtime/compute profile and run the pipeline

Click Configure or Pipeline settings (UI naming varies).
Select a Compute profile. – For labs, choose the default profile that runs on ephemeral managed compute (commonly Dataproc). – If your org requires a custom profile (VPC, service account, network tags), choose the approved profile.
Click Deploy (if required by the UI).
Click Run.

Expected outcome – The pipeline transitions to a Running state. – You see stages execute in order. – On success, the run ends with Succeeded/Completed.

Verification – Check the run details page. – Confirm each stage completed successfully. – Open logs for the run if needed.

Step 10: Validate results in BigQuery

List tables:

bq ls "${PROJECT_ID}:${BQ_DATASET}"

Query the data:

bq query --use_legacy_sql=false "
SELECT customer_id, full_name, email, signup_date, total_spend
FROM \`${PROJECT_ID}.${BQ_DATASET}.customers_clean\`
ORDER BY customer_id;"

Expected outcome: You see three rows with cleaned full_name and email fields and numeric total_spend.

Validation

Use this checklist:

Cloud Storage contains the input file:
gs://BUCKET/input/customers.csv
Cloud Data Fusion instance is running and accessible
Pipeline run status is Succeeded
BigQuery table exists: PROJECT_ID.df_lab.customers_clean
Query returns cleaned data

If any step fails, move to Troubleshooting below.

Troubleshooting

Below are common issues and practical fixes.

Issue 1: “Permission denied” reading from Cloud Storage

Symptoms – Source stage fails with 403 errors

Fix – Confirm the runtime identity used by pipeline execution has storage.objects.get and storage.objects.list on the bucket. – Prefer granting permissions to a service account rather than broad user roles. – Verify bucket-level IAM and any org policies.

Issue 2: BigQuery permission errors (create table / write)

Symptoms – Sink stage fails with access denied for dataset/table operations

Fix – Ensure runtime identity has: – Permission to create tables (if auto-create is enabled) – Permission to write data to the dataset – Check dataset IAM bindings.

Issue 3: Dataproc cluster creation fails / quota issues

Symptoms – Pipeline fails before processing starts – Errors mention quotas, regions, or VM provisioning

Fix – Check Dataproc and Compute Engine quotas in the chosen region. – Use a smaller compute profile for the lab. – Try a different region if allowed. – Confirm the Dataproc API is enabled.

Issue 4: Can’t open Data Fusion UI

Symptoms – UI doesn’t load or access is blocked

Fix – Verify you have IAM permissions for the instance. – If using a private instance, confirm you are on the right network path (VPN, bastion, authorized access method). – Check org policy constraints for external access.

Issue 5: Schema/type errors in Wrangler or BigQuery sink

Symptoms – Pipeline fails when converting types

Fix – Keep signup_date as a string in the lab if date parsing fails; parse later in BigQuery with PARSE_DATE. – Ensure decimal parsing matches your locale and format.

Cleanup

To avoid ongoing charges, delete resources created in this lab.

1) Delete the Cloud Data Fusion instance

Console: – Go to Data Fusion → Instances – Select df-lab → Delete

Important: Instance deletion can take time.

2) Delete the BigQuery dataset (and tables)

bq rm -r -f "${PROJECT_ID}:${BQ_DATASET}"

3) Delete the Cloud Storage bucket

gsutil -m rm -r "gs://${BUCKET}"

4) (Optional) Review logs and remove any extra resources

Check Dataproc clusters (if any were created and left behind)
Check service accounts created for the lab (if you created any manually)

11. Best Practices

Architecture best practices

Co-locate resources: Keep Cloud Storage, Dataproc runtime region, and your BigQuery location strategy aligned to reduce latency and egress.
Use layered data zones:
Raw (immutable)
Clean (validated and standardized)
Curated (business-ready marts)
Design for reprocessing: Store raw data and pipeline configs so you can rebuild curated datasets.

IAM/security best practices

Least privilege:
Separate human authoring permissions from runtime execution permissions.
Prefer dedicated runtime service accounts per environment.
Separation of environments:
Use separate projects for dev/test/prod if you have strong governance needs.
Review service agent permissions: Cloud Data Fusion uses managed identities; ensure they have only required access.

Cost best practices

Stop non-prod instances when not used (verify stop/start and billing behavior).
Use small compute profiles for development.
Avoid over-logging; set retention policies.

Performance best practices

Use distributed compute for heavy transforms, but:
Push down transformations to BigQuery where it makes sense (for SQL-friendly transforms).
Use partitioning and clustering in BigQuery sinks for query performance.
Avoid reading massive unpartitioned files repeatedly; use partitioned file layouts (date-based prefixes).

Reliability best practices

Build idempotent pipelines:
Use deterministic output paths/tables
Use truncate+load for small tables, or partition overwrite for incremental loads
Add data validation and quarantine outputs for bad records.
Use retries thoughtfully; not all failures should auto-retry (e.g., schema errors).

Operations best practices

Standardize:
Pipeline naming: domain_source_to_sink_purpose
Labels: env, owner, cost-center
Set up alerting on:
Pipeline failures
Missing data (expected daily loads)
Data volume anomalies
Keep a runbook per pipeline:
Inputs, outputs, SLAs, dependencies, rollback steps

Governance/tagging/naming best practices

Use consistent dataset/table naming in BigQuery:
raw_*, clean_*, cur_*
Use consistent bucket prefixes:
raw/, staging/, curated/, quarantine/
Store exported pipeline definitions in Git with code review.

12. Security Considerations

Identity and access model

Cloud Data Fusion uses Google Cloud IAM for:
Instance administration
User access to UI and operations
Pipeline runtime access depends on:
The configured runtime identity/service account and permissions
Permissions for underlying services (GCS, BigQuery, Dataproc, external DBs)

Recommendations: – Use separate service accounts for: – Instance administration automation (if any) – Pipeline execution (runtime) – Grant runtime accounts: – Only the buckets and datasets they need – Only required permissions (read vs write)

Encryption

Google Cloud services encrypt data at rest by default.
If you require customer-managed encryption keys (CMEK), evaluate CMEK support for each dependent service:
Cloud Storage CMEK
BigQuery CMEK
Dataproc disk encryption
Cloud Data Fusion’s internal encryption controls and CMEK options may vary; verify in official docs for your edition and region.

Network exposure

Prefer private connectivity patterns for regulated environments (verify current private instance architecture and constraints).
Control egress:
If pipelines access external systems, restrict egress using VPC controls, NAT, and firewall rules where applicable.
Avoid placing sensitive sources behind public IPs without strict access controls.

Secrets handling

Common patterns: – Prefer IAM-based auth where possible (e.g., access to GCS and BigQuery via service accounts). – For database passwords/API keys: – Store secrets in Secret Manager and inject them at runtime where supported by your runtime environment. – If storing credentials in connection configs, restrict who can view/edit connections and audit changes.

Because secret injection mechanisms vary by plugin and runtime, verify recommended patterns in official docs and test in a non-production environment.

Audit/logging

Ensure Cloud Audit Logs are enabled for admin activity.
Centralize runtime logs in Cloud Logging and forward to your SIEM if required.
Avoid logging PII in plaintext.

Compliance considerations

Choose region(s) matching data residency requirements.
Enforce environment separation for sensitive datasets.
Document lineage and transformations (pipeline definitions are part of compliance evidence).

Common security mistakes

Using overly broad roles (Owner/Editor) for pipeline runtime service accounts
Storing secrets in pipeline parameters or exported pipeline JSON in Git
Running pipelines with public networking when private is required
Not restricting access to production instances and namespaces

Secure deployment recommendations

Use private instance patterns for production where required.
Use least-privileged service accounts and separate projects.
Implement CI/CD promotion with approvals and artifact scanning for custom plugins.

13. Limitations and Gotchas

The following are common practical limitations and “gotchas.” Always validate against current docs for your edition and region.

Instance lifecycle and cost

Cloud Data Fusion is instance-based; leaving instances running can create ongoing cost.
Instance creation/deletion can take several minutes.

Regional constraints

Instances are regional; cross-region sources/sinks can increase latency and cost.
BigQuery dataset location constraints can complicate multi-region designs.

Runtime quotas and dependencies

Pipeline execution often depends on Dataproc capacity and quotas.
If quotas are low, pipelines fail before running transforms.

Networking complexity (especially private instances)

Private connectivity can require specific VPC design and permissions.
DNS/firewall/VPC peering constraints can block UI access or runtime access to sources.

Plugin compatibility and driver management

JDBC and external connectors may require correct driver versions.
Plugin upgrades can introduce behavior changes; test before production promotion.

Operational surprises

Logging volume can grow quickly and cost money.
Retries can duplicate loads if pipelines aren’t idempotent.

Migration challenges

Migrating from legacy ETL tools often reveals hidden transformation logic and edge cases.
Rebuilding pipelines requires careful validation of business logic and data reconciliation.

Vendor-specific nuances

Cloud Data Fusion is built on CDAP; understanding CDAP concepts (artifacts, namespaces) helps troubleshooting.
Some advanced patterns require custom plugins or external orchestration.

14. Comparison with Alternatives

Cloud Data Fusion is one option in a broader ecosystem of Google Cloud and third-party data analytics and pipelines tools.

Comparison table

Option	Best For	Strengths	Weaknesses	When to Choose
Cloud Data Fusion (Google Cloud)	Visual ETL/ELT pipelines with plugins	Visual authoring, managed instance, plugin ecosystem, extensible via custom plugins	Instance-based cost, networking complexity for private setups, runtime depends on separate compute	When you want managed, visual pipelines and standard connectors
Dataflow (Google Cloud)	Serverless stream/batch processing (Apache Beam)	Fully managed execution, strong streaming, autoscaling, per-job model	More code-centric, learning curve for Beam	When you need streaming-first or code-based pipelines with serverless ops
Dataproc (Google Cloud)	Managed Spark/Hadoop clusters	Full control for Spark jobs, notebooks, custom runtimes	You manage more operational details; not a visual ETL tool	When you want custom Spark and cluster-level control
Cloud Composer (Google Cloud)	Orchestration (Airflow)	Great for scheduling/coordination across many services	Not an ETL engine by itself	When you need orchestration over many tools including Data Fusion
BigQuery + Dataform (Google Cloud)	SQL-first transformations in the warehouse	Strong governance for SQL transformations, fewer moving parts	Not for complex non-SQL ingestion or heavy non-SQL transforms	When most transforms are SQL and data is already in BigQuery
BigQuery Data Transfer Service	Managed transfers from supported sources	Simple managed transfers	Limited transformation capability	When you just need supported source → BigQuery loads
AWS Glue (AWS)	Visual/catalog-based ETL	Serverless ETL, deep AWS integration	Different ecosystem, migration effort to Google Cloud	When you are standardized on AWS
Azure Data Factory (Azure)	Visual data integration	Wide connector ecosystem, orchestration	Different ecosystem, migration effort to Google Cloud	When you are standardized on Azure
Apache NiFi (self-managed)	Flow-based integration with fine-grained control	Real-time flow control, great UI	You operate it; scaling and HA are your responsibility	When you need on-prem/hybrid flow control and can operate NiFi
Airbyte/Fivetran (managed)	ELT ingestion from SaaS/apps	Fast SaaS ingestion, managed connectors	Less flexible transforms; cost can scale with volume	When you primarily ingest SaaS/app data into a warehouse

15. Real-World Example

Enterprise example (regulated analytics platform)

Problem A financial services company needs repeatable ingestion pipelines from multiple internal systems into BigQuery with strict access control and auditability.

Proposed architecture – Separate projects for dev/test/prod – Private Cloud Data Fusion instance in prod project – Runtime service accounts per domain with least privilege – Landing zone in Cloud Storage (raw) – Transformation pipelines in Cloud Data Fusion writing to curated BigQuery datasets – Centralized Cloud Logging with alerting for pipeline failures

Why Cloud Data Fusion was chosen – Visual pipeline authoring accelerates onboarding for multiple teams. – Managed service reduces operational burden compared to self-hosting CDAP. – Plugin model supports JDBC ingestion and standardized transformations.

Expected outcomes – Reduced time to build new ingestion pipelines (days instead of weeks) – Improved auditability via centralized pipeline definitions and logs – Better reliability through standardized runbooks and monitoring

Startup/small-team example (lean analytics stack)

Problem A startup receives daily CSV exports from partners and needs a reliable way to clean and load data into BigQuery for product analytics.

Proposed architecture – One Cloud Data Fusion instance (dev/prod may be separate later) – Cloud Storage bucket for partner drops – Simple batch pipelines: GCS → Wrangler cleanup → BigQuery – Basic alerting on failures (email/Chat via Cloud Monitoring notifications)

Why Cloud Data Fusion was chosen – Minimal custom code required – Fast iteration with Wrangler – Easy integration with BigQuery

Expected outcomes – Repeatable data loads without ad-hoc scripts – Faster analytics availability each morning – A clear path to productionization with better IAM and environment separation later

16. FAQ

1) Is Cloud Data Fusion still an active Google Cloud service?
Yes—Cloud Data Fusion is an active Google Cloud service. Always verify the latest product status and release notes in official documentation if you are planning a long-term platform decision.

2) What is the relationship between Cloud Data Fusion and CDAP?
Cloud Data Fusion is built on the open-source CDAP platform. Many concepts (artifacts, namespaces, plugins) come from CDAP.

3) Do I need to run servers to use Cloud Data Fusion?
No. The control plane is managed by Google Cloud. Your pipelines execute on managed runtime compute configured through compute profiles (often Dataproc-based), which you also don’t “server manage,” but you do pay for and must size correctly.

4) Is Cloud Data Fusion serverless?
Not in the same sense as Dataflow or BigQuery. Cloud Data Fusion is instance-based plus separate runtime compute costs.

5) Can Cloud Data Fusion load data into BigQuery?
Yes, loading into BigQuery is a common use case using BigQuery sink plugins.

6) Can Cloud Data Fusion read from Cloud Storage?
Yes. Cloud Storage (GCS) file ingestion is one of the most common patterns.

7) How do I control who can edit or run pipelines?
Use Google Cloud IAM to control access to Data Fusion instances and related resources. For fine-grained separation, consider namespaces plus project-level environment separation.

8) How do pipelines authenticate to Cloud Storage and BigQuery?
Typically through service accounts and IAM permissions. Ensure the runtime identity has least-privileged access to required buckets and datasets.

9) Can I run Cloud Data Fusion in a private network?
Cloud Data Fusion supports private connectivity patterns. The setup can be more complex and is subject to region/edition constraints. Use: https://cloud.google.com/data-fusion/docs/concepts/private-ip

10) Is Cloud Data Fusion good for streaming pipelines?
Cloud Data Fusion can be used for certain streaming-style patterns depending on available plugins and runtime support. For streaming-first architectures, compare with Dataflow. Verify streaming support details in current docs for your edition.

11) How do I schedule pipelines?
Cloud Data Fusion provides deployment and run management features, and some scheduling/triggering patterns may be available. Many teams use Cloud Composer (Airflow) for orchestration across multiple systems. Verify scheduling features in current docs.

12) How do I promote pipelines from dev to prod?
A common approach is: – Build in dev – Export pipeline definitions and store in Git – Import/apply in prod with controlled variables and service accounts
Exact mechanics depend on your governance model.

13) What are common causes of pipeline failures?
– Permissions (GCS/BigQuery) – Dataproc quotas or cluster provisioning failures – Schema/type mismatches – Network connectivity to external databases

14) How can I reduce Cloud Data Fusion cost?
– Stop non-prod instances when not needed (verify billing behavior) – Use smaller compute profiles – Avoid unnecessary reprocessing – Reduce log verbosity and retention

15) Should I choose Cloud Data Fusion or Dataflow?
Choose Cloud Data Fusion when you want visual pipeline building and connectors with managed operations. Choose Dataflow when you need serverless streaming/batch at scale with code-based Apache Beam pipelines.

16) Can I use custom code in Cloud Data Fusion?
Yes, typically via custom plugins/artifacts or specialized transformation steps. The exact development model should be verified in official docs and your organization’s SDLC requirements.

17) How do I monitor pipelines?
Use Data Fusion run history plus Cloud Logging/Monitoring. Create alerts for failures and abnormal runtimes/volumes.

17. Top Online Resources to Learn Cloud Data Fusion

Resource Type	Name	Why It Is Useful
Official documentation	Cloud Data Fusion docs — https://cloud.google.com/data-fusion/docs	Primary reference for concepts, how-tos, IAM, networking, plugins
Official pricing	Cloud Data Fusion pricing — https://cloud.google.com/data-fusion/pricing	Current pricing model by edition/region
Pricing calculator	Google Cloud Pricing Calculator — https://cloud.google.com/products/calculator	Build estimates including Dataproc/BigQuery/Storage
Official quickstart	Cloud Data Fusion Quickstart (docs) — https://cloud.google.com/data-fusion/docs/quickstart	Step-by-step initial setup and first pipeline
IAM guide	IAM for Cloud Data Fusion — https://cloud.google.com/data-fusion/docs/concepts/iam	Required roles, permission model, service identities
Private networking	Private IP instances — https://cloud.google.com/data-fusion/docs/concepts/private-ip	Key reference for private connectivity design and constraints
Dataproc pricing	Dataproc pricing — https://cloud.google.com/dataproc/pricing	Understand runtime compute cost drivers
BigQuery documentation	BigQuery docs — https://cloud.google.com/bigquery/docs	Best practices for datasets, partitioning, cost control
Cloud Skills Boost	Google Cloud Skills Boost — https://www.cloudskillsboost.google/	Hands-on labs (search for “Data Fusion”)
Architecture Center	Google Cloud Architecture Center — https://cloud.google.com/architecture	Reference architectures for analytics and pipelines patterns

18. Training and Certification Providers

Institute	Suitable Audience	Likely Learning Focus	Mode	Website URL
DevOpsSchool.com	Engineers, DevOps, platform teams, learners	Cloud/DevOps training programs; verify Data Fusion coverage	Check website	https://www.devopsschool.com/
ScmGalaxy.com	Beginners to intermediate engineers	DevOps/SCM learning paths; verify Google Cloud coverage	Check website	https://www.scmgalaxy.com/
CLoudOpsNow.in	Cloud ops and DevOps practitioners	Cloud operations and tooling; verify Data Fusion modules	Check website	https://www.cloudopsnow.in/
SreSchool.com	SREs, operations engineers	Reliability, monitoring, incident response; apply to data pipelines ops	Check website	https://www.sreschool.com/
AiOpsSchool.com	Ops teams and engineers	AIOps concepts, observability; relevant for pipeline monitoring	Check website	https://www.aiopsschool.com/

19. Top Trainers

Platform/Site	Likely Specialization	Suitable Audience	Website URL
RajeshKumar.xyz	Trainer profile site (verify exact offerings)	Learners seeking trainer-led coaching	https://rajeshkumar.xyz/
devopstrainer.in	DevOps training platform (verify curricula)	DevOps/cloud learners	https://www.devopstrainer.in/
devopsfreelancer.com	Freelance DevOps/community platform (verify services)	Teams seeking contract training/support	https://www.devopsfreelancer.com/
devopssupport.in	DevOps support/training platform (verify scope)	Ops teams needing guided support	https://www.devopssupport.in/

20. Top Consulting Companies

Company Name	Likely Service Area	Where They May Help	Consulting Use Case Examples	Website URL
cotocus.com	Cloud/DevOps/engineering services (verify exact portfolio)	Architecture, implementation support, operationalization	Data pipeline platform setup, IAM hardening, CI/CD for pipelines	https://cotocus.com/
DevOpsSchool.com	Training and consulting (verify exact offerings)	Enablement + advisory	Team enablement on Google Cloud operations and delivery practices	https://www.devopsschool.com/
DEVOPSCONSULTING.IN	DevOps consulting (verify exact offerings)	DevOps and operations practices	Observability setup for data pipelines, deployment process improvements	https://www.devopsconsulting.in/

21. Career and Learning Roadmap

What to learn before Cloud Data Fusion

Google Cloud fundamentals – Projects, billing, IAM, service accounts
Storage and analytics basics – Cloud Storage buckets, object paths, lifecycle rules – BigQuery datasets, tables, partitioning
Networking basics – VPC concepts, firewall rules, private access patterns (especially if your org uses private-only)
Data engineering fundamentals – ETL vs ELT, schemas, data types, incremental loads, data quality

What to learn after Cloud Data Fusion

Orchestration
Cloud Composer (Airflow) for multi-step workflows and cross-service orchestration
Streaming
Pub/Sub + Dataflow for streaming-first workloads
Warehouse modeling and governance
BigQuery performance/cost optimization
Dataform for SQL-based transformations
Observability
Cloud Monitoring dashboards, SLOs for data pipelines, log-based alerts

Job roles that use it

Data Engineer
Analytics Engineer (for ingestion/curation workflows)
Cloud Engineer (data platform)
DevOps / Platform Engineer supporting data systems
SRE for data platforms (reliability and operations)

Certification path (if available)

Google Cloud certifications that align well (even if not Data Fusion-specific): – Associate Cloud Engineer – Professional Cloud Architect – Professional Data Engineer (if currently offered—verify current certification catalog)

Certification catalog: – https://cloud.google.com/learn/certification

Project ideas for practice

Build a raw→clean→curated pipeline framework with standardized naming
Add a quarantine output for invalid records and build a dashboard for data quality
Implement incremental loads into partitioned BigQuery tables
Create a dev/test/prod promotion workflow using exported pipeline definitions in Git
Compare cost and performance of Spark-based transforms vs BigQuery SQL pushdown (where applicable)

22. Glossary

Artifact: A packaged bundle in CDAP/Data Fusion containing plugins or applications, often versioned.
Batch pipeline: A pipeline that processes bounded datasets (files, snapshots) as discrete runs.
BigQuery: Google Cloud’s serverless data warehouse.
Cloud Data Fusion instance: Regional managed environment hosting the Data Fusion UI and control plane.
Compute profile: Configuration that defines where/how pipelines execute (often Dataproc-based runtime configuration).
Control plane: Management components (UI, metadata, pipeline definitions).
Data plane: The runtime execution environment where data is processed.
Dataproc: Managed Spark/Hadoop service commonly used as execution engine for Data Fusion pipelines.
ELT: Extract, Load, Transform (transformations done after loading, often in warehouse).
ETL: Extract, Transform, Load (transformations before loading into warehouse).
IAM: Identity and Access Management; controls permissions.
Namespace: Logical partition inside a Cloud Data Fusion instance for organizing assets.
Plugin: A connector or transformation step used in a pipeline (source/sink/transform).
Runtime identity: Service account/identity used when the pipeline executes and accesses data.
Wrangler: Interactive data preparation interface for cleaning and shaping datasets.

23. Summary

Cloud Data Fusion is Google Cloud’s managed, visual service for building data integration pipelines in the Data analytics and pipelines category. It combines drag-and-drop pipeline design, interactive data preparation, and a plugin ecosystem to help teams ingest, transform, and load data into platforms like BigQuery.

Architecturally, Cloud Data Fusion separates the instance control plane (authoring and management) from runtime execution (commonly Dataproc-based). This improves manageability but introduces two major cost and operations considerations: instance runtime costs and pipeline execution compute costs.

For security, focus on least-privilege IAM for runtime identities, region and networking decisions (public vs private), and careful secrets handling. For cost, avoid leaving non-prod instances running, right-size compute profiles, co-locate resources, and manage logging volume.

Use Cloud Data Fusion when you want a managed, visual approach to building pipelines quickly with standardized connectors; consider alternatives like Dataflow or BigQuery-native transformation tooling when your needs are streaming-first or SQL-first.

Next step: run the hands-on lab again with a larger dataset, add a quarantine output for invalid records, and implement incremental loading into partitioned BigQuery tables using a dev/test/prod promotion workflow.

rajeshkumar

Category

1. Introduction

2. What is Cloud Data Fusion?

Official purpose

Core capabilities

Major components (how to think about the service)

Service type

Scope (regional / project-scoped)

Fit in the Google Cloud ecosystem (Data analytics and pipelines)

3. Why use Cloud Data Fusion?

Business reasons

Technical reasons

Operational reasons

Security/compliance reasons

Scalability/performance reasons

When teams should choose it

When teams should not choose it

4. Where is Cloud Data Fusion used?

Industries

Team types

Workloads

Architectures

Real-world deployment contexts

Production vs dev/test usage

5. Top Use Cases and Scenarios

1) Cloud Storage CSV → BigQuery curated table

2) Multi-source enrichment (GCS + reference table join)

3) JDBC database extract → BigQuery

4) Standardize raw → clean → curated layers

5) Data quality checks during ingestion

6) Metadata and lineage visibility (within the platform)

7) Migration from legacy ETL tooling to Google Cloud

8) Rapid prototyping for new data products

9) Central ingestion for BI tools

10) Compliance-aware processing pipeline (segmented access)

11) Hybrid ingestion from on-prem to Google Cloud

12) Controlled plugin-based extensibility for enterprise standards

6. Core Features

1) Managed Cloud Data Fusion instances

2) Visual pipeline designer (Studio)

3) Interactive data preparation (Wrangler)

4) Plugin ecosystem (sources, sinks, transforms)

5) Extensibility with custom plugins (artifacts)

6) Compute profiles (runtime configuration)

7) Namespace-based organization

8) Operational visibility (run history, logs integration)

9) Integration with Google Cloud storage and analytics services

10) Versioning and promotion patterns (pipeline export/import)

7. Architecture and How It Works

High-level service architecture

Request/data/control flow (typical)

Integrations with related services (common)

Dependency services (what you should plan for)

Security/authentication model (practical view)

Networking model (public vs private patterns)

Monitoring/logging/governance considerations

Simple architecture diagram (Mermaid)

Production-style architecture diagram (Mermaid)

8. Prerequisites

Google Cloud account/project requirements

Permissions / IAM roles

Billing requirements

CLI/SDK/tools needed

Region availability

Quotas/limits

Prerequisite services/APIs

9. Pricing / Cost

Official pricing sources

Pricing dimensions (what you pay for)

Cost drivers (what increases your bill)

Hidden or indirect costs to plan for

How to optimize cost

Example low-cost starter estimate (conceptual, no fabricated numbers)

Example production cost considerations (what to model)

10. Step-by-Step Hands-On Tutorial

Objective

Lab Overview

Step 1: Prepare your project and enable required APIs

Actions (Console)

Actions (CLI)