{"id":656,"date":"2026-04-14T22:10:30","date_gmt":"2026-04-14T22:10:30","guid":{"rendered":"https:\/\/www.devopsschool.com\/tutorials\/google-cloud-datastream-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-data-analytics-and-pipelines\/"},"modified":"2026-04-14T22:10:30","modified_gmt":"2026-04-14T22:10:30","slug":"google-cloud-datastream-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-data-analytics-and-pipelines","status":"publish","type":"post","link":"https:\/\/www.devopsschool.com\/tutorials\/google-cloud-datastream-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-data-analytics-and-pipelines\/","title":{"rendered":"Google Cloud Datastream Tutorial: Architecture, Pricing, Use Cases, and Hands-On Guide for Data analytics and pipelines"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Category<\/h2>\n\n\n\n<p>Data analytics and pipelines<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">1. Introduction<\/h2>\n\n\n\n<p>Datastream is a managed change data capture (CDC) and replication service on <strong>Google Cloud<\/strong>. It continuously captures changes from supported source databases (for example, PostgreSQL, MySQL, and Oracle) and delivers them to analytics destinations such as <strong>BigQuery<\/strong> or to <strong>Cloud Storage<\/strong> for downstream processing.<\/p>\n\n\n\n<p>In simple terms: Datastream helps you keep a near-real-time copy of your operational database changes flowing into your analytics platform, without you having to run and maintain your own CDC tooling, connectors, or Kafka infrastructure.<\/p>\n\n\n\n<p>Technically, Datastream establishes a secure connection to a source database, performs an optional initial <strong>backfill<\/strong> (historical snapshot), then continuously reads database logs (CDC) to emit insert\/update\/delete changes into a destination. You manage <strong>connection profiles<\/strong> and <strong>streams<\/strong> as regional resources, and Google Cloud operates the underlying CDC pipeline and scaling.<\/p>\n\n\n\n<p>Datastream solves a common problem in <strong>Data analytics and pipelines<\/strong>: getting reliable, low-ops, incremental data movement from transactional systems into analytics systems with predictable latency, strong observability, and fewer moving parts than self-managed replication stacks.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">2. What is Datastream?<\/h2>\n\n\n\n<p><strong>Official purpose (what it\u2019s for)<\/strong><br\/>\nDatastream is Google Cloud\u2019s managed service for <strong>database replication and change data capture<\/strong> into Google Cloud destinations\u2014most commonly <strong>BigQuery<\/strong> for analytics and <strong>Cloud Storage<\/strong> for building data lakes and downstream pipelines. (Always confirm the latest supported sources\/destinations in the official docs, as support expands over time.)<\/p>\n\n\n\n<p><strong>Core capabilities<\/strong>\n&#8211; <strong>Initial backfill<\/strong> of selected schemas\/tables (a consistent snapshot, subject to source engine capabilities and configuration).\n&#8211; <strong>Continuous CDC<\/strong> (capturing ongoing changes using database log mechanisms).\n&#8211; <strong>Object selection<\/strong> so you can choose which databases\/schemas\/tables to replicate.\n&#8211; <strong>Managed connectivity<\/strong> options for public or private networking (private is recommended for production).\n&#8211; <strong>Operational visibility<\/strong> into stream status, throughput\/latency metrics, and errors via Google Cloud observability tools.<\/p>\n\n\n\n<p><strong>Major components<\/strong>\n&#8211; <strong>Connection profiles<\/strong>: Define how Datastream connects to a <em>source<\/em> (database credentials, host, port, SSL settings) or a <em>destination<\/em> (for example, BigQuery or Cloud Storage).\n&#8211; <strong>Streams<\/strong>: Define the replication job\u2014source profile + destination profile + selection rules + backfill settings + runtime state.\n&#8211; <strong>Private connectivity resources<\/strong> (if used): Datastream-managed private networking constructs that allow access to sources on private IPs (for example in a VPC or on-prem via VPN\/Interconnect). Exact terminology and setup steps can vary\u2014verify in official docs.<\/p>\n\n\n\n<p><strong>Service type<\/strong>\n&#8211; Fully managed <strong>data replication \/ CDC<\/strong> service (control-plane managed by you, data-plane operated by Google Cloud).<\/p>\n\n\n\n<p><strong>Scope and geography<\/strong>\n&#8211; Datastream resources (such as streams and connection profiles) are typically <strong>regional<\/strong> and belong to a <strong>Google Cloud project<\/strong>. You choose a region for Datastream, and you generally place sources\/destinations and networking close to that region for latency and cost reasons.<br\/>\n  Verify the exact regional availability and resource scoping in: https:\/\/cloud.google.com\/datastream\/docs\/locations<\/p>\n\n\n\n<p><strong>How it fits into the Google Cloud ecosystem<\/strong>\nDatastream often sits between:\n&#8211; <strong>Operational databases<\/strong> (Cloud SQL, self-managed databases on Compute Engine, or on-prem databases reachable via Cloud VPN\/Interconnect)<br\/>\nand\n&#8211; <strong>Analytics and storage destinations<\/strong> (BigQuery and Cloud Storage), where you can then use <strong>Dataflow<\/strong>, <strong>Dataproc<\/strong>, <strong>BigQuery SQL<\/strong>, or <strong>Dataplex<\/strong>\/governance tooling to build end-to-end pipelines.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">3. Why use Datastream?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Business reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Faster analytics<\/strong>: Get operational data into BigQuery with lower latency than batch exports.<\/li>\n<li><strong>Lower engineering effort<\/strong>: Reduce the need to build and maintain custom CDC pipelines.<\/li>\n<li><strong>Better data timeliness for decisions<\/strong>: Near-real-time dashboards and operational reporting.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Technical reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Managed CDC<\/strong> avoids running Debezium\/Kafka Connect clusters or homegrown log parsers.<\/li>\n<li><strong>Backfill + CDC in one service<\/strong> reduces multi-tool complexity.<\/li>\n<li><strong>Destination alignment with Google Cloud analytics<\/strong> (BigQuery and Cloud Storage) streamlines architectures.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operational reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Less infrastructure to patch and scale<\/strong> compared to self-managed connectors.<\/li>\n<li><strong>Centralized monitoring and logging<\/strong> via Cloud Monitoring and Cloud Logging.<\/li>\n<li><strong>Declarative configuration<\/strong> (streams\/profiles) encourages repeatability.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/compliance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Supports private networking patterns (recommended) to avoid exposing databases publicly.<\/li>\n<li>Integrates with Google Cloud IAM, audit logs, and (depending on destination) encryption controls.<\/li>\n<li>Helps implement \u201cleast privilege\u201d by using dedicated database users and scoped BigQuery permissions.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scalability\/performance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Designed to handle continuous change volumes without you provisioning ingestion clusters.<\/li>\n<li>Scales operationally as a managed service (though you still must plan for source database impact and destination ingestion costs\/limits).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should choose Datastream<\/h3>\n\n\n\n<p>Choose Datastream if you need:\n&#8211; CDC from supported relational sources into BigQuery\/Cloud Storage\n&#8211; A managed replication service with operational simplicity\n&#8211; A pattern that fits Google Cloud-native analytics and governance<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should not choose Datastream<\/h3>\n\n\n\n<p>Avoid (or reconsider) Datastream if:\n&#8211; Your source engine\/version isn\u2019t supported.\n&#8211; You need a destination not supported (for example, direct Kafka topics or arbitrary HTTP sinks). You may need Cloud Storage + Dataflow or a different tool.\n&#8211; You require complex transformations inline during capture (Datastream is primarily replication\/CDC; transformations are typically downstream in Dataflow\/BigQuery).\n&#8211; You need multi-master conflict resolution or bidirectional replication (Datastream is typically one-way replication).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">4. Where is Datastream used?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Industries<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Retail\/e-commerce: orders, inventory, pricing changes into BigQuery<\/li>\n<li>Financial services: transaction events into analytics (with strong security controls)<\/li>\n<li>Healthcare: operational system extracts into governed analytics environments<\/li>\n<li>SaaS: product usage events stored in relational DBs replicated to BigQuery<\/li>\n<li>Gaming\/media: player\/session data replicated for near-real-time insights<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Team types<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data engineering teams building analytics pipelines<\/li>\n<li>Platform teams standardizing ingestion patterns<\/li>\n<li>SRE\/operations teams reducing bespoke replication tooling<\/li>\n<li>Security teams enforcing secure connectivity and auditability<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Workloads<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Near-real-time BI (BigQuery + Looker)<\/li>\n<li>Operational analytics (freshness in minutes)<\/li>\n<li>Data lake landing zones (Cloud Storage) + downstream processing<\/li>\n<li>Migration\/modernization: run old and new systems in parallel while replicating data<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Architectures<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>OLTP \u2192 CDC \u2192 BigQuery (analytics)<\/li>\n<li>OLTP \u2192 CDC \u2192 Cloud Storage \u2192 Dataflow \u2192 BigQuery (custom transforms)<\/li>\n<li>On-prem DB \u2192 CDC \u2192 Google Cloud landing zone (hybrid)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Real-world deployment contexts<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Production<\/strong>: private connectivity, least-privilege IAM, strong monitoring, explicit cost controls<\/li>\n<li><strong>Dev\/test<\/strong>: smaller datasets, limited table selection, shorter retention, more frequent teardown<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">5. Top Use Cases and Scenarios<\/h2>\n\n\n\n<p>Below are realistic scenarios where Datastream commonly fits.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1) Near-real-time operational reporting in BigQuery<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Batch ETL (nightly) is too slow for business users.<\/li>\n<li><strong>Why Datastream fits<\/strong>: Managed CDC keeps BigQuery updated continuously.<\/li>\n<li><strong>Example<\/strong>: Customer support dashboard shows order status changes within minutes.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2) Building a Cloud Storage landing zone (raw CDC lake)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: You want raw change logs for replay, audits, or multiple downstream consumers.<\/li>\n<li><strong>Why Datastream fits<\/strong>: Can deliver to Cloud Storage for flexible processing (verify formats supported for your configuration).<\/li>\n<li><strong>Example<\/strong>: Store CDC files in Cloud Storage, then run Dataflow pipelines for multiple marts.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3) Modernizing analytics from on-prem databases<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: On-prem Oracle\/PostgreSQL is hard to scale for analytics workloads.<\/li>\n<li><strong>Why Datastream fits<\/strong>: CDC over hybrid connectivity moves data continuously to Google Cloud.<\/li>\n<li><strong>Example<\/strong>: On-prem Oracle changes stream to BigQuery via Datastream, while apps remain on-prem.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4) Keeping a serving layer updated<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: You need a denormalized dataset for APIs or search, derived from relational DB changes.<\/li>\n<li><strong>Why Datastream fits<\/strong>: CDC provides timely updates you can transform downstream.<\/li>\n<li><strong>Example<\/strong>: Datastream \u2192 Cloud Storage \u2192 Dataflow \u2192 (destination of your choice) for serving.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">5) Incremental data ingestion for ML feature pipelines<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: ML features in BigQuery are stale due to batch loads.<\/li>\n<li><strong>Why Datastream fits<\/strong>: Fresh updates to feature tables using CDC ingestion.<\/li>\n<li><strong>Example<\/strong>: Fraud model features update as transactions change.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6) Reducing load on production databases<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Analytics queries on OLTP cause performance issues.<\/li>\n<li><strong>Why Datastream fits<\/strong>: Replicate changes to BigQuery; query BigQuery instead.<\/li>\n<li><strong>Example<\/strong>: Product managers query BigQuery instead of hitting the primary DB.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">7) Event-driven analytics without rewriting apps<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Legacy apps write only to a relational DB, not to an event bus.<\/li>\n<li><strong>Why Datastream fits<\/strong>: CDC effectively produces an event stream derived from DB logs.<\/li>\n<li><strong>Example<\/strong>: Order updates become CDC events used downstream for metrics and alerting.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">8) Data validation during migrations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: You need to compare old system data vs new system data continuously.<\/li>\n<li><strong>Why Datastream fits<\/strong>: Replicate to BigQuery, then run validation queries.<\/li>\n<li><strong>Example<\/strong>: Cloud SQL (new) vs on-prem DB (old) differences tracked in BigQuery.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">9) Multi-environment test data refresh (subset)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Test environments need recent production-like data without full dumps.<\/li>\n<li><strong>Why Datastream fits<\/strong>: Select only required tables\/schemas and replicate to a non-prod BigQuery dataset.<\/li>\n<li><strong>Example<\/strong>: A small subset of customer and product tables replicated nightly + CDC (if allowed).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">10) Building an audit trail of row-level changes<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: You need traceability of how records changed over time.<\/li>\n<li><strong>Why Datastream fits<\/strong>: CDC produces change events that can be stored and queried.<\/li>\n<li><strong>Example<\/strong>: Append-only history tables in BigQuery created from CDC.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">11) Cost-optimized ingestion with downstream batching<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: You want CDC but also want to batch downstream transformations for cost.<\/li>\n<li><strong>Why Datastream fits<\/strong>: Capture continuously, transform on schedule (Dataflow\/BigQuery).<\/li>\n<li><strong>Example<\/strong>: Datastream to Cloud Storage; scheduled BigQuery loads every 15 minutes.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">12) Centralized ingestion standard for many app databases<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: Each team built its own connector; operations are inconsistent.<\/li>\n<li><strong>Why Datastream fits<\/strong>: A single managed ingestion pattern using standard IAM, monitoring, and governance.<\/li>\n<li><strong>Example<\/strong>: Platform team offers Datastream streams as a self-service capability.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">6. Core Features<\/h2>\n\n\n\n<p>The exact feature set can evolve. Confirm current details in the official docs: https:\/\/cloud.google.com\/datastream\/docs<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1) Change Data Capture (CDC)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Continuously captures inserts\/updates\/deletes from source DB logs.<\/li>\n<li><strong>Why it matters<\/strong>: Provides near-real-time data freshness for analytics and downstream processing.<\/li>\n<li><strong>Practical benefit<\/strong>: Fewer full reloads; reduced load on sources and destinations.<\/li>\n<li><strong>Caveats<\/strong>: Requires correct source configuration (replication\/logging settings, privileges). CDC can increase WAL\/binlog\/redo log retention needs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2) Initial Backfill (historical load)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Copies existing table contents to the destination before CDC keeps it updated.<\/li>\n<li><strong>Why it matters<\/strong>: Avoids the \u201cstart from now only\u201d limitation.<\/li>\n<li><strong>Practical benefit<\/strong>: You get complete tables plus ongoing changes.<\/li>\n<li><strong>Caveats<\/strong>: Backfill can be time-consuming and can load the source. Plan maintenance windows and throttle if supported (verify in official docs).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3) Connection profiles (source and destination)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Stores connectivity details (host\/port\/credentials, SSL, etc.) for endpoints.<\/li>\n<li><strong>Why it matters<\/strong>: Separation of concerns\u2014reuse profiles across streams; simplify rotations and changes.<\/li>\n<li><strong>Practical benefit<\/strong>: Standardized onboarding for multiple sources\/destinations.<\/li>\n<li><strong>Caveats<\/strong>: Treat credentials as sensitive. Prefer secret management patterns and least privilege.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4) Stream configuration and object selection<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Defines which schemas\/tables (and sometimes columns\u2014verify) to replicate.<\/li>\n<li><strong>Why it matters<\/strong>: Limits replication to what you actually need, controlling cost and risk.<\/li>\n<li><strong>Practical benefit<\/strong>: Reduce destination clutter and ingestion spend.<\/li>\n<li><strong>Caveats<\/strong>: Changes to selection rules can require stream updates; ensure you understand how backfill behaves when adding objects.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">5) Private connectivity options<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Allows Datastream to reach sources on private networks (VPC, on-prem via VPN\/Interconnect) without exposing them to the public internet.<\/li>\n<li><strong>Why it matters<\/strong>: Stronger security posture and simpler compliance alignment.<\/li>\n<li><strong>Practical benefit<\/strong>: No public IP on databases; fewer firewall exceptions.<\/li>\n<li><strong>Caveats<\/strong>: Requires VPC planning, routing, and sometimes IP range allocations. Setup varies by scenario\u2014verify the current recommended method in official docs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6) BigQuery destination support<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Writes replicated data into BigQuery datasets\/tables for analytics.<\/li>\n<li><strong>Why it matters<\/strong>: Removes the need to build custom ingestion into BigQuery.<\/li>\n<li><strong>Practical benefit<\/strong>: Faster time to value for analytics.<\/li>\n<li><strong>Caveats<\/strong>: Understand table\/metadata behavior, schema mapping, and how deletes\/updates are represented. Verify the exact BigQuery output model in official docs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">7) Cloud Storage destination support<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Writes change events and\/or backfill outputs into Cloud Storage.<\/li>\n<li><strong>Why it matters<\/strong>: Cloud Storage is a flexible landing zone for many pipeline patterns.<\/li>\n<li><strong>Practical benefit<\/strong>: You can replay and reprocess CDC data with Dataflow\/Dataproc\/BigQuery external tables.<\/li>\n<li><strong>Caveats<\/strong>: You must design downstream consumption, partitioning, and lifecycle policies. Output formats and file layout can vary\u2014verify in official docs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">8) Monitoring, status, and error reporting<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Exposes stream health, lag, throughput, and error conditions.<\/li>\n<li><strong>Why it matters<\/strong>: CDC pipelines must be observable to be trusted.<\/li>\n<li><strong>Practical benefit<\/strong>: Faster incident response and capacity planning.<\/li>\n<li><strong>Caveats<\/strong>: Metrics availability and names can change; confirm what\u2019s emitted in Cloud Monitoring.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">9) IAM and auditability<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does<\/strong>: Uses Google Cloud IAM to control who can create\/modify streams and who can view them, and logs admin activities.<\/li>\n<li><strong>Why it matters<\/strong>: CDC touches sensitive data and production systems.<\/li>\n<li><strong>Practical benefit<\/strong>: Principle of least privilege and governance alignment.<\/li>\n<li><strong>Caveats<\/strong>: You must also secure the source database credentials and the destination datasets\/buckets.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">7. Architecture and How It Works<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">High-level service architecture<\/h3>\n\n\n\n<p>Datastream operates as a managed CDC pipeline:\n1. You configure <strong>connection profiles<\/strong> to the source database and destination (BigQuery or Cloud Storage).\n2. You create a <strong>stream<\/strong> that defines selection rules and whether to backfill.\n3. Datastream connects to the source, performs backfill (if enabled), then switches to CDC mode.\n4. Changes are delivered continuously to the destination.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Request\/data\/control flow<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Control plane<\/strong>: You (or CI\/CD) call the Datastream API\/Console to create\/update streams and profiles.<\/li>\n<li><strong>Data plane<\/strong>: Datastream reads from the source database\u2019s replication mechanism and writes to the destination.<\/li>\n<li><strong>Observability plane<\/strong>: Stream state and errors appear in Cloud Logging\/Monitoring.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Integrations with related services<\/h3>\n\n\n\n<p>Common patterns in Google Cloud:\n&#8211; <strong>BigQuery<\/strong>: Analytics destination (dashboards, ad-hoc SQL, ML).\n&#8211; <strong>Cloud Storage<\/strong>: Landing zone for raw CDC files.\n&#8211; <strong>Dataflow<\/strong>: Transform and load CDC outputs into curated tables (especially when using Cloud Storage destination).\n&#8211; <strong>Cloud SQL \/ Compute Engine \/ on-prem<\/strong>: Source database hosting.\n&#8211; <strong>Cloud VPN \/ Cloud Interconnect<\/strong>: Hybrid connectivity to on-prem sources.\n&#8211; <strong>Cloud Monitoring + Cloud Logging<\/strong>: Alerts, dashboards, and audit trails.\n&#8211; <strong>Dataplex<\/strong> (optional): Governance over datasets and storage zones (verify best practices for CDC data).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Dependency services (typical)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Source database and its replication\/logging config<\/li>\n<li>Networking (VPC, firewall rules, routes; possibly Service Networking \/ Private Service Connect depending on connectivity mode)<\/li>\n<li>Destination service (BigQuery dataset permissions, Cloud Storage bucket permissions)<\/li>\n<li>IAM policies for administrators and service agents<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/authentication model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Google Cloud IAM<\/strong> controls management operations (create stream, update profile, view).<\/li>\n<li>Datastream uses a <strong>Google-managed service agent<\/strong> in your project to access destinations (and possibly to manage some resources). You grant permissions to that service agent for BigQuery\/Cloud Storage as required.<\/li>\n<li>Access to the source database is authenticated using database credentials (user\/password) and optionally TLS certificates, depending on the engine and configuration.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Networking model<\/h3>\n\n\n\n<p>Two common approaches:\n&#8211; <strong>Private connectivity (recommended for production)<\/strong>: Datastream reaches the source over private IP addressing via VPC\/hybrid connectivity.\n&#8211; <strong>Public IP connectivity<\/strong>: Datastream reaches the source over public internet, typically requiring IP allowlisting and TLS. This is simpler for labs but riskier for production.<\/p>\n\n\n\n<p>The exact connectivity setup depends on source type and location. Always follow the current Datastream connectivity guide: https:\/\/cloud.google.com\/datastream\/docs\/configure-connectivity<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Monitoring\/logging\/governance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Create Cloud Monitoring alerts on stream failure, lag, or throughput drops.<\/li>\n<li>Export logs to a SIEM if needed.<\/li>\n<li>Apply data governance to destination datasets\/buckets (labels, retention, access controls).<\/li>\n<li>Treat CDC as sensitive: changes can include PII and secrets if present in the DB.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Simple architecture diagram (Mermaid)<\/h3>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart LR\n  A[(Source DB\\nPostgreSQL\/MySQL\/Oracle)] --&gt;|CDC + optional backfill| B[Datastream (regional)]\n  B --&gt; C[(BigQuery)]\n  B --&gt; D[(Cloud Storage)]\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Production-style architecture diagram (Mermaid)<\/h3>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart TB\n  subgraph OnPrem[\"On\u2011prem \/ Self-managed network\"]\n    DB[(Production DB)]\n  end\n\n  subgraph GCP[\"Google Cloud project\"]\n    VPN[Cloud VPN \/ Interconnect]\n    VPC[VPC Network]\n    DS[Datastream (Region)]\n    BQ[(BigQuery Datasets)]\n    GCS[(Cloud Storage Landing Bucket)]\n    DF[Dataflow (optional transforms)]\n    MON[Cloud Monitoring + Logging]\n    KMS[Cloud KMS (dest encryption controls)]\n  end\n\n  DB --- VPN --- VPC\n  VPC --&gt;|Private connectivity| DS\n  DS --&gt; BQ\n  DS --&gt; GCS\n  GCS --&gt; DF --&gt; BQ\n  DS --&gt; MON\n  BQ --&gt; MON\n  GCS --&gt; MON\n  KMS -.-&gt; BQ\n  KMS -.-&gt; GCS\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">8. Prerequisites<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Google Cloud account\/project<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A <strong>Google Cloud project<\/strong> with <strong>billing enabled<\/strong>.<\/li>\n<li>You should choose a region where Datastream is available. Verify regions: https:\/\/cloud.google.com\/datastream\/docs\/locations<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Permissions \/ IAM roles<\/h3>\n\n\n\n<p>You need permissions to:\n&#8211; Enable APIs\n&#8211; Create and manage Datastream resources (streams, connection profiles, private connectivity)\n&#8211; Create\/manage BigQuery datasets and tables (or at least grant permissions to Datastream\u2019s service agent)\n&#8211; Create\/manage Cloud SQL if you use it as a source (for the lab)<\/p>\n\n\n\n<p>Typical roles (verify exact role names and best-practice combinations):\n&#8211; Datastream administration: check IAM roles at https:\/\/cloud.google.com\/datastream\/docs\/access-control\n&#8211; BigQuery permissions: <code>roles\/bigquery.admin<\/code> (broad) or dataset-scoped permissions for least privilege\n&#8211; Cloud SQL admin: <code>roles\/cloudsql.admin<\/code> (for the lab)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Billing requirements<\/h3>\n\n\n\n<p>Costs may come from:\n&#8211; Datastream usage\n&#8211; BigQuery storage and query costs\n&#8211; Cloud SQL instance\/runtime costs\n&#8211; Cloud Storage (if used)\n&#8211; Dataflow (if used)\n&#8211; Network egress (depending on topology)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Tools<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Google Cloud Console<\/strong> access<\/li>\n<li><strong>gcloud CLI<\/strong> installed and authenticated: https:\/\/cloud.google.com\/sdk\/docs\/install<\/li>\n<li>A SQL client (for PostgreSQL: <code>psql<\/code>) for inserting sample data<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">APIs to enable<\/h3>\n\n\n\n<p>Typically:\n&#8211; Datastream API\n&#8211; BigQuery API\n&#8211; Cloud SQL Admin API\n&#8211; Compute Engine API (often required for networking\/VPC operations)\n&#8211; Service Networking API (if using Cloud SQL private IP)<\/p>\n\n\n\n<p>Enable only what you use.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Quotas\/limits<\/h3>\n\n\n\n<p>Datastream, BigQuery, and Cloud SQL each have quotas (streams per project\/region, connection limits, BigQuery ingestion\/storage limits, etc.). Review:\n&#8211; Datastream quotas\/limits: verify in official docs\n&#8211; BigQuery quotas: https:\/\/cloud.google.com\/bigquery\/quotas\n&#8211; Cloud SQL limits: https:\/\/cloud.google.com\/sql\/quotas<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Prerequisite services (for this tutorial lab)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Cloud SQL for PostgreSQL (source)<\/li>\n<li>BigQuery dataset (destination)<\/li>\n<li>A VPC network (for private connectivity in the lab)<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">9. Pricing \/ Cost<\/h2>\n\n\n\n<p>Datastream pricing is <strong>usage-based<\/strong> and depends on factors such as:\n&#8211; Amount of data processed for <strong>backfill<\/strong>\n&#8211; Ongoing <strong>CDC change volume<\/strong> (bytes processed)\n&#8211; Potentially other dimensions depending on the product\u2019s current SKUs (for example, regional pricing differences)<\/p>\n\n\n\n<p>Because pricing and SKUs can change and vary by region, always validate on the official pages:\n&#8211; Datastream pricing: https:\/\/cloud.google.com\/datastream\/pricing<br\/>\n&#8211; Google Cloud Pricing Calculator: https:\/\/cloud.google.com\/products\/calculator<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Pricing dimensions (typical cost drivers)<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>CDC volume<\/strong>: The more row changes (and larger rows), the more Datastream processes.<\/li>\n<li><strong>Backfill volume<\/strong>: Initial snapshot size can be large and cost significant.<\/li>\n<li><strong>Region<\/strong>: Pricing can differ by region.<\/li>\n<li><strong>Destination costs<\/strong>:\n   &#8211; <strong>BigQuery<\/strong>: storage, streaming\/ingestion behavior (implementation-specific), and queries by users.\n   &#8211; <strong>Cloud Storage<\/strong>: object storage, operations, lifecycle, and retrieval.<\/li>\n<li><strong>Source costs<\/strong>:\n   &#8211; Source DB CPU\/IO overhead from replication\/log reading.\n   &#8211; WAL\/binlog\/redo retention and storage overhead.<\/li>\n<li><strong>Network costs<\/strong>:\n   &#8211; Cross-region data transfer can add cost.\n   &#8211; Hybrid egress (on-prem to cloud) may incur network charges depending on connectivity.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Free tier<\/h3>\n\n\n\n<p>Datastream free-tier availability can change. <strong>Verify in official pricing docs<\/strong> whether any free tier or trial credits apply.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Hidden or indirect costs to plan for<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>BigQuery query spend (analysts exploring newly replicated datasets can generate costs quickly).<\/li>\n<li>Cloud SQL sizing (logical decoding\/replication overhead can require larger instances or more storage).<\/li>\n<li>Cloud Storage lifecycle misconfiguration (retaining raw CDC files forever).<\/li>\n<li>Cross-region replication (placing Datastream in a different region than source\/destination).<\/li>\n<li>Operational overhead: alerts, dashboards, and incident response time (even managed services need ops).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">How to optimize cost<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Replicate only what you need (schemas\/tables; possibly columns\u2014verify).<\/li>\n<li>Avoid replicating huge historical tables unless needed; consider staged backfill.<\/li>\n<li>Place Datastream in the same region as destination and near the source network entry point.<\/li>\n<li>If using Cloud Storage as a landing zone, set lifecycle policies (for example, transition\/delete older CDC files).<\/li>\n<li>In BigQuery, partition\/cluster curated tables and control who can run expensive queries.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Example low-cost starter estimate (conceptual)<\/h3>\n\n\n\n<p>A starter lab typically includes:\n&#8211; A small Cloud SQL instance\n&#8211; A small BigQuery dataset\n&#8211; One Datastream stream replicating one schema with a few small tables and low CDC volume<\/p>\n\n\n\n<p>To estimate:\n1. Use the Pricing Calculator for Cloud SQL hourly cost.\n2. Add Datastream estimated processed GB for backfill + daily changes.\n3. Add BigQuery storage for replicated tables and expected query usage.<\/p>\n\n\n\n<p>Because actual SKUs and volumes vary, <strong>do not rely on fixed numbers<\/strong>\u2014use the calculator and measure real change volume.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Example production cost considerations<\/h3>\n\n\n\n<p>In production, the main drivers are:\n&#8211; Large backfills (TB-scale)\n&#8211; High-velocity CDC (many updates\/sec)\n&#8211; Multiple streams for many databases\n&#8211; BigQuery downstream query usage (often the largest ongoing spend)<\/p>\n\n\n\n<p>A practical approach:\n&#8211; Start with one \u201cpilot\u201d stream, measure bytes processed and freshness, then extrapolate.\n&#8211; Set budgets and alerts in Google Cloud Billing.\n&#8211; Enforce table selection standards to prevent replicating entire databases by default.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">10. Step-by-Step Hands-On Tutorial<\/h2>\n\n\n\n<p>This lab builds a small but real CDC pipeline:\n&#8211; <strong>Source<\/strong>: Cloud SQL for PostgreSQL (private IP)\n&#8211; <strong>Replication<\/strong>: Datastream stream with backfill + CDC\n&#8211; <strong>Destination<\/strong>: BigQuery dataset<\/p>\n\n\n\n<p>The goal is to keep cost low while using production-aligned patterns (private connectivity, least privilege).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Objective<\/h3>\n\n\n\n<p>Create a Datastream stream that replicates a PostgreSQL table from Cloud SQL into BigQuery, validate by inserting rows and observing them appear in BigQuery, then clean up all resources.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Lab Overview<\/h3>\n\n\n\n<p>You will:\n1. Create networking required for Cloud SQL private IP.\n2. Create a Cloud SQL for PostgreSQL instance and configure logical replication.\n3. Create a sample database\/table and a dedicated replication user.\n4. Create a BigQuery dataset.\n5. Create Datastream connection profiles (source\/destination).\n6. Create a Datastream stream with backfill + CDC.\n7. Validate data appears in BigQuery.\n8. Troubleshoot common issues.\n9. Clean up resources to avoid ongoing costs.<\/p>\n\n\n\n<blockquote>\n<p>Notes before you begin:\n&#8211; Datastream is regional. Pick one region and keep Cloud SQL and BigQuery dataset in that region\/multi-region as appropriate.\n&#8211; Some flags\/settings differ by PostgreSQL version and Cloud SQL capabilities. If anything conflicts, follow Cloud SQL + Datastream official docs for PostgreSQL sources.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">Step 1: Select project, region, and enable APIs<\/h3>\n\n\n\n<p>1) Set variables (replace placeholders):<\/p>\n\n\n\n<pre><code class=\"language-bash\">export PROJECT_ID=\"YOUR_PROJECT_ID\"\nexport REGION=\"us-central1\"   # choose a Datastream-supported region\nexport ZONE=\"us-central1-a\"\ngcloud config set project \"${PROJECT_ID}\"\n<\/code><\/pre>\n\n\n\n<p>2) Enable required APIs:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud services enable \\\n  datastream.googleapis.com \\\n  bigquery.googleapis.com \\\n  sqladmin.googleapis.com \\\n  compute.googleapis.com \\\n  servicenetworking.googleapis.com\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome<\/strong>: APIs are enabled successfully.<\/p>\n\n\n\n<p><strong>Verification<\/strong>:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud services list --enabled --filter=\"name:datastream.googleapis.com\"\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Step 2: Create a VPC and subnet (for private connectivity)<\/h3>\n\n\n\n<p>Create a dedicated VPC for the lab:<\/p>\n\n\n\n<pre><code class=\"language-bash\">export VPC_NAME=\"ds-lab-vpc\"\nexport SUBNET_NAME=\"ds-lab-subnet\"\nexport SUBNET_RANGE=\"10.10.0.0\/24\"\n\ngcloud compute networks create \"${VPC_NAME}\" --subnet-mode=custom\n\ngcloud compute networks subnets create \"${SUBNET_NAME}\" \\\n  --network=\"${VPC_NAME}\" \\\n  --region=\"${REGION}\" \\\n  --range=\"${SUBNET_RANGE}\"\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome<\/strong>: VPC and subnet exist in your project.<\/p>\n\n\n\n<p><strong>Verification<\/strong>:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud compute networks describe \"${VPC_NAME}\"\ngcloud compute networks subnets describe \"${SUBNET_NAME}\" --region \"${REGION}\"\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Step 3: Reserve an IP range and enable Private Service Access (for Cloud SQL private IP)<\/h3>\n\n\n\n<p>Cloud SQL private IP requires a reserved range and a Service Networking connection.<\/p>\n\n\n\n<p>1) Reserve an internal range for Google-managed services:<\/p>\n\n\n\n<pre><code class=\"language-bash\">export PSA_RANGE_NAME=\"ds-lab-psa-range\"\n\ngcloud compute addresses create \"${PSA_RANGE_NAME}\" \\\n  --global \\\n  --purpose=VPC_PEERING \\\n  --prefix-length=16 \\\n  --network=\"${VPC_NAME}\"\n<\/code><\/pre>\n\n\n\n<p>2) Create the Service Networking connection:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud services vpc-peerings connect \\\n  --service=servicenetworking.googleapis.com \\\n  --network=\"${VPC_NAME}\" \\\n  --ranges=\"${PSA_RANGE_NAME}\"\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome<\/strong>: The VPC has Private Service Access configured.<\/p>\n\n\n\n<p><strong>Verification<\/strong>:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud services vpc-peerings list --network=\"${VPC_NAME}\"\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Step 4: Create a Cloud SQL for PostgreSQL instance (private IP)<\/h3>\n\n\n\n<p>1) Create the instance (choose a small tier for cost). PostgreSQL version selection matters\u2014use a Datastream-supported version (verify in Datastream docs for PostgreSQL support matrix).<\/p>\n\n\n\n<pre><code class=\"language-bash\">export SQL_INSTANCE=\"ds-lab-pg\"\nexport DB_VERSION=\"POSTGRES_15\"   # example; verify supported versions for Datastream\nexport TIER=\"db-custom-1-3840\"    # example small tier; adjust as needed\n\ngcloud sql instances create \"${SQL_INSTANCE}\" \\\n  --database-version=\"${DB_VERSION}\" \\\n  --tier=\"${TIER}\" \\\n  --region=\"${REGION}\" \\\n  --network=\"projects\/${PROJECT_ID}\/global\/networks\/${VPC_NAME}\" \\\n  --no-assign-ip\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome<\/strong>: Cloud SQL instance is created with private IP only.<\/p>\n\n\n\n<p><strong>Verification<\/strong>:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud sql instances describe \"${SQL_INSTANCE}\" --format=\"value(ipAddresses.ipAddress,settings.ipConfiguration.ipv4Enabled)\"\n<\/code><\/pre>\n\n\n\n<p>You should see an IP address and <code>ipv4Enabled<\/code> should indicate public IP is not enabled.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 5: Configure PostgreSQL for logical replication and create sample data<\/h3>\n\n\n\n<p>Datastream needs PostgreSQL logical decoding enabled and a user with suitable privileges.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">5.1 Set Cloud SQL flags for logical decoding (PostgreSQL)<\/h4>\n\n\n\n<p>Cloud SQL uses database flags (exact flags depend on version). Common requirements:\n&#8211; Enable logical decoding (Cloud SQL-specific flag)\n&#8211; Ensure adequate replication slots \/ WAL senders<\/p>\n\n\n\n<p>Set flags (verify the exact flag names for your Cloud SQL version in official docs; this step is commonly required):<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud sql instances patch \"${SQL_INSTANCE}\" \\\n  --database-flags=cloudsql.logical_decoding=on\n<\/code><\/pre>\n\n\n\n<p>If your source requires additional flags like <code>max_replication_slots<\/code> or <code>max_wal_senders<\/code>, add them per Cloud SQL guidance (verify in official docs).<\/p>\n\n\n\n<p><strong>Expected outcome<\/strong>: Instance is patched; it may require a restart.<\/p>\n\n\n\n<p><strong>Verification<\/strong>:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud sql instances describe \"${SQL_INSTANCE}\" --format=\"value(settings.databaseFlags)\"\n<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">5.2 Set a password for the default postgres user (if needed)<\/h4>\n\n\n\n<p>Cloud SQL often creates a default <code>postgres<\/code> user. Set a password you\u2019ll use temporarily:<\/p>\n\n\n\n<pre><code class=\"language-bash\">export POSTGRES_PASSWORD=\"REPLACE_WITH_A_STRONG_PASSWORD\"\ngcloud sql users set-password postgres \\\n  --instance=\"${SQL_INSTANCE}\" \\\n  --password=\"${POSTGRES_PASSWORD}\"\n<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">5.3 Connect to the instance and create a database\/table<\/h4>\n\n\n\n<p>To connect privately, use one of these:\n&#8211; A VM in the same VPC, or\n&#8211; Cloud Shell with a configured path (Cloud Shell isn\u2019t in your VPC by default), or\n&#8211; A temporary bastion VM<\/p>\n\n\n\n<p>A simple approach is to create a tiny VM in the same VPC and connect from there.<\/p>\n\n\n\n<p>Create a small VM:<\/p>\n\n\n\n<pre><code class=\"language-bash\">export VM_NAME=\"ds-lab-vm\"\n\ngcloud compute instances create \"${VM_NAME}\" \\\n  --zone=\"${ZONE}\" \\\n  --machine-type=\"e2-micro\" \\\n  --network=\"${VPC_NAME}\" \\\n  --subnet=\"${SUBNET_NAME}\" \\\n  --scopes=\"https:\/\/www.googleapis.com\/auth\/cloud-platform\"\n<\/code><\/pre>\n\n\n\n<p>Find the Cloud SQL private IP:<\/p>\n\n\n\n<pre><code class=\"language-bash\">export SQL_PRIVATE_IP=\"$(gcloud sql instances describe \"${SQL_INSTANCE}\" --format='value(ipAddresses.ipAddress)')\"\necho \"${SQL_PRIVATE_IP}\"\n<\/code><\/pre>\n\n\n\n<p>SSH into the VM:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud compute ssh \"${VM_NAME}\" --zone=\"${ZONE}\"\n<\/code><\/pre>\n\n\n\n<p>On the VM, install <code>psql<\/code> client (Debian\/Ubuntu example):<\/p>\n\n\n\n<pre><code class=\"language-bash\">sudo apt-get update\nsudo apt-get install -y postgresql-client\n<\/code><\/pre>\n\n\n\n<p>Connect using <code>psql<\/code>:<\/p>\n\n\n\n<pre><code class=\"language-bash\">export PGPASSWORD=\"REPLACE_WITH_A_STRONG_PASSWORD\"\npsql -h SQL_PRIVATE_IP -U postgres -d postgres\n<\/code><\/pre>\n\n\n\n<p>Inside <code>psql<\/code>, create a database and table:<\/p>\n\n\n\n<pre><code class=\"language-sql\">CREATE DATABASE ds_lab;\n\\c ds_lab\n\nCREATE TABLE public.customers (\n  customer_id  BIGSERIAL PRIMARY KEY,\n  email        TEXT NOT NULL,\n  created_at   TIMESTAMPTZ NOT NULL DEFAULT now()\n);\n\nINSERT INTO public.customers (email) VALUES\n  ('alice@example.com'),\n  ('bob@example.com');\n\nSELECT * FROM public.customers;\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome<\/strong>: You have a <code>ds_lab<\/code> database and a <code>customers<\/code> table with 2 rows.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">5.4 Create a dedicated Datastream user<\/h4>\n\n\n\n<p>Still inside <code>psql<\/code>, create a user for Datastream. Exact permissions can differ; generally it needs:\n&#8211; Ability to connect to the database\n&#8211; SELECT on replicated tables (for backfill)\n&#8211; Replication\/logical decoding privileges<\/p>\n\n\n\n<p>Create the user (verify required roles\/privileges in official docs for Datastream PostgreSQL source):<\/p>\n\n\n\n<pre><code class=\"language-sql\">CREATE USER datastream_user WITH PASSWORD 'REPLACE_WITH_STRONG_PASSWORD';\n\nGRANT CONNECT ON DATABASE ds_lab TO datastream_user;\nGRANT USAGE ON SCHEMA public TO datastream_user;\nGRANT SELECT ON ALL TABLES IN SCHEMA public TO datastream_user;\nALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO datastream_user;\n\n-- Replication privilege (may be required)\nALTER USER datastream_user WITH REPLICATION;\n<\/code><\/pre>\n\n\n\n<p>Exit <code>psql<\/code>:<\/p>\n\n\n\n<pre><code class=\"language-sql\">\\q\n<\/code><\/pre>\n\n\n\n<p>Exit the VM SSH session:<\/p>\n\n\n\n<pre><code class=\"language-bash\">exit\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome<\/strong>: A least-privilege user exists for Datastream to read and replicate.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 6: Create a BigQuery dataset (destination)<\/h3>\n\n\n\n<p>Create a dataset in BigQuery (choose location aligned with your region strategy):<\/p>\n\n\n\n<pre><code class=\"language-bash\">export BQ_DATASET=\"ds_lab\"\nbq --location=\"${REGION}\" mk -d \\\n  --description \"Datastream lab dataset\" \\\n  \"${PROJECT_ID}:${BQ_DATASET}\"\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome<\/strong>: BigQuery dataset exists.<\/p>\n\n\n\n<p><strong>Verification<\/strong>:<\/p>\n\n\n\n<pre><code class=\"language-bash\">bq show \"${PROJECT_ID}:${BQ_DATASET}\"\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Step 7: Grant BigQuery permissions to the Datastream service agent<\/h3>\n\n\n\n<p>Datastream writes into BigQuery using a Google-managed identity (service agent). You must grant it sufficient permissions on the dataset (or project). The exact service agent format and required roles can vary\u2014verify in official docs:\n&#8211; Datastream IAM\/access control: https:\/\/cloud.google.com\/datastream\/docs\/access-control<\/p>\n\n\n\n<p>Typical pattern:\n&#8211; Find your project number\n&#8211; Identify the Datastream service agent\n&#8211; Grant dataset-level permissions (preferred) instead of project-wide<\/p>\n\n\n\n<p>Get project number:<\/p>\n\n\n\n<pre><code class=\"language-bash\">export PROJECT_NUMBER=\"$(gcloud projects describe \"${PROJECT_ID}\" --format='value(projectNumber)')\"\necho \"${PROJECT_NUMBER}\"\n<\/code><\/pre>\n\n\n\n<p>The Datastream service agent commonly looks like:<\/p>\n\n\n\n<pre><code>service-PROJECT_NUMBER@gcp-sa-datastream.iam.gserviceaccount.com\n<\/code><\/pre>\n\n\n\n<p>Grant dataset access (dataset-level IAM). BigQuery dataset IAM via <code>bq<\/code> can be managed, but many teams do this in the Console for clarity:\n&#8211; BigQuery \u2192 Dataset \u2192 Sharing \u2192 Permissions \u2192 Add principal (service agent) \u2192 assign role<\/p>\n\n\n\n<p>Role guidance:\n&#8211; Minimum required roles depend on how Datastream creates\/updates tables. Many labs use a broader role for simplicity, then tighten for production.<br\/>\n&#8211; For least privilege, grant only what\u2019s required to create and write tables in that dataset. <strong>Verify required roles in official docs<\/strong>.<\/p>\n\n\n\n<p><strong>Expected outcome<\/strong>: Datastream can create\/write tables in the dataset.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 8: Create Datastream connection profiles (Console-recommended for accuracy)<\/h3>\n\n\n\n<p>Because CLI flags can change, the most robust beginner path is the Console.<\/p>\n\n\n\n<p>1) Go to Datastream in Google Cloud Console:<br\/>\nhttps:\/\/console.cloud.google.com\/datastream<\/p>\n\n\n\n<p>2) Ensure you\u2019re in the correct <strong>project<\/strong> and <strong>region<\/strong>.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">8.1 Create a source connection profile (PostgreSQL)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Datastream \u2192 <strong>Connection profiles<\/strong> \u2192 <strong>Create profile<\/strong><\/li>\n<li>Type: <strong>PostgreSQL<\/strong><\/li>\n<li>Connectivity: choose <strong>Private connectivity<\/strong> (recommended)<\/li>\n<li>Hostname\/IP: Cloud SQL <strong>private IP<\/strong><\/li>\n<li>Port: <code>5432<\/code><\/li>\n<li>Username: <code>datastream_user<\/code><\/li>\n<li>Password: the password you set<\/li>\n<li>Database: <code>ds_lab<\/code><\/li>\n<li>TLS\/SSL: configure per your security posture (for Cloud SQL private IP, TLS is still recommended; verify required settings)<\/li>\n<\/ul>\n\n\n\n<p>Test the connection in the UI (Datastream typically provides a \u201cTest\u201d button).<\/p>\n\n\n\n<p><strong>Expected outcome<\/strong>: Connection test succeeds and profile is created.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">8.2 Create a destination connection profile (BigQuery)<\/h4>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Datastream \u2192 <strong>Connection profiles<\/strong> \u2192 <strong>Create profile<\/strong><\/li>\n<li>Type: <strong>BigQuery<\/strong><\/li>\n<li>Choose dataset: <code>ds_lab<\/code> (or allow Datastream to create datasets depending on UI options)<\/li>\n<li>Save<\/li>\n<\/ul>\n\n\n\n<p><strong>Expected outcome<\/strong>: Destination profile created.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 9: Configure private connectivity in Datastream (if required by your setup)<\/h3>\n\n\n\n<p>Depending on the current Datastream UI flow, you may need to create a <strong>private connection<\/strong> resource that attaches Datastream to your VPC\/subnet.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Datastream \u2192 <strong>Private connectivity<\/strong> (or similar) \u2192 <strong>Create<\/strong><\/li>\n<li>Select:<\/li>\n<li>VPC: <code>ds-lab-vpc<\/code><\/li>\n<li>Subnet: <code>ds-lab-subnet<\/code><\/li>\n<li>Region: same as your Datastream resources<\/li>\n<\/ul>\n\n\n\n<p><strong>Expected outcome<\/strong>: Private connectivity becomes READY\/ACTIVE.<\/p>\n\n\n\n<p><strong>Common issue<\/strong>: If the private connection cannot be established, check:\n&#8211; IAM permissions\n&#8211; Overlapping IP ranges\n&#8211; VPC peering limits\n&#8211; Regional mismatches<\/p>\n\n\n\n<p>(Exact diagnostics vary; consult Datastream connectivity docs.)<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 10: Create a Datastream stream (backfill + CDC)<\/h3>\n\n\n\n<p>1) Datastream \u2192 <strong>Streams<\/strong> \u2192 <strong>Create stream<\/strong>\n2) Select:\n   &#8211; Source connection profile: your PostgreSQL profile\n   &#8211; Destination connection profile: your BigQuery profile\n3) Configure <strong>object selection<\/strong>:\n   &#8211; Database: <code>ds_lab<\/code>\n   &#8211; Schema: <code>public<\/code>\n   &#8211; Tables: <code>customers<\/code> (only)\n4) Backfill:\n   &#8211; Enable backfill for selected objects (recommended so BigQuery gets the existing two rows)\n5) Review and create the stream.\n6) Start\/Run the stream (depending on UI state model).<\/p>\n\n\n\n<p><strong>Expected outcome<\/strong>: Stream transitions to RUNNING and begins backfill, then CDC.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 11: Generate changes in PostgreSQL and observe in BigQuery<\/h3>\n\n\n\n<p>1) SSH to the VM again and insert\/update\/delete rows:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud compute ssh \"${VM_NAME}\" --zone=\"${ZONE}\"\n<\/code><\/pre>\n\n\n\n<p>Run:<\/p>\n\n\n\n<pre><code class=\"language-bash\">export PGPASSWORD=\"REPLACE_WITH_A_STRONG_PASSWORD\"\npsql -h SQL_PRIVATE_IP -U postgres -d ds_lab\n<\/code><\/pre>\n\n\n\n<p>In <code>psql<\/code>:<\/p>\n\n\n\n<pre><code class=\"language-sql\">INSERT INTO public.customers (email) VALUES ('carol@example.com');\nUPDATE public.customers SET email='alice+new@example.com' WHERE email='alice@example.com';\nDELETE FROM public.customers WHERE email='bob@example.com';\nSELECT * FROM public.customers ORDER BY customer_id;\n<\/code><\/pre>\n\n\n\n<p>Exit:<\/p>\n\n\n\n<pre><code class=\"language-sql\">\\q\n<\/code><\/pre>\n\n\n\n<p>Exit VM:<\/p>\n\n\n\n<pre><code class=\"language-bash\">exit\n<\/code><\/pre>\n\n\n\n<p>2) In BigQuery, query the replicated table(s).<br\/>\nIn the BigQuery Console: https:\/\/console.cloud.google.com\/bigquery<\/p>\n\n\n\n<p>Run a query against the expected table created by Datastream in dataset <code>ds_lab<\/code>.<\/p>\n\n\n\n<p>Because naming conventions can vary (and may include schema\/database prefixes), locate the table in the dataset and run:<\/p>\n\n\n\n<pre><code class=\"language-sql\">SELECT * FROM `YOUR_PROJECT_ID.ds_lab.YOUR_TABLE_NAME` LIMIT 100;\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome<\/strong>: You see rows reflecting the backfill + subsequent inserts\/updates\/deletes, consistent with Datastream\u2019s BigQuery replication model.<\/p>\n\n\n\n<blockquote>\n<p>Important: How updates\/deletes appear depends on Datastream\u2019s BigQuery write semantics (for example, merge vs append\/change tables). <strong>Verify the exact BigQuery output model in official docs<\/strong> and align your downstream queries accordingly.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">Validation<\/h3>\n\n\n\n<p>Use the following checks:<\/p>\n\n\n\n<p>1) <strong>Datastream stream health<\/strong>\n&#8211; Datastream \u2192 Streams \u2192 your stream\n&#8211; Status: RUNNING\n&#8211; Backfill: completed (if enabled)\n&#8211; Errors: none<\/p>\n\n\n\n<p>2) <strong>Source database<\/strong>\n&#8211; Rows changed as expected:<\/p>\n\n\n\n<pre><code class=\"language-sql\">SELECT count(*) FROM public.customers;\n<\/code><\/pre>\n\n\n\n<p>3) <strong>Destination BigQuery<\/strong>\n&#8211; Replicated table exists in dataset <code>ds_lab<\/code>\n&#8211; Query returns expected data with reasonable freshness<\/p>\n\n\n\n<p>4) <strong>Observability<\/strong>\n&#8211; Cloud Logging: filter for Datastream logs (service name) and verify no repeated failures.\n&#8211; Cloud Monitoring: look for Datastream metrics if exposed in your environment (names may vary).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Troubleshooting<\/h3>\n\n\n\n<p>Common issues and realistic fixes:<\/p>\n\n\n\n<p>1) <strong>Connection test fails (source)<\/strong>\n&#8211; Cause: VPC\/private connectivity not established, wrong IP\/port, firewall rules, wrong credentials.\n&#8211; Fix:\n  &#8211; Ensure Cloud SQL has private IP and is in the correct VPC.\n  &#8211; Ensure Datastream private connectivity is READY and matches region.\n  &#8211; Re-check username\/password and database name.<\/p>\n\n\n\n<p>2) <strong>PostgreSQL replication\/logical decoding errors<\/strong>\n&#8211; Cause: logical decoding not enabled; missing privileges; insufficient WAL settings.\n&#8211; Fix:\n  &#8211; Confirm Cloud SQL flag for logical decoding is enabled and instance restarted if required.\n  &#8211; Ensure <code>datastream_user<\/code> has required privileges (including REPLICATION if required by Datastream).\n  &#8211; Verify PostgreSQL version support in Datastream docs.<\/p>\n\n\n\n<p>3) <strong>Backfill succeeds but CDC doesn\u2019t (or vice versa)<\/strong>\n&#8211; Cause: replication slot\/log retention issues; permissions; stream state.\n&#8211; Fix:\n  &#8211; Check stream details for lag\/errors.\n  &#8211; Confirm source log retention is sufficient and replication slot is healthy.\n  &#8211; Verify no network interruption.<\/p>\n\n\n\n<p>4) <strong>BigQuery permission denied<\/strong>\n&#8211; Cause: Datastream service agent lacks dataset permissions.\n&#8211; Fix:\n  &#8211; Grant required dataset permissions to the Datastream service agent.\n  &#8211; Verify you used the correct service agent principal for the project.<\/p>\n\n\n\n<p>5) <strong>No table appears in BigQuery<\/strong>\n&#8211; Cause: object selection rules exclude table; stream not running; backfill disabled; permission issues.\n&#8211; Fix:\n  &#8211; Confirm selection includes <code>ds_lab.public.customers<\/code>.\n  &#8211; Ensure stream is RUNNING.\n  &#8211; Re-check destination profile and dataset location\/permissions.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Cleanup<\/h3>\n\n\n\n<p>To avoid ongoing charges, delete all lab resources.<\/p>\n\n\n\n<p>1) Stop and delete the Datastream stream (Console recommended):\n&#8211; Datastream \u2192 Streams \u2192 select stream \u2192 Stop (if required) \u2192 Delete<\/p>\n\n\n\n<p>2) Delete connection profiles and private connectivity resources:\n&#8211; Datastream \u2192 Connection profiles \u2192 Delete source and destination profiles\n&#8211; Datastream \u2192 Private connectivity \u2192 Delete private connection (if created)<\/p>\n\n\n\n<p>3) Delete BigQuery dataset:<\/p>\n\n\n\n<pre><code class=\"language-bash\">bq rm -r -d \"${PROJECT_ID}:${BQ_DATASET}\"\n<\/code><\/pre>\n\n\n\n<p>4) Delete Cloud SQL instance:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud sql instances delete \"${SQL_INSTANCE}\"\n<\/code><\/pre>\n\n\n\n<p>5) Delete VM:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud compute instances delete \"${VM_NAME}\" --zone=\"${ZONE}\"\n<\/code><\/pre>\n\n\n\n<p>6) Delete VPC and related resources:<\/p>\n\n\n\n<pre><code class=\"language-bash\">gcloud compute networks subnets delete \"${SUBNET_NAME}\" --region \"${REGION}\"\ngcloud compute networks delete \"${VPC_NAME}\"\ngcloud compute addresses delete \"${PSA_RANGE_NAME}\" --global\n<\/code><\/pre>\n\n\n\n<p>7) Optional: disable APIs if this project is only for labs.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">11. Best Practices<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Architecture best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Choose the right destination<\/strong>:<\/li>\n<li>BigQuery for immediate analytics<\/li>\n<li>Cloud Storage for raw landing + replay + multiple downstreams<\/li>\n<li><strong>Keep region alignment<\/strong>: Place Datastream near the source network and destination to reduce latency and data transfer costs.<\/li>\n<li><strong>Plan for schema evolution<\/strong>: Expect DDL changes; define operational procedures for schema changes and verify how Datastream propagates them for your source\/destination.<\/li>\n<li><strong>Design downstream models<\/strong>: Decide whether you want:<\/li>\n<li>Current-state tables, or<\/li>\n<li>Change-history tables, or<\/li>\n<li>Both (often via downstream transformations)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">IAM\/security best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use <strong>dedicated Datastream admin\/operator roles<\/strong> rather than broad project Owner.<\/li>\n<li>Grant <strong>dataset-scoped<\/strong> BigQuery permissions to Datastream\u2019s service agent (avoid project-wide roles when possible).<\/li>\n<li>Use a dedicated DB user with the <strong>minimum privileges<\/strong> required for backfill + CDC.<\/li>\n<li>Rotate DB credentials and document rotation steps.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Cost best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Replicate only required tables; avoid \u201cwhole database\u201d replication by default.<\/li>\n<li>Plan backfills carefully (time windows, object selection).<\/li>\n<li>If landing to Cloud Storage, apply <strong>lifecycle policies<\/strong> and retention controls.<\/li>\n<li>Set <strong>budgets and alerts<\/strong> for BigQuery query spend and Datastream usage.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Performance best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Ensure the source DB is configured for CDC load: adequate WAL\/binlog retention, IO capacity, and replication settings.<\/li>\n<li>Avoid excessive table selection changes that trigger large re-backfills.<\/li>\n<li>For BigQuery, create curated, partitioned tables downstream for efficient querying.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Reliability best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Monitor stream state and lag; alert on failures.<\/li>\n<li>Use private connectivity for stable networking and security.<\/li>\n<li>Document runbooks for restart\/recreate scenarios.<\/li>\n<li>Test schema changes and failover behavior in staging.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operations best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Standardize naming:<\/li>\n<li><code>ds-{env}-{source}-{region}<\/code> for streams<\/li>\n<li><code>cp-{env}-{system}-{role}<\/code> for connection profiles<\/li>\n<li>Apply labels\/tags for cost allocation (env, app, owner, data-domain).<\/li>\n<li>Use least privilege and separation of duties: creators vs viewers vs auditors.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Governance best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Treat replicated datasets as governed assets:<\/li>\n<li>Data classification labels (PII, PCI, etc.)<\/li>\n<li>Controlled sharing and authorized views in BigQuery<\/li>\n<li>Data retention policies aligned with compliance<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">12. Security Considerations<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Identity and access model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Google Cloud IAM<\/strong> controls who can create\/modify Datastream resources.<\/li>\n<li>Datastream uses a <strong>service agent<\/strong> to access destinations (BigQuery\/Cloud Storage). Grant it only the permissions it needs.<\/li>\n<li>Source authentication uses database credentials; protect and rotate them.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Encryption<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>In transit: use TLS where supported\/required between Datastream and source.<\/li>\n<li>At rest: BigQuery and Cloud Storage encrypt data by default. For higher control, consider CMEK where supported by the destination services.<br\/>\n  For Datastream-specific CMEK support, <strong>verify in official docs<\/strong> (do not assume it applies to every resource type).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Network exposure<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Prefer <strong>private connectivity<\/strong> so databases do not need public IP access.<\/li>\n<li>If public connectivity is unavoidable:<\/li>\n<li>Use TLS<\/li>\n<li>Restrict ingress using allowlists (Datastream egress IPs per region\u2014verify in official docs)<\/li>\n<li>Monitor connections closely<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secrets handling<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Avoid hardcoding DB passwords in scripts or repos.<\/li>\n<li>Use Secret Manager to store credentials and enforce access controls; implement a rotation process.<\/li>\n<li>Ensure only CI\/CD service accounts with a need have access to secrets.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Audit\/logging<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Enable and retain:<\/li>\n<li>Cloud Audit Logs for Datastream admin activity<\/li>\n<li>Cloud SQL audit\/logging as needed<\/li>\n<li>BigQuery audit logs (data access logs if required for compliance; note cost\/volume implications)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Compliance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>CDC often includes sensitive attributes. Replicate only what\u2019s allowed.<\/li>\n<li>Ensure data residency requirements align with chosen regions.<\/li>\n<li>Use dataset-level sharing controls and authorized views in BigQuery to implement least-privilege consumption.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Common security mistakes<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Exposing the source DB to the public internet for convenience.<\/li>\n<li>Granting overly broad BigQuery roles (project-wide admin).<\/li>\n<li>Replicating entire schemas that include secrets\/PII unintentionally.<\/li>\n<li>No monitoring\/alerts on stream failure (silent data freshness issues).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secure deployment recommendations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use private connectivity + least privilege + monitoring by default.<\/li>\n<li>Create a security review checklist for each new stream:<\/li>\n<li>Data classification<\/li>\n<li>Object selection<\/li>\n<li>Destination access controls<\/li>\n<li>Logging\/retention<\/li>\n<li>Incident response ownership<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">13. Limitations and Gotchas<\/h2>\n\n\n\n<p>Always review the official \u201cknown limitations\u201d and \u201csupported sources\/destinations\u201d pages for your exact versions and regions.<\/p>\n\n\n\n<p>Common limitations\/gotchas in CDC\/replication projects:\n&#8211; <strong>Source support matrix<\/strong>: Not all DB engines\/versions\/editions are supported.\n&#8211; <strong>DDL\/schema changes<\/strong>: Handling of ALTER TABLE, renames, and type changes can be nuanced.\n&#8211; <strong>Large objects \/ special types<\/strong>: Some data types may map imperfectly to BigQuery or file outputs.\n&#8211; <strong>Primary keys<\/strong>: Replication semantics (especially updates\/deletes) often rely on stable keys.\n&#8211; <strong>Initial backfill impact<\/strong>: Backfill can generate significant load on the source and network.\n&#8211; <strong>Permissions<\/strong>: The source user must have enough privileges for snapshot + CDC.\n&#8211; <strong>WAL\/binlog retention<\/strong>: If logs are truncated before Datastream reads them, replication can break.\n&#8211; <strong>Latency expectations<\/strong>: \u201cNear-real-time\u201d still requires monitoring; spikes happen during backfill, schema changes, or source load.\n&#8211; <strong>Regional placement<\/strong>: Cross-region pipelines can add egress cost and latency.\n&#8211; <strong>Destination write model<\/strong>: BigQuery replication model can differ from \u201cexact mirror table.\u201d Verify how changes are represented and how to query current state.\n&#8211; <strong>Operational ownership<\/strong>: Even managed CDC needs clear ownership for failures, schema issues, and access reviews.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">14. Comparison with Alternatives<\/h2>\n\n\n\n<p>Datastream is one option in a broader data ingestion landscape.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Comparison table<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Option<\/th>\n<th>Best For<\/th>\n<th>Strengths<\/th>\n<th>Weaknesses<\/th>\n<th>When to Choose<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Datastream (Google Cloud)<\/strong><\/td>\n<td>Managed CDC from supported DBs to BigQuery\/Cloud Storage<\/td>\n<td>Low ops, CDC + backfill, Google Cloud-native destinations<\/td>\n<td>Limited to supported sources\/destinations; transformations usually downstream<\/td>\n<td>You want managed CDC into BigQuery\/Cloud Storage<\/td>\n<\/tr>\n<tr>\n<td><strong>Dataflow (Google Cloud)<\/strong><\/td>\n<td>Custom streaming\/batch ETL<\/td>\n<td>Powerful transforms, many connectors, flexible pipelines<\/td>\n<td>You must build\/operate pipelines; CDC from DB logs usually needs extra tooling<\/td>\n<td>You need complex transforms and custom sinks<\/td>\n<\/tr>\n<tr>\n<td><strong>BigQuery Data Transfer Service<\/strong><\/td>\n<td>Scheduled transfers from supported SaaS\/data sources<\/td>\n<td>Easy managed scheduled loads<\/td>\n<td>Not a general CDC tool for OLTP DB logs<\/td>\n<td>You need scheduled ingestion from supported transfer sources<\/td>\n<\/tr>\n<tr>\n<td><strong>Database Migration Service (Google Cloud)<\/strong><\/td>\n<td>Database migrations to Cloud SQL\/AlloyDB (and possibly continuous replication for migration)<\/td>\n<td>Migration-focused workflows<\/td>\n<td>Not primarily an analytics CDC to BigQuery<\/td>\n<td>You\u2019re migrating databases rather than building analytics CDC<\/td>\n<\/tr>\n<tr>\n<td><strong>Self-managed Debezium + Kafka<\/strong><\/td>\n<td>Broad CDC to many consumers<\/td>\n<td>Very flexible, many sinks via Kafka ecosystem<\/td>\n<td>High ops cost, scaling and security complexity<\/td>\n<td>You need multi-sink event streaming and accept ops burden<\/td>\n<\/tr>\n<tr>\n<td><strong>AWS DMS<\/strong><\/td>\n<td>CDC\/migration in AWS ecosystem<\/td>\n<td>Mature CDC and migration tool<\/td>\n<td>Best integrated with AWS; cross-cloud adds complexity<\/td>\n<td>Your platform is AWS-centric<\/td>\n<\/tr>\n<tr>\n<td><strong>Azure Data Factory + CDC patterns<\/strong><\/td>\n<td>Data integration in Azure<\/td>\n<td>Strong integration in Azure<\/td>\n<td>CDC specifics vary; may require extra components<\/td>\n<td>Your platform is Azure-centric<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">15. Real-World Example<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Enterprise example: Retail analytics modernization<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: A retailer runs on-prem Oracle for order processing. Analysts need near-real-time sales analytics in BigQuery and want to retire heavy reporting queries on the OLTP system.<\/li>\n<li><strong>Proposed architecture<\/strong>:<\/li>\n<li>On-prem Oracle \u2192 Datastream (private connectivity via Interconnect\/VPN) \u2192 BigQuery<\/li>\n<li>Downstream: curated marts in BigQuery; dashboards in Looker<\/li>\n<li>Monitoring and alerts for stream lag; strict dataset access controls for PII<\/li>\n<li><strong>Why Datastream was chosen<\/strong>:<\/li>\n<li>Managed CDC reduces operational complexity versus self-managed connectors.<\/li>\n<li>Direct alignment with BigQuery analytics.<\/li>\n<li>Private connectivity supports compliance and avoids public database exposure.<\/li>\n<li><strong>Expected outcomes<\/strong>:<\/li>\n<li>Reduced OLTP reporting load<\/li>\n<li>Data freshness improved from daily to minutes<\/li>\n<li>Standardized ingestion and stronger governance<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Startup\/small-team example: SaaS product metrics in BigQuery<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem<\/strong>: A startup\u2019s PostgreSQL database is the system of record. They need near-real-time KPIs and cohort analysis without building a complex streaming platform.<\/li>\n<li><strong>Proposed architecture<\/strong>:<\/li>\n<li>Cloud SQL for PostgreSQL \u2192 Datastream \u2192 BigQuery dataset<\/li>\n<li>Scheduled SQL transformations in BigQuery for curated tables<\/li>\n<li>Basic monitoring alerts on stream failures<\/li>\n<li><strong>Why Datastream was chosen<\/strong>:<\/li>\n<li>Fast setup, minimal ops<\/li>\n<li>Fits a small team\u2019s capacity<\/li>\n<li><strong>Expected outcomes<\/strong>:<\/li>\n<li>Simple, reliable replication pipeline<\/li>\n<li>Faster iteration on analytics models<\/li>\n<li>Clear cost model focused on change volume and BigQuery usage<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">16. FAQ<\/h2>\n\n\n\n<p>1) <strong>Is Datastream an ETL tool?<\/strong><br\/>\nDatastream is primarily a <strong>CDC\/replication<\/strong> service. Transformations typically happen downstream (for example in Dataflow or BigQuery).<\/p>\n\n\n\n<p>2) <strong>Does Datastream do initial full loads?<\/strong><br\/>\nYes, it supports <strong>backfill<\/strong> (initial snapshot) for selected objects, then continues with CDC (verify backfill behavior for your source engine).<\/p>\n\n\n\n<p>3) <strong>Which databases can Datastream read from?<\/strong><br\/>\nCommon sources include PostgreSQL, MySQL, and Oracle, but support depends on version\/edition. Check the official support matrix: https:\/\/cloud.google.com\/datastream\/docs\/sources<\/p>\n\n\n\n<p>4) <strong>Can Datastream write directly to BigQuery?<\/strong><br\/>\nYes, BigQuery is a common destination. Confirm the current destination capabilities and semantics: https:\/\/cloud.google.com\/datastream\/docs\/destinations<\/p>\n\n\n\n<p>5) <strong>Can Datastream write to Pub\/Sub or Kafka directly?<\/strong><br\/>\nTypically Datastream targets BigQuery or Cloud Storage. For Pub\/Sub\/Kafka-style patterns, use Cloud Storage as a landing zone and transform\/forward using Dataflow (or choose a different CDC stack). Verify current destination support in docs.<\/p>\n\n\n\n<p>6) <strong>How fresh is the data in BigQuery?<\/strong><br\/>\nFreshness depends on change volume, source load, network, and stream health. Monitor lag\/freshness metrics rather than assuming a fixed SLA. Verify published SLAs (if any) in official documentation.<\/p>\n\n\n\n<p>7) <strong>Do I need a public IP on the source database?<\/strong><br\/>\nNo\u2014private connectivity is recommended. Public IP is possible in some scenarios but increases security risk and requires allowlisting.<\/p>\n\n\n\n<p>8) <strong>What happens if the stream goes down?<\/strong><br\/>\nDatastream will surface errors and stream state changes; recovery depends on the failure mode (connectivity, permissions, log retention). You should have alerts and runbooks.<\/p>\n\n\n\n<p>9) <strong>Does Datastream handle schema changes automatically?<\/strong><br\/>\nSchema evolution support depends on the source\/destination and change type. Always test DDL changes in staging and verify documented behavior.<\/p>\n\n\n\n<p>10) <strong>Will Datastream impact my production database performance?<\/strong><br\/>\nCDC reads logs and can increase IO\/CPU and log retention. Backfills can add significant read load. Plan capacity and test.<\/p>\n\n\n\n<p>11) <strong>How do deletes appear in BigQuery?<\/strong><br\/>\nRepresentation can vary (for example, tombstones, merge semantics, or change tables). Verify the BigQuery destination model in official docs and design queries accordingly.<\/p>\n\n\n\n<p>12) <strong>Can I replicate only some tables?<\/strong><br\/>\nYes, streams support object selection rules so you can include\/exclude objects.<\/p>\n\n\n\n<p>13) <strong>How do I secure database credentials?<\/strong><br\/>\nStore credentials in Secret Manager, restrict access, rotate regularly, and use dedicated least-privilege DB users.<\/p>\n\n\n\n<p>14) <strong>How do I estimate cost?<\/strong><br\/>\nEstimate backfill size + daily change volume, then use the Datastream pricing page and the Google Cloud Pricing Calculator. Remember BigQuery query costs can dominate.<\/p>\n\n\n\n<p>15) <strong>Is Datastream suitable for disaster recovery replication?<\/strong><br\/>\nDatastream is designed for analytics and pipeline replication patterns. DR requirements (RPO\/RTO, failover) may need database-native replication or specialized DR designs. Evaluate carefully.<\/p>\n\n\n\n<p>16) <strong>Can I run multiple streams from the same database?<\/strong><br\/>\nOften yes, but it depends on source limits (replication slots, connections) and Datastream quotas. Verify limits in official docs.<\/p>\n\n\n\n<p>17) <strong>What\u2019s the difference between Datastream and Database Migration Service?<\/strong><br\/>\nDatastream focuses on CDC into analytics\/storage destinations, while Database Migration Service focuses on migrating databases to managed database targets (and may support continuous replication for migration). Choose based on your goal.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">17. Top Online Resources to Learn Datastream<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Resource Type<\/th>\n<th>Name<\/th>\n<th>Why It Is Useful<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Official documentation<\/td>\n<td>Datastream docs https:\/\/cloud.google.com\/datastream\/docs<\/td>\n<td>Primary reference for concepts, configuration, and supported sources\/destinations<\/td>\n<\/tr>\n<tr>\n<td>Official pricing<\/td>\n<td>Datastream pricing https:\/\/cloud.google.com\/datastream\/pricing<\/td>\n<td>Current pricing model and SKUs (region-dependent)<\/td>\n<\/tr>\n<tr>\n<td>Pricing calculator<\/td>\n<td>Google Cloud Pricing Calculator https:\/\/cloud.google.com\/products\/calculator<\/td>\n<td>Build estimates including Datastream + BigQuery + Cloud SQL<\/td>\n<\/tr>\n<tr>\n<td>Locations<\/td>\n<td>Datastream locations https:\/\/cloud.google.com\/datastream\/docs\/locations<\/td>\n<td>Verify regional availability and plan deployments<\/td>\n<\/tr>\n<tr>\n<td>Connectivity guide<\/td>\n<td>Configure connectivity (Datastream) https:\/\/cloud.google.com\/datastream\/docs\/configure-connectivity<\/td>\n<td>Canonical steps for private\/public connectivity patterns<\/td>\n<\/tr>\n<tr>\n<td>Access control<\/td>\n<td>Datastream access control https:\/\/cloud.google.com\/datastream\/docs\/access-control<\/td>\n<td>IAM roles, permissions, and service agent guidance<\/td>\n<\/tr>\n<tr>\n<td>Destinations<\/td>\n<td>Datastream destinations https:\/\/cloud.google.com\/datastream\/docs\/destinations<\/td>\n<td>Understand BigQuery\/Cloud Storage behaviors and constraints<\/td>\n<\/tr>\n<tr>\n<td>Sources<\/td>\n<td>Datastream sources https:\/\/cloud.google.com\/datastream\/docs\/sources<\/td>\n<td>Verify supported engines\/versions and required DB settings<\/td>\n<\/tr>\n<tr>\n<td>BigQuery quotas<\/td>\n<td>BigQuery quotas https:\/\/cloud.google.com\/bigquery\/quotas<\/td>\n<td>Plan ingestion\/query limits and avoid surprises<\/td>\n<\/tr>\n<tr>\n<td>Cloud SQL docs<\/td>\n<td>Cloud SQL for PostgreSQL docs https:\/\/cloud.google.com\/sql\/docs\/postgres<\/td>\n<td>Required flags and operational guidance for logical decoding\/replication<\/td>\n<\/tr>\n<tr>\n<td>Architecture Center<\/td>\n<td>Google Cloud Architecture Center https:\/\/cloud.google.com\/architecture<\/td>\n<td>Patterns for analytics pipelines, landing zones, and governance (search for Datastream-related references)<\/td>\n<\/tr>\n<tr>\n<td>Videos<\/td>\n<td>Google Cloud Tech (YouTube) https:\/\/www.youtube.com\/@googlecloudtech<\/td>\n<td>Practical demos and explanations (search within channel for \u201cDatastream\u201d)<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">18. Training and Certification Providers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Institute<\/th>\n<th>Suitable Audience<\/th>\n<th>Likely Learning Focus<\/th>\n<th>Mode<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>Cloud\/DevOps engineers, SREs, platform teams<\/td>\n<td>Google Cloud operations, pipelines, automation; may include Datastream as part of data engineering<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>ScmGalaxy.com<\/td>\n<td>Beginners to intermediate engineers<\/td>\n<td>DevOps\/SCM fundamentals and adjacent cloud tooling<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.scmgalaxy.com\/<\/td>\n<\/tr>\n<tr>\n<td>CLoudOpsNow.in<\/td>\n<td>Cloud ops practitioners<\/td>\n<td>Cloud operations, monitoring, reliability practices<\/td>\n<td>Check website<\/td>\n<td>https:\/\/cloudopsnow.in\/<\/td>\n<\/tr>\n<tr>\n<td>SreSchool.com<\/td>\n<td>SREs and operations teams<\/td>\n<td>Reliability engineering, monitoring, incident response (useful for operating Datastream pipelines)<\/td>\n<td>Check website<\/td>\n<td>https:\/\/sreschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>AiOpsSchool.com<\/td>\n<td>Ops + automation practitioners<\/td>\n<td>AIOps concepts, automation, monitoring analytics<\/td>\n<td>Check website<\/td>\n<td>https:\/\/aiopsschool.com\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">19. Top Trainers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Platform\/Site<\/th>\n<th>Likely Specialization<\/th>\n<th>Suitable Audience<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>RajeshKumar.xyz<\/td>\n<td>Cloud\/DevOps training content (verify specific offerings)<\/td>\n<td>Engineers seeking guided training<\/td>\n<td>https:\/\/rajeshkumar.xyz\/<\/td>\n<\/tr>\n<tr>\n<td>devopstrainer.in<\/td>\n<td>DevOps and cloud training<\/td>\n<td>Beginners to intermediate DevOps practitioners<\/td>\n<td>https:\/\/devopstrainer.in\/<\/td>\n<\/tr>\n<tr>\n<td>devopsfreelancer.com<\/td>\n<td>Freelance DevOps services\/training platform (verify offerings)<\/td>\n<td>Teams needing short-term help or coaching<\/td>\n<td>https:\/\/devopsfreelancer.com\/<\/td>\n<\/tr>\n<tr>\n<td>devopssupport.in<\/td>\n<td>DevOps support and guidance platform (verify offerings)<\/td>\n<td>Ops teams needing troubleshooting support<\/td>\n<td>https:\/\/devopssupport.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">20. Top Consulting Companies<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Company<\/th>\n<th>Likely Service Area<\/th>\n<th>Where They May Help<\/th>\n<th>Consulting Use Case Examples<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>cotocus.com<\/td>\n<td>Cloud\/DevOps\/engineering services (verify specific portfolio)<\/td>\n<td>Architecture, implementation, and operations for cloud pipelines<\/td>\n<td>Datastream proof-of-concept, secure connectivity setup, monitoring\/runbooks<\/td>\n<td>https:\/\/cotocus.com\/<\/td>\n<\/tr>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>DevOps and cloud consulting\/training organization<\/td>\n<td>Enablement, platform practices, pipeline standardization<\/td>\n<td>Establishing standardized CDC ingestion patterns and operational governance<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>DEVOPSCONSULTING.IN<\/td>\n<td>DevOps consulting services<\/td>\n<td>CI\/CD, cloud ops, reliability and security practices<\/td>\n<td>Implementing observability, IAM guardrails, and cost controls for data pipelines<\/td>\n<td>https:\/\/devopsconsulting.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">21. Career and Learning Roadmap<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn before Datastream<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Core Google Cloud fundamentals: projects, IAM, VPC networking, billing<\/li>\n<li>Database fundamentals: PostgreSQL\/MySQL\/Oracle basics, backups, performance<\/li>\n<li>Analytics basics: BigQuery datasets\/tables, partitioning, clustering, query costs<\/li>\n<li>Security basics: least privilege, secret handling, audit logs<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn after Datastream<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data modeling for analytics (dimensional modeling, wide tables vs normalized)<\/li>\n<li>Dataflow for streaming\/batch transformations<\/li>\n<li>Governance: Dataplex, policy tags, data classification workflows (as applicable)<\/li>\n<li>Observability: SLOs for data freshness, alerting and incident management<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Job roles that use it<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data Engineer (ingestion and pipeline design)<\/li>\n<li>Cloud Engineer \/ Platform Engineer (standardizing managed services)<\/li>\n<li>Solutions Architect (hybrid + analytics architectures)<\/li>\n<li>SRE\/Operations (monitoring and reliability for pipelines)<\/li>\n<li>Security Engineer (secure connectivity and access controls)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Certification path (if available)<\/h3>\n\n\n\n<p>Datastream is typically covered as part of broader Google Cloud certifications rather than a standalone certification. Consider:\n&#8211; Professional Data Engineer\n&#8211; Professional Cloud Architect<br\/>\nAlways confirm current Google Cloud certification tracks: https:\/\/cloud.google.com\/learn\/certification<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Project ideas for practice<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Replicate a small PostgreSQL schema to BigQuery and build a Looker dashboard with freshness monitoring.<\/li>\n<li>Land CDC to Cloud Storage, then build a Dataflow pipeline to create a curated BigQuery model.<\/li>\n<li>Implement a \u201cschema change test harness\u201d in staging: apply DDL changes and document effects.<\/li>\n<li>Build cost controls: budgets, alerts, and automated teardown for dev streams.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">22. Glossary<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>CDC (Change Data Capture)<\/strong>: Technique to capture row-level changes (insert\/update\/delete) from a database\u2019s logs.<\/li>\n<li><strong>Backfill<\/strong>: Initial copy of existing table data before ongoing CDC begins.<\/li>\n<li><strong>Connection profile<\/strong>: Datastream configuration that stores connectivity details for a source or destination.<\/li>\n<li><strong>Stream<\/strong>: Datastream resource that defines replication behavior (source \u2192 destination, selection rules, backfill, state).<\/li>\n<li><strong>Logical decoding (PostgreSQL)<\/strong>: PostgreSQL mechanism to decode WAL into logical change events for replication\/CDC.<\/li>\n<li><strong>WAL (Write-Ahead Log)<\/strong>: PostgreSQL transaction log used for durability and replication.<\/li>\n<li><strong>Binlog (MySQL)<\/strong>: MySQL binary log used for replication and CDC.<\/li>\n<li><strong>Least privilege<\/strong>: Security principle of granting only the permissions needed to perform a task.<\/li>\n<li><strong>Private IP \/ Private connectivity<\/strong>: Networking pattern where services communicate over internal IPs rather than public internet.<\/li>\n<li><strong>Service agent<\/strong>: Google-managed service account used by a managed service to access resources in your project.<\/li>\n<li><strong>Data freshness<\/strong>: How up-to-date the destination data is compared to the source.<\/li>\n<li><strong>Landing zone<\/strong>: A raw ingestion area (often Cloud Storage) where data is first delivered before transformations.<\/li>\n<li><strong>Data egress<\/strong>: Network traffic leaving a region or cloud boundary, often billable.<\/li>\n<li><strong>Quotas<\/strong>: Service limits on resources or requests, enforced per project\/region\/account.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">23. Summary<\/h2>\n\n\n\n<p>Datastream is Google Cloud\u2019s managed <strong>CDC and replication<\/strong> service in the <strong>Data analytics and pipelines<\/strong> category. It captures database changes (and optionally backfills historical data) and delivers them to <strong>BigQuery<\/strong> or <strong>Cloud Storage<\/strong>, enabling near-real-time analytics without running your own CDC infrastructure.<\/p>\n\n\n\n<p>It matters because it reduces operational complexity for a common enterprise need: keeping analytics data fresh while minimizing load on production databases. Architecturally, it fits best as the ingestion layer between OLTP systems and Google Cloud analytics platforms, often combined with Dataflow\/BigQuery transformations and strong governance.<\/p>\n\n\n\n<p>For cost, plan around <strong>backfill size<\/strong>, <strong>ongoing change volume<\/strong>, and downstream costs (especially <strong>BigQuery queries<\/strong>). For security, prioritize <strong>private connectivity<\/strong>, dedicated least-privilege database users, careful dataset permissions for the Datastream service agent, and monitoring\/alerts on stream health and freshness.<\/p>\n\n\n\n<p>Use Datastream when you need managed CDC into Google Cloud-native analytics destinations; choose alternatives when you need unsupported sources\/destinations or heavy inline transformations. Next step: read the official connectivity and destination semantics docs, then extend the lab to Cloud Storage landing + Dataflow transformations for a production-grade pipeline.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Data analytics and pipelines<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[59,51],"tags":[],"class_list":["post-656","post","type-post","status-publish","format-standard","hentry","category-data-analytics-and-pipelines","category-google-cloud"],"_links":{"self":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/656","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/comments?post=656"}],"version-history":[{"count":0,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/656\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/media?parent=656"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/categories?post=656"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/tags?post=656"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}