{"id":83,"date":"2026-04-12T18:37:43","date_gmt":"2026-04-12T18:37:43","guid":{"rendered":"https:\/\/www.devopsschool.com\/tutorials\/alibaba-cloud-realtime-compute-for-apache-flink-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-analytics-computing\/"},"modified":"2026-04-12T18:37:43","modified_gmt":"2026-04-12T18:37:43","slug":"alibaba-cloud-realtime-compute-for-apache-flink-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-analytics-computing","status":"publish","type":"post","link":"https:\/\/www.devopsschool.com\/tutorials\/alibaba-cloud-realtime-compute-for-apache-flink-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-analytics-computing\/","title":{"rendered":"Alibaba Cloud Realtime Compute for Apache Flink Tutorial: Architecture, Pricing, Use Cases, and Hands-On Guide for Analytics Computing"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Category<\/h2>\n\n\n\n<p>Analytics Computing<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">1. Introduction<\/h2>\n\n\n\n<p>Realtime Compute for Apache Flink is Alibaba Cloud\u2019s fully managed, production-oriented service for running Apache Flink workloads: real-time streaming analytics, event processing, and stateful stream processing with low latency and high throughput.<\/p>\n\n\n\n<p>In simple terms: you send in streams of events (clicks, transactions, IoT telemetry, logs), write Flink SQL or Flink code to transform\/aggregate\/join those events as they arrive, and continuously output results to downstream systems (databases, data lakes, search engines, dashboards, alerting systems).<\/p>\n\n\n\n<p>Technically, Realtime Compute for Apache Flink provides a managed control plane (job\/deployment lifecycle, scaling, upgrades, integrations, observability) plus managed runtime resources (Flink clusters\/jobs) so teams can focus on pipelines rather than building and operating Flink infrastructure. You typically integrate it with Alibaba Cloud networking (VPC), identity (Resource Access Management\/RAM), storage (Object Storage Service\/OSS for checkpoints\/savepoints), and observability (Log Service\/SLS and CloudMonitor), plus streaming sources and sinks (for example Kafka-compatible services, databases via JDBC, and other data services). Exact connector availability varies by runtime version and region\u2014verify in the official connector documentation.<\/p>\n\n\n\n<p>The problem it solves: operating Apache Flink reliably is non-trivial. You must manage clusters, upgrades, state backends, checkpoints, fault recovery, scaling, security, and observability\u2014often 24\/7. Realtime Compute for Apache Flink reduces that operational burden while enabling production-grade real-time analytics in the Alibaba Cloud ecosystem.<\/p>\n\n\n\n<blockquote>\n<p>Naming note (verify in official docs): Alibaba historically used product names like \u201cBlink\u201d in the Flink space. Today the managed service is branded as <strong>Realtime Compute for Apache Flink<\/strong>. If you encounter older terms in blogs or screenshots, treat them as legacy and cross-check the current console and documentation.<\/p>\n<\/blockquote>\n\n\n\n<h2 class=\"wp-block-heading\">2. What is Realtime Compute for Apache Flink?<\/h2>\n\n\n\n<p><strong>Official purpose (what it is for):<\/strong><br\/>\nRealtime Compute for Apache Flink is a managed service for building and running Apache Flink jobs on Alibaba Cloud. It is designed for continuous stream processing, real-time ETL, real-time feature computation, event-driven applications, and live analytics.<\/p>\n\n\n\n<p><strong>Core capabilities (high level):<\/strong>\n&#8211; Run Apache Flink jobs (commonly Flink SQL and Flink DataStream applications) with managed deployment and operations.\n&#8211; Perform stateful stream processing: windowed aggregations, joins, deduplication, pattern detection, enrichment, and routing.\n&#8211; Support fault tolerance through checkpoints\/savepoints (standard Flink concepts) backed by durable storage (commonly OSS).\n&#8211; Integrate with Alibaba Cloud services for networking, security, logging, and monitoring.\n&#8211; Provide a console\/UI and APIs for job lifecycle management, configuration, and observability.<\/p>\n\n\n\n<p><strong>Major components (conceptual):<\/strong>\n&#8211; <strong>Control plane<\/strong>: Alibaba Cloud console, APIs, and service backend that manage environments\/workspaces, job configuration, versions, scaling, and deployment lifecycle.\n&#8211; <strong>Compute runtime<\/strong>: Flink JobManager\/TaskManager processes (managed by the service) that execute your SQL or application code.\n&#8211; <strong>State &amp; durability<\/strong>: Checkpoints and savepoints persisted to durable storage (commonly OSS or another supported storage) for recovery and upgrades.\n&#8211; <strong>Connectors<\/strong>: Integration points to read\/write data (Kafka-compatible sources, databases, data lakes, etc.). Availability depends on runtime version and region\u2014verify in official docs.\n&#8211; <strong>Observability<\/strong>: Logs (often via SLS), metrics (often via CloudMonitor), and the Flink Web UI-equivalent views surfaced through the service.<\/p>\n\n\n\n<p><strong>Service type:<\/strong><br\/>\nManaged analytics computing \/ stream processing (PaaS). You bring SQL and\/or Flink code; the platform manages much of the runtime operations.<\/p>\n\n\n\n<p><strong>Scope (regional\/global and tenancy):<\/strong>\n&#8211; Typically <strong>regional<\/strong>: you create resources in a specific Alibaba Cloud region (for data gravity, latency, compliance, and service availability reasons).\n&#8211; Typically <strong>account-scoped<\/strong> under your Alibaba Cloud account, with <strong>project\/workspace\/environment<\/strong> constructs inside the service (names may vary by console version). Access is controlled using RAM users\/roles and policies.\n&#8211; Network access is typically within a <strong>VPC<\/strong> (recommended for production) with optional public endpoints depending on region and configuration.<\/p>\n\n\n\n<p><strong>How it fits into the Alibaba Cloud ecosystem:<\/strong>\n&#8211; <strong>Analytics Computing<\/strong>: complements batch analytics platforms (for example MaxCompute or EMR) by providing low-latency stream processing.\n&#8211; <strong>Data ingestion<\/strong>: pairs with streaming ingestion (Kafka-compatible services, DataHub\u2014verify current product positioning), application logs, or IoT ingestion.\n&#8211; <strong>Storage and serving<\/strong>: outputs to data warehouses, OLAP engines, databases, search engines, and storage services used for dashboards, alerting, and APIs.\n&#8211; <strong>Governance and ops<\/strong>: aligns with RAM, ActionTrail, SLS, CloudMonitor, tagging, and cost management.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">3. Why use Realtime Compute for Apache Flink?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Business reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Real-time decisioning<\/strong>: reduce time-to-insight from hours to seconds (fraud detection, inventory updates, personalization).<\/li>\n<li><strong>Continuous data products<\/strong>: build continuously updated aggregates (KPIs, user features, anomaly scores) that power products and operations.<\/li>\n<li><strong>Faster iteration<\/strong>: managed service accelerates PoCs and production rollouts vs. self-managed Flink.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Technical reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Stateful streaming<\/strong>: Apache Flink is widely adopted for complex event processing with strong state and time semantics.<\/li>\n<li><strong>Exactly-once processing patterns<\/strong>: Flink provides mechanisms (checkpoints + transactional sinks \/ idempotency strategies) to approach exactly-once outcomes when used correctly. Final guarantees depend on connectors and sink semantics\u2014verify in official docs and connector notes.<\/li>\n<li><strong>Unified APIs<\/strong>: write pipelines in Flink SQL or code (Java\/Scala; Python support depends on the managed runtime\u2014verify).<\/li>\n<li><strong>Event time<\/strong>: handle out-of-order events with watermarks (core Flink feature).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operational reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Reduced platform burden<\/strong>: fewer tasks around cluster provisioning, patching, and scaling.<\/li>\n<li><strong>Standardized observability<\/strong>: central logs\/metrics, job health visibility, and operational controls.<\/li>\n<li><strong>Production lifecycle controls<\/strong>: upgrades, savepoints, rollbacks (where supported), and deployment workflows.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/compliance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>RAM-based access control<\/strong>: enforce least privilege and separation of duties.<\/li>\n<li><strong>VPC networking<\/strong>: keep traffic private and restrict exposure.<\/li>\n<li><strong>Auditability<\/strong>: integrate with Alibaba Cloud audit trails and logging services (verify specific integration points in your region).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scalability\/performance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Horizontal scaling<\/strong>: Flink\u2019s parallelism and distributed execution model supports scaling out.<\/li>\n<li><strong>Backpressure handling<\/strong>: Flink can manage bursts via backpressure and checkpointing; you still must capacity plan sources\/sinks.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should choose it<\/h3>\n\n\n\n<p>Choose Realtime Compute for Apache Flink when you need:\n&#8211; Near real-time analytics (seconds to minutes)\n&#8211; Stateful transformations and joins across streams\n&#8211; Continuous pipelines with high availability expectations\n&#8211; Managed operations on Alibaba Cloud, close to your data sources\/sinks in the same region<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should <em>not<\/em> choose it<\/h3>\n\n\n\n<p>Consider alternatives when:\n&#8211; You only need <strong>batch<\/strong> processing (a batch engine may be cheaper\/simpler).\n&#8211; You require <strong>sub-second<\/strong> ultra-low-latency at extreme scale with very specific runtimes\u2014benchmark and validate.\n&#8211; Your workload is small and sporadic and you cannot justify always-on streaming resources (unless the service supports cost-effective scaling to zero\u2014verify).\n&#8211; You have strict requirements to run a custom Flink distribution\/plugins not supported by the managed service (verify supported extension mechanisms).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">4. Where is Realtime Compute for Apache Flink used?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Industries<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>E-commerce &amp; retail<\/strong>: real-time recommendations, cart abandonment signals, inventory and pricing updates.<\/li>\n<li><strong>FinTech &amp; payments<\/strong>: fraud scoring, AML pattern detection, real-time risk signals.<\/li>\n<li><strong>Gaming<\/strong>: live telemetry, player behavior analytics, anti-cheat signals.<\/li>\n<li><strong>Logistics<\/strong>: package tracking, route optimization signals, ETA prediction features.<\/li>\n<li><strong>Manufacturing\/IoT<\/strong>: anomaly detection on sensor data, predictive maintenance features.<\/li>\n<li><strong>AdTech\/marketing<\/strong>: attribution pipelines, bidding features, real-time audience segmentation.<\/li>\n<li><strong>Media<\/strong>: live content analytics, QoE monitoring, trending detection.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Team types<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data engineering and platform teams building streaming foundations<\/li>\n<li>SRE\/DevOps teams operating pipelines<\/li>\n<li>Application teams embedding real-time features<\/li>\n<li>Security and fraud teams building detections<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Workloads<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Stream ingestion \u2192 enrich \u2192 aggregate \u2192 serve<\/li>\n<li>CDC (change data capture) pipelines (connector-dependent\u2014verify)<\/li>\n<li>Log\/event ETL to real-time OLAP stores<\/li>\n<li>Feature computation for ML online serving<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Architectures<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Event-driven microservices + stream processing layer<\/li>\n<li>Lambda\/Kappa-style architectures (streaming-first)<\/li>\n<li>Streaming backbone feeding a data lake\/warehouse plus real-time serving stores<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Real-world deployment contexts<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Production<\/strong>: multi-AZ\/HA requirements (depending on region\/service design), managed checkpoints to OSS, strict IAM, VPC-only, alerting and runbooks.<\/li>\n<li><strong>Dev\/test<\/strong>: smaller compute footprints, reduced retention, sandbox credentials, test topics\/tables, synthetic event generators.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">5. Top Use Cases and Scenarios<\/h2>\n\n\n\n<p>Below are realistic scenarios for Alibaba Cloud Realtime Compute for Apache Flink. Connector specifics can vary\u2014verify the supported connectors and versions in official docs.<\/p>\n\n\n\n<p>1) <strong>Real-time KPI dashboard<\/strong>\n&#8211; <strong>Problem:<\/strong> Business dashboards update too slowly when computed in batch.\n&#8211; <strong>Why this service fits:<\/strong> Streaming windows compute KPIs continuously with event-time correctness.\n&#8211; <strong>Example:<\/strong> Compute GMV, orders\/min, conversion rate per campaign every 10 seconds and sink to an OLAP store powering dashboards.<\/p>\n\n\n\n<p>2) <strong>Fraud detection stream<\/strong>\n&#8211; <strong>Problem:<\/strong> Fraud decisions must happen before transactions complete.\n&#8211; <strong>Why this service fits:<\/strong> Stateful rules and pattern detection over event streams with enrichment.\n&#8211; <strong>Example:<\/strong> Join card transactions with recent device fingerprints; flag bursts and anomalous geolocation changes.<\/p>\n\n\n\n<p>3) <strong>Clickstream sessionization<\/strong>\n&#8211; <strong>Problem:<\/strong> You need session-level analytics from raw click events.\n&#8211; <strong>Why this service fits:<\/strong> Event-time windows and stateful session windows.\n&#8211; <strong>Example:<\/strong> Build sessions per user with inactivity gaps; output session summaries to a warehouse and a real-time store.<\/p>\n\n\n\n<p>4) <strong>Real-time ETL from Kafka to a data warehouse<\/strong>\n&#8211; <strong>Problem:<\/strong> Data arrives in Kafka but analytics lives in a warehouse.\n&#8211; <strong>Why this service fits:<\/strong> Managed Flink SQL for parsing, cleaning, enrichment, and writing to warehouse sinks.\n&#8211; <strong>Example:<\/strong> Parse JSON events, enforce schemas, add geo\/IP enrichment, and load curated tables.<\/p>\n\n\n\n<p>5) <strong>Operational alerting from logs<\/strong>\n&#8211; <strong>Problem:<\/strong> Detect error spikes and latency regressions immediately.\n&#8211; <strong>Why this service fits:<\/strong> Streaming aggregations + threshold alerts.\n&#8211; <strong>Example:<\/strong> Aggregate API error rate per service per minute; output to a topic\/table consumed by alerting.<\/p>\n\n\n\n<p>6) <strong>IoT anomaly detection<\/strong>\n&#8211; <strong>Problem:<\/strong> Sensors produce continuous streams; anomalies should be detected early.\n&#8211; <strong>Why this service fits:<\/strong> Stateful processing with rolling statistics.\n&#8211; <strong>Example:<\/strong> Compute rolling mean\/stddev per sensor and flag deviations; sink to a time-series store.<\/p>\n\n\n\n<p>7) <strong>Inventory and pricing updates<\/strong>\n&#8211; <strong>Problem:<\/strong> Inventory changes must be reflected across channels quickly.\n&#8211; <strong>Why this service fits:<\/strong> Stream joins, deduplication, and ordering by event time.\n&#8211; <strong>Example:<\/strong> Merge inventory changes from multiple warehouses; output canonical inventory snapshots.<\/p>\n\n\n\n<p>8) <strong>Real-time user features for ML<\/strong>\n&#8211; <strong>Problem:<\/strong> Models need up-to-date user behavior features.\n&#8211; <strong>Why this service fits:<\/strong> Continuous feature computation with low-latency sinks.\n&#8211; <strong>Example:<\/strong> Maintain per-user counters (views, purchases last 1h\/24h) and write to a fast key-value store.<\/p>\n\n\n\n<p>9) <strong>CDC-based cache invalidation<\/strong>\n&#8211; <strong>Problem:<\/strong> App caches become stale when DB updates.\n&#8211; <strong>Why this service fits:<\/strong> Streaming pipelines can capture changes and update caches (connector-dependent).\n&#8211; <strong>Example:<\/strong> Consume DB change events, update cache entries, and publish invalidation events.<\/p>\n\n\n\n<p>10) <strong>Data quality checks in motion<\/strong>\n&#8211; <strong>Problem:<\/strong> Bad data propagates quickly; you need guards.\n&#8211; <strong>Why this service fits:<\/strong> Streaming rules, anomaly detection, and side outputs.\n&#8211; <strong>Example:<\/strong> Validate required fields and ranges; route invalid events to a quarantine sink.<\/p>\n\n\n\n<p>11) <strong>Real-time order fulfillment monitoring<\/strong>\n&#8211; <strong>Problem:<\/strong> Track SLAs across multiple event streams.\n&#8211; <strong>Why this service fits:<\/strong> Correlate events from ordering, payment, and shipping systems.\n&#8211; <strong>Example:<\/strong> Join streams by order_id; compute time-to-ship and alert on breaches.<\/p>\n\n\n\n<p>12) <strong>Multi-tenant event processing platform<\/strong>\n&#8211; <strong>Problem:<\/strong> Many teams need stream processing without each running their own clusters.\n&#8211; <strong>Why this service fits:<\/strong> Central managed platform with controlled access and standardized ops.\n&#8211; <strong>Example:<\/strong> Platform team provides workspaces\/namespaces, templates, and guardrails.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">6. Core Features<\/h2>\n\n\n\n<p>Feature availability can be runtime-version and region dependent. Verify the exact list in the official documentation for your region and purchased edition.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Managed Apache Flink runtime<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Runs Flink jobs without you managing VM clusters manually.<\/li>\n<li><strong>Why it matters:<\/strong> Reduces ops overhead (provisioning, patching, baseline configs).<\/li>\n<li><strong>Practical benefit:<\/strong> Faster onboarding and more consistent production environments.<\/li>\n<li><strong>Caveat:<\/strong> You must align job design with the managed runtime constraints (supported versions, connectors, and resource model).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Flink SQL development and execution<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Allows authoring streaming pipelines in SQL (DDL\/DML).<\/li>\n<li><strong>Why it matters:<\/strong> Lowers barrier for analysts\/data engineers; faster iterations.<\/li>\n<li><strong>Practical benefit:<\/strong> Rapid ETL and aggregation pipelines with declarative logic.<\/li>\n<li><strong>Caveat:<\/strong> Complex custom logic may require UDFs or DataStream API; UDF support and packaging model must follow the service\u2019s guidelines.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Application (code) deployments (DataStream API)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Run compiled Flink applications (commonly Java\/Scala).<\/li>\n<li><strong>Why it matters:<\/strong> Enables advanced logic beyond SQL (custom state, process functions).<\/li>\n<li><strong>Practical benefit:<\/strong> Full Flink programmability for complex event processing.<\/li>\n<li><strong>Caveat:<\/strong> Packaging dependencies, connector JARs, and version compatibility must match the managed runtime.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Stateful processing with checkpoints\/savepoints<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Persists state periodically for fault tolerance and upgrades.<\/li>\n<li><strong>Why it matters:<\/strong> Stateful streaming is the core of Flink reliability.<\/li>\n<li><strong>Practical benefit:<\/strong> Recovery from failures with minimal data loss; controlled upgrades via savepoints.<\/li>\n<li><strong>Caveat:<\/strong> Checkpoint storage location (often OSS) must be correctly configured and secured; large state increases storage and I\/O costs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scaling and parallelism controls<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Adjust job parallelism and compute resources (exact knobs depend on the service).<\/li>\n<li><strong>Why it matters:<\/strong> Streaming load changes; scaling prevents lag and backpressure.<\/li>\n<li><strong>Practical benefit:<\/strong> Keep latency stable while controlling cost.<\/li>\n<li><strong>Caveat:<\/strong> Rescaling stateful jobs can require savepoints and careful planning; autoscaling behavior (if available) should be tested.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Built-in integrations (connectors)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Connects to streaming sources and sinks (Kafka-compatible, databases, storage, etc.).<\/li>\n<li><strong>Why it matters:<\/strong> Most of the work in streaming is integration.<\/li>\n<li><strong>Practical benefit:<\/strong> Less custom connector engineering; faster delivery.<\/li>\n<li><strong>Caveat:<\/strong> Connector semantics differ (exactly-once vs at-least-once), and connector availability differs by runtime\u2014verify.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Observability: logs, metrics, and job UI<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Provides visibility into job status, failures, backpressure, checkpoints, throughput, and logs.<\/li>\n<li><strong>Why it matters:<\/strong> Streaming jobs are long-running; you need continuous monitoring.<\/li>\n<li><strong>Practical benefit:<\/strong> Faster MTTR, better capacity planning.<\/li>\n<li><strong>Caveat:<\/strong> Retention and cost for logs\/metrics can grow; set policies and sampling.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Networking and private connectivity (VPC)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Runs jobs with access to VPC resources (databases, caches, internal endpoints).<\/li>\n<li><strong>Why it matters:<\/strong> Most production data systems are private.<\/li>\n<li><strong>Practical benefit:<\/strong> Reduced exposure and better security posture.<\/li>\n<li><strong>Caveat:<\/strong> You must plan subnets, security groups, route tables, and DNS; misconfigurations cause timeouts.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">IAM (RAM) integration<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Controls who can create\/modify\/run jobs and access related resources.<\/li>\n<li><strong>Why it matters:<\/strong> Prevents unauthorized changes to production pipelines.<\/li>\n<li><strong>Practical benefit:<\/strong> Least privilege, auditability, separation of duties.<\/li>\n<li><strong>Caveat:<\/strong> You must understand both \u201ccontrol plane permissions\u201d and \u201cruntime permissions\u201d (e.g., access to OSS checkpoint buckets).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Versioning and upgrades (runtime versions)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Offers supported Flink versions \/ runtimes.<\/li>\n<li><strong>Why it matters:<\/strong> Security patches and compatibility.<\/li>\n<li><strong>Practical benefit:<\/strong> Managed upgrade paths (where supported) reduce risk.<\/li>\n<li><strong>Caveat:<\/strong> Upgrades can impact connectors, serialization, and state compatibility\u2014test in staging and use savepoints.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">7. Architecture and How It Works<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">High-level service architecture<\/h3>\n\n\n\n<p>A typical managed Flink service has:\n1. <strong>User access layer<\/strong>: Alibaba Cloud console\/API where you define jobs, permissions, and configurations.\n2. <strong>Control plane<\/strong>: Validates configs, orchestrates deployments, allocates resources, and manages versions.\n3. <strong>Data plane<\/strong>: Flink runtime executing your job; communicates with sources\/sinks and checkpoint storage.\n4. <strong>Observability plane<\/strong>: logs and metrics pipelines to SLS\/CloudMonitor (and possibly a built-in UI).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Data\/control flow (conceptual)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>You author SQL or upload an application.<\/li>\n<li>The control plane creates\/updates the running job.<\/li>\n<li>The job reads from sources (streams), processes events, writes to sinks.<\/li>\n<li>Checkpoints are periodically written to durable storage.<\/li>\n<li>Metrics and logs are continuously emitted for monitoring and troubleshooting.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Integrations with related Alibaba Cloud services (common patterns)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>OSS<\/strong>: checkpoint\/savepoint storage; also data lake storage for files.<\/li>\n<li><strong>Log Service (SLS)<\/strong>: job logs; sometimes sink\/source for log pipelines (verify).<\/li>\n<li><strong>CloudMonitor<\/strong>: metrics and alerting.<\/li>\n<li><strong>RAM<\/strong>: identity and access policies.<\/li>\n<li><strong>VPC<\/strong>: private networking to databases\/caches\/queues.<\/li>\n<li><strong>Streaming sources<\/strong>: Kafka-compatible services, DataHub, etc. (verify current recommended products in your region).<\/li>\n<li><strong>Databases\/warehouses<\/strong>: via JDBC connectors to ApsaraDB services (verify supported engines and versions).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Dependency services (you often need)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A <strong>streaming source<\/strong> (Kafka-compatible, etc.) and a <strong>sink<\/strong> (database, warehouse, file storage).<\/li>\n<li><strong>OSS bucket<\/strong> (commonly) for state\/checkpoints and possibly artifacts.<\/li>\n<li><strong>SLS project\/logstore<\/strong> for logs (depends on configuration and defaults).<\/li>\n<li><strong>VPC<\/strong> networking if accessing private endpoints.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/authentication model (typical)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Control plane access<\/strong>: RAM users\/roles with policies granting permission to manage Flink resources.<\/li>\n<li><strong>Runtime access<\/strong>: the Flink job needs permission to access OSS (for checkpoints) and any other services (sources\/sinks). This is often done via a service role \/ RAM role attached to the service or via credential configuration. Exact mechanism is region\/edition-dependent\u2014verify in official docs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Networking model (typical)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Jobs run inside managed infrastructure with optional VPC attachment.<\/li>\n<li>For private data sources (RDS, Redis, etc.), place them in the same VPC and configure security groups and whitelists.<\/li>\n<li>For public endpoints, ensure egress rules and NAT\/Internet access (if allowed) and consider security implications.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Monitoring\/logging\/governance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Define SLOs (lag, throughput, end-to-end latency).<\/li>\n<li>Monitor checkpoint success rate\/duration; failing checkpoints often indicate backpressure or storage\/network issues.<\/li>\n<li>Govern configurations via templates and code review.<\/li>\n<li>Use tags and naming conventions to map jobs to cost centers and owners.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Simple architecture diagram (conceptual)<\/h3>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart LR\n  U[Engineer \/ Data Engineer] --&gt;|SQL or App| C[Alibaba Cloud Console\/API]\n  C --&gt; P[Realtime Compute for Apache Flink Control Plane]\n  P --&gt; R[Flink Runtime (JobManager\/TaskManagers)]\n  S[(Event Source\\n(e.g., Kafka-compatible))] --&gt; R\n  R --&gt; K[(Sink\\n(DB\/OLAP\/OSS\/etc.))]\n  R --&gt; O[(OSS\\nCheckpoints\/Savepoints)]\n  R --&gt; L[(Logs\/Metrics\\nSLS\/CloudMonitor)]\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Production-style architecture diagram (multi-system)<\/h3>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart TB\n  subgraph VPC[VPC (Recommended)]\n    subgraph Ingest[Ingestion Layer]\n      K1[(Kafka-compatible Cluster)]\n      APP[Microservices \/ Producers]\n      APP --&gt; K1\n    end\n\n    subgraph Flink[Realtime Compute for Apache Flink]\n      JM[Job Manager]\n      TM[Task Managers]\n      JM --- TM\n    end\n\n    subgraph Storage[State + Data Stores]\n      OSS[(OSS Bucket\\nCheckpoints\/Savepoints)]\n      RDS[(ApsaraDB RDS \/ PolarDB\\nOperational Tables)]\n      OLAP[(Real-time OLAP Store\\n(Verify service choice))]\n      KV[(Key-Value Cache\\n(Verify service choice))]\n    end\n\n    subgraph Obs[Observability &amp; Governance]\n      SLS[(Log Service)]\n      CM[(CloudMonitor Alerts)]\n      AT[(ActionTrail)]\n    end\n  end\n\n  K1 --&gt;|events| Flink\n  Flink --&gt;|enriched stream| OLAP\n  Flink --&gt;|features| KV\n  Flink --&gt;|writes\/reads| RDS\n  Flink --&gt;|checkpoints| OSS\n  Flink --&gt; SLS\n  CM &lt;--&gt;|metrics| Flink\n  AT --&gt;|audit events| Obs\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">8. Prerequisites<\/h2>\n\n\n\n<p>Before you start, confirm these items in your target region using official documentation.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Account and billing<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>An <strong>Alibaba Cloud account<\/strong> with billing enabled.<\/li>\n<li>Ability to purchase or enable <strong>Realtime Compute for Apache Flink<\/strong> in your chosen region.<\/li>\n<li>A payment method suitable for your organization (pay-as-you-go or subscription availability depends on region\/edition\u2014verify).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Permissions (RAM)<\/h3>\n\n\n\n<p>You typically need RAM permissions to:\n&#8211; Create and manage Realtime Compute for Apache Flink resources (workspaces\/projects, jobs\/deployments).\n&#8211; Create\/read\/write <strong>OSS<\/strong> buckets (for checkpoints\/artifacts).\n&#8211; Create\/read <strong>SLS<\/strong> projects\/logstores (if you configure logging).\n&#8211; Manage <strong>VPC<\/strong> networking (if using VPC access): VPCs, vSwitches, security groups.\n&#8211; Optional: access to sources\/sinks (Kafka service, RDS, etc.).<\/p>\n\n\n\n<p>If you are in an enterprise environment:\n&#8211; Use separate roles for platform admins vs. job developers.\n&#8211; Use a dedicated service role for runtime resource access where supported.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Tools<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Alibaba Cloud console access is enough for this tutorial.<\/li>\n<li>Optional: Alibaba Cloud CLI (<code>aliyun<\/code>) is helpful for automating OSS\/SLS\/VPC. CLI installation and authentication steps are documented here (verify current URL):<br\/>\n  https:\/\/www.alibabacloud.com\/help\/en\/alibaba-cloud-cli\/latest\/what-is-alibaba-cloud-cli<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Region availability<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Realtime Compute for Apache Flink is <strong>region-dependent<\/strong>. Confirm your region supports it in the product availability matrix (verify in official docs\/product page).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Quotas\/limits<\/h3>\n\n\n\n<p>Common limits to verify:\n&#8211; Maximum number of jobs\/deployments per workspace\/project.\n&#8211; Max parallelism and resource quotas.\n&#8211; Connector-specific limits (e.g., Kafka partitions, sink TPS).\n&#8211; SLS log retention and indexing costs\/limits.\nBecause quotas can change by region\/edition, <strong>verify in official docs<\/strong>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Prerequisite services (recommended)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>OSS<\/strong> bucket for checkpoints\/savepoints and optionally artifacts.<\/li>\n<li><strong>SLS<\/strong> for logs (if not enabled by default).<\/li>\n<li>A <strong>VPC<\/strong> with at least one vSwitch if you plan to connect to private data sources.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">9. Pricing \/ Cost<\/h2>\n\n\n\n<p>Alibaba Cloud pricing is region- and edition-dependent. Do not rely on third-party price tables. Always validate with the official pricing page and the console purchase flow.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Pricing dimensions (typical for managed Flink)<\/h3>\n\n\n\n<p>Realtime Compute for Apache Flink commonly charges based on some combination of:\n&#8211; <strong>Compute resources<\/strong> allocated to jobs (for example CPU\/memory bundles, \u201ccompute units\u201d, or resource specifications).\n&#8211; <strong>Running time<\/strong> (per hour\/minute) while jobs are running.\n&#8211; <strong>Storage and I\/O<\/strong> costs for checkpoints\/savepoints (OSS charges separately).\n&#8211; <strong>Data transfer<\/strong>: intra-region traffic may be cheaper than cross-region; Internet egress is usually billable.\n&#8211; <strong>Logs\/metrics<\/strong>: SLS ingestion, indexing, and retention; CloudMonitor custom metrics and alert rules (if applicable).<\/p>\n\n\n\n<p>Because the exact meter (CU, vCPU, memory, etc.) and billing granularity can differ, <strong>verify the current billing model in official pricing<\/strong>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Free tier<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>A permanent free tier is not guaranteed for this class of service. Some regions may offer trials, coupons, or promotional credits\u2014<strong>verify in Alibaba Cloud offers<\/strong>.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Cost drivers (what usually makes bills grow)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Always-on jobs<\/strong>: streaming jobs run 24\/7, which multiplies compute-hours.<\/li>\n<li><strong>High parallelism<\/strong>: more TaskManagers\/slots \u2192 more cost.<\/li>\n<li><strong>Large state<\/strong>: bigger checkpoints and higher checkpoint frequency \u2192 more OSS storage + write I\/O.<\/li>\n<li><strong>High log volume<\/strong>: verbose logging to SLS can become expensive.<\/li>\n<li><strong>Cross-AZ \/ cross-region traffic<\/strong>: if sources\/sinks are in different zones\/regions.<\/li>\n<li><strong>Hot sinks<\/strong>: OLAP\/database sinks that require high write throughput (you pay for those services too).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Hidden or indirect costs<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>OSS<\/strong>: checkpoint retention (old checkpoints\/savepoints), lifecycle policies not set.<\/li>\n<li><strong>SLS<\/strong>: indexing everything by default.<\/li>\n<li><strong>NAT Gateway \/ EIP<\/strong>: if your Flink runtime needs outbound Internet from VPC.<\/li>\n<li><strong>Downstream services<\/strong>: databases, caches, message queues.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Network\/data transfer implications<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Prefer keeping sources, Flink jobs, and sinks in the <strong>same region<\/strong>.<\/li>\n<li>Use VPC endpoints\/private connectivity where possible.<\/li>\n<li>If consuming from Internet-exposed Kafka endpoints, you may pay Internet egress\/ingress depending on architecture\u2014design for private connectivity.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">How to optimize cost (practical)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Right-size parallelism and resources based on observed lag and CPU utilization.<\/li>\n<li>Use <strong>staging<\/strong> and <strong>production<\/strong> separation with smaller staging footprints.<\/li>\n<li>Tune checkpoint interval and state TTL to reduce state size (without compromising recovery objectives).<\/li>\n<li>Reduce log verbosity in production; set SLS retention and indexing selectively.<\/li>\n<li>Consolidate multiple small pipelines where appropriate, but avoid \u201cmega-jobs\u201d that increase blast radius.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Example low-cost starter estimate (no fabricated numbers)<\/h3>\n\n\n\n<p>A minimal learning setup usually includes:\n&#8211; 1 small development job running for a short time (minutes to a few hours)\n&#8211; OSS bucket for checkpoints (small footprint)\n&#8211; SLS logs with short retention\nBecause <strong>unit prices vary by region\/edition<\/strong>, get a realistic number by:\n1. Checking the official Realtime Compute for Apache Flink pricing page (or console order page).\n2. Estimating compute-hours for your job runtime.\n3. Adding OSS storage for checkpoints and SLS ingestion for logs.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Example production cost considerations<\/h3>\n\n\n\n<p>For a production system, create a cost model per pipeline:\n&#8211; Compute: baseline + peak parallelism (and whether autoscaling is used)\n&#8211; Checkpoint storage: state size \u00d7 checkpoint frequency \u00d7 retention\n&#8211; Source\/sink throughput: consider Kafka partitions, database write capacity, OLAP ingestion\n&#8211; Observability: logs\/metrics volume per job\nThen validate with:\n&#8211; Alibaba Cloud pricing pages for each service\n&#8211; Internal FinOps tagging and monthly budgets<\/p>\n\n\n\n<p><strong>Official pricing references (verify):<\/strong>\n&#8211; Product page (often links to pricing): https:\/\/www.alibabacloud.com\/product\/realtime-compute-for-apache-flink<br\/>\n&#8211; Documentation hub: https:\/\/www.alibabacloud.com\/help\/en\/realtime-compute-for-apache-flink<\/p>\n\n\n\n<p>If you cannot find a dedicated pricing page for your region, use the <strong>console purchase\/billing details<\/strong> for the definitive meter names and unit pricing.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">10. Step-by-Step Hands-On Tutorial<\/h2>\n\n\n\n<p>This lab is designed to be executable with minimal external dependencies. Console labels can vary by region\/runtime version; when the UI differs, follow the closest equivalent steps and cross-check the official \u201cQuick Start\u201d for your region.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Objective<\/h3>\n\n\n\n<p>Create and run a simple streaming pipeline in <strong>Realtime Compute for Apache Flink<\/strong> using <strong>Flink SQL<\/strong> that generates synthetic events, performs a time-based aggregation, and outputs results to a debug sink (logs) so you can validate the pipeline end-to-end.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Lab Overview<\/h3>\n\n\n\n<p>You will:\n1. Prepare a low-cost environment (region selection, OSS for checkpoints).\n2. Create a Realtime Compute for Apache Flink workspace\/project.\n3. Create a SQL job using a built-in generator source.\n4. Run the job, observe metrics\/logs, and confirm checkpoints.\n5. Clean up resources to avoid ongoing charges.<\/p>\n\n\n\n<blockquote>\n<p>Notes on connectors used in this lab:\n&#8211; The SQL <code>datagen<\/code> connector is part of Apache Flink examples and is commonly available for testing.<br\/>\n&#8211; A \u201cprint\/log\u201d style sink may require a connector JAR depending on the managed runtime. If your runtime does not include a print connector, route output to a supported sink such as OSS filesystem, Log Service, or a database (verify supported connectors in your environment).<\/p>\n<\/blockquote>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 1: Choose a region and confirm service availability<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Sign in to Alibaba Cloud console.<\/li>\n<li>Select a region close to you and\/or your data sources.<\/li>\n<li>Confirm <strong>Realtime Compute for Apache Flink<\/strong> is available in that region.<\/li>\n<\/ol>\n\n\n\n<p><strong>Expected outcome:<\/strong> You can access the Realtime Compute for Apache Flink console and create resources in the selected region.<\/p>\n\n\n\n<p><strong>Verification:<\/strong>\n&#8211; You can open the documentation for the service and see region-specific configuration notes:<br\/>\n  https:\/\/www.alibabacloud.com\/help\/en\/realtime-compute-for-apache-flink<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 2: Create an OSS bucket for checkpoints (recommended)<\/h3>\n\n\n\n<p>Even for a toy job, configure durable checkpoint storage so you can learn how production jobs recover.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Open <strong>Object Storage Service (OSS)<\/strong> in the same region.<\/li>\n<li>Create a bucket with:\n   &#8211; Private access\n   &#8211; A unique name<\/li>\n<li>Create a folder\/prefix such as:\n   &#8211; <code>flink-checkpoints\/<\/code>\n   &#8211; <code>flink-savepoints\/<\/code><\/li>\n<\/ol>\n\n\n\n<p><strong>Expected outcome:<\/strong> An OSS bucket exists for Flink state persistence.<\/p>\n\n\n\n<p><strong>Verification:<\/strong>\n&#8211; You can browse the bucket and see the created prefixes.<\/p>\n\n\n\n<p><strong>Cost note:<\/strong> OSS costs are usually low at small scale, but checkpoint retention can accumulate. You will clean up at the end.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 3: Create a workspace\/project in Realtime Compute for Apache Flink<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Open <strong>Realtime Compute for Apache Flink<\/strong> in the console.<\/li>\n<li>Create a <strong>Workspace\/Project<\/strong> (name depends on UI), for example:\n   &#8211; Name: <code>flink-lab<\/code>\n   &#8211; Environment: <code>dev<\/code> (if supported)<\/li>\n<li>If prompted:\n   &#8211; Choose <strong>pay-as-you-go<\/strong> for labs (if available) to avoid long commitments.\n   &#8211; Configure default networking (public vs VPC). For this lab, choose the simplest option supported by your region. For production, use VPC.<\/li>\n<\/ol>\n\n\n\n<p><strong>Expected outcome:<\/strong> A workspace\/project exists and you can create jobs.<\/p>\n\n\n\n<p><strong>Verification:<\/strong>\n&#8211; The workspace shows as active and you can access job creation screens.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 4: Create a SQL job\/pipeline<\/h3>\n\n\n\n<p>In the Realtime Compute for Apache Flink console:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Create a new <strong>SQL job<\/strong> (names may include \u201cDraft\u201d, \u201cDevelopment\u201d, \u201cSQL Editor\u201d, \u201cSQL Studio\u201d, or \u201cJob\u201d).<\/li>\n<li>Paste the following SQL. It uses:\n   &#8211; A synthetic event generator (<code>datagen<\/code>)\n   &#8211; Event-time attribute and watermark\n   &#8211; Tumbling window aggregation\n   &#8211; A debug sink (shown as <code>print<\/code> below; if unavailable, see alternatives after the code)<\/li>\n<\/ol>\n\n\n\n<pre><code class=\"language-sql\">-- 1) Source: synthetic stream of purchase-like events\nCREATE TABLE source_events (\n  user_id BIGINT,\n  amount DOUBLE,\n  ts TIMESTAMP(3),\n  WATERMARK FOR ts AS ts - INTERVAL '3' SECOND\n) WITH (\n  'connector' = 'datagen',\n  'rows-per-second' = '5',\n  'fields.user_id.kind' = 'random',\n  'fields.user_id.min' = '1',\n  'fields.user_id.max' = '100',\n  'fields.amount.kind' = 'random',\n  'fields.amount.min' = '1',\n  'fields.amount.max' = '200',\n  'fields.ts.kind' = 'sequence',\n  'fields.ts.start' = '2025-01-01T00:00:00.000',\n  'fields.ts.end' = '2025-01-01T00:10:00.000'\n);\n\n-- 2) Sink: debug output\n-- If your runtime does NOT support the 'print' connector, use an alternative sink.\nCREATE TABLE sink_out (\n  window_start TIMESTAMP(3),\n  window_end   TIMESTAMP(3),\n  user_id      BIGINT,\n  total_amount DOUBLE,\n  cnt          BIGINT\n) WITH (\n  'connector' = 'print'\n);\n\n-- 3) Streaming aggregation\nINSERT INTO sink_out\nSELECT\n  window_start,\n  window_end,\n  user_id,\n  SUM(amount) AS total_amount,\n  COUNT(*) AS cnt\nFROM TABLE(\n  TUMBLE(TABLE source_events, DESCRIPTOR(ts), INTERVAL '10' SECOND)\n)\nGROUP BY window_start, window_end, user_id;\n<\/code><\/pre>\n\n\n\n<p><strong>If <code>connector = 'print'<\/code> is not available (common in some managed runtimes):<\/strong>\n&#8211; Option A (preferred for \u201cno external systems\u201d): Use a sink supported by your runtime that writes to logs or an internal preview tool, if available (verify in the SQL studio docs).\n&#8211; Option B: Write to OSS using the filesystem connector if supported and OSS filesystem integration is configured in your environment (verify exact syntax and supported schemes in Alibaba Cloud docs).\n&#8211; Option C: Write to a small database table (ApsaraDB RDS) via JDBC if you already have one.<\/p>\n\n\n\n<p><strong>Expected outcome:<\/strong> The job is created and passes SQL validation.<\/p>\n\n\n\n<p><strong>Verification:<\/strong>\n&#8211; The SQL editor shows no syntax errors.\n&#8211; The system validates DDL and connector configs (or provides actionable error messages).<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 5: Configure checkpoints and job settings<\/h3>\n\n\n\n<p>Before running:\n1. Open the job\u2019s <strong>configuration<\/strong> (job settings\/advanced parameters).\n2. Configure:\n   &#8211; <strong>Checkpoint interval<\/strong>: start with something like 30\u201360 seconds for a lab (exact recommended defaults may differ).\n   &#8211; <strong>Checkpoint storage<\/strong>: point to your OSS bucket prefix (the service may provide a UI field for this).\n   &#8211; <strong>Parallelism<\/strong>: choose a small value (e.g., 1\u20132) for low cost.<\/p>\n\n\n\n<p>Because configuration keys differ by managed runtime, follow your console\u2019s fields and verify in official documentation for your runtime version.<\/p>\n\n\n\n<p><strong>Expected outcome:<\/strong> The job has checkpointing enabled and is configured to use OSS (or the managed default).<\/p>\n\n\n\n<p><strong>Verification:<\/strong>\n&#8211; Job config shows checkpointing enabled.\n&#8211; OSS path is accepted (no permission errors).<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 6: Start\/run the job<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Start the job (Run\/Deploy\/Start).<\/li>\n<li>Wait for the job status to become <strong>RUNNING<\/strong>.<\/li>\n<\/ol>\n\n\n\n<p><strong>Expected outcome:<\/strong> Job transitions to RUNNING and begins generating and aggregating events.<\/p>\n\n\n\n<p><strong>Verification:<\/strong>\n&#8211; In the job overview, you can see:\n  &#8211; Running state\n  &#8211; Throughput metrics (records in\/out)\n  &#8211; Checkpoint status (succeeded\/failed)\n&#8211; If using <code>print<\/code> sink, view task logs and confirm output lines appear periodically.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 7: Observe metrics, checkpoints, and backpressure<\/h3>\n\n\n\n<p>Use the console to review:\n&#8211; <strong>Checkpoint success rate and duration<\/strong>\n&#8211; <strong>Restart count<\/strong> (should be 0 in a stable lab)\n&#8211; <strong>Backpressure<\/strong> indicators (should be low)<\/p>\n\n\n\n<p><strong>Expected outcome:<\/strong> Checkpoints succeed consistently, and the job remains stable.<\/p>\n\n\n\n<p><strong>Verification:<\/strong>\n&#8211; You see successful checkpoints.\n&#8211; No repeated restarts or continuous failures.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Validation<\/h3>\n\n\n\n<p>Use at least two of these validation methods:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p><strong>Job status = RUNNING<\/strong><br\/>\n   &#8211; Confirms scheduling and runtime health.<\/p>\n<\/li>\n<li>\n<p><strong>Checkpoint success<\/strong><br\/>\n   &#8211; Confirms state backend + OSS access works.<\/p>\n<\/li>\n<li>\n<p><strong>Output observed<\/strong>\n   &#8211; If using <code>print<\/code> connector: confirm aggregated records in logs.\n   &#8211; If using OSS sink: confirm output files in OSS.\n   &#8211; If using JDBC sink: confirm rows appear in the target table.<\/p>\n<\/li>\n<li>\n<p><strong>Metrics trend<\/strong>\n   &#8211; Records in\/out should be non-zero and stable.<\/p>\n<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Troubleshooting<\/h3>\n\n\n\n<p>Common errors and realistic fixes:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p><strong>Connector not found (<code>print<\/code> \/ <code>datagen<\/code> not available)<\/strong>\n   &#8211; <strong>Cause:<\/strong> The managed runtime may not ship certain example connectors.\n   &#8211; <strong>Fix:<\/strong> Use a connector listed as supported in your runtime\u2019s connector docs. Verify the official \u201cConnectors\u201d page for your runtime version.<\/p>\n<\/li>\n<li>\n<p><strong>Checkpoint failures (permission denied to OSS)<\/strong>\n   &#8211; <strong>Cause:<\/strong> The job runtime identity lacks OSS write permission.\n   &#8211; <strong>Fix:<\/strong> Attach\/authorize the correct RAM role\/policy to allow <code>oss:PutObject<\/code>, <code>oss:GetObject<\/code>, <code>oss:ListObjects<\/code> on the checkpoint bucket\/prefix. Verify the official IAM setup steps for Realtime Compute for Apache Flink.<\/p>\n<\/li>\n<li>\n<p><strong>Job stuck in STARTING \/ FAILED with network timeouts<\/strong>\n   &#8211; <strong>Cause:<\/strong> VPC\/security group rules block access to endpoints (OSS, SLS, or sinks).\n   &#8211; <strong>Fix:<\/strong> Confirm VPC routing, DNS, security group egress, and whitelists. Keep sources\/sinks in the same VPC\/region.<\/p>\n<\/li>\n<li>\n<p><strong>High checkpoint duration \/ backpressure<\/strong>\n   &#8211; <strong>Cause:<\/strong> Too little compute or too frequent checkpoints.\n   &#8211; <strong>Fix:<\/strong> Increase resources\/parallelism, reduce checkpoint frequency, or optimize state size (TTL, key cardinality).<\/p>\n<\/li>\n<li>\n<p><strong>Frequent restarts<\/strong>\n   &#8211; <strong>Cause:<\/strong> Unhandled exceptions, schema mismatch, sink errors.\n   &#8211; <strong>Fix:<\/strong> Inspect logs, confirm schema compatibility, and add defensive parsing\/validation.<\/p>\n<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Cleanup<\/h3>\n\n\n\n<p>To avoid ongoing costs:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Stop the job<\/strong> in Realtime Compute for Apache Flink.<\/li>\n<li>Delete the job\/deployment if you no longer need it.<\/li>\n<li>Delete OSS objects created for checkpoints\/savepoints:\n   &#8211; Remove <code>flink-checkpoints\/<\/code> and <code>flink-savepoints\/<\/code> prefixes (and any output data if you wrote to OSS).<\/li>\n<li>Optionally delete the OSS bucket if it is dedicated to this lab.<\/li>\n<li>Remove SLS logstores\/projects created for the lab (if applicable) or reduce retention.<\/li>\n<li>Delete the workspace\/project if it was created solely for the lab.<\/li>\n<\/ol>\n\n\n\n<p><strong>Expected outcome:<\/strong> No running jobs remain and storage\/logging artifacts are removed.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">11. Best Practices<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Architecture best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Co-locate data and compute<\/strong>: keep sources, Flink jobs, and sinks in the same region and (ideally) same VPC.<\/li>\n<li><strong>Design for idempotency<\/strong>: even with strong processing guarantees, downstream sinks often need idempotent writes or upsert semantics to handle retries.<\/li>\n<li><strong>Separate concerns<\/strong>: isolate pipelines by domain or criticality to reduce blast radius.<\/li>\n<li><strong>Use event time correctly<\/strong>: define watermarks and handle late events explicitly.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">IAM\/security best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Least privilege<\/strong>: restrict who can start\/stop\/modify production jobs.<\/li>\n<li><strong>Separate roles<\/strong>:<\/li>\n<li>Platform admin (creates workspaces, networking, baseline policies)<\/li>\n<li>Developer (deploys jobs in approved namespaces)<\/li>\n<li>Operator (restart\/rollback permissions without edit permissions, where feasible)<\/li>\n<li><strong>Scope OSS permissions<\/strong> to specific buckets\/prefixes for checkpoints and artifacts.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Cost best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Right-size parallelism<\/strong> using real metrics (CPU, busy time, backpressure, lag).<\/li>\n<li><strong>Limit log volume<\/strong> and set SLS retention policies appropriate for compliance needs.<\/li>\n<li><strong>Tune checkpoint interval<\/strong>: too frequent increases overhead; too infrequent increases recovery time.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Performance best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Avoid hotspots<\/strong>: use keys with balanced cardinality; mitigate skew (salting, repartition strategies).<\/li>\n<li><strong>Use appropriate state TTL<\/strong> to prevent unbounded state growth.<\/li>\n<li><strong>Batch sink writes<\/strong> where supported to improve throughput (connector-dependent).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Reliability best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Use checkpoints and test restores<\/strong>: practice restoring from a savepoint in staging.<\/li>\n<li><strong>Plan upgrades<\/strong>: test new runtime versions, connector versions, and serialization compatibility.<\/li>\n<li><strong>Define SLIs<\/strong>: end-to-end latency, processing lag, checkpoint duration, restart frequency.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operations best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Runbooks<\/strong>: create standard procedures for \u201clag increasing\u201d, \u201ccheckpoint failing\u201d, \u201csink errors\u201d.<\/li>\n<li><strong>Alerting<\/strong>:<\/li>\n<li>Job down\/restarting<\/li>\n<li>Checkpoint failures<\/li>\n<li>Lag above threshold<\/li>\n<li>Backpressure sustained<\/li>\n<li><strong>Change management<\/strong>: enforce code review for SQL\/app changes and use CI\/CD where possible.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Governance\/tagging\/naming best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use consistent naming:<\/li>\n<li><code>env-team-domain-pipeline<\/code> (example: <code>prod-growth-clickstream-sessionize<\/code>)<\/li>\n<li>Tag resources with:<\/li>\n<li><code>Owner<\/code>, <code>CostCenter<\/code>, <code>Environment<\/code>, <code>DataClass<\/code> (PII\/non-PII), <code>Criticality<\/code><\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">12. Security Considerations<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Identity and access model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Control plane<\/strong> access is governed by <strong>RAM<\/strong> permissions:<\/li>\n<li>Who can create\/edit\/start\/stop jobs<\/li>\n<li>Who can access job logs\/metrics<\/li>\n<li><strong>Runtime<\/strong> access must be authorized to reach:<\/li>\n<li>OSS checkpoint locations<\/li>\n<li>Private endpoints in VPC<\/li>\n<li>Any sink\/source services\nThe exact pattern (service role, instance role, credential configuration) depends on the managed service implementation\u2014verify in official docs for your region.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Encryption<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>In transit<\/strong>: Prefer TLS for sources\/sinks (Kafka, JDBC, HTTP) where available.<\/li>\n<li><strong>At rest<\/strong>:<\/li>\n<li>Use OSS server-side encryption (SSE) for checkpoint buckets when required by policy.<\/li>\n<li>Use database encryption features for sinks that store sensitive data.<\/li>\n<li><strong>Secrets<\/strong>: Avoid embedding credentials in SQL or code; use managed secret mechanisms when provided (verify) or RAM roles with short-lived credentials.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Network exposure<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Prefer <strong>VPC-only<\/strong> connectivity for production.<\/li>\n<li>Avoid public endpoints for databases and queues unless absolutely necessary.<\/li>\n<li>Restrict security group egress\/ingress; whitelist only required ports and destinations.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secrets handling<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Store secrets in a dedicated secret manager if available in your stack (Alibaba Cloud has services for secrets\u2014verify which is recommended for your region).<\/li>\n<li>Rotate credentials and use least-privileged accounts for sources\/sinks.<\/li>\n<li>For JDBC sinks, use TLS and restricted DB users.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Audit\/logging<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Enable <strong>ActionTrail<\/strong> for audit logs of API actions (create\/update\/start\/stop jobs).<\/li>\n<li>Centralize logs in SLS with retention aligned to compliance requirements.<\/li>\n<li>Monitor and alert on permission changes to roles\/policies used by Flink.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Compliance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data residency: keep processing in-region when required.<\/li>\n<li>PII handling: minimize PII in streaming where possible; tokenize\/anonymize early.<\/li>\n<li>Retention: set checkpoint\/savepoint retention and logs retention policies to match policy.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Common security mistakes<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Using a single \u201cadmin\u201d RAM user for everything.<\/li>\n<li>Storing plaintext DB passwords inside SQL scripts.<\/li>\n<li>Writing checkpoints\/savepoints into broadly accessible OSS buckets.<\/li>\n<li>Allowing public network access to production sources\/sinks.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secure deployment recommendations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Dedicated VPC and subnets per environment.<\/li>\n<li>Dedicated OSS buckets per environment, with narrow policies by prefix.<\/li>\n<li>Separate dev\/staging\/prod workspaces and RAM policies.<\/li>\n<li>Use automated policy checks (IaC + policy-as-code) where possible.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">13. Limitations and Gotchas<\/h2>\n\n\n\n<p>Always validate against the official documentation for your runtime version and region.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Connector availability differs<\/strong> by runtime version\/region\/edition. Do not assume every open-source Flink connector is available.<\/li>\n<li><strong>Processing guarantees depend on sinks<\/strong>: exactly-once outcomes require careful sink configuration or idempotency; some sinks are at-least-once.<\/li>\n<li><strong>State growth surprises<\/strong>: high-cardinality keys, long windows, or missing TTL can explode state size and checkpoint cost.<\/li>\n<li><strong>Checkpoint tuning matters<\/strong>: overly frequent checkpoints can reduce throughput; overly infrequent checkpoints increase recovery time and risk.<\/li>\n<li><strong>Schema evolution pitfalls<\/strong>: changing table schemas or serialization can break state compatibility; plan migrations with savepoints.<\/li>\n<li><strong>Network misconfiguration<\/strong>: VPC routing\/security group issues are a frequent cause of timeouts and job failures.<\/li>\n<li><strong>Cost surprises<\/strong>:<\/li>\n<li>Jobs left running in dev<\/li>\n<li>High SLS log ingestion\/indexing<\/li>\n<li>OSS checkpoint retention not managed<\/li>\n<li><strong>Upgrade risk<\/strong>: Flink version upgrades can change planner behavior, connector behavior, or defaults.<\/li>\n<li><strong>Multi-tenant blast radius<\/strong>: too many pipelines in one large deployment can increase the impact of a single issue.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">14. Comparison with Alternatives<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Alibaba Cloud alternatives (same cloud)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Self-managed Flink on ECS<\/strong>: maximum control; maximum ops burden.<\/li>\n<li><strong>Flink on EMR (E-MapReduce)<\/strong>: more control over the cluster; still requires operations and sizing.<\/li>\n<li><strong>Batch analytics services<\/strong> (for example MaxCompute): better for batch ETL and large-scale offline analytics; not for low-latency stream processing.<\/li>\n<li><strong>DataWorks<\/strong>: orchestration and data development platform; can orchestrate streaming and batch, but isn\u2019t a Flink runtime itself (verify your region\u2019s integration patterns).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Other cloud providers (nearest equivalents)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>AWS Kinesis Data Analytics for Apache Flink<\/strong><\/li>\n<li><strong>Google Cloud Dataflow<\/strong> (Apache Beam; not Flink but comparable managed streaming)<\/li>\n<li><strong>Azure Stream Analytics<\/strong> (SQL-like streaming; not Flink) and partner-managed Flink offerings<\/li>\n<li><strong>Confluent Cloud Flink<\/strong> (managed Flink SQL tied to Confluent Kafka ecosystem)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Open-source\/self-managed alternatives<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Apache Flink on Kubernetes<\/li>\n<li>Apache Spark Structured Streaming (different semantics and tradeoffs)<\/li>\n<li>Kafka Streams (library approach; limited for complex state\/time cases)<\/li>\n<\/ul>\n\n\n\n<h4 class=\"wp-block-heading\">Comparison table<\/h4>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Option<\/th>\n<th>Best For<\/th>\n<th>Strengths<\/th>\n<th>Weaknesses<\/th>\n<th>When to Choose<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Alibaba Cloud Realtime Compute for Apache Flink<\/strong><\/td>\n<td>Managed Flink streaming on Alibaba Cloud<\/td>\n<td>Managed ops, Alibaba ecosystem integration, production lifecycle features<\/td>\n<td>Connector\/runtime constraints; pricing depends on resource model<\/td>\n<td>You want managed Flink near Alibaba Cloud data sources\/sinks<\/td>\n<\/tr>\n<tr>\n<td>Flink on Alibaba Cloud EMR<\/td>\n<td>Teams needing more cluster control<\/td>\n<td>More control over cluster configs; EMR ecosystem<\/td>\n<td>Higher ops effort than managed service<\/td>\n<td>You need custom cluster-level control or specific EMR integrations<\/td>\n<\/tr>\n<tr>\n<td>Self-managed Flink on ECS\/K8s<\/td>\n<td>Platform teams with strong ops maturity<\/td>\n<td>Full control, custom plugins<\/td>\n<td>Highest operational burden<\/td>\n<td>You require unsupported customizations or strict control<\/td>\n<\/tr>\n<tr>\n<td>MaxCompute (batch)<\/td>\n<td>Offline analytics and ETL<\/td>\n<td>Cost-effective for batch at scale<\/td>\n<td>Not real-time; higher latency<\/td>\n<td>Your use case is batch and latency isn\u2019t critical<\/td>\n<\/tr>\n<tr>\n<td>AWS Kinesis Data Analytics for Flink<\/td>\n<td>Managed Flink on AWS<\/td>\n<td>Tight AWS integration<\/td>\n<td>Not Alibaba Cloud; data gravity issues<\/td>\n<td>Your data and apps are primarily on AWS<\/td>\n<\/tr>\n<tr>\n<td>Google Cloud Dataflow<\/td>\n<td>Managed streaming with Beam<\/td>\n<td>Strong managed scaling; unified batch\/stream<\/td>\n<td>Different programming model; not Flink<\/td>\n<td>You prefer Beam and run on GCP<\/td>\n<\/tr>\n<tr>\n<td>Spark Structured Streaming<\/td>\n<td>Mixed workloads, Spark ecosystem<\/td>\n<td>Familiar to Spark shops; good ecosystem<\/td>\n<td>Different time\/state semantics; micro-batch patterns<\/td>\n<td>You\u2019re already standardized on Spark<\/td>\n<\/tr>\n<tr>\n<td>Kafka Streams<\/td>\n<td>App-embedded stream processing<\/td>\n<td>Simple deployment model<\/td>\n<td>Limited for complex event-time\/windowing use cases<\/td>\n<td>You want library-based processing inside services<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">15. Real-World Example<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Enterprise example: Real-time fraud signals for a payments platform<\/h3>\n\n\n\n<p><strong>Problem<\/strong><br\/>\nA payments platform needs to detect suspicious patterns (rapid repeated transactions, device switching, geo anomalies) in seconds and provide risk signals to an authorization service.<\/p>\n\n\n\n<p><strong>Proposed architecture<\/strong>\n&#8211; Producers publish transaction events to a Kafka-compatible service in Alibaba Cloud.\n&#8211; Realtime Compute for Apache Flink:\n  &#8211; Enriches transactions with user\/device history from a database or cache\n  &#8211; Computes rolling aggregates per account\/device\n  &#8211; Applies rules and anomaly scoring\n  &#8211; Emits risk signals to a low-latency sink (database\/cache\/topic)\n&#8211; OSS stores checkpoints\/savepoints.\n&#8211; SLS\/CloudMonitor provide monitoring and alerting.\n&#8211; IAM uses strict separation between dev and prod.<\/p>\n\n\n\n<p><strong>Why this service was chosen<\/strong>\n&#8211; Stateful stream processing with event-time support\n&#8211; Managed operations for always-on pipelines\n&#8211; Native alignment with Alibaba Cloud IAM\/networking\/observability<\/p>\n\n\n\n<p><strong>Expected outcomes<\/strong>\n&#8211; Reduced fraud losses and faster detection\n&#8211; Better system resiliency through checkpoint-based recovery\n&#8211; Lower operational overhead vs. self-managed clusters<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Startup\/small-team example: Real-time product analytics and activation funnel<\/h3>\n\n\n\n<p><strong>Problem<\/strong><br\/>\nA SaaS startup wants near real-time activation funnel metrics (signup \u2192 onboarding steps \u2192 first key action) without building a large data platform.<\/p>\n\n\n\n<p><strong>Proposed architecture<\/strong>\n&#8211; Application events flow into a streaming source.\n&#8211; Realtime Compute for Apache Flink (SQL):\n  &#8211; Parses and normalizes events\n  &#8211; Computes funnel step counts and user cohorts in time windows\n  &#8211; Outputs aggregates to an analytics database\/dashboard store\n&#8211; OSS for checkpoints; SLS for logs.<\/p>\n\n\n\n<p><strong>Why this service was chosen<\/strong>\n&#8211; SQL-based development reduces engineering time\n&#8211; Managed runtime avoids Kubernetes\/Flink operations\n&#8211; Scale up as the product grows<\/p>\n\n\n\n<p><strong>Expected outcomes<\/strong>\n&#8211; Faster feedback loop for product changes (minutes instead of daily batch)\n&#8211; Low operational burden for a small team\n&#8211; Clear cost model tied to running compute resources<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">16. FAQ<\/h2>\n\n\n\n<p>1) <strong>Is Realtime Compute for Apache Flink the same as open-source Apache Flink?<\/strong><br\/>\nIt runs Apache Flink workloads but adds a managed control plane and Alibaba Cloud integrations. Some open-source connectors\/features may not be available or may be packaged differently\u2014verify supported versions and connectors.<\/p>\n\n\n\n<p>2) <strong>Do I write SQL or Java?<\/strong><br\/>\nTypically both are supported: Flink SQL for many ETL\/analytics pipelines and Java\/Scala DataStream apps for advanced logic. Python support is runtime-dependent\u2014verify in official docs.<\/p>\n\n\n\n<p>3) <strong>Is it batch or streaming?<\/strong><br\/>\nPrimarily streaming (continuous) processing. Some runtimes may support bounded\/batch execution patterns, but the main design center is real-time streams\u2014verify your runtime capabilities.<\/p>\n\n\n\n<p>4) <strong>How does fault tolerance work?<\/strong><br\/>\nThrough Flink checkpoints and restart strategies. State is periodically persisted (commonly to OSS), enabling recovery after failures.<\/p>\n\n\n\n<p>5) <strong>Does it guarantee exactly-once?<\/strong><br\/>\nFlink can provide strong guarantees, but end-to-end exactly-once depends on source\/sink connector semantics and configuration. Many real-world systems implement idempotency or transactional sinks.<\/p>\n\n\n\n<p>6) <strong>What do I need for checkpoint storage?<\/strong><br\/>\nUsually OSS (or another supported durable store). Configure permissions so the runtime can read\/write checkpoint paths.<\/p>\n\n\n\n<p>7) <strong>Can I run in a VPC only (no public Internet)?<\/strong><br\/>\nTypically yes, and it\u2019s recommended for production. You must configure VPC connectivity and private endpoints to sources\/sinks.<\/p>\n\n\n\n<p>8) <strong>How do I deploy updates safely?<\/strong><br\/>\nUse savepoints and controlled redeployments. Test upgrades in staging. Follow the managed service\u2019s recommended upgrade workflow.<\/p>\n\n\n\n<p>9) <strong>How do I monitor lag and latency?<\/strong><br\/>\nUse built-in job metrics, and integrate with CloudMonitor\/SLS. Monitor consumer lag at the source level (Kafka) and end-to-end latency via timestamps.<\/p>\n\n\n\n<p>10) <strong>What is the smallest setup for learning?<\/strong><br\/>\nA single small SQL job with a synthetic source and debug sink, short runtime, OSS for checkpoints, minimal logs retention.<\/p>\n\n\n\n<p>11) <strong>Can multiple teams share the service?<\/strong><br\/>\nYes, typically via multiple workspaces\/projects\/namespaces and RAM policies. Implement quotas and guardrails to prevent noisy neighbors.<\/p>\n\n\n\n<p>12) <strong>How do I handle schema evolution?<\/strong><br\/>\nPlan schema changes carefully. Use compatible changes where possible. For stateful jobs, ensure state schema compatibility and use savepoints\/migration strategies.<\/p>\n\n\n\n<p>13) <strong>What happens if my sink is slow?<\/strong><br\/>\nBackpressure propagates upstream; throughput drops and latency increases. Scale sink capacity, optimize sink writes, and adjust parallelism.<\/p>\n\n\n\n<p>14) <strong>Can I use custom connectors or JARs?<\/strong><br\/>\nSome managed services allow uploading custom JARs (UDFs\/connectors). The packaging and approval constraints vary\u2014verify the current extension mechanism in official docs.<\/p>\n\n\n\n<p>15) <strong>How do I estimate cost?<\/strong><br\/>\nModel compute-hours (resources \u00d7 time), plus OSS checkpoint storage and log ingestion. Validate the exact billing meters in your region\u2019s pricing page\/console.<\/p>\n\n\n\n<p>16) <strong>Is there a recommended dev\/staging\/prod setup?<\/strong><br\/>\nYes: separate environments (workspaces\/projects), separate OSS buckets\/prefixes, separate roles\/policies, smaller staging resources, and CI\/CD-driven deployments where possible.<\/p>\n\n\n\n<p>17) <strong>How do I troubleshoot checkpoint failures?<\/strong><br\/>\nStart with logs + checkpoint metrics. Common causes: insufficient resources, slow sinks, OSS permission\/network issues, or state too large.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">17. Top Online Resources to Learn Realtime Compute for Apache Flink<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Resource Type<\/th>\n<th>Name<\/th>\n<th>Why It Is Useful<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Official documentation<\/td>\n<td>Alibaba Cloud Help Center \u2013 Realtime Compute for Apache Flink<\/td>\n<td>Primary source for current features, concepts, connector lists, and region\/runtime specifics: https:\/\/www.alibabacloud.com\/help\/en\/realtime-compute-for-apache-flink<\/td>\n<\/tr>\n<tr>\n<td>Official product page<\/td>\n<td>Realtime Compute for Apache Flink product page<\/td>\n<td>High-level positioning, entry points to pricing and trials (verify region): https:\/\/www.alibabacloud.com\/product\/realtime-compute-for-apache-flink<\/td>\n<\/tr>\n<tr>\n<td>Official CLI docs<\/td>\n<td>Alibaba Cloud CLI documentation<\/td>\n<td>Helpful for automating OSS\/VPC\/SLS setup around Flink pipelines: https:\/\/www.alibabacloud.com\/help\/en\/alibaba-cloud-cli\/latest\/what-is-alibaba-cloud-cli<\/td>\n<\/tr>\n<tr>\n<td>Apache Flink docs<\/td>\n<td>Apache Flink Documentation<\/td>\n<td>Core Flink concepts (event time, checkpoints, state, SQL): https:\/\/flink.apache.org\/<\/td>\n<\/tr>\n<tr>\n<td>Release notes (verify)<\/td>\n<td>Service release notes \/ product updates<\/td>\n<td>Understand runtime upgrades, connector changes, deprecations (find within official docs for your region)<\/td>\n<\/tr>\n<tr>\n<td>Architecture references (verify)<\/td>\n<td>Alibaba Cloud Architecture Center<\/td>\n<td>Reference architectures and best practices; search for Flink\/streaming patterns: https:\/\/www.alibabacloud.com\/architecture<\/td>\n<\/tr>\n<tr>\n<td>Videos\/webinars (verify)<\/td>\n<td>Alibaba Cloud Tech content \/ webinars<\/td>\n<td>Useful for demos and operational guidance (availability varies; check Alibaba Cloud official channels)<\/td>\n<\/tr>\n<tr>\n<td>Samples (verify)<\/td>\n<td>Official or highly trusted GitHub examples<\/td>\n<td>Accelerates learning with runnable SQL\/app patterns; verify compatibility with managed runtime<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">18. Training and Certification Providers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Institute<\/th>\n<th>Suitable Audience<\/th>\n<th>Likely Learning Focus<\/th>\n<th>Mode<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>DevOps engineers, data\/platform teams, architects<\/td>\n<td>Cloud + DevOps practices; may include streaming and operations fundamentals<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>ScmGalaxy.com<\/td>\n<td>Beginners to intermediate engineers<\/td>\n<td>DevOps\/SCM foundations, tooling practices<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.scmgalaxy.com\/<\/td>\n<\/tr>\n<tr>\n<td>CLoudOpsNow.in<\/td>\n<td>Cloud operations teams<\/td>\n<td>Cloud operations, monitoring, reliability practices<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.cloudopsnow.in\/<\/td>\n<\/tr>\n<tr>\n<td>SreSchool.com<\/td>\n<td>SREs, platform engineers<\/td>\n<td>Reliability engineering, incident response, observability<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.sreschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>AiOpsSchool.com<\/td>\n<td>Ops teams adopting AIOps<\/td>\n<td>Monitoring automation and ops analytics concepts<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.aiopsschool.com\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">19. Top Trainers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Platform\/Site<\/th>\n<th>Likely Specialization<\/th>\n<th>Suitable Audience<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>RajeshKumar.xyz<\/td>\n<td>DevOps\/cloud training content (verify offerings)<\/td>\n<td>Engineers seeking hands-on mentoring<\/td>\n<td>https:\/\/www.rajeshkumar.xyz\/<\/td>\n<\/tr>\n<tr>\n<td>devopstrainer.in<\/td>\n<td>DevOps training and mentoring (verify offerings)<\/td>\n<td>Beginners to intermediate DevOps practitioners<\/td>\n<td>https:\/\/www.devopstrainer.in\/<\/td>\n<\/tr>\n<tr>\n<td>devopsfreelancer.com<\/td>\n<td>Freelance DevOps services\/training platform (verify)<\/td>\n<td>Teams needing short-term coaching\/support<\/td>\n<td>https:\/\/www.devopsfreelancer.com\/<\/td>\n<\/tr>\n<tr>\n<td>devopssupport.in<\/td>\n<td>DevOps support\/training (verify offerings)<\/td>\n<td>Ops teams looking for practical support<\/td>\n<td>https:\/\/www.devopssupport.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">20. Top Consulting Companies<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Company<\/th>\n<th>Likely Service Area<\/th>\n<th>Where They May Help<\/th>\n<th>Consulting Use Case Examples<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>cotocus.com<\/td>\n<td>Cloud\/DevOps\/engineering consulting (verify exact focus)<\/td>\n<td>Architecture reviews, platform setup, operations<\/td>\n<td>Designing streaming platform guardrails; observability and cost governance<\/td>\n<td>https:\/\/www.cotocus.com\/<\/td>\n<\/tr>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>DevOps and cloud consulting\/training (verify)<\/td>\n<td>Delivery enablement, DevOps process, tooling adoption<\/td>\n<td>CI\/CD for Flink jobs; operational runbooks and SRE practices<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>DEVOPSCONSULTING.IN<\/td>\n<td>DevOps consulting services (verify)<\/td>\n<td>DevOps transformation, reliability and automation<\/td>\n<td>Monitoring\/alerting setup; infrastructure automation for data platforms<\/td>\n<td>https:\/\/www.devopsconsulting.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">21. Career and Learning Roadmap<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn before this service<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Streaming basics: topics\/partitions, ordering, consumer groups (Kafka concepts)<\/li>\n<li>Data modeling for events: schemas, evolution, serialization (JSON\/Avro\/Protobuf)<\/li>\n<li>SQL fundamentals: aggregations, joins, windowing<\/li>\n<li>Cloud fundamentals on Alibaba Cloud:<\/li>\n<li>RAM (users, roles, policies)<\/li>\n<li>VPC networking basics<\/li>\n<li>OSS basics<\/li>\n<li>SLS and CloudMonitor basics<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn after this service<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Advanced Flink:<\/li>\n<li>State management, timers, process functions<\/li>\n<li>Checkpoint tuning and state backends (conceptually; managed service abstracts some details)<\/li>\n<li>Exactly-once design patterns<\/li>\n<li>Data governance:<\/li>\n<li>Data lineage, cataloging, access controls<\/li>\n<li>CI\/CD for streaming:<\/li>\n<li>Versioning jobs, automated tests, canary deployments<\/li>\n<li>Performance engineering:<\/li>\n<li>Load testing streaming pipelines, sink capacity planning<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Job roles that use it<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Streaming Data Engineer<\/li>\n<li>Data Platform Engineer<\/li>\n<li>Cloud Solutions Architect (analytics)<\/li>\n<li>DevOps\/SRE supporting data systems<\/li>\n<li>Fraud\/Detection Engineer (stream processing heavy)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Certification path (if available)<\/h3>\n\n\n\n<p>Alibaba Cloud certification offerings change over time. If there is a certification explicitly covering streaming analytics or Flink on Alibaba Cloud, follow the current Alibaba Cloud certification portal and official learning paths (verify).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Project ideas for practice<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Clickstream sessionization with event-time windows and late-event handling.<\/li>\n<li>Real-time anomaly detection on IoT data with rolling statistics.<\/li>\n<li>CDC \u2192 upsert pipeline into an analytics store (connector-dependent\u2014verify).<\/li>\n<li>Real-time feature store pipeline with TTL-managed state.<\/li>\n<li>Multi-tenant streaming platform design: namespaces, quotas, IAM policies, tagging.<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">22. Glossary<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Apache Flink<\/strong>: Open-source framework for stateful stream processing and batch processing.<\/li>\n<li><strong>Flink SQL<\/strong>: SQL interface to define tables, sources\/sinks, and streaming transformations.<\/li>\n<li><strong>JobManager \/ TaskManager<\/strong>: Core Flink components for coordination and distributed execution.<\/li>\n<li><strong>Checkpoint<\/strong>: Periodic snapshot of state for fault tolerance.<\/li>\n<li><strong>Savepoint<\/strong>: Manually triggered, versioned snapshot used for upgrades\/migrations.<\/li>\n<li><strong>State<\/strong>: Data kept by operators across events (e.g., per-user counters).<\/li>\n<li><strong>Event time<\/strong>: Time when an event actually occurred (vs processing time).<\/li>\n<li><strong>Watermark<\/strong>: A mechanism to track event-time progress and handle out-of-order events.<\/li>\n<li><strong>Backpressure<\/strong>: When downstream operators\/sinks cannot keep up, slowing upstream processing.<\/li>\n<li><strong>Parallelism<\/strong>: Degree of concurrent processing (number of subtasks).<\/li>\n<li><strong>Connector<\/strong>: Integration module that reads from or writes to an external system.<\/li>\n<li><strong>RAM<\/strong>: Alibaba Cloud Resource Access Management (IAM).<\/li>\n<li><strong>VPC<\/strong>: Virtual Private Cloud networking environment.<\/li>\n<li><strong>OSS<\/strong>: Object Storage Service used for durable storage (often checkpoints).<\/li>\n<li><strong>SLS<\/strong>: Log Service used for log collection\/search\/analysis.<\/li>\n<li><strong>CloudMonitor<\/strong>: Alibaba Cloud monitoring and alerting service.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">23. Summary<\/h2>\n\n\n\n<p>Realtime Compute for Apache Flink is Alibaba Cloud\u2019s managed stream processing service in the <strong>Analytics Computing<\/strong> category, designed to run Apache Flink jobs with reduced operational overhead. It matters because it enables reliable, low-latency, stateful analytics and event processing\u2014powering dashboards, fraud detection, operational alerting, and real-time ML features\u2014while integrating with Alibaba Cloud IAM, VPC networking, OSS-based durability, and monitoring\/logging.<\/p>\n\n\n\n<p>Cost and security are primarily driven by always-on compute resources, state\/checkpoint storage in OSS, and logs\/metrics volume, plus the need to correctly scope RAM permissions and keep pipelines private in VPC. Use it when you need continuous streaming with strong state\/time semantics and want a managed runtime close to Alibaba Cloud data services; avoid it for purely batch workloads or when you require custom unsupported runtime components.<\/p>\n\n\n\n<p>Next step: review the official documentation for your region and runtime version, then extend the lab by connecting to a real source (Kafka-compatible) and a durable sink (database\/OLAP), adding alerting for lag and checkpoint failures.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Analytics Computing<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2,4],"tags":[],"class_list":["post-83","post","type-post","status-publish","format-standard","hentry","category-alibaba-cloud","category-analytics-computing"],"_links":{"self":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/83","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/comments?post=83"}],"version-history":[{"count":0,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/83\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/media?parent=83"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/categories?post=83"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/tags?post=83"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}