{"id":378,"date":"2026-04-13T20:44:22","date_gmt":"2026-04-13T20:44:22","guid":{"rendered":"https:\/\/www.devopsschool.com\/tutorials\/azure-data-factory-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-analytics\/"},"modified":"2026-04-13T20:44:22","modified_gmt":"2026-04-13T20:44:22","slug":"azure-data-factory-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-analytics","status":"publish","type":"post","link":"https:\/\/www.devopsschool.com\/tutorials\/azure-data-factory-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-analytics\/","title":{"rendered":"Azure Data Factory Tutorial: Architecture, Pricing, Use Cases, and Hands-On Guide for Analytics"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Category<\/h2>\n\n\n\n<p>Analytics<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">1. Introduction<\/h2>\n\n\n\n<p>Azure Data Factory is Azure\u2019s managed cloud service for building <strong>data integration<\/strong> and <strong>data orchestration<\/strong> workflows. It helps you move data between systems, transform it, and schedule\/monitor the entire process\u2014without having to build and operate your own ETL infrastructure.<\/p>\n\n\n\n<p>In simple terms: <strong>Azure Data Factory is the \u201cpipeline service\u201d for analytics<\/strong>. You define where data comes from (sources), where it goes (sinks), what processing should occur in between (transformations), and when it should run (triggers). Azure Data Factory then executes those workflows reliably and at scale.<\/p>\n\n\n\n<p>Technically, Azure Data Factory is a control-plane service for defining pipelines, plus a runtime layer called <strong>Integration Runtime<\/strong> that performs the actual compute for data movement and certain transformations. It integrates with many Azure and non-Azure data stores via built-in connectors, supports orchestration patterns (dependencies, retries, branching), and provides monitoring and operational tooling in Azure Data Factory Studio.<\/p>\n\n\n\n<p>The problem it solves: as data grows across SaaS apps, databases, files, and cloud platforms, teams need a secure and maintainable way to <strong>ingest, copy, and orchestrate data workflows<\/strong> for analytics and reporting\u2014without stitching together scripts and cron jobs.<\/p>\n\n\n\n<blockquote>\n<p>Service naming note (important): <strong>Azure Data Factory<\/strong> is an active Azure service. Microsoft also offers <strong>Azure Synapse Analytics pipelines<\/strong>, which share similar pipeline concepts and a related user experience. This tutorial is specifically for <strong>Azure Data Factory<\/strong>.<\/p>\n<\/blockquote>\n\n\n\n<h2 class=\"wp-block-heading\">2. What is Azure Data Factory?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Official purpose<\/h3>\n\n\n\n<p>Azure Data Factory\u2019s purpose is to provide a <strong>managed cloud ETL\/ELT and data orchestration service<\/strong> that enables you to:\n&#8211; Connect to diverse data sources\n&#8211; Move and transform data\n&#8211; Orchestrate end-to-end data workflows\n&#8211; Monitor, manage, and operationalize those workflows<\/p>\n\n\n\n<p>Official docs: https:\/\/learn.microsoft.com\/azure\/data-factory\/introduction<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Core capabilities<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Data movement (Copy Activity):<\/strong> Copy data between supported data stores using optimized connectors.<\/li>\n<li><strong>Data transformation:<\/strong> Use <strong>Mapping Data Flows<\/strong> (Spark-based, visual) and\/or invoke external compute (Databricks, HDInsight, Azure Functions, Stored Procedures, Synapse, etc.).<\/li>\n<li><strong>Orchestration:<\/strong> Schedule and control workflow execution with dependencies, branching, parameters, variables, looping, retries, and failure handling.<\/li>\n<li><strong>Hybrid integration:<\/strong> Use <strong>Self-hosted Integration Runtime<\/strong> to reach on-premises networks and private endpoints.<\/li>\n<li><strong>Operational tooling:<\/strong> Monitoring views, activity run details, alerts\/diagnostics via Azure Monitor, and CI\/CD via Git integration.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Major components (conceptual model)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Factory:<\/strong> The top-level Azure Data Factory resource.<\/li>\n<li><strong>Pipelines:<\/strong> Logical containers for workflow steps.<\/li>\n<li><strong>Activities:<\/strong> Individual steps inside pipelines (Copy, Data Flow, Lookup, ForEach, Web, etc.).<\/li>\n<li><strong>Datasets:<\/strong> Named references to data structures\/locations used by activities.<\/li>\n<li><strong>Linked services:<\/strong> Connection definitions (to storage, databases, compute, Key Vault, etc.).<\/li>\n<li><strong>Integration runtime (IR):<\/strong> The compute infrastructure that executes data movement and some activities.<\/li>\n<li><strong>Triggers:<\/strong> Schedule\/event\/tumbling window triggers that start pipelines.<\/li>\n<li><strong>Monitoring:<\/strong> Runs, logs, metrics, alerts, and diagnostics.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Service type<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Managed cloud service (PaaS)<\/strong> for designing and orchestrating data integration pipelines.<\/li>\n<li>Uses <strong>serverless orchestration<\/strong> concepts plus configurable runtime options (Azure IR, Self-hosted IR, and Azure-SSIS IR).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scope and deployment model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Azure Data Factory is an <strong>Azure resource<\/strong> created in a <strong>subscription<\/strong>, within a <strong>resource group<\/strong>, and in a <strong>region<\/strong>.<\/li>\n<li>The <strong>orchestration\/control plane<\/strong> is managed by Azure.<\/li>\n<li>The <strong>execution<\/strong> happens via Integration Runtime:<\/li>\n<li><strong>Azure Integration Runtime<\/strong> (managed by Azure, runs in Azure)<\/li>\n<li><strong>Self-hosted Integration Runtime<\/strong> (runs on your VM\/on-prem)<\/li>\n<li><strong>Azure-SSIS Integration Runtime<\/strong> (for SSIS package execution in Azure)<\/li>\n<\/ul>\n\n\n\n<p>Regional specifics and availability can change; verify in official docs for your region and requirements.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">How it fits into the Azure ecosystem<\/h3>\n\n\n\n<p>Azure Data Factory typically sits at the center of an Analytics platform:\n&#8211; Ingests from: Azure Storage, Azure SQL, SQL Server, Oracle, SAP (connector-dependent), SaaS sources, REST APIs, SFTP, etc.\n&#8211; Lands into: Azure Data Lake Storage (ADLS), Azure Blob Storage, Azure SQL, Azure Synapse Analytics, Microsoft Fabric (integration patterns vary\u2014verify), and other stores.\n&#8211; Transforms via: Mapping Data Flows, Azure Databricks, Synapse Spark\/SQL, stored procedures, and more.\n&#8211; Govern\/secure with: Microsoft Entra ID (Azure AD), Managed Identities, Azure Key Vault, Private Link, Azure Monitor, Microsoft Purview (integration depends on configuration\u2014verify in docs).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">3. Why use Azure Data Factory?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Business reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Faster time-to-insights:<\/strong> Build repeatable ingestion pipelines for analytics and reporting.<\/li>\n<li><strong>Lower operational overhead:<\/strong> Managed service reduces the need to operate custom ETL servers and schedulers.<\/li>\n<li><strong>Standardization:<\/strong> A shared integration layer reduces \u201cone-off scripts\u201d and manual processes.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Technical reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Connector ecosystem:<\/strong> Large set of supported data stores and protocols.<\/li>\n<li><strong>Hybrid reach:<\/strong> Self-hosted Integration Runtime supports on-prem and private network connectivity.<\/li>\n<li><strong>Orchestration patterns:<\/strong> Robust control flow for dependency handling, retries, branching, and parameterized workflows.<\/li>\n<li><strong>Separation of concerns:<\/strong> Linked services\/datasets\/pipelines encourage reusable, maintainable designs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operational reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Monitoring:<\/strong> Run history, activity-level diagnostics, and integration with Azure Monitor.<\/li>\n<li><strong>CI\/CD support:<\/strong> Git integration and deployment patterns (e.g., ARM template-based) help promote changes across environments.<\/li>\n<li><strong>Centralized governance:<\/strong> Naming\/tagging and RBAC can be standardized across a team.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/compliance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Identity-first access:<\/strong> Support for Managed Identity and Microsoft Entra ID authentication patterns (connector-dependent).<\/li>\n<li><strong>Secret management:<\/strong> Integrates with Azure Key Vault for storing credentials.<\/li>\n<li><strong>Network controls:<\/strong> Private endpoints and managed virtual network options (availability is connector\/feature dependent\u2014verify for your scenario).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scalability\/performance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Elastic data movement:<\/strong> Scale characteristics depend on the Integration Runtime type and activity configuration.<\/li>\n<li><strong>Parallelism:<\/strong> Pipelines can run activities in parallel, and Copy Activity supports parallel copy patterns (source\/sink dependent).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should choose it<\/h3>\n\n\n\n<p>Choose Azure Data Factory when you need:\n&#8211; Repeatable and observable data ingestion and orchestration\n&#8211; Broad connector support\n&#8211; Hybrid connectivity to on-prem\/private networks\n&#8211; A managed service with enterprise security and monitoring integration<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should not choose it<\/h3>\n\n\n\n<p>Azure Data Factory may not be the best fit when:\n&#8211; You need <strong>true streaming\/event processing<\/strong> (consider Azure Stream Analytics, Event Hubs + processing, or Spark streaming).\n&#8211; You need a <strong>full analytical warehouse\/lakehouse<\/strong> service (ADF orchestrates; it doesn\u2019t replace Synapse\/Fabric\/Databricks storage &amp; compute layers).\n&#8211; You want a code-native orchestration tool with heavy custom logic (Airflow\/Dagster\/Prefect may be a better fit depending on your platform).\n&#8211; You need near-zero latency transformation (ADF is primarily batch-oriented).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">4. Where is Azure Data Factory used?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Industries<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Retail and e-commerce (sales, inventory, customer analytics)<\/li>\n<li>Finance and insurance (risk reporting, reconciliation, regulatory reporting)<\/li>\n<li>Healthcare and life sciences (claims data, operational analytics\u2014ensure compliance needs are met)<\/li>\n<li>Manufacturing and IoT (batch ingestion from plants, ERP integration)<\/li>\n<li>Media and gaming (content analytics, user behavior data)<\/li>\n<li>Public sector (data consolidation across departments)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Team types<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data engineering teams building analytics platforms<\/li>\n<li>Platform\/Cloud engineering teams standardizing ingestion tooling<\/li>\n<li>BI teams coordinating ingestion into reporting stores<\/li>\n<li>DevOps\/SRE teams operating data pipelines with reliability\/observability<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Workloads<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Batch ingestion into data lakes\/warehouses<\/li>\n<li>Daily\/hourly incremental loads from operational databases<\/li>\n<li>Periodic extracts from SaaS systems<\/li>\n<li>File-based ingestion from SFTP\/partners<\/li>\n<li>Orchestration of multi-step data workflows across several services<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Architectures<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Lake-centric: land raw \u2192 curate \u2192 serve<\/li>\n<li>Warehouse-centric: ingest \u2192 stage \u2192 transform \u2192 publish<\/li>\n<li>Hybrid: on-prem + cloud integration with private networking<\/li>\n<li>Multi-environment: dev\/test\/prod with Git + CI\/CD patterns<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Real-world deployment contexts<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Production:<\/strong> multiple pipelines, strict IAM, private endpoints\/self-hosted IR, Key Vault integration, alerting, and runbook-based operations.<\/li>\n<li><strong>Dev\/test:<\/strong> smaller datasets, simplified networking, often fewer governance constraints, but still benefits from Git integration and parameterization.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">5. Top Use Cases and Scenarios<\/h2>\n\n\n\n<p>Below are realistic scenarios where Azure Data Factory is commonly used.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">1) Copy from on-prem SQL Server to Azure Data Lake<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Operational data is locked in an on-prem database; analytics needs it in the cloud.<\/li>\n<li><strong>Why ADF fits:<\/strong> Self-hosted Integration Runtime can securely access on-prem SQL Server and land files into ADLS\/Blob.<\/li>\n<li><strong>Example:<\/strong> Nightly copy of \u201cOrders\u201d tables to a data lake as Parquet\/CSV for downstream analytics.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2) ELT orchestration for Azure Synapse Analytics<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Multiple dependent steps must load staging tables, then execute transformations.<\/li>\n<li><strong>Why ADF fits:<\/strong> Pipelines orchestrate Copy Activities and Stored Procedure activities with dependencies and retries.<\/li>\n<li><strong>Example:<\/strong> Load raw files to staging, then run SQL stored procedures to populate dimensional models.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3) Ingest SaaS data (REST API) into Azure Storage<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> SaaS platforms expose REST APIs with rate limits and paging.<\/li>\n<li><strong>Why ADF fits:<\/strong> REST connector + pipeline control flow (Until\/ForEach) can orchestrate pagination and incremental loads.<\/li>\n<li><strong>Example:<\/strong> Pull daily CRM changes and store as JSON in a raw zone.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4) Partner file ingestion over SFTP<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> External partners drop files to SFTP; you must validate, archive, and load.<\/li>\n<li><strong>Why ADF fits:<\/strong> SFTP connector + Copy Activity + pipeline branching for validation.<\/li>\n<li><strong>Example:<\/strong> Copy inbound CSV to landing, move to archive, and load to curated zone if schema checks pass.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">5) Metadata-driven ingestion framework<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Dozens\/hundreds of tables must be ingested with consistent patterns.<\/li>\n<li><strong>Why ADF fits:<\/strong> Parameterized pipelines + Lookup + ForEach support metadata-driven ingestion.<\/li>\n<li><strong>Example:<\/strong> Configuration table lists sources, table names, and sink paths; one pipeline loops through and ingests all.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6) Orchestrate Azure Databricks jobs<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Transformations require Spark code and libraries; orchestration must be centralized.<\/li>\n<li><strong>Why ADF fits:<\/strong> Databricks activity can run notebooks\/jobs with parameters and dependency control.<\/li>\n<li><strong>Example:<\/strong> Copy raw data to lake, then trigger a Databricks notebook to clean and aggregate.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">7) Run SSIS packages in the cloud (lift-and-shift)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Existing SSIS packages must be moved off on-prem servers.<\/li>\n<li><strong>Why ADF fits:<\/strong> Azure-SSIS Integration Runtime executes SSIS packages in Azure.<\/li>\n<li><strong>Example:<\/strong> Migrate an existing SSIS-based EDW load to Azure without full rewrite.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">8) Incremental ingestion using watermark columns<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Full loads are expensive; only new\/changed rows should be ingested.<\/li>\n<li><strong>Why ADF fits:<\/strong> Lookup last watermark, query source with parameter, update watermark upon success.<\/li>\n<li><strong>Example:<\/strong> Load rows where ModifiedDate &gt; last_run_time.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">9) Data movement between Azure regions\/accounts with governance<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Business units have separate subscriptions; data sharing must be controlled.<\/li>\n<li><strong>Why ADF fits:<\/strong> Central orchestration with managed identities\/RBAC, consistent monitoring, and auditing.<\/li>\n<li><strong>Example:<\/strong> Daily copy of curated datasets from a central lake to a departmental lake.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">10) Orchestrate multi-step file processing (validate \u2192 transform \u2192 publish)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Files must pass checks before being published.<\/li>\n<li><strong>Why ADF fits:<\/strong> Control flow activities handle branching and failure paths.<\/li>\n<li><strong>Example:<\/strong> Validate schema\/row count, copy to \u201ccurated\u201d container, trigger downstream refresh.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">11) Event-driven ingestion (where applicable)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> You want ingestion to start when a file arrives.<\/li>\n<li><strong>Why ADF fits:<\/strong> Event-based triggers can start pipelines when storage events occur (verify supported trigger types and constraints).<\/li>\n<li><strong>Example:<\/strong> Start pipeline when a blob is created in a landing container.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">12) Centralized scheduling replacement for cron + scripts<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Sprawling scripts across VMs lack observability and standardization.<\/li>\n<li><strong>Why ADF fits:<\/strong> Managed scheduling, retries, monitoring, RBAC, and centralized operations.<\/li>\n<li><strong>Example:<\/strong> Replace nightly Python scripts with ADF pipelines that call Functions\/Databricks as needed.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">6. Core Features<\/h2>\n\n\n\n<p>This section focuses on <strong>current, commonly used<\/strong> Azure Data Factory capabilities. Some features vary by connector, runtime, and region\u2014verify for your exact combination.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">6.1 Pipelines (workflow orchestration)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Defines a workflow of activities with control flow (sequence, parallel, conditions).<\/li>\n<li><strong>Why it matters:<\/strong> Orchestrates end-to-end ingestion reliably, not just individual copy jobs.<\/li>\n<li><strong>Practical benefit:<\/strong> Centralizes scheduling, error handling, and dependencies.<\/li>\n<li><strong>Caveats:<\/strong> Complex pipelines can become hard to maintain without modularization and naming standards.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6.2 Activities (units of work)<\/h3>\n\n\n\n<p>Common activity categories include:\n&#8211; <strong>Data movement:<\/strong> Copy Activity\n&#8211; <strong>Transform:<\/strong> Mapping Data Flow, Databricks, HDInsight, stored procedures, etc.\n&#8211; <strong>Control flow:<\/strong> If Condition, Switch, ForEach, Until, Wait, Fail\n&#8211; <strong>Utility:<\/strong> Lookup, Get Metadata, Web, Azure Function\n&#8211; <strong>What it does:<\/strong> Executes each step in the pipeline.\n&#8211; <strong>Why it matters:<\/strong> Lets you combine data movement, transformation, and operational logic.\n&#8211; <strong>Caveats:<\/strong> External compute activities depend on the target service\u2019s availability and quotas.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">6.3 Linked Services (connections)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Stores connection info to data stores and compute resources.<\/li>\n<li><strong>Why it matters:<\/strong> Reuse connections across datasets and pipelines; enable environment parameterization.<\/li>\n<li><strong>Practical benefit:<\/strong> Central place to configure auth (Managed Identity, Key Vault, etc.).<\/li>\n<li><strong>Caveats:<\/strong> Not all connectors support all auth methods; verify connector documentation.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6.4 Datasets (data structure references)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Represents data within a store (table, file path, folder, etc.).<\/li>\n<li><strong>Why it matters:<\/strong> Separates data location\/schema from pipeline logic.<\/li>\n<li><strong>Practical benefit:<\/strong> Reuse the same dataset across multiple pipelines.<\/li>\n<li><strong>Caveats:<\/strong> Over-modeling datasets can add management overhead; metadata-driven patterns can reduce dataset sprawl.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6.5 Integration Runtime (IR)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Provides the compute and network bridge that enables data movement and activity execution.<\/li>\n<li><strong>Why it matters:<\/strong> Determines connectivity (public\/private\/on-prem), performance, and sometimes cost.<\/li>\n<li><strong>Types and caveats:<\/strong><\/li>\n<li><strong>Azure IR:<\/strong> Managed, easiest for Azure-to-Azure and public endpoints.<\/li>\n<li><strong>Self-hosted IR:<\/strong> Required for on-prem\/private network sources; you manage the host VM(s) and patching.<\/li>\n<li><strong>Azure-SSIS IR:<\/strong> Specialized for SSIS; cost and management differ significantly.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6.6 Copy Activity (bulk data movement)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Copies data from source to sink with format conversion options and performance features.<\/li>\n<li><strong>Why it matters:<\/strong> This is the core ingestion engine for many data platforms.<\/li>\n<li><strong>Practical benefit:<\/strong> Handles many connectors; supports parallel copy and partitioning patterns (source\/sink dependent).<\/li>\n<li><strong>Caveats:<\/strong> Throughput depends on IR type, source\/sink limits, network, and configuration; some sources throttle.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6.7 Mapping Data Flows (visual transformations)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Visual, Spark-based transformations (joins, derives, aggregates, schema drift, etc.).<\/li>\n<li><strong>Why it matters:<\/strong> Enables transformations without hand-writing Spark code.<\/li>\n<li><strong>Practical benefit:<\/strong> Unified UI, reusable transformation logic, and integration with pipelines.<\/li>\n<li><strong>Caveats:<\/strong> Data Flows use a Spark cluster behind the scenes and can become a significant cost driver. Validate performance and cost. Some transformations can be easier\/cheaper in SQL engines or Databricks.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6.8 Triggers (scheduling and automation)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Starts pipelines on schedules, events, or tumbling windows (depending on support and configuration).<\/li>\n<li><strong>Why it matters:<\/strong> Enables production automation and repeatability.<\/li>\n<li><strong>Practical benefit:<\/strong> Replace ad hoc scheduling and manual runs.<\/li>\n<li><strong>Caveats:<\/strong> Trigger semantics (especially windowing) require careful design to avoid duplicate processing.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6.9 Parameterization (reusable pipelines)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Pass parameters into pipelines, datasets, linked services (pattern-dependent), and activities.<\/li>\n<li><strong>Why it matters:<\/strong> Enables multi-environment and multi-table patterns without duplicating pipelines.<\/li>\n<li><strong>Practical benefit:<\/strong> One ingestion pipeline can handle many tables by reading metadata.<\/li>\n<li><strong>Caveats:<\/strong> Too many parameters can reduce readability; enforce conventions and documentation.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6.10 Monitoring and operational management<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Provides pipeline run history, activity details, integration runtime monitoring, and alerts via Azure Monitor (with diagnostic settings).<\/li>\n<li><strong>Why it matters:<\/strong> Production pipelines need observability and incident response workflows.<\/li>\n<li><strong>Practical benefit:<\/strong> Faster triage with run-level metrics and logs.<\/li>\n<li><strong>Caveats:<\/strong> Long retention and verbose diagnostics can increase Log Analytics costs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6.11 Git integration and CI\/CD (DevOps)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Integrates with Git for source control and collaboration; supports deployment patterns to other environments.<\/li>\n<li><strong>Why it matters:<\/strong> Enables repeatable releases and reduces configuration drift.<\/li>\n<li><strong>Practical benefit:<\/strong> Pull requests, code review, history, and environment promotion.<\/li>\n<li><strong>Caveats:<\/strong> Deployment approach varies (ARM templates and other patterns). Verify the current recommended deployment method in Microsoft docs for your stack.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6.12 Managed identity and Key Vault integration<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Use system-assigned\/user-assigned managed identity for auth; store secrets in Key Vault where needed.<\/li>\n<li><strong>Why it matters:<\/strong> Avoids embedding credentials in pipeline definitions.<\/li>\n<li><strong>Practical benefit:<\/strong> Stronger security posture with rotation-friendly secrets.<\/li>\n<li><strong>Caveats:<\/strong> Some sources still require passwords\/keys; use Key Vault references and restrict access.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6.13 Networking: private endpoints and managed virtual network (where applicable)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Helps reduce public exposure and control data exfiltration paths.<\/li>\n<li><strong>Why it matters:<\/strong> Many enterprises require private connectivity to data stores.<\/li>\n<li><strong>Practical benefit:<\/strong> Lower risk of data exposure through public endpoints.<\/li>\n<li><strong>Caveats:<\/strong> Configuration differs by connector and feature set. Private networking can complicate troubleshooting. Verify current support in the official networking docs.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">7. Architecture and How It Works<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">High-level architecture<\/h3>\n\n\n\n<p>Azure Data Factory separates:\n&#8211; <strong>Design-time\/control plane:<\/strong> Where you define pipelines, linked services, datasets, triggers (typically through Azure Data Factory Studio in the Azure portal).\n&#8211; <strong>Run-time execution:<\/strong> Where the Integration Runtime performs copy\/transform or calls external services.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Control flow vs data flow<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Control flow (orchestration):<\/strong> Pipeline definitions, activity chaining, triggers, retries, variables, branching.<\/li>\n<li><strong>Data flow (data movement\/transformation):<\/strong> The movement of bytes\/rows from source to sink (Copy Activity) or transformations executed by a Spark runtime (Mapping Data Flows) or external compute (Databricks, SQL, etc.).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Typical request\/data\/control flow<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>You author\/publish a pipeline in Azure Data Factory.<\/li>\n<li>A trigger (or manual run) starts a <strong>pipeline run<\/strong>.<\/li>\n<li>The service schedules activities and dispatches execution to an Integration Runtime.<\/li>\n<li>The IR connects to the source and sink (or external compute), moves\/transforms data.<\/li>\n<li>Run status and diagnostics are recorded; optional diagnostic logs flow to Azure Monitor\/Log Analytics.<\/li>\n<li>Downstream systems (warehouse\/lakehouse\/BI) consume the output.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Integrations with related services (common)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Azure Storage \/ ADLS Gen2:<\/strong> landing zones and curated zones.<\/li>\n<li><strong>Azure SQL Database \/ SQL Managed Instance \/ SQL Server:<\/strong> operational sources or targets.<\/li>\n<li><strong>Azure Synapse Analytics:<\/strong> loading dedicated SQL pools, serverless SQL patterns, or Spark-based transformations.<\/li>\n<li><strong>Azure Databricks:<\/strong> advanced transformations and ML feature engineering.<\/li>\n<li><strong>Azure Key Vault:<\/strong> secret storage and rotation.<\/li>\n<li><strong>Azure Monitor + Log Analytics:<\/strong> centralized logging and alerting.<\/li>\n<li><strong>Microsoft Purview:<\/strong> data catalog\/lineage integration patterns (verify exact integration steps).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Dependency services<\/h3>\n\n\n\n<p>Azure Data Factory usually depends on:\n&#8211; <strong>Integration Runtime<\/strong> (Azure-managed or self-hosted)\n&#8211; <strong>Network connectivity<\/strong> (public endpoints, private endpoints, VPN\/ExpressRoute for hybrid)\n&#8211; <strong>Identity provider<\/strong> (Microsoft Entra ID)\n&#8211; <strong>Storage and compute services<\/strong> you orchestrate<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/authentication model (practical view)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use <strong>RBAC<\/strong> for managing who can author and run pipelines.<\/li>\n<li>Use <strong>Managed Identity<\/strong> for accessing Azure resources that support Entra-based auth (recommended).<\/li>\n<li>Use <strong>Key Vault<\/strong> for secrets when required (passwords, keys, tokens).<\/li>\n<li>Prefer least privilege roles and separate authoring from operations.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Networking model (practical view)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data movement path depends on IR type:<\/li>\n<li><strong>Azure IR<\/strong> reaches cloud sources\/sinks.<\/li>\n<li><strong>Self-hosted IR<\/strong> runs in your network and reaches internal endpoints.<\/li>\n<li>With stricter security, you may add:<\/li>\n<li><strong>Private endpoints<\/strong> on data stores<\/li>\n<li><strong>Managed virtual network<\/strong> features for the service (verify current applicability)<\/li>\n<li><strong>Firewall rules<\/strong> to restrict access to known networks<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Monitoring\/logging\/governance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use <strong>diagnostic settings<\/strong> to send logs to Log Analytics\/Storage\/Event Hubs.<\/li>\n<li>Standardize <strong>naming, tagging, and runbook links<\/strong>.<\/li>\n<li>Use <strong>alerts<\/strong> on pipeline failures and high duration\/cost anomalies.<\/li>\n<li>Implement <strong>CI\/CD<\/strong> and environment-specific parameterization to avoid drift.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Simple architecture diagram (Mermaid)<\/h3>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart LR\n  Dev[Engineer in ADF Studio] --&gt;|Publish pipeline| ADF[Azure Data Factory]\n  Trigger[Schedule\/Event Trigger] --&gt; ADF\n  ADF --&gt;|Dispatch activity| IR[Integration Runtime]\n  IR --&gt; Source[(Source: DB\/Files\/SaaS)]\n  IR --&gt; Sink[(Sink: ADLS\/Blob\/SQL\/Synapse)]\n  ADF --&gt; Monitor[Monitoring &amp; Run History]\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Production-style architecture diagram (Mermaid)<\/h3>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart TB\n  subgraph RG[Resource Group: analytics-platform-prod]\n    ADF[Azure Data Factory]\n    KV[Azure Key Vault]\n    LA[Log Analytics Workspace]\n  end\n\n  subgraph Net[Network]\n    SHIR[Self-hosted Integration Runtime\\n(on Azure VM or on-prem server)]\n    VPN[VPN\/ExpressRoute]\n  end\n\n  subgraph Data[Data Platform]\n    ADLS[(ADLS Gen2 \/ Blob Storage)]\n    SQLMI[(Azure SQL MI \/ SQL DB)]\n    SYN[Azure Synapse \/ Warehouse]\n    DBX[Azure Databricks]\n  end\n\n  SourceOnPrem[(On-prem SQL Server \/ File Shares)] --&gt; VPN --&gt; SHIR\n  ADF --&gt;|Uses MI \/ KV refs| KV\n  ADF --&gt;|Diagnostics| LA\n\n  ADF --&gt;|Copy\/Orchestrate via Azure IR| ADLS\n  ADF --&gt;|Copy\/Stored proc| SQLMI\n  ADF --&gt;|Trigger notebook\/job| DBX\n  ADF --&gt;|Load curated data| SYN\n\n  SHIR --&gt;|Copy from on-prem| ADLS\n  SHIR --&gt;|Copy to cloud DB| SQLMI\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">8. Prerequisites<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Account\/subscription\/tenant requirements<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>An active <strong>Azure subscription<\/strong> with permission to create resources.<\/li>\n<li>Ability to create:<\/li>\n<li>Azure Data Factory<\/li>\n<li>Azure Storage account (Blob)<\/li>\n<li>Role assignments (RBAC) for Managed Identity<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Permissions \/ IAM roles<\/h3>\n\n\n\n<p>At minimum (typical lab setup):\n&#8211; On the subscription or resource group:\n  &#8211; <strong>Contributor<\/strong> (or more restrictive roles that still allow creating ADF and Storage)\n&#8211; For Storage data access using Managed Identity (recommended):\n  &#8211; Assign the Data Factory managed identity <strong>Storage Blob Data Contributor<\/strong> on the Storage account (or scoped container level where supported).<\/p>\n\n\n\n<p>If your organization restricts RBAC, coordinate with your Azure administrators.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Billing requirements<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Azure Data Factory is usage-based; you need billing enabled.<\/li>\n<li>Mapping Data Flows and SSIS IR can increase costs quickly; the lab below avoids those.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Tools needed<\/h3>\n\n\n\n<p>Choose one:\n&#8211; <strong>Azure portal<\/strong> (recommended for this lab): https:\/\/portal.azure.com\/\n&#8211; Optional CLI:\n  &#8211; <strong>Azure CLI<\/strong>: https:\/\/learn.microsoft.com\/cli\/azure\/install-azure-cli<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Region availability<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Azure Data Factory is region-based. Pick a region supported by your subscription policies.<\/li>\n<li>Some networking features\/connectors vary by region\u2014verify in official docs if you rely on them.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Quotas\/limits<\/h3>\n\n\n\n<p>Azure Data Factory has service limits (pipelines, activities, concurrency, Integration Runtime constraints, etc.). Limits can evolve.\n&#8211; Verify current limits: https:\/\/learn.microsoft.com\/azure\/data-factory\/limits<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Prerequisite services for the lab<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Azure Storage account<\/strong> (Blob) with two containers:<\/li>\n<li><code>source<\/code><\/li>\n<li><code>sink<\/code><\/li>\n<li>A small sample CSV file to upload (we\u2019ll provide one)<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">9. Pricing \/ Cost<\/h2>\n\n\n\n<p>Azure Data Factory pricing is <strong>consumption-based<\/strong>. Exact prices vary by region and can change over time, so use the official pricing page and calculator for current numbers.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Official pricing page: https:\/\/azure.microsoft.com\/pricing\/details\/data-factory\/<\/li>\n<li>Pricing calculator: https:\/\/azure.microsoft.com\/pricing\/calculator\/<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Pricing dimensions (how you are charged)<\/h3>\n\n\n\n<p>Common cost dimensions include (names may vary slightly on the pricing page):\n1. <strong>Orchestration and activity runs<\/strong>\n   &#8211; Pipelines are made of activities; you are typically charged per activity run and related orchestration operations.\n2. <strong>Data movement (Copy Activity)<\/strong>\n   &#8211; Often measured by <strong>DIU-hours<\/strong> (Data Integration Units) used during copy execution.\n   &#8211; Performance settings and parallelism influence DIU usage.\n3. <strong>Data Flow (Mapping Data Flows)<\/strong>\n   &#8211; Charged by compute time (commonly vCore-hours) while the Spark cluster runs.\n4. <strong>SSIS Integration Runtime<\/strong>\n   &#8211; Charged by vCore-hours for the SSIS runtime while it is running (including idle time if left running).\n5. <strong>External activity execution<\/strong>\n   &#8211; Activities that call other compute services may incur ADF orchestration charges plus the cost of the target service (Databricks, Synapse, Functions, etc.).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Free tier<\/h3>\n\n\n\n<p>Azure Data Factory does not generally have a \u201cfree tier\u201d in the same way some services do, but your overall Azure account may have credits\/free services depending on your subscription type. Verify current offers on the pricing page.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Main cost drivers<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Number of pipeline\/activity runs<\/strong> (especially frequent schedules)<\/li>\n<li><strong>Copy throughput configuration<\/strong> (DIU usage and duration)<\/li>\n<li><strong>Mapping Data Flows runtime duration<\/strong><\/li>\n<li><strong>SSIS IR uptime<\/strong> (keeping it running is expensive relative to a basic copy pipeline)<\/li>\n<li><strong>Log Analytics ingestion\/retention<\/strong> if you enable verbose diagnostics<\/li>\n<li><strong>Networking<\/strong> (see below)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Hidden or indirect costs (common surprises)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Target system costs:<\/strong> Storage transactions, SQL\/Synapse compute, Databricks jobs, etc.<\/li>\n<li><strong>Log Analytics costs:<\/strong> High-volume logs and long retention.<\/li>\n<li><strong>Self-hosted IR VM costs:<\/strong> If you host IR on an Azure VM, you pay VM + disk + network.<\/li>\n<li><strong>SSIS IR \u201calways on\u201d costs:<\/strong> If you forget to stop it, it continues billing.<\/li>\n<li><strong>Data egress:<\/strong> Copying data out of Azure (or between regions) can incur bandwidth\/egress charges.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Network\/data transfer implications<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Inbound to Azure<\/strong> is often free; <strong>egress<\/strong> and <strong>cross-region<\/strong> transfers can cost money (depends on Azure bandwidth pricing).<\/li>\n<li>Private networking (VPN\/ExpressRoute) has its own costs.<\/li>\n<li>If you copy from on-prem to Azure via Self-hosted IR, you pay for on-prem bandwidth and potentially VPN\/ExpressRoute.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">How to optimize cost<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Prefer <strong>batching<\/strong> work rather than running thousands of tiny pipeline runs.<\/li>\n<li>Keep <strong>activity counts<\/strong> reasonable (avoid \u201cchatty\u201d pipelines with excessive Lookup\/Web calls).<\/li>\n<li>Tune copy performance thoughtfully:<\/li>\n<li>Start with defaults, then test higher throughput only when needed.<\/li>\n<li>Avoid Mapping Data Flows for simple transformations that a SQL engine can do cheaply.<\/li>\n<li>For SSIS IR:<\/li>\n<li>Use scheduling\/auto-start patterns if applicable, and stop when not needed.<\/li>\n<li>Use <strong>diagnostic settings<\/strong> selectively:<\/li>\n<li>Send essential logs to Log Analytics, and archive the rest to Storage if required.<\/li>\n<li>Consider <strong>metadata-driven frameworks<\/strong> to reduce duplicated pipelines and operational overhead.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Example low-cost starter estimate (no fabricated prices)<\/h3>\n\n\n\n<p>A low-cost learning setup typically includes:\n&#8211; ADF pipelines that run <strong>manually<\/strong> or once per day\n&#8211; A small Copy Activity moving a few MBs\n&#8211; Minimal diagnostics (or logs to Storage)<\/p>\n\n\n\n<p>Your bill will mainly reflect:\n&#8211; A small number of <strong>activity runs<\/strong>\n&#8211; A small amount of <strong>DIU-hours<\/strong> during the copy<\/p>\n\n\n\n<p>Use the pricing calculator and input:\n&#8211; Expected activity runs\/day\n&#8211; Expected copy duration and DIU level<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Example production cost considerations<\/h3>\n\n\n\n<p>In production, cost is usually dominated by:\n&#8211; High-frequency ingestion (many runs\/hour)\n&#8211; Large-scale copies (high DIU-hours)\n&#8211; Mapping Data Flows cluster runtime\n&#8211; SSIS IR uptime\n&#8211; Downstream compute (Synapse\/Databricks\/SQL)\n&#8211; Centralized logging volume<\/p>\n\n\n\n<p>A practical approach is to:\n&#8211; Build a cost model per pipeline (runs\/day \u00d7 activities\/run \u00d7 average duration)\n&#8211; Add data movement estimates (GB\/day \u00d7 expected throughput)\n&#8211; Add logging costs based on expected run volume and retention\n&#8211; Reassess after observing real Azure Cost Management data for 1\u20132 weeks<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">10. Step-by-Step Hands-On Tutorial<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Objective<\/h3>\n\n\n\n<p>Create an <strong>Azure Data Factory<\/strong> pipeline that copies a small CSV file from one Blob container (<code>source<\/code>) to another container (<code>sink<\/code>) in the same Azure Storage account using <strong>Managed Identity<\/strong> authentication.<\/p>\n\n\n\n<p>This lab is designed to be:\n&#8211; Beginner-friendly\n&#8211; Low-cost (no Mapping Data Flows, no SSIS IR)\n&#8211; Executable with the Azure portal<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Lab Overview<\/h3>\n\n\n\n<p>You will:\n1. Create an Azure Storage account and containers.\n2. Upload a sample CSV to the <code>source<\/code> container.\n3. Create an Azure Data Factory instance and enable its system-assigned managed identity.\n4. Grant the managed identity access to Blob data.\n5. Create linked services and datasets.\n6. Build a pipeline with a Copy Activity.\n7. Run the pipeline and validate output.\n8. Troubleshoot common issues.\n9. Clean up resources to stop billing.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 1: Create a Resource Group<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>In the Azure portal, open <strong>Resource groups<\/strong>.<\/li>\n<li>Select <strong>Create<\/strong>.<\/li>\n<li>Set:\n   &#8211; <strong>Subscription:<\/strong> your subscription\n   &#8211; <strong>Resource group:<\/strong> <code>rg-adf-lab<\/code>\n   &#8211; <strong>Region:<\/strong> choose a region close to you<\/li>\n<li>Select <strong>Review + create<\/strong> \u2192 <strong>Create<\/strong>.<\/li>\n<\/ol>\n\n\n\n<p><strong>Expected outcome:<\/strong> Resource group <code>rg-adf-lab<\/code> exists.<\/p>\n\n\n\n<p>Optional Azure CLI:<\/p>\n\n\n\n<pre><code class=\"language-bash\">az group create --name rg-adf-lab --location eastus\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 2: Create an Azure Storage Account + Containers<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>In the portal: <strong>Storage accounts<\/strong> \u2192 <strong>Create<\/strong>.<\/li>\n<li>Basics:\n   &#8211; <strong>Resource group:<\/strong> <code>rg-adf-lab<\/code>\n   &#8211; <strong>Storage account name:<\/strong> must be globally unique, e.g. <code>stadflab&lt;random&gt;<\/code>\n   &#8211; <strong>Region:<\/strong> same region as RG (recommended)\n   &#8211; <strong>Performance:<\/strong> Standard\n   &#8211; <strong>Redundancy:<\/strong> LRS (lowest cost, fine for lab)<\/li>\n<li>Networking: keep defaults for lab (public endpoint enabled). If your org enforces restrictions, adapt accordingly.<\/li>\n<li>Select <strong>Review<\/strong> \u2192 <strong>Create<\/strong>.<\/li>\n<\/ol>\n\n\n\n<p>After deployment:\n1. Open the storage account.\n2. Go to <strong>Data storage<\/strong> \u2192 <strong>Containers<\/strong>.\n3. Create two containers:\n   &#8211; <code>source<\/code>\n   &#8211; <code>sink<\/code><\/p>\n\n\n\n<p><strong>Expected outcome:<\/strong> Storage account exists with <code>source<\/code> and <code>sink<\/code> containers.<\/p>\n\n\n\n<p>Optional Azure CLI (container creation requires auth context):<\/p>\n\n\n\n<pre><code class=\"language-bash\"># Example: using Azure AD auth might require additional setup.\n# Simpler approach for CLI is to use an account key (for lab only).\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 3: Upload a Sample CSV to the <code>source<\/code> Container<\/h3>\n\n\n\n<p>Create a local file named <code>customers.csv<\/code>:<\/p>\n\n\n\n<pre><code class=\"language-csv\">customer_id,name,country,signup_date\n1,Ana,US,2024-01-02\n2,Ben,CA,2024-02-10\n3,Chen,SG,2024-03-21\n<\/code><\/pre>\n\n\n\n<p>Upload via portal:\n1. Storage account \u2192 <strong>Containers<\/strong> \u2192 <code>source<\/code>\n2. <strong>Upload<\/strong> \u2192 select <code>customers.csv<\/code> \u2192 <strong>Upload<\/strong><\/p>\n\n\n\n<p><strong>Expected outcome:<\/strong> <code>customers.csv<\/code> is present in <code>source<\/code>.<\/p>\n\n\n\n<p><strong>Verification:<\/strong> In the <code>source<\/code> container, you can see <code>customers.csv<\/code> and its size is non-zero.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 4: Create Azure Data Factory<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>In the portal: search <strong>Data factories<\/strong> \u2192 <strong>Create<\/strong>.<\/li>\n<li>Basics:\n   &#8211; <strong>Subscription:<\/strong> your subscription\n   &#8211; <strong>Resource group:<\/strong> <code>rg-adf-lab<\/code>\n   &#8211; <strong>Name:<\/strong> <code>adf-lab-&lt;unique&gt;<\/code>\n   &#8211; <strong>Region:<\/strong> same region\n   &#8211; <strong>Version:<\/strong> V2 (this is the current service generation)<\/li>\n<li>Select <strong>Review + create<\/strong> \u2192 <strong>Create<\/strong>.<\/li>\n<\/ol>\n\n\n\n<p>After deployment:\n1. Open the Data Factory resource.\n2. Select <strong>Launch studio<\/strong> (opens Azure Data Factory Studio).<\/p>\n\n\n\n<p><strong>Expected outcome:<\/strong> Azure Data Factory Studio opens and you can see the authoring UI.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 5: Enable Managed Identity and Grant Blob Access<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">5.1 Enable the Data Factory system-assigned managed identity<\/h4>\n\n\n\n<ol class=\"wp-block-list\">\n<li>In the Data Factory resource (not Studio), go to <strong>Identity<\/strong>.<\/li>\n<li>Under <strong>System assigned<\/strong>, set <strong>Status<\/strong> to <strong>On<\/strong> \u2192 <strong>Save<\/strong>.<\/li>\n<\/ol>\n\n\n\n<p><strong>Expected outcome:<\/strong> The Data Factory now has a system-assigned managed identity (an enterprise application\/service principal in your tenant).<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">5.2 Grant the managed identity access to the Storage account<\/h4>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Open the <strong>Storage account<\/strong>.<\/li>\n<li>Go to <strong>Access control (IAM)<\/strong> \u2192 <strong>Add role assignment<\/strong>.<\/li>\n<li>Choose role: <strong>Storage Blob Data Contributor<\/strong><\/li>\n<li>Assign access to: <strong>Managed identity<\/strong><\/li>\n<li>Select members: choose your <strong>Azure Data Factory<\/strong> resource identity<\/li>\n<li><strong>Review + assign<\/strong><\/li>\n<\/ol>\n\n\n\n<p><strong>Expected outcome:<\/strong> ADF\u2019s managed identity has permission to read\/write blobs in the storage account.<\/p>\n\n\n\n<p><strong>Verification tip:<\/strong> It can take a few minutes for role assignments to propagate.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 6: Create a Linked Service to Azure Blob Storage (Managed Identity)<\/h3>\n\n\n\n<p>In <strong>Azure Data Factory Studio<\/strong>:\n1. Go to <strong>Manage<\/strong> (toolbox icon) \u2192 <strong>Linked services<\/strong> \u2192 <strong>New<\/strong>.\n2. Search for <strong>Azure Blob Storage<\/strong>.\n3. Create linked service:\n   &#8211; <strong>Name:<\/strong> <code>ls_blob_adflab<\/code>\n   &#8211; <strong>Authentication method:<\/strong> <strong>Managed Identity<\/strong> (wording may vary slightly)\n   &#8211; <strong>Storage account name\/URL:<\/strong> select or enter your storage account\n   &#8211; Test connection \u2192 <strong>Create<\/strong><\/p>\n\n\n\n<p>If you cannot select Managed Identity for your chosen connector\/settings:\n&#8211; Use <strong>Azure Data Lake Storage Gen2<\/strong> linked service if you used ADLS Gen2.\n&#8211; Or use <strong>Account key<\/strong> for this lab only (store it in Key Vault in real deployments).<\/p>\n\n\n\n<p><strong>Expected outcome:<\/strong> Linked service <code>ls_blob_adflab<\/code> is created and tests successfully.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 7: Create Source and Sink Datasets<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>In Studio, go to <strong>Author<\/strong> \u2192 <strong>+<\/strong> \u2192 <strong>Dataset<\/strong>.<\/li>\n<li>Choose <strong>Azure Blob Storage<\/strong>.<\/li>\n<li>Choose format: <strong>DelimitedText<\/strong> (CSV).<\/li>\n<li>Set:\n   &#8211; <strong>Name:<\/strong> <code>ds_source_customers_csv<\/code>\n   &#8211; <strong>Linked service:<\/strong> <code>ls_blob_adflab<\/code>\n   &#8211; <strong>File path:<\/strong> container <code>source<\/code>, file <code>customers.csv<\/code>\n   &#8211; First row as header: enabled<\/li>\n<li>Create.<\/li>\n<\/ol>\n\n\n\n<p>Repeat for sink:\n1. <strong>+ Dataset<\/strong> \u2192 <strong>Azure Blob Storage<\/strong> \u2192 <strong>DelimitedText<\/strong>\n2. Set:\n   &#8211; <strong>Name:<\/strong> <code>ds_sink_customers_csv<\/code>\n   &#8211; <strong>Linked service:<\/strong> <code>ls_blob_adflab<\/code>\n   &#8211; <strong>File path:<\/strong> container <code>sink<\/code>\n   &#8211; <strong>File name:<\/strong> <code>customers.csv<\/code> (or <code>customers_copied.csv<\/code>)\n3. Create.<\/p>\n\n\n\n<p><strong>Expected outcome:<\/strong> Two datasets exist\u2014one pointing to the source file, one to the destination path.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 8: Create a Pipeline with a Copy Activity<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>In Studio \u2192 <strong>Author<\/strong> \u2192 <strong>+<\/strong> \u2192 <strong>Pipeline<\/strong>.<\/li>\n<li>Name: <code>pl_copy_customers_blob_to_blob<\/code><\/li>\n<li>In <strong>Activities<\/strong>, expand <strong>Move &amp; transform<\/strong> and drag <strong>Copy data<\/strong> onto the canvas.<\/li>\n<li>Select the Copy activity and configure:\n   &#8211; <strong>Source<\/strong> tab:<ul>\n<li><strong>Source dataset:<\/strong> <code>ds_source_customers_csv<\/code><\/li>\n<li><strong>Sink<\/strong> tab:<\/li>\n<li><strong>Sink dataset:<\/strong> <code>ds_sink_customers_csv<\/code><\/li>\n<\/ul>\n<\/li>\n<\/ol>\n\n\n\n<p>Optional settings:\n&#8211; In <strong>Settings<\/strong>, you can set logging\/skip incompatible row settings depending on connector. Keep defaults for the lab.<\/p>\n\n\n\n<p>Click <strong>Validate<\/strong> (top bar) to check for obvious errors.<\/p>\n\n\n\n<p><strong>Expected outcome:<\/strong> A pipeline exists with a Copy activity wired from the source dataset to the sink dataset.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 9: Debug Run, then Publish<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">9.1 Debug run (quick test)<\/h4>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Click <strong>Debug<\/strong>.<\/li>\n<\/ol>\n\n\n\n<p>Wait for completion (bottom panel shows status).<\/p>\n\n\n\n<p><strong>Expected outcome:<\/strong> Debug run succeeds and reports rows read\/written.<\/p>\n\n\n\n<p>If it fails, go to the <strong>Output<\/strong> details and proceed to Troubleshooting.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">9.2 Publish<\/h4>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Click <strong>Publish all<\/strong>.<\/li>\n<\/ol>\n\n\n\n<p><strong>Expected outcome:<\/strong> The pipeline artifacts are published to the live Data Factory service.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Step 10: Trigger a Manual Run and Monitor<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Click <strong>Add trigger<\/strong> \u2192 <strong>Trigger now<\/strong>.<\/li>\n<li>Confirm \u2192 <strong>OK<\/strong>.<\/li>\n<\/ol>\n\n\n\n<p>Monitor:\n1. Go to <strong>Monitor<\/strong> (left panel).\n2. Under <strong>Pipeline runs<\/strong>, find your pipeline run.\n3. Click it to see <strong>Activity runs<\/strong> and details.<\/p>\n\n\n\n<p><strong>Expected outcome:<\/strong> Pipeline run status is <strong>Succeeded<\/strong>.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Validation<\/h3>\n\n\n\n<p>Validate the output file exists in the sink container:\n1. Storage account \u2192 <strong>Containers<\/strong> \u2192 <code>sink<\/code>\n2. Confirm <code>customers.csv<\/code> (or your chosen output name) exists.<\/p>\n\n\n\n<p>Optionally download the file and confirm contents match the source.<\/p>\n\n\n\n<p><strong>Expected outcome:<\/strong> The sink container contains a copied CSV file with the same rows.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Troubleshooting<\/h3>\n\n\n\n<p>Common issues and practical fixes:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p><strong>AuthorizationPermissionMismatch \/ 403 when accessing Blob<\/strong>\n   &#8211; Cause: Managed identity lacks data-plane role.\n   &#8211; Fix: Ensure <strong>Storage Blob Data Contributor<\/strong> is assigned to the Data Factory managed identity at the Storage account scope (or container scope if supported), then wait a few minutes and retry.<\/p>\n<\/li>\n<li>\n<p><strong>Linked service test fails<\/strong>\n   &#8211; Cause: Wrong auth method, network restrictions, or role propagation delay.\n   &#8211; Fix: Re-test after a few minutes; verify Storage firewall settings allow access; verify you enabled system-assigned identity and assigned RBAC.<\/p>\n<\/li>\n<li>\n<p><strong>File not found<\/strong>\n   &#8211; Cause: Dataset path wrong (container name\/file name mismatch).\n   &#8211; Fix: Re-check dataset file path and case sensitivity; ensure file exists in <code>source<\/code>.<\/p>\n<\/li>\n<li>\n<p><strong>Publish succeeds but Trigger now fails<\/strong>\n   &#8211; Cause: Parameter mismatch or dataset referencing draft changes.\n   &#8211; Fix: Re-validate pipeline; ensure datasets and linked services are published; re-run.<\/p>\n<\/li>\n<li>\n<p><strong>Storage firewall\/private endpoints<\/strong>\n   &#8211; Cause: Storage account blocks public access; Azure IR cannot reach it.\n   &#8211; Fix: For this lab, keep Storage networking default. In production, use private endpoints and the appropriate ADF networking approach (verify current support for your connector and IR type).<\/p>\n<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Cleanup<\/h3>\n\n\n\n<p>To stop billing and remove resources:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Delete the resource group:\n   &#8211; Portal: Resource groups \u2192 <code>rg-adf-lab<\/code> \u2192 <strong>Delete resource group<\/strong>\n   &#8211; Type the name to confirm \u2192 <strong>Delete<\/strong><\/li>\n<\/ol>\n\n\n\n<p>Optional Azure CLI:<\/p>\n\n\n\n<pre><code class=\"language-bash\">az group delete --name rg-adf-lab --yes --no-wait\n<\/code><\/pre>\n\n\n\n<p><strong>Expected outcome:<\/strong> Azure Data Factory and Storage resources are deleted.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">11. Best Practices<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Architecture best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use a <strong>layered lake pattern<\/strong>: <code>raw\/<\/code> \u2192 <code>curated\/<\/code> \u2192 <code>served\/<\/code> containers\/folders.<\/li>\n<li>Split complex logic into <strong>modular pipelines<\/strong>:<\/li>\n<li>One pipeline per domain or per ingestion pattern<\/li>\n<li>Reusable child pipelines (Execute Pipeline activity) for shared steps<\/li>\n<li>Prefer <strong>metadata-driven ingestion<\/strong> for many similar sources\/tables.<\/li>\n<li>Keep ADF responsible for orchestration; push heavy transformation to the most appropriate engine (SQL\/Spark\/Databricks) based on cost\/performance.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">IAM\/security best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Prefer <strong>Managed Identity<\/strong> over keys\/passwords whenever supported.<\/li>\n<li>Use <strong>Azure Key Vault<\/strong> for secrets; avoid storing secrets in linked services as plain values.<\/li>\n<li>Apply <strong>least privilege<\/strong>:<\/li>\n<li>Separate roles for authors vs operators vs viewers<\/li>\n<li>Limit who can edit linked services and triggers<\/li>\n<li>Use separate Data Factories (or strong environment separation) for <strong>dev\/test\/prod<\/strong>.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Cost best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Reduce run frequency where acceptable; batch small ingestions.<\/li>\n<li>Minimize chatty control-flow calls (excessive web\/lookups).<\/li>\n<li>For Mapping Data Flows:<\/li>\n<li>Right-size runtime and avoid long-running clusters<\/li>\n<li>Stop\/test quickly; measure with real data<\/li>\n<li>Avoid leaving SSIS IR running when not in use.<\/li>\n<li>Monitor cost in <strong>Azure Cost Management<\/strong> and tag resources (<code>env<\/code>, <code>owner<\/code>, <code>costCenter<\/code>).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Performance best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use Copy Activity performance features where appropriate:<\/li>\n<li>Partitioning\/parallel copy (when supported)<\/li>\n<li>Staging options (when supported)<\/li>\n<li>Optimize at the source and sink:<\/li>\n<li>Indexing for source queries<\/li>\n<li>Bulk load patterns for sinks<\/li>\n<li>Avoid \u201crow-by-row\u201d patterns; prefer bulk operations.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Reliability best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use retries with exponential backoff for transient failures (HTTP, SaaS throttling).<\/li>\n<li>Implement idempotency:<\/li>\n<li>Write to date-partitioned folders<\/li>\n<li>Use overwrite vs incremental patterns intentionally<\/li>\n<li>Use tumbling window triggers for time-sliced processing where appropriate (verify semantics).<\/li>\n<li>Implement <strong>dead-letter<\/strong> patterns for failed files\/records.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operations best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Enable diagnostic settings to Azure Monitor\/Log Analytics with a retention policy aligned to your needs.<\/li>\n<li>Standardize runbooks:<\/li>\n<li>What to do on failure<\/li>\n<li>How to re-run safely<\/li>\n<li>How to handle partial loads<\/li>\n<li>Use alerting:<\/li>\n<li>Pipeline failure alerts<\/li>\n<li>Duration anomalies<\/li>\n<li>IR offline alerts (Self-hosted IR)<\/li>\n<li>Use Git for source control; require pull requests for production changes.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Governance\/tagging\/naming best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Naming conventions (example):<\/li>\n<li>Factories: <code>adf-&lt;org&gt;-&lt;env&gt;-&lt;region&gt;<\/code><\/li>\n<li>Linked services: <code>ls_&lt;system&gt;_&lt;auth&gt;<\/code><\/li>\n<li>Datasets: <code>ds_&lt;zone&gt;_&lt;entity&gt;_&lt;format&gt;<\/code><\/li>\n<li>Pipelines: <code>pl_&lt;domain&gt;_&lt;action&gt;<\/code><\/li>\n<li>Tag resources:<\/li>\n<li><code>env<\/code>, <code>owner<\/code>, <code>dataClassification<\/code>, <code>costCenter<\/code><\/li>\n<li>Document pipeline purpose and SLAs in descriptions and\/or repo docs.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">12. Security Considerations<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Identity and access model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Azure RBAC<\/strong> controls who can create\/edit\/run pipelines and manage the factory.<\/li>\n<li><strong>Managed Identity<\/strong> (system-assigned or user-assigned) is recommended for connecting to Azure services that support Entra ID auth.<\/li>\n<li>Use <strong>separation of duties<\/strong>:<\/li>\n<li>Authors can develop pipelines<\/li>\n<li>Operators can monitor and re-run<\/li>\n<li>Security admins manage RBAC and secrets<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Encryption<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data in Azure Storage and many Azure services is encrypted at rest by default (service dependent).<\/li>\n<li>Data in transit uses TLS for supported connectors.<\/li>\n<li>For customer-managed keys (CMK) or advanced encryption requirements, verify current ADF and dependent service support in official docs.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Network exposure<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Default setups often use public endpoints for Storage and other services.<\/li>\n<li>For enterprise security:<\/li>\n<li>Use <strong>private endpoints<\/strong> for data stores where possible<\/li>\n<li>Use restricted firewalls and allowed networks<\/li>\n<li>Consider <strong>Self-hosted IR<\/strong> for private network reach<\/li>\n<li>Evaluate <strong>managed virtual network<\/strong> features where applicable (verify support for your connector and region)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secrets handling<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Do not hardcode secrets in pipeline JSON or code repositories.<\/li>\n<li>Store secrets in <strong>Azure Key Vault<\/strong> and reference them from linked services.<\/li>\n<li>Rotate credentials and audit access.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Audit\/logging<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Send diagnostics to <strong>Azure Monitor \/ Log Analytics<\/strong>.<\/li>\n<li>Track:<\/li>\n<li>Pipeline run history<\/li>\n<li>Trigger changes<\/li>\n<li>Linked service changes<\/li>\n<li>For broader governance, integrate with organizational logging and SIEM.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Compliance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data residency: choose regions carefully.<\/li>\n<li>PII\/PHI: implement masking, restricted access, and least privilege.<\/li>\n<li>If you operate under specific frameworks (HIPAA, PCI, SOC, ISO), align controls with your organization\u2019s compliance program and verify service compliance documentation.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Common security mistakes<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Using Storage account keys everywhere instead of Managed Identity\/Key Vault.<\/li>\n<li>Leaving public network access open with no firewall controls in production.<\/li>\n<li>Granting overly broad roles (Owner\/Contributor) to all users.<\/li>\n<li>No environment separation, leading to accidental production changes.<\/li>\n<li>No auditing\/diagnostic settings, making investigations difficult.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secure deployment recommendations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use Managed Identity + RBAC for Azure Storage and Azure SQL where supported.<\/li>\n<li>Use Key Vault references for any required secrets.<\/li>\n<li>Restrict networking (private endpoints \/ SHIR) for sensitive data paths.<\/li>\n<li>Implement CI\/CD with approvals for production deployments.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">13. Limitations and Gotchas<\/h2>\n\n\n\n<p>Azure Data Factory is mature, but there are practical constraints to plan for:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p><strong>Not a streaming engine<\/strong>\n   &#8211; ADF is primarily for batch ingestion\/orchestration.<\/p>\n<\/li>\n<li>\n<p><strong>Integration Runtime choice affects everything<\/strong>\n   &#8211; Connectivity, performance, and even feasibility can hinge on Azure IR vs Self-hosted IR.<\/p>\n<\/li>\n<li>\n<p><strong>Connector capabilities vary<\/strong>\n   &#8211; Authentication methods, performance options, and supported operations differ by connector. Always check the connector\u2019s official documentation.<\/p>\n<\/li>\n<li>\n<p><strong>Private networking can be complex<\/strong>\n   &#8211; Storage firewalls\/private endpoints + IR networking frequently cause connectivity issues during initial setup.<\/p>\n<\/li>\n<li>\n<p><strong>Mapping Data Flows cost<\/strong>\n   &#8211; Spark cluster startup and runtime can be expensive for small transforms.<\/p>\n<\/li>\n<li>\n<p><strong>SSIS IR billing behavior<\/strong>\n   &#8211; If you leave SSIS IR running, you pay for uptime. Plan start\/stop and scheduling.<\/p>\n<\/li>\n<li>\n<p><strong>Operational overhead for Self-hosted IR<\/strong>\n   &#8211; You manage patching, scaling, HA, and network connectivity for the host machines.<\/p>\n<\/li>\n<li>\n<p><strong>DevOps deployments require planning<\/strong>\n   &#8211; Git\/CI\/CD is powerful but can be confusing without standard templates and environment parameterization.<\/p>\n<\/li>\n<li>\n<p><strong>Activity-level limits and concurrency<\/strong>\n   &#8211; There are service limits (pipelines, concurrent runs, integration runtime constraints). Verify current limits:\n   https:\/\/learn.microsoft.com\/azure\/data-factory\/limits<\/p>\n<\/li>\n<li>\n<p><strong>Schema drift and data quality<\/strong>\n   &#8211; File-based ingestion can fail on unexpected schema changes unless designed for drift handling and validation.<\/p>\n<\/li>\n<li>\n<p><strong>SaaS API throttling<\/strong>\n   &#8211; REST\/SaaS sources often enforce rate limits; add retries\/backoff and incremental patterns.<\/p>\n<\/li>\n<li>\n<p><strong>Time zones and scheduling<\/strong>\n   &#8211; Carefully validate trigger time zone behavior and daylight savings implications (verify trigger settings in the UI and docs).<\/p>\n<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">14. Comparison with Alternatives<\/h2>\n\n\n\n<p>Azure Data Factory is one of several ways to orchestrate data workflows.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Comparison table<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Option<\/th>\n<th>Best For<\/th>\n<th>Strengths<\/th>\n<th>Weaknesses<\/th>\n<th>When to Choose<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Azure Data Factory<\/strong><\/td>\n<td>Batch data integration + orchestration<\/td>\n<td>Broad connectors, managed service, hybrid IR, monitoring, enterprise RBAC<\/td>\n<td>Costs can rise with frequent runs; complex networking; not streaming<\/td>\n<td>Standard Azure-centric batch ETL\/ELT orchestration<\/td>\n<\/tr>\n<tr>\n<td><strong>Azure Synapse pipelines<\/strong><\/td>\n<td>Pipelines tightly integrated with Synapse workspace<\/td>\n<td>Similar pipeline experience; close to Synapse artifacts<\/td>\n<td>Tied to Synapse workspace model; feature parity can vary<\/td>\n<td>If your team is all-in on Synapse workspace-centric development<\/td>\n<\/tr>\n<tr>\n<td><strong>Azure Databricks Workflows<\/strong><\/td>\n<td>Spark-first data engineering<\/td>\n<td>Great for code-driven pipelines; strong Spark ecosystem<\/td>\n<td>More engineering overhead; connector breadth differs<\/td>\n<td>When transformations are Spark-heavy and teams prefer code<\/td>\n<\/tr>\n<tr>\n<td><strong>Microsoft Fabric Data Factory \/ pipelines<\/strong><\/td>\n<td>Fabric-centric analytics<\/td>\n<td>Integrated with Fabric experiences (verify capabilities)<\/td>\n<td>Platform scope differs; feature mapping vs ADF varies<\/td>\n<td>When your organization standardized on Fabric for analytics<\/td>\n<\/tr>\n<tr>\n<td><strong>Azure Logic Apps<\/strong><\/td>\n<td>Application integration and business workflows<\/td>\n<td>Huge SaaS\/event integrations; low-code<\/td>\n<td>Not optimized for big data movement\/ETL<\/td>\n<td>For app\/event workflows rather than analytics ingestion at scale<\/td>\n<\/tr>\n<tr>\n<td><strong>Apache Airflow (self-managed or managed offerings)<\/strong><\/td>\n<td>Code-based orchestration<\/td>\n<td>Python DAGs, strong ecosystem<\/td>\n<td>\uc6b4\uc601 \ubd80\ub2f4 (ops overhead), connectors depend on your setup<\/td>\n<td>When teams want code-native orchestration with custom logic<\/td>\n<\/tr>\n<tr>\n<td><strong>AWS Glue (other cloud)<\/strong><\/td>\n<td>AWS-native ETL<\/td>\n<td>Serverless ETL, crawler\/catalog integration<\/td>\n<td>Different cloud, migration effort<\/td>\n<td>If your data platform is primarily on AWS<\/td>\n<\/tr>\n<tr>\n<td><strong>Google Cloud Data Fusion \/ Dataflow (other cloud)<\/strong><\/td>\n<td>GCP-native data integration<\/td>\n<td>Strong GCP integrations<\/td>\n<td>Different cloud, migration effort<\/td>\n<td>If your platform is primarily on GCP<\/td>\n<\/tr>\n<tr>\n<td><strong>Apache NiFi (self-managed)<\/strong><\/td>\n<td>Flow-based data movement<\/td>\n<td>Visual flows, great for routing<\/td>\n<td>Operate\/scale it yourself<\/td>\n<td>When you need on-prem flow routing and are OK managing infrastructure<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">15. Real-World Example<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Enterprise example: Hybrid data platform for a regulated retailer<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Retailer has on-prem SQL Server for POS data and an SFTP drop from logistics partners. They need daily analytics in Azure with strict network controls.<\/li>\n<li><strong>Proposed architecture:<\/strong><\/li>\n<li>Azure Data Factory in a production subscription<\/li>\n<li>Self-hosted Integration Runtime on hardened VMs (or on-prem servers) with HA<\/li>\n<li>Land data in ADLS Gen2 raw zone<\/li>\n<li>Transform using Synapse SQL and\/or Databricks depending on workload<\/li>\n<li>Store secrets in Key Vault and use Managed Identity where supported<\/li>\n<li>Central logs in Azure Monitor\/Log Analytics and alerts to on-call tooling<\/li>\n<li><strong>Why Azure Data Factory was chosen:<\/strong><\/li>\n<li>Hybrid connectivity with Self-hosted IR<\/li>\n<li>Strong orchestration, retries, monitoring<\/li>\n<li>Fits enterprise RBAC and Key Vault patterns<\/li>\n<li><strong>Expected outcomes:<\/strong><\/li>\n<li>Reliable daily ingestion with audit trail<\/li>\n<li>Reduced manual operations and faster troubleshooting<\/li>\n<li>Standardized ingestion approach across business units<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Startup\/small-team example: SaaS product analytics ingestion<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> Startup needs daily ingestion from production Postgres and a few SaaS endpoints into a lake for reporting, without hiring a large platform team.<\/li>\n<li><strong>Proposed architecture:<\/strong><\/li>\n<li>Azure Data Factory for orchestration and Copy Activity<\/li>\n<li>Azure Storage (Blob\/ADLS) as landing zone<\/li>\n<li>Lightweight transformations in SQL (Azure SQL) or a small Databricks job when needed<\/li>\n<li>Git integration for version control<\/li>\n<li><strong>Why Azure Data Factory was chosen:<\/strong><\/li>\n<li>Quick setup, minimal ops overhead<\/li>\n<li>Visual authoring helps small teams move quickly<\/li>\n<li>Schedules\/monitoring reduce ad hoc scripts<\/li>\n<li><strong>Expected outcomes:<\/strong><\/li>\n<li>Predictable daily refresh for dashboards<\/li>\n<li>Clear run history and failure notifications<\/li>\n<li>Gradual evolution to metadata-driven ingestion as sources grow<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">16. FAQ<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p><strong>Is Azure Data Factory an ETL or ELT tool?<\/strong><br\/>\n   It supports both patterns. You can copy data to a lake\/warehouse first (ELT) and then transform using SQL\/Spark, or transform using Mapping Data Flows as part of the pipeline (ETL-style).<\/p>\n<\/li>\n<li>\n<p><strong>Does Azure Data Factory store my data?<\/strong><br\/>\n   No. Azure Data Factory orchestrates and moves\/transforms data, but your data lives in your chosen storage\/DB services.<\/p>\n<\/li>\n<li>\n<p><strong>What is the Integration Runtime (IR)?<\/strong><br\/>\n   The IR is the execution infrastructure used for data movement and some transformations. Choosing Azure IR vs Self-hosted IR is a key design decision.<\/p>\n<\/li>\n<li>\n<p><strong>When do I need a Self-hosted Integration Runtime?<\/strong><br\/>\n   When your source\/sink is in a private network\/on-prem environment not reachable from Azure-managed runtimes, or when you must control the network path.<\/p>\n<\/li>\n<li>\n<p><strong>Can Azure Data Factory access Azure Storage using Managed Identity?<\/strong><br\/>\n   Yes, for many Azure connectors you can use Managed Identity and RBAC roles (e.g., Storage Blob Data Contributor). Verify support for your chosen connector.<\/p>\n<\/li>\n<li>\n<p><strong>How do I schedule pipelines?<\/strong><br\/>\n   Use triggers (schedule, event-based, or tumbling window depending on your needs). Always test time zone and DST behavior.<\/p>\n<\/li>\n<li>\n<p><strong>How do I handle incremental loads?<\/strong><br\/>\n   Common patterns include watermark columns, \u201clast modified\u201d timestamps, CDC approaches (source-dependent), and file partitioning by date.<\/p>\n<\/li>\n<li>\n<p><strong>Is Azure Data Factory the same as Synapse pipelines?<\/strong><br\/>\n   They are closely related in concept and user experience, but they are different products\/resources. Choose based on whether you want a standalone ADF factory or a Synapse workspace-centric approach.<\/p>\n<\/li>\n<li>\n<p><strong>Can I do transformations without Databricks?<\/strong><br\/>\n   Yes. You can use Mapping Data Flows, SQL stored procedures, Synapse SQL\/Spark, or other Azure services.<\/p>\n<\/li>\n<li>\n<p><strong>How do I version control Azure Data Factory assets?<\/strong><br\/>\n   Use Git integration in ADF Studio. For multi-environment deployments, follow a documented CI\/CD approach (verify Microsoft\u2019s current guidance).<\/p>\n<\/li>\n<li>\n<p><strong>How do I monitor failures and send alerts?<\/strong><br\/>\n   Use ADF monitoring views plus Azure Monitor diagnostic logs\/metrics and alert rules based on failures\/duration. Integrate alerts with email\/webhooks\/ITSM as needed.<\/p>\n<\/li>\n<li>\n<p><strong>What are common causes of pipeline failures?<\/strong><br\/>\n   Permissions (RBAC), network\/firewall restrictions, source throttling, schema drift, and incorrect dataset paths are common.<\/p>\n<\/li>\n<li>\n<p><strong>How do I secure secrets used by connectors?<\/strong><br\/>\n   Store them in Azure Key Vault and reference them from linked services; restrict Key Vault access and enable auditing.<\/p>\n<\/li>\n<li>\n<p><strong>Does Azure Data Factory support CI\/CD?<\/strong><br\/>\n   Yes, but the mechanics (Git mode, publish artifacts, deployment) require planning. Validate the recommended approach in official docs.<\/p>\n<\/li>\n<li>\n<p><strong>How do I estimate costs before going to production?<\/strong><br\/>\n   Model activity runs\/day, copy duration\/throughput (DIU-hours), data flow runtime (vCore-hours), SSIS IR uptime, and logging volume. Then validate with the Azure pricing calculator and a small proof-of-concept.<\/p>\n<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">17. Top Online Resources to Learn Azure Data Factory<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Resource Type<\/th>\n<th>Name<\/th>\n<th>Why It Is Useful<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Official documentation<\/td>\n<td>Azure Data Factory documentation (Learn) \u2014 https:\/\/learn.microsoft.com\/azure\/data-factory\/<\/td>\n<td>Canonical reference for concepts, connectors, activities, networking, and security<\/td>\n<\/tr>\n<tr>\n<td>Official overview<\/td>\n<td>Introduction to Azure Data Factory \u2014 https:\/\/learn.microsoft.com\/azure\/data-factory\/introduction<\/td>\n<td>Clear, official service overview and core terminology<\/td>\n<\/tr>\n<tr>\n<td>Limits\/quotas<\/td>\n<td>Azure Data Factory limits \u2014 https:\/\/learn.microsoft.com\/azure\/data-factory\/limits<\/td>\n<td>Helps avoid surprises in production planning<\/td>\n<\/tr>\n<tr>\n<td>Official pricing<\/td>\n<td>Azure Data Factory pricing \u2014 https:\/\/azure.microsoft.com\/pricing\/details\/data-factory\/<\/td>\n<td>Current pricing model and billing dimensions<\/td>\n<\/tr>\n<tr>\n<td>Cost estimation<\/td>\n<td>Azure Pricing Calculator \u2014 https:\/\/azure.microsoft.com\/pricing\/calculator\/<\/td>\n<td>Estimate costs based on expected activity runs and runtime usage<\/td>\n<\/tr>\n<tr>\n<td>Tutorials<\/td>\n<td>Tutorials in Azure Data Factory \u2014 https:\/\/learn.microsoft.com\/azure\/data-factory\/tutorial-copy-data-portal<\/td>\n<td>Step-by-step walkthroughs (copy data, triggers, etc.)<\/td>\n<\/tr>\n<tr>\n<td>Connector reference<\/td>\n<td>Azure Data Factory connectors \u2014 https:\/\/learn.microsoft.com\/azure\/data-factory\/connector-overview<\/td>\n<td>Official list of connectors and connector-specific notes<\/td>\n<\/tr>\n<tr>\n<td>Networking guidance<\/td>\n<td>Azure Data Factory networking and security topics \u2014 https:\/\/learn.microsoft.com\/azure\/data-factory\/<\/td>\n<td>Official networking\/security sections (verify current pages for Private Link\/managed VNet)<\/td>\n<\/tr>\n<tr>\n<td>CI\/CD guidance<\/td>\n<td>Source control and CI\/CD in ADF \u2014 https:\/\/learn.microsoft.com\/azure\/data-factory\/source-control<\/td>\n<td>Official Git integration and collaboration concepts<\/td>\n<\/tr>\n<tr>\n<td>Samples (GitHub)<\/td>\n<td>Azure Data Factory samples (GitHub) \u2014 https:\/\/github.com\/Azure\/Azure-DataFactory<\/td>\n<td>Community + Microsoft-maintained samples and templates (review repo contents and applicability)<\/td>\n<\/tr>\n<tr>\n<td>Architecture center<\/td>\n<td>Azure Architecture Center \u2014 https:\/\/learn.microsoft.com\/azure\/architecture\/<\/td>\n<td>Reference architectures and best practices for analytics platforms<\/td>\n<\/tr>\n<tr>\n<td>Video learning<\/td>\n<td>Microsoft Azure YouTube \u2014 https:\/\/www.youtube.com\/@MicrosoftAzure<\/td>\n<td>Official videos; search within channel for \u201cAzure Data Factory\u201d sessions<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">18. Training and Certification Providers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Institute<\/th>\n<th>Suitable Audience<\/th>\n<th>Likely Learning Focus<\/th>\n<th>Mode<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>DevOps engineers, cloud engineers, platform teams<\/td>\n<td>Azure DevOps, automation, cloud fundamentals; may include data pipeline operations<\/td>\n<td>check website<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>ScmGalaxy.com<\/td>\n<td>Beginners to intermediate IT professionals<\/td>\n<td>Software\/configuration management and DevOps-aligned tooling<\/td>\n<td>check website<\/td>\n<td>https:\/\/www.scmgalaxy.com\/<\/td>\n<\/tr>\n<tr>\n<td>CLoudOpsNow.in<\/td>\n<td>Cloud ops and operations teams<\/td>\n<td>Cloud operations, monitoring, reliability practices<\/td>\n<td>check website<\/td>\n<td>https:\/\/www.cloudopsnow.in\/<\/td>\n<\/tr>\n<tr>\n<td>SreSchool.com<\/td>\n<td>SREs, operations, reliability engineers<\/td>\n<td>SRE practices, observability, incident response<\/td>\n<td>check website<\/td>\n<td>https:\/\/www.sreschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>AiOpsSchool.com<\/td>\n<td>Ops teams and engineers exploring AIOps<\/td>\n<td>AIOps concepts, monitoring automation<\/td>\n<td>check website<\/td>\n<td>https:\/\/www.aiopsschool.com\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">19. Top Trainers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Platform\/Site<\/th>\n<th>Likely Specialization<\/th>\n<th>Suitable Audience<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>RajeshKumar.xyz<\/td>\n<td>DevOps\/cloud training content (verify course coverage)<\/td>\n<td>Beginners to professionals seeking guided training<\/td>\n<td>https:\/\/rajeshkumar.xyz\/<\/td>\n<\/tr>\n<tr>\n<td>devopstrainer.in<\/td>\n<td>DevOps training services\/platform (verify specific Azure coverage)<\/td>\n<td>DevOps engineers, cloud engineers<\/td>\n<td>https:\/\/www.devopstrainer.in\/<\/td>\n<\/tr>\n<tr>\n<td>devopsfreelancer.com<\/td>\n<td>Freelance DevOps services\/training platform (verify offerings)<\/td>\n<td>Teams wanting flexible coaching\/support<\/td>\n<td>https:\/\/www.devopsfreelancer.com\/<\/td>\n<\/tr>\n<tr>\n<td>devopssupport.in<\/td>\n<td>DevOps support and training resources (verify offerings)<\/td>\n<td>Ops\/DevOps teams needing practical assistance<\/td>\n<td>https:\/\/www.devopssupport.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">20. Top Consulting Companies<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Company Name<\/th>\n<th>Likely Service Area<\/th>\n<th>Where They May Help<\/th>\n<th>Consulting Use Case Examples<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>cotocus.com<\/td>\n<td>Cloud\/DevOps\/engineering consulting (verify exact services)<\/td>\n<td>Architecture, implementation support, operational readiness<\/td>\n<td>Designing secure ADF ingestion, setting up CI\/CD, monitoring and runbooks<\/td>\n<td>https:\/\/cotocus.com\/<\/td>\n<\/tr>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>DevOps and cloud consulting\/training organization<\/td>\n<td>Enablement, platform practices, DevOps processes<\/td>\n<td>ADF operationalization, IaC strategy, governance and cost controls (verify scope)<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>DEVOPSCONSULTING.IN<\/td>\n<td>DevOps consulting services (verify offerings)<\/td>\n<td>Automation, DevOps pipelines, operational tooling<\/td>\n<td>Building deployment pipelines for ADF, integrating alerts and incident workflows<\/td>\n<td>https:\/\/www.devopsconsulting.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">21. Career and Learning Roadmap<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn before Azure Data Factory<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Azure fundamentals: subscriptions, resource groups, RBAC, networking basics<\/li>\n<li>Data fundamentals: files vs tables, batch processing, basic SQL<\/li>\n<li>Storage basics: Blob\/ADLS containers, folders, access keys vs Entra ID auth<\/li>\n<li>Security basics: Managed Identity, Key Vault, least privilege<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn after Azure Data Factory<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data lake architecture: medallion\/layered zones, partitioning strategies<\/li>\n<li>Transformation engines:<\/li>\n<li>SQL-based transformations (Synapse\/SQL DB)<\/li>\n<li>Spark-based transformations (Databricks\/Synapse Spark)<\/li>\n<li>Governance: Microsoft Purview concepts (catalog, lineage\u2014verify integration steps)<\/li>\n<li>DataOps: CI\/CD patterns, testing strategies for pipelines, monitoring\/alerting<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Job roles that use Azure Data Factory<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data Engineer<\/li>\n<li>Analytics Engineer (or orchestration-focused)<\/li>\n<li>Cloud Engineer \/ Platform Engineer (data platform)<\/li>\n<li>DevOps Engineer supporting data platforms<\/li>\n<li>BI Engineer (in smaller teams)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Certification path (Azure)<\/h3>\n\n\n\n<p>Microsoft certification offerings change over time. Commonly relevant certifications include Azure data and analytics tracks.<br\/>\n&#8211; Verify current role-based certifications on Microsoft Learn: https:\/\/learn.microsoft.com\/credentials\/<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Project ideas for practice<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Build a metadata-driven ingestion pipeline that loads 10 CSV files to a curated zone.<\/li>\n<li>Implement incremental loads from Azure SQL using a watermark.<\/li>\n<li>Create a Self-hosted IR on a VM and ingest from a private endpoint (in a controlled lab).<\/li>\n<li>Add Azure Monitor alerts for pipeline failures and build a basic runbook.<\/li>\n<li>Use Git integration and deploy dev \u2192 test \u2192 prod with parameterization.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">22. Glossary<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Activity:<\/strong> A single step in an Azure Data Factory pipeline (e.g., Copy, Lookup, If Condition).<\/li>\n<li><strong>ADF Studio:<\/strong> Web UI for authoring and monitoring Azure Data Factory (launched from the Azure portal).<\/li>\n<li><strong>Azure Integration Runtime (Azure IR):<\/strong> Microsoft-managed runtime used for data movement and some activities in Azure.<\/li>\n<li><strong>Azure-SSIS Integration Runtime:<\/strong> ADF runtime option to execute SSIS packages in Azure.<\/li>\n<li><strong>CI\/CD:<\/strong> Continuous Integration\/Continuous Delivery; automating build\/test\/deploy of ADF artifacts.<\/li>\n<li><strong>Copy Activity:<\/strong> Core ADF activity used to copy data from source to sink.<\/li>\n<li><strong>Dataset:<\/strong> A named reference to data within a data store (table, file path, folder).<\/li>\n<li><strong>DIU (Data Integration Unit):<\/strong> A billing\/performance concept used for Copy Activity data movement (see pricing docs for current definition).<\/li>\n<li><strong>Integration Runtime (IR):<\/strong> Compute and connectivity layer used by ADF for execution.<\/li>\n<li><strong>Linked service:<\/strong> Connection configuration to a data store or compute service.<\/li>\n<li><strong>Managed Identity:<\/strong> Azure identity for a resource, used to authenticate to other Azure services without managing secrets.<\/li>\n<li><strong>Mapping Data Flow:<\/strong> Visual transformation feature that runs Spark-based transformations.<\/li>\n<li><strong>Pipeline:<\/strong> A container for activities representing an orchestration workflow.<\/li>\n<li><strong>Private Endpoint:<\/strong> Azure Private Link endpoint providing private connectivity to a service.<\/li>\n<li><strong>Self-hosted Integration Runtime (SHIR):<\/strong> Runtime installed on your machine\/VM for on-prem\/private network access.<\/li>\n<li><strong>Trigger:<\/strong> A schedule\/event definition that starts pipeline runs automatically.<\/li>\n<li><strong>Tumbling window trigger:<\/strong> A trigger type for fixed-size time windows (verify exact behavior in docs).<\/li>\n<li><strong>Watermark:<\/strong> A value (timestamp\/ID) used to load only new\/changed data incrementally.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">23. Summary<\/h2>\n\n\n\n<p>Azure Data Factory is Azure\u2019s managed <strong>Analytics-focused data integration and orchestration<\/strong> service. It helps you build, schedule, and monitor pipelines that move and transform data across cloud and hybrid environments.<\/p>\n\n\n\n<p>It matters because most real analytics platforms need a reliable ingestion layer with strong operational controls\u2014retries, monitoring, access control, and repeatable deployments. Azure Data Factory fills that role by combining pipelines, connectors, Integration Runtime options (Azure IR and Self-hosted IR), and integrations with Key Vault and Azure Monitor.<\/p>\n\n\n\n<p>Cost-wise, focus on the main drivers: <strong>activity runs<\/strong>, <strong>data movement (DIU-hours)<\/strong>, <strong>Mapping Data Flow runtime<\/strong>, <strong>SSIS IR uptime<\/strong>, and <strong>logging volume<\/strong>. Security-wise, prefer <strong>Managed Identity<\/strong>, least privilege RBAC, Key Vault for secrets, and private networking patterns where required.<\/p>\n\n\n\n<p>Use Azure Data Factory when you need standardized batch ingestion and orchestration in Azure. If you need streaming or a full warehouse\/lakehouse engine, pair ADF with the right compute\/storage services rather than expecting ADF to replace them.<\/p>\n\n\n\n<p>Next step: build a second pipeline that ingests incrementally (watermark pattern) and enable Azure Monitor diagnostics so you can practice operating Azure Data Factory like a production service.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Analytics<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[21,40,12],"tags":[],"class_list":["post-378","post","type-post","status-publish","format-standard","hentry","category-analytics","category-azure","category-databases"],"_links":{"self":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/378","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/comments?post=378"}],"version-history":[{"count":0,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/378\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/media?parent=378"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/categories?post=378"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/tags?post=378"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}