{"id":383,"date":"2026-04-13T21:08:35","date_gmt":"2026-04-13T21:08:35","guid":{"rendered":"https:\/\/www.devopsschool.com\/tutorials\/azure-data-catalog-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-analytics\/"},"modified":"2026-04-13T21:08:35","modified_gmt":"2026-04-13T21:08:35","slug":"azure-data-catalog-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-analytics","status":"publish","type":"post","link":"https:\/\/www.devopsschool.com\/tutorials\/azure-data-catalog-tutorial-architecture-pricing-use-cases-and-hands-on-guide-for-analytics\/","title":{"rendered":"Azure Data Catalog Tutorial: Architecture, Pricing, Use Cases, and Hands-On Guide for Analytics"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">Category<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Analytics<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">1. Introduction<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Important service status note (read first):<\/strong> The original <strong>Azure Data Catalog<\/strong> service (the standalone product that existed years ago) was <strong>retired<\/strong> by Microsoft. Today, the \u201cdata catalog\u201d experience in Azure is delivered as <strong>Microsoft Purview Data Catalog<\/strong> (part of <strong>Microsoft Purview<\/strong>, an Azure-deployable governance service). In this tutorial, <strong>Data Catalog<\/strong> refers to the <strong>current, supported<\/strong> catalog capability in <strong>Microsoft Purview<\/strong>\u2014the modern replacement for the retired Azure Data Catalog. Always verify the latest product status and feature availability in official docs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>What this service is<\/strong>\n&#8211; <strong>Data Catalog<\/strong> in Azure is a <strong>central inventory of data assets<\/strong> (tables, files, reports, pipelines, etc.) plus the metadata that makes those assets discoverable, understandable, and governable.\n&#8211; It is used by data consumers (analysts, engineers, scientists) to <strong>find trustworthy data<\/strong> and by governance teams to <strong>standardize definitions<\/strong> and oversee metadata quality.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Simple explanation (one paragraph)<\/strong><br\/>\nData Catalog is like a searchable \u201clibrary index\u201d for your organization\u2019s data. It helps people quickly find datasets, understand what they mean, see who owns them, and learn how they are used\u2014without having to ask around or dig through storage accounts and databases.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Technical explanation (one paragraph)<\/strong><br\/>\nIn Azure, Data Catalog is implemented through <strong>Microsoft Purview Data Catalog<\/strong>. It builds a <strong>metadata graph<\/strong> (via automated scanning and integrations) for supported sources such as Azure Storage, Azure SQL, Azure Synapse, Power BI, and more. It stores technical metadata (schemas, columns, file formats), business metadata (glossary, owners, descriptions), and relationship metadata (lineage where supported). Access is governed through <strong>Azure AD<\/strong> identities, <strong>Purview roles<\/strong>, and Azure resource permissions for scanning.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>What problem it solves<\/strong>\n&#8211; Data sprawl and \u201cunknown datasets\u201d\n&#8211; Repeated questions like \u201cWhere is the customer table?\u201d and \u201cWhich dataset is the source of truth?\u201d\n&#8211; Inconsistent definitions (e.g., \u201cactive customer\u201d means different things across teams)\n&#8211; Risk from sensitive data being poorly understood or incorrectly shared\n&#8211; Slow onboarding for new engineers and analysts<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">2. What is Data Catalog?<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Official purpose<\/strong><br\/>\nData Catalog (Microsoft Purview Data Catalog in Azure) is designed to provide <strong>data discovery<\/strong> and <strong>metadata management<\/strong> across an organization\u2019s data estate. Its purpose is to help users <strong>find, understand, trust, and govern<\/strong> data assets.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Core capabilities<\/strong>\n&#8211; <strong>Register and scan data sources<\/strong> to extract metadata (schemas, columns, file types, etc.)\n&#8211; <strong>Search and browse<\/strong> assets with filters and facets\n&#8211; <strong>Enrich metadata<\/strong> with descriptions, owners, classifications, glossary terms, and tags (capabilities vary by source and configuration)\n&#8211; <strong>Lineage<\/strong> visualization for supported systems and integrations (availability depends on connectors and workloads)\n&#8211; <strong>Govern access to the catalog<\/strong> and organize assets via <strong>collections<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Major components (as implemented in Microsoft Purview)<\/strong>\n&#8211; <strong>Microsoft Purview account<\/strong>: The Azure resource you deploy; it hosts governance capabilities.\n&#8211; <strong>Data Map<\/strong>: The underlying metadata store\/index used by Data Catalog.\n&#8211; <strong>Collections<\/strong>: Hierarchical partitions for organizing assets and delegating administration.\n&#8211; <strong>Scans and credentials<\/strong>: Configuration to connect to and scan sources.\n&#8211; <strong>Business glossary<\/strong>: Central vocabulary and definitions linked to assets.\n&#8211; <strong>Insights \/ reporting<\/strong> (where available): Views into data estate coverage, classifications, and governance posture (feature availability may vary\u2014verify in official docs).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Service type<\/strong>\n&#8211; Managed cloud service for <strong>metadata management and governance<\/strong>, deployed as an Azure resource and accessed via web portals and APIs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Scope: regional\/global\/account-scoped<\/strong>\n&#8211; You create a <strong>Microsoft Purview account in an Azure region<\/strong>.<br\/>\n&#8211; The <strong>catalog metadata<\/strong> is <strong>account-scoped<\/strong> (within that Purview account) and tied to your <strong>Azure AD tenant<\/strong> identity model.\n&#8211; You can typically catalog across <strong>multiple Azure subscriptions<\/strong> (within the same Azure AD tenant) as long as the Purview managed identity and\/or configured credentials have permission to read metadata from those sources.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>How it fits into the Azure ecosystem<\/strong>\n&#8211; Complements <strong>Analytics<\/strong> services such as <strong>Azure Synapse Analytics<\/strong>, <strong>Azure Data Factory<\/strong>, <strong>Azure Databricks<\/strong>, <strong>Power BI<\/strong>, and <strong>Azure Data Lake Storage<\/strong> by adding the missing layer: <strong>enterprise metadata + discovery + governance<\/strong>.\n&#8211; Integrates with Azure identity (Azure AD), Azure networking (private endpoints where supported), and Azure monitoring (diagnostic logs).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Official entry points to verify:<br\/>\n&#8211; Microsoft Purview documentation: https:\/\/learn.microsoft.com\/purview\/<br\/>\n&#8211; Microsoft Purview Data Catalog overview: https:\/\/learn.microsoft.com\/purview\/purview-data-catalog-overview<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">3. Why use Data Catalog?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Business reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Faster analytics outcomes<\/strong>: Teams spend less time searching for data and more time using it.<\/li>\n<li><strong>Shared definitions<\/strong>: A business glossary reduces KPI disputes and inconsistent reporting.<\/li>\n<li><strong>Better data trust<\/strong>: Ownership, certification\/curation patterns, and lineage (where available) increase confidence.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Technical reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Centralized metadata<\/strong> across many systems (storage, databases, BI, etc.).<\/li>\n<li><strong>Automation<\/strong> via scanning and integrations reduces manual documentation.<\/li>\n<li><strong>Lineage<\/strong> helps engineers assess downstream impact before making schema changes.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operational reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Onboarding<\/strong>: New hires find data faster.<\/li>\n<li><strong>Reduced tribal knowledge<\/strong>: Metadata is documented and searchable.<\/li>\n<li><strong>Change impact assessment<\/strong>: Lineage + ownership shortens incident resolution (when supported).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Security \/ compliance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Sensitive data discovery<\/strong> via classifications (built-in and custom\u2014verify exact classifiers in your region\/tenant).<\/li>\n<li><strong>Governance boundaries<\/strong> with collections and role-based access.<\/li>\n<li><strong>Auditability<\/strong> by using Azure logging and Purview activity visibility (verify logging specifics in official docs).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Scalability \/ performance reasons<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Designed for <strong>enterprise-scale<\/strong> metadata indexing rather than spreadsheet-based catalogs.<\/li>\n<li>Supports scanning at scale, but you must manage scan schedules, throughput, and cost.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should choose Data Catalog<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>You have multiple data platforms and need a <strong>single discovery layer<\/strong>.<\/li>\n<li>Data is growing faster than documentation.<\/li>\n<li>You need <strong>governance<\/strong>: ownership, glossary, classification, and (ideally) lineage.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">When teams should not choose it<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>You have a very small, single-database environment and can document with lightweight tooling.<\/li>\n<li>You require a fully on-prem-only catalog with no cloud footprint (unless your governance requirements allow Purview with hybrid connectors; verify).<\/li>\n<li>You need advanced data access governance features for sources not supported by Purview connectors; consider alternatives or complementary tools.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">4. Where is Data Catalog used?<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Industries<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Financial services (risk, regulatory reporting, data controls)<\/li>\n<li>Healthcare and life sciences (PHI discovery, auditability)<\/li>\n<li>Retail and e-commerce (customer analytics, product catalog analytics)<\/li>\n<li>Manufacturing\/IoT (sensor data lakes, quality data)<\/li>\n<li>Public sector (data sharing boundaries, compliance)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Team types<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data engineering<\/li>\n<li>BI \/ analytics teams<\/li>\n<li>Data science \/ ML engineering<\/li>\n<li>Governance and compliance teams<\/li>\n<li>Platform engineering (data platform owners)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Workloads<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data lake\/lakehouse discovery (ADLS Gen2, Databricks, Synapse)<\/li>\n<li>Data warehouse governance (Azure SQL, Synapse SQL)<\/li>\n<li>BI semantic model discovery (Power BI)<\/li>\n<li>ETL\/ELT pipeline documentation (ADF\/Synapse pipelines; lineage depends on integration)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Architectures<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Central data platform with domain data products (data mesh patterns)<\/li>\n<li>Multi-subscription landing zones where data is distributed<\/li>\n<li>Hybrid estates with some on-prem sources (connectors vary; verify supported sources)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Real-world deployment contexts<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Production<\/strong>: Most value comes when scanning production data sources (metadata-only access) and curating trusted datasets.<\/li>\n<li><strong>Dev\/Test<\/strong>: Used to validate scanning, role delegation, glossary workflow, and integration patterns before production rollout.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">5. Top Use Cases and Scenarios<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Below are realistic scenarios where Data Catalog fits well. Each includes the problem, why Data Catalog fits, and an example.<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p><strong>Enterprise data discovery for a data lake<\/strong>\n   &#8211; <strong>Problem:<\/strong> Hundreds of containers and folders in ADLS Gen2; nobody knows what\u2019s in them.\n   &#8211; <strong>Why it fits:<\/strong> Automated scanning extracts schemas and file metadata; search helps users find datasets.\n   &#8211; <strong>Example:<\/strong> Finance analysts search \u201cinvoice\u201d and find curated parquet datasets with owners and refresh cadence.<\/p>\n<\/li>\n<li>\n<p><strong>Business glossary for KPI standardization<\/strong>\n   &#8211; <strong>Problem:<\/strong> \u201cRevenue\u201d and \u201cActive customer\u201d definitions vary across teams.\n   &#8211; <strong>Why it fits:<\/strong> Glossary terms become shared definitions and can be linked to assets and columns.\n   &#8211; <strong>Example:<\/strong> The CFO office maintains glossary terms used by Power BI reports and data warehouse tables.<\/p>\n<\/li>\n<li>\n<p><strong>Sensitive data inventory and risk reduction<\/strong>\n   &#8211; <strong>Problem:<\/strong> Unknown locations of personal data increase breach and compliance risk.\n   &#8211; <strong>Why it fits:<\/strong> Classification\/tagging + catalog browsing supports sensitive data discovery.\n   &#8211; <strong>Example:<\/strong> Security team identifies datasets containing national IDs and ensures access reviews are in place.<\/p>\n<\/li>\n<li>\n<p><strong>Lineage-driven change management (where supported)<\/strong>\n   &#8211; <strong>Problem:<\/strong> Schema changes break downstream reports.\n   &#8211; <strong>Why it fits:<\/strong> Lineage helps understand upstream\/downstream dependencies.\n   &#8211; <strong>Example:<\/strong> Engineers see that a Synapse table feeds a critical Power BI semantic model before altering a column.<\/p>\n<\/li>\n<li>\n<p><strong>Data product catalog for data mesh<\/strong>\n   &#8211; <strong>Problem:<\/strong> Domains publish data products but consumers can\u2019t easily discover them.\n   &#8211; <strong>Why it fits:<\/strong> Collections map to domains; assets are curated and searchable across the enterprise.\n   &#8211; <strong>Example:<\/strong> \u201cSales\u201d and \u201cSupply Chain\u201d collections publish certified datasets with clear owners.<\/p>\n<\/li>\n<li>\n<p><strong>Audit support and compliance evidence<\/strong>\n   &#8211; <strong>Problem:<\/strong> Auditors request evidence of data classification and ownership.\n   &#8211; <strong>Why it fits:<\/strong> Central metadata (owners, classifications) can be reviewed and exported (methods vary; verify).\n   &#8211; <strong>Example:<\/strong> Compliance team demonstrates where sensitive fields exist and who owns them.<\/p>\n<\/li>\n<li>\n<p><strong>Faster incident response for data quality issues<\/strong>\n   &#8211; <strong>Problem:<\/strong> A dashboard is wrong; nobody knows the data origin.\n   &#8211; <strong>Why it fits:<\/strong> Catalog metadata and lineage reduce MTTR.\n   &#8211; <strong>Example:<\/strong> On-call analyst traces a KPI to a pipeline and finds a failed step introduced yesterday.<\/p>\n<\/li>\n<li>\n<p><strong>Analytics platform onboarding<\/strong>\n   &#8211; <strong>Problem:<\/strong> New hires take weeks to learn data locations and meanings.\n   &#8211; <strong>Why it fits:<\/strong> Searchable catalog + glossary shortens ramp-up.\n   &#8211; <strong>Example:<\/strong> A new data engineer uses the catalog to find canonical customer and product tables.<\/p>\n<\/li>\n<li>\n<p><strong>Consolidated metadata across multiple subscriptions<\/strong>\n   &#8211; <strong>Problem:<\/strong> Data is split across landing zones; no unified discovery.\n   &#8211; <strong>Why it fits:<\/strong> Purview can catalog across subscriptions if permissions are granted.\n   &#8211; <strong>Example:<\/strong> Central governance catalogs storage accounts from multiple business units.<\/p>\n<\/li>\n<li>\n<p><strong>Migration support (on-prem to Azure)<\/strong>\n   &#8211; <strong>Problem:<\/strong> During migration, assets are duplicated and mapping is unclear.\n   &#8211; <strong>Why it fits:<\/strong> Catalog can track both old and new assets and help link documentation.\n   &#8211; <strong>Example:<\/strong> Teams catalog on-prem SQL (if supported via connector) and Azure SQL to manage the transition.<\/p>\n<\/li>\n<li>\n<p><strong>Self-service analytics enablement<\/strong>\n   &#8211; <strong>Problem:<\/strong> Analysts request data extracts because they can\u2019t find trusted sources.\n   &#8211; <strong>Why it fits:<\/strong> Curated assets plus owners reduce ad-hoc extract requests.\n   &#8211; <strong>Example:<\/strong> Analysts use the catalog to find certified \u201csales_orders\u201d and stop using emailed CSVs.<\/p>\n<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">6. Core Features<\/h2>\n\n\n\n<blockquote>\n<p>Feature availability can vary by region, connector, and Microsoft Purview release. Verify details in official docs: https:\/\/learn.microsoft.com\/purview\/<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">1) Asset discovery: search and browse<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Provides a search UI and browsing experience across cataloged assets.<\/li>\n<li><strong>Why it matters:<\/strong> Users can find data without knowing where it lives.<\/li>\n<li><strong>Practical benefit:<\/strong> Reduces time spent asking around and reduces duplicate datasets.<\/li>\n<li><strong>Limitations\/caveats:<\/strong> Search quality depends on scanning coverage and metadata enrichment.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">2) Automated scanning and metadata extraction<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Connects to supported sources and extracts metadata such as schemas, file types, and structural information.<\/li>\n<li><strong>Why it matters:<\/strong> Keeps metadata current without manual documentation.<\/li>\n<li><strong>Practical benefit:<\/strong> Scheduled scans can maintain freshness.<\/li>\n<li><strong>Limitations\/caveats:<\/strong> Requires correct permissions; some sources require credentials\/managed identity setup.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">3) Collections and delegated administration<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Organizes catalog assets into a hierarchy (collections) and supports role delegation.<\/li>\n<li><strong>Why it matters:<\/strong> Large organizations need governance boundaries and decentralized ownership.<\/li>\n<li><strong>Practical benefit:<\/strong> Domain teams manage their own assets while central governance sets standards.<\/li>\n<li><strong>Limitations\/caveats:<\/strong> Misaligned collection strategy can create administrative complexity.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">4) Business glossary<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Maintains business terms, definitions, and relationships; terms can be linked to assets.<\/li>\n<li><strong>Why it matters:<\/strong> Aligns business and technical teams on definitions.<\/li>\n<li><strong>Practical benefit:<\/strong> Fewer KPI disputes; better report consistency.<\/li>\n<li><strong>Limitations\/caveats:<\/strong> Requires governance process to keep terms accurate.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">5) Metadata enrichment (owners, descriptions, contacts)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Adds human context on top of technical metadata.<\/li>\n<li><strong>Why it matters:<\/strong> Ownership and context drive trust and accountability.<\/li>\n<li><strong>Practical benefit:<\/strong> People know who to contact and how data should be used.<\/li>\n<li><strong>Limitations\/caveats:<\/strong> Needs operating model (RACI) to avoid stale ownership.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6) Classification and labeling (sensitive data discovery)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Applies classifications to assets\/columns based on scanning and rules.<\/li>\n<li><strong>Why it matters:<\/strong> Helps locate sensitive data and manage exposure.<\/li>\n<li><strong>Practical benefit:<\/strong> Supports compliance initiatives and access reviews.<\/li>\n<li><strong>Limitations\/caveats:<\/strong> Classification accuracy varies; false positives\/negatives are possible.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">7) Lineage (for supported integrations)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Shows data movement and transformations across pipelines and systems.<\/li>\n<li><strong>Why it matters:<\/strong> Enables impact analysis and troubleshooting.<\/li>\n<li><strong>Practical benefit:<\/strong> Faster root cause analysis when downstream breaks.<\/li>\n<li><strong>Limitations\/caveats:<\/strong> Lineage coverage depends on supported sources\/integrations (verify current list).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">8) Role-based access control (Purview roles + Azure RBAC)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Controls who can administer the catalog, manage sources, run scans, and curate metadata.<\/li>\n<li><strong>Why it matters:<\/strong> Governance tools often contain sensitive metadata.<\/li>\n<li><strong>Practical benefit:<\/strong> Separation of duties (admins vs curators vs readers).<\/li>\n<li><strong>Limitations\/caveats:<\/strong> You must manage two layers: Purview roles and source permissions.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">9) APIs and automation (where supported)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>What it does:<\/strong> Enables metadata operations via REST APIs and supported SDKs.<\/li>\n<li><strong>Why it matters:<\/strong> Enterprises need repeatable deployment and automation.<\/li>\n<li><strong>Practical benefit:<\/strong> Infrastructure-as-code patterns for Purview account deployment; scripted scans\/metadata workflows.<\/li>\n<li><strong>Limitations\/caveats:<\/strong> API coverage and tooling evolve\u2014verify in official docs.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">7. Architecture and How It Works<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">High-level architecture<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">At a high level:\n1. You deploy a <strong>Microsoft Purview account<\/strong> in Azure.\n2. You <strong>register data sources<\/strong> (ADLS, Azure SQL, etc.) into the account.\n3. You configure <strong>credentials<\/strong> (managed identity or other supported methods) and <strong>scans<\/strong>.\n4. Purview scanning reads <strong>metadata<\/strong> from sources and stores it in the <strong>Data Map<\/strong>.\n5. Users search\/browse <strong>Data Catalog<\/strong> and enrich assets with business context.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Request\/data\/control flow<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Control plane:<\/strong> Admins configure collections, roles, sources, scans, and policies in Purview.<\/li>\n<li><strong>Data plane (metadata plane):<\/strong> Scanning connects to data sources to read <em>metadata<\/em> (and possibly sample content for classification depending on configuration; verify and control this).<\/li>\n<li><strong>Consumption:<\/strong> Users query\/search the metadata index and view asset pages.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Integrations with related Azure services (common)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Azure Data Lake Storage Gen2<\/strong>: Catalog file systems, folders, and files; extract schema for supported formats.<\/li>\n<li><strong>Azure SQL \/ Synapse<\/strong>: Catalog databases, schemas, tables, and columns.<\/li>\n<li><strong>Azure Data Factory \/ Synapse pipelines<\/strong>: Potential lineage integration (verify supported scenarios).<\/li>\n<li><strong>Power BI<\/strong>: Catalog reports\/datasets and enable discovery (tenant configuration required; verify).<\/li>\n<li><strong>Azure Key Vault<\/strong>: Store credentials\/secrets where applicable (depending on connector method).<\/li>\n<li><strong>Azure Monitor<\/strong>: Diagnostic logs and metrics routing (availability varies; verify in docs).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Dependency services<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Azure AD<\/strong> for identity and authentication.<\/li>\n<li><strong>Source services<\/strong> being scanned (Storage, SQL, etc.).<\/li>\n<li>Optional <strong>networking<\/strong> components for private connectivity.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Security\/authentication model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>User access to Data Catalog:<\/strong> authenticated via Azure AD; authorization governed by Purview roles and potentially collection-level permissions.<\/li>\n<li><strong>Scanner access to data sources:<\/strong> typically via the Purview managed identity (system-assigned managed identity) or configured credentials, plus Azure RBAC\/data-plane permissions on the source.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Networking model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>By default, Purview endpoints may be publicly accessible.<\/li>\n<li>For production, consider <strong>private endpoints<\/strong> and managed network features where supported, especially when scanning sources inside private VNets. Verify current networking options in Purview docs for your region.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Monitoring\/logging\/governance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Treat metadata as sensitive: asset names, schemas, and classifications can reveal business secrets.<\/li>\n<li>Enable diagnostic logs where supported.<\/li>\n<li>Establish governance processes:<\/li>\n<li>Ownership and stewardship<\/li>\n<li>Glossary curation<\/li>\n<li>Scan schedules and change management<\/li>\n<li>Access reviews for catalog roles<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Simple architecture diagram (Mermaid)<\/h3>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart LR\n  U[Data Consumer] --&gt;|Search\/Browse| DC[Data Catalog&lt;br\/&gt;(Microsoft Purview)]\n  A[Data Steward] --&gt;|Curate glossary &amp; metadata| DC\n\n  DC --&gt; DM[Data Map&lt;br\/&gt;(Metadata Store)]\n  DC --&gt;|Scan metadata| SRC[(Azure Data Sources)]\n  SRC --&gt; ADLS[ADLS Gen2]\n  SRC --&gt; SQL[Azure SQL \/ Synapse]\n  DC --&gt; AAD[Azure AD&lt;br\/&gt;AuthN\/AuthZ]\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Production-style architecture diagram (Mermaid)<\/h3>\n\n\n\n<pre><code class=\"language-mermaid\">flowchart TB\n  subgraph Tenant[Azure AD Tenant]\n    AAD[Azure AD]\n  end\n\n  subgraph Azure[Azure Subscription(s)]\n    subgraph RG[Resource Group]\n      P[Microsoft Purview Account&lt;br\/&gt;Data Catalog + Data Map]\n      PE[Private Endpoints&lt;br\/&gt;(if enabled)]\n      LAW[Log Analytics Workspace&lt;br\/&gt;(optional)]\n    end\n\n    subgraph DataPlane[Data Sources]\n      ADLS[ADLS Gen2&lt;br\/&gt;Private Endpoint (optional)]\n      SQL[Azure SQL \/ Synapse&lt;br\/&gt;Private Endpoint (optional)]\n      ADF[Azure Data Factory \/ Synapse Pipelines&lt;br\/&gt;(lineage where supported)]\n      PBI[Power BI&lt;br\/&gt;(tenant integration)]\n    end\n  end\n\n  Users[Analysts \/ Engineers \/ Stewards] --&gt;|Azure AD sign-in| AAD\n  AAD --&gt;|Tokens| P\n\n  P --&gt;|Metadata scans via MI\/credentials| ADLS\n  P --&gt;|Metadata scans via MI\/credentials| SQL\n  ADF --&gt;|Lineage integration (supported cases)| P\n  PBI --&gt;|Metadata integration (tenant config)| P\n\n  P --&gt;|Diagnostic logs| LAW\n  P --&gt; PE\n  PE --&gt; ADLS\n  PE --&gt; SQL\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">8. Prerequisites<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Account\/subscription\/tenant requirements<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>An active <strong>Azure subscription<\/strong>.<\/li>\n<li>Access to an <strong>Azure AD tenant<\/strong> associated with the subscription.<\/li>\n<li>Ability to register resource providers if needed (commonly <code>Microsoft.Purview<\/code>).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Permissions \/ IAM roles<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To complete the lab you typically need:\n&#8211; <strong>At Azure scope<\/strong>:\n  &#8211; <code>Owner<\/code> or <code>Contributor<\/code> on the resource group\/subscription to create resources.\n  &#8211; Permissions to assign roles (for RBAC) if you will grant the Purview managed identity access to Storage.\n&#8211; <strong>Inside Purview (Data Catalog)<\/strong>:\n  &#8211; Purview roles (for example, data source admin \/ collection admin) depending on your tasks. Role names and exact permissions should be verified in current Purview RBAC documentation.\n&#8211; <strong>On the data source<\/strong> (lab uses ADLS Gen2):\n  &#8211; Grant Purview\u2019s managed identity <strong>read\/list<\/strong> permissions. Commonly involves Azure RBAC roles like <strong>Storage Blob Data Reader<\/strong> at the storage account scope (verify required permissions for scanning in official docs).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Billing requirements<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>You must have billing enabled; Purview pricing is usage-based (see section 9).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">CLI\/SDK\/tools needed<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Azure Portal access.<\/li>\n<li>Optional:<\/li>\n<li><strong>Azure CLI<\/strong> for resource creation and uploads: https:\/\/learn.microsoft.com\/cli\/azure\/install-azure-cli<\/li>\n<li><strong>Azure Storage Explorer<\/strong> (optional) for uploading sample files: https:\/\/azure.microsoft.com\/products\/storage\/storage-explorer\/<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Region availability<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Microsoft Purview availability varies by region. Verify the current supported regions:<br\/>\n  https:\/\/learn.microsoft.com\/purview\/purview-portal<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Quotas\/limits<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Purview has limits around scans, capacity, and catalogs at scale; these change over time. Verify current limits in official documentation.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Prerequisite services for the lab<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Azure Storage account<\/strong> with <strong>Hierarchical namespace enabled<\/strong> (ADLS Gen2).<\/li>\n<li>A small sample file (CSV or Parquet) to demonstrate discovery.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">9. Pricing \/ Cost<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Do not use this section as a quote.<\/strong> Pricing varies by region, billing agreement, and feature usage. Always confirm on official pricing pages.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Current pricing model (high level)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Microsoft Purview pricing commonly includes:\n&#8211; <strong>Data Map capacity<\/strong> (often billed as \u201ccapacity units\u201d per hour):<br\/>\n  The Data Map powers catalog search and metadata graph storage\/operations.\n&#8211; <strong>Scanning and classification<\/strong> (often billed by compute\/time, such as vCore-hours or similar):<br\/>\n  Costs depend on number of sources, scan frequency, and depth of classification.\n&#8211; Potential additional charges for optional capabilities (for example, insights\/reporting, data estate features)\u2014verify in official pricing.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Official pricing page (verify latest):<br\/>\n&#8211; Microsoft Purview pricing: https:\/\/azure.microsoft.com\/pricing\/details\/microsoft-purview\/<br\/>\n&#8211; Azure Pricing Calculator: https:\/\/azure.microsoft.com\/pricing\/calculator\/<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Pricing dimensions to understand<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Hours of Data Map capacity<\/strong>: baseline cost driver for keeping the catalog active.<\/li>\n<li><strong>Scan execution costs<\/strong>: driven by scan runtime and classification settings.<\/li>\n<li><strong>Number\/type of sources<\/strong>: some connectors are heavier (large SQL estates, deep folder trees).<\/li>\n<li><strong>Metadata volume<\/strong>: number of assets, columns, files, partitions; can impact performance and operational overhead.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Free tier<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Microsoft Purview has historically offered limited free scanning\/capacity in some contexts or trials, but this changes. <strong>Verify in official pricing<\/strong>.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Cost drivers (direct)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Running Data Map capacity continuously.<\/li>\n<li>Scheduling frequent scans (e.g., hourly scans across many sources).<\/li>\n<li>Enabling deep classification or patterns that increase scan runtime.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Hidden or indirect costs<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Data source costs<\/strong>: scanning may generate requests\/transactions on Storage or SQL.<\/li>\n<li><strong>Network<\/strong>: if scanning crosses network boundaries, private endpoints and data egress considerations may apply.<\/li>\n<li><strong>Log storage<\/strong>: sending diagnostics to Log Analytics incurs ingestion\/retention cost.<\/li>\n<li><strong>Operational labor<\/strong>: stewardship and governance activities are real costs; plan an operating model.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Network\/data transfer implications<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Metadata scanning usually reads metadata, but classification can require reading file contents or samples depending on configuration and connector behavior. Control scan rules and scope to manage both cost and risk. <strong>Verify scan behavior per connector<\/strong>.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">How to optimize cost<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Start with a <strong>narrow scope<\/strong> (one collection, one storage account) and expand.<\/li>\n<li>Schedule scans based on data change rate (daily\/weekly for stable sources).<\/li>\n<li>Exclude noisy paths (temp folders, staging, logs).<\/li>\n<li>Curate \u201cgold\u201d datasets first; avoid trying to catalog everything on day one.<\/li>\n<li>Use diagnostics selectively; retain only what you need.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Example low-cost starter estimate (no fabricated numbers)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A low-cost pilot typically includes:\n&#8211; One Purview account\n&#8211; One ADLS Gen2 source\n&#8211; A small number of scanned folders\/containers\n&#8211; A daily scan schedule\n&#8211; Minimal classification rules initially<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Exact costs depend on:\n&#8211; Your region\u2019s Purview rates\n&#8211; How long the Data Map capacity is active\n&#8211; Scan runtime for your data patterns<br\/>\nUse the pricing calculator with your region and expected scan schedule.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Example production cost considerations<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Production environments commonly add:\n&#8211; Multiple sources (data lake + SQL + BI)\n&#8211; Higher scan frequency for critical assets\n&#8211; Deeper classification and glossary stewardship\n&#8211; Private networking and centralized logging<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The biggest levers are:\n&#8211; Data Map capacity-hours\n&#8211; Total scan compute\/time across the estate\n&#8211; Scan scope (asset counts and folder depth)<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">10. Step-by-Step Hands-On Tutorial<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">This lab uses the <strong>current Azure approach<\/strong>: <strong>Microsoft Purview Data Catalog<\/strong> (Data Catalog) scanning an <strong>ADLS Gen2<\/strong> storage account. It is designed to be low-risk and relatively low-cost, but costs can still occur\u2014monitor your Purview pricing dimensions.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Objective<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Deploy Data Catalog (via a Microsoft Purview account), create a small ADLS Gen2 dataset, scan it into the catalog, and validate that you can search and enrich metadata (owner\/description\/glossary).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Lab Overview<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">You will:\n1. Create an ADLS Gen2 storage account and upload a sample CSV.\n2. Create a Microsoft Purview account (Data Catalog).\n3. Grant Purview\u2019s managed identity read access to the storage account for scanning.\n4. Register the storage account as a data source and run a scan.\n5. Search and browse the discovered asset.\n6. Add business metadata (description\/glossary term).\n7. Clean up resources.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 1: Create a resource group<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Azure Portal<\/strong>\n1. Go to <strong>Resource groups<\/strong> \u2192 <strong>Create<\/strong>.\n2. Choose your subscription.\n3. Name: <code>rg-datacatalog-lab<\/code>\n4. Region: choose one where Microsoft Purview is available (verify availability).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> A new resource group exists.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Optional Azure CLI<\/strong><\/p>\n\n\n\n<pre><code class=\"language-bash\">az group create --name rg-datacatalog-lab --location &lt;YOUR_AZURE_REGION&gt;\n<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Step 2: Create an ADLS Gen2 storage account<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Azure Portal<\/strong>\n1. Go to <strong>Storage accounts<\/strong> \u2192 <strong>Create<\/strong>.\n2. Resource group: <code>rg-datacatalog-lab<\/code>\n3. Storage account name: globally unique, e.g. <code>stcataloglab&lt;random&gt;<\/code>\n4. Region: same as your RG (recommended)\n5. Performance: Standard (fine for lab)\n6. In <strong>Advanced<\/strong> (or <strong>Data Lake Storage Gen2<\/strong>) enable:\n   &#8211; <strong>Hierarchical namespace<\/strong> = <strong>Enabled<\/strong>\n7. Create the account.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> Storage account is created with hierarchical namespace enabled.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 3: Create a container (file system) and upload a sample CSV<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Azure Portal<\/strong>\n1. Open the storage account.\n2. Go to <strong>Storage browser<\/strong> (or <strong>Containers<\/strong> \/ <strong>Data Lake Gen2<\/strong> depending on portal UI).\n3. Create a container \/ file system named: <code>raw<\/code>\n4. Create a folder named: <code>sales<\/code>\n5. Upload a file named <code>orders.csv<\/code> with content like:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Create a local file <code>orders.csv<\/code>:<\/p>\n\n\n\n<pre><code class=\"language-csv\">order_id,order_date,customer_id,amount,country\n1001,2025-01-05,C001,120.50,US\n1002,2025-01-08,C002,89.99,GB\n1003,2025-01-11,C003,42.10,IN\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> File exists at a path like <code>raw\/sales\/orders.csv<\/code>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification<\/strong>\n&#8211; Confirm you can see the file in the portal Storage Browser.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 4: Create the Data Catalog (Microsoft Purview account)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Azure Portal<\/strong>\n1. Search for <strong>Microsoft Purview<\/strong> and open it.\n2. Click <strong>Create<\/strong>.\n3. Basics:\n   &#8211; Subscription: your subscription\n   &#8211; Resource group: <code>rg-datacatalog-lab<\/code>\n   &#8211; Name: <code>pv-datacatalog-lab<\/code> (must be unique per naming rules)\n   &#8211; Region: choose supported region\n4. Review + create.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> A Microsoft Purview account is deployed.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification<\/strong>\n&#8211; Open the Purview account resource \u2192 confirm it shows as <strong>Succeeded<\/strong>.\n&#8211; Locate the link to open the Purview portal \/ governance portal (naming in portal may vary).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 5: Assign permissions for scanning (Purview managed identity \u2192 Storage)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Purview scans your data sources using an identity\/credential. For many Azure sources, the simplest approach is granting the Purview account\u2019s <strong>managed identity<\/strong> the required data-plane permissions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Azure Portal (Storage account RBAC)<\/strong>\n1. Open the <strong>Storage account<\/strong> you created.\n2. Go to <strong>Access control (IAM)<\/strong> \u2192 <strong>Add role assignment<\/strong>.\n3. Role: typically <strong>Storage Blob Data Reader<\/strong> (verify required roles for Purview scanning in the official connector doc).\n4. Assign access to: <strong>Managed identity<\/strong>\n5. Select members:\n   &#8211; Choose your <strong>Microsoft Purview account<\/strong> managed identity.\n6. Review + assign.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> Purview managed identity has read access to Storage blobs\/files for scanning.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification<\/strong>\n&#8211; In the storage account IAM, confirm the role assignment exists.<\/p>\n\n\n\n<blockquote>\n<p>If you cannot find the Purview managed identity in the picker, ensure the Purview account has a system-assigned managed identity enabled (commonly enabled by default for scanning scenarios). Verify in Purview account identity settings and official docs.<\/p>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\">Step 6: Open the Purview portal and configure collections (optional but recommended)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Inside the Purview portal:\n1. Find <strong>Collections<\/strong>.\n2. Confirm the default root collection exists.\n3. (Optional) Create a child collection named <code>lab<\/code>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> A collection exists to organize lab assets.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 7: Register the ADLS Gen2 storage account as a data source<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In the Purview portal:\n1. Go to <strong>Data map<\/strong> \u2192 <strong>Sources<\/strong> (UI labels vary slightly).\n2. <strong>Register<\/strong> a source.\n3. Choose source type: <strong>Azure Data Lake Storage Gen2<\/strong> (or Azure Storage; pick the option matching ADLS Gen2).\n4. Select your subscription and storage account (or enter resource ID if required).\n5. Assign it to the <code>lab<\/code> collection (or default collection).\n6. Save\/register.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> The storage account appears as a registered source.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 8: Create and run a scan<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In the Purview portal:\n1. Open the registered ADLS Gen2 source.\n2. Choose <strong>New scan<\/strong>.\n3. Configure:\n   &#8211; Name: <code>scan-raw-sales<\/code>\n   &#8211; Credential \/ authentication: choose the managed identity option if available for your configuration\n   &#8211; Scope: select the <code>raw\/sales<\/code> path (avoid scanning the entire account for the lab)\n   &#8211; Classification rules: start with defaults (you can tune later)\n   &#8211; Scan rule set: default (unless you have a reason to customize)\n4. Run the scan now.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> Scan status moves from queued\/running to succeeded, and assets appear in the catalog.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Verification<\/strong>\n&#8211; Check scan run history\/status.\n&#8211; If the scan succeeds, proceed. If it fails, go to Troubleshooting.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 9: Search\/browse the discovered asset in Data Catalog<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In the Purview portal:\n1. Go to <strong>Data Catalog<\/strong> \u2192 <strong>Browse<\/strong> or <strong>Search<\/strong>.\n2. Search for: <code>orders.csv<\/code> or <code>orders<\/code>.\n3. Open the asset.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> You can view the asset details page, including technical metadata (file type, path, possibly schema inference depending on support).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Step 10: Enrich the asset with business metadata (description, owner, glossary)<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>On the asset page, add:\n   &#8211; <strong>Description<\/strong>: e.g., \u201cRaw sales orders exported daily from ecommerce system.\u201d\n   &#8211; <strong>Owner \/ Contact<\/strong>: assign yourself for the lab.<\/li>\n<li>Create a glossary term:\n   &#8211; Go to <strong>Glossary<\/strong> \u2192 <strong>New term<\/strong>\n   &#8211; Term: <code>Order<\/code>\n   &#8211; Definition: \u201cA purchase transaction placed by a customer.\u201d<\/li>\n<li>Link the glossary term to the asset (UI typically supports adding terms to assets).<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Expected outcome:<\/strong> The asset becomes easier to understand; search results may reflect glossary linkages and description content.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Validation<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Use this checklist:\n&#8211; [ ] Purview account exists and you can open the Purview portal.\n&#8211; [ ] ADLS Gen2 storage account exists and has <code>raw\/sales\/orders.csv<\/code>.\n&#8211; [ ] Purview managed identity has Storage read permissions.\n&#8211; [ ] Source is registered in Purview.\n&#8211; [ ] Scan completes successfully.\n&#8211; [ ] <code>orders.csv<\/code> appears in Data Catalog search\/browse.\n&#8211; [ ] You can add description\/owner and create\/link a glossary term.<\/p>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Troubleshooting<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Common issues and realistic fixes:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p><strong>Scan fails with permission\/authorization error<\/strong>\n   &#8211; <strong>Cause:<\/strong> Purview identity lacks data-plane permissions.\n   &#8211; <strong>Fix:<\/strong> Re-check Storage IAM role assignment for the Purview managed identity. Ensure it\u2019s applied at the correct scope (storage account is simplest). Verify required role in connector docs.<\/p>\n<\/li>\n<li>\n<p><strong>Cannot select the Purview managed identity<\/strong>\n   &#8211; <strong>Cause:<\/strong> Identity not enabled or directory permissions delay.\n   &#8211; <strong>Fix:<\/strong> Confirm the Purview account has a managed identity enabled (where applicable). Wait a few minutes and retry. Verify in official docs.<\/p>\n<\/li>\n<li>\n<p><strong>No assets appear after a successful scan<\/strong>\n   &#8211; <strong>Cause:<\/strong> Scan scope excludes the folder, or file type not included, or scan rule set excludes CSV.\n   &#8211; <strong>Fix:<\/strong> Confirm scan path includes <code>raw\/sales<\/code>. Verify scan rule set includes CSV. Re-run scan.<\/p>\n<\/li>\n<li>\n<p><strong>Scan takes too long<\/strong>\n   &#8211; <strong>Cause:<\/strong> Scope too broad (scanning entire storage account).\n   &#8211; <strong>Fix:<\/strong> Narrow the scan scope to a specific container\/folder. Reduce classification depth for the lab.<\/p>\n<\/li>\n<li>\n<p><strong>You can\u2019t edit metadata (description\/owner)<\/strong>\n   &#8211; <strong>Cause:<\/strong> Your user lacks Purview catalog curation permissions for that collection.\n   &#8211; <strong>Fix:<\/strong> Have a Purview admin grant you the appropriate Purview role at the correct collection scope.<\/p>\n<\/li>\n<li>\n<p><strong>Purview resource not available in your region<\/strong>\n   &#8211; <strong>Cause:<\/strong> Region availability constraints.\n   &#8211; <strong>Fix:<\/strong> Create resources in a supported region (verify current region list).<\/p>\n<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">Cleanup<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To avoid ongoing charges:\n1. Delete the resource group <strong>rg-datacatalog-lab<\/strong>:\n   &#8211; This removes the Purview account and storage account.\n2. Confirm deletion completes.\n3. If you used diagnostic logs to Log Analytics, ensure the workspace is deleted (if created) or stop retention\/ingestion as required.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Azure CLI cleanup<\/strong><\/p>\n\n\n\n<pre><code class=\"language-bash\">az group delete --name rg-datacatalog-lab --yes --no-wait\n<\/code><\/pre>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">11. Best Practices<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Architecture best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Start with a target operating model<\/strong>: define who owns collections, scanning, glossary, and curation.<\/li>\n<li>Use <strong>collections aligned to your org structure<\/strong> (domains, business units, environments).<\/li>\n<li>Implement a <strong>curation pattern<\/strong>:<\/li>\n<li>Raw \u2192 curated \u2192 certified (even if \u201ccertified\u201d is a process rather than a feature flag)<\/li>\n<li>Integrate Data Catalog with your analytics platform rollout:<\/li>\n<li>As new sources are deployed, add \u201ccatalog registration + scan\u201d to the delivery checklist.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">IAM\/security best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Apply <strong>least privilege<\/strong>:<\/li>\n<li>Purview scanning identity should have <em>metadata read<\/em> only.<\/li>\n<li>Avoid granting write permissions to data sources.<\/li>\n<li>Separate duties:<\/li>\n<li>Catalog admins vs data stewards vs catalog readers.<\/li>\n<li>Perform <strong>access reviews<\/strong> regularly for Purview roles and collection admins.<\/li>\n<li>Treat metadata as sensitive\u2014schemas and names can reveal confidential business concepts.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Cost best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Control scan scope:<\/li>\n<li>Avoid scanning entire lakes by default.<\/li>\n<li>Exclude transient\/staging paths.<\/li>\n<li>Schedule scans based on change frequency.<\/li>\n<li>Pilot with a limited number of sources and expand based on value.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Performance best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Keep scans targeted and avoid unnecessary deep classification.<\/li>\n<li>Use incremental governance: prioritize high-value datasets first.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Reliability best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use consistent naming conventions for scans and collections.<\/li>\n<li>Document scan schedules and owners for operational continuity.<\/li>\n<li>Use monitoring\/alerts (where available) for scan failures and operational health.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Operations best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Maintain a runbook:<\/li>\n<li>Scan failure triage<\/li>\n<li>Permission change process<\/li>\n<li>New source onboarding<\/li>\n<li>Use tagging standards for Azure resources (Purview account, resource group, storage accounts):<\/li>\n<li><code>env<\/code>, <code>owner<\/code>, <code>costCenter<\/code>, <code>dataDomain<\/code><\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Governance\/tagging\/naming best practices<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Collections: <code>prod<\/code>, <code>nonprod<\/code>, then domains under each (or domains first, environments second\u2014pick one).<\/li>\n<li>Glossary:<\/li>\n<li>Define term approval workflow.<\/li>\n<li>Add synonyms and examples to improve adoption.<\/li>\n<li>Metadata:<\/li>\n<li>Require owner + description for curated datasets.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">12. Security Considerations<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Identity and access model<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>User access<\/strong> uses Azure AD authentication.<\/li>\n<li><strong>Authorization<\/strong> is enforced by Purview roles and collection scoping.<\/li>\n<li><strong>Scanning access<\/strong> to sources is separate and requires explicit permissions on each data source.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Key principle: <strong>Catalog access \u2260 data access<\/strong>.<br\/>\nSeeing metadata does not automatically grant access to the underlying data, but metadata itself can be sensitive.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Encryption<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Azure services generally encrypt data at rest and in transit. Confirm the specifics for Microsoft Purview in official security documentation:\n  https:\/\/learn.microsoft.com\/purview\/<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Network exposure<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>For production, evaluate:<\/li>\n<li>Public endpoint exposure<\/li>\n<li>Private endpoints for Purview and sources (where supported)<\/li>\n<li>Scanning across VNets and private networks<br\/>\nVerify supported private networking patterns in Purview docs before design.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secrets handling<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Prefer <strong>managed identity<\/strong> over stored secrets where supported.<\/li>\n<li>If connector requires secrets:<\/li>\n<li>Store them in <strong>Azure Key Vault<\/strong><\/li>\n<li>Restrict Key Vault access<\/li>\n<li>Rotate secrets regularly<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Audit\/logging<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use <strong>Azure Activity Log<\/strong> for resource-level operations.<\/li>\n<li>Enable Purview diagnostic settings (where available) to route logs to Log Analytics \/ Event Hub \/ Storage for retention and analysis.<\/li>\n<li>Audit role assignments and permission changes (Azure RBAC + Purview roles).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Compliance considerations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Understand your regulatory requirements:<\/li>\n<li>Data residency (Purview region)<\/li>\n<li>Metadata retention and access<\/li>\n<li>Separation between business units<br\/>\nAlways verify compliance statements in Microsoft\u2019s trust documentation.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Common security mistakes<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Granting Purview scanning identity overly broad permissions (e.g., Storage Blob Data Owner).<\/li>\n<li>Making the catalog widely readable when metadata reveals confidential projects.<\/li>\n<li>Scanning sensitive sources without aligning classification policies and steward review.<\/li>\n<li>No owner\/steward model, leading to stale and misleading metadata.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Secure deployment recommendations<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use private networking where appropriate.<\/li>\n<li>Limit catalog visibility by collection and role.<\/li>\n<li>Apply least privilege to scanning identities.<\/li>\n<li>Establish a governance workflow for glossary and curated dataset standards.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">13. Limitations and Gotchas<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Because Microsoft Purview evolves rapidly, confirm these in official docs for your tenant\/region.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Service naming confusion:<\/strong> \u201cAzure Data Catalog\u201d (legacy) is retired; the supported service is <strong>Microsoft Purview Data Catalog<\/strong>.<\/li>\n<li><strong>Connector variability:<\/strong> Not all sources support the same depth of metadata extraction, classification, or lineage.<\/li>\n<li><strong>Lineage coverage is not universal:<\/strong> Lineage is powerful but depends on supported systems and integration configuration.<\/li>\n<li><strong>Permissions are two-layered:<\/strong><\/li>\n<li>Purview roles control catalog administration\/curation<\/li>\n<li>Source permissions control scanning<\/li>\n<li><strong>Cost surprises:<\/strong><\/li>\n<li>Scanning broad scopes (entire lakes) can increase scan compute\/time.<\/li>\n<li>Keeping Data Map capacity running continuously is a baseline cost driver.<\/li>\n<li><strong>Metadata sensitivity:<\/strong> Even without data access, metadata can expose sensitive business details.<\/li>\n<li><strong>Regional availability constraints:<\/strong> Some features and regions may lag; verify supported regions and features.<\/li>\n<li><strong>Operational overhead:<\/strong> A catalog without stewardship becomes stale; plan people\/process, not just tooling.<\/li>\n<li><strong>Migration challenges:<\/strong> If you are migrating from a legacy catalog, expect mapping work for glossary terms, owners, and asset identifiers.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">14. Comparison with Alternatives<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Data Catalog (Microsoft Purview Data Catalog) is one option in the Azure Analytics governance space. Alternatives include other cloud catalogs and open-source metadata platforms.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Option<\/th>\n<th>Best For<\/th>\n<th>Strengths<\/th>\n<th>Weaknesses<\/th>\n<th>When to Choose<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Data Catalog (Microsoft Purview Data Catalog)<\/strong><\/td>\n<td>Azure-centric enterprises needing governance + discovery<\/td>\n<td>Deep Azure integration, collections, scanning, glossary, enterprise governance patterns<\/td>\n<td>Costs can grow with scans\/capacity; connector\/lineage coverage varies<\/td>\n<td>You run analytics on Azure and need an enterprise catalog<\/td>\n<\/tr>\n<tr>\n<td><strong>Microsoft Fabric (catalog\/discovery features within Fabric)<\/strong><\/td>\n<td>Teams standardized on Fabric<\/td>\n<td>Tight integration with Fabric workloads<\/td>\n<td>Not a full replacement for enterprise-wide multi-source governance in all cases<\/td>\n<td>You are all-in on Fabric and scope is mainly Fabric assets<\/td>\n<\/tr>\n<tr>\n<td><strong>Azure Synapse Studio (workspace metadata)<\/strong><\/td>\n<td>Synapse-centric teams<\/td>\n<td>Convenient within Synapse workflows<\/td>\n<td>Not an enterprise-wide catalog across all sources<\/td>\n<td>You only need discovery within a Synapse workspace<\/td>\n<\/tr>\n<tr>\n<td><strong>Databricks Unity Catalog (on Azure Databricks)<\/strong><\/td>\n<td>Lakehouse governance for Databricks users<\/td>\n<td>Strong governance for Databricks data\/AI assets<\/td>\n<td>Focused on Databricks ecosystem; not a general Azure-wide catalog<\/td>\n<td>Your primary platform is Databricks and you want unified governance there<\/td>\n<\/tr>\n<tr>\n<td><strong>AWS Glue Data Catalog<\/strong><\/td>\n<td>AWS data lake and Athena\/Glue ecosystems<\/td>\n<td>Native AWS integration; widely used in AWS analytics<\/td>\n<td>Not Azure-native; cross-cloud adds complexity<\/td>\n<td>Your estate is primarily on AWS<\/td>\n<\/tr>\n<tr>\n<td><strong>Google Cloud Dataplex \/ Data Catalog<\/strong><\/td>\n<td>GCP analytics governance<\/td>\n<td>Strong GCP integration<\/td>\n<td>Not Azure-native<\/td>\n<td>Your estate is primarily on GCP<\/td>\n<\/tr>\n<tr>\n<td><strong>Apache Atlas (self-managed)<\/strong><\/td>\n<td>Custom governance in self-managed Hadoop\/lake stacks<\/td>\n<td>Open-source; customizable<\/td>\n<td>Operational burden; scaling and UI experience vary<\/td>\n<td>You need self-managed control and accept ops overhead<\/td>\n<\/tr>\n<tr>\n<td><strong>DataHub \/ Amundsen (self-managed)<\/strong><\/td>\n<td>Metadata platform with extensibility<\/td>\n<td>Strong community; integrates via pipelines<\/td>\n<td>You must host\/operate; governance workflows depend on implementation<\/td>\n<td>You want open metadata platform with custom integrations<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">15. Real-World Example<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Enterprise example (regulated industry)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> A bank has multiple Azure subscriptions with ADLS Gen2, Azure SQL, Synapse, and Power BI. Auditors require proof of sensitive data discovery and ownership. Analysts waste time searching for \u201capproved\u201d datasets.<\/li>\n<li><strong>Proposed architecture:<\/strong><\/li>\n<li>One or more Microsoft Purview accounts (depending on org boundaries) hosting Data Catalog.<\/li>\n<li>Collections aligned to business domains (Retail, Corporate, Risk) with delegated admins.<\/li>\n<li>Scans for ADLS Gen2 and Azure SQL scheduled daily\/weekly.<\/li>\n<li>Glossary curated by a data governance office; terms linked to key assets.<\/li>\n<li>Diagnostic logs routed to a central Log Analytics workspace.<\/li>\n<li>Private endpoints for Purview and data sources (where supported) for network control.<\/li>\n<li><strong>Why Data Catalog was chosen:<\/strong><\/li>\n<li>Azure-native identity model (Azure AD), governance features (collections), and integration with Azure data estate.<\/li>\n<li><strong>Expected outcomes:<\/strong><\/li>\n<li>Faster discovery of trusted datasets.<\/li>\n<li>Improved audit readiness (ownership\/classification visibility).<\/li>\n<li>Reduced incident time through better metadata and lineage (where supported).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Startup\/small-team example<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Problem:<\/strong> A startup scales from one database to a small lakehouse. Analysts and engineers repeatedly ask where the latest datasets live and what columns mean.<\/li>\n<li><strong>Proposed architecture:<\/strong><\/li>\n<li>Single Microsoft Purview account (Data Catalog).<\/li>\n<li>One collection for <code>prod<\/code>, one for <code>dev<\/code>.<\/li>\n<li>Scans only for the curated zone of ADLS Gen2 and the main Azure SQL database.<\/li>\n<li>Lightweight glossary with 20\u201350 critical terms.<\/li>\n<li><strong>Why Data Catalog was chosen:<\/strong><\/li>\n<li>Quick setup, managed service, and search-based discovery without building a custom metadata app.<\/li>\n<li><strong>Expected outcomes:<\/strong><\/li>\n<li>Better onboarding.<\/li>\n<li>Fewer duplicated datasets.<\/li>\n<li>Clear ownership and definitions for core KPIs.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">16. FAQ<\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li>\n<p><strong>Is Azure Data Catalog still available?<\/strong><br\/>\n   No. The standalone <strong>Azure Data Catalog<\/strong> product was retired. The supported \u201cData Catalog\u201d capability in Azure today is delivered through <strong>Microsoft Purview Data Catalog<\/strong>. Verify retirement details and timelines in Microsoft announcements and docs.<\/p>\n<\/li>\n<li>\n<p><strong>What is the difference between Microsoft Purview and Data Catalog?<\/strong><br\/>\n   Microsoft Purview is the broader governance service. <strong>Data Catalog<\/strong> is a core capability within it focused on <strong>discovering and curating metadata<\/strong>.<\/p>\n<\/li>\n<li>\n<p><strong>Does Data Catalog store my actual data?<\/strong><br\/>\n   It stores <strong>metadata<\/strong> (information about data), not typically the full dataset. Classification may inspect portions of content depending on configuration and connector behavior\u2014verify per connector.<\/p>\n<\/li>\n<li>\n<p><strong>Does catalog access grant data access?<\/strong><br\/>\n   No. Catalog access and underlying data access are separate. You still need permissions on the actual data source to read the data.<\/p>\n<\/li>\n<li>\n<p><strong>Which data sources are supported?<\/strong><br\/>\n   Support depends on Purview connectors. Azure sources commonly include ADLS Gen2 and Azure SQL, plus others. Verify the current supported sources list in official docs.<\/p>\n<\/li>\n<li>\n<p><strong>Can I catalog data across multiple Azure subscriptions?<\/strong><br\/>\n   Often yes, within the same Azure AD tenant, provided scanning identities and permissions are configured correctly.<\/p>\n<\/li>\n<li>\n<p><strong>How does scanning authenticate to Azure Storage?<\/strong><br\/>\n   Commonly via the Purview account\u2019s <strong>managed identity<\/strong> with Azure RBAC roles on the Storage account. Some scenarios may use other credential methods\u2014verify in connector docs.<\/p>\n<\/li>\n<li>\n<p><strong>How often should I run scans?<\/strong><br\/>\n   Based on data change rate and cost. Start with daily\/weekly scans for stable datasets; increase frequency only for high-value assets that change often.<\/p>\n<\/li>\n<li>\n<p><strong>What is a collection and why do I need it?<\/strong><br\/>\n   Collections organize assets and allow <strong>delegated administration<\/strong> and access boundaries\u2014important at enterprise scale.<\/p>\n<\/li>\n<li>\n<p><strong>Can I create a business glossary and link it to assets?<\/strong><br\/>\n   Yes. The glossary is a key feature for standardizing definitions and improving discoverability.<\/p>\n<\/li>\n<li>\n<p><strong>Does Data Catalog provide lineage?<\/strong><br\/>\n   It can, but lineage depends on supported sources and integrations (for example, some pipeline tools and BI integrations). Verify lineage support for your stack.<\/p>\n<\/li>\n<li>\n<p><strong>How do I prevent scanning sensitive folders?<\/strong><br\/>\n   Narrow scan scope to specific paths, and use exclusion patterns\/rules where supported. Also apply least-privilege permissions.<\/p>\n<\/li>\n<li>\n<p><strong>How do I monitor scan failures?<\/strong><br\/>\n   Use scan run history in the Purview portal, and enable Azure diagnostics\/logging where supported to centralize monitoring.<\/p>\n<\/li>\n<li>\n<p><strong>Can I use Infrastructure as Code to deploy it?<\/strong><br\/>\n   The Purview account is an Azure resource and can be deployed via ARM\/Bicep\/Terraform patterns. Automation for scans and curation varies\u2014verify current API support.<\/p>\n<\/li>\n<li>\n<p><strong>What are the most common reasons catalogs fail to deliver value?<\/strong><br\/>\n   Scanning everything without curation, no ownership model, stale glossary, and lack of integration into delivery processes.<\/p>\n<\/li>\n<li>\n<p><strong>Is Data Catalog an Analytics service or a governance service?<\/strong><br\/>\n   It supports Analytics by improving discovery and trust, but it is fundamentally a <strong>governance\/metadata<\/strong> service used across analytics workflows.<\/p>\n<\/li>\n<li>\n<p><strong>How do I estimate costs?<\/strong><br\/>\n   Use the official Purview pricing page plus the Azure Pricing Calculator. Your main levers are Data Map capacity-hours and scan runtime\/frequency.<\/p>\n<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">17. Top Online Resources to Learn Data Catalog<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Resource Type<\/th>\n<th>Name<\/th>\n<th>Why It Is Useful<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Official documentation<\/td>\n<td>Microsoft Purview docs<\/td>\n<td>Primary source for current capabilities, connectors, and setup: https:\/\/learn.microsoft.com\/purview\/<\/td>\n<\/tr>\n<tr>\n<td>Official overview<\/td>\n<td>Microsoft Purview Data Catalog overview<\/td>\n<td>Explains concepts, roles, and discovery experience: https:\/\/learn.microsoft.com\/purview\/purview-data-catalog-overview<\/td>\n<\/tr>\n<tr>\n<td>Official quickstart<\/td>\n<td>Create a Microsoft Purview account<\/td>\n<td>Step-by-step account creation and basics (verify latest): https:\/\/learn.microsoft.com\/purview\/create-microsoft-purview-account<\/td>\n<\/tr>\n<tr>\n<td>Official connectors<\/td>\n<td>Microsoft Purview data sources \/ connectors<\/td>\n<td>Lists supported sources and configuration requirements (verify latest): https:\/\/learn.microsoft.com\/purview\/purview-data-sources<\/td>\n<\/tr>\n<tr>\n<td>Official pricing<\/td>\n<td>Microsoft Purview pricing<\/td>\n<td>Explains pricing dimensions: https:\/\/azure.microsoft.com\/pricing\/details\/microsoft-purview\/<\/td>\n<\/tr>\n<tr>\n<td>Cost estimation tool<\/td>\n<td>Azure Pricing Calculator<\/td>\n<td>Build region-specific estimates: https:\/\/azure.microsoft.com\/pricing\/calculator\/<\/td>\n<\/tr>\n<tr>\n<td>Architecture guidance<\/td>\n<td>Azure Architecture Center<\/td>\n<td>Broader Azure analytics\/governance architecture patterns: https:\/\/learn.microsoft.com\/azure\/architecture\/<\/td>\n<\/tr>\n<tr>\n<td>Security\/trust<\/td>\n<td>Microsoft Trust Center<\/td>\n<td>Compliance, privacy, and security posture: https:\/\/www.microsoft.com\/trust-center<\/td>\n<\/tr>\n<tr>\n<td>Video learning<\/td>\n<td>Microsoft Mechanics (YouTube)<\/td>\n<td>Often covers Purview governance concepts and updates (verify relevant episodes): https:\/\/www.youtube.com\/@MicrosoftMechanics<\/td>\n<\/tr>\n<tr>\n<td>Samples (verify official)<\/td>\n<td>Microsoft Purview GitHub (search)<\/td>\n<td>Code samples and API usage may exist; validate repo authenticity and currency: https:\/\/github.com\/search?q=Microsoft+Purview+samples<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">18. Training and Certification Providers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Institute<\/th>\n<th>Suitable Audience<\/th>\n<th>Likely Learning Focus<\/th>\n<th>Mode<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>DevOps engineers, cloud engineers, platform teams<\/td>\n<td>Azure governance\/DevOps adjacent skills; may include Purview and data platform modules (verify)<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>ScmGalaxy.com<\/td>\n<td>Students, engineers transitioning to DevOps\/cloud<\/td>\n<td>Fundamentals and tooling; may include cloud governance topics (verify)<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.scmgalaxy.com\/<\/td>\n<\/tr>\n<tr>\n<td>CLoudOpsNow.in<\/td>\n<td>Cloud operations and SRE\/ops teams<\/td>\n<td>Cloud operations practices, monitoring, governance basics (verify)<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.cloudopsnow.in\/<\/td>\n<\/tr>\n<tr>\n<td>SreSchool.com<\/td>\n<td>SREs, reliability engineers, ops leads<\/td>\n<td>Reliability practices that intersect with data platform operations (verify)<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.sreschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>AiOpsSchool.com<\/td>\n<td>Ops teams exploring AIOps<\/td>\n<td>Operations automation\/monitoring concepts; relevance to Purview is indirect (verify)<\/td>\n<td>Check website<\/td>\n<td>https:\/\/www.aiopsschool.com\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">19. Top Trainers<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Platform\/Site<\/th>\n<th>Likely Specialization<\/th>\n<th>Suitable Audience<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>RajeshKumar.xyz<\/td>\n<td>DevOps\/cloud training content (verify current offerings)<\/td>\n<td>Beginners to intermediate practitioners<\/td>\n<td>https:\/\/rajeshkumar.xyz\/<\/td>\n<\/tr>\n<tr>\n<td>devopstrainer.in<\/td>\n<td>DevOps and cloud training (verify modules)<\/td>\n<td>Engineers seeking practical training<\/td>\n<td>https:\/\/www.devopstrainer.in\/<\/td>\n<\/tr>\n<tr>\n<td>devopsfreelancer.com<\/td>\n<td>Freelance DevOps services\/training marketplace style (verify)<\/td>\n<td>Teams looking for short-term expert help<\/td>\n<td>https:\/\/www.devopsfreelancer.com\/<\/td>\n<\/tr>\n<tr>\n<td>devopssupport.in<\/td>\n<td>DevOps support and training resources (verify)<\/td>\n<td>Ops teams needing hands-on guidance<\/td>\n<td>https:\/\/www.devopssupport.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">20. Top Consulting Companies<\/h2>\n\n\n\n<figure class=\"wp-block-table\"><table>\n<thead>\n<tr>\n<th>Company Name<\/th>\n<th>Likely Service Area<\/th>\n<th>Where They May Help<\/th>\n<th>Consulting Use Case Examples<\/th>\n<th>Website URL<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>cotocus.com<\/td>\n<td>Cloud\/DevOps consulting (verify specialties)<\/td>\n<td>Architecture reviews, implementation support, operationalization<\/td>\n<td>Purview rollout planning, RBAC\/collections design, scan strategy and runbooks (verify scope with vendor)<\/td>\n<td>https:\/\/cotocus.com\/<\/td>\n<\/tr>\n<tr>\n<td>DevOpsSchool.com<\/td>\n<td>Training + consulting (verify service catalog)<\/td>\n<td>Enablement, implementation guidance, DevOps\/cloud practices<\/td>\n<td>Governance operating model workshops, IaC patterns for Azure resources, integrations planning (verify scope)<\/td>\n<td>https:\/\/www.devopsschool.com\/<\/td>\n<\/tr>\n<tr>\n<td>DEVOPSCONSULTING.IN<\/td>\n<td>DevOps consulting (verify offerings)<\/td>\n<td>DevOps pipelines, cloud operations, best practices<\/td>\n<td>Building deployment automation around data platform resources, monitoring and access review processes (verify scope)<\/td>\n<td>https:\/\/devopsconsulting.in\/<\/td>\n<\/tr>\n<\/tbody>\n<\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">21. Career and Learning Roadmap<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn before this service<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Azure fundamentals:<\/li>\n<li>Subscriptions, resource groups, Azure RBAC, managed identities<\/li>\n<li>Data fundamentals:<\/li>\n<li>Databases vs data lakes<\/li>\n<li>Schemas, partitions, file formats (CSV\/Parquet)<\/li>\n<li>Security basics:<\/li>\n<li>Least privilege, network concepts, Key Vault basics<\/li>\n<li>Analytics basics:<\/li>\n<li>Common Azure analytics services (ADLS Gen2, Azure SQL, Synapse, Data Factory)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">What to learn after this service<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Advanced Microsoft Purview topics:<\/li>\n<li>Connector expansion (Power BI, SQL estate, multi-subscription)<\/li>\n<li>Lineage integrations<\/li>\n<li>Governance workflows and stewardship operations<\/li>\n<li>Data platform architecture:<\/li>\n<li>Lakehouse patterns, medallion architecture<\/li>\n<li>Data quality tooling and monitoring<\/li>\n<li>Cloud security:<\/li>\n<li>Private endpoints, centralized logging, policy enforcement<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Job roles that use it<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Data engineer \/ senior data engineer<\/li>\n<li>Analytics engineer<\/li>\n<li>Data platform engineer<\/li>\n<li>Cloud solutions architect<\/li>\n<li>Data governance analyst \/ data steward<\/li>\n<li>Security engineer (data discovery and classification use cases)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Certification path (if available)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Microsoft certifications change frequently. There is not always a certification dedicated only to Data Catalog\/Purview. Consider:\n&#8211; Azure fundamentals and architecture certifications\n&#8211; Data engineering certifications<br\/>\nVerify the current Microsoft certification catalog: https:\/\/learn.microsoft.com\/credentials\/<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Project ideas for practice<\/h3>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Catalog a curated zone of an ADLS Gen2 lake; implement collection strategy by domain.<\/li>\n<li>Build a glossary for 30 key business terms and link them to tables\/files.<\/li>\n<li>Set up scheduled scans and alerting\/runbook for scan failures.<\/li>\n<li>Pilot lineage for a supported pipeline tool and document impact analysis workflow.<\/li>\n<li>Build a \u201ctrusted datasets\u201d curation checklist and apply it to 10 assets.<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">22. Glossary<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Data Catalog:<\/strong> A searchable inventory of data assets plus metadata for discovery and governance.<\/li>\n<li><strong>Microsoft Purview account:<\/strong> The Azure resource that hosts Purview governance capabilities, including Data Catalog.<\/li>\n<li><strong>Data Map:<\/strong> The metadata store\/index behind the catalog experience.<\/li>\n<li><strong>Metadata:<\/strong> Data about data\u2014schemas, descriptions, owners, classifications, tags, and relationships.<\/li>\n<li><strong>Business glossary:<\/strong> Central dictionary of business terms and definitions mapped to data assets.<\/li>\n<li><strong>Collection:<\/strong> A hierarchical container in Purview used to organize assets and delegate administration.<\/li>\n<li><strong>Scan:<\/strong> A configured process that connects to a source and extracts metadata (and optionally classifications).<\/li>\n<li><strong>Managed identity:<\/strong> Azure AD identity automatically managed by Azure for authenticating to other services without storing secrets.<\/li>\n<li><strong>Classification:<\/strong> Labels applied to data assets\/columns indicating sensitivity or type (e.g., personal identifiers).<\/li>\n<li><strong>Lineage:<\/strong> The representation of data flow and transformations across systems from source to consumption.<\/li>\n<li><strong>Azure RBAC:<\/strong> Azure role-based access control for managing access to Azure resources.<\/li>\n<li><strong>Data plane vs control plane:<\/strong> Data plane is the actual data access; control plane is management\/configuration operations.<\/li>\n<li><strong>ADLS Gen2:<\/strong> Azure Data Lake Storage Gen2\u2014Azure Storage with hierarchical namespace for analytics workloads.<\/li>\n<li><strong>Least privilege:<\/strong> Security principle of granting only the minimum permissions needed.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">23. Summary<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Data Catalog in Azure\u2014implemented today as <strong>Microsoft Purview Data Catalog<\/strong>\u2014is Microsoft\u2019s supported way to deliver <strong>enterprise data discovery, metadata management, and governance<\/strong> across your Analytics estate. It helps teams find data faster, align on definitions through a glossary, and improve trust with ownership and (where supported) lineage and classification.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">From a cost perspective, focus on the <strong>Data Map capacity<\/strong> and <strong>scan runtime\/frequency<\/strong> as primary drivers, and avoid scanning everything by default. From a security perspective, treat metadata as sensitive, apply least privilege to scanning identities, and use collection-based access boundaries.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Use Data Catalog when you have a growing, multi-system data estate and need a central discovery layer. Your next step after this tutorial is to expand from a single ADLS scan into a <strong>production operating model<\/strong>: collections by domain, curated datasets, glossary governance, and scheduled scans with monitoring.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Analytics<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[21,40],"tags":[],"class_list":["post-383","post","type-post","status-publish","format-standard","hentry","category-analytics","category-azure"],"_links":{"self":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/383","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/comments?post=383"}],"version-history":[{"count":0,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/posts\/383\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/media?parent=383"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/categories?post=383"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.devopsschool.com\/tutorials\/wp-json\/wp\/v2\/tags?post=383"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}