Here are 10 well-known data catalog and discovery tools:
- Collibra Catalog: Offers a centralized, business-friendly data catalog for discovering and understanding data assets.
- Alation: Provides a collaborative data catalog and knowledge platform to enable data discovery and governance.
- Inormatica Enterprise Data Catalog: Offers AI-powered data cataloging and data discovery capabilities for enterprise data assets.
- Atlan: Offers a modern data catalog that allows users to discover, understand, and collaborate on data assets.
- AWS Glue Data Catalog: A fully managed metadata repository that integrates with various AWS services to enable data cataloging and discovery.
- Apache Atlas: A scalable and extensible open-source data governance and metadata framework for cataloging and managing data assets.
- IBM Watson Knowledge Catalog: Enables data cataloging, data discovery, and collaboration across the enterprise using AI-powered capabilities.
- Oracle Enterprise Metadata Management: Provides a comprehensive data catalog and metadata management solution for data discovery and governance.
- Infogix Data360 Govern: Infogix offers a suite of integrated data governance capabilities that include business glossaries, data cataloging, data lineage, and metadata management.
- Google Cloud Data Catalog: Part of Google Cloud’s Dataplex, Google Cloud Data Catalog is a fully managed cloud service with data discovery and metadata management capabilities.
1. Collibra Catalog
Collibra’s Data Dictionary documents an organization’s technical metadata and how it is used. It describes the structure of a piece of data, its relationship to other data, and its origin, format, and use. The solution serves as a searchable repository for users who need to understand how and where data is stored and how it can be used. Users can also document roles and responsibilities and utilize workflows to define and map data. Collibra is unique because the product was built with business end-users in mind.
2. Alation Data Catalog
Alation Data Catalog relies on its Behavioral Analysis Engine to utilize advanced artificial intelligence and machine learning. The use of popularity-driven relevancy brings the most useful information forward and the product creates in-workflow governance to maintain data policies.
The architecture is containerized and improves the timeliness of data onboarding and time-to-analysis.
Alation also supports multiple deployment styles, giving organizations the option of managing data themselves or having it remotely managed on the cloud, or other options in between. Alation’s Open Data Quality Initiative allows smooth data sharing between sources.
3. Informatica Enterprise Data Catalog
Informatica Enterprise Data Catalog is a machine learning-based data catalog that lets you classify and organize data assets across any environment. The product also provides a metadata system of record for the enterprise. Enterprise Data Catalog automatically scans and catalogs data, indexing it for organization-wide discovery via a Google-like search engine. Key features include data provisioning, end-to-end data lineage, integrated data quality, data relationships, and recommendations, and even a Tableau extension.
Atlan compares itself to Netflix for data, supporting multiple experiences for different kinds of users’ needs through its use of Personas. Each user has a customized homepage, custom metadata, and access to data curated to their workflows. Atlan’s Purposes allow you to create policies and grant access to data assets by business domains and project context. Atlan’s Compliance controls access to sensitive assets, which can also be auto-identified.
Atlan supports natural language search and the ability to use business metrics to find associated linked assets, all throughout the entire data asset universe. Atlan is built on open source and all actions are API-driven. Atlan’s custom metadata builder has a no-code interface and allows you to easily share with other users. It also allows you to collaborate and communicate using common communication and workflow tools and plug-ins without leaving Atlan.
5. AWS Glue Data Catalog
AWS Glue Data Catalog is the persistent metadata store in AWS Glue, a fully managed extract, transform, and load (ETL) service offered by AWS. The data catalog enables data management teams to store, annotate and share metadata for use in ETL integration jobs when they create data warehouses or data lakes on the AWS cloud platform. It supports similar functionality and is compatible with the megastore repository in Apache Hive, a popular open-source data warehouse tool. In some cases, organizations can also integrate the AWS data catalog as an external megastore for Hive data.
Users can share access to AWS Glue Data Catalog across an organization using their AWS Identity and Access Management (IAM) credentials. The data catalog tool helps enforce data governance requirements by tracking changes to schemas and data access controls. In addition, it supports data processes that span different AWS services, including AWS Lake Formation, Amazon Athena, Amazon Redshift, Amazon EMR, and more. AWS Glue Data Catalog can also be used to populate business data catalogs in Amazon DataZone, a separate data management service scheduled for a preview release in early 2023.
Other features offered by the AWS software include the following:
- the ability to write scripts to automatically crawl repositories and capture information on schemas and data types;
- improved visibility, control, and governance of data assets across various AWS data services; and
- a settings page in the AWS Glue management console for changing permissions and other data catalog properties.
6. Apache Atlas
Apache Atlas is an open-source data governance and metadata management tool that allows businesses to collect, process, and maintain information more easily. The platform is able to track data processes and store data files, and metadata repository upgrades. Using Apache Atlas, teams can catalog their data assets, classify and manage databases, as well as collaborate on them with data scientists, analysts, and data governance specialists.
Why should you consider it?
Apache Atlas allows users to create and classify files, schemas, and tables, as well as to view data lineage through an intuitive user interface.
By enabling advanced data governance, the platform allows users to create new metadata types and instances and share metadata across teams through centralized analytics.
7. IBM Watson Knowledge Catalog
IBM Watson Knowledge Catalog is a metadata repository that was designed from the ground up to support AI, machine learning, and other analytics workflows. It works with the company’s underlying InfoSphere Information Governance Catalog to help organizations discover and govern data across cloud and on-premises sources.
The Watson tool can catalog various data and analytics assets, including machine learning models and structured, unstructured, and semi-structured data types. It supports intelligent cataloging and data discovery, which can be driven by automated search recommendations. The tool also features a self-service portal and automated data governance functions, including active policy management capabilities, role-based access control, and dynamic masking of sensitive data. It can be deployed in the cloud, on-premises, or as a fully managed service on the IBM Cloud Pak for Data platform.
IBM Watson Knowledge Catalog also offers the following features:
- the ability to create a common business glossary as a foundation for data governance efforts;
- a set of more than 30 connectors to both IBM and external data sources; and
- tracking of data lineage, data quality scores, and data governance workflow history.
8. Oracle Enterprise Metadata Management
Oracle Cloud Infrastructure Data Catalog is a metadata management service that helps organizations find and govern data using an organized inventory of data assets. The product features a modern, intuitive user interface that includes a simple dashboard, search-and-browse capabilities, recommended actions, and shortcuts. Oracle Cloud Infrastructure Data Catalog is included with an Oracle Cloud Infrastructure subscription.
9. Infogix Data360 Govern
Infogix offers a suite of integrated data governance capabilities that include business glossaries, data cataloging, data lineage, and metadata management. The tool also provides customizable dashboards and zero-code workflows that adapt as each organizational data capability matures. Reference customers use Infogix for data governance and for risk, compliance, and data value management. The product is also flexible and easy to use and supports smaller data analysis jobs as well.
10. Google Cloud Data Catalog
Part of Google Cloud’s Dataplex, Google Cloud Data Catalog is a fully managed cloud service with data discovery and metadata management capabilities. Key features of the service include serverless architecture, metadata as a service, a central catalog, search and discovery, schematized metadata, cloud DLP integration, on-prem connectors, cloud identity, and access management (IAM) integration and governance capabilities. It offers a faceted-search interface, metadata syncing and tagging, easy scalability, and integration with cloud data loss prevention (DLP) and other Google Cloud services.
- Technical and business metadata: Google Cloud Data Catalog supports data-driven decision-making and accelerates insight time by enriching data.
- Unified view: Users can gain a unified view to reduce time while searching for the right data.
- Cloud Data Loss Prevention (DLP): Data Catalog can use the Cloud Data Loss Prevention (DLP) scan to identify sensitive data directly within Data Catalog.