By Dave Wells
Published on February 13, 2020
In an earlier blog, I defined a data catalog as “a collection of metadata, combined with data management and search tools, that helps analysts and other data users to find the data that they need, serves as an inventory of available data, and provides information to evaluate fitness data for intended uses.”
From modest beginnings as a means to manage data inventory and expose data sets to analysts, the data catalog has grown in functionality, popularity, and importance. Modern data catalogs—originated to help data analysts find and evaluate data—continue to meet the needs of analysts, but they have expanded their reach. They are now central to data stewardship, data curation, and data governance—all metadata dependent activities.
Think of a data catalog as being similar to a traditional retail catalog. Instead of information about products, it contains metadata — and data management and search tools — to serve as an inventory of available data and provide information evaluating the fitness of that data.
Metadata management is how organizations track their data — both where it comes from, as well as how it’s being used.
Whereas metadata describes data characteristics like structure, format, and content, a data catalog is a software tool used to manage and organize metadata about data assets within an organization, which facilitates a range of use cases. A data catalog stores metadata to facillitate metadata management, and by extension search & discovery, governance, and collaboration.
It seems that everyone wants data management but most want to avoid metadata management. The distaste for metadata management is an artifact of past metadata approaches with disparate metadata collected by a variety of tools using proprietary formats and without integration. Metadata management in the BI era was painful, but we can’t avoid the reality that metadata is essential to data management. Just as you need data about finances for effective financial management, you need data about data (metadata) for effective data management. You can’t manage data without metadata.
As data management becomes more complex with data lakes, big data, self-service analytics ,and data science, the role of metadata changes, and the importance of metadata increases exponentially. Metadata that is current, accurate, and readily accessible is an imperative. Metadata disparity is not workable, and metadata management as an afterthought is hazardous. We must actively manage metadata, and a data catalog is the right tool for the job. The data catalog has become the new gold standard for metadata and a cornerstone of data curation.
The real value of metadata is found in the answers it can provide. People who depend on data have questions about trustworthiness, latency, lineage, sensitivity, preparation, and much more. Sometimes they want to find others who know or have worked with the data to get human perspective. And they need to know about access, privacy and security constraints, cost, etc. Robust metadata ranging from data set names and properties to usage, access, licensing, and subject experts is the key to answering the many questions that data users and data managers will ask. In today’s self-service world, metadata is essential for three distinct groups of data management stakeholders:
Data consumers need metadata to help them find data for reporting, analysis, and data science work, and to evaluate that data to ensure that they work with the right datasets.
Data curators need metadata to observe data usage, understand the needs and interests of data consumers, and effectively manage the collection of shared data.
Data governors (owners and stewards) need metadata to identify and protect sensitive data, trace data lineage, and establish trust in data.
Metadata is the core of a data catalog. Every catalog collects data about the data inventory and also about processes, people, and platforms related to data. Metadata tools of the past collected business, process, and technical metadata, and data catalogs continue that practice. But data catalogs do much more. They collect metadata about datasets, metadata about processing, metadata for searching, and metadata for and about people. Figure 1 shows a logical data model that represents typical metadata content of a data catalog.
Data catalogs change the game and elevate best practices for metadata management with:
Crowdsourced metadata. Much of catalog metadata is collected automatically by applying algorithms and machine learning. But sometimes the most valuable metadata is the knowledge and experiences of individuals and groups. Collecting that knowledge as user ratings, reviews, tips, and techniques enriches the metadata collection and converts tribal knowledge into a shared and enduring data management resource.
Data about people. Data management and data analysis are ultimately human activities. Knowing which people have data roles and relationships and the nature of those roles is valuable. Data catalogs capture metadata to identify data users, data creators, data stewards, and data subject matter experts.
Automated metadata discovery. Organizations with massive data holdings—literally tens of thousands of databases—simply don’t know about all of the data they have. It is impossible to catalog a petabyte data estate without automated discovery.
Automated metadata discovery is an important part of data cataloging. But much of the metadata in a data catalog is a result of crowdsourcing and collaboration. In my next blog, I’ll discuss the roles of Collaboration and Crowdsourcing for Data Cataloging.