By Satyen Sangani
Published on 2022年6月21日
When Alation was founded, we weren’t admittedly thinking in a fine-grained way about the audience of users. The product concept back then went something like:
In a world where enterprises have numerous sources of data, let’s make a thing that helps people find the best data asset to answer their question based on what other users were using.
It was SAT-like; not the answer, but the best answer among many options. And to determine “best,” we’d ingest log files and leverage machine learning. It was Googley in spirit: define best (i.e., relevancy) by the actions of others.
Thus, our vision for the audience was “people” and our early use-case was “search and discovery.” Over time, we called the “thing” a data catalog, blending the Google-style, AI/ML-based relevancy with more Yahoo-style manual curation and wikis.
Thus was born the data catalog.
In our early days, “people” largely meant data analysts and business analysts. They were scarce and expensive, and if we could provide a platform that made them happier and more productive, then it was a good investment. The audience grew to include data scientists (who were even more scarce and expensive) and their supporting resources (e.g., ML and DataOps teams). After that came data governance, privacy, and compliance staff. Power business users and other non-purely-analytic data citizens came after that.
As the audience grew, so did the diversity of information assets they wanted in the catalog. Analysts didn’t just want to catalog data sources, they wanted to include dashboards, reports, and visualizations. Why start with a data source and build a visualization, if you can just find a visualization that already exists, complete with metadata about it? Data scientists went beyond database tables to data lakes and cloud data stores. Data scientists want to catalog not just information sources, but models. Data engineers want to catalog data pipelines.
As the audience grew so did their associated use cases and the functional requirements to support them. Data governance people wanted policies, glossaries, and guided navigation. Compliance staff wanted workflow support. Privacy staff wanted tagging.
The point is that data catalogs started out with one audience (analysts) executing one use-case (search and discovery) and have evolved over the last decade to include multiple audiences executing multiple use cases. This is why we say that the data catalog is the platform for data intelligence. Platform means capable of supporting multiple audiences (e.g., analysts, data scientists, compliance, stewards, data engineers, analytics engineers) executing multiple use cases (e.g., search and discovery, governance, privacy, lineage, metadata management).
So, given this vision for the market and its evolution, we weren’t sure how to think when Forrester Research recently published a piece on Enterprise Data Catalogs for DataOps. At one level, it makes sense – there is certainly a lot of interest in DataOps today. It’s one of the hottest topics in data, right up there with the data mesh, the modern stack (of which its tools are a key part), and the data fabric. And we’re excited about DataOps because building data culture requires delivery of the quality data, often as a product, to the business with low latency and at scale. Moreover, decentralized architectures like the data mesh and the data fabric increase the requirements for both DataOps as a discipline and data catalogs as a platform.
Of course, talking about DataOps invariably begs the question of how to define DataOps. While most define DataOps abstractly as a discipline (or as a set of practices, processes, and technologies) you can also think of a DataOps, like DevOps, as a team within the organization.
When considered in this light, DataOps and the data (and/or analytics) engineers who work within it are another audience. Like any audience, they bring their own requirements in terms of use cases (e.g., observability) and information assets (e.g., data pipelines) to support. But like any audience you want them on the same data intelligence platform as everyone else because the purpose of the platform is to provide, in a decentralized world, a single layer that ties everything back together and provides high-level functionality like search and discovery, collaboration, and governance. Note that while the catalog should provide a single source of metadata, it does not have to meet every functional requirement – see our recently announced Open Data Quality Initiative for an example of how to tie data catalogs to various data quality tools that use different approaches to ensuring data quality.
If we have collectively agreed that keeping everything together in one place doesn’t work, then we should further agree that cataloging it all in one place is the smart alternative. As we sometimes like to say at Alation: we went searching for a single source of truth and found a single source of reference.
The whole purpose of the data catalog is that it’s for everyone. That’s our vision. To embrace new audiences and use cases to ensure that the data catalog is and remains the platform for data intelligence.
Curious to learn more? Explore why customers and analysts alike choose Alation.