By John Wills
Published on August 18, 2022
We all know that data lineage is a complex and challenging topic. In this blog, I am drilling into something I’ve been thinking about and studying for a long time: fundamental approaches to lineage creation and maintenance. There are several reasons why I am compelled to address it:
I continue to meet people who don’t understand how to frame and put the lineage challenge in context. These people are at the beginning of the learning curve and are typically hopeful that a magic ‘easy button’ exists.
In the world of data lineage, there are a lot of exciting changes and evolution that warrant investigation.
I’m convinced that a fundamental shift in approach to lineage is needed to drive both the value of data (and the analytics culture) to a new level of effectiveness. Simply put, I find it fascinating!
What do I mean by lineage creation and maintenance? For the purposes of this blog, it means the process of understanding how and why things are related.
But what data things are interconnected? Most of the time we think about data fields & files, columns & tables, reports & dashboards. And we think about how they are manipulated using some form of processing. We also think about how all of these are strung together to form long ‘chains’ of dependencies in pipelines and orchestrations.
The challenge today is to think more broadly about what these data things could or should be. It’s important to realize that we need visibility into lineage and relationships between all data and data-related assets, including business terms, metric definitions, policies, quality rules, access controls, algorithms, etc. My focus is on the fundamental approaches we use to understand what physical data assets exist and the processing that relates one to another.
It’s critical that you consider, not just how these things are discovered and created the first time, but how they are maintained on an ongoing basis. Data drift is tough to track but essential to understand, as it can lead to low quality data and erode human trust. When it comes time to create lineage for the first time, there are a lot of one-off approaches (including ‘brute force’) that can be used. But how do you track its drift and evolution? The complex challenge here is to have the lineage be intelligently updated as the data landscape and processing dynamically bubbles and changes daily across an enterprise. Active metadata will play a critical role in automating such updates as they arise.
The fundamental value proposition of lineage is increased productivity. Data engineers and others tasked with maintaining data feeds and pipelines, as well as those responsible for issue remediation, use lineage to identify dependent data and processing more quickly. This enables them to troubleshoot and address broken pipelines more efficiently.
Data consumers, too, stand to benefit. Visibility into lineage increases consumers’ trust and confidence, as they can see where data comes from. (This may be a red herring, which I’ll address in a future blog!).
For now, I will explore the two fundamental approaches to data lineage creation and maintenance. I’ve adopted the statistics related terminology of deterministic and non-deterministic to help define and explain each.
I define this as a post-processing effort to discern the existence of data assets and discover how the assets are related through processing logic. Said in a more direct way, it’s an effort to examine code and parse out what is in it, including how it’s related. With this process you’d address the question, “How do these data assets interconnect?”
This has been the dominant approach for nearly 50 years, and in my opinion, was born out of the work of Thomas McCabe in the 1970’s to measure the complexity of Cobol programs. His work produced control-flow graphs with nodes and edges as a visual representation of complexity.
Fast forward and we still see the same basic approach with lineage tools either connecting to a ‘source’ or being fed files and then trying to tease out the nodes and edges. The big difference is the explosion of data sources. Today, sources can include hundreds of technologies, including a wide range of coding languages, pipeline environments, data storage/streaming representations, and API endpoints. Of course, the other big change is the complexity of the modern application and data architecture. This places a premium on the need to understand ‘cross-system’ lineage.
For non-deterministic lineage to work well, it requires a significant investment in building parsers by engineers that very deeply understand specific environments and languages. Attempting to stitch together a representation of cross-system lineage requires an even deeper investment in a proprietary approach to mapping in order to maintain reasonable accuracy.
Typically, it’s been VC-backed vendors who have taken on this challenge, and over the years several have come and gone. The vendors have done some good work, but have generally struggled to meet false expectations that lineage would be an easy ‘push button’ exercise. Even for data lineage companies, creating an accurate representation of the entire data landscape is still an enormous challenge.
Of course, a realist will see the fatal flaw. The non-deterministic approach is retroactive: a post-mortem. It involves deep technical expertise, guesswork, and troubleshooting. This is an attempt to extract highly accurate meaning from a wide range of code and technical artifacts without a premeditated approach or framework. Even with the most powerful ML/AI it will always be an after-the-fact approach to divine understanding, producing significant gaps and inaccuracies.
As a counter-example, what if you planned for lineage? What if you built your processing framework to accommodate lineage from the start?
I define deterministic lineage as the use of lineage markup embedded in processing logic to construct a view of lineage. In other words, lineage is premeditated, and the developer and maintainers of pipelines proactively insert lineage indicators along with code that performs data movement and manipulation.
This approach ensures lineage is easy to visualize. Because the lineage markup uses a consistent ‘dialect’, no matter the coding environment, the production of a visual representation of lineage is a much lighter lift. All that’s needed is a parser that looks for the markup in everything it’s handed.
Naysayers here may argue that the burden of responsibility and effort is shifted from the ‘magic’ of (machine) parsers to the (human) developer. At first, that may seem like a bad trade-off; after all, who wants more work to do? But let’s examine the benefits of non-deterministic lineage:
Lineage will represent exactly what is embedded in the code. If it’s wrong, it’s within your power and control to change as opposed to relying on a vendor to enhance their ‘black box’.
You can decide what level of granularity you want to represent in your lineage ranging from high-level representation down to field level transformations and mapping.
If a language can include metadata in the form of comments (and they all can) then markup can be inserted. This means all your tools could embed lineage and you don’t have to worry about including any new ones as well.
The markup can be extracted and used in a wide array of visual tools. This could also include storage in a graph database and/or being mashed up with other technologies.
The Alation Data Catalog has industry-leading lineage capabilities. This is especially true as it relates to our unique query log parsing capability for a wide array of database environments. Additional leading capabilities include our automation of BI lineage and the ability to create custom database connectors that include your own query log parsing logic.
Lineage capabilities have also recently been extended to support manual creation through the UI. To put it in the context of the definitions above, I would say we are an industry leader in the deterministic lineage approach and are providing tremendous value for our customers.
We also are laying the groundwork to be the leader in the newer non-deterministic lineage approach, but that will take a while. Why? The standards and approach are in an embryonic stage in the industry. It will take a few years for the concepts and early proof of concepts to become production viable, but we intend to be part of that process and to offer the value it produces as an option for our customers.
These approaches to lineage are not mutually exclusive. An integrated approach, that leverages the strengths of man and machine, is increasingly the sought-after end state. You can easily imagine a man/machine partnership where parsers do a great deal of heavy lifting for well-known environments (intra-system) and humans provide the vital markup for linkage (cross-system) and specialized processing.
I am convinced that lineage is one of the keys to driving to a new level of productivity fueled by data. It is an area ripe for fundamental change. And you now have a frame of reference you can use to track its evolution.
For the purposes of this blog, it means the process of understanding how and why things are related.
The fundamental value proposition of lineage is increased productivity. Data engineers and others tasked with maintaining data feeds and pipelines, as well as those responsible for issue remediation, use lineage to identify dependent data and processing more quickly. This enables them to troubleshoot and address broken pipelines more efficiently.