By Neil Raden
Published on 2022年4月14日
Moving data to the cloud can bring immense operational benefits. However, the sheer volume and complexity of today’s enterprise data can cause downstream headaches for data users. Semantics, context, and how data is tracked and used mean even more as you stretch to reach post-migration goals. This is why, when data moves, it’s imperative for organizations to prioritize data discovery.
Data discovery is also critical for data governance, which, when ineffective, can actually hinder organizational growth. And, as organizations progress and grow, “data drift” starts to impact data usage, models, and your business. In today’s AI/ML-driven world of data analytics, explainability needs a repository just as much as those doing the explaining need access to metadata, EG, information about the data being used.
This two-part series will explore how data discovery, fragmented data governance, ongoing data drift, and the need for ML explainability can all be overcome with a data catalog for accurate data and metadata record keeping.
With the onslaught of AI/ML, data volumes, cadence, and complexity have exploded. Cloud providers like Amazon Web Services, Microsoft Azure, Google, and Alibaba not only provide capacity beyond what the data center can provide, their current and emerging capabilities and services drive the execution of AI/ML away from the data center.
The future lies in the cloud. A cloud-ready data discovery process can ease your transition to cloud computing and streamline processes upon arrival. So how do you take full advantage of the cloud? Migration leaders would be wise to enable all the enhancements a cloud environment offers, including:
Special requirements for AI/ML
Data pipeline orchestration
Collaboration and governance
Low-code, no-code operation
Support for languages and SQL
Moving/integrating data in the cloud/data exploration and quality assessment
Once migration is complete, it’s important that your data scientists and engineers have the tools to search, assemble, and manipulate data sources through the following techniques and tools.
An inference algorithm that informs the analyst with a ranked set of suggestions about the transformation.
A technique to automate changes in iterative passes.
A useful feature for exposing patterns in the data.
Supports the ability to interact with the actual data and perform analysis on it.
Automatic sampling to test transformation.
This provides the facility a time or event for a job to run and offers useful post-run information.
Similar to a data warehouse schema, this prep tool automates the development of the recipe to match.
Support for multiple analysts to work together and create the facility to share quality work for reuse
Taken together, these techniques enable all people to trust the data, as well as the insights of their peers. A cloud environment with such features will support collaboration across departments and across common data types, including csv, JSON, XML, AVRO, Parquet, Hyper, TDE, and more.
The vision of big data freed organizations to capture more data sources at lower levels of detail and in vastly greater volumes. The problem with this much collection was that it exposed a far more complex semantic dissonance problem.
For example, data science always consumes “historical” data, and there is no guarantee that the semantics of older datasets are the same, even if their names are unchanged. Pushing data to a data lake and assuming it is ready for use is shortsighted.
Organizations launched initiatives to be “data-driven” (though we at Hired Brains Research prefer the term “data-aware”). They strove to ramp up skills in all manner of predictive modeling, machine learning, AI, or even deep learning. And, of course, the existing analytics could not be left behind, so any solution must satisfy those requirements as well. Integrating data from your own ERP and CRM systems may be a chore, but for today’s data-aware applications, the fabric of data is multi-colored.
The primary issue is that enterprise data no longer exists solely in a data center or even a single cloud (or more than one, or combinations of both).
Edge analytics for IoT, for example, captures, digests, curates, and even pulls data from other, different application platforms and live connections to partners (previously a snail-like exercise using obsolete processes like EDI). Edge computing can be decentralized from on-premises, cellular, data centers, or the cloud. These factors risk data originating in far-flung environments, where the data structures and semantics are not well understood or documented.
Problems arise when data sources are semantically incompatible. And valuable analytics are often derived by drawing from multiple sources. The challenge of smoothly moving data and its logic while everything is in motion is too extreme for manual methods.
AI/ML models to automate the discovery and semantics of the data
Cloud governance
On-premises business intelligence and databases
A data catalog sophisticated enough to support the other components
Data security throughout the migration process is also essential. A data catalog that tracks labeled data, and spotlights the most useful data, can help migration managers ensure the process goes smoothly – and securely. A data catalog with a governance framework can also ensure that cloud data governance in place once data is migrated.
Security and governance are often confused because they are tightly bound, but security is only a part of governance. According to Strategies in IT Governance, “ governance is the system by which entities are directed and controlled. It is concerned with structure and processes for decision making, accountability, control and behavior at the top of an entity. Governance influences how an organization’s objectives are set and achieved, how risk is monitored and addressed, and how performance is optimized.”
It’s not a simple definition. Governance has to be codified in an open system for applications across the enterprise to apply it. Adding to the confusion here is the fact that ethics and compliance are often used interchangeably. Ethics is about the right thing to do; compliance includes the rules, regulations, statutes, and even organizational direction, which try to realize and guide this “correct” course of action.
The role of governance is to define the rules and policies for how individuals and groups access data properties and the kind of access they are allowed. Yet people in an organization rarely operate according to well-defined roles. They perform in multiple roles, often provisionally. On-ramping has to happen immediately; off-ramping has to be a centralized function. One very large organization we dealt with discovered that departing employees still had access to critical data for seven to nine days!
After all, without governance, security would be arbitrary. Many organizations that employ security schemes struggle, because such schemes tend to be either too loose or too tight and almost always too rigid (insufficiently dynamic).
In this way, security can hinder the progress of the organization. Yet given the complexity of data architecture today, it’s become impossible to manage security for individuals without a coherent and dynamic governance policy to drive security allowance or grants for exceptions to those rules. It is impossible to have a coherent security policy that isn’t part of the larger governance framework.
In today’s complex data architecture, governance has grown too complicated for manual methods. A data governance application with the ability to connect to data and security is needed. As discussed in the next installment, that data governance app must also connect to a data catalog.
Part of the complexity of managing security is the constant change or “drift” in the data, the models, the semantics, the Master Data Management, and all dependencies.
Once data is ingested into an organization’s repository or is connected through a managed pipeline, data sources tend to drift. Data drift changes the data with time, meaning data cannot manifest the kind of perfect objectivity that we tend to endow it with. Data that originates from primarily stable operational systems (AKA operational exhaust) and data extracted from static database tables generally exhibits a lesser extent of drift.
However, almost any other data, the so-called “digital exhaust,” includes user-generated files on web-based systems and networks such as cookies, logs, temporary browsing history, and indicators to help website managers.
In addition, there are external datasets from data brokers that need scrutiny every time they are accessed. Incorporating this data into your corpus of information without constant surveillance for drift would render your repository unseeable.
Another source of drift is organizational. It arises from activities like mergers and acquisitions, joint ventures, and dynamic supply chains. All these changes render simple, high-level schemes for security unworkable and create considerable liability to an organization.
Legacy data adds to the challenge. Coupled with inadequate security are last-generation solutions to metadata, effectively rows and columns that have to be queried and often joined in a relational format. This approach is generations old and too limited and rigid for the requirements of a digital organization.
Models too, can drift and transform over time. Analysts and data scientists may or may not register their models, but effective governance of AI/ML models should include those in production, versioning the models, managing update notifications of documentation, monitoring models, their results, and implementing machine learning with existing IT policies.
AI/ML models pose a problem with versioning results and testing using unique testing and algorithms. For instance, new users may find it difficult to understand how an ML model arrived at its conclusion. This is the so-called black-box problem. Procedural methods can trace other models, but ML is quite a bit mysterious in its operations. However, ML yields some of its mystery through a new technique called Explainability, or XAI for short. These examinations include measuring bias and fairness by understanding which variables affected the conclusions, among other investigations. However, XAI has only started to show some maturity.
Record-keeping in a data catalog is key. By tracking, documenting, monitoring, versioning, and controlling access to all models, organizations can closely control model inputs and begin to understand all the variables that might affect the results. A key benefit of model governance is identifying who owns a model while a company changes over time. For example, if someone worked on a project recently but has left the company, model governance helps keep track of projects, how they run, and where you left off.
The many kinds of data drift are challenging to address manually and may create problems with data security down the line. How can data leaders safeguard security while ensuring data drift is kept in check?
The solution to the problem is a data catalog. A data catalog software collects metadata, combines it with data management and search tools, and helps analysts and other data users find the data that they need. Data catalogs can be continuously updated with AI/ML routines to provide much richer metadata, broader and deeper coverage, and far faster performance. In the next installment, we’ll look at how AI/ML helps data catalogs improve data governance, data discoverability, data usage, and more.
1. https://www.researchgate.net/publication/314517377_Governance_in_IT_Outsourcing_Partnerships