By Raja Perumal
Published on June 16, 2020
Increasingly, enterprises seeking a modern data infrastructure are adopting cloud environments for greater productivity and flexibility as well as to drive digital transformation. But unlocking these benefits can be challenging.
Enterprises must plan both for which data to migrate and how to maintain an up-to-date, cloud environment or risk a slow, drawn-out migration and unnecessary costs. The cloud data lake must be made accessible to a broad range of data consumers, providing the ability to find data within the lake, or it risks becoming part of the problem — another siloed data source with limited usage — rather than part of the solution. Data scientists need an easy way to collaborate on datasets and analytic models and to share results and insights, or the cloud risks becoming a hindrance to productive data science rather than an enabler. To help enterprises realize the promise of the cloud, Alation and Databricks have partnered to enable organizations to establish and maintain a cloud environment, provide data context and fast analytics processing, and enable collaboration across the enterprise to deliver faster and more accurate data science insights.
Migrating to a cloud environment is a complex project requiring significant time and resources as well as meticulous planning to ensure a successful move and healthy adoption. IT needs to determine what data to move, who is using it, and how to communicate changes in order to avoid productivity gaps. Moving the wrong data, or missing critical data or dependent data assets, will cause frustration and inefficiency for data users and prolong the migration. Enterprises can find themselves in a dilemma between indefinitely maintaining legacy systems or needlessly migrating unused and unnecessary data.
Alation eliminates the uncertainty inherent in data migration projects via its data catalog that identifies which data is being used, by whom, and how it is being used. The Alation data catalog identifies the top users of data and proactively alerts relevant users about data relocation operations and informs them where to find it in the cloud. Once the team determines what data to migrate and maintain in the data lake, the Databricks Unified Data Platform simplifies the data pipeline across batch and streaming data to maintain up-to-date data in the cloud. Databricks provides a fast and reliable cloud platform to ingest, process, store, access, and share data.
Data scientists waste significant time and energy searching for data to use for machine learning (ML). Estimates have put the amount of time data scientists spend searching and unpacking data at as much as 80% of the analytics cycle. Data science teams need to be able to find and understand data quickly to select the best data to use for modeling. Data scientists need a platform that enables them to quickly access and process data to effectively build, train, and deploy data science models.
Alation enables data users to easily discover data sources with the context needed to understand the data, identify who is using it, and understand how it has been used. Alation makes finding data in a data lake easy with natural language search and helps users to discover new data sources. Databricks provides fast analytics processing on auto-scaling infrastructure that is powered by Apache Spark™. Data scientists speed their time-to-insight by quickly identifying the data they need in Alation, and then applying Databricks’ reliable cloud platform to access and analyze complete datasets, delivering rapid results from data science models.
Data scientists often work in silos, unaware of the data assets and projects happening across the enterprise. As a result, work is needlessly recreated, data assets are underutilized, and opportunities to build on the advances of others are lost.
The ability to collaborate with experts across the organization and share existing models and business intelligence assets increases the overall value of data science. Data and insights that once benefitted only a select few are now more valuable because they can be easily identified, recommended, reused, and built upon by other data users and experts.
Alation spurs collaboration by adding context and searchable conversations. Discovering new data assets, analyses, and business intelligence assets , and identifying the people who worked on them, can provide data scientists with a jump start on data science projects. Knowing who the experts are, Databricks provides a platform for data scientists to share models and easily collaborate to track experiments and build models in real time. And finally, data science results and business intelligence assets can be shared in Alation for visibility to the entire organization to enable a data-driven culture.
Together, Alation and Databricks enable enterprises to build and maintain a fast and secure cloud environment for data discovery, analytics processing, and collaboration to build data culture.