By Jason Lim
Published on 2022年2月10日
Data classification is necessary for leveraging data effectively and efficiently. Effective data classification helps mitigate risk, maintain governance and compliance, improve efficiencies, and help businesses understand and better use data. Let’s discuss what data classification is, the processes for classifying data, data types, and the steps to follow for data classification:
Data classification is the process of analyzing and organizing structured and unstructured data into categories by tagging data based on:
File type
Contents
Metadata
Either completed manually or using automation, the data classification process is based on the data’s context, content, and user discretion.
Companies use data classification to maintain compliance and answer important questions about data. For example, to maintain compliance with data security mandates, organizations need to classify Personally Identifiable Information (PII) appropriately; such labels ensure that sensitive data is being used and accessed appropriately.
Manual data classification is when data owners or users determine and create the data classification policies. It involves:
Entering assets in a data catalog
Determining and setting sensitivity levels
Labeling the asset
Ensuring the asset is being used properly based on its classification
As your organization collects and uses more data, manual data classification becomes overwhelming for data owners. This is why many organizations leverage automated data classification, which augments labeling projects with help from Artificial Intelligence and Machine Learning (AI & ML).
Two kinds of automated classification have grown popular. Real automation uses ML to locate certain categories of data and label them appropriately based on common patterns (for example, a passport number has a letter followed by 9 digits or a social security number is 8 digits). Since sensitive data types are often interlinked, automation helps analyze and categorize them more efficiently.
Hybrid automation entails a person creating a rule to classify data. That rule involves if-then logic to locate data, for example, IF column title = name, THEN apply the appropriate label.
AI & ML can lighten the burden of classification, but it’s important to remember that people will understand context far better. Therefore, hybrid automation, which takes a human-driven approach to automate classification, is often the best approach for those seeking to scale labeling projects, particularly in large data environments.
Data is only as good as your ability to use it effectively. Without knowing what you have, you won’t be able to maximize the value of it. Sensitive data also needs to be protected. Whether from a data security or privacy perspective, you need to make sure that you limit access to personally identifiable information (PII), like names, addresses, telephone numbers, banking information, and social security numbers.
Here are some of the ways data classification makes it easier to leverage information while keeping sensitive data protected:
The more PII you collect, the more you need to carefully manage user access to data. While mitigating risk often includes insider threats, privacy laws focus on both threats and unnecessary visibility. Further, as emerging privacy laws mandate how data can be used, data classification helps you meet these requirements.
With data classification, metadata tags are used to:
Protect sensitive data
Identify data governed by GDPR &CCPA, HIPAA, PCI, SOX, and BCBS
Quarantine data, place legal holds, or meet data archiving requirements
Respond to “Right to Be Forgotten” requests and Data Subject Access Requests (DSARs)
For example, when data users pull employee data from sources and tables, that data might include salary information – which only human resources should be able to see. Data classification plays a crucial role here, by labeling data and masking it based on a specific user’s access rights.
The result? Users can still analyze this information in aggregate, for example, employee names are concealed yet, salary is visible. People aren’t locked out from the data completely, but private details remain private. Analysis can proceed and compliance with privacy laws is maintained.
Beyond compliance, data classification can maximize the return on investment of your analytics tools.
Time savings is one example. Data classification helps business users discover data and understand how to use it quickly. Well-labeled data can also improve business operations, and simplify the role of a data analyst.
Operational efficiency is also supported by classification. Organized tagging and categorization enables efficient access for data users. Tagging also makes it easier for people to find what they are looking for without compromising data security.
By engaging in this primary analysis of data classification, you know what data you have where. These are important details, not just to maintain compliance, but also to democratize access appropriately.
As you begin to migrate data to the cloud, you need a way to understand what data is most critical to your operations. Data classification helps you rank the most used data, which is a reliable indicator of the most useful data. Visibility into your most useful data empowers you to ensure that only your most valuable data is migrated to the cloud.
By migrating the data that matters most, leaders broaden access to the most high-value data – and drive smarter, more efficient analysis for more people. A cloud migration strategy should begin with the questions, “Which data is most valuable to migrate?” and “What data can be migrated to the cloud?” This strategy enables leaders to also cut costs and mitigate risk in the cloud down the line.
For many organizations, security and compliance are primary data classification drivers. You can’t protect what you don’t know you have or where you have it. Financially motivated cybercriminals may try to steal sensitive data; data classification helps to mitigate that risk.
Data classification does this by:
Labeling sensitive data so you can limit access and mask it if necessary.
Reducing the number of locations where you store sensitive information to reduce the attack surface.
Giving you a way to integrate sensitive data types into your data loss prevention and other policy-enforcing applications.
For example, if you’ve appropriately classified sensitive information, it’s easier to apply role-based and attribute-based controls to that data to limit access. Typically, attribute-based controls incorporate the user’s role, geographic location, and the data’s sensitivity level.
Why categorize data? The answer is often linked to your reasons for engaging in the process in the first place. But nowadays, as multiple uses for data grow, there is no “one-size-fits-all” approach. Across the industry, three traditional approaches to data classification are viewed as best practices. However, you may want to use one, two, or even all three, depending on your data strategy.
Content-based classification addresses security or privacy compliance mandates. This process inspects files for sensitive, personal, and confidential information. Then, it labels them accordingly. (This is the primary approach for compliance).
For example, under the Health Insurance Portability and Availability Act (HIPAA), a hospital must label electronic private health information (ePHI) to protect it from unauthorized access.
Context-based classification focuses on how the information is being used and who uses it. This process focuses on application, location, or the creator to help determine whether the information is sensitive.
For example, local regulations increasingly mandate how data can be used within a specific geographic region. A multinational corporation may tag the data generated in Iceland with its origin, so users of that data know they must comply with stringent local, Icelandic law.
This classification type relies on manual processes because often people have unique knowledge, and use their discretion to classify the data. With user-based data classification, users review documents, files, or databases, and then categorize them based on their personal jurisdiction. However, it’s important to set the appropriate permissions or tracking so that you maintain control over how they classify it.
For example, if a data steward must retroactively label data as sensitive, they can use lineage capabilities in a data catalog to track related data and label the sensitive data accordingly.
If you’re embarking on your data classification journey, then you need to create and follow a process. The data classification process ensures that you have defined goals, standard naming conventions, and prioritized workflows. Here are the steps your organization needs to follow when classifying data:
The first step to any process is understanding your business goals. Classifying data should have a purpose. You might have more than one end business goal which will inform how you set out your process. For example, you might need to meet compliance requirements and want to streamline your data analysts’ work.
The next step in the process is to assess risk. You need to identify the data that you collect to appropriately set sensitivity and risk levels. Your classification or sensitivity levels can be low, medium, or high. Low sensitivity data is usually anything that’s intended to be public. Medium sensitivity data might be emails between employees and external recipients. High sensitivity data is usually data that falls into a protected category, like PII. Classification definitions should be well defined and easy to understand.
With the business objectives and data types identified, you’re ready to start creating processes for maintaining your data classification initiatives. This means scanning data to apply the classifications as well as the order in which data should be scanned. You also need to consider when to automate scanning and what data should be user classified.
Depending on the data you collect, you might have multiple categories of sensitive information. And overlapping rules may be at play. We’ve all heard the saying, All squares are rectangles, but not all rectangles are squares, and similar rules are likely to arise in this step.
For example, some PII is PHI, but not all PHI is PII. A social security number is considered both PHI and PII under all privacy regulations. Meanwhile, a medical record number is definitely PHI… but may not fall under the more general term of PII for other regulations.
Determining categories for classification supports compliance. If you’ve identified the types of data that need to be protected under a regulation, then you can use data classification to help meet those requirements.
Identifying usage focuses on who should have access to data, and how they should use it. This means you need to define how sensitive information can be used within the organization so that you can set the appropriate compliance controls around it — like masking data or restricting access entirely (based on a user’s permissions).
Data classification isn’t just a one-and-done process. You need to create streamlined workflows that allow you to classify data appropriately as you collect it and discover new data that may not yet be classified.
Data classification is a fundamental step to having a robust data governance strategy. To appropriately use your data, you need to know what you have and how it can be used. Data governance is the strategy for use and data classification is the foundation that tells you what data needs to be included and how. Data classification enables you to incorporate the following into your data governance strategy:
Maintain data integrity by knowing what information you have and who uses it.
Identify the processes for setting rules, procedures, and analytics within the boundaries of compliance mandates.
Secure sensitive data as required by law and industry standards by knowing what data needs to be protected.
Data classification is also fundamental to a successful data fabric or data mesh. Here are three core pillars of fabric and mesh in which classification plays an important role.
Automated provisioning: If a user is blocked from accessing data they need, what do they do? Automated provisioning describes the process in which a person can request access within the data catalog; that access is automatically denied or granted based on a range of questions. A data fabric will integrate governance with IT to answer those questions (such as, where is the person located? What is their use case? And how is this data classified?)
Data as a product for producers: A data mesh emphasizes data as a product, and positions those who distribute data as data producers. Classification communicates vital details for those data producers who must deliver it as a strategic asset. For consumers, classification details enable them to quickly understand an asset and determine its trustworthiness.
Self-service analytics: Analysts working independently may leverage a data mesh to access data on a self-serve data infrastructure platform. In this case, classification helps the analyst explore data products, curate quality data, and manage security policies appropriately.
Alation helps customers gain the most out of their data with data classification and data governance strategies that give them a holistic view of data to drive analytics.
Hulu uses Alation’s data catalog to gain the most value from viewership, content, and engagement data by making it more accessible to data users. With Alation’s data catalog, Hulu is able to build the foundation for accessibility and trust that enables collaboration, data stewardship, and data democratization at scale.
Texas Mutual leverages Alation’s modern data catalog platform to make smarter choices faster. With new pressures facing insurers, Texas Mutual uses Alation and Snowflake for consolidated, actionable, and trusted information. Creating an information architecture enables data intelligence to reduce the delivery time of key business dashboards by 80%.
With the Alation Data Catalog and the ability to manage PII Compliance, you can automate the mapping and categorization of data to ensure continued data classification and discovery. Companies get the insights they need to establish data governance workflows, processes, and self-service analytics within the Alation platform, all while maintaining control of sensitive data.
Sensitive data is classified and tagged to meet evolving compliance requirements so that privacy controls can be maintained. With our Data Governance App, companies ingest policies from Snowflake, then implement and enforce the classification in the data catalog, masking sensitive data to ensure continued compliance. Additionally, Alation’s partners enable rapid data classification and rule application within our catalog, streamlining compliance activities and reducing operational costs.
For more information, request a free demo to learn how your organization can optimize the governance of your sensitive data.
Data classification is the process of analyzing and organizing structured and unstructured data into categories by tagging data based on file type, contents, and metadata.
Here are some of the ways data classification makes it easier to leverage information while keeping sensitive data protected: Maintain Governance and Compliance, Accelerate Data Analytics, Streamline Cloud Migration, and Mitigate Security Risk.