According to industry reports, there will be an 800% increase in enterprise data, and about 80% of it will be unstructured data.
Enterprise data is growing rapidly and propagating due to the advent of cloud technologies and cloud collaboration, thereby making it very challenging to secure. The smart strategy is to first identify the data in your enterprise that is valuable, thus must be secured with additional steps above and beyond your standard enterprise security posture. Essentially separating the vital few from the trivial many.
Data Classification is the critical step that helps you identify high value content in your enterprise by categorizing your data into an agreed set of categories that are specific and meaningful to your enterprise. Data Classification drives multiple use cases such as data labelling, sensitive data identification, automating protection, compliance, security, access-control, data retention etc. With much dependent on this critical step it is useful to know the state of the art today on Data Classification, and the uses and limitations of these approaches. The three main means to classify data are:
End User Based
End User Based
Enterprises rely on the end user to manually tag the document with the appropriate category and assign a sensitivity level of the document. When done right and at the time of document creation this is a very effective method because a person can apply their expert judgement and classify appropriately. For example, AIP (Azure Information Protector) has the capability to enable end users to classify documents as sensitive and can also limit access to them. This approach relies on a combination of end-user agents, centralized trigger rules and finally the end-user manually applying tags to documents and communications. End-user based classification has some limitations:
End users need to be trained on classification, which adds significant time and cost
End users may not be able to accurately identify the sensitivity level of the document, thereby increasing data risk
Gaps in labelling are easily introduced since end users may not remember to label/classify the document unless forced to do so through an auto-triggered rule. If an auto-triggered rule does not fire, then a sensitive document may go unclassified by the end-user
Data may be entered in free form fields, labels are not consistently applied and could lead to label proliferation.
A centralized policy-based data classification technique entails central rules that are then used to classify documents without the need for end user input. DLP (Data Loss Prevention) and CASB (Cloud Access Security Brokers) products generally have some policy based classification in place. An example policy could be to “Mark all documents that have the code word Sedona as sensitive”. This approach can easily tag documents that are identified by rules, taking the human element out, but it also has some limitations:
Rules create false positives and false negatives. A false negative is when a sensitive document is identified as non-sensitive and not tagged appropriately. False negatives are especially problematic because they permit transmittal of sensitive documents without knowledge of the end-user
False positives, when documents are incorrectly tagged as sensitive, can cause significant time to be spent on corrective action by the end-user
Administrators cannot write complex rules. Many documents are such that it is not possible to write a rule, or a rule written for one purpose contradicts rules written for another purpose. Thus rules become unwieldy to manage
Synonymy (different words that mean the same thing) and polysemy (same word has different meaning based on the context) are common challenges and can trip up many a rule
Rules created for data classification require continuous maintenance and upkeep over time. Active business input is needed for maintenance and increases the cost and complexity of this endeavor.
Organizations also classify data by identifying document sensitivity using the associated meta-data. Some examples of meta-data information are (a) the file folder in which the document is stored, (b) the role of the person that creates or accesses a document (c ) the title level or organization the person belongs to etc. An example of meta-data driven classification is any document that a company executive has access to is marked as sensitive. Or marking any document that is stored in the “Revenue Forecasts” folder is automatically marked as sensitive. The limitations of this approach are:
A sweeping and broad approach leaves little room for nuance
It is possible to restrict access to legitimate content, thereby creating friction in the business
The quality of the meta-data itself may be suspect (e.g. user’s title, role etc.), which will reflect in the quality of classification.
Data Classification when done right can be a foundational element to meaningful data security. Security leaders must factor both the capabilities and the limitations of these approaches when assessing the business risk posed by their current data classification strategy and continually assess the quality of their Data Classification efforts.