Next-Generation DLP System Without False Negatives and False Positives: How It Became Possible

In many DLP systems, things are still much the same, except that the "gray area" is populated not by dragons and mermaids, but by false negatives (FNs), illegitimate business processes, and false positives (FPs) – "incidents" that are not actually incidents. To sail safely, know where you are going.

A next-generation DLP system differs from classic ones in that it leaves no mysteries in the form of FNs, FPs, and "gray areas." At the event, we will delve into the topic of FNs and FPs, share recommendations on technologies and methods for effectively working with the "gray area" of information flows – so that you can fight leaks, not false positives.

Why FNs and FPs appear in the first place. And how vendors of classic DLP systems propose to solve this problem, which is not usually discussed in the market.
How to leave false positives in the past. Technologies, automation, and ML for timely updating of security policies using the InfoWatch Traffic Monitor DLP system as an example.
Machine learning capabilities for clustering documents versus manual analysis of the "gray area."

Today, the event is led by Alexander Klevtsov, Product Development Manager for InfoWatch Traffic Monitor, InfoWatch company, and he will be talking about FNs and FPs, that is, false negatives and false positives of the DLP system. Why it is so important to delve into how the DLP system is actually structured and what can be taken as a metric to assess the effectiveness of your work with it. We will talk about technology, about details, we will not talk about methodology, security culture, but about these two specific aspects. And here we have a wonderful slide: how an artist would see the problems of the DLP system, namely false positives and false negatives.

False positives, when we talk about DLP, are when DLP considers something to be a violation that is not a violation. That is, the system triggered on some event, on some fact, considering it a violation, but there was no violation. A false negative is when the system missed something. Some incident occurred, a bad thing, but the DLP system remained silent.

Here I will try to overthrow some idols from their pedestals and question three basic DLP system technologies. These are regular expressions, digital fingerprints, and linguistic dictionaries.

Speaking of FNs and FPs, I have one recipe for dealing with such triggers. This is a transition from content analysis to contextual analysis. Content analysis is when the system can detect a credit card number, understand that this correspondence is about logistics, well defines the topic, structure of the submitted data, and contextual analysis, in addition to allowing you to detect some data, it also understands the business context. Let's give a simple example that a name, surname, patronymic, and phone number may simply be a signature in a letter or someone sent a contact to an employee to interact with them. And this may be the contact of a key client. It is important to understand that this is not just a phone and full name, but you need to understand that this is a phone and full name, for example, of a VIP client. And today the whole event will be exclusively about this, how to move from content analysis, where we simply understand some data structures, belonging to one or another category, to contextual, when we understand not only the category of data, but also its value, its importance for the business, taking into account document flow or business process.

Regular expressions

Transition from content analysis to contextual. And all this within the framework of overthrowing such idols of three basic DLP technologies and what they should be in order to be more about contextual analysis rather than content analysis.

So, the first analysis technology that we will consider, which we will criticize, which has become a classic of the DLP market, is regular expressions. Well, everyone is familiar with when some construction can determine any detail: card number, phone number, account number, TIN. So, from the point of view of contextual analysis, a regular expression is the weakest thing. As I have already given an example, we need to distinguish a signature in a letter from the data of some client or even an employee who is not an employee of the client department, does not interact with other counterparties, is not a public figure. And from the point of view of regular expressions, yes, the system can determine that this is some surname-name-patronymic, some e-mail, some phone number or even a personnel number, but regular expressions completely do not understand whether this is the full name of an employee, whether this is the full name of a client or this is just a mention of some counterparty. We have a technology for protecting client databases, protecting nomenclature databases, protecting employee databases, which understands the business context. This is achieved due to the fact that it integrates with CRM, ABS or ERP system, from there it extracts data and checks each message, each letter, can check whether this is employee data, whether this is client data or even, for example, carry out such a check that this particular name is a mention from the nomenclature database. This technology allows you to clearly understand the business context. That this is not just some structure, some number, some surname, but this is exactly either a client, or an employee, or a partner, or even some other counterparty. The technology from the business system, where the context is stored, extracts data, and then compares this data with any intercepted message. And you can say for sure that this is the client's phone number, that's the fundamental difference between content analysis and context.

Digital fingerprints

When a digital fingerprint is taken from a specific document, then we allow it to be mentioned or its fragment to be identified. How to move from content analysis to contextual? The rule of thumb in the market is that digital fingerprints are either text digital fingerprints or binary ones. So, in order for digital fingerprints to become more sensitive, more useful from the point of view of detecting official and confidential information, they must determine the maximum number of diverse data.

Good digital fingerprints, in order to be effective, in addition to text, streaming information in the form of audio and video files, must also understand raster images.

We had a client who had reportage photos stolen from the scene.

And their goal was to catch mentions of these photos in traffic. Moreover, it did not matter whether the photo was changed in resolution, that is, it was in RAF format, and they turned it into JPEG, or RAF turned into PNG, changed the resolution and the number of points. Still, this system had to detect it. We have a digital fingerprint technology built into DLP that allows you to catch mentions of photos. Even if the photos were slightly cropped, rotated, changed format.

There is also a separate module that allows you to make digital fingerprints from vector images, from CAD files. When, for example, we drive a very rough drawing of a tank into the system, and if someone forwards at least part of this drawing, then the system still understands that this is part of a confidential drawing. The analysis here is due to the analysis of graphic primitives: points, lines, curves and their relationships. That is, this is not just a cut by binary data, but it is precisely understanding what is depicted on the vector fingerprint. And how to force digital fingerprint technologies to move from content analysis to contextual? Naturally, digital fingerprints must be integrated with electronic document systems, with all those repositories where the originals of the documents we want to protect are stored.

And this whole digital fingerprint database must be kept up to date, that is, there must be constant synchronization, updating of fingerprints. We have a special API in Traffic Monitor for this, which allows all digital fingerprints to be kept up to date and integrated into the electronic document management system.

Linguistic dictionaries

Now we move on to another technology that is classic for the DLP market, linguistic dictionaries.

How to force this dictionary technology to move from content analysis to contextual, so that it takes into account all the nuances. Here machine learning comes to our aid. Because creating dictionaries is a very costly thing. To create an effective dictionary, so that it really clearly detects one or another category of information, you need from 5 to 7 working days of a professional linguist. A full-fledged linguistic model must be compiled, which may contain a couple of hundred terms. It is difficult to create such a dictionary. Machine learning reduces labor costs. There are examples of documents, fed information and in a minute a new dictionary is ready. Just as effective as if a professional linguist had worked. The fact is that by reducing the cost and speeding up the creation of dictionaries, when we can generate them by the dozens, experiment, create one today - it didn't work, the next day we created another one or retrained it, this allows us to granularly reflect each category of data that you have in the company in the policy. For example, if you take some dictionary "out of the box" related to logistics. And you can, for example, disable it. And create a dictionary "logistics of consumables" or "logistics of goods supply" based on your specific documents. Or create a dictionary that will clearly reflect commercial proposals. By reducing the cost of creating dictionaries with the help of artificial intelligence, with the help of machine learning, we can "generate" as many of these dictionaries as we want, they will reflect all the nuances of your document flow or business process. By reducing the cost of creating dictionaries for a security officer, it is possible to constantly remake them. We can granularly reflect each category of data in the policy, I repeat, create a separate dictionary for each type of contract.

And even if we have new categories of data that have not been taken into account, we can quickly respond to them. That is why machine learning and artificial intelligence makes it possible to more clearly reflect the business context in the policy.

Now we move on to the most unpleasant aspect of operating a DLP system - false negatives.

In fact, this is such a scourge, it is not customary to talk about it in the DLP market. This is an indecent topic, one might say, a forbidden one. And what is the problem here? DLP misses some incidents, not even because a lazy security officer didn't take something into account, didn't mark something in the policies in time. The problem of false negatives is that the security officer may not even know that some information asset has appeared in his traffic, which he has not taken into account in any way in the policies.

Accordingly, he doesn't know about it, nor does the DLP system, which he hasn't trained in any way. Missed incidents, false negatives - this is not even a problem of the DLP system itself, but a problem that the security officer is so overloaded that he does not have enough time to manually review everything, regularly analyze events that the DLP system has not marked. This is simply a problem of human resources and gigantic labor costs. The only tool that allows you to deal with false negatives is when we open the DLP system and manually, methodically review the events that the DLP system has not processed, and look for a needle in a haystack.

We have a customer, he said that he has such a regular standard practice. When an employee submits a resignation letter, they begin to methodically review all his events. That is, especially forwarding to mail, to a flash drive, and so on. And what does this customer say? No, no, yes, at some point they discover some category of data, some document structure, some new form that they did not take into account in the policies, and the employee leaked several documents within a week. And they start refining the policies. It turns out that to combat false negatives in the current realities, DLP has only one thing – luck. That when viewing events and incidents "manually", we will be lucky and we will find the very unaccounted for information asset that we can turn into a policy. If we are not lucky, then we will be in blissful ignorance and will not know that something is leaking from us. What can we offer in this context? I don't even remember, a year or a year and a half ago, we had a machine learning technology that clusters unknown, unmarked data from the DLP system.

What does clustering mean? It takes the entire gray area that the DLP system has not sorted out in any way, breaks the documents into piles, and forms an annotation for each document, highlighting the most striking examples of documents. That is, you took, for example, at the stage of implementing the DLP system, your system worked, there is not a single "trigger", there is not a single policy.

You launched the tool, it's called Data Explorer, and it will scatter all the unknown documents into piles, create an annotation for them, and highlight the most striking examples. And you can quickly see and understand what categories of documents are included in your traffic.

How does this technology work? Imagine that you are a security officer. You are faced with the task of disassembling the gray area, disassembling this mass of documents that the DLP system has not processed in any way. Imagine that you are sitting at a table and thousands, tens of thousands of documents are poured out of the basket to you. And you are told to sort out what categories of information are here. This is exactly what a security officer does when he tries to manually disassemble the gray area. The documents are not marked in any way in DLP, he starts to chaotically rummage through it somehow. With hands and eyes.

Machine learning technology behaves differently. Imagine that there is not a pile of documents in front of you, but stacks of stapled documents were placed in front of you, they said, here are contracts, here are waybills, here are documents related to employees. And a yellow sticker was attached to each stack of documents, indicating what key terms are in this stack, and the most striking and diverse documents that reflect this stack were placed on top of this stack.

Machine learning technology allows you to decompose any data unknown to either a person or a DLP system into stacks and form an annotation for them. So you can still take this stack and feed it to another machine learning technology that will generate a new dictionary for you in one minute.

In this case, we are not just moving from content to context, but we can understand this context.

The customer faced a problem that a certain group of employees was taking massive documents outside the company's perimeter. This was a total of 4 thousand documents. DLP did not process these documents in any way. That is, it did not consider them confidential. They began to carefully study these documents. It turns out that some category of information, whether it is confidential or not, is not clear. How the customer himself assessed the labor costs - it would take 30 hours for two security officers to disassemble these documents. But he used machine learning technology, which analyzed 4 thousand documents in an hour and a half. As a result, it was found that this is a certain document for preparing mergers, for some VIP deals, and so on, but which were not reflected in the policies in any way. And they subsequently reflected these information assets in the policy. First, they realized that this is not a violation, that this is preparation for a deal. And secondly, they were able to get this information asset cheaply by researching only an hour and a half instead of 30 hours. And then, with the help of machine learning, create dictionaries and reflect them in the policy. The customer talks about labor costs, how much he spent, how much he can spend, how much he cannot spend on servicing the deal, how to justify such manipulations to the management for analyzing gray events.

And the second case is a financial organization that used a technology that is opposed to regular expressions. This is a technology that, unlike regular expressions, allows you to understand not just the content, but the context of the data. The company has 3,000 employees, it has been on the market for 20 years, and has 100,000 customers. And the customer had a task - to understand that customer data, statements, commercial offers, maybe such correspondence, are sent to a specific customer. That is, if I, for example, send a commercial offer to a partner or client, but mistakenly indicated the wrong address, then this should be considered a violation. In fact, the client, with the help of this technology, implemented an automatic policy that contains hundreds of thousands of rules. Here, the understanding of the context of the data is revealed as much as possible. First, we determine the ownership of the data of a specific client and automatically determine that the data is sent to this particular client. Plus, this technology is implemented in the work "in the gap". That is, if the letter is sent to the wrong client, or by mistake, or there are some incorrect personal data indicated, the sending will be blocked.

This is achieved through patented technology, indexing and search algorithms. The technology is very productive. Here is a small example. The fingerprint of a customer base of 10 million, its entry into any message, any mention of any client is checked in one tenth of a second.

We were able to integrate the Traffic Monitor DLP system with the customer's business system, from where we took data about customers, determined the ownership of this data, and blocked it if a portion of data was sent to the wrong counterparty.

For this technology, we received the national banking award in the nomination "The best personal data protection system for the banking sector".