In many DLP systems, things are still roughly the same, only the "gray area" instead of dragons and mermaids is populated by false negative triggers (FNTs), illegitimate business processes, and false positive triggers (FPTs) — "incidents" that are not such. To sail safely — know where you are sailing.
A next-generation DLP system differs from classic ones in that it leaves no mysteries in the form of FNTs, FPTs, and "gray areas." At the event, we will delve into the topic of FNTs and FPTs, share recommendations on technologies and methods for effectively working with the "gray area" of information flows — so that you can fight leaks, not false triggers.
- Why FNTs and FPTs appear at all. And how vendors of classic DLP systems propose to solve this problem, which is not usually talked about in the market
- How to leave false triggers in the past. Technologies, automation, and ML for timely updating of IB policies using the example of the InfoWatch Traffic Monitor DLP system.
- Machine learning capabilities for clustering documents versus manual analysis of the "gray area."
Today, the event is led by Alexander Klevtsov, Product Development Manager at InfoWatch Traffic Monitor, InfoWatch company, and he will talk about FNTs and FPTs, that is, false negative and false positive triggers of the DLP system. Why it is so important to dig into how the DLP system is actually arranged and what can be taken here as a metric to assess the effectiveness of your work with it. We will talk about technology, about details, we will not talk about methodology, security culture, but about these two specific aspects. And here we have a wonderful slide: how the artist would see the problems of the DLP system, namely false positive triggers and false negative ones.
False positive triggers, when we talk about DLP, is the moment when DLP considered something a violation that is not a violation. That is, the system triggered on some event, on some fact, considering it a violation, but there was no violation. False negative trigger is when the system missed something. Some incident occurred, a bad thing, but the DLP system remained silent.
Here I will try to overthrow some idols from the pedestal and question three basic technologies of the DLP system. These are regular expressions, digital fingerprints, and linguistic dictionaries.
Speaking about FNTs and FPTs, I have one recipe for dealing with such triggers. This is the transition from content analysis to contextual analysis. Content analysis is when the system can detect a credit card number, understand that this correspondence is about logistics, well determines the topic, the structure of the data sent, and contextual analysis, in addition to allowing you to detect some data, it also understands the business context. Let's give an elementary example that name, surname, patronymic, and phone number may simply be a signature in a letter or someone sent a contact to an employee to interact with him. And this may be the contact of a key client. It is important to understand that this is not just a phone and full name, but you need to understand that this is a phone and full name, for example, of a VIP client. And today the whole event will be exclusively about this, how to move from content analysis, where we simply understand some data structures, belonging to one or another category, to contextual, when we understand not only the category of data, but also its value, its importance for the business, taking into account document flow or business process.
Regular expressions
Transition from content analysis to contextual. And all this within the framework of overthrowing such idols of three basic DLP technologies and what they should be in order to be more about contextual analysis rather than content analysis.
So, the first analysis technology that we will consider, which we will criticize, which has become a classic of the DLP market, is regular expressions. Well, everyone is familiar with when some construction can determine any detail: card number, phone number, account number, TIN. So, from the point of view of contextual analysis, a regular expression is the weakest thing. As I have already given an example, we need to distinguish a signature in a letter and the data of some client or even an employee who is not an employee of the client department, does not interact with other counterparties, is not a public figure. And from the point of view of regular expressions, yes, the system can determine that this is some surname-name-patronymic, some e-mail, some phone or even a personnel number, but regular expressions completely do not understand whether this is the full name of an employee, whether this is the full name of a client or this is just a mention of some counterparty. We have a technology for protecting client databases, protecting nomenclature databases, protecting employee databases, which understands the business context. This is achieved due to the fact that it integrates with CRM, ABS or ERP system, from there it extracts data and checks each message, each letter, can check whether this is employee data, whether this is client data or even, for example, carry out such a check that this particular name is a mention from the nomenclature database. This technology allows you to clearly understand the business context. That this is not just some structure, some number, some surname, but this is exactly either a client, or an employee, or a partner, or even some other counterparty. The technology from the business system, where the context is stored, extracts data, and then compares this data with any intercepted message. And you can say for sure that this is the client's phone number, that is the fundamental difference between content analysis and context.
Digital fingerprints
When a digital fingerprint is taken from a specific document, then we allow it to be mentioned or its fragment to be identified. How to move from content analysis to contextual? The market has developed a rule that digital fingerprints are either text digital fingerprints or binary ones. So, in order for digital fingerprints to become more sensitive, more useful from the point of view of detecting official and confidential information, they must determine the maximum number of diverse data.
Good digital fingerprints, in order to be effective, in addition to text, streaming information in the form of audio and video files, must also understand raster images.
We had a client who had reportage photos stolen from the scene.
And their goal was to catch mentions of these photos in traffic. Moreover, it did not matter whether the photo was changed in resolution, that is, it was a RAF format, and they turned it into JPEG, or RAF turned into PNG, changed the resolution and the number of points. Still, this system had to detect it. We have a digital fingerprint technology built into DLP that allows you to catch mentions of photos. Even if the photos were slightly cropped, flipped, changed format.
There is also a separate module that allows you to make digital fingerprints from vector images, from CAD files. When, for example, we drive a very rough drawing of a tank into the system, and if someone forwards at least part of this drawing, then the system still understands that this is part of a confidential drawing. The analysis here occurs due to the analysis of graphic primitives: points, lines, curves, and their relationships. That is, this is not just a cut by binary data, but this is exactly understanding what is depicted on the vector fingerprint. And how to force digital fingerprint technologies to move from content analysis to contextual? Naturally, digital fingerprints must be integrated with electronic document systems, with all those repositories where the originals of documents that we want to protect are stored.
And this entire digital fingerprint database must be kept up to date, that is, there must be constant synchronization, updating of fingerprints. We have a special API in Traffic Monitor for this, which allows all digital fingerprints to be kept up to date and integrated into the electronic document management system.
Linguistic dictionaries
Now we move on to another technology that is classic for the DLP market, these are linguistic dictionaries.
How to force this dictionary technology to move from content analysis to contextual, so that it takes into account all the nuances. Here machine learning comes to our aid. Because creating dictionaries is a very costly thing. To create an effective dictionary, so that it really clearly detects one or another category of information, it takes from 5 to 7 working days for a professional linguist. A full-fledged linguistic model must be compiled, which may contain a couple of hundred terms. It is difficult to create such a dictionary. Machine learning reduces labor costs. There are examples of documents, fed information, and in a minute a new dictionary is ready. Just as effective as if a professional linguist worked. The fact is that by reducing the cost and speeding up the creation of dictionaries, when we can generate them in dozens, experiment, create one today - it didn’t work, the next day we created another one or retrained it, this allows us to granularly reflect each category of data that you have in the company in the policy. For example, if you take some dictionary related to logistics "out of the box". And you can, for example, disable it. And create a dictionary "logistics of consumables" or "logistics of goods supply" using the example of your specific documents. Or create a dictionary that will clearly reflect commercial proposals. By reducing the cost of creating dictionaries with the help of artificial intelligence, with the help of machine learning, we can "generate" as many of these dictionaries as we want, they will reflect all the nuances of your document flow or business process. By reducing the cost of creating dictionaries for a security officer, it is possible to constantly redo them. We can granularly reflect each category of data in the policy, I repeat, create a separate dictionary for each type of contract.
And even if we have new categories of data that have not been taken into account, we can quickly respond to them. That is why machine learning and artificial intelligence makes it possible to more clearly reflect the business context in the policy.
Now we move on to the most unpleasant aspect of operating a DLP system – false negative triggers.
In fact, this is such a scourge, it is not customary to talk about this in the DLP market. This is an indecent topic, you can say, forbidden. And what is the problem here? DLP misses some incidents, not even because a lazy security officer something there did not take into account, something did not mark in time in the policies. The problem of a false negative trigger is that the security officer may not know that some information asset has appeared in his traffic, which he has not taken into account in any way in the policies.
Accordingly, he does not know about this, nor does the DLP system, which he has not trained in any way. Missed incidents, false negative triggers - this is not even a problem of the DLP system itself, but the problem that the security officer is so overloaded that he does not have enough time to manually view everything, regularly analyze events that the DLP system has not marked. This is simply a problem of human resources and huge labor costs. The only tool that allows you to deal with false negative triggers is when we open the DLP system and manually, with our eyes, methodically view events that the DLP system has not processed, and look for a needle in a haystack.
We have a customer, he said that he has such a regular standard practice. When some employee submits a resignation letter, they begin to methodically view all his events. That is, especially forwarding to mail, to a flash drive and so on. And what does this customer say? No, no, yes, at some point they discover some category of data, some structure of documents, some new form that they did not take into account in the policies, and the employee leaked several documents within a week. And they start to refine the policies. It turns out that in order to deal with false negative triggers in the current realities, DLP has only one thing – luck. That when viewing events and incidents "manually", we will be lucky and we will find that unrecorded information asset that we can turn into a policy. If we are not lucky, then we will be in sweet ignorance and will not know that something is leaking from us. What can we offer for this in this context? I don't even remember, a year or a year and a half ago, we had a machine learning technology that clusters unknown, unmarked data of the DLP system.
What does it mean to cluster? It takes the entire gray area, which the DLP system has not swept away in any way, divides the documents into piles and forms an annotation for each document, highlighting the most striking examples of documents. That is, you took, for example, at the stage of implementing a DLP system, your system worked, there is not a single "trigger", there is not a single policy.
Launched the tool, it is called Data Explorer, and it will scatter all unknown documents into piles for you, create an annotation for them and highlight the most striking examples. And you can quickly see and understand what categories of documents you have in traffic.
How does this technology work? Imagine that you are a security officer. You have a task to analyze the gray area, analyze this mass of documents that the DLP system has not processed in any way. Imagine that you are sitting at a table and thousands, tens of thousands of documents are poured out of the basket to you. And you are told to sort out what categories of information are here. This is what a security officer does when he tries to analyze the gray area "manually". Documents are not marked in any way in DLP, he starts to chaotically rummage in this somehow. With hands and eyes.
Machine learning technology behaves differently. Imagine that there is not a pile of documents in front of you, but you have been given stacks of stapled documents, they said, here are contracts, here are consignment notes, here are documents related to employees. And a yellow sticker was attached to each stack of documents, which indicated what key terms are in this stack, and even the most striking and diverse documents that reflect this stack were placed on top of this stack.
Machine learning technology allows you to decompose any unknown data, neither to a person nor to a DLP system, into stacks and form an annotation for them. So you can still take this stack and feed it to another machine learning technology, which will form a new dictionary for you in one minute.
In this case, we are not just moving from content to context, but we can understand this context.
The customer faced the problem that a certain group of employees was taking massive documents outside the company's perimeter. There were a total of 4 thousand documents. DLP did not process these documents in any way. That is, it did not consider them confidential. They began to carefully study these documents. It turns out that some category of information, whether it is confidential or not, is not clear. How the customer himself assessed the labor costs - it would take 30 hours for two security officers to analyze these documents. But he used machine learning technology, which analyzed 4 thousand documents in an hour and a half. As a result, it was found that this is a certain document for preparing mergers, for some VIP deals, and so on, but which were not reflected in the policies in any way. And they subsequently reflected these information assets in the policy. First, they realized that this is not a violation, that this is preparation for a deal. And secondly, they were able to get this information asset cheaply by researching only an hour and a half instead of 30 hours. And then use machine learning to create dictionaries and reflect them in the policy. The customer talks about labor costs, how much he spent, how much he can spend, how much he cannot spend on servicing the deal, how to justify such manipulations to the management for analyzing gray events.
And the second case is a financial organization that used a technology that is opposed to regular expressions. This is a technology that, unlike regular expressions, allows you to understand not just content, but the context of data. The company has 3,000 employees, it has been on the market for 20 years, and it has very 100,000 clients. And the customer had a task - to understand that client data, statements, commercial offers, maybe such correspondence, are sent to a specific client. That is, if I, for example, send a commercial offer to a partner or client, but mistakenly indicated the wrong address, then this should be considered a violation. In fact, the client, with the help of this technology, implemented an automatic policy that contains hundreds of thousands of rules. Here the understanding of the context of data is maximally disclosed. First, we determine the ownership of the data of a specific client and automatically determine that the data is sent to this client. Plus, this technology is implemented in the work "in the gap". That is, if a letter is sent to the wrong client, or by mistake, or there are some incorrect personal data indicated, the sending will be blocked.
This is achieved through a patented technology, indexing and search algorithms. The technology is very productive. Here is a small example. A fingerprint of a client database of 10 million, its entry into any message, any mention of any client is checked in one tenth of a second.
We were able to integrate the Traffic Monitor DLP system with the customer's business system, from where we took data about clients, determined the ownership of this data, and blocked it if a portion of data was sent to the wrong counterparty.
We received the national banking award for this technology in the nomination "Best personal data protection system for the banking sector".