Sensitive Data Detection solution
- #Data analytics
- #NLP
- #Sensitive data
- #Text analysis
About the Client
Our client is one of the top five leading enterprise cloud corporations selling hyper-converged infrastructure (HCI) appliances and software-defined storage. The company delivers its customers a simple, flexible, and cost-efficient cloud platform that offers freedom of choice and enables true hybrid and multi-cloud computing.
Business Challenge
In a modern IT environment where billions of files are stored, there is a demand to ensure that sensitive information, such as passport copies, driver’s licenses, or credit card information, cannot be revealed for one reason or another. Companies need to control their sensitive information, prevent data leaks, and comply with standards for personal data handling such as GDPR, NIST, PCI-DSS, or HIPAA.
Our client has the infrastructure to run its corporate customers’ virtual machines and wants to help them manage their data securely and cost-effectively. But the problem was that he could not integrate a ready-to-use system because of the large company’s size.
Solution Overview
An application implemented in the company’s cluster provides a single pane of glass management simplicity, software-defined flexibility, and deep analytics intelligence to meet the modern challenges you face around file data.
The solution we need to develop has to analyze the data storage, detect and classify sensitive data, and notify owners about sensitive data that is present in a data storage. It also will provide dashboards with some statistics on detected sensitive data and should be fully integrated with the application.
Project Description
Since the card number has a 16-digit number structure and is formed according to specific rules, we used a regular expression to search for 16-digit numbers in raw text and validate them using program methods that check the card number generation rules.
For full name detection, we applied the Named Entity Recognition (NER) model of the Natural Language Processing domain because there is no generalized behavior of appearance of full names in text data.
To ensure the model accuracy of predictions of full names and cleaning of irrelevant names, we trained the Conditional Random Fields (CRF) model. The CRF model was trained on the same dataset as the NER model.
This approach helps to seek and classify named entities mentioned in unstructured text into predefined categories and use this information for structured prediction.
The solution includes a pre-built library of almost 50 built-in rules and more than 400 patterns for the common laws and standards:
- Personal information: credit card numbers, passport numbers, driver’s license numbers, social security numbers, IBAN, and more
- Financial records
- Security file types (.cer, crt, p7b, etc.)
- Regulated data (GDPR, HIPAA, PHI, PCI, Sarbanes Oxley, GLBA, etc.)
Our NLP-based model finds and classifies sensitive information in files and ranks each file based on findings providing a comfortable tool for security auditors and system administrators.
As a result, Quantum created a scalable software solution with multitier architecture, allowing the easy addition of new file formats and patterns to detect and scale the solution to process thousands of documents simultaneously. Moreover, the used model allowed us to detect and classify sensitive data within an image.
Let's discuss your idea!
Technological Details
The backend part was written in Python. For this project, we decided to use Named Entity Recognition (NER) and Conditional Random Fields (CRF) models of the NLP domain. For the development of the frontend, we used React JS.
To make the solution scalable, we set up 3 docker containers: application with the main processing part, Redis for communication between application modules, and Prometheus for metrics collection.