NLP Text Classification Solution

About the Client

A Germany-based company that cooperates with startups, and specializes in high-performance web & mobile apps and scalable portals.

Business Challenge

Document flow is usually a time-consuming process. Thousands of different documents, like invoices and bills, circulate daily, adding tons of manual work for office employees. Fortunately, document processing and decision making can be automated, and that was the idea the client wanted to implement. The invoice handling solution also had to comply with European standards of online document exchange.

Quantum used computer vision algorithm searches in the solution to automate and simplify document flow.

Solution Overview

Text mining studies are getting more important as the number of electronic documents from various sources increases. These documents include unstructured and semi-structured information. The main goal of text mining is to let users extract information from textual resources and automate operations like retrieval and classification.

Advanced NLP techniques allow interacting with computing systems and obtaining useful information from them. For example, an NLP text classification solution can recognize types of invoices, numerals, the sender, the receiver and other parameters.

The solution we developed takes attached PDF documents, processes them and returns the results of parsing. The heart of the system is an engine that can work with different types of invoices, automatically detect fields and read the necessary information. Quantum’s solution helped reduce the number of incidents involving the human factor significantly.

Our invoice parsing rules allow users to get information from PDF documents in a matter of minutes. Extraction algorithms are using a mixture of pattern- and keyword-based recognition that provides 82% of accuracy.

desktop

Project Description

At the first stage of this project, we created an API that could process several types of invoices. This version of the invoice parser operated based on templates. But if the structure of the document didn’t match any of the templates, the system used computer vision algorithms to detect the necessary fields. At this step, we developed a useful tool that was implemented into the client’s business process to test and find potentially problematic cases: types of invoices, different structures of the same documents, flexible information format, etc.

In the second stage, we developed a generalized invoice parser that could process any type of invoice. To be able to process documents with different structures, we moved block detection to DNNs and used fuzzy string comparison to find the necessary information.

img-1

Technological Details

The Quantum team developed a fast and scalable web service. Python was our programing language of choice since it allows developing asynchronous APIs easily and has a full spectrum of ML, DL and CV libraries.

Python
Python
OpenCV
OpenCV
scikit-learn
scikit-learn
scikit-image
scikit-image
Tesseract OCR
Tesseract OCR
Keras
Keras
AIOHTTP
AIOHTTP

Connect with our experts