NLP System for Determining How Gene Mutations Affect Diseases

#AI & Machine Learning
#Deep learning
#Healthcare
#NLP
#Text analysis

Business Challenge

Almost any information can be found on the Internet. However, modern search engines aren’t suitable for more complex queries since they can’t cope with the amount of information available online. These tasks can be solved using NLP and data analysis to find information from specified sources, classify this data, and search for relationships in context.

Our client wanted to create a system that would analyze relevant abstracts from the PubMed library and establish appropriate connections between diseases and related mutations to facilitate diagnosis.

Solution Overview

The Quantum team has developed an NLP-based system capable of reading scientific abstracts from the PubMed library with further content analysis to determine the effect of genetic mutations on the disease. Using API requests, the user can make records of diseases and mutations, as well as MESH and OMIM codes, to get a list of the disease-mutation connection states:

The mutation causes the disease.
Mutation doesn’t cause disease.
Mutation reduces the risk of disease.

This information is highly valuable for medical staff examining patients and diagnosing their health conditions.

Project Description

First, we created an NLP abstract processing model based on the open-source BioBERT model pre-trained in biomedical language.

We trained our model to identify disease-mutation compounds using a formerly structured dataset with 6000 training and 2000 test mutation-disease compounds from nearly 600 abstracts. Each paper from our dataset could contain several such connections at once.

It was crucial for medical staff to understand the ethnicity of patients described in the scientific abstracts. To make this possible, we added the SpaCy model capable of recognizing and extracting named entities from the “mutation-disease” clause with further demographic assessment.

After creating an infrastructure for storing and processing scientific articles, we integrated our solution into the client’s system via API.

Summary: our team made a tool that automatically downloads and processes medical abstracts, periodically checking for new publications. All abstracts and established connections are stored in the ElasticSearch database.

Let's discuss your idea!

Technological Details

The project was mainly developed in Python. We used the BioBERT model with our dataset to teach the system to identify connections between disease and mutation and the SpaCy model for ethnicity identification. All abstracts are placed in the ElasticSearch database. The solution was deployed on Amazon Web Services.