Predicting Oil Fields Using Solids DNA

#Machine learning

Business Challenge

Geologists and geophysicists use different methods to search for geological structures that may form oil reservoirs.

Currently, it costs around $85,000 per square mile for an oil and gas company to check a field for oil.

All around, they spend at least $1M and possibly over $40M before they see any results.

Our client had DNA samples of solids from 13 different areas that contain more than 3,000 hundred fields with microelements and characteristics for each area.

The client wanted to make predicting oil fields easier thanks to solids DNA.

Solution Overview

Quantum found a way to predict oil fields by using the DNAs of solids, saving time and money for the client.

Our machine learning model predicts the location of an oil field with a 70% accuracy.

And as soon as the client collects more data about different areas, our R&D team will improve the result.

Project Description

Data understanding and preparation

Our research began with an attempt to highlight the main features of solids DNA with an analytical approach. We had to determine the importance of fields. For feature selection, we applied regression, statistical and other methods to all of our mixed data to get relevant results. And as we mixed data in different ways, we created new datasets.

ML model

Our team used mixed datasets for ML model training. As usual, we started with different algorithms to find the best one, but on the first iteration, we couldn’t tell the difference between 13 groups of data and got only 50% of accuracy.

After we discussed the problem with the client, we made a reasonable conclusion: the fact that all data was taken from different regions and in different seasons was to blame. We split the datasets according to this principle, thus improving the result by 20%.

Visualization

To show more than just a thousand of lines with thousands of fields, we decided to show each field as a group of pixels that could change their saturation depending on the importance of certain features. This method showed us a full image we could analyze.

Let's discuss your idea!

Technological Details

We used the FloydHub platform for data processing and model training. Scikit-learn, SciPy, Matplotlib, and Seaborn were used for EDA and visualization.

We also developed an automated pipeline to find the algorithm for selecting the most valuable feature. As we were looking through algorithms, we used a bunch of approaches from correlation and stacks of L1 regularized regressors to unsupervised approaches and dimensionality reduction methods. All results were saved as separate datasets.

Further model selection was also done automatically by specifying the necessary models and evaluating them against each dataset. The most “valuable” datasets were then selected to build more accurate models by fine parameter tuning and building sophisticated models like DNN and gradient boosting.

The libraries we used:

pandas, NumPy for data manipulation and visualization
Scikit-learn for data analysis and processing, feature selection and clusterization
Scikit-feature for unsupervised feature selection
SciPy for data analysis
Matplotlib, Seaborn for visualization
XGBoost for gradient boosting
Keras for DNN
hyperopt for parameter tuning
imblearn for dealing with imbalanced data