A quick introduction to NLP

    September 17, 2019

    Natural Language Processing or NLP is an area of Data Science, Machine Learning and Linguistics that focuses on processing people’s language. NLP used to be one of the slowest developing areas. When Computer Vision has been using fancy neural networks since the dawn of AlexNet, NLP was lagging. In recent years the area is starting to get closer and closer to the development speed of CV.

    You might have heard about the Transformer, BERT, XLnet, and Ernie. But what are those? What is NLP overall, how do machines understand our speech, and do they? Read on to know more. This article has a complementary Google collab notebook where you will be able to fiddle with some of the libraries and approaches I describe here in real-time.

    Text cleaning
    Taken from xkcd

    Tokenization

    Delimiters

    RegEx

    Rule-based

    Byte Pairwise encoding (BPE)

    Token preprocessing

    Stop Words removal (+rare words)

    Stemming and Lemmatization

    WordNet

    POS / NE Tagging

    Visualization made using CoreNLP by Stanford
    Visualization made using CoreNLP demo by Stanford.

    E.g. let’s imagine there’s a company called ‘Good Dudes’. And somebody wrote a review: ‘Good Dudes are awful’. Here the sentiment model might get confused because there are both words ‘good’ and ‘awful’ in the same sentence. Which of them is the part of the review? Right now it cannot tell. But if you apply NER tagging, you will see that the words ‘Good Dudes’ are very likely to be a named entity, like a company name. Using the name of the company as a feature in a classification task is a bad idea because if people only write bad reviews for the company, the model will just remember that the company is bad. And eventually, it might classify even legitimately good reviews for the given company as being bad. So, the best practice is to remove the name or change it for its NER tag. Now the review is ‘COMPANY are awful’, and it’s obvious what sentiment score to give it.

    Constituency parsing

    This processing approach can give you information about the words’ relations in the sentence. This information can be very crucial to sentiment tasks. Imagine a review ‘This movie was not very good’. If we just take a word by word approach, then this review is very likely to get classified as positive. The word ‘not’ can be used in both positive and negative reviews, but the word good is more likely to be present in a positive one.

    An example of such a parsed tree
    An example of such a parsed tree. Taken from Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank.

    But, if we apply the constituency parsing, we will see that the phrase ‘very good’ gets modified by the word ‘not’, which inverts the meaning, thus making the review negative. All this looks easy on paper, but implementing such an approach is hard. And most state-of-the-art sentiment models do not use this information directly.

    Coreference resolution

    We, people, can understand that the word ‘it’ in the sentence ‘Chicken did not cross the road because it was too tired’ corresponds to the ‘Chicken’, and not the ‘road’. But for machines, this is a relatively hard task to solve on its own. Using such information can be tricky if you do this by hand, but the models based on the Transformer architecture or some other type of attention might learn to represent this information in a way that it helps. It’s not completely understood what kind of info it does capture. But still, more on that in our later publications.

    Feature creation

    Now we have an augmented sequence of tokens, but we need numbers or sequences of them to feed to our machine learning algorithms. Most of the approaches use a vocabulary — a simple mapping from tokens to numbers. We do not pass those directly to the algorithm but rather use them as indices. So, for example, you have the word ‘good’ in your sentence. If this word is #4 in your vocabulary, you would put a 1 for feature #4. Now, it’s time to get a more detailed look at different feature creation methods.

    Bag of Words

    As the name suggests, this means representing the text as just a bunch of words and treating the word presence/absence/count as a feature. So, each token in a sentence/text will just be converted into a long sparse vector of the length of the vocabulary, where it will only have numbers at the indices of the present words. Going back to our example of the word `good`, it might look like this: [0, 0, 0, 0, 1, 0, …, 0]

    Ngrams

    A way of converting a bunch of tokens of sequence length N that appear next to one another to a feature. There are unigrams — single token (pretty much bag of words), bigrams — a sequence of two tokens, trigrams — a sequence of three tokens and so on… In practice, it’s rarely a good idea to go further than trigrams because the feature matrix will become very big and sparse, which is not good.

    TF-IDF (Term Frequency — Inverse Document Frequency)

    The intuition behind this is to make the words that are not that common have a higher score in a given document. This is done by counting the number of occurrences of a token (term) in a current document and the number of documents it occurs in overall. The formulas can be found on Wikipedia or the sklearn page. The tf-idf can be used on its own or can be interpreted as weights for some other representation like embeddings.

    Embeddings

    Pretty much embeddings is just a mapping from token to a vector. Vectors can either be learned on the task like Language Modelling or Text Classification, with the model fitting. Or they can be separately precomputed using approaches made solely for this purpose, e.g. Word2Vec. A huge benefit of this approach compared to all previous ones is that it does not use a single number to represent a token, but a combination of them like [0.4322, -0.1123, …, 0.9542]. This allows to carry a lot more information and is said to even encode a meaning of the word. There is a lot to be said about the embeddings, they deserve their article.

    Conclusion

    Today we have taken a look at some of the basics as preprocessing, tokenization and some methods of feature creation. These are the foundations of NLP pipelines. Now we are ready to move to more advanced topics and take a closer look at some of the approaches we already know. The topic of our next article in the ‘Introduction to NLP’ series will be ‘Embeddings and Language Models’. Stay tuned!

    Useful links

    • #Custom software development
    • #Data science
    • #Machine learning
    • #NLP
    • #Quantum
    • #R&D

    Share Article

    Case studies

    CONNECT WITH OUR EXPERTS