An introduction to NLP

A summary of all that I have learned while working with random NLP stuff.

onoff
Picture Credits: http://nlp.cs.tamu.edu/

The what?

Natural Language Processing is a field of computer science that deals with artificial intelligence and linguistics.
When you tell Cortana/Siri/Google now to set an alarm for tomorrow, you talk to a machine and it understands you.

5 Major uses of NLP

1. Classification

  • Text Analysis
  • Sentiment Analysis

2. Matching

  • Search
  • Dialogue
  • Question Answering

3. Recognition/Translation

  • Machine Translation
  • Speech recognition
  • Handwriting recognition
  • Dialogue

4. Structure Prediction

  • Named entity Extraction
  • Parts of Speech Tagging
  • Sentence parsing
  • Semantic Parsing

5. Markov Decision Process

  • Dialogue (eg. API.ai )

Text preprocessing

Processing the text, making it ready for the NLP tasks.
Generally divided into 3 steps

  1. Noise removal (bag of words, regular expressions, enumeration to remove useless words such as is, the, a, am)
  2. Dictionary normalization (which includes stemming, stemming, and lemmatization) can be done using NLTK packages.
  3. Object standardization (used to deal with words that do not exist in the common vocabulary, such as slang, web language, etc., the simplest can use dictionary enumeration to achieve)

Noise removal

Removing common words such as – the, and, or, if, etc.
Sample code to remove noisy words from a text

def _remove_noise(input_text):
    words = input_text.split() 
    noise_free_words = [word for word in words if word not in noise_list] 
    noise_free_text = " ".join(noise_free_words) 
    return noise_free_text

_remove_noise("this is a sample text")
>>> "sample text" 

Sample code to remove a regex pattern

See my previous post on regex to know more about it.


def _remove_regex(input_text, regex_pattern):
    urls = re.finditer(regex_pattern, input_text) 
    for i in urls: 
        input_text = re.sub(i.group().strip(), '', input_text)
    return input_text

regex_pattern = "#[w]*"  

_remove_regex("remove this #hashtag from this text", regex_pattern)
>>> "remove this  from this text"

Vocabulary standardization

lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer 
stem = PorterStemmer()

word = "multiplying" 
lem.lemmatize(word, "v")
>> "multiply" 
stem.stem(word)
>> "multipli"

Why use standardization

In our language, we have several forms for one word. That can be confusing to the computer. For example- LOL (laugh out loud) or RT(re tweet)
Below is an example code showing how we standardized the text before it could be sent for NLP tasks.

def _lookup_words(input_text):
    words = input_text.split() 
    new_words = [] 
    for word in words:
        if word.lower() in lookup_dict:
            word = lookup_dict[word.lower()]
        new_words.append(word) new_text = " ".join(new_words) 
        return new_text

_lookup_words("RT this is a retweeted tweet by RonitRay")
>> "Retweet this is a retweeted tweet by RonitRay"

Text embedding

Collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers.
Commonly used tools for this are word2vec,GloVe

Main applications for this includes
– Find similar words
– Information retrieval
– word segmentation
– Name recognition
– emotion analysis
– Document subject discrimination

Other NLP tasks

  1. Text Summarization – Given a text article or paragraph, summarize it automatically to produce most important and relevant sentences in order.
  2. Machine Translation – Automatically translate text from one human language to another by taking care of grammar, semantics, and information about the real world, etc.
  3. Natural Language Generation and Understanding – Convert information from computer databases or semantic intents into readable human language are called language generation. Converting chunks of text into more logical structures that are easier for computer programs to manipulate is called language understanding.
  4. Optical Character Recognition – Given an image representing printed text, determine the corresponding text.
  5. Document to Information – This involves parsing of textual data present in documents (websites, files, pdf’s and images) to analyzable and clean format.

Popular NLP processing libraries

Scikit-learn: Machine learning in Python
Natural Language Toolkit (NLTK): The complete toolkit for all NLP techniques.
Pattern – A web mining module for the with tools for NLP and machine learning.
TextBlob – Easy to use nl p tools API, built on top of NLTK and Pattern.
spaCy – Industrial strength N LP with Python and Cython.
Gensim – Topic Modelling for Humans
Stanford Core NLP – NLP services and packages by Stanford NLP Group.

Leave a reply:

Your email address will not be published.

Site Footer

Sliding Sidebar

About Me

About Me

Hey, I am Thomas Ashish Cherian. What I cannot create, I do not understand.

Social Profiles