Unit 1b
Unit 1b
Unit-1
Pre-processing
Basic NLP Pipeline NLP uses Language Processing Pipelines to read,
decipher and understand human languages.
Extended NLP pipeline
Spacy Data Processing Pipeline
Tokenization
• Tokenization is breaking the raw text into small chunks. Tokenization breaks
the raw text into words, sentences called tokens. These tokens help in
understanding the context or developing the model for the NLP. The
tokenization helps in interpreting the meaning of the text by analyzing the
sequence of the words.
• For example, the text “It is raining” can be tokenized into ‘It’, ‘is’, ‘raining’
• There are different methods and libraries available to perform tokenization.
NLTK, Gensim, Keras are some of the libraries that can be used to
accomplish the task.
• Stop words are those words in the text which does not add any meaning to
the sentence and their removal will not affect the processing of text for the
defined purpose. They are removed from the vocabulary to reduce noise
and to reduce the dimension of the feature set.
Various Tokenization Techniques
Stop words removal
• The words which are generally filtered out before processing a natural
language are called stop words. These are actually the most common
words in any language (like articles, prepositions, pronouns,
conjunctions, etc) and does not add much information to the text.
Examples of a few stop words in English are “the”, “a”, “an”, “so”,
“what”.
• Many Libraries are available to carry out this.
We can remove stop words while performing
the following tasks:
• Text Classification
• Spam Filtering
• Language Classification
• Genre Classification
• Caption Generation
• Auto-Tag Generation
Remove Stop words using Spacy
Stopword Removal using NLTK
Avoid Stop word Removal
• Machine Translation
• Language Modeling
• Text Summarization
• Question-Answering problems
Text Normalization
• When we normalize text, we attempt to reduce its randomness,
bringing it closer to a predefined “standard”. This helps us to reduce
the amount of different information that the computer has to deal
with, and therefore improves efficiency. The goal of normalization
techniques like stemming and lemmatization is to reduce inflectional
forms and sometimes derivationally related forms of a word to a
common base form.
Stemming
• We use Stemming to remove suffixes from words and end up with a so-called word stem. The
words “likes”, “likely” and “liked”, for example, all result in their common word stem “like” which
can be used as a synonym for all three words. That way, an NLP model can learn that all three
words are somehow similar and are used in a similar context.
• Stemming lets us standardize words to their base stem irrespective of their inflections, which
helps many applications like clustering or classifying text. Search engines use these techniques
extensively to give better results irrespective of the word form. Before the implementation of
word stems to Google in 2003, a search for “fish” did not include websites on fishes or fishing.
• Over-stemming: where a much larger part of a word is chopped off than what is required, which
in turn leads to words being reduced to the same root word or stem incorrectly when they should
have been reduced to more stem words. For example, the words “university” and “universe” that
get reduced to “univers”.
• Under-stemming: occurs when two or more words could be wrongly reduced to more than one
root word when they actually should be reduced to the same root word. For example, the words
“data” and “datum” that get reduced to “dat” and “datu” respectively (instead of the same stem
“dat”).
Stemming is an elementary rule-based process for removing
inflationary forms from a given token. The output of the error is the
stem of a word. for example laughing, laughed, laughs, laugh all will
become laugh after the stemming process.
Stemming is not a good process for normalization. since sometimes it can produce
non-meaningful words which are not present in the dictionary. Consider the
sentence ” His teams are not winning”. After stemming we get “Hi team are not
winn ” . Notice that the keyword winn is not a regular word. Also, “hi” has changed
the context of the entire sentence.
Lemmatization
• Unlike stemming, lemmatization reduces words to their base word, reducing the
inflected words properly and ensuring that the root word belongs to the
language. It’s usually more sophisticated than stemming, since stemmers works
on an individual word without knowledge of the context. In lemmatization, a root
word is called lemma. A lemma is the canonical form, dictionary form, or citation
form of a set of words.
:
Tags in Spacy