Text Preprocessing For NLP
Text Preprocessing For NLP
1. Lower Casing
2. Tokenization
3. Punctuation Mark Removal
4. Stop Word Removal
5. Stemming
6. Lemmatization
In order to resolve this issue, we must convert all the words to lower
case. This provides uniformity in the text.
To do this, we shall use the NLTK library. NLTK is the Natural Language
Toolkit library in python that is used for Text Preprocessing.
import nltk
nltk.download('punkt')
Output:
Sentence Tokenize
import nltk
# nltk.download('punkt')
Similarly, we can also tokenize the paragraph into words. The result is
a list called ‘words’, containing each word of the paragraph. The length
of the list gives us the total number of words present in our paragraph.
# Tokenize Words
words = nltk.word_tokenize(paragraph.lower())
print (words)
print (len(words))
This brings us to the next step. We must now remove the punctuation
marks from our list of words. Let us first display our original list of
words.
print (words)
Now, we can remove all the punctuation marks from our list of words
by excluding any alphanumeric element. This can be easily done in this
manner.
Have you ever observed that certain words pop up very frequently in
any language irrespective of what you are writing?
Output:
Once this is done, we can now display the stop words of any language
by simply using the following command and passing the language
name as a parameter.
print(stopwords.words("english"))
You can also get the stop words of other languages by simply changing
the parameter. Have some fun and try passing “Spanish” or “French”
as the parameter!
Since these stop words do not add much value to the overall meaning
of the sentence, we can easily remove these words from our text data.
This helps in dimensionality reduction by eliminating unnecessary
information.
WordSet = []
for word in new_words:
if word not in set(stopwords.words("english")):
WordSet.append(word)
print(WordSet)
print(len(WordSet))
Output: 24
We observe that all the stop words have been successfully removed
from our set of words. On printing the length of our new word list we
see that the length is now 24 which is much less than our original word
list length which was 49. This shows how we can effectively reduce the
dimensionality of our text dataset by removal of stop words without
losing any vital information. This becomes extremely useful in the case
of large text datasets.
Stemming
After that using the PorterStemmer object ‘ps’ we shall call the stem
method to perform stemming on our wordlist.
WordSetStem = []
for word in WordSet:
WordSetStem.append(ps.stem(word))
print(WordSetStem)
Carefully observe the result that we have obtained. All the words in
our list have been reduced to their stem words or lemma. For
example, “linguistics” has been reduced to “linguist”, the word
“scientific” has been reduced to “scientif” and so on.
Note: The word list obtained after performing stemming does not
always contain words that are a part of the English vocabulary. In our
example, words such as “scientif“, “studi“, “everi” are not proper
words, i.e. they do not make sense to us.
Lemmatization
We have just seen, how we can reduce the words to their root words
using Stemming.
However, Stemming does not always result in words that are part of
the language vocabulary. It often results in words that have no
meaning to the users. In order to overcome this drawback, we shall
use the concept of Lemmatization.
Let’s dive into the code.
Output:
WordSetLem = []
for word in WordSet:
WordSetLem.append(lm.lemmatize(word))
print(WordSetLem)
We see that the words in our list have been lemmatized. Each word
has been converted into a meaningful parent word.
test = []
for word in [“changing”, “dancing”,”is”, “was”]:
test.append(lm.lemmatize(word, pos=”v”))
print(test)
Now the words have been changed to their proper root words. The
words such as “is” and “was” have also been converted to “be“. Thus
we observe that we can accurately specify the context of
lemmatization by passing in the desired parts of speech in the
parameter of the lemmatize method.
Chunking
Note that there are eight parts of speech: noun, verb, adjective,
adverb, preposition, conjunction, pronoun, and interjection, as we
recall from our English grammar studies at school. Short phrases are
also defined as phrases generated by combining any of these parts of
speech in the previous definition of Chunking.
Chunking can break down sentences into phrases that are more useful
than single words and provide meaningful outcomes.
Chunking Up
Chunking Down
Output:
[[1, 2, 3], [4, 5, 6], [7, 8, 9], [10]]
This function chunk_list takes a list lst and a chunk_size as input and
returns a list of lists, each containing chunk_size elements from the
original list.
The range() function is used to iterate over the indices of the original
list, and list slicing is used to extract chunks of the specified size.
Applications
Approaches
1. Rule-based approaches
These approaches make use of handcrafted linguistic rules to
classify text. One way to group text is to create a list of words
related to a certain column and then judge the text based on the
occurrences of these words. For example, words like “fur”,
“feathers”, “claws”, and “scales” could help a zoologist identify
texts talking about animals online. These approaches require a
lot of domain knowledge to be extensive, take a lot of time to
compile, and are difficult to scale.
2. Machine learning approaches
We can use machine learning to train models on large sets of text
data to predict categories of new text. To train models, we need
to transform text data into numerical data – this is known as
feature extraction. Important feature extraction techniques
include bag of words and n-grams.
There are several useful machine learning algorithms we can use
for text classification. The most popular ones are:
o Naive Bayes classifiers
o Support vector machines
o Deep learning algorithms
3. Hybrid approaches
These approaches are a combination of the two algorithms
above. They make use of both rule-based and machine learning
techniques to model a classifier that can be fine-tuned in certain
scenarios.
Reasons to consider text classification
You may segment your audience depending on the words and phrases
they use, allowing you to develop more focused campaigns.
One of your users could tweet “If this product would have a logo
generation feature, it would be perfect for me.” This is valuable
feedback and you can leverage it to make your product more useful.