Text Preprocessing in Python | Set 2
Last Updated :
18 Mar, 2024
Text Preprocessing is one of the initial steps of Natural Language Processing (NLP) that involves cleaning and transforming raw data into suitable data for further processing. It enhances the quality of the text makes it easier to work and improves the performance of machine learning models.
In this article, we will look at some more advanced text preprocessing techniques.
Prerequisites
Before starting with this article, you need to go through the Text Preprocessing in Python | Set 1.
Also, refer to this article to learn more about Natural Language Processing – Introduction to NLP
We can see the basic preprocessing steps when working with textual data. We can use these techniques to gain more insights into the data that we have. Let’s import the necessary libraries.
Python3
# import the necessary libraries
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
import string
import re
Part of Speech Tagging
The part of speech explains how a word is used in a sentence. In a sentence, a word can have different contexts and semantic meanings. The basic natural language processing models like bag-of-words fail to identify these relations between words. Hence, we use part of speech tagging to mark a word to its part of speech tag based on its context in the data. It is also used to extract relationships between words.
Python3
from nltk.tokenize import word_tokenize
from nltk import pos_tag
# convert text into word_tokens with their tags
def pos_tagging(text):
word_tokens = word_tokenize(text)
return pos_tag(word_tokens)
pos_tagging('You just gave me a scare')
Output:
[('You', 'PRP'),
('just', 'RB'),
('gave', 'VBD'),
('me', 'PRP'),
('a', 'DT'),
('scare', 'NN')]
In the given example, PRP stands for personal pronoun, RB for adverb, VBD for verb past tense, DT for determiner and NN for noun. We can get the details of all the part of speech tags using the Penn Treebank tagset.
Python3
# download the tagset
nltk.download('tagsets')
# extract information about the tag
nltk.help.upenn_tagset('NN')
Output:
NN: noun, common, singular or mass
common-carrier cabbage knuckle-duster Casino afghan shed thermostat
investment slide humour falloff slick wind hyena override subhumanity
machinist ...
Chunking
Chunking is the process of extracting phrases from unstructured text and more structure to it. It is also known as shallow parsing. It is done on top of Part of Speech tagging. It groups word into “chunks”, mainly of noun phrases. Chunking is done using regular expressions.
Python3
from nltk.tokenize import word_tokenize
from nltk import pos_tag
# define chunking function with text and regular
# expression representing grammar as parameter
def chunking(text, grammar):
word_tokens = word_tokenize(text)
# label words with part of speech
word_pos = pos_tag(word_tokens)
# create a chunk parser using grammar
chunkParser = nltk.RegexpParser(grammar)
# test it on the list of word tokens with tagged pos
tree = chunkParser.parse(word_pos)
for subtree in tree.subtrees():
print(subtree)
sentence = 'the little yellow bird is flying in the sky'
grammar = "NP: {<DT>?<JJ>*<NN>}"
chunking(sentence, grammar)
Output:
(S
(NP the/DT little/JJ yellow/JJ bird/NN)
is/VBZ
flying/VBG
in/IN
(NP the/DT sky/NN))
(NP the/DT little/JJ yellow/JJ bird/NN)
(NP the/DT sky/NN)
In the given example, grammar, which is defined using a simple regular expression rule. This rule says that an NP (Noun Phrase) chunk should be formed whenever the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN). Libraries like spaCy and Textblob are more suited for chunking.
Example:
Input: ‘the little yellow bird is flying in the sky’
Output: (S (NP the/DT little/JJ yellow/JJ bird/NN) is/VBZ flying/VBG in/IN (NP the/DT sky/NN)) (NP the/DT little/JJ yellow/JJ bird/NN) (NP the/DT sky/NN)

Named Entity Recognition
As we know Named Entity Recognition is used to extract information from unstructured text. It is used to classify entities present in a text into categories like a person, organization, event, places, etc. It gives us detailed knowledge about the text and the relationships between the different entities.
Python3
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk
def named_entity_recognition(text):
# tokenize the text
word_tokens = word_tokenize(text)
# part of speech tagging of words
word_pos = pos_tag(word_tokens)
# tree of word entities
print(ne_chunk(word_pos))
text = 'Bill works for GeeksforGeeks so he went to Delhi for a meetup.'
named_entity_recognition(text)
Example:
Input: ‘Bill works for GeeksforGeeks so he went to Delhi for a meetup.’
Output: (S
(PERSON Bill/NNP)
works/VBZ
for/IN
(ORGANIZATION GeeksforGeeks/NNP)
so/RB
he/PRP
went/VBD
to/TO
(GPE Delhi/NNP)
for/IN
a/DT
meetup/NN
./.)
Conclusion
In conclusion, natural language processing (NLP) plays a pivotal role in bridging the gap between human communication and computer understanding. As this field progresses, we can anticipate further innovations that will reshape how we communicate with and leverage the capabilities of intelligent systems in our daily lives and professional endeavors.
Similar Reads
Keywords in Python | Set 2
Python Keywords - Introduction Keywords in Python | Set 1 More keywords:16. try : This keyword is used for exception handling, used to catch the errors in the code using the keyword except. Code in "try" block is checked, if there is any type of error, except block is executed. 17. except : As expl
4 min read
Python | sep parameter in print()
The separator between the arguments to print() function in Python is space by default (softspace feature) , which can be modified and can be made to any character, integer or string as per our choice. The 'sep' parameter is used to achieve the same, it is found only in python 3.x or later. It is als
3 min read
Iterate over a set in Python
The goal is to iterate over a set in Python. Since sets are unordered, the order of elements may vary each time you iterate. You can use a for loop to access and process each element, but the sequence may change with each execution. Let's explore different ways to iterate over a set. Using for loopW
2 min read
Python Set | difference_update()
The difference_update() method helps in an in-place way of differentiating the set. The previously discussed set difference() helps to find out the difference between two sets and returns a new set with the difference value, but the difference_update() updates the existing caller set.If A and B are
1 min read
Interesting facts about strings in Python | Set 1
Strings are one of the most commonly used data types in Python. They allow us to work with text and can be used in various tasks like processing text, handling input and output and much more. Python strings come with several interesting and useful features that make them unique and versatile. Here a
6 min read
Set add() Method in Python
The set.add() method in Python adds a new element to a set while ensuring uniqueness. It prevents duplicates automatically and only allows immutable types like numbers, strings, or tuples. If the element already exists, the set remains unchanged, while mutable types like lists or dictionaries cannot
5 min read
Convert String to Set in Python
There are multiple ways of converting a String to a Set in python, here are some of the methods. Using set()The easiest way of converting a string to a set is by using the set() function. Example 1 : [GFGTABS] Python s = "Geeks" print(type(s)) print(s) # Convert String to Set set_s = set(s
1 min read
Python Tokens and Character Sets
Python is a general-purpose, high-level programming language. It was designed with an emphasis on code readability, and its syntax allows programmers to express their concepts in fewer lines of code, and these codes are known as scripts. These scripts contain character sets, tokens, and identifiers.
6 min read
Convert Set to String in Python
Converting a set to a string in Python means changing a group of unique items into a text format that can be easily read and used. Since sets do not have a fixed order, the output may look different each time. For example, a set {1, 2, 3} can be turned into the string "{1, 2, 3}" or into "{3, 1, 2}"
3 min read
Python Set Exercise
Basic Set ProgramsFind the size of a Set in PythonIterate over a set in PythonPython - Maximum and Minimum in a SetPython - Remove items from SetPython - Check if two lists have at-least one element commonPython program to find common elements in three lists using setsPython - Find missing and addit
2 min read