Normalizing Textual Data with Python
Last Updated :
26 Nov, 2022
In this article, we will learn How to Normalizing Textual Data with Python. Let's discuss some concepts :
- Textual data ask systematically collected material consisting of written, printed, or electronically published words, typically either purposefully written or transcribed from speech.
- Text normalization is that the method of transforming text into one canonical form that it'd not have had before. Normalizing text before storing or processing it allows for separation of concerns since the input is sure to be consistent before operations are performed thereon. Text normalization requires being conscious of what sort of text is to be normalized and the way it's to be processed afterwards; there's no all-purpose normalization procedure.
Steps Required
Here, we will discuss some basic steps need for Text normalization.
- Input text String,
- Convert all letters of the string to one case(either lower or upper case),
- If numbers are essential to convert to words else remove all numbers,
- Remove punctuations, other formalities of grammar,
- Remove white spaces,
- Remove stop words,
- And any other computations.
We are doing Text normalization with above-mentioned steps, every step can be done in some ways. So we will discuss each and everything in this whole process.
Text String
Python3
# input string
string = " Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
print(string)
Output:
" Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
Case Conversion (Lower Case)
In Python, lower() is a built-in method used for string handling. The lower() methods returns the lowercased string from the given string. It converts all uppercase characters to lowercase. If no uppercase characters exist, it returns the original string.
Python3
# input string
string = " Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
# convert to lower case
lower_string = string.lower()
print(lower_string)
Output:
" python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much python 2 code does not run unmodified on python 3. with python 2's end-of-life, only python 3.6.x[30] and later are supported, with older versions still supporting e.g. windows 7 (and old installers not restricted to 64-bit windows)."
Removing Numbers
Remove numbers if they're not relevant to your analyses. Usually, regular expressions are used to remove numbers.
Python3
# import regex
import re
# input string
string = " Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
# convert to lower case
lower_string = string.lower()
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
print(no_number_string)
Output:
" python ., released in , was a major revision of the language that is not completely backward compatible and much python code does not run unmodified on python . with python 's end-of-life, only python ..x[] and later are supported, with older versions still supporting e.g. windows (and old installers not restricted to -bit windows)."
Removing punctuation
The part of replacing with punctuation can also be performed using regex. In this, we replace all punctuation by empty string using certain regex.
Python3
# import regex
import re
# input string
string = " Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
# convert to lower case
lower_string = string.lower()
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)
print(no_punc_string)
Output:
' python released in was a major revision of the language that is not completely backward compatible and much python code does not run unmodified on python with python s endoflife only python x and later are supported with older versions still supporting eg windows and old installers not restricted to bit windows'
Removing White space
The strip() function is an inbuilt function in Python programming language that returns a copy of the string with both leading and trailing characters removed (based on the string argument passed).
Python3
# import regex
import re
# input string
string = " Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
# convert to lower case
lower_string = string.lower()
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)
# remove white spaces
no_wspace_string = no_punc_string.strip()
print(no_wspace_string)
Output:
'python released in was a major revision of the language that is not completely backward compatible and much python code does not run unmodified on python with python s endoflife only python x and later are supported with older versions still supporting eg windows and old installers not restricted to bit windows'
Removing Stop Words
Stop words” are the foremost common words during a language like “the”, “a”, “on”, “is”, “all”. These words don't carry important meaning and are usually faraway from texts. It is possible to get rid of stop words using tongue Toolkit (NLTK), a set of libraries and programs for symbolic and statistical tongue processing.
Python3
# download stopwords
import nltk
nltk.download('stopwords')
# import nltk for stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)
# assign string
no_wspace_string='python released in was a major revision of the language that is not completely backward compatible and much python code does not run unmodified on python with python s endoflife only python x and later are supported with older versions still supporting eg windows and old installers not restricted to bit windows'
# convert string to list of words
lst_string = [no_wspace_string][0].split()
print(lst_string)
# remove stopwords
no_stpwords_string=""
for i in lst_string:
if not i in stop_words:
no_stpwords_string += i+' '
# removing last space
no_stpwords_string = no_stpwords_string[:-1]
print(no_stpwords_string)
Output:

In this, we can normalize the textual data using Python. Below is the complete python program:
Python3
# import regex
import re
# download stopwords
import nltk
nltk.download('stopwords')
# import nltk for stopwords
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
# input string
string = " Python 3.0, released in 2008, was a major revision of the language that is not completely backward compatible and much Python 2 code does not run unmodified on Python 3. With Python 2's end-of-life, only Python 3.6.x[30] and later are supported, with older versions still supporting e.g. Windows 7 (and old installers not restricted to 64-bit Windows)."
# convert to lower case
lower_string = string.lower()
# remove numbers
no_number_string = re.sub(r'\d+','',lower_string)
# remove all punctuation except words and space
no_punc_string = re.sub(r'[^\w\s]','', no_number_string)
# remove white spaces
no_wspace_string = no_punc_string.strip()
no_wspace_string
# convert string to list of words
lst_string = [no_wspace_string][0].split()
print(lst_string)
# remove stopwords
no_stpwords_string=""
for i in lst_string:
if not i in stop_words:
no_stpwords_string += i+' '
# removing last space
no_stpwords_string = no_stpwords_string[:-1]
# output
print(no_stpwords_string)
Output:

Similar Reads
How To Read .Data Files In Python? Unlocking the secrets of reading .data files in Python involves navigating through diverse structures. In this article, we will unravel the mysteries of reading .data files in Python through four distinct approaches. Understanding the structure of .data files is essential, as their format may vary w
4 min read
How to implement Dictionary with Python3? This program uses python's container called dictionary (in dictionary a key is associated with some information). This program will take a word as input and returns the meaning of that word. Python3 should be installed in your system. If it not installed, install it from this link. Always try to ins
3 min read
Replace Commas with New Lines in a Text File Using Python Replacing a comma with a new line in a text file consists of traversing through the file's content and substituting each comma with a newline character. In this article, we will explore three different approaches to replacing a comma with a new line in a text file. Replace Comma With a New Line in a
2 min read
Read a file without newlines in Python When working with files in Python, it's common to encounter scenarios where you need to read the file content without including newline characters. Newlines can sometimes interfere with the processing or formatting of the data. In this article, we'll explore different approaches to reading a file wi
2 min read
Save API data into CSV format using Python In this article, we are going to see how can we fetch data from API and make a CSV file of it, and then we can perform various stuff on it like applying machine learning model data analysis, etc. Sometimes we want to fetch data from our Database Api and train our machine learning model and it was ve
6 min read
Introduction to TextFSM in Python TextFSM is a Python library used for parsing semi-structured text into structured data. It's particularly useful for extracting information from command-line outputs. This article will introduce you to TextFSM, explain how it works, and provide examples with code and outputs to help you get started.
4 min read
Generating Word Cloud in Python | Set 2 Prerequisite: Generating Word Cloud in Python | Set - 1Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance. Significant textual data points can be highlighted using a word cloud. Word clouds are widely used
5 min read
Textwrap â Text wrapping and filling in Python The textwrap module can be used for wrapping and formatting of plain text. This module provides formatting of text by adjusting the line breaks in the input paragraph. The TextWrapper instance attributes (and keyword arguments to the constructor) are as follows: width: This refers to the maximum len
6 min read
How to read a numerical data or file in Python with numpy? Prerequisites: Numpy NumPy is a general-purpose array-processing package. It provides a high-performance multidimensional array object and tools for working with these arrays. This article depicts how numeric data can be read from a file using Numpy. Numerical data can be present in different forma
4 min read
Convert Text file to JSON in Python JSON (JavaScript Object Notation) is a data-interchange format that is human-readable text and is used to transmit data, especially between web applications and servers. The JSON files will be like nested dictionaries in Python. To convert a text file into JSON, there is a json module in Python. Thi
4 min read