Processing text using NLP | Basics Last Updated : 22 Sep, 2022 Summarize Comments Improve Suggest changes Share Like Article Like Report In this article, we will be learning the steps followed to process the text data before using it to train the actual Machine Learning Model. Importing Libraries The following must be installed in the current working environment: NLTK Library: The NLTK library is a collection of libraries and programs written for processing of English language written in Python programming language.urllib library: This is a URL handling library for python.BeautifulSoup library: This is a library used for extracting data out of HTML and XML documents. Python3 import nltk from bs4 import BeautifulSoup from urllib.request import urlopen Once importing all the libraries, we need to extract the text. Text can be in string datatype or a file that we have to process. Extracting Data For this article, we are using web scraping to read a webpage then we will be using get_text() function for changing it to str format. Python3 raw = urlopen("https://www.w3.org/TR/PNG/iso_8859-1.txt").read() raw1 = BeautifulSoup(raw) raw2 = raw1.get_text() raw2 Output : Data Preprocessing Once the data extraction is done, the data is now ready to process. For that follow these steps : 1. Deletion of Punctuations and numerical text Python3 # deletion of punctuations and numerical values def punc(raw2): raw2 = re.sub('[^a-zA-Z]', ' ', raw2) return raw2 2. Creating Tokens Python3 # extracting tokens def token(raw2): tokens = nltk.word_tokenize(raw2) return tokens 3. Removing Stopwords Python3 # lowercase the letters # removing stopwords def remove_(tokens): final = [word.lower() for word in tokens if word not in stopwords.words("english")] return final 4. Lemmatization Python3 # Lemmatizing from textblob import TextBlob def lemma(final): # initialize an empty string str1 = ' '.join(final) s = TextBlob(str1) lemmatized_sentence = " ".join([w.lemmatize() for w in s.words]) return final 5. Joining the final tokens Python3 # Joining the final results def join_(final): review = ' '.join(final) return ans To execute the above functions refer this code : Python3 # Calling all the functions raw2 = punc(raw2) tokens = token(raw2) final = remove_(tokens) final = lemma(final) ans = join_(final) ans Output : Comment More infoAdvertise with us Next Article Processing text using NLP | Basics N noob_coders_ka_baap Follow Improve Article Tags : NLP python Practice Tags : python Similar Reads Python Tutorial - Learn Python Programming Language Python is one of the most popular programming languages. Itâs simple to use, packed with features and supported by a wide range of libraries and frameworks. Its clean syntax makes it beginner-friendly. It'sA high-level language, used in web development, data science, automation, AI and more.Known fo 10 min read Support Vector Machine (SVM) Algorithm Support Vector Machine (SVM) is a supervised machine learning algorithm used for classification and regression tasks. It tries to find the best boundary known as hyperplane that separates different classes in the data. It is useful when you want to do binary classification like spam vs. not spam or 9 min read Logistic Regression in Machine Learning Logistic Regression is a supervised machine learning algorithm used for classification problems. Unlike linear regression which predicts continuous values it predicts the probability that an input belongs to a specific class. It is used for binary classification where the output can be one of two po 11 min read File Handling in Python File handling refers to the process of performing operations on a file such as creating, opening, reading, writing and closing it, through a programming interface. It involves managing the data flow between the program and the file system on the storage device, ensuring that data is handled safely a 7 min read Python Lambda Functions Python Lambda Functions are anonymous functions means that the function is without a name. As we already know the def keyword is used to define a normal function in Python. Similarly, the lambda keyword is used to define an anonymous function in Python. In the example, we defined a lambda function(u 6 min read Natural Language Processing (NLP) - Overview Natural Language Processing (NLP) is a field that combines computer science, artificial intelligence and language studies. It helps computers understand, process and create human language in a way that makes sense and is useful. With the growing amount of text data from social media, websites and ot 9 min read Python Quiz These Python quiz questions are designed to help you become more familiar with Python and test your knowledge across various topics. From Python basics to advanced concepts, these topic-specific quizzes offer a comprehensive way to practice and assess your understanding of Python concepts. These Pyt 3 min read Python Keywords Keywords in Python are reserved words that have special meanings and serve specific purposes in the language syntax. Python keywords cannot be used as the names of variables, functions, and classes or any other identifier. Getting List of all Python keywordsWe can also get all the keyword names usin 2 min read Printing Pyramid Patterns in Python Pyramid patterns are sequences of characters or numbers arranged in a way that resembles a pyramid, with each level having one more element than the level above. These patterns are often used for aesthetic purposes and in educational contexts to enhance programming skills.Exploring and creating pyra 9 min read Generative Adversarial Network (GAN) Generative Adversarial Networks (GANs) help machines to create new, realistic data by learning from existing examples. It is introduced by Ian Goodfellow and his team in 2014 and they have transformed how computers generate images, videos, music and more. Unlike traditional models that only recogniz 12 min read Like