
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
POS Tagging and Lemmatization Using SpaCy in Python
Python acts as an integral tool for understanding the concepts and application of machine learning and deep learning. It offers numerous libraries and modules that provides a magnificent platform for building useful techniques. In this article we will discuss about one such library known as "spaCy".
spaCy is an open-source library and is used to analyse and compare textual data. We will discuss about this library in detail but before we dive deep into the topic, let's quickly go through the overview of this article and understand the itinerary.
This article is divided into two sections ?
In the first section we will understand the significance of spaCy and discuss the concepts of PoS tagging and lemmatization.
The second section will focus on the application of spaCy and the use of PoS tokening and lemmatization tokening.
What is spaCy?
spaCy is an open-source library used in deep learning. It is managed by the Natural Language Processing (NLP). NLP itself is a conceptual field of artificial intelligence. it paves the path for human-computer interaction by providing meaning to the human languages for machines. With the help of spaCy we process data at large scale and derive meaning for the machine.
spaCy is written in Cython and it provides interactive APIs.
Installation
spaCy is installed with the help of "pip".
pip install spacy
Once spaCy is installed we can import it on our IDE. We will also load the pipeline package along by passing the correct naming convention. For PoS tagging and lemmatization we will use ?
en_core_web_sm
This naming convention decides what kind of pipeline package we want. "en" decides the language, "core" decides the capabilities, "web" decides the genre and "sm" decides the size.
So this convention loads the package that is in English language and its capabilities are PoS tagging and lemmatization and it is trained on written web text.
What is pos tagging?
PoS (PART OF SPEECH) tagging is a technique of categorizing words in a textual data. We can analyse each word and understand its context and lateral meanings. We can grammatically check a speech and describe its structure.
It also includes unknow words and modifies the vocabulary. The passed dataset itself is deeply analysed. We can check which part of the speech is a verb, noun, pronoun, preposition etc.
What is lemmatization?
Lemmatization is the technique of grouping together terms or words of different versions that are the same word. It is an integral tool of NLP and is used to categorize inflected words found in a speech.
We can morphologically analyse the speech and target the words with inflected endings so that we can remove them. The entire logic of lemmatization is to gather the base word for an inflected word.
Example
We will construct a program to segregate different parts of the speech using spaCy. Firstly we will use PoS tagging and see how it functions ?
Here,
We imported spacy after installing it on the command prompt.
We created a variable named "load_capabilites" that will initiate the "NLP". We loaded a particular package i.e., "en_core_web_sm".
We passed the textual data for analysis.
We created a variable named "Anadata".
This Anadata will store all the words from the textual data for analysis in spacy.
We will iterate for a single word and then with the help of "word.pos_" we will perform PoS tagging for all the words.
import spacy load_capabilites = spacy.load("en_core_web_sm") data_text = """Python programming can be used to perform numerous mathematical operations and provide solutions for different problems. Python is a very powerful language as it offers multiple modules and methods that are tailor made to perform various operations""" Anadata = load_capabilites(data_text) for word in Anadata: print(word, word.pos_)
Output
Python PROPN programming NOUN can AUX be AUX used VERB to PART perform VERB numerous ADJ mathematical ADJ operations NOUN and CCONJ provide VERB solutions NOUN for ADP different ADJ problems NOUN . PUNCT SPACE Python PROPN is AUX a DET very ADV powerful ADJ language NOUN as SCONJ it PRON offers VERB multiple ADJ modules NOUN and CCONJ methods NOUN that PRON are AUX tailor AUX made VERB to PART perform VERB various ADJ operations NOUN
Here, each tag means something for example, "PROPN" means proper noun, "PUNC" means punctuation. "ADJ" means adjective.
Example
We can even pick single tags and print them separately.
import spacy load_capabilites = spacy.load("en_core_web_sm") data_text = """Python programming can be used to perform numerous mathematical operations and provide solutions for different problems. Python is a very powerful language as it offers multiple modules and methods that are tailor made to perform various operations""" visdata = load_capabilites(data_text) for word in visdata: pass print("Ajectives:", [word.text for word in visdata if word.pos_ == "ADJ" ])
Output
Ajectives: ['numerous', 'mathematical', 'different', 'powerful', 'multiple', 'various']
Example
Now that we have understood how PoS tagging works, let's understand the functioning of lemmatization.
import spacy load_capabilites = spacy.load("en_core_web_sm") data_text = """Python programming can be used to perform numerous mathematical operations and provide solutions for different problems. Python is a very powerful language as it offers multiple modules and methods that are tailor made to perform various operations""" visdata = load_capabilites(data_text) for word in visdata: print(word, word.lemma_)
Output
Python Python programming programming can can be be used use to to perform perform numerous numerous mathematical mathematical operations operation and and provide provide solutions solution for for different different problems problem . . Python Python is be a a very very powerful powerful language language as as it it offers offer multiple multiple modules module and and methods method that that are be tailor tailor made make to to perform perform various various operations operation
Here, we used "lemma_" to perform lemmatization. All the inflected words are printed in their base form and now we can add these words on an external dictionary to enhance the local vocabulary.
Conclusion
In this article we covered the basic concepts of PoS tagging and lemmatization and understood its significance in deep learning. We also discussed the various applications through spaCy library and its role in NLP.