Text Classification Using NLP
Text Classification Using NLP
CLASSIFICATION
USING NLP
PRESENTATION BY:
SURYANSH DEV(54912)
HRITIK BATHLA(54892)
ANJALI PANDEY(54886)
TOPICS WE WILL BE DISCUSSING:
Conclusion.
What is Text Classification?
Text Classification: Text classification also known as text tagging or text
categorization is the process of categorizing text into organized
groups. By using Natural Language Processing (NLP), text classifiers can
automatically analyze text and then assign a set of pre-defined tags or
categories based on its content.
Corpus: The corpus is nothing, but a collection of documents.
For example, you have been studying for certain topics let us say an
operating system in computer science, you start collecting various
information related to chapter the from the web some research papers
and so on. So, at the end , you might be having about you know 100 or
115 documents that you have in hand. So, that collection is called a
corpus
And you know you can name that corpus as an operating system
corpus.
What is NLP?
Natural language processing (NLP) refers to the branch of computer science—
and more specifically, the branch of artificial intelligence or AI—concerned with
giving computers the ability to understand text and spoken words in much the
same way human beings can.
NLP combines computational linguistics—rule-based modeling of human
language—with statistical, machine learning, and deep learning models.
Together, these technologies enable computers to process human language in
the form of text or voice data and to ‘understand’ its full meaning, complete
with the speaker or writer’s intent and sentiment.
NLP drives computer programs that translate text from one language to
another, respond to spoken commands, and summarize large volumes of text
rapidly—even in real time. There’s a good chance you’ve interacted with NLP in
the form of voice-operated GPS systems, digital assistants, speech-to-text
dictation software, customer service chatbots, and other consumer
conveniences. But NLP also plays a growing role in enterprise solutions that help
streamline business operations, increase employee productivity, and simplify
mission-critical business processes.
Why NLP?
NLP is all about processing text data.
Now this abandons of text data and this is all quite unstructured.
So, to be able to make use of this information and build some nice
applications, one should know how to process this data.
That is why NLP comes into picture.
Processing of Text data is essential in almost all the search engines,
query and query completion.
For NLP, it includes:
▪ Text cleaning
▪ Stopwords Removal
▪ Stemming and lemmatization.
Text Classification with codes.
Stemming Code
Lemmatization Code
Text Classification Algorithms
1. Naïve Bayes:
The Naive Bayes algorithm is a probabilistic classifier that makes use of Bayes' Theorem – a
rule that uses probability to predict the tag of a text based on prior knowledge of conditions
that might be related. It calculates the probability of each tag for a given text, and then
predicts the tag with the highest probability.
You can also improve Naive Bayes’ performance by applying various techniques:
• Removing stopwords: common words that don’t add value. For example, such as, able to,
either, else, ever, etc.
• Lemmatizing words: Grouping different inflections of the same word. For example, draft,
drafted, drafts, drafting, etc.
• N-grams: The n-gram is the probability of the appearance of a word or a sequence of
words of ‘n length’ within a text.
• TF-IDF: Short for term frequency-inverse document frequency, TF-IDF is a metric that
quantifies how important a word is to a document in a document set. It is very powerful
when used to score words, i.e. it increases proportionally to the number of times a specific
word appears in a document, but is offset by the number of documents that contain said
word.
2. Support Vector Machines
Support Vector Machines (SVM) is a classification algorithm that performs
at its best when handling a limited amount of data. It determines the best
result between vectors that belong to a given group or category, as well as
the vectors that don’t belong to the group.
For example, let’s say you have two tags: expensive and cheap and the
data has two features: x and y. For each pair of coordinates (x, y) there
should be a result for each one to determine which is expensive and which
one is cheap. To do this, SVM creates a divisory line between the data
points, known as the decision boundary, and classifies everything that
falls on one side as expensive and everything that falls on the other side
as cheap.
The decision boundary divides a space into two subspaces, one for
vectors that belong to a group and another for vectors that don’t belong
to that group. Here, vectors represent training text and a group
represents the tag you use to tag your texts. A perk of using SVM is that it
doesn’t require a lot of training data to produce accurate results, although
it does require more computational resources than Naive Bayes to yield
more accurate results.
3. Deep Learning
Deep Learning is comprised of algorithms and techniques that are designed to mimic the
human brain. With text classification, there are two main deep learning models that are
widely used: Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN).
CNN is a type of neural network that consists of an input layer, an output layer, and multiple
hidden layers that are made of convolutional layers. Convolutional layers are the major
building blocks used in CNN, while a convolution is the linear operation that automatically
applies filters to inputs – resulting in activations. These complex layers are key ingredients in a
convolutional neural network as they assign importance to various inputs and differentiate
one from the other.
Within the context of text classification, CNN represents a feature that is applied to words or
n-grams to extract high-level features.
RNN are specialized neural-based networks that process sequential information. For each
input sequence, the RNN performs calculations that are conditioned to the previous
computer outputs. The key advantage of using RNN is the ability to memorize the results of
previous computations and use it for current ones.
It’s important to remember that deep learning algorithms require millions of tagged
examples, as they work best when fed more data.
Thank YOU.